NGINX backend cascade failure: when slow upstreams take down everything

Users report timeouts. 502 Bad Gateway and 504 Gateway Time-out responses are climbing, and nginx error logs show upstream timeouts. On the nginx host, CPU and memory are normal, and the master process is alive. The proxy is healthy but out of connections.

This is a backend cascade failure. One slow upstream causes nginx workers to hold connections open while waiting for responses, consuming finite worker_connections slots. As slots fill, new requests cannot be forwarded. Traffic concentrates on the remaining healthy backends, which overload and slow down. Eventually every backend times out or fails health checks, and nginx returns 502/504 to all clients while the proxy process remains up.

The distinguishing feature is the sequence: upstream latency rises first, active connections pile up second, and 502/504 errors appear last. Errors without a preceding latency spike indicate the upstream died suddenly rather than degraded gradually.

What this means

Each nginx worker has a fixed pool of connection slots set by worker_connections. Every proxied request consumes at least two slots: one for the client and one for the upstream. When an upstream slows down, the worker holds that connection open until proxy_read_timeout (default 60s). If enough requests stall, the worker exhausts its slots. New connections are dropped because workers have no slots available, which widens the gap between accepts and handled in stub_status.

Open-source nginx relies on passive health checks: real requests are the probes. A backend that slows down stays in rotation until it hits max_fails failures within fail_timeout. Until then, every stalled request holds a slot that could have served a healthy backend. Remaining backends absorb the redirected load, slow down, and accelerate the cascade.

flowchart TD
    A[One upstream slows down] --> B[nginx waits, holding client and upstream connections]
    B --> C[Worker connection slots fill]
    C --> D[New requests queue in kernel backlog then drop]
    D --> E[Traffic shifts to healthy backends]
    E --> F[Healthy backends overload and slow down]
    F --> G[All backends time out or fail health checks]
    G --> H[nginx returns 502/504 to all clients]

Common causes

CauseWhat it looks likeFirst thing to check
Database lock contention or slow queries in the backendUpstream header time spikes before response time; specific endpoints affectedBackend application logs and database slow query log
Upstream memory pressure or GC pauseIntermittent latency spikes that recover, then spike againBackend memory and GC metrics, system OOM logs
Network partition or latency between nginx and upstreamupstream_connect_time rises or fails; specific backend IP affectedNetwork path with mtr or ping; ss for retransmits
Deployment regression (new code is slower)Latency increase correlates with deploy timestampBackend release logs and rollback status
Downstream dependency failure behind the upstreamUpstream header time high but connect time normal; backend logs show dependency timeoutsBackend dependency health checks and outbound connection logs

Quick checks

Run these from the nginx host.

# Check nginx saturation
curl -s http://127.0.0.1/stub_status

# Compare upstream time vs total request time.
# Assumes $request_time and $upstream_response_time are the final two fields.
tail -1000 /var/log/nginx/access.log | awk '{print "total:", $(NF-1), "upstream:", $NF}' | tail -20

# Check error logs for upstream timeout patterns
tail -1000 /var/log/nginx/error.log | grep -E "upstream timed out|connect\(\) failed|no live upstreams"

# Check if specific backends are refusing connections (bash only)
for backend in 10.0.1.10:8080 10.0.1.11:8080; do
  timeout 2 bash -c "echo > /dev/tcp/${backend%:*}/${backend#*:}" 2>/dev/null && echo "$backend UP" || echo "$backend DOWN"
done

# Check nginx worker CPU (should be normal in a pure cascade)
ps -eo pid,pcpu,comm,args | grep '[n]ginx: worker'

# Check connection slot utilization
active=$(curl -s http://127.0.0.1/stub_status | awk '/Active connections/ {print $3}')
workers=$(pgrep -c -P $(cat /var/run/nginx.pid))
wc=$(nginx -T 2>/dev/null | grep -m1 'worker_connections' | awk '{print $2}' | tr -d ';')
wc=${wc:-512}
echo "Utilization: $(awk "BEGIN {printf \"%.1f\", $active * 100 / ($workers * $wc)}")%"

How to diagnose it

  1. Confirm nginx is not saturated. Check worker CPU with ps or top. In a backend cascade, CPU is normal. If CPU is pegged, suspect SSL termination overload or event loop saturation instead.

  2. Compare $request_time and $upstream_response_time. If $upstream_response_time is the dominant component of $request_time, the bottleneck is upstream. If the gap between them is large, the client or nginx buffering is the problem.

  3. Check error logs for upstream timeouts. Look for upstream timed out (110: Connection timed out) while reading response header and connect() failed (111: Connection refused). These confirm the backend is either too slow or down.

  4. Inspect the connection state breakdown. If Writing connections dominate while request rate is flat or dropping, workers are waiting on upstreams. High Reading with low throughput suggests a slow client attack, not a backend cascade.

  5. Identify the specific failing backend. Parse $upstream_addr from access logs. If one IP appears repeatedly with high $upstream_response_time or - in $upstream_status, that server is the trigger.

  6. Calculate connection slot utilization. Divide active connections by worker_connections * worker_processes. Above 80% is the danger zone. Because each proxied request uses two slots, effective proxy capacity is roughly half the theoretical maximum.

  7. Look for 499s before 502/504s. A spike in 499s precedes the timeout errors as clients abandon waiting connections.

Metrics and signals to monitor

SignalWhy it mattersWarning sign
$upstream_response_time P95Isolates backend latency from client or nginx overheadP95 > 80% of proxy_read_timeout or trending >20% above baseline
Active connectionsTotal connection pressure against finite worker slotsSustained >80% of worker_connections * worker_processes
Writing connectionsWorkers waiting for upstream responses; dominant during cascade>50% of active connections with low request rate
499 rateClients closing connections before response completes>1% sustained, correlating with latency increase
502/504 rateUpstream failures and timeouts becoming visible to clientsAny sustained nonzero rate
Accepts vs handled gapConnections dropped because no slots availableGap increasing for >60 seconds

Fixes

Isolate the slow backend immediately. If one upstream is clearly degraded, remove it from the upstream block or set its weight to zero and reload. Open-source nginx cannot drain backends dynamically without a reload. If you cannot reload, temporarily reduce proxy_read_timeout so nginx gives up on slow responses faster. This frees connection slots but increases 504 errors for those requests.

Tune passive health check sensitivity. Defaults of max_fails=1 and fail_timeout=10s are aggressive: a single timeout removes a backend for 10 seconds. If your network has brief blips, increase max_fails to 3 so transient errors do not shift all load to the remaining backends prematurely. Raising it too high delays removal of genuinely failed servers, so balance retry tolerance against failover speed.

Increase connection capacity. If utilization is near the limit, increase worker_connections and ensure worker_rlimit_nofile is at least double that value to cover client connections, upstream connections, and log files. Reload to apply. This does not fix the slow backend, but it buys runway.

Verify upstream keepalive configuration. Without keepalive in the upstream block, every proxied request opens a new TCP connection to the backend. During recovery, this adds handshake overhead and ephemeral port pressure. Ensure keepalive is configured with an appropriate pool size per worker. Upstream keepalive pools are per-worker, not global.

Prevention

Log $upstream_response_time, $upstream_header_time, and $upstream_addr. Aggregate latency dilutes a single slow backend across the average of healthy peers. Per-upstream breakdowns are essential to detect partial failures.

Monitor the proxy connection multiplier. Size worker_connections so that peak traffic uses no more than 60% of the theoretical maximum after halving for the proxy multiplier. Do not assume the default of 512 is sufficient for production reverse proxy workloads.

Set worker_shutdown_timeout. In environments with frequent reloads, old workers can linger on long-lived connections, temporarily increasing worker count and memory. worker_shutdown_timeout forces old workers to exit, keeping process count predictable.

Review proxy_next_upstream carefully. Default conditions are error and timeout. Add http_502 and http_504 explicitly if you need retry on those. Under concurrent load, retries can land on the same failing peer or overload the remaining backends, so do not rely on retries alone to prevent a cascade.

How Netdata helps

  • Correlate rising $upstream_response_time percentiles with climbing nginx.connections_active to confirm a backend cascade before 502s appear.
  • Alert on the accepts-handled gap from stub_status to detect connection slot exhaustion at the nginx layer.
  • Track nginx.requests rate alongside 5xx and 499 response rates to distinguish between traffic drops and backend failures.
  • Surface per-upstream latency when access log parsing is configured, revealing which backend is degrading before it triggers max_fails.
  • Monitor nginx worker CPU to rule out proxy-side saturation and focus the investigation on upstream health.