NGINX backend cascade failure: when slow upstreams take down everything

Users report timeouts. 502 Bad Gateway and 504 Gateway Time-out responses are climbing, and nginx error logs show upstream timeouts. On the nginx host, CPU and memory are normal, and the master process is alive. The proxy is healthy but out of connections.

This is a backend cascade failure. One slow upstream causes nginx workers to hold connections open while waiting for responses, consuming finite worker_connections slots. As slots fill, new requests cannot be forwarded. Traffic concentrates on the remaining healthy backends, which overload and slow down. Eventually every backend times out or fails health checks, and nginx returns 502/504 to all clients while the proxy process remains up.

The distinguishing feature is the sequence: upstream latency rises first, active connections pile up second, and 502/504 errors appear last. Errors without a preceding latency spike indicate the upstream died suddenly rather than degraded gradually.

What this means

Each nginx worker has a fixed pool of connection slots set by worker_connections. Every proxied request consumes at least two slots: one for the client and one for the upstream. When an upstream slows down, the worker holds that connection open until proxy_read_timeout (default 60s). If enough requests stall, the worker exhausts its slots. New connections are dropped because workers have no slots available, which widens the gap between accepts and handled in stub_status.

Open-source nginx relies on passive health checks: real requests are the probes. A backend that slows down stays in rotation until it hits max_fails failures within fail_timeout. Until then, every stalled request holds a slot that could have served a healthy backend. Remaining backends absorb the redirected load, slow down, and accelerate the cascade.

flowchart TD
    A[One upstream slows down] --> B[nginx waits, holding client and upstream connections]
    B --> C[Worker connection slots fill]
    C --> D[New requests queue in kernel backlog then drop]
    D --> E[Traffic shifts to healthy backends]
    E --> F[Healthy backends overload and slow down]
    F --> G[All backends time out or fail health checks]
    G --> H[nginx returns 502/504 to all clients]

Common causes

Cause	What it looks like	First thing to check
Database lock contention or slow queries in the backend	Upstream header time spikes before response time; specific endpoints affected	Backend application logs and database slow query log
Upstream memory pressure or GC pause	Intermittent latency spikes that recover, then spike again	Backend memory and GC metrics, system OOM logs
Network partition or latency between nginx and upstream	`upstream_connect_time` rises or fails; specific backend IP affected	Network path with `mtr` or `ping`; `ss` for retransmits
Deployment regression (new code is slower)	Latency increase correlates with deploy timestamp	Backend release logs and rollback status
Downstream dependency failure behind the upstream	Upstream header time high but connect time normal; backend logs show dependency timeouts	Backend dependency health checks and outbound connection logs

Quick checks

Run these from the nginx host.

# Check nginx saturation
curl -s http://127.0.0.1/stub_status

# Compare upstream time vs total request time.
# Assumes $request_time and $upstream_response_time are the final two fields.
tail -1000 /var/log/nginx/access.log | awk '{print "total:", $(NF-1), "upstream:", $NF}' | tail -20

# Check error logs for upstream timeout patterns
tail -1000 /var/log/nginx/error.log | grep -E "upstream timed out|connect\(\) failed|no live upstreams"

# Check if specific backends are refusing connections (bash only)
for backend in 10.0.1.10:8080 10.0.1.11:8080; do
  timeout 2 bash -c "echo > /dev/tcp/${backend%:*}/${backend#*:}" 2>/dev/null && echo "$backend UP" || echo "$backend DOWN"
done

# Check nginx worker CPU (should be normal in a pure cascade)
ps -eo pid,pcpu,comm,args | grep '[n]ginx: worker'

# Check connection slot utilization
active=$(curl -s http://127.0.0.1/stub_status | awk '/Active connections/ {print $3}')
workers=$(pgrep -c -P $(cat /var/run/nginx.pid))
wc=$(nginx -T 2>/dev/null | grep -m1 'worker_connections' | awk '{print $2}' | tr -d ';')
wc=${wc:-512}
echo "Utilization: $(awk "BEGIN {printf \"%.1f\", $active * 100 / ($workers * $wc)}")%"

How to diagnose it

Confirm nginx is not saturated. Check worker CPU with ps or top. In a backend cascade, CPU is normal. If CPU is pegged, suspect SSL termination overload or event loop saturation instead.
Compare $request_time and $upstream_response_time. If $upstream_response_time is the dominant component of $request_time, the bottleneck is upstream. If the gap between them is large, the client or nginx buffering is the problem.
Check error logs for upstream timeouts. Look for upstream timed out (110: Connection timed out) while reading response header and connect() failed (111: Connection refused). These confirm the backend is either too slow or down.
Inspect the connection state breakdown. If Writing connections dominate while request rate is flat or dropping, workers are waiting on upstreams. High Reading with low throughput suggests a slow client attack, not a backend cascade.
Identify the specific failing backend. Parse $upstream_addr from access logs. If one IP appears repeatedly with high $upstream_response_time or - in $upstream_status, that server is the trigger.
Calculate connection slot utilization. Divide active connections by worker_connections * worker_processes. Above 80% is the danger zone. Because each proxied request uses two slots, effective proxy capacity is roughly half the theoretical maximum.
Look for 499s before 502/504s. A spike in 499s precedes the timeout errors as clients abandon waiting connections.

Metrics and signals to monitor

Signal	Why it matters	Warning sign
`$upstream_response_time` P95	Isolates backend latency from client or nginx overhead	P95 > 80% of `proxy_read_timeout` or trending >20% above baseline
Active connections	Total connection pressure against finite worker slots	Sustained >80% of `worker_connections * worker_processes`
Writing connections	Workers waiting for upstream responses; dominant during cascade	>50% of active connections with low request rate
499 rate	Clients closing connections before response completes	>1% sustained, correlating with latency increase
502/504 rate	Upstream failures and timeouts becoming visible to clients	Any sustained nonzero rate
Accepts vs handled gap	Connections dropped because no slots available	Gap increasing for >60 seconds

Fixes

Isolate the slow backend immediately. If one upstream is clearly degraded, remove it from the upstream block or set its weight to zero and reload. Open-source nginx cannot drain backends dynamically without a reload. If you cannot reload, temporarily reduce proxy_read_timeout so nginx gives up on slow responses faster. This frees connection slots but increases 504 errors for those requests.

Tune passive health check sensitivity. Defaults of max_fails=1 and fail_timeout=10s are aggressive: a single timeout removes a backend for 10 seconds. If your network has brief blips, increase max_fails to 3 so transient errors do not shift all load to the remaining backends prematurely. Raising it too high delays removal of genuinely failed servers, so balance retry tolerance against failover speed.

Increase connection capacity. If utilization is near the limit, increase worker_connections and ensure worker_rlimit_nofile is at least double that value to cover client connections, upstream connections, and log files. Reload to apply. This does not fix the slow backend, but it buys runway.

Verify upstream keepalive configuration. Without keepalive in the upstream block, every proxied request opens a new TCP connection to the backend. During recovery, this adds handshake overhead and ephemeral port pressure. Ensure keepalive is configured with an appropriate pool size per worker. Upstream keepalive pools are per-worker, not global.

Prevention

Log $upstream_response_time, $upstream_header_time, and $upstream_addr. Aggregate latency dilutes a single slow backend across the average of healthy peers. Per-upstream breakdowns are essential to detect partial failures.

Monitor the proxy connection multiplier. Size worker_connections so that peak traffic uses no more than 60% of the theoretical maximum after halving for the proxy multiplier. Do not assume the default of 512 is sufficient for production reverse proxy workloads.

Set worker_shutdown_timeout. In environments with frequent reloads, old workers can linger on long-lived connections, temporarily increasing worker count and memory. worker_shutdown_timeout forces old workers to exit, keeping process count predictable.

Review proxy_next_upstream carefully. Default conditions are error and timeout. Add http_502 and http_504 explicitly if you need retry on those. Under concurrent load, retries can land on the same failing peer or overload the remaining backends, so do not rely on retries alone to prevent a cascade.

How Netdata helps

Correlate rising $upstream_response_time percentiles with climbing nginx.connections_active to confirm a backend cascade before 502s appear.
Alert on the accepts-handled gap from stub_status to detect connection slot exhaustion at the nginx layer.
Track nginx.requests rate alongside 5xx and 499 response rates to distinguish between traffic drops and backend failures.
Surface per-upstream latency when access log parsing is configured, revealing which backend is degrading before it triggers max_fails.
Monitor nginx worker CPU to rule out proxy-side saturation and focus the investigation on upstream health.

The Netdata solution

Web server monitoring with Netdata

Netdata monitors NGINX with per-second request, connection, and latency metrics plus ML anomaly detection. Correlate connection and file-descriptor exhaustion, upstream cascade failures, buffer spill, and TLS CPU with the host signals behind them.

See web server monitoring → Start monitoring free

NGINX backend cascade failure: when slow upstreams take down everything

NGINX backend cascade failure: when slow upstreams take down everything

What this means

Common causes

Quick checks

How to diagnose it

Metrics and signals to monitor

Fixes

Prevention

How Netdata helps

Related guides

Web server monitoring with Netdata