NGINX backend cascade failure: when slow upstreams take down everything
Users report timeouts. 502 Bad Gateway and 504 Gateway Time-out responses are climbing, and nginx error logs show upstream timeouts. On the nginx host, CPU and memory are normal, and the master process is alive. The proxy is healthy but out of connections.
This is a backend cascade failure. One slow upstream causes nginx workers to hold connections open while waiting for responses, consuming finite worker_connections slots. As slots fill, new requests cannot be forwarded. Traffic concentrates on the remaining healthy backends, which overload and slow down. Eventually every backend times out or fails health checks, and nginx returns 502/504 to all clients while the proxy process remains up.
The distinguishing feature is the sequence: upstream latency rises first, active connections pile up second, and 502/504 errors appear last. Errors without a preceding latency spike indicate the upstream died suddenly rather than degraded gradually.
What this means
Each nginx worker has a fixed pool of connection slots set by worker_connections. Every proxied request consumes at least two slots: one for the client and one for the upstream. When an upstream slows down, the worker holds that connection open until proxy_read_timeout (default 60s). If enough requests stall, the worker exhausts its slots. New connections are dropped because workers have no slots available, which widens the gap between accepts and handled in stub_status.
Open-source nginx relies on passive health checks: real requests are the probes. A backend that slows down stays in rotation until it hits max_fails failures within fail_timeout. Until then, every stalled request holds a slot that could have served a healthy backend. Remaining backends absorb the redirected load, slow down, and accelerate the cascade.
flowchart TD
A[One upstream slows down] --> B[nginx waits, holding client and upstream connections]
B --> C[Worker connection slots fill]
C --> D[New requests queue in kernel backlog then drop]
D --> E[Traffic shifts to healthy backends]
E --> F[Healthy backends overload and slow down]
F --> G[All backends time out or fail health checks]
G --> H[nginx returns 502/504 to all clients]Common causes
| Cause | What it looks like | First thing to check |
|---|---|---|
| Database lock contention or slow queries in the backend | Upstream header time spikes before response time; specific endpoints affected | Backend application logs and database slow query log |
| Upstream memory pressure or GC pause | Intermittent latency spikes that recover, then spike again | Backend memory and GC metrics, system OOM logs |
| Network partition or latency between nginx and upstream | upstream_connect_time rises or fails; specific backend IP affected | Network path with mtr or ping; ss for retransmits |
| Deployment regression (new code is slower) | Latency increase correlates with deploy timestamp | Backend release logs and rollback status |
| Downstream dependency failure behind the upstream | Upstream header time high but connect time normal; backend logs show dependency timeouts | Backend dependency health checks and outbound connection logs |
Quick checks
Run these from the nginx host.
# Check nginx saturation
curl -s http://127.0.0.1/stub_status
# Compare upstream time vs total request time.
# Assumes $request_time and $upstream_response_time are the final two fields.
tail -1000 /var/log/nginx/access.log | awk '{print "total:", $(NF-1), "upstream:", $NF}' | tail -20
# Check error logs for upstream timeout patterns
tail -1000 /var/log/nginx/error.log | grep -E "upstream timed out|connect\(\) failed|no live upstreams"
# Check if specific backends are refusing connections (bash only)
for backend in 10.0.1.10:8080 10.0.1.11:8080; do
timeout 2 bash -c "echo > /dev/tcp/${backend%:*}/${backend#*:}" 2>/dev/null && echo "$backend UP" || echo "$backend DOWN"
done
# Check nginx worker CPU (should be normal in a pure cascade)
ps -eo pid,pcpu,comm,args | grep '[n]ginx: worker'
# Check connection slot utilization
active=$(curl -s http://127.0.0.1/stub_status | awk '/Active connections/ {print $3}')
workers=$(pgrep -c -P $(cat /var/run/nginx.pid))
wc=$(nginx -T 2>/dev/null | grep -m1 'worker_connections' | awk '{print $2}' | tr -d ';')
wc=${wc:-512}
echo "Utilization: $(awk "BEGIN {printf \"%.1f\", $active * 100 / ($workers * $wc)}")%"
How to diagnose it
Confirm nginx is not saturated. Check worker CPU with
psortop. In a backend cascade, CPU is normal. If CPU is pegged, suspect SSL termination overload or event loop saturation instead.Compare
$request_timeand$upstream_response_time. If$upstream_response_timeis the dominant component of$request_time, the bottleneck is upstream. If the gap between them is large, the client or nginx buffering is the problem.Check error logs for upstream timeouts. Look for
upstream timed out (110: Connection timed out) while reading response headerandconnect() failed (111: Connection refused). These confirm the backend is either too slow or down.Inspect the connection state breakdown. If
Writingconnections dominate while request rate is flat or dropping, workers are waiting on upstreams. HighReadingwith low throughput suggests a slow client attack, not a backend cascade.Identify the specific failing backend. Parse
$upstream_addrfrom access logs. If one IP appears repeatedly with high$upstream_response_timeor-in$upstream_status, that server is the trigger.Calculate connection slot utilization. Divide active connections by
worker_connections * worker_processes. Above 80% is the danger zone. Because each proxied request uses two slots, effective proxy capacity is roughly half the theoretical maximum.Look for 499s before 502/504s. A spike in 499s precedes the timeout errors as clients abandon waiting connections.
Metrics and signals to monitor
| Signal | Why it matters | Warning sign |
|---|---|---|
$upstream_response_time P95 | Isolates backend latency from client or nginx overhead | P95 > 80% of proxy_read_timeout or trending >20% above baseline |
| Active connections | Total connection pressure against finite worker slots | Sustained >80% of worker_connections * worker_processes |
| Writing connections | Workers waiting for upstream responses; dominant during cascade | >50% of active connections with low request rate |
| 499 rate | Clients closing connections before response completes | >1% sustained, correlating with latency increase |
| 502/504 rate | Upstream failures and timeouts becoming visible to clients | Any sustained nonzero rate |
| Accepts vs handled gap | Connections dropped because no slots available | Gap increasing for >60 seconds |
Fixes
Isolate the slow backend immediately. If one upstream is clearly degraded, remove it from the upstream block or set its weight to zero and reload. Open-source nginx cannot drain backends dynamically without a reload. If you cannot reload, temporarily reduce proxy_read_timeout so nginx gives up on slow responses faster. This frees connection slots but increases 504 errors for those requests.
Tune passive health check sensitivity. Defaults of max_fails=1 and fail_timeout=10s are aggressive: a single timeout removes a backend for 10 seconds. If your network has brief blips, increase max_fails to 3 so transient errors do not shift all load to the remaining backends prematurely. Raising it too high delays removal of genuinely failed servers, so balance retry tolerance against failover speed.
Increase connection capacity. If utilization is near the limit, increase worker_connections and ensure worker_rlimit_nofile is at least double that value to cover client connections, upstream connections, and log files. Reload to apply. This does not fix the slow backend, but it buys runway.
Verify upstream keepalive configuration. Without keepalive in the upstream block, every proxied request opens a new TCP connection to the backend. During recovery, this adds handshake overhead and ephemeral port pressure. Ensure keepalive is configured with an appropriate pool size per worker. Upstream keepalive pools are per-worker, not global.
Prevention
Log $upstream_response_time, $upstream_header_time, and $upstream_addr. Aggregate latency dilutes a single slow backend across the average of healthy peers. Per-upstream breakdowns are essential to detect partial failures.
Monitor the proxy connection multiplier. Size worker_connections so that peak traffic uses no more than 60% of the theoretical maximum after halving for the proxy multiplier. Do not assume the default of 512 is sufficient for production reverse proxy workloads.
Set worker_shutdown_timeout. In environments with frequent reloads, old workers can linger on long-lived connections, temporarily increasing worker count and memory. worker_shutdown_timeout forces old workers to exit, keeping process count predictable.
Review proxy_next_upstream carefully. Default conditions are error and timeout. Add http_502 and http_504 explicitly if you need retry on those. Under concurrent load, retries can land on the same failing peer or overload the remaining backends, so do not rely on retries alone to prevent a cascade.
How Netdata helps
- Correlate rising
$upstream_response_timepercentiles with climbingnginx.connections_activeto confirm a backend cascade before 502s appear. - Alert on the accepts-handled gap from
stub_statusto detect connection slot exhaustion at the nginx layer. - Track
nginx.requestsrate alongside 5xx and 499 response rates to distinguish between traffic drops and backend failures. - Surface per-upstream latency when access log parsing is configured, revealing which backend is degrading before it triggers
max_fails. - Monitor nginx worker CPU to rule out proxy-side saturation and focus the investigation on upstream health.
Related guides
- How NGINX actually works in production: a mental model for operators
- nginx 502 Bad Gateway: causes and how to fix it
- nginx 504 Gateway Time-out: causes and fixes
- NGINX active connections climbing: reading, writing, waiting explained
- nginx connect() failed (111: Connection refused) while connecting to upstream
- NGINX connection exhaustion: detection, diagnosis, and prevention
- NGINX dropped connections: the accepts vs handled gap
- NGINX monitoring checklist: the signals every production server needs
- NGINX monitoring maturity model: from survival to expert
- nginx no live upstreams while connecting to upstream: what it means
- NGINX slowloris and slow-client attacks: detection and mitigation
- nginx: too many open files - diagnosing file descriptor exhaustion







