nginx no live upstreams while connecting to upstream: what it means
When nginx logs no live upstreams while connecting to upstream, every server in the affected upstream block is marked unavailable. The proxied request has no eligible backend, so nginx returns 502 Bad Gateway. This is not an nginx defect; it signals that all backends have failed open-source nginx’s passive health checks, or a network partition has made them unreachable from the nginx host.
Open-source nginx supports only passive health checks. The defaults are aggressive: max_fails=1 and fail_timeout=10s. One timeout or connection failure inside a ten-second window removes a server from rotation for ten seconds. When every server crosses that threshold, the pool has zero live members.
The most common trigger is a backend cascade: one server slows down, traffic shifts to the remainder, they overload and fail their own health checks, and the entire pool is marked down within seconds. The fix is upstream-side, but the immediate priority is confirming scope, restoring capacity, and preventing recurrence.
What this means
nginx uses max_fails and fail_timeout in an upstream block to decide server availability. max_fails sets the allowed failed attempts inside the fail_timeout window. Once crossed, nginx marks the server unavailable for the remainder of fail_timeout. After that period, nginx sends the next request to the server as a probe. If the probe succeeds, the server is restored; if it fails, the server stays unavailable for another full fail_timeout cycle.
If every server in the group is marked unavailable simultaneously, nginx has no destination for the proxied request. It logs no live upstreams while connecting to upstream and returns 502.
Two nuances limit when this error can appear:
- Single-server guardrail. If an upstream block contains exactly one server, nginx ignores
max_fails,fail_timeout, andslow_start. The lone server is never marked unavailable by passive checks.no live upstreamsonly appears when an upstream block has two or more servers and all are down. - Backup servers. A server with the
backupparameter receives traffic only when all non-backup servers are unavailable. Active backups should prevent this error, though they may overload if primaries stay down.
The zone directive creates a shared memory zone for the upstream group, letting worker processes share health-check state. Without it, each worker maintains independent failure counters, which can produce inconsistent behavior under high concurrency.
flowchart TD
A[Backend slowness or network issue] --> B[Request times out or fails]
B --> C{max_fails reached within fail_timeout?}
C -->|Yes| D[Server marked unavailable for fail_timeout]
C -->|No| E[Traffic continues to server]
D --> F[Traffic shifts to remaining servers]
F --> G[Remaining servers overload and fail]
G --> H[All servers marked unavailable]
H --> I[nginx returns 502 no live upstreams]Common causes
| Cause | What it looks like | First thing to check |
|---|---|---|
| Backend cascade failure | Upstream response time climbs, Writing connections pile up, then 502s spike; error log shows upstream timed out | Backend application metrics and logs: CPU, memory, or database contention |
| Network partition or firewall change | Error log shows connect() failed (110: Connection timed out) or (111: Connection refused); backends respond when checked directly | Network path from nginx host to backend: nc, curl, or /dev/tcp tests |
| Simultaneous backend restart or crash | Immediate 502 spike; connect() failed (111) for every upstream | Backend process liveness and listen ports |
| Overly aggressive passive health checks | Brief latency blip removes healthy servers for fail_timeout seconds; flapping under load | nginx -T output for max_fails and fail_timeout values |
| Unreachable entries in upstream block | A reachable server is marked down, leaving only down or unreachable entries; common in multi-network container setups | nginx -T for 127.0.0.1 down or unreachable addresses mixed with live backends |
| DNS resolution failure (dynamic upstreams) | 502s isolated to locations using variable proxy_pass; resolver errors in log | resolver directive and DNS server reachability |
Quick checks
# Error log for the exact error and upstream failures
tail -1000 /var/log/nginx/error.log | grep -E "no live upstreams|upstream timed out|connect\(\) failed"
# Upstream configuration, including max_fails and fail_timeout
nginx -T 2>/dev/null | grep -A 20 'upstream'
# TCP reachability to each backend from the nginx host
# Substitute your actual backend host:port values
for backend in 10.0.1.10:8080 10.0.1.11:8080; do
timeout 2 bash -c "echo > /dev/tcp/${backend%:*}/${backend#*:}" 2>/dev/null && \
echo "$backend: UP" || echo "$backend: DOWN"
done
# Active connection state to confirm connection pile-up
curl -s http://127.0.0.1/stub_status
# 502 rate from access log
tail -n 10000 /var/log/nginx/access.log | awk '{if ($9 == 502) count++} END {print "502 count:", count+0}'
# Resolver errors when using variable-based proxy_pass
tail -1000 /var/log/nginx/error.log | grep -iE "resolver|host not found"
How to diagnose it
- Confirm the scope. Check the error log for the exact
no live upstreamsline. Note the upstream name and timestamp. Determine whether the error is continuous or intermittent. Intermittent errors that correlate with traffic spikes suggest aggressivemax_fails. - Check backend health directly. From the nginx host, test TCP connectivity to each backend in the affected group. If TCP fails, the backend or network is down. If TCP succeeds but HTTP fails, the application process is unhealthy.
- Distinguish slowness from death. If backends accept TCP but nginx logs
upstream timed out, the backends are slow, not dead. Compare$upstream_response_timein the access log againstproxy_read_timeout. Times near the timeout value mean nginx is giving up before the backend responds. - Verify the failure mode in the error log.
connect() failed (111)means the backend refused the connection: process down or port not listening.connect() failed (110)means the TCP handshake timed out: network issue or firewall.upstream timed outmeans the backend accepted the connection but did not respond in time. - Review health check thresholds. Run
nginx -Tand inspect the upstream block. Defaultmax_fails=1andfail_timeout=10sare aggressive for many workloads. Servers flapping in and out of availability indicate thresholds that are too sensitive. - Check for configuration traps. Look for unreachable addresses mixed with live ones in the upstream block, or a
downparameter on every server except one that is now also failing. In auto-generated configurations, ensure the generation logic filters out addresses that are not routable from the nginx host. - Look for the cascade signature. Plot or tail
$upstream_response_timeand 502 rate over time. A true cascade shows upstream latency rising first, active connections increasing, then 502s emerging as servers are marked unavailable. If 502s appear without preceding latency increase, the backends failed suddenly.
Metrics and signals to monitor
| Signal | Why it matters | Warning sign |
|---|---|---|
| HTTP 5xx rate | Confirms user-facing impact | Any sustained no live upstreams episode, or 502 rate above 1% |
| Upstream response time (P95) | Leading indicator of a cascade | P95 trending up more than 20% from baseline, or approaching 80% of proxy_read_timeout |
| Active connections / Writing state | Connections piling up behind slow backends | Writing sustained above 90% of active with low request throughput |
| Upstream errors in error log | Which backends are failing and how | Sustained connect() failed or upstream timed out for specific servers |
| Accepts vs handled gap | Rules out connection exhaustion that mimics upstream failure | Growing gap indicates nginx is dropping connections at the worker level |
| Client abandons (499 rate) | Users give up before nginx times out | Spike correlating with upstream latency increase |
Fixes
Backends are down or unresponsive
Fix the root cause: restart crashed processes, resolve application errors, or scale capacity. If you cannot restore backends immediately and you have backup servers, verify they are receiving traffic. Without backups, consider adding a temporary backend or redirecting traffic at the edge.
Backends are slow (cascade scenario)
Reduce pressure on the upstream pool. If you have a load balancer or CDN in front of nginx, enable stricter rate limiting or queue shedding there. As a temporary tradeoff, reduce proxy_read_timeout so nginx fails faster on slow requests, freeing connection slots. This increases 502s for slow requests but preserves capacity for fast ones. Do not raise timeouts to mask underlying slowness.
Passive health checks are too aggressive
Increase max_fails to 3 or higher, and raise fail_timeout if backends experience brief, recoverable blips. The tradeoff is slower failure detection: more requests hit a failing server before it is removed. Tune based on your tolerance for errors versus false positives.
Unreachable or spurious upstream entries
Remove down entries or unreachable addresses that do not belong in the active pool, then reload nginx. In auto-generated configurations, ensure the generation logic filters out addresses that are not routable from the nginx host.
Emergency recovery when all servers are marked down
Open-source nginx does not support manually marking a server available at runtime. You must either fix the backend and wait for the fail_timeout cycle to attempt a probe, or reload nginx with adjusted configuration. Setting max_fails=0 on a server disables passive health checks entirely for that server. Use this only as a last resort: nginx will continue sending traffic to a genuinely failed backend.
Prevention
- Size upstream capacity with failure in mind. Maintain N+1 redundancy so that losing one backend does not overload the remainder and trigger a cascade.
- Tune
max_failsandfail_timeoutto your workload. Defaults of 1 failure and 10 seconds are rarely appropriate for production pools larger than two servers. - Use backup servers for critical upstream blocks. Backups only receive traffic when all primaries are unavailable, providing a safety valve.
- Avoid mixing reachable and unreachable addresses in the same upstream block. Audit auto-generated configurations regularly.
- Monitor upstream response time as a leading indicator. Alert on P95 latency deviations before backends hit timeout thresholds.
- Use the
zonedirective so health state is shared across all workers, preventing inconsistent failure counting.
How Netdata helps
- Correlate nginx 502 rate with upstream response time to spot a backend cascade before every server is marked down.
- Alert on error log entries containing
no live upstreamsto reduce detection time. - Track active connections and the Reading/Writing/Waiting breakdown to see connections piling up behind slow backends.
- Monitor per-upstream latency and status code distributions via access log parsing to identify which backend is degrading first.
- Use anomaly detection on upstream response time percentiles to catch degradation before passive health checks trigger.
Related guides
- How NGINX actually works in production: a mental model for operators
- nginx 502 Bad Gateway: causes and how to fix it
- NGINX active connections climbing: reading, writing, waiting explained
- NGINX connection exhaustion: detection, diagnosis, and prevention
- NGINX dropped connections: the accepts vs handled gap
- NGINX monitoring checklist: the signals every production server needs
- NGINX monitoring maturity model: from survival to expert
- NGINX slowloris and slow-client attacks: detection and mitigation
- nginx: too many open files - diagnosing file descriptor exhaustion
- nginx: worker_connections are not enough - causes and fixes
- NGINX worker_connections and worker_processes: sizing for real traffic
- NGINX worker_rlimit_nofile: setting file descriptor limits correctly







