$ guides / nginx / nginx-no-live-upstreams ▌

Operations Guides

nginx no live upstreams while connecting to upstream: what it means

When nginx logs no live upstreams while connecting to upstream, every server in the affected upstream block is marked unavailable. The proxied request has no eligible backend, so nginx returns 502 Bad Gateway. This is not an nginx defect; it signals that all backends have failed open-source nginx’s passive health checks, or a network partition has made them unreachable from the nginx host.

Open-source nginx supports only passive health checks. The defaults are aggressive: max_fails=1 and fail_timeout=10s. One timeout or connection failure inside a ten-second window removes a server from rotation for ten seconds. When every server crosses that threshold, the pool has zero live members.

The most common trigger is a backend cascade: one server slows down, traffic shifts to the remainder, they overload and fail their own health checks, and the entire pool is marked down within seconds. The fix is upstream-side, but the immediate priority is confirming scope, restoring capacity, and preventing recurrence.

What this means

nginx uses max_fails and fail_timeout in an upstream block to decide server availability. max_fails sets the allowed failed attempts inside the fail_timeout window. Once crossed, nginx marks the server unavailable for the remainder of fail_timeout. After that period, nginx sends the next request to the server as a probe. If the probe succeeds, the server is restored; if it fails, the server stays unavailable for another full fail_timeout cycle.

If every server in the group is marked unavailable simultaneously, nginx has no destination for the proxied request. It logs no live upstreams while connecting to upstream and returns 502.

Two nuances limit when this error can appear:

Single-server guardrail. If an upstream block contains exactly one server, nginx ignores max_fails, fail_timeout, and slow_start. The lone server is never marked unavailable by passive checks. no live upstreams only appears when an upstream block has two or more servers and all are down.
Backup servers. A server with the backup parameter receives traffic only when all non-backup servers are unavailable. Active backups should prevent this error, though they may overload if primaries stay down.

The zone directive creates a shared memory zone for the upstream group, letting worker processes share health-check state. Without it, each worker maintains independent failure counters, which can produce inconsistent behavior under high concurrency.

flowchart TD
    A[Backend slowness or network issue] --> B[Request times out or fails]
    B --> C{max_fails reached within fail_timeout?}
    C -->|Yes| D[Server marked unavailable for fail_timeout]
    C -->|No| E[Traffic continues to server]
    D --> F[Traffic shifts to remaining servers]
    F --> G[Remaining servers overload and fail]
    G --> H[All servers marked unavailable]
    H --> I[nginx returns 502 no live upstreams]

Common causes

Cause	What it looks like	First thing to check
Backend cascade failure	Upstream response time climbs, Writing connections pile up, then 502s spike; error log shows `upstream timed out`	Backend application metrics and logs: CPU, memory, or database contention
Network partition or firewall change	Error log shows `connect() failed (110: Connection timed out)` or `(111: Connection refused)`; backends respond when checked directly	Network path from nginx host to backend: `nc`, `curl`, or `/dev/tcp` tests
Simultaneous backend restart or crash	Immediate 502 spike; `connect() failed (111)` for every upstream	Backend process liveness and listen ports
Overly aggressive passive health checks	Brief latency blip removes healthy servers for `fail_timeout` seconds; flapping under load	`nginx -T` output for `max_fails` and `fail_timeout` values
Unreachable entries in upstream block	A reachable server is marked down, leaving only `down` or unreachable entries; common in multi-network container setups	`nginx -T` for `127.0.0.1 down` or unreachable addresses mixed with live backends
DNS resolution failure (dynamic upstreams)	502s isolated to locations using variable `proxy_pass`; resolver errors in log	`resolver` directive and DNS server reachability

Quick checks

# Error log for the exact error and upstream failures
tail -1000 /var/log/nginx/error.log | grep -E "no live upstreams|upstream timed out|connect\(\) failed"

# Upstream configuration, including max_fails and fail_timeout
nginx -T 2>/dev/null | grep -A 20 'upstream'

# TCP reachability to each backend from the nginx host
# Substitute your actual backend host:port values
for backend in 10.0.1.10:8080 10.0.1.11:8080; do
  timeout 2 bash -c "echo > /dev/tcp/${backend%:*}/${backend#*:}" 2>/dev/null && \
    echo "$backend: UP" || echo "$backend: DOWN"
done

# Active connection state to confirm connection pile-up
curl -s http://127.0.0.1/stub_status

# 502 rate from access log
tail -n 10000 /var/log/nginx/access.log | awk '{if ($9 == 502) count++} END {print "502 count:", count+0}'

# Resolver errors when using variable-based proxy_pass
tail -1000 /var/log/nginx/error.log | grep -iE "resolver|host not found"

How to diagnose it

Confirm the scope. Check the error log for the exact no live upstreams line. Note the upstream name and timestamp. Determine whether the error is continuous or intermittent. Intermittent errors that correlate with traffic spikes suggest aggressive max_fails.
Check backend health directly. From the nginx host, test TCP connectivity to each backend in the affected group. If TCP fails, the backend or network is down. If TCP succeeds but HTTP fails, the application process is unhealthy.
Distinguish slowness from death. If backends accept TCP but nginx logs upstream timed out, the backends are slow, not dead. Compare $upstream_response_time in the access log against proxy_read_timeout. Times near the timeout value mean nginx is giving up before the backend responds.
Verify the failure mode in the error log. connect() failed (111) means the backend refused the connection: process down or port not listening. connect() failed (110) means the TCP handshake timed out: network issue or firewall. upstream timed out means the backend accepted the connection but did not respond in time.
Review health check thresholds. Run nginx -T and inspect the upstream block. Default max_fails=1 and fail_timeout=10s are aggressive for many workloads. Servers flapping in and out of availability indicate thresholds that are too sensitive.
Check for configuration traps. Look for unreachable addresses mixed with live ones in the upstream block, or a down parameter on every server except one that is now also failing. In auto-generated configurations, ensure the generation logic filters out addresses that are not routable from the nginx host.
Look for the cascade signature. Plot or tail $upstream_response_time and 502 rate over time. A true cascade shows upstream latency rising first, active connections increasing, then 502s emerging as servers are marked unavailable. If 502s appear without preceding latency increase, the backends failed suddenly.

Metrics and signals to monitor

Signal	Why it matters	Warning sign
HTTP 5xx rate	Confirms user-facing impact	Any sustained `no live upstreams` episode, or 502 rate above 1%
Upstream response time (P95)	Leading indicator of a cascade	P95 trending up more than 20% from baseline, or approaching 80% of `proxy_read_timeout`
Active connections / Writing state	Connections piling up behind slow backends	Writing sustained above 90% of active with low request throughput
Upstream errors in error log	Which backends are failing and how	Sustained `connect() failed` or `upstream timed out` for specific servers
Accepts vs handled gap	Rules out connection exhaustion that mimics upstream failure	Growing gap indicates nginx is dropping connections at the worker level
Client abandons (499 rate)	Users give up before nginx times out	Spike correlating with upstream latency increase

Fixes

Backends are down or unresponsive

Fix the root cause: restart crashed processes, resolve application errors, or scale capacity. If you cannot restore backends immediately and you have backup servers, verify they are receiving traffic. Without backups, consider adding a temporary backend or redirecting traffic at the edge.

Backends are slow (cascade scenario)

Reduce pressure on the upstream pool. If you have a load balancer or CDN in front of nginx, enable stricter rate limiting or queue shedding there. As a temporary tradeoff, reduce proxy_read_timeout so nginx fails faster on slow requests, freeing connection slots. This increases 502s for slow requests but preserves capacity for fast ones. Do not raise timeouts to mask underlying slowness.

Passive health checks are too aggressive

Increase max_fails to 3 or higher, and raise fail_timeout if backends experience brief, recoverable blips. The tradeoff is slower failure detection: more requests hit a failing server before it is removed. Tune based on your tolerance for errors versus false positives.

Unreachable or spurious upstream entries

Remove down entries or unreachable addresses that do not belong in the active pool, then reload nginx. In auto-generated configurations, ensure the generation logic filters out addresses that are not routable from the nginx host.

Emergency recovery when all servers are marked down

Open-source nginx does not support manually marking a server available at runtime. You must either fix the backend and wait for the fail_timeout cycle to attempt a probe, or reload nginx with adjusted configuration. Setting max_fails=0 on a server disables passive health checks entirely for that server. Use this only as a last resort: nginx will continue sending traffic to a genuinely failed backend.

Prevention

Size upstream capacity with failure in mind. Maintain N+1 redundancy so that losing one backend does not overload the remainder and trigger a cascade.
Tune max_fails and fail_timeout to your workload. Defaults of 1 failure and 10 seconds are rarely appropriate for production pools larger than two servers.
Use backup servers for critical upstream blocks. Backups only receive traffic when all primaries are unavailable, providing a safety valve.
Avoid mixing reachable and unreachable addresses in the same upstream block. Audit auto-generated configurations regularly.
Monitor upstream response time as a leading indicator. Alert on P95 latency deviations before backends hit timeout thresholds.
Use the zone directive so health state is shared across all workers, preventing inconsistent failure counting.

How Netdata helps

Correlate nginx 502 rate with upstream response time to spot a backend cascade before every server is marked down.
Alert on error log entries containing no live upstreams to reduce detection time.
Track active connections and the Reading/Writing/Waiting breakdown to see connections piling up behind slow backends.
Monitor per-upstream latency and status code distributions via access log parsing to identify which backend is degrading first.
Use anomaly detection on upstream response time percentiles to catch degradation before passive health checks trigger.

The Netdata solution

Web server monitoring with Netdata

Netdata monitors NGINX with per-second request, connection, and latency metrics plus ML anomaly detection. Correlate connection and file-descriptor exhaustion, upstream cascade failures, buffer spill, and TLS CPU with the host signals behind them.

See web server monitoring → Start monitoring free

nginx no live upstreams while connecting to upstream: what it means

nginx no live upstreams while connecting to upstream: what it means

What this means

Common causes

Quick checks

How to diagnose it

Metrics and signals to monitor

Fixes

Backends are down or unresponsive

Backends are slow (cascade scenario)

Passive health checks are too aggressive

Unreachable or spurious upstream entries

Emergency recovery when all servers are marked down

Prevention

How Netdata helps

Related guides

Web server monitoring with Netdata