nginx upstream timed out (110: Connection timed out) while connecting/reading
upstream timed out (110: Connection timed out) in the nginx error log usually surfaces to clients as a 504 Gateway Timeout. The suffix after the error string tells you which phase failed: connecting, sending, or reading. That phase determines whether you are looking at a dead backend, a network partition, or a retry storm hiding the real problem.
The defaults are unforgiving. proxy_connect_timeout, proxy_send_timeout, and proxy_read_timeout all default to 60 seconds, and proxy_next_upstream implicitly retries on error and timeout. Retries can mask the root cause while exhausting upstream capacity.
This guide maps the exact log message to the directive that fired, reads retry evidence in $upstream_response_time, and fixes the failure without worsening retry behavior.
What this means
nginx distinguishes three timeout phases when talking to an upstream. Each produces a distinct suffix in the error log.
| Phase | Error log suffix | Directive | What it measures |
|---|---|---|---|
| Connect | while connecting to upstream | proxy_connect_timeout | TCP handshake (and TLS handshake if HTTPS). Defaults to 60s. There is a hard ceiling of 75s regardless of configuration. |
| Send | while sending to upstream | proxy_send_timeout | Idle time between successive write operations to upstream, not total upload duration. Defaults to 60s. |
| Read header | while reading response header from upstream | proxy_read_timeout | Idle time between successive read operations from upstream. Defaults to 60s. If the backend sends headers slowly, this fires. |
| Read body | while reading response body from upstream | proxy_read_timeout | Same directive as header read, but fires while streaming the response body. |
The client sees a 504 when nginx gives up on an upstream attempt. If the client disconnects first, nginx logs a 499 instead. If retries are configured and all attempts fail, the final client-facing status is still 504.
Because proxy_next_upstream implicitly defaults to error timeout, nginx may retry the same request on another upstream server when any of these timeouts fires. The retry can succeed and the user never sees an error, but the upstream that timed out is still sick, and the retry adds load to the remaining backends.
flowchart TD
A[upstream timed out 110] --> B{Which phase?}
B -->|while connecting| C[Network / TCP / backlog issue]
B -->|while sending| D[Upstream read stall or buffering]
B -->|while reading header| E[Backend processing is slow]
B -->|while reading body| F[Large body / slow transfer]
C --> G[Check upstream_connect_time and TCP path]
E --> H[Check upstream_header_time and backend logs]Common causes
| Cause | What it looks like | First thing to check |
|---|---|---|
| Backend is genuinely slow or dead | Read-phase timeouts dominate; $upstream_response_time or $upstream_header_time near proxy_read_timeout; 504 rate rises | Backend CPU, memory, and application logs |
| Network partition or firewall between nginx and upstream | Connect-phase timeouts; $upstream_connect_time missing or maxed; TCP handshake never completes | Layer-3/4 reachability from the nginx host to the upstream port |
| Upstream accept queue full | Connect-phase timeouts; backend process is alive but kernel backlog is overflowing | ss -tlnp on the upstream and TcpExtListenOverflows |
| Retry storm exhausting upstream capacity | Multiple comma-separated values in $upstream_response_time; errors rotate across backends | Whether proxy_next_upstream_timeout or proxy_next_upstream_tries is unbounded |
| Keepalive pool exhausted | $upstream_connect_time suddenly nonzero and spiking; upstream otherwise healthy | Upstream keepalive connection reuse ratio |
| Dynamic upstream DNS resolution failing | Variable proxy_pass with resolver; 502s mixed with timeouts; latency clusters near resolver_timeout (default 30s) | resolver directive reachability and valid= TTL |
Quick checks
Run these in order. They are read-only and safe on a live server.
# Identify the exact timeout phase from recent error logs
grep -E 'upstream timed out.*while (connecting|reading|sending)' /var/log/nginx/error.log | tail -20
# Inspect current timeout directive values in the running config
nginx -T 2>/dev/null | grep -E 'proxy_(connect|send|read)_timeout'
# Look for retry evidence in access logs (comma-separated upstream_response_time)
# Assumes $upstream_response_time is logged; adjust field position to your log_format
tail -10000 /var/log/nginx/access.log | awk '{print $NF}' | grep ',' | head -10
# Check nginx connection pressure (high Writing + low throughput = slow upstream)
curl -s http://127.0.0.1/stub_status
# Test raw TCP connectivity from nginx to each upstream backend
for backend in 10.0.1.10:8080 10.0.1.11:8080; do
timeout 2 bash -c "echo > /dev/tcp/${backend%:*}/${backend#*:}" 2>/dev/null && \
echo "$backend: UP" || echo "$backend: DOWN"
done
# Detect kernel-level drops that never reach nginx logs
nstat -az TcpExtListenOverflows 2>/dev/null | awk '/ListenOverflows/ {print $2}'
# Review proxy_next_upstream settings
nginx -T 2>/dev/null | grep -E 'proxy_next_upstream'
# Check if retry limits are actually bounded
nginx -T 2>/dev/null | grep -E 'proxy_next_upstream_(tries|timeout)'
How to diagnose it
Classify the phase from the error suffix.
while connectingpoints to the network or upstream TCP accept path.while reading response headerpoints to backend application processing.while reading response bodypoints to slow payload generation or transfer.while sendingis rare and usually means upstream stopped reading.Correlate with access-log timing.
Log$upstream_response_time,$upstream_connect_time, and$upstream_header_time. If$upstream_header_timeis high while the gap between$upstream_response_timeand$upstream_header_timeis small, the backend is slow to generate the response, not slow to transfer it.Detect retries from
$upstream_response_timepunctuation.
A comma separates times for different upstream servers contacted during retries. A colon separates times for different upstream groups when an internal redirect occurred. The last value is the final attempt. If you see commas, nginx retried and the first upstream failed.Verify that retries are not making things worse.
Check whetherproxy_next_upstream_timeoutis set. The default is0(unlimited total wall-clock time). A request with multiple slow retries can hang for minutes. Also checkproxy_next_upstream_tries. Since 1.7.5 this directive exists, but it is silently capped at the number of servers in the upstream block. With one upstream server,tries=5still only attempts once.Test the upstream directly from the nginx host.
Usecurlor a raw TCP connect to bypass nginx entirely. If the direct test also times out, the problem is the backend or the network, not nginx configuration.Check for 499s preceding 504s.
A spike in 499 status codes means clients are giving up before nginx fires the upstream timeout. This is often the first visible symptom and confirms that total latency is breaching client-side patience thresholds.Rule out nginx-side saturation.
If active connections are nearworker_connections * worker_processesor file descriptors are exhausted, nginx may be too constrained to maintain upstream connections efficiently. Check stub_status and per-worker FD counts.
Metrics and signals to monitor
| Signal | Why it matters | Warning sign |
|---|---|---|
$upstream_response_time | Isolates backend latency from client slowness | P95 > 80% of proxy_read_timeout |
$upstream_connect_time | Reveals TCP/TLS handshake overhead | Nonzero and spiking when keepalive should reuse connections |
$upstream_header_time | Shows time to first byte from upstream | Approaching proxy_read_timeout while body transfer time remains low |
| 504 rate | Direct client impact of upstream timeouts | Any sustained nonzero rate |
| 499 rate | Clients abandoning before nginx times out | Correlates with rising upstream latency; often precedes 504 spikes |
| Active connections in Writing state | Connections held waiting for upstream | Writing > 50% of active connections with low request throughput |
$upstream_addr | Identifies which specific backend is failing | Repeated same-backend failures before a retry succeeds |
proxy_next_upstream_timeout vs request duration | Total retry budget exhaustion | Requests hanging longer than the primary timeout because retries are unbounded |
Fixes
Backend is slow or overloaded
Fix the backend. There is no nginx tuning that makes a slow database query fast.
If you need immediate relief, reduce proxy_read_timeout so nginx fails faster and frees connection slots. The tradeoff is more 504s for legitimate long requests. You can also temporarily shrink proxy_next_upstream_tries or remove the timeout keyword from proxy_next_upstream to stop retrying slow backends and avoid amplifying load. If caching is enabled, set proxy_cache_use_stale updating error timeout so nginx serves stale content while the backend recovers.
Network or connect-phase failures
Fix the network path or firewall rule. If the upstream is cross-region and legitimately needs more handshake time, increase proxy_connect_timeout to the minimum necessary value. Do not exceed the 75s hard ceiling documented in the nginx proxy module. If you are proxying to hostnames resolved per-request, ensure the resolver directive points to a reliable server and add a valid= cache TTL to avoid repeated DNS lookups.
Retry behavior causing cascades
Set proxy_next_upstream_timeout to a finite total budget. The default of 0 means retries can accumulate unlimited wall-clock time. This is a total limit across all attempts, not per-attempt.
Avoid adding http_500 to proxy_next_upstream unless you understand the interaction with max_fails. The default max_fails is 1 and fail_timeout is 10s. Adding http_500 means a request that returns HTTP 500 can be retried across multiple upstreams; each server that returns 500 is marked failed for 10 seconds. If the underlying bug affects all peers, this can empty the upstream pool.
Do not add non_idempotent to proxy_next_upstream unless the upstream safely handles duplicate non-idempotent requests. Without it, nginx correctly avoids retrying POST, LOCK, and PATCH requests on timeout. Enabling it risks duplicate side effects if the first attempt partially succeeded.
Capacity and resource limits on nginx
If nginx itself is saturated, timeouts can be a secondary effect. Increase worker_connections and worker_rlimit_nofile to ensure workers can maintain both client and upstream sockets. Each proxied request uses at least two connection slots. Reduce keepalive_timeout on the upstream side if idle connections are filling the upstream’s accept queue.
Prevention
- Set
proxy_read_timeoutbased on application SLA, not the default 60s. Endpoints with predictable fast responses should have tight timeouts; genuinely long-polling endpoints should have location-specific overrides. - Log
$upstream_response_time,$upstream_connect_time, and$upstream_header_timein your access log. Without them, you cannot distinguish connect latency from processing latency. - Monitor the ratio of P95
$upstream_response_timetoproxy_read_timeout. When it crosses 80%, mass timeouts are likely imminent. - Verify upstream keepalive reuse. Without connection reuse, every request pays a TCP handshake tax and contributes to ephemeral port exhaustion.
- Set explicit
proxy_next_upstream_timeoutandproxy_next_upstream_triesto bound retry cost. - Do not rely solely on nginx passive health checking for critical paths. Open-source nginx discovers unhealthy upstreams by sending real user traffic to them. Use external health checks or nginx Plus active checks if you need probe-based detection.
How Netdata helps
- Correlate 504 and 499 spikes with upstream response time percentiles to confirm backend degradation versus client impatience.
- Monitor active connections and the Writing state ratio to detect upstream slowdown before timeouts fire.
- Track file descriptor utilization and connection slot saturation to rule out nginx-side resource exhaustion.
- Alert on kernel-level
TcpExtListenOverflowsand listen backlog depth to catch silent connection drops that never appear in nginx logs. - Parse access-log timing variables to flag P95 upstream latency approaching configured timeout thresholds.
Related guides
- How NGINX actually works in production: a mental model for operators
- nginx 502 Bad Gateway: causes and how to fix it
- NGINX active connections climbing: reading, writing, waiting explained
- NGINX connection exhaustion: detection, diagnosis, and prevention
- NGINX dropped connections: the accepts vs handled gap
- NGINX monitoring checklist: the signals every production server needs
- NGINX monitoring maturity model: from survival to expert
- NGINX slowloris and slow-client attacks: detection and mitigation
- nginx: too many open files - diagnosing file descriptor exhaustion
- nginx: worker_connections are not enough - causes and fixes
- NGINX worker_connections and worker_processes: sizing for real traffic
- NGINX worker_rlimit_nofile: setting file descriptor limits correctly







