nginx upstream timed out (110: Connection timed out) while connecting/reading

upstream timed out (110: Connection timed out) in the nginx error log usually surfaces to clients as a 504 Gateway Timeout. The suffix after the error string tells you which phase failed: connecting, sending, or reading. That phase determines whether you are looking at a dead backend, a network partition, or a retry storm hiding the real problem.

The defaults are unforgiving. proxy_connect_timeout, proxy_send_timeout, and proxy_read_timeout all default to 60 seconds, and proxy_next_upstream implicitly retries on error and timeout. Retries can mask the root cause while exhausting upstream capacity.

This guide maps the exact log message to the directive that fired, reads retry evidence in $upstream_response_time, and fixes the failure without worsening retry behavior.

What this means

nginx distinguishes three timeout phases when talking to an upstream. Each produces a distinct suffix in the error log.

PhaseError log suffixDirectiveWhat it measures
Connectwhile connecting to upstreamproxy_connect_timeoutTCP handshake (and TLS handshake if HTTPS). Defaults to 60s. There is a hard ceiling of 75s regardless of configuration.
Sendwhile sending to upstreamproxy_send_timeoutIdle time between successive write operations to upstream, not total upload duration. Defaults to 60s.
Read headerwhile reading response header from upstreamproxy_read_timeoutIdle time between successive read operations from upstream. Defaults to 60s. If the backend sends headers slowly, this fires.
Read bodywhile reading response body from upstreamproxy_read_timeoutSame directive as header read, but fires while streaming the response body.

The client sees a 504 when nginx gives up on an upstream attempt. If the client disconnects first, nginx logs a 499 instead. If retries are configured and all attempts fail, the final client-facing status is still 504.

Because proxy_next_upstream implicitly defaults to error timeout, nginx may retry the same request on another upstream server when any of these timeouts fires. The retry can succeed and the user never sees an error, but the upstream that timed out is still sick, and the retry adds load to the remaining backends.

flowchart TD
  A[upstream timed out 110] --> B{Which phase?}
  B -->|while connecting| C[Network / TCP / backlog issue]
  B -->|while sending| D[Upstream read stall or buffering]
  B -->|while reading header| E[Backend processing is slow]
  B -->|while reading body| F[Large body / slow transfer]
  C --> G[Check upstream_connect_time and TCP path]
  E --> H[Check upstream_header_time and backend logs]

Common causes

CauseWhat it looks likeFirst thing to check
Backend is genuinely slow or deadRead-phase timeouts dominate; $upstream_response_time or $upstream_header_time near proxy_read_timeout; 504 rate risesBackend CPU, memory, and application logs
Network partition or firewall between nginx and upstreamConnect-phase timeouts; $upstream_connect_time missing or maxed; TCP handshake never completesLayer-3/4 reachability from the nginx host to the upstream port
Upstream accept queue fullConnect-phase timeouts; backend process is alive but kernel backlog is overflowingss -tlnp on the upstream and TcpExtListenOverflows
Retry storm exhausting upstream capacityMultiple comma-separated values in $upstream_response_time; errors rotate across backendsWhether proxy_next_upstream_timeout or proxy_next_upstream_tries is unbounded
Keepalive pool exhausted$upstream_connect_time suddenly nonzero and spiking; upstream otherwise healthyUpstream keepalive connection reuse ratio
Dynamic upstream DNS resolution failingVariable proxy_pass with resolver; 502s mixed with timeouts; latency clusters near resolver_timeout (default 30s)resolver directive reachability and valid= TTL

Quick checks

Run these in order. They are read-only and safe on a live server.

# Identify the exact timeout phase from recent error logs
grep -E 'upstream timed out.*while (connecting|reading|sending)' /var/log/nginx/error.log | tail -20

# Inspect current timeout directive values in the running config
nginx -T 2>/dev/null | grep -E 'proxy_(connect|send|read)_timeout'

# Look for retry evidence in access logs (comma-separated upstream_response_time)
# Assumes $upstream_response_time is logged; adjust field position to your log_format
tail -10000 /var/log/nginx/access.log | awk '{print $NF}' | grep ',' | head -10

# Check nginx connection pressure (high Writing + low throughput = slow upstream)
curl -s http://127.0.0.1/stub_status

# Test raw TCP connectivity from nginx to each upstream backend
for backend in 10.0.1.10:8080 10.0.1.11:8080; do
  timeout 2 bash -c "echo > /dev/tcp/${backend%:*}/${backend#*:}" 2>/dev/null && \
    echo "$backend: UP" || echo "$backend: DOWN"
done

# Detect kernel-level drops that never reach nginx logs
nstat -az TcpExtListenOverflows 2>/dev/null | awk '/ListenOverflows/ {print $2}'

# Review proxy_next_upstream settings
nginx -T 2>/dev/null | grep -E 'proxy_next_upstream'

# Check if retry limits are actually bounded
nginx -T 2>/dev/null | grep -E 'proxy_next_upstream_(tries|timeout)'

How to diagnose it

  1. Classify the phase from the error suffix.
    while connecting points to the network or upstream TCP accept path. while reading response header points to backend application processing. while reading response body points to slow payload generation or transfer. while sending is rare and usually means upstream stopped reading.

  2. Correlate with access-log timing.
    Log $upstream_response_time, $upstream_connect_time, and $upstream_header_time. If $upstream_header_time is high while the gap between $upstream_response_time and $upstream_header_time is small, the backend is slow to generate the response, not slow to transfer it.

  3. Detect retries from $upstream_response_time punctuation.
    A comma separates times for different upstream servers contacted during retries. A colon separates times for different upstream groups when an internal redirect occurred. The last value is the final attempt. If you see commas, nginx retried and the first upstream failed.

  4. Verify that retries are not making things worse.
    Check whether proxy_next_upstream_timeout is set. The default is 0 (unlimited total wall-clock time). A request with multiple slow retries can hang for minutes. Also check proxy_next_upstream_tries. Since 1.7.5 this directive exists, but it is silently capped at the number of servers in the upstream block. With one upstream server, tries=5 still only attempts once.

  5. Test the upstream directly from the nginx host.
    Use curl or a raw TCP connect to bypass nginx entirely. If the direct test also times out, the problem is the backend or the network, not nginx configuration.

  6. Check for 499s preceding 504s.
    A spike in 499 status codes means clients are giving up before nginx fires the upstream timeout. This is often the first visible symptom and confirms that total latency is breaching client-side patience thresholds.

  7. Rule out nginx-side saturation.
    If active connections are near worker_connections * worker_processes or file descriptors are exhausted, nginx may be too constrained to maintain upstream connections efficiently. Check stub_status and per-worker FD counts.

Metrics and signals to monitor

SignalWhy it mattersWarning sign
$upstream_response_timeIsolates backend latency from client slownessP95 > 80% of proxy_read_timeout
$upstream_connect_timeReveals TCP/TLS handshake overheadNonzero and spiking when keepalive should reuse connections
$upstream_header_timeShows time to first byte from upstreamApproaching proxy_read_timeout while body transfer time remains low
504 rateDirect client impact of upstream timeoutsAny sustained nonzero rate
499 rateClients abandoning before nginx times outCorrelates with rising upstream latency; often precedes 504 spikes
Active connections in Writing stateConnections held waiting for upstreamWriting > 50% of active connections with low request throughput
$upstream_addrIdentifies which specific backend is failingRepeated same-backend failures before a retry succeeds
proxy_next_upstream_timeout vs request durationTotal retry budget exhaustionRequests hanging longer than the primary timeout because retries are unbounded

Fixes

Backend is slow or overloaded

Fix the backend. There is no nginx tuning that makes a slow database query fast.

If you need immediate relief, reduce proxy_read_timeout so nginx fails faster and frees connection slots. The tradeoff is more 504s for legitimate long requests. You can also temporarily shrink proxy_next_upstream_tries or remove the timeout keyword from proxy_next_upstream to stop retrying slow backends and avoid amplifying load. If caching is enabled, set proxy_cache_use_stale updating error timeout so nginx serves stale content while the backend recovers.

Network or connect-phase failures

Fix the network path or firewall rule. If the upstream is cross-region and legitimately needs more handshake time, increase proxy_connect_timeout to the minimum necessary value. Do not exceed the 75s hard ceiling documented in the nginx proxy module. If you are proxying to hostnames resolved per-request, ensure the resolver directive points to a reliable server and add a valid= cache TTL to avoid repeated DNS lookups.

Retry behavior causing cascades

Set proxy_next_upstream_timeout to a finite total budget. The default of 0 means retries can accumulate unlimited wall-clock time. This is a total limit across all attempts, not per-attempt.

Avoid adding http_500 to proxy_next_upstream unless you understand the interaction with max_fails. The default max_fails is 1 and fail_timeout is 10s. Adding http_500 means a request that returns HTTP 500 can be retried across multiple upstreams; each server that returns 500 is marked failed for 10 seconds. If the underlying bug affects all peers, this can empty the upstream pool.

Do not add non_idempotent to proxy_next_upstream unless the upstream safely handles duplicate non-idempotent requests. Without it, nginx correctly avoids retrying POST, LOCK, and PATCH requests on timeout. Enabling it risks duplicate side effects if the first attempt partially succeeded.

Capacity and resource limits on nginx

If nginx itself is saturated, timeouts can be a secondary effect. Increase worker_connections and worker_rlimit_nofile to ensure workers can maintain both client and upstream sockets. Each proxied request uses at least two connection slots. Reduce keepalive_timeout on the upstream side if idle connections are filling the upstream’s accept queue.

Prevention

  • Set proxy_read_timeout based on application SLA, not the default 60s. Endpoints with predictable fast responses should have tight timeouts; genuinely long-polling endpoints should have location-specific overrides.
  • Log $upstream_response_time, $upstream_connect_time, and $upstream_header_time in your access log. Without them, you cannot distinguish connect latency from processing latency.
  • Monitor the ratio of P95 $upstream_response_time to proxy_read_timeout. When it crosses 80%, mass timeouts are likely imminent.
  • Verify upstream keepalive reuse. Without connection reuse, every request pays a TCP handshake tax and contributes to ephemeral port exhaustion.
  • Set explicit proxy_next_upstream_timeout and proxy_next_upstream_tries to bound retry cost.
  • Do not rely solely on nginx passive health checking for critical paths. Open-source nginx discovers unhealthy upstreams by sending real user traffic to them. Use external health checks or nginx Plus active checks if you need probe-based detection.

How Netdata helps

  • Correlate 504 and 499 spikes with upstream response time percentiles to confirm backend degradation versus client impatience.
  • Monitor active connections and the Writing state ratio to detect upstream slowdown before timeouts fire.
  • Track file descriptor utilization and connection slot saturation to rule out nginx-side resource exhaustion.
  • Alert on kernel-level TcpExtListenOverflows and listen backlog depth to catch silent connection drops that never appear in nginx logs.
  • Parse access-log timing variables to flag P95 upstream latency approaching configured timeout thresholds.