nginx 504 Gateway Time-out: causes and fixes

A 504 Gateway Time-out means nginx reached the upstream but the upstream did not finish its response before proxy_read_timeout expired. The default is 60 seconds. Unlike a 502 Bad Gateway, which means nginx never established a valid upstream connection, a 504 means the connection succeeded but the response did not complete in time.

This guide covers isolating slow upstreams via access log variables, tuning timeouts and retries, and distinguishing 504 from 502.

What this means

When nginx proxies a request, it opens a TCP connection to an upstream server, sends the request, and waits for the response. A 504 occurs when that wait exceeds the timeout configured for the proxy operation. The most common trigger is proxy_read_timeout, which governs how long nginx waits between successive read operations from the upstream.

The error log signature is:

upstream timed out (110: Connection timed out) while reading response header from upstream

By default, proxy_next_upstream includes timeout, so nginx may try the next server in the upstream block when a timeout occurs. Because retries are serial, the total client-visible delay is the sum of each attempt, not a single timeout interval.

flowchart LR
    A[Client request] --> B{nginx connects to upstream?}
    B -->|Connection refused or invalid response| C[502 Bad Gateway]
    B -->|Connected successfully| D{Response completed within proxy_read_timeout?}
    D -->|No| E[504 Gateway Time-out]
    D -->|Yes| F[Deliver response to client]

Common causes

CauseWhat it looks likeFirst thing to check
Slow upstream application$upstream_response_time clusters near proxy_read_timeout; error log shows reading-response-header timeouts; Writing connections dominate stub_statusBackend application logs, database lock metrics, and GC pause data
proxy_read_timeout too aggressive504s appear only on specific heavy endpoints such as report generation or bulk exports; $upstream_header_time is just under the timeout valueAccess log for $upstream_response_time percentiles on the affected endpoints
Upstream keepalive pool exhausted$upstream_connect_time is nonzero on requests that previously showed 0.000; connect time spikes even though the backend is healthyUpstream block for keepalive directive and whether the backend sends Connection: close
Retry cascadeComma-separated values in $upstream_response_time; error log shows multiple upstream attempts for one client requestproxy_next_upstream_tries and proxy_next_upstream_timeout settings
Timeout mismatch in proxy chain504s returned at exactly 60 seconds despite a higher nginx timeout; an L7 load balancer sits between the client and nginxDownstream load balancer idle timeout relative to proxy_read_timeout
Upstream saturation or queueingGradual climb in $upstream_response_time across all endpoints before 504s spike; backend CPU or memory peggedBackend resource utilization and request queue depth

Quick checks

Run these read-only checks before making configuration changes.

# Check error log for upstream timeout messages
grep "upstream timed out (110: Connection timed out)" /var/log/nginx/error.log | tail -20
# Check 504 rate in access log
grep '" 504 ' /var/log/nginx/access.log | wc -l
# Check upstream response times for recent 504 responses
# Assumes access log includes $upstream_response_time as urt=...
grep ' 504 ' /var/log/nginx/access.log | grep -oP 'urt=\K[0-9.]+' | tail -20
# Check connection states: high Writing indicates upstream wait
# Requires stub_status module and /nginx_status location
curl -s http://127.0.0.1/nginx_status | tail -1
# Verify current proxy timeout values in effective configuration
nginx -T 2>/dev/null | grep -E 'proxy_(read|connect|send)_timeout'
# Check TCP connectivity to upstream backends directly
for backend in 10.0.1.10:8080 10.0.1.11:8080; do
  timeout 5 bash -c "echo > /dev/tcp/${backend%:*}/${backend#*:}" 2>/dev/null && \
    echo "$backend: TCP OK" || echo "$backend: TCP FAIL"
done

How to diagnose it

  1. Confirm nginx is generating 504s, not 502s. The error log distinguishes them. upstream timed out (110: Connection timed out) while reading response header from upstream is a 504. connect() failed (111: Connection refused) or upstream prematurely closed connection is a 502. Do not tune proxy timeouts for a 502.

  2. Identify which upstream servers are slow. Parse $upstream_addr and $upstream_response_time from access logs. If one backend address dominates the 504 rows, that server is the problem. If all backends are equally represented, the issue is systemic: database pressure, a shared dependency, or a global timeout misconfiguration.

  3. Determine whether the backend is genuinely slow or the timeout is too tight. Compare $upstream_response_time to proxy_read_timeout. If the P95 upstream time is consistently within a few seconds of the timeout, the backend needs attention, not the timeout. If $upstream_response_time is well below the timeout but 504s still occur, suspect an intermediate proxy or a mismatched timeout elsewhere in the stack.

  4. Check for retry amplification. When nginx retries on timeout, $upstream_response_time contains comma-separated values for each attempt (for example, 2.001, 2.003 means two attempts each taking about two seconds). If the sum approaches the client-facing timeout, retries are eating the budget. Review proxy_next_upstream_tries to see if retries are unbounded.

  5. Examine stub_status connection states. A high Writing count with a normal or declining request rate means connections are stuck waiting for upstreams. If Writing is high and Reading is also elevated, clients may be slow to send request bodies (proxy_send_timeout territory). If Writing is high and Reading is low, the upstream is the bottleneck.

  6. Validate keepalive behavior. Check $upstream_connect_time. Values of 0.000 indicate keepalive pool reuse. If connect time is consistently nonzero where it used to be zero, nginx is opening new TCP connections for every request. This adds handshake latency and upstream load, which can push borderline requests over the timeout threshold.

  7. Review backend application metrics. Database lock waits, thread pool exhaustion, garbage collection pauses, and queue saturation usually appear in application metrics before nginx logs the 504. Correlate the nginx 504 spike with backend telemetry.

  8. Inspect the full proxy chain. If an L7 load balancer or ingress controller sits between the client and nginx, ensure its idle timeout is equal to or greater than proxy_read_timeout. If the LB times out first, the client sees a 504 generated by the LB while nginx is still waiting for the upstream.

Metrics and signals to monitor

SignalWhy it mattersWarning sign
$upstream_response_timeIsolates backend speed from client network variabilityP95 approaching 80% of proxy_read_timeout
$upstream_header_timeSeparates backend processing time from response body transferHeader time dominates total upstream time
$upstream_connect_timeReveals whether keepalive reuse is working or TCP handshakes are adding latencyNonzero values where keepalive reuse was previously near zero
Writing connectionsWriting includes upstream wait during proxyingWriting > 50% of active connections with low request rate
504 response rateDirect measure of timeout impact on clientsAny sustained nonzero rate in production
499 client abandonsClients give up before nginx officially times out499 rate rising in tandem with upstream latency
Active connectionsConnection pile-up reduces headroom for new requestsApproaching worker_connections * worker_processes
Accepts vs handled gapDetects silent connection drops before timeout symptoms appearGap growing under load

Fixes

Increase proxy_read_timeout for legitimate slow endpoints

If the upstream is healthy but inherently slow (large report generation, heavy analytics queries), raise proxy_read_timeout on the specific location block rather than globally.

location /api/reports {
    proxy_read_timeout 120s;
}

Tradeoff: Longer timeouts hold connections and memory longer. In reverse proxy mode, each active request consumes two connection slots. A higher timeout increases the blast radius of a slow upstream because slots remain occupied longer.

Reduce upstream load

Scale the backend horizontally or vertically. If one backend in a pool is degraded, remove it temporarily and reload nginx. Enable the keepalive directive in the upstream block to avoid TCP handshake overhead on every request. Ensure the backend does not send Connection: close, which defeats keepalive.

Tradeoff: Keepalive connections hold file descriptors and memory. Size worker_connections and worker_rlimit_nofile to accommodate both client and upstream keepalive pools.

Tune retry behavior

Set explicit bounds on retries to prevent a single slow request from serially timing out across every backend.

proxy_next_upstream_tries 2;
proxy_next_upstream_timeout 10s;

Tradeoff: Limiting retries improves tail latency but may increase the error rate for requests that would have succeeded on the third attempt. For read-heavy, idempotent traffic, retries are safer. For POST or mutating requests, retries can duplicate side effects unless the application is designed for it.

Fix keepalive pool exhaustion

Add or increase the keepalive count in the upstream block:

upstream backend {
    server 10.0.1.10:8080;
    server 10.0.1.11:8080;
    keepalive 64;
}

Tradeoff: A pool that is too small wastes handshake time. A pool that is too large consumes FDs and memory without additional benefit. Monitor $upstream_connect_time to find the sweet spot.

Scope timeouts by protocol and endpoint

If your stack uses FastCGI, uwsgi, or SCGI instead of HTTP proxying, adjust the corresponding protocol-specific timeout directives. Raising proxy_read_timeout has no effect on FastCGI backends.

WebSocket or gRPC endpoints often need longer read timeouts than REST APIs. Use location-specific overrides rather than global defaults.

Prevention

  • Include $upstream_response_time, $upstream_header_time, $upstream_connect_time, and $upstream_status in your access log format. Without them, you cannot distinguish backend slowness from client slowness or connection issues.
  • Alert on P95 $upstream_response_time exceeding 80% of proxy_read_timeout. This gives you runway before timeouts begin.
  • Monitor the ratio of Writing connections to total active connections. Sustained dominance with flat request rate predicts a 504 spike.
  • Size worker_connections to account for the proxy multiplier. Each proxied request uses at least two connections. The effective proxy capacity is at most half of worker_connections * worker_processes.
  • Configure explicit values for proxy_next_upstream_tries and proxy_next_upstream_timeout so retry storms cannot amplify a backend slowdown into a user-visible outage.
  • Test timeout changes under realistic slow-backend conditions in staging before applying to production.

How Netdata helps

  • Correlate 504 spikes with upstream response time percentiles parsed from access logs.
  • Alert on Writing connection dominance and active connection saturation before 504s cascade into client-visible errors.
  • Track per-upstream latency by parsing $upstream_addr and $upstream_response_time without manual log crunching.
  • Visualize the accepts vs handled gap to confirm connection pressure during timeout events.
  • Surface rising 499 client abandon rates alongside upstream latency to detect user-visible slowness before official timeouts fire.