nginx 504 Gateway Time-out: causes and fixes
A 504 Gateway Time-out means nginx reached the upstream but the upstream did not finish its response before proxy_read_timeout expired. The default is 60 seconds. Unlike a 502 Bad Gateway, which means nginx never established a valid upstream connection, a 504 means the connection succeeded but the response did not complete in time.
This guide covers isolating slow upstreams via access log variables, tuning timeouts and retries, and distinguishing 504 from 502.
What this means
When nginx proxies a request, it opens a TCP connection to an upstream server, sends the request, and waits for the response. A 504 occurs when that wait exceeds the timeout configured for the proxy operation. The most common trigger is proxy_read_timeout, which governs how long nginx waits between successive read operations from the upstream.
The error log signature is:
upstream timed out (110: Connection timed out) while reading response header from upstream
By default, proxy_next_upstream includes timeout, so nginx may try the next server in the upstream block when a timeout occurs. Because retries are serial, the total client-visible delay is the sum of each attempt, not a single timeout interval.
flowchart LR
A[Client request] --> B{nginx connects to upstream?}
B -->|Connection refused or invalid response| C[502 Bad Gateway]
B -->|Connected successfully| D{Response completed within proxy_read_timeout?}
D -->|No| E[504 Gateway Time-out]
D -->|Yes| F[Deliver response to client]Common causes
| Cause | What it looks like | First thing to check |
|---|---|---|
| Slow upstream application | $upstream_response_time clusters near proxy_read_timeout; error log shows reading-response-header timeouts; Writing connections dominate stub_status | Backend application logs, database lock metrics, and GC pause data |
| proxy_read_timeout too aggressive | 504s appear only on specific heavy endpoints such as report generation or bulk exports; $upstream_header_time is just under the timeout value | Access log for $upstream_response_time percentiles on the affected endpoints |
| Upstream keepalive pool exhausted | $upstream_connect_time is nonzero on requests that previously showed 0.000; connect time spikes even though the backend is healthy | Upstream block for keepalive directive and whether the backend sends Connection: close |
| Retry cascade | Comma-separated values in $upstream_response_time; error log shows multiple upstream attempts for one client request | proxy_next_upstream_tries and proxy_next_upstream_timeout settings |
| Timeout mismatch in proxy chain | 504s returned at exactly 60 seconds despite a higher nginx timeout; an L7 load balancer sits between the client and nginx | Downstream load balancer idle timeout relative to proxy_read_timeout |
| Upstream saturation or queueing | Gradual climb in $upstream_response_time across all endpoints before 504s spike; backend CPU or memory pegged | Backend resource utilization and request queue depth |
Quick checks
Run these read-only checks before making configuration changes.
# Check error log for upstream timeout messages
grep "upstream timed out (110: Connection timed out)" /var/log/nginx/error.log | tail -20
# Check 504 rate in access log
grep '" 504 ' /var/log/nginx/access.log | wc -l
# Check upstream response times for recent 504 responses
# Assumes access log includes $upstream_response_time as urt=...
grep ' 504 ' /var/log/nginx/access.log | grep -oP 'urt=\K[0-9.]+' | tail -20
# Check connection states: high Writing indicates upstream wait
# Requires stub_status module and /nginx_status location
curl -s http://127.0.0.1/nginx_status | tail -1
# Verify current proxy timeout values in effective configuration
nginx -T 2>/dev/null | grep -E 'proxy_(read|connect|send)_timeout'
# Check TCP connectivity to upstream backends directly
for backend in 10.0.1.10:8080 10.0.1.11:8080; do
timeout 5 bash -c "echo > /dev/tcp/${backend%:*}/${backend#*:}" 2>/dev/null && \
echo "$backend: TCP OK" || echo "$backend: TCP FAIL"
done
How to diagnose it
Confirm nginx is generating 504s, not 502s. The error log distinguishes them.
upstream timed out (110: Connection timed out) while reading response header from upstreamis a 504.connect() failed (111: Connection refused)orupstream prematurely closed connectionis a 502. Do not tune proxy timeouts for a 502.Identify which upstream servers are slow. Parse
$upstream_addrand$upstream_response_timefrom access logs. If one backend address dominates the 504 rows, that server is the problem. If all backends are equally represented, the issue is systemic: database pressure, a shared dependency, or a global timeout misconfiguration.Determine whether the backend is genuinely slow or the timeout is too tight. Compare
$upstream_response_timetoproxy_read_timeout. If the P95 upstream time is consistently within a few seconds of the timeout, the backend needs attention, not the timeout. If$upstream_response_timeis well below the timeout but 504s still occur, suspect an intermediate proxy or a mismatched timeout elsewhere in the stack.Check for retry amplification. When nginx retries on timeout,
$upstream_response_timecontains comma-separated values for each attempt (for example,2.001, 2.003means two attempts each taking about two seconds). If the sum approaches the client-facing timeout, retries are eating the budget. Reviewproxy_next_upstream_triesto see if retries are unbounded.Examine
stub_statusconnection states. A high Writing count with a normal or declining request rate means connections are stuck waiting for upstreams. If Writing is high and Reading is also elevated, clients may be slow to send request bodies (proxy_send_timeoutterritory). If Writing is high and Reading is low, the upstream is the bottleneck.Validate keepalive behavior. Check
$upstream_connect_time. Values of0.000indicate keepalive pool reuse. If connect time is consistently nonzero where it used to be zero, nginx is opening new TCP connections for every request. This adds handshake latency and upstream load, which can push borderline requests over the timeout threshold.Review backend application metrics. Database lock waits, thread pool exhaustion, garbage collection pauses, and queue saturation usually appear in application metrics before nginx logs the 504. Correlate the nginx 504 spike with backend telemetry.
Inspect the full proxy chain. If an L7 load balancer or ingress controller sits between the client and nginx, ensure its idle timeout is equal to or greater than
proxy_read_timeout. If the LB times out first, the client sees a 504 generated by the LB while nginx is still waiting for the upstream.
Metrics and signals to monitor
| Signal | Why it matters | Warning sign |
|---|---|---|
| $upstream_response_time | Isolates backend speed from client network variability | P95 approaching 80% of proxy_read_timeout |
| $upstream_header_time | Separates backend processing time from response body transfer | Header time dominates total upstream time |
| $upstream_connect_time | Reveals whether keepalive reuse is working or TCP handshakes are adding latency | Nonzero values where keepalive reuse was previously near zero |
| Writing connections | Writing includes upstream wait during proxying | Writing > 50% of active connections with low request rate |
| 504 response rate | Direct measure of timeout impact on clients | Any sustained nonzero rate in production |
| 499 client abandons | Clients give up before nginx officially times out | 499 rate rising in tandem with upstream latency |
| Active connections | Connection pile-up reduces headroom for new requests | Approaching worker_connections * worker_processes |
| Accepts vs handled gap | Detects silent connection drops before timeout symptoms appear | Gap growing under load |
Fixes
Increase proxy_read_timeout for legitimate slow endpoints
If the upstream is healthy but inherently slow (large report generation, heavy analytics queries), raise proxy_read_timeout on the specific location block rather than globally.
location /api/reports {
proxy_read_timeout 120s;
}
Tradeoff: Longer timeouts hold connections and memory longer. In reverse proxy mode, each active request consumes two connection slots. A higher timeout increases the blast radius of a slow upstream because slots remain occupied longer.
Reduce upstream load
Scale the backend horizontally or vertically. If one backend in a pool is degraded, remove it temporarily and reload nginx. Enable the keepalive directive in the upstream block to avoid TCP handshake overhead on every request. Ensure the backend does not send Connection: close, which defeats keepalive.
Tradeoff: Keepalive connections hold file descriptors and memory. Size worker_connections and worker_rlimit_nofile to accommodate both client and upstream keepalive pools.
Tune retry behavior
Set explicit bounds on retries to prevent a single slow request from serially timing out across every backend.
proxy_next_upstream_tries 2;
proxy_next_upstream_timeout 10s;
Tradeoff: Limiting retries improves tail latency but may increase the error rate for requests that would have succeeded on the third attempt. For read-heavy, idempotent traffic, retries are safer. For POST or mutating requests, retries can duplicate side effects unless the application is designed for it.
Fix keepalive pool exhaustion
Add or increase the keepalive count in the upstream block:
upstream backend {
server 10.0.1.10:8080;
server 10.0.1.11:8080;
keepalive 64;
}
Tradeoff: A pool that is too small wastes handshake time. A pool that is too large consumes FDs and memory without additional benefit. Monitor $upstream_connect_time to find the sweet spot.
Scope timeouts by protocol and endpoint
If your stack uses FastCGI, uwsgi, or SCGI instead of HTTP proxying, adjust the corresponding protocol-specific timeout directives. Raising proxy_read_timeout has no effect on FastCGI backends.
WebSocket or gRPC endpoints often need longer read timeouts than REST APIs. Use location-specific overrides rather than global defaults.
Prevention
- Include
$upstream_response_time,$upstream_header_time,$upstream_connect_time, and$upstream_statusin your access log format. Without them, you cannot distinguish backend slowness from client slowness or connection issues. - Alert on P95
$upstream_response_timeexceeding 80% ofproxy_read_timeout. This gives you runway before timeouts begin. - Monitor the ratio of Writing connections to total active connections. Sustained dominance with flat request rate predicts a 504 spike.
- Size
worker_connectionsto account for the proxy multiplier. Each proxied request uses at least two connections. The effective proxy capacity is at most half ofworker_connections * worker_processes. - Configure explicit values for
proxy_next_upstream_triesandproxy_next_upstream_timeoutso retry storms cannot amplify a backend slowdown into a user-visible outage. - Test timeout changes under realistic slow-backend conditions in staging before applying to production.
How Netdata helps
- Correlate 504 spikes with upstream response time percentiles parsed from access logs.
- Alert on Writing connection dominance and active connection saturation before 504s cascade into client-visible errors.
- Track per-upstream latency by parsing
$upstream_addrand$upstream_response_timewithout manual log crunching. - Visualize the accepts vs handled gap to confirm connection pressure during timeout events.
- Surface rising 499 client abandon rates alongside upstream latency to detect user-visible slowness before official timeouts fire.
Related guides
- How NGINX actually works in production: a mental model for operators
- nginx 502 Bad Gateway: causes and how to fix it
- NGINX active connections climbing: reading, writing, waiting explained
- NGINX connection exhaustion: detection, diagnosis, and prevention
- NGINX dropped connections: the accepts vs handled gap
- NGINX monitoring checklist: the signals every production server needs
- NGINX monitoring maturity model: from survival to expert
- NGINX slowloris and slow-client attacks: detection and mitigation
- nginx: too many open files - diagnosing file descriptor exhaustion
- nginx: worker_connections are not enough - causes and fixes
- NGINX worker_connections and worker_processes: sizing for real traffic
- NGINX worker_rlimit_nofile: setting file descriptor limits correctly







