NGINX slow requests: from access log to root cause
Elevated $request_time in access logs does not mean the upstream is slow. The variable measures the full lifecycle: from reading the first client byte through sending the last response byte. That includes client upload, upstream wait, nginx processing, and client download. Blaming the backend by reflex is the most common nginx latency mistake.
To split the time accurately, confirm your log_format includes $request_time, $upstream_response_time, $upstream_connect_time, and $upstream_header_time.
What this means
$request_time is the wall-clock time nginx spends on a single request. For a proxied request:
$request_time ≈ client header/body read + upstream connection setup + upstream response wait + response body transfer to client
$upstream_response_time covers only the upstream segment: establishing the TCP connection through receiving the last byte of the response body. It excludes time spent reading the request from the client or sending the response to the client.
The delta between $request_time and $upstream_response_time is the diagnostic space. A small delta means the upstream is the bottleneck. A large delta points to the client, nginx processing, or disk I/O from temp-file spill.
$upstream_response_time returns "-" for internal redirects, cache hits without upstream contact, or internal errors. If nginx retries upstreams, the variable contains comma-separated values for each attempt.
Common causes
| Cause | What it looks like | First thing to check |
|---|---|---|
| Slow upstream / backend | $upstream_response_time close to $request_time; Writing state high | $upstream_header_time vs $upstream_response_time; backend CPU and logs |
| Slow client or large upload | $request_time high, $upstream_response_time low; Reading state sustained | $request_length; per-IP connection concentration |
| Temp-file spill | $request_time much larger than $upstream_response_time; latency correlates with large payloads | Disk I/O on the temp-file partition; proxy_buffers and client_body_buffer_size |
| CPU saturation (SSL/gzip/regex) | $upstream_response_time low; per-worker CPU >80%; latency uniform across request sizes | Per-worker CPU; SSL session cache hit rate; gzip ratio |
| Connection exhaustion | Active connections near worker_connections × worker_processes; accepts-handled gap growing | stub_status; TcpExtListenOverflows |
Quick checks
Run these read-only commands before making changes.
# Slow requests (>1s). Adjust $NF if $request_time is not the last field.
tail -n 1000 /var/log/nginx/access.log | awk '$NF > 1.0'
# Connection pressure
curl -s http://127.0.0.1/nginx_status
# Connection state breakdown: Reading, Writing, Waiting
curl -s http://127.0.0.1/nginx_status | awk '/Reading/ {print "R:"$2,"W:"$4,"Wait:"$6}'
# Silently dropped connections
curl -s http://127.0.0.1/nginx_status | awk '/^[[:space:]]*[0-9]/ {print "dropped="$1-$2; exit}'
# File descriptor pressure per worker
for pid in $(pgrep -P $(cat /var/run/nginx.pid)); do
echo "Worker $pid: $(ls /proc/$pid/fd 2>/dev/null | wc -l) FDs"
done
# Per-worker CPU
for pid in $(pgrep -P $(cat /var/run/nginx.pid)); do
echo "Worker $pid: $(ps -o %cpu= -p $pid)%"
done
# Listen queue depth on ports 80 and 443
ss -tlnp | awk 'NR>1 && ($4 ~ /:80$/ || $4 ~ /:443$/) {print "Recv-Q:"$2,"Send-Q:"$3,"Local:"$4}'
nstat -az TcpExtListenOverflows 2>/dev/null | awk '/ListenOverflows/ {print $2}'
How to diagnose it
Do not skip step 1. It prevents the most common nginx misdiagnosis.
Compare
$request_timewith$upstream_response_time. If the values are within a few milliseconds, the upstream is the bottleneck. Go to step 2. If$request_timeis much larger, the time is spent reading from the client, writing to the client, or inside nginx. Go to step 3.Isolate the upstream segment. Check
$upstream_connect_time. High values mean network congestion or a full upstream accept queue. If$upstream_connect_timeis near zero (keepalive reuse or fast local path), check$upstream_header_time. A large gap between$upstream_header_timeand$upstream_response_timemeans the backend is slow to generate the response body. A high$upstream_header_timewith a small gap means the application is slow to start responding. Check backend CPU, memory, and logs. If$upstream_response_timecontains multiple comma-separated values, nginx is retrying after timeouts.Check for slow clients or large uploads. Look at
$request_length. Large values indicate big POST/PUT bodies. Checkstub_status: if Reading dominates active connections, clients are sending slowly. This is expected for mobile networks or file uploads, but it inflates$request_timewithout implying an upstream problem.Check for temp-file spill. If the request is not upload-heavy and the upstream is fast, but
$request_timeremains high, suspect disk I/O. When request bodies exceedclient_body_buffer_sizeor proxy responses exceedproxy_buffers, nginx writes to temp files. Check disk I/O latency on the partition holding your temp paths. Look for correlation between large$body_bytes_sentor$request_lengthand high$request_time. Checkiostat -x 1or/proc/diskstats: if%utilor queue depth spikes with latency, temp-file I/O is the likely culprit. Slowproxy_temp_pathorclient_body_temp_pathstorage adds latency that$upstream_response_timecannot see.Check worker CPU for SSL, gzip, or regex overhead. If none of the above fit, look at per-worker CPU. A worker sustained above 80% indicates an event-loop bottleneck. Correlate with SSL session cache hit rate (log
$ssl_session_reused;"r"means reused,"."means new). A low hit rate with high connection rate means workers are burning CPU on TLS handshakes. High CPU without high SSL load points to gzip compression or inefficient regex inlocationorrewriteblocks.Rule out connection exhaustion. If there is still no clear culprit, check active connections against
worker_connections × worker_processes. Utilization above 80%, or a growing accepts-handled gap, means requests are queuing or dropping. CheckTcpExtListenOverflows: any increasing rate means the kernel is dropping SYNs before nginx sees them.
flowchart TD
A[Elevated $request_time] --> B{$upstream_response_time
close to $request_time?}
B -->|Yes| C[Upstream slow]
B -->|No| D{Large $request_length
or high Reading?}
D -->|Yes| E[Slow client / upload]
D -->|No| F{Worker CPU >80%?}
F -->|Yes| G[SSL / gzip / regex]
F -->|No| H[Temp-file spill
or kernel queueing]
C --> I[Check $upstream_header_time,
connect time, backend logs]
E --> J[Check client_body_buffer_size
and timeouts]
G --> K[Check SSL session cache,
compression level]
H --> L[Check disk I/O,
proxy buffers, listen queue]Metrics and signals to monitor
| Signal | Why it matters | Warning sign |
|---|---|---|
$request_time | End-to-end latency from nginx’s perspective | P95 > 2x rolling baseline sustained for 5+ minutes |
$upstream_response_time | Isolates backend latency from client and nginx overhead | P95 trending up >20% from baseline |
$upstream_header_time | Distinguishes backend processing delay from response body transfer | Approaching proxy_read_timeout |
$upstream_connect_time | Reveals TCP/TLS handshake overhead and keepalive health | >100ms for same-datacenter backends |
| Reading / Writing / Waiting | Shows where connections are stuck | Writing dominant with low throughput = upstream bottleneck; Reading dominant = slow client |
| Accepts - Handled gap | Silent connection drops before processing | Gap increasing under load |
| Worker CPU per process | Detects SSL/gzip/regex saturation | Sustained >80% per worker |
TcpExtListenOverflows | Kernel-level drops invisible to nginx | Any nonzero increasing rate |
$ssl_session_reused | TLS session resumption efficiency | Hit rate <70% |
Fixes
Upstream slow
If $upstream_response_time dominates, the problem is behind nginx. Reduce proxy_read_timeout temporarily to fail faster and free connection slots. This trades 504s for faster turnaround on requests that would have timed out anyway. If one upstream server in a pool is degraded, remove it from the upstream block and reload. Do not restart; a reload preserves existing connections on old workers while new workers pick up the changed configuration.
Slow client or large upload
Reduce client_header_timeout and client_body_timeout from the default 60s to 10-15s if your workload allows. This cuts off stalled clients. If you legitimately handle large uploads, increase client_body_buffer_size so bodies stay in memory rather than spilling to disk. Use limit_conn to cap concurrent connections per client IP.
Temp-file spill
Increase proxy_buffers count or size so large upstream responses buffer in memory. If responses are predictably large and memory is constrained, move proxy_temp_path to faster storage such as a local SSD. Using tmpfs is an option only if you have sufficient memory headroom and can tolerate OOM pressure. For write-heavy workloads, avoid placing temp files on a spindle shared with OS swap or application logs. Verify the change by monitoring disk I/O latency on that partition after the next traffic peak.
CPU saturation
Enable ssl_session_cache shared:SSL:... with a generous size and set ssl_session_timeout to at least several hours. If gzip is enabled, consider lowering the compression level or disabling it for internal traffic. Audit location and rewrite blocks for regex backtracking. Offloading TLS termination to a dedicated edge layer is the last resort if connection rates exceed what your CPU can handshake.
Connection exhaustion
Increase worker_connections and ensure worker_rlimit_nofile is at least double that value to account for upstream connections, log files, and temp files. Also verify that the systemd LimitNOFILE or ulimit for the nginx process matches worker_rlimit_nofile; a mismatch silently caps the effective limit below what nginx expects. If keepalive connections consume the majority of slots, reduce keepalive_timeout or keepalive_requests to reclaim capacity faster.
Prevention
Log all four timing variables in a single format so you can derive the delta in any parsing tool. Alert on the gap between $request_time and $upstream_response_time: a sustained gap >100ms without a corresponding upstream latency increase signals client slowness or temp-file spill.
Size worker_connections with the proxy multiplier in mind: each proxied request uses at least two connection slots. Keep peak active connections below 60% of worker_connections × worker_processes. Monitor accepts - handled and TcpExtListenOverflows from day one; they are leading indicators that cost nothing to collect. On multi-tenant nodes, temp-file spill from one virtual host can degrade latency for others sharing the same disk.
How Netdata helps
Netdata collects stub_status metrics, including the accepts-handled gap and the Reading/Writing/Waiting breakdown, and correlates them with per-worker CPU, memory, and file descriptor usage on the same node. Disk I/O latency charts for the partitions hosting logs and temp files surface temp-file spill that access logs alone cannot explain. The web_log collector can parse structured nginx access logs to chart $request_time percentiles and upstream timing variables alongside system metrics.
Related guides
- How NGINX actually works in production: a mental model for operators
- nginx 502 Bad Gateway: causes and how to fix it
- nginx 503 Service Temporarily Unavailable: causes and fixes
- nginx 504 Gateway Time-out: causes and fixes
- NGINX active connections climbing: reading, writing, waiting explained
- NGINX backend cascade failure: when slow upstreams take down everything
- nginx connect() failed (111: Connection refused) while connecting to upstream
- NGINX connection exhaustion: detection, diagnosis, and prevention
- NGINX dropped connections: the accepts vs handled gap
- NGINX monitoring checklist: the signals every production server needs
- NGINX monitoring maturity model: from survival to expert
- nginx no live upstreams while connecting to upstream: what it means







