NGINX slow requests: from access log to root cause

Elevated $request_time in access logs does not mean the upstream is slow. The variable measures the full lifecycle: from reading the first client byte through sending the last response byte. That includes client upload, upstream wait, nginx processing, and client download. Blaming the backend by reflex is the most common nginx latency mistake.

To split the time accurately, confirm your log_format includes $request_time, $upstream_response_time, $upstream_connect_time, and $upstream_header_time.

What this means

$request_time is the wall-clock time nginx spends on a single request. For a proxied request:

$request_time ≈ client header/body read + upstream connection setup + upstream response wait + response body transfer to client

$upstream_response_time covers only the upstream segment: establishing the TCP connection through receiving the last byte of the response body. It excludes time spent reading the request from the client or sending the response to the client.

The delta between $request_time and $upstream_response_time is the diagnostic space. A small delta means the upstream is the bottleneck. A large delta points to the client, nginx processing, or disk I/O from temp-file spill.

$upstream_response_time returns "-" for internal redirects, cache hits without upstream contact, or internal errors. If nginx retries upstreams, the variable contains comma-separated values for each attempt.

Common causes

CauseWhat it looks likeFirst thing to check
Slow upstream / backend$upstream_response_time close to $request_time; Writing state high$upstream_header_time vs $upstream_response_time; backend CPU and logs
Slow client or large upload$request_time high, $upstream_response_time low; Reading state sustained$request_length; per-IP connection concentration
Temp-file spill$request_time much larger than $upstream_response_time; latency correlates with large payloadsDisk I/O on the temp-file partition; proxy_buffers and client_body_buffer_size
CPU saturation (SSL/gzip/regex)$upstream_response_time low; per-worker CPU >80%; latency uniform across request sizesPer-worker CPU; SSL session cache hit rate; gzip ratio
Connection exhaustionActive connections near worker_connections × worker_processes; accepts-handled gap growingstub_status; TcpExtListenOverflows

Quick checks

Run these read-only commands before making changes.

# Slow requests (>1s). Adjust $NF if $request_time is not the last field.
tail -n 1000 /var/log/nginx/access.log | awk '$NF > 1.0'

# Connection pressure
curl -s http://127.0.0.1/nginx_status

# Connection state breakdown: Reading, Writing, Waiting
curl -s http://127.0.0.1/nginx_status | awk '/Reading/ {print "R:"$2,"W:"$4,"Wait:"$6}'

# Silently dropped connections
curl -s http://127.0.0.1/nginx_status | awk '/^[[:space:]]*[0-9]/ {print "dropped="$1-$2; exit}'

# File descriptor pressure per worker
for pid in $(pgrep -P $(cat /var/run/nginx.pid)); do
  echo "Worker $pid: $(ls /proc/$pid/fd 2>/dev/null | wc -l) FDs"
done

# Per-worker CPU
for pid in $(pgrep -P $(cat /var/run/nginx.pid)); do
  echo "Worker $pid: $(ps -o %cpu= -p $pid)%"
done

# Listen queue depth on ports 80 and 443
ss -tlnp | awk 'NR>1 && ($4 ~ /:80$/ || $4 ~ /:443$/) {print "Recv-Q:"$2,"Send-Q:"$3,"Local:"$4}'
nstat -az TcpExtListenOverflows 2>/dev/null | awk '/ListenOverflows/ {print $2}'

How to diagnose it

Do not skip step 1. It prevents the most common nginx misdiagnosis.

  1. Compare $request_time with $upstream_response_time. If the values are within a few milliseconds, the upstream is the bottleneck. Go to step 2. If $request_time is much larger, the time is spent reading from the client, writing to the client, or inside nginx. Go to step 3.

  2. Isolate the upstream segment. Check $upstream_connect_time. High values mean network congestion or a full upstream accept queue. If $upstream_connect_time is near zero (keepalive reuse or fast local path), check $upstream_header_time. A large gap between $upstream_header_time and $upstream_response_time means the backend is slow to generate the response body. A high $upstream_header_time with a small gap means the application is slow to start responding. Check backend CPU, memory, and logs. If $upstream_response_time contains multiple comma-separated values, nginx is retrying after timeouts.

  3. Check for slow clients or large uploads. Look at $request_length. Large values indicate big POST/PUT bodies. Check stub_status: if Reading dominates active connections, clients are sending slowly. This is expected for mobile networks or file uploads, but it inflates $request_time without implying an upstream problem.

  4. Check for temp-file spill. If the request is not upload-heavy and the upstream is fast, but $request_time remains high, suspect disk I/O. When request bodies exceed client_body_buffer_size or proxy responses exceed proxy_buffers, nginx writes to temp files. Check disk I/O latency on the partition holding your temp paths. Look for correlation between large $body_bytes_sent or $request_length and high $request_time. Check iostat -x 1 or /proc/diskstats: if %util or queue depth spikes with latency, temp-file I/O is the likely culprit. Slow proxy_temp_path or client_body_temp_path storage adds latency that $upstream_response_time cannot see.

  5. Check worker CPU for SSL, gzip, or regex overhead. If none of the above fit, look at per-worker CPU. A worker sustained above 80% indicates an event-loop bottleneck. Correlate with SSL session cache hit rate (log $ssl_session_reused; "r" means reused, "." means new). A low hit rate with high connection rate means workers are burning CPU on TLS handshakes. High CPU without high SSL load points to gzip compression or inefficient regex in location or rewrite blocks.

  6. Rule out connection exhaustion. If there is still no clear culprit, check active connections against worker_connections × worker_processes. Utilization above 80%, or a growing accepts-handled gap, means requests are queuing or dropping. Check TcpExtListenOverflows: any increasing rate means the kernel is dropping SYNs before nginx sees them.

flowchart TD
  A[Elevated $request_time] --> B{$upstream_response_time
close to $request_time?} B -->|Yes| C[Upstream slow] B -->|No| D{Large $request_length
or high Reading?} D -->|Yes| E[Slow client / upload] D -->|No| F{Worker CPU >80%?} F -->|Yes| G[SSL / gzip / regex] F -->|No| H[Temp-file spill
or kernel queueing] C --> I[Check $upstream_header_time,
connect time, backend logs] E --> J[Check client_body_buffer_size
and timeouts] G --> K[Check SSL session cache,
compression level] H --> L[Check disk I/O,
proxy buffers, listen queue]

Metrics and signals to monitor

SignalWhy it mattersWarning sign
$request_timeEnd-to-end latency from nginx’s perspectiveP95 > 2x rolling baseline sustained for 5+ minutes
$upstream_response_timeIsolates backend latency from client and nginx overheadP95 trending up >20% from baseline
$upstream_header_timeDistinguishes backend processing delay from response body transferApproaching proxy_read_timeout
$upstream_connect_timeReveals TCP/TLS handshake overhead and keepalive health>100ms for same-datacenter backends
Reading / Writing / WaitingShows where connections are stuckWriting dominant with low throughput = upstream bottleneck; Reading dominant = slow client
Accepts - Handled gapSilent connection drops before processingGap increasing under load
Worker CPU per processDetects SSL/gzip/regex saturationSustained >80% per worker
TcpExtListenOverflowsKernel-level drops invisible to nginxAny nonzero increasing rate
$ssl_session_reusedTLS session resumption efficiencyHit rate <70%

Fixes

Upstream slow

If $upstream_response_time dominates, the problem is behind nginx. Reduce proxy_read_timeout temporarily to fail faster and free connection slots. This trades 504s for faster turnaround on requests that would have timed out anyway. If one upstream server in a pool is degraded, remove it from the upstream block and reload. Do not restart; a reload preserves existing connections on old workers while new workers pick up the changed configuration.

Slow client or large upload

Reduce client_header_timeout and client_body_timeout from the default 60s to 10-15s if your workload allows. This cuts off stalled clients. If you legitimately handle large uploads, increase client_body_buffer_size so bodies stay in memory rather than spilling to disk. Use limit_conn to cap concurrent connections per client IP.

Temp-file spill

Increase proxy_buffers count or size so large upstream responses buffer in memory. If responses are predictably large and memory is constrained, move proxy_temp_path to faster storage such as a local SSD. Using tmpfs is an option only if you have sufficient memory headroom and can tolerate OOM pressure. For write-heavy workloads, avoid placing temp files on a spindle shared with OS swap or application logs. Verify the change by monitoring disk I/O latency on that partition after the next traffic peak.

CPU saturation

Enable ssl_session_cache shared:SSL:... with a generous size and set ssl_session_timeout to at least several hours. If gzip is enabled, consider lowering the compression level or disabling it for internal traffic. Audit location and rewrite blocks for regex backtracking. Offloading TLS termination to a dedicated edge layer is the last resort if connection rates exceed what your CPU can handshake.

Connection exhaustion

Increase worker_connections and ensure worker_rlimit_nofile is at least double that value to account for upstream connections, log files, and temp files. Also verify that the systemd LimitNOFILE or ulimit for the nginx process matches worker_rlimit_nofile; a mismatch silently caps the effective limit below what nginx expects. If keepalive connections consume the majority of slots, reduce keepalive_timeout or keepalive_requests to reclaim capacity faster.

Prevention

Log all four timing variables in a single format so you can derive the delta in any parsing tool. Alert on the gap between $request_time and $upstream_response_time: a sustained gap >100ms without a corresponding upstream latency increase signals client slowness or temp-file spill.

Size worker_connections with the proxy multiplier in mind: each proxied request uses at least two connection slots. Keep peak active connections below 60% of worker_connections × worker_processes. Monitor accepts - handled and TcpExtListenOverflows from day one; they are leading indicators that cost nothing to collect. On multi-tenant nodes, temp-file spill from one virtual host can degrade latency for others sharing the same disk.

How Netdata helps

Netdata collects stub_status metrics, including the accepts-handled gap and the Reading/Writing/Waiting breakdown, and correlates them with per-worker CPU, memory, and file descriptor usage on the same node. Disk I/O latency charts for the partitions hosting logs and temp files surface temp-file spill that access logs alone cannot explain. The web_log collector can parse structured nginx access logs to chart $request_time percentiles and upstream timing variables alongside system metrics.