NGINX connection exhaustion: detection, diagnosis, and prevention
Users see connection timeouts while load balancer health checks and the NGINX stub_status endpoint still return HTTP 200. New connections are silently dropped. Connection exhaustion is a cliff-edge failure: once the limit is hit, there is no graceful degradation. Connections are refused at the kernel level, or accepted into the TCP backlog but discarded by NGINX because no worker has a free connection slot.
The capacity boundary is worker_connections multiplied by worker_processes. In proxy mode, every request consumes at least two slots: one for the client and one for the upstream. With the default worker_connections of 512, four worker processes can handle at most 2,048 connections, or roughly 1,000 concurrent proxied requests. The gap between accepts and handled in stub_status is the leading indicator; it grows before the active connection count flatlines.
Exhaustion can be triggered by slow upstream backends holding connections in the Writing state, keepalive clients piling up in the Waiting state, slowloris-style attacks stuck in Reading, or a simple capacity mismatch.
What this means
NGINX allocates a fixed pool of connection structures per worker. worker_connections sets the pool size; the default is 512. Each active connection, including idle keepalive, consumes one slot. When a worker’s pool is full, it cannot pull new connections from the kernel listen queue. If the kernel queue is also full, the kernel drops the SYN. If the queue has room, the TCP handshake completes but NGINX cannot allocate a connection structure, incrementing accepts without incrementing handled.
There are two invisible drop points: the kernel accept queue (TcpExtListenOverflows) and the NGINX connection slot limit (accepts - handled gap). Both produce the same client symptom: timeouts. Kernel drops leave zero evidence in NGINX logs.
File descriptor exhaustion is a separate ceiling. Every client connection, upstream connection, log file, and temp file consumes an FD. The effective limit is the lower of worker_rlimit_nofile and the system ulimit. If FDs exhaust before connection slots, you see accept4() failed (24: Too many open files) in the error log.
flowchart TD
A[Connection exhaustion] --> B{Which state dominates?}
B -->|Waiting high| C[Keepalive pileup]
B -->|Writing high| D[Slow upstream pileup]
B -->|Reading high| E[Slow client attack]
C --> F[Slots held idle]
D --> G[Slots blocked on backend]
E --> H[Slots blocked on client]
F --> I[accepts > handled]
G --> I
H --> I
I --> J[Cliff-edge drops]
K[Listen queue full] --> JCommon causes
| Cause | What it looks like | First thing to check |
|---|---|---|
| Slow upstream backends | Writing state dominates; upstream_response_time high; 502/504 emerging | Error log for upstream timed out; check backend latency directly |
| Keepalive timeout too long | Waiting state dominates; high active connections with low throughput | keepalive_timeout and keepalive_requests settings |
worker_connections or FD limit too low | accepts > handled gap growing; active connections plateau at a hard ceiling | worker_connections * worker_processes vs peak load; /proc/$pid/limits for FDs |
| Kernel accept queue overflow | TcpExtListenOverflows increasing; zero entries in nginx error log | ss -tlnp Recv-Q vs Send-Q; net.core.somaxconn |
| Slowloris / slow clients | Reading state dominates; low CPU; low request rate | client_header_timeout; per-source IP connection concentration |
Quick checks
Run these in order. They are all read-only.
# Check active connections vs theoretical capacity
curl -s http://127.0.0.1/nginx_status
# Calculate the accepts-handled gap (dropped connections)
curl -s http://127.0.0.1/nginx_status | awk '/^[[:space:]]*[0-9]/ {print "gap=" $1-$2; exit}'
# Check which connection state dominates
curl -s http://127.0.0.1/nginx_status | awk '/Reading/ {print "R:"$2, "W:"$4, "Wait:"$6}'
# Count running workers to verify capacity math
pgrep -c -P $(cat /var/run/nginx.pid)
# Check file descriptor usage per worker against limits
for pid in $(pgrep -P $(cat /var/run/nginx.pid)); do
used=$(ls /proc/$pid/fd 2>/dev/null | wc -l)
max=$(awk '/^Max open files/ {print $4}' /proc/$pid/limits)
echo "Worker $pid: $used / $max FDs"
done
# Check kernel listen queue depth for nginx ports
ss -tlnp '( sport = :80 or sport = :443 )' | awk 'NR>1 {print "Recv-Q:"$2, "Send-Q:"$3, $4}'
# Check kernel-level listen overflows (silent drops)
awk '/^TcpExt: / {
h=$0; getline; v=$0
n=split(h,ha); split(v,va)
for(i=1;i<=n;i++) if(ha[i]=="ListenOverflows") print va[i]
}' /proc/net/netstat
How to diagnose it
Confirm active connections against maximum capacity. Calculate the ratio of active connections to
worker_connections * worker_processes. Above 80% is the danger zone. At 95% with a growingaccepts - handledgap, exhaustion is already causing admission loss.Check for active admission loss. Evaluate the
accepts - handleddelta as a rate, not an absolute. A nonzero gap that is stable means drops happened in the past; a growing gap means drops are happening now. Simultaneously checkTcpExtListenOverflows. If it is increasing, the kernel is dropping connections before NGINX sees them.Identify the dominant connection state. If Writing dominates, the bottleneck is upstream slowness. If Waiting dominates, keepalive connections are consuming slots without doing work. If Reading dominates, clients are sending data slowly or not at all. Normal proxy traffic usually shows Writing at 40-60%, Waiting at 30-50%, and Reading below 10%.
Verify file descriptors are not the real limit. Compare FD count per worker against
/proc/$pid/limits. If FD usage is above 80%, the ceiling isworker_rlimit_nofileor the OS ulimit, notworker_connections. Raisingworker_connectionswithout raising the FD limit will not help.Correlate with upstream health if Writing is high. Check
$upstream_response_timeand$upstream_connect_timein access logs. Ifupstream_response_timedominates$request_time, the backend is the bottleneck. Ifupstream_connect_timeis elevated or nonzero when keepalive is configured, the upstream connection pool is ineffective.Check keepalive configuration if Waiting is high. Idle keepalive connections still consume slots. Verify whether
keepalive_timeoutis unnecessarily long and whetherkeepalive_requestsis unbounded. Confirm upstream keepalive is configured correctly: anupstream{}block,keepalive N;inside it, andproxy_http_version 1.1;withproxy_set_header Connection "";in the location block.
Metrics and signals to monitor
| Signal | Why it matters | Warning sign |
|---|---|---|
| Connection slot utilization | Primary saturation metric | Active / (worker_connections * worker_processes) > 80% |
Dropped connections (accepts - handled) | Leading indicator of NGINX-level admission loss | Rate of gap increase > 0 sustained for > 60 seconds |
| Connection state breakdown | Reveals the nature of the load | Writing > 50% with low throughput; Reading > 20% sustained |
| File descriptor usage per worker | Hard ceiling independent of connection slots | > 80% of per-process limit |
Kernel listen overflows (TcpExtListenOverflows) | Silent kernel-level drops invisible to NGINX logs | Any increasing rate |
| Upstream response time | Backend slowness is the most common exhaustion trigger | P95 trending up > 2x baseline or approaching proxy_read_timeout |
Fixes
If slow upstream is causing the pileup
Reduce proxy_read_timeout temporarily to shed slow requests faster. This trades error rate for availability. Identify the degraded backend via $upstream_addr in access logs and remove it from rotation if possible. Scale upstream capacity. Do not simply raise worker_connections to absorb backend slowness; this delays the inevitable and can overwhelm the upstream with more concurrent load.
If keepalive is consuming all slots
Lower keepalive_timeout to reclaim idle connections faster. Tune keepalive_requests to force periodic connection rotation. For upstream keepalive, ensure all three prerequisites are met: the upstream{} block, the keepalive directive inside it, and the HTTP/1.1 proxy settings. Without these, NGINX opens a fresh TCP connection per request, which also causes ephemeral port exhaustion via TIME_WAIT buildup.
If slow clients are the problem
Reduce client_header_timeout and client_body_timeout from the default 60 seconds to 10-15 seconds. Enable limit_conn per source IP after defining a limit_conn_zone to cap concurrent connections from a single client. If the source is identifiable and malicious, block at the firewall rather than inside NGINX.
If you are genuinely at capacity
Increase worker_connections to provide headroom, but account for memory: every slot allocates buffers. Ensure worker_rlimit_nofile is at least twice the per-worker worker_connections to accommodate client and upstream FDs, log files, and temp files. Check that systemd or container runtime limits are not overriding worker_rlimit_nofile. Enable reuseport on Linux to distribute connections evenly across workers and reduce accept contention.
If the kernel queue is overflowing
Raise net.core.somaxconn to at least match the backlog parameter in your listen directives. The effective backlog is the minimum of the two. On modern Linux kernels the default somaxconn is 4096, but older systems may default to 128. NGINX silently truncates the listen backlog to somaxconn without logging an error.
Prevention
Size worker_connections for at least three to five times your average peak active connections, then double it again if you are proxying traffic to account for the two-connections-per-request multiplier. Monitor the accepts - handled gap as a leading indicator; it grows before users complain. Monitor TcpExtListenOverflows at the OS level because NGINX cannot see kernel drops.
Set worker_rlimit_nofile generously, typically to at least 65,536 for production proxies. Set worker_shutdown_timeout to prevent old workers from lingering indefinitely after reloads, which artificially inflates connection counts. Tune keepalive_timeout and keepalive_requests to match your traffic patterns rather than leaving them at defaults. If you run in Kubernetes or another environment with frequent reloads, monitor worker process count to detect old worker accumulation.
How Netdata helps
- Correlates NGINX active connections with per-worker FD usage and system-wide socket metrics to distinguish slot exhaustion from FD exhaustion from kernel queue overflow.
- Tracks the
accepts - handledgap rate fromstub_statusto flag admission loss before active connections flatline. - Monitors
TcpExtListenOverflowsand listen queue depth alongside NGINX metrics to catch kernel drops. - Breaks down Reading, Writing, and Waiting states to distinguish upstream slowdown, keepalive pileup, and slow client attacks.
- Correlates upstream response time percentiles with NGINX connection state changes to expose backend cascade failures.







