NGINX connection exhaustion: detection, diagnosis, and prevention

Users see connection timeouts while load balancer health checks and the NGINX stub_status endpoint still return HTTP 200. New connections are silently dropped. Connection exhaustion is a cliff-edge failure: once the limit is hit, there is no graceful degradation. Connections are refused at the kernel level, or accepted into the TCP backlog but discarded by NGINX because no worker has a free connection slot.

The capacity boundary is worker_connections multiplied by worker_processes. In proxy mode, every request consumes at least two slots: one for the client and one for the upstream. With the default worker_connections of 512, four worker processes can handle at most 2,048 connections, or roughly 1,000 concurrent proxied requests. The gap between accepts and handled in stub_status is the leading indicator; it grows before the active connection count flatlines.

Exhaustion can be triggered by slow upstream backends holding connections in the Writing state, keepalive clients piling up in the Waiting state, slowloris-style attacks stuck in Reading, or a simple capacity mismatch.

What this means

NGINX allocates a fixed pool of connection structures per worker. worker_connections sets the pool size; the default is 512. Each active connection, including idle keepalive, consumes one slot. When a worker’s pool is full, it cannot pull new connections from the kernel listen queue. If the kernel queue is also full, the kernel drops the SYN. If the queue has room, the TCP handshake completes but NGINX cannot allocate a connection structure, incrementing accepts without incrementing handled.

There are two invisible drop points: the kernel accept queue (TcpExtListenOverflows) and the NGINX connection slot limit (accepts - handled gap). Both produce the same client symptom: timeouts. Kernel drops leave zero evidence in NGINX logs.

File descriptor exhaustion is a separate ceiling. Every client connection, upstream connection, log file, and temp file consumes an FD. The effective limit is the lower of worker_rlimit_nofile and the system ulimit. If FDs exhaust before connection slots, you see accept4() failed (24: Too many open files) in the error log.

flowchart TD
    A[Connection exhaustion] --> B{Which state dominates?}
    B -->|Waiting high| C[Keepalive pileup]
    B -->|Writing high| D[Slow upstream pileup]
    B -->|Reading high| E[Slow client attack]
    C --> F[Slots held idle]
    D --> G[Slots blocked on backend]
    E --> H[Slots blocked on client]
    F --> I[accepts > handled]
    G --> I
    H --> I
    I --> J[Cliff-edge drops]
    K[Listen queue full] --> J

Common causes

CauseWhat it looks likeFirst thing to check
Slow upstream backendsWriting state dominates; upstream_response_time high; 502/504 emergingError log for upstream timed out; check backend latency directly
Keepalive timeout too longWaiting state dominates; high active connections with low throughputkeepalive_timeout and keepalive_requests settings
worker_connections or FD limit too lowaccepts > handled gap growing; active connections plateau at a hard ceilingworker_connections * worker_processes vs peak load; /proc/$pid/limits for FDs
Kernel accept queue overflowTcpExtListenOverflows increasing; zero entries in nginx error logss -tlnp Recv-Q vs Send-Q; net.core.somaxconn
Slowloris / slow clientsReading state dominates; low CPU; low request rateclient_header_timeout; per-source IP connection concentration

Quick checks

Run these in order. They are all read-only.

# Check active connections vs theoretical capacity
curl -s http://127.0.0.1/nginx_status

# Calculate the accepts-handled gap (dropped connections)
curl -s http://127.0.0.1/nginx_status | awk '/^[[:space:]]*[0-9]/ {print "gap=" $1-$2; exit}'

# Check which connection state dominates
curl -s http://127.0.0.1/nginx_status | awk '/Reading/ {print "R:"$2, "W:"$4, "Wait:"$6}'

# Count running workers to verify capacity math
pgrep -c -P $(cat /var/run/nginx.pid)

# Check file descriptor usage per worker against limits
for pid in $(pgrep -P $(cat /var/run/nginx.pid)); do
  used=$(ls /proc/$pid/fd 2>/dev/null | wc -l)
  max=$(awk '/^Max open files/ {print $4}' /proc/$pid/limits)
  echo "Worker $pid: $used / $max FDs"
done

# Check kernel listen queue depth for nginx ports
ss -tlnp '( sport = :80 or sport = :443 )' | awk 'NR>1 {print "Recv-Q:"$2, "Send-Q:"$3, $4}'

# Check kernel-level listen overflows (silent drops)
awk '/^TcpExt: / {
  h=$0; getline; v=$0
  n=split(h,ha); split(v,va)
  for(i=1;i<=n;i++) if(ha[i]=="ListenOverflows") print va[i]
}' /proc/net/netstat

How to diagnose it

  1. Confirm active connections against maximum capacity. Calculate the ratio of active connections to worker_connections * worker_processes. Above 80% is the danger zone. At 95% with a growing accepts - handled gap, exhaustion is already causing admission loss.

  2. Check for active admission loss. Evaluate the accepts - handled delta as a rate, not an absolute. A nonzero gap that is stable means drops happened in the past; a growing gap means drops are happening now. Simultaneously check TcpExtListenOverflows. If it is increasing, the kernel is dropping connections before NGINX sees them.

  3. Identify the dominant connection state. If Writing dominates, the bottleneck is upstream slowness. If Waiting dominates, keepalive connections are consuming slots without doing work. If Reading dominates, clients are sending data slowly or not at all. Normal proxy traffic usually shows Writing at 40-60%, Waiting at 30-50%, and Reading below 10%.

  4. Verify file descriptors are not the real limit. Compare FD count per worker against /proc/$pid/limits. If FD usage is above 80%, the ceiling is worker_rlimit_nofile or the OS ulimit, not worker_connections. Raising worker_connections without raising the FD limit will not help.

  5. Correlate with upstream health if Writing is high. Check $upstream_response_time and $upstream_connect_time in access logs. If upstream_response_time dominates $request_time, the backend is the bottleneck. If upstream_connect_time is elevated or nonzero when keepalive is configured, the upstream connection pool is ineffective.

  6. Check keepalive configuration if Waiting is high. Idle keepalive connections still consume slots. Verify whether keepalive_timeout is unnecessarily long and whether keepalive_requests is unbounded. Confirm upstream keepalive is configured correctly: an upstream{} block, keepalive N; inside it, and proxy_http_version 1.1; with proxy_set_header Connection ""; in the location block.

Metrics and signals to monitor

SignalWhy it mattersWarning sign
Connection slot utilizationPrimary saturation metricActive / (worker_connections * worker_processes) > 80%
Dropped connections (accepts - handled)Leading indicator of NGINX-level admission lossRate of gap increase > 0 sustained for > 60 seconds
Connection state breakdownReveals the nature of the loadWriting > 50% with low throughput; Reading > 20% sustained
File descriptor usage per workerHard ceiling independent of connection slots> 80% of per-process limit
Kernel listen overflows (TcpExtListenOverflows)Silent kernel-level drops invisible to NGINX logsAny increasing rate
Upstream response timeBackend slowness is the most common exhaustion triggerP95 trending up > 2x baseline or approaching proxy_read_timeout

Fixes

If slow upstream is causing the pileup

Reduce proxy_read_timeout temporarily to shed slow requests faster. This trades error rate for availability. Identify the degraded backend via $upstream_addr in access logs and remove it from rotation if possible. Scale upstream capacity. Do not simply raise worker_connections to absorb backend slowness; this delays the inevitable and can overwhelm the upstream with more concurrent load.

If keepalive is consuming all slots

Lower keepalive_timeout to reclaim idle connections faster. Tune keepalive_requests to force periodic connection rotation. For upstream keepalive, ensure all three prerequisites are met: the upstream{} block, the keepalive directive inside it, and the HTTP/1.1 proxy settings. Without these, NGINX opens a fresh TCP connection per request, which also causes ephemeral port exhaustion via TIME_WAIT buildup.

If slow clients are the problem

Reduce client_header_timeout and client_body_timeout from the default 60 seconds to 10-15 seconds. Enable limit_conn per source IP after defining a limit_conn_zone to cap concurrent connections from a single client. If the source is identifiable and malicious, block at the firewall rather than inside NGINX.

If you are genuinely at capacity

Increase worker_connections to provide headroom, but account for memory: every slot allocates buffers. Ensure worker_rlimit_nofile is at least twice the per-worker worker_connections to accommodate client and upstream FDs, log files, and temp files. Check that systemd or container runtime limits are not overriding worker_rlimit_nofile. Enable reuseport on Linux to distribute connections evenly across workers and reduce accept contention.

If the kernel queue is overflowing

Raise net.core.somaxconn to at least match the backlog parameter in your listen directives. The effective backlog is the minimum of the two. On modern Linux kernels the default somaxconn is 4096, but older systems may default to 128. NGINX silently truncates the listen backlog to somaxconn without logging an error.

Prevention

Size worker_connections for at least three to five times your average peak active connections, then double it again if you are proxying traffic to account for the two-connections-per-request multiplier. Monitor the accepts - handled gap as a leading indicator; it grows before users complain. Monitor TcpExtListenOverflows at the OS level because NGINX cannot see kernel drops.

Set worker_rlimit_nofile generously, typically to at least 65,536 for production proxies. Set worker_shutdown_timeout to prevent old workers from lingering indefinitely after reloads, which artificially inflates connection counts. Tune keepalive_timeout and keepalive_requests to match your traffic patterns rather than leaving them at defaults. If you run in Kubernetes or another environment with frequent reloads, monitor worker process count to detect old worker accumulation.

How Netdata helps

  • Correlates NGINX active connections with per-worker FD usage and system-wide socket metrics to distinguish slot exhaustion from FD exhaustion from kernel queue overflow.
  • Tracks the accepts - handled gap rate from stub_status to flag admission loss before active connections flatline.
  • Monitors TcpExtListenOverflows and listen queue depth alongside NGINX metrics to catch kernel drops.
  • Breaks down Reading, Writing, and Waiting states to distinguish upstream slowdown, keepalive pileup, and slow client attacks.
  • Correlates upstream response time percentiles with NGINX connection state changes to expose backend cascade failures.