NGINX dropped connections: the accepts vs handled gap
Users report intermittent connection timeouts. Your HTTP 5xx rate is flat. The error log is quiet. Something is dropping traffic before it ever becomes a request.
On every NGINX instance, the stub_status page exposes two cumulative counters: accepts and handled. When accepts grows faster than handled, NGINX is taking connections from the kernel and then discarding them. This gap is a leading indicator of connection-slot or file-descriptor exhaustion. It often starts increasing minutes before the system hits the hard wall.
Because the counters are cumulative since process start, a static nonzero difference only proves a past event. Watch the rate of change.
What this means
In stub_status output, the first two numbers on the data line are cumulative counters:
accepts – total connections accepted from the kernel listen queue.
handled – total connections that NGINX actually processed.
In a healthy instance, the counters increase in lockstep. When accepts exceeds handled, the delta equals dropped connections. NGINX accepted the TCP handshake but could not allocate the resources to process it. The connection is silently closed.
The gap is not a single-point metric. Both counters are monotonic since startup and survive configuration reloads. A nonzero absolute value only tells you that drops happened at some point. What matters is whether the gap is actively growing during a measurement window. If the delta over 60 seconds is greater than zero, connections are being dropped now.
The usual triggers are resource ceilings: the per-worker worker_connections limit, the process file-descriptor limit (worker_rlimit_nofile or system ulimit), or the kernel listen backlog. The gap grows because workers cannot create connection structures, cannot open new file descriptors, or cannot pull connections from the kernel queue fast enough.
Common causes
| Cause | What it looks like | First thing to check |
|---|---|---|
| Connection slot exhaustion | Active connections plateau near worker_connections x worker_processes. Waiting (keepalive) or Writing (slow upstream) states dominate. | stub_status active connections against configured maximum. |
| File-descriptor exhaustion | Active connections are below the slot limit, but accept4() failed (24: Too many open files) appears in the error log. FD count per worker is pinned near its limit. | /proc/<pid>/fd count against /proc/<pid>/limits for each worker. |
| Kernel listen-backlog overflow | stub_status shows no stress, but clients see connection timeouts. NGINX logs show no evidence because drops happen before accept(). | TcpExtListenOverflows or ss -tlnp Recv-Q approaching Send-Q. |
| Worker restart or reload | Brief gap spike that stabilizes within seconds. Correlates with a reload event in the error log. | Error log for reconfiguring notices around the same timestamp. |
Quick checks
These checks are read-only and safe to run during an incident.
# 1. Check the current gap and whether it is growing
curl -s http://127.0.0.1/nginx_status | awk '/^[[:space:]]*[0-9]/ {print "gap=" $1-$2; exit}'
# 2. Check active connections vs theoretical maximum
active=$(curl -s http://127.0.0.1/nginx_status | awk '/Active connections/ {print $3}')
workers=$(pgrep -a -P $(cat /var/run/nginx.pid) | grep -c 'nginx: worker')
wc=$(nginx -T 2>/dev/null | grep -m1 'worker_connections' | awk '{print $2}' | tr -d ';')
wc=${wc:-512}
echo "Active: $active / Max: $((workers * wc))"
# 3. Check file-descriptor usage per worker
pgrep -a -P $(cat /var/run/nginx.pid) | grep 'nginx: worker' | awk '{print $1}' | while read -r pid; do
used=$(ls /proc/$pid/fd 2>/dev/null | wc -l)
echo "Worker $pid: $used FDs"
done
# 4. Check the hard FD limit enforced on workers
prlimit -n -p $(pgrep -a -P $(cat /var/run/nginx.pid) | grep 'nginx: worker' | head -1 | awk '{print $1}') 2>/dev/null
# 5. Check kernel-level listen-queue overflows
nstat -az TcpExtListenOverflows 2>/dev/null | awk '/ListenOverflows/ {print $2}'
# 6. Check current backlog depth on listening sockets
ss -tlnp | awk 'NR>1 && /:(80|443)/ {print "Recv-Q:"$2, "Send-Q:"$3, $4}'
How to diagnose it
flowchart TD
A[accepts - handled gap growing] --> B{Active connections near limit?}
B -->|Yes| C[Connection slot exhaustion]
B -->|No| D{FD usage near process limit?}
D -->|Yes| E[File descriptor exhaustion]
D -->|No| F{TcpExtListenOverflows increasing?}
F -->|Yes| G[Kernel listen backlog overflow]
F -->|No| H[Worker restart or transient reload]- Establish that the gap is live. Take two
stub_statussamples 30 to 60 seconds apart. If(accepts - handled)increases between samples, drops are occurring now. A static gap is historical noise. - Check connection slot utilization. Calculate active connections divided by
worker_connections x worker_processes. If the ratio is above 0.8 and climbing, the worker event loops are running out of connection structures. For reverse-proxy traffic, remember the proxy multiplier: each request consumes at least two slots (client-facing and upstream). - Check file-descriptor saturation. Even if slots are available, the OS may refuse to allocate new FDs. Count open FDs per worker against the limit in
/proc/<pid>/limits. If usage is above 90 percent, FD exhaustion is the bottleneck. The error log may containaccept4() failed (24: Too many open files). - Check the kernel listen queue. If neither slots nor FDs are saturated, look at the kernel level. Run
nstat -az TcpExtListenOverflows. An increasing counter means the kernel is dropping SYNs or established connections before NGINX ever sees them. Usess -tlnpto compare Recv-Q against Send-Q. - Correlate with error log and process state. Search the error log for
accept4() failedortoo many open files. Verify that the worker count matches the configuredworker_processes; missing workers can concentrate load and accelerate slot exhaustion on the remaining processes. - Determine the root cause category. Slot exhaustion points to capacity or slow upstreams. FD exhaustion points to
worker_rlimit_nofileor systemulimitbeing too low, or an FD leak. Kernel overflow points tonet.core.somaxconnor worker CPU saturation preventingaccept.
Metrics and signals to monitor
| Signal | Why it matters | Warning sign |
|---|---|---|
Dropped connection rate (accepts - handled delta) | The only NGINX-native signal that proves connections are being discarded after TCP handshake. | Gap increasing over any 60-second window while request rate is above zero. |
| Connection slot utilization | Reveals whether the event loop is near its hard ceiling. | Active / (worker_connections x worker_processes) > 0.8. |
| FD usage per worker | FD limits often bind before connection slots do, especially with keepalive and log files. | Used FDs / limit > 0.8, or any accept4() failed (24) in error log. |
| TcpExtListenOverflows | Kernel drops are invisible to NGINX logs. This OS counter is the only evidence. | Counter increasing over a 60-second window. |
| Active connection state breakdown | Distinguishes slot exhaustion caused by slow upstreams (Writing) from keepalive waste (Waiting). | Writing > 50% of active with low throughput, or Waiting consuming > 80% of capacity. |
| Error log severity rate | Resource-limit errors may not produce HTTP status codes but still log at crit or error. | Any too many open files or could not allocate node entries. |
Fixes
Fix choice depends on which boundary you hit.
Connection slots exhausted
If active connections are near worker_connections x worker_processes, you have three levers.
Increase capacity. Raise worker_connections in the events block and reload. For a reverse proxy, set the value to at least 2x your peak proxied request count plus keepalive headroom. Monitor memory: each connection allocates buffers, so doubling slots increases RSS.
Reclaim idle capacity. If Waiting connections dominate active connections, reduce keepalive_timeout or keepalive_requests so idle connections close faster. This is a fast config change with no memory cost.
Fix the upstream bottleneck. If Writing connections dominate, the backend is too slow. Reduce proxy_read_timeout only if you can tolerate faster 504 errors; otherwise add backend capacity.
File descriptors exhausted
Check the actual process limit with prlimit or /proc/<pid>/limits, not just the NGINX config.
Raise the worker FD ceiling. Set worker_rlimit_nofile to a value comfortably above worker_connections plus overhead for log files, upstream sockets, and temp files. Then reload. Note that systemd LimitNOFILE or container runtimes may override this, so verify the process-level limit after restart.
Find a leak. If FD usage grows while active connections stay flat, inspect /proc/<pid>/fd for repeated patterns. Leaked upstream connections, excessive temp files, or per-vhost log proliferation are common culprits.
Kernel listen queue overflowing
The effective backlog is the lesser of net.core.somaxconn and the backlog parameter on the listen directive.
Raise the kernel limit. Increase net.core.somaxconn to at least 4096 on modern kernels.
Raise the socket backlog. Add backlog=4096 to the listen directive and reload.
Speed up accept capacity. If the queue fills despite a deep backlog, workers are not calling accept() fast enough. Check worker CPU saturation or event-loop blocking from synchronous disk I/O.
Prevention
- Monitor the rate, not the absolute. Alert when
(accepts - handled)increases over a rolling 60-second window, not when the cumulative counter is nonzero. - Size for the proxy multiplier. In reverse-proxy mode, each request needs at least two connections. Size
worker_connectionsat 2x peak proxied concurrency. - Set generous FD limits.
worker_rlimit_nofileshould be at least 2xworker_connections. Verify the systemd or container limit independently. - Watch the kernel queue. Include
TcpExtListenOverflowsin host-level monitoring. It is the only signal that catches drops before NGINX sees them.
How Netdata helps
- Delta rate visualization. Netdata computes the
acceptsminushandledgap rate fromstub_statuscounters, removing manual two-point sampling. - Per-process FD charts. File-descriptor usage per NGINX worker is shown against process limits, surfacing exhaustion before errors appear.
- Kernel drop correlation. Host-level kernel listen-overflow counters are charted next to NGINX connection metrics, linking silent kernel drops to the application.
- Slot utilization alerts. Active connections are compared against
worker_connectionscapacity, with alerts when utilization crosses 80 percent. - State breakdown correlation. Reading, Writing, and Waiting proportions are shown alongside upstream latency and error-log rates to distinguish slot exhaustion from backend slowness.







