nginx: too many open files - diagnosing file descriptor exhaustion

After a traffic spike, the error log shows accept4() failed (24: Too many open files), then goes silent. Existing connections still serve, but new ones cannot land.

File descriptor exhaustion is a hard failure. Once the limit is hit, nginx cannot accept new connections, open upstream sockets, or write to the error log. Default OS limits of 1024 are too low for production reverse proxies. Each proxied request consumes at least two FDs, and idle keepalive connections hold them indefinitely. The effective limit is the lower of worker_rlimit_nofile and the OS hard limit enforced by systemd or the container runtime.

What this means

Every active resource in nginx consumes one FD per worker: client sockets, upstream sockets, open log files, temporary files for large request or response bodies, entries held by open_file_cache, and internal event notifications. In reverse proxy mode, a single request ties up two sockets simultaneously, so FD demand is at least double the active connection count.

When a worker exhausts its allowance, accept() returns EMFILE. The kernel still completes TCP handshakes and queues connections in the listen backlog, but nginx cannot pull them into the event loop. If the backlog fills, the kernel silently drops new SYN packets. Existing connections continue to process, so the server looks partially healthy from the inside while appearing down to new clients. Because the error log file is also an FD, severe exhaustion can prevent nginx from recording further diagnostics.

flowchart TD
    A[accept4 failed 24] --> B{Check proc PID limits}
    B -->|Hard limit below config| C[Fix systemd or OS ulimit]
    B -->|Limit is high| D{Check FD consumers}
    D -->|Waiting high| E[Reduce keepalive timeout]
    D -->|Writing high| F[Check upstream latency]
    D -->|Files exceed sockets| G[Check temp and cache files]

Common causes

CauseWhat it looks likeFirst thing to check
OS or systemd limit lower than worker_rlimit_nofileaccept4() failed (24) under moderate load; /proc/<pid>/limits shows a hard limit below the nginx configawk '/^Max open files/ {print $4}' /proc/<worker_pid>/limits
worker_rlimit_nofile set below connection demandActive connections plateau below worker_connections but FDs are maxedls /proc/<worker_pid>/fd | wc -l against worker_rlimit_nofile
Keepalive hoarding idle connectionsHigh Waiting count in stub_status; FD usage climbs while throughput is flatcurl -s http://127.0.0.1/nginx_status and compare Waiting to total active
Missing upstream keepalive causing churnMany upstream sockets in TIME_WAIT; high $upstream_connect_timess -tan state time-wait | wc -l and access log $upstream_connect_time values
Open file cache or temp files consuming headroomStatic file workloads with open_file_cache; FD count exceeds socket countls /proc/<worker_pid>/fd and count regular files versus sockets
FD leak in a module or configurationFD count grows monotonically without matching connection growthFD count sampled every minute from /proc/<pid>/fd

Quick checks

Run these read-only commands to confirm the failure and locate the bottleneck.

# Confirm FD exhaustion in the error log
grep 'accept4() failed (24' /var/log/nginx/error.log | tail -5
# FD count per worker
for pid in $(pgrep -P $(cat /var/run/nginx.pid)); do
  echo "Worker $pid: $(ls /proc/$pid/fd 2>/dev/null | wc -l) FDs"
done
# Effective soft and hard limits for a worker
prlimit -n -p $(pgrep -P $(cat /var/run/nginx.pid) | head -1)
# Hard limit from procfs
awk '/^Max open files/ {print "Hard limit:", $4}' \
  /proc/$(pgrep -P $(cat /var/run/nginx.pid) | head -1)/limits
# Connection state breakdown
curl -s http://127.0.0.1/nginx_status | \
  awk '/Reading/ {print "R:"$2, "W:"$4, "Wait:"$6}'
# Kernel listen queue depth and silent drops
ss -tlnp | awk 'NR>1 && /nginx/ {print $4, "Recv-Q:", $2, "Send-Q:", $3}'
nstat -a TcpExtListenOverflows 2>/dev/null | \
  awk '/ListenOverflows/ {print "Kernel drops:", $2}'
# Relevant configuration directives
nginx -T 2>/dev/null | grep -E 'worker_rlimit_nofile|worker_connections|keepalive|open_file_cache'

How to diagnose it

  1. Confirm the error pattern. Look for accept4() failed (24: Too many open files) in the error log. If the log has gone silent under load, FD exhaustion may already be preventing new entries.
  2. Quantify per-worker FD consumption. Count entries in /proc/<worker_pid>/fd for each worker.
  3. Identify the effective hard limit. Read /proc/<worker_pid>/limits and compare it with worker_rlimit_nofile. The lower value wins.
  4. Classify FD consumers. Inside /proc/<pid>/fd, sockets dominate for proxy workloads. A high count of regular files points to temp files, logs, or open_file_cache pressure.
  5. Correlate with connection state. High Waiting means keepalive is hoarding FDs. High Writing with low throughput means slow upstreams are piling up active connections that each hold FDs.
  6. Check for silent kernel drops. TcpExtListenOverflows increasing, or ss showing Recv-Q near Send-Q, means connections are dropping before nginx can accept them.
  7. Verify the capacity math. For reverse proxy, worker_rlimit_nofile must cover at least two FDs per worker_connections plus log files, temp files, and cache headroom.

Metrics and signals to monitor

SignalWhy it mattersWarning sign
File descriptor usage per workerDirect measure of proximity to the hard limit>75% of limit sustained
Active connections vs worker_connectionsShows slot pressure; each proxy request uses two slots>80% of worker_connections * worker_processes
Dropped connections (accepts - handled gap)Confirms admission loss before total saturationGap increasing for >60 seconds
Kernel listen overflows (TcpExtListenOverflows)Reveals silent drops invisible to nginxAny nonzero rate of increase
Connection state breakdown (Reading/Writing/Waiting)Distinguishes keepalive bloat from slow upstreamWaiting >80% of active
Upstream connect timeDetects keepalive pool miss and TIME_WAIT churnReuse rate low, connect time nonzero
Error log rate and contentFD exhaustion eventually kills logging itselfEmergence of accept4() failed (24) or sudden silence

Fixes

Align the OS limit with nginx config

If systemd or the container runtime enforces a hard limit below worker_rlimit_nofile, the config value is ignored. If the master process was started under an OS hard limit below worker_rlimit_nofile, a reload cannot raise the workers above that inherited hard limit. Raise the OS or container limit, then restart nginx.

Raise worker_rlimit_nofile

Set worker_rlimit_nofile to at least double worker_connections, plus headroom for log files, temp files, and the open file cache. For a reverse proxy, each connection slot can require two FDs. Reload to apply. Tradeoff: FDs are cheap on modern systems, but the master process can only raise the limit up to the OS hard ceiling at worker spawn time.

Shed idle keepalive connections

If Waiting connections dominate active connections, reduce keepalive_timeout for client connections and verify keepalive pool sizing in upstream blocks. Reload to apply. Tradeoff: lower timeouts increase TCP and TLS handshake overhead for repeat clients, but they free FDs immediately.

Enable or tune upstream keepalive

If every proxied request opens a new upstream socket, configure keepalive inside the upstream block to reuse connections. This cuts upstream FD consumption from one per request to one per concurrent upstream peer. Tradeoff: consumes upstream server connection slots and memory.

Reduce open file cache or temp file pressure

Lower open_file_cache max= or reduce buffer sizes that spill to proxy_temp_path and client_body_temp_path. Tradeoff: slightly higher disk I/O for static files or large responses, but fewer simultaneous open file descriptors.

Emergency load shedding without restart

If you cannot restart, reload with a very low keepalive_timeout to force idle connections to close. If even a reload is too risky, block new traffic at the edge firewall or load balancer to reduce connection creation while preserving existing sessions. Tradeoff: impacts some clients but prevents total lockup.

Prevention

  • Set worker_rlimit_nofile to at least 2x worker_connections per worker, plus margin for logs and cache.
  • Verify the effective limit in /proc/<pid>/limits after every deployment, not just during config syntax checks.
  • Monitor FD utilization percentage per worker as a core saturation signal.
  • Keep upstream keepalive pools effective by logging $upstream_connect_time and targeting near-zero connect times.
  • Treat TcpExtListenOverflows as a first-class signal. It reveals exhaustion before nginx logs do.

How Netdata helps

Netdata charts per-process FD utilization for each worker against its limit without manual /proc scraping. It correlates nginx stub_status active connections with kernel TcpExtListenOverflows, and alerts on growing accepts - handled gaps and error log matches for accept4() failed (24). Connection state breakdowns distinguish keepalive bloat from upstream slowness.