NGINX old worker processes accumulating after reload: worker_shutdown_timeout
After nginx -s reload, worker process count climbs above worker_processes. Memory and file descriptor usage grow. The workers are not crashing; they are old generations waiting for long-lived connections to close.
This pattern is normal in small doses. During a reload, the master keeps old workers alive until active connections drain. Without a shutdown deadline, a single WebSocket, gRPC stream, or long-polling connection can pin an old worker indefinitely. In environments that reload frequently, such as Kubernetes ingress controllers reacting to endpoint changes, accumulation becomes a resource leak that can exhaust memory or file descriptors.
flowchart TD
A[Reload received] --> B[Master spawns new workers]
B --> C[Old workers stop accepting]
C --> D{worker_shutdown_timeout set?}
D -->|No| E[Remain until connections close]
D -->|Yes| F[Remain until timeout expires]
F --> G[Force close connections]
E --> H[Old worker exits]
G --> HWhat this means
When nginx reloads, the master spawns new workers with the updated configuration. Previous workers stop accepting new connections but continue serving existing ones. In a healthy short-request workload, old workers finish within seconds and exit.
The worker_shutdown_timeout directive caps how long old workers wait. If the directive is absent, the default is an infinite grace period. Old workers wait until every last connection closes naturally. WebSockets, Server-Sent Events, gRPC streaming, and long-polling HTTP connections often persist for minutes or hours. Each reload under these conditions adds another generation of workers that may never leave.
In Kubernetes ingress deployments, endpoint changes can trigger reloads many times per hour. This leads to a steady rise in worker count, memory footprint, and open file descriptors. The symptom is simple: running worker processes exceed the configured worker_processes, and the gap widens over time.
Common causes
| Cause | What it looks like | First thing to check |
|---|---|---|
| Long-lived connections blocking drain | Worker count stays above worker_processes indefinitely; specific PIDs persist across multiple reloads | ss -tnp mapped to old worker PIDs to confirm active connections |
| Frequent configuration reloads | Process count climbs steadily over hours; many reconfiguring entries in the error log | Error log reload frequency and correlation with worker spawn times |
worker_shutdown_timeout not configured | Old workers never disappear regardless of age; memory grows monotonically | nginx.conf for the presence of worker_shutdown_timeout |
| Streaming endpoints without endpoint-specific timeouts | Old workers cluster on ports or locations serving WebSocket, SSE, or gRPC | Access log for long-running requests to those paths |
Quick checks
Run these read-only checks to confirm accumulation and identify the generation causing it.
# Count configured workers (check included configs if this returns nothing)
configured=$(grep -E '^\s*worker_processes' /etc/nginx/nginx.conf | awk '{print $2}' | tr -d ';')
echo "configured: $configured"
# Count running worker processes (excludes cache loader and manager)
# Adjust PID file path if your instance uses a different location
pgrep -a -P $(cat /var/run/nginx.pid) | grep -c 'nginx: worker process'
# Check whether worker_shutdown_timeout is configured
grep 'worker_shutdown_timeout' /etc/nginx/nginx.conf
# Review recent reload frequency from error log (adjust path if needed)
grep -c 'reconfiguring' /var/log/nginx/error.log
# Check active connection totals and state breakdown
curl -s http://127.0.0.1/nginx_status
# List worker processes to spot generations
pgrep -a -f 'nginx: worker process'
# Check connections associated with a specific old worker PID
ss -tnp | grep '<OLD_PID>'
How to diagnose it
Establish the baseline. Note the configured
worker_processes. If set toauto, the expected count equals the number of CPU cores visible to nginx.Count running workers. Use
pgrep -a -P $(cat /var/run/nginx.pid) | grep -c 'nginx: worker process'and compare to the configured value. A sustained count above baseline confirms accumulation.Identify old generations. In the process list, old workers have lower PIDs and earlier start times than the current generation. If the same PIDs survive across multiple reloads, they are stuck.
Map stuck workers to open connections. For each old worker PID, run
ss -tnp | grep '<PID>'to see what connections it still holds. Long-lived connections to upstream or client ports indicate why the worker cannot exit.Check for
worker_shutdown_timeout. Ifgrepreturns nothing, there is no deadline. Old workers will wait indefinitely.Correlate with reload events. Check the error log for
reconfiguringentries. If reloads occur more than once per minute sustained, the environment is generating new worker generations faster than old ones can drain.Determine the connection type. Review access logs for the stuck endpoints. Look for long
$request_timevalues or requests to known streaming paths. WebSocket connections appear as a single long-lived request in access logs.
Metrics and signals to monitor
| Signal | Why it matters | Warning sign |
|---|---|---|
| Worker process count | Direct indicator of old worker accumulation | Count > worker_processes sustained for more than a few minutes |
| Active connections | Old workers hold connections that should have drained | Monotonic growth without corresponding traffic growth |
| Writing connections | Active proxy or streaming connections prevent worker exit | Sustained high Writing count tied to old worker PIDs |
| Worker RSS | Each old worker retains memory for open buffers and connections | Per-worker memory growing without bound |
| Reload frequency | Each reload spawns a new generation of workers | More than one reload per minute sustained |
accepts - handled delta | Connections dropped because workers or the system ran out of file descriptors | Gap increasing under load while old worker count is high |
Fixes
Set worker_shutdown_timeout
Add worker_shutdown_timeout at the main configuration level. A value between 30 and 120 seconds is a reasonable starting point for most web workloads. After the timeout expires, nginx forcefully closes remaining connections on old workers, allowing them to exit.
Tradeoff: Connections terminated by this timeout drop mid-stream. Clients see disconnections. If your application relies on persistent WebSocket or gRPC streams, set the timeout to the longest acceptable interruption window, or isolate streaming locations to a separate nginx instance where reloads are rare.
# In nginx.conf, main context
worker_shutdown_timeout 60s;
Test and reload:
nginx -t
nginx -s reload
Reduce reload frequency
If a configuration management system, ingress controller, or CI/CD pipeline triggers reloads on every minor change, batch those changes. Fewer reloads mean fewer opportunities for worker generations to pile up. In Kubernetes environments, coalesce ingress object changes before they hit the nginx control loop.
Tune streaming endpoint timeouts
For locations that intentionally handle long-lived connections, review proxy_read_timeout, proxy_send_timeout, and client-side timeouts. If a connection is idle for minutes, a shorter timeout may allow the upstream or client to close gracefully, freeing the worker sooner. Balance this against the legitimate needs of the application protocol.
Emergency manual termination
If old workers have accumulated to the point of memory pressure and you cannot reload immediately, terminate specific old worker processes by PID. This is disruptive: their connections drop immediately.
# DANGEROUS: Terminates the worker and its open connections
kill <OLD_WORKER_PID>
Only use this as a last resort to reclaim memory or file descriptors before the OOM killer intervenes.
Prevention
Set
worker_shutdown_timeoutin the base configuration. Do not wait for accumulation to appear. The directive is absent by default; the omission is only safe if you have no long-lived connections and infrequent reloads.Batch configuration changes. Coalesce updates so reloads happen at a controlled rate rather than reactively.
Monitor worker count against
worker_processes. Alert when the running count exceeds the configured count for more than a brief transient window.Review long-lived connection paths. If you proxy WebSocket, SSE, or gRPC, validate that upstream and client idle timeouts are aligned so connections close when genuinely idle.
How Netdata helps
- Charts worker process count and alerts when sustained above configured
worker_processes. - Correlates per-process RSS and file descriptor usage with reload events to highlight memory pressure from stuck workers.
- Surfaces stub_status metrics including active connections and the Reading/Writing/Waiting breakdown to identify whether old workers are holding connections open.
- Tracks reload frequency via error log patterns to detect configuration storms.
- Cross-references upstream response time and connection state to distinguish old worker accumulation from legitimate traffic growth.
Related guides
- How NGINX actually works in production: a mental model for operators
- nginx 413 Request Entity Too Large: client_max_body_size explained
- nginx 499 status code: why clients close connections before the response
- nginx 500 Internal Server Error: how to diagnose it
- nginx 502 Bad Gateway: causes and how to fix it
- nginx 503 Service Temporarily Unavailable: causes and fixes
- nginx 504 Gateway Time-out: causes and fixes
- NGINX active connections climbing: reading, writing, waiting explained
- NGINX backend cascade failure: when slow upstreams take down everything
- nginx: a client request body is buffered to a temporary file - what it means
- NGINX proxy cache hit rate is low: measuring and improving it
- nginx connect() failed (111: Connection refused) while connecting to upstream







