$ guides / nginx / nginx-old-worker-processes-accumulating ▌

Operations Guides

NGINX old worker processes accumulating after reload: worker_shutdown_timeout

After nginx -s reload, worker process count climbs above worker_processes. Memory and file descriptor usage grow. The workers are not crashing; they are old generations waiting for long-lived connections to close.

This pattern is normal in small doses. During a reload, the master keeps old workers alive until active connections drain. Without a shutdown deadline, a single WebSocket, gRPC stream, or long-polling connection can pin an old worker indefinitely. In environments that reload frequently, such as Kubernetes ingress controllers reacting to endpoint changes, accumulation becomes a resource leak that can exhaust memory or file descriptors.

flowchart TD
    A[Reload received] --> B[Master spawns new workers]
    B --> C[Old workers stop accepting]
    C --> D{worker_shutdown_timeout set?}
    D -->|No| E[Remain until connections close]
    D -->|Yes| F[Remain until timeout expires]
    F --> G[Force close connections]
    E --> H[Old worker exits]
    G --> H

What this means

When nginx reloads, the master spawns new workers with the updated configuration. Previous workers stop accepting new connections but continue serving existing ones. In a healthy short-request workload, old workers finish within seconds and exit.

The worker_shutdown_timeout directive caps how long old workers wait. If the directive is absent, the default is an infinite grace period. Old workers wait until every last connection closes naturally. WebSockets, Server-Sent Events, gRPC streaming, and long-polling HTTP connections often persist for minutes or hours. Each reload under these conditions adds another generation of workers that may never leave.

In Kubernetes ingress deployments, endpoint changes can trigger reloads many times per hour. This leads to a steady rise in worker count, memory footprint, and open file descriptors. The symptom is simple: running worker processes exceed the configured worker_processes, and the gap widens over time.

Common causes

Cause	What it looks like	First thing to check
Long-lived connections blocking drain	Worker count stays above `worker_processes` indefinitely; specific PIDs persist across multiple reloads	`ss -tnp` mapped to old worker PIDs to confirm active connections
Frequent configuration reloads	Process count climbs steadily over hours; many `reconfiguring` entries in the error log	Error log reload frequency and correlation with worker spawn times
`worker_shutdown_timeout` not configured	Old workers never disappear regardless of age; memory grows monotonically	`nginx.conf` for the presence of `worker_shutdown_timeout`
Streaming endpoints without endpoint-specific timeouts	Old workers cluster on ports or locations serving WebSocket, SSE, or gRPC	Access log for long-running requests to those paths

Quick checks

Run these read-only checks to confirm accumulation and identify the generation causing it.

# Count configured workers (check included configs if this returns nothing)
configured=$(grep -E '^\s*worker_processes' /etc/nginx/nginx.conf | awk '{print $2}' | tr -d ';')
echo "configured: $configured"

# Count running worker processes (excludes cache loader and manager)
# Adjust PID file path if your instance uses a different location
pgrep -a -P $(cat /var/run/nginx.pid) | grep -c 'nginx: worker process'

# Check whether worker_shutdown_timeout is configured
grep 'worker_shutdown_timeout' /etc/nginx/nginx.conf

# Review recent reload frequency from error log (adjust path if needed)
grep -c 'reconfiguring' /var/log/nginx/error.log

# Check active connection totals and state breakdown
curl -s http://127.0.0.1/nginx_status

# List worker processes to spot generations
pgrep -a -f 'nginx: worker process'

# Check connections associated with a specific old worker PID
ss -tnp | grep '<OLD_PID>'

How to diagnose it

Establish the baseline. Note the configured worker_processes. If set to auto, the expected count equals the number of CPU cores visible to nginx.
Count running workers. Use pgrep -a -P $(cat /var/run/nginx.pid) | grep -c 'nginx: worker process' and compare to the configured value. A sustained count above baseline confirms accumulation.
Identify old generations. In the process list, old workers have lower PIDs and earlier start times than the current generation. If the same PIDs survive across multiple reloads, they are stuck.
Map stuck workers to open connections. For each old worker PID, run ss -tnp | grep '<PID>' to see what connections it still holds. Long-lived connections to upstream or client ports indicate why the worker cannot exit.
Check for worker_shutdown_timeout. If grep returns nothing, there is no deadline. Old workers will wait indefinitely.
Correlate with reload events. Check the error log for reconfiguring entries. If reloads occur more than once per minute sustained, the environment is generating new worker generations faster than old ones can drain.
Determine the connection type. Review access logs for the stuck endpoints. Look for long $request_time values or requests to known streaming paths. WebSocket connections appear as a single long-lived request in access logs.

Metrics and signals to monitor

Signal	Why it matters	Warning sign
Worker process count	Direct indicator of old worker accumulation	Count > `worker_processes` sustained for more than a few minutes
Active connections	Old workers hold connections that should have drained	Monotonic growth without corresponding traffic growth
Writing connections	Active proxy or streaming connections prevent worker exit	Sustained high Writing count tied to old worker PIDs
Worker RSS	Each old worker retains memory for open buffers and connections	Per-worker memory growing without bound
Reload frequency	Each reload spawns a new generation of workers	More than one reload per minute sustained
`accepts - handled` delta	Connections dropped because workers or the system ran out of file descriptors	Gap increasing under load while old worker count is high

Fixes

Set `worker_shutdown_timeout`

Add worker_shutdown_timeout at the main configuration level. A value between 30 and 120 seconds is a reasonable starting point for most web workloads. After the timeout expires, nginx forcefully closes remaining connections on old workers, allowing them to exit.

Tradeoff: Connections terminated by this timeout drop mid-stream. Clients see disconnections. If your application relies on persistent WebSocket or gRPC streams, set the timeout to the longest acceptable interruption window, or isolate streaming locations to a separate nginx instance where reloads are rare.

# In nginx.conf, main context
worker_shutdown_timeout 60s;

Test and reload:

nginx -t
nginx -s reload

Reduce reload frequency

If a configuration management system, ingress controller, or CI/CD pipeline triggers reloads on every minor change, batch those changes. Fewer reloads mean fewer opportunities for worker generations to pile up. In Kubernetes environments, coalesce ingress object changes before they hit the nginx control loop.

Tune streaming endpoint timeouts

For locations that intentionally handle long-lived connections, review proxy_read_timeout, proxy_send_timeout, and client-side timeouts. If a connection is idle for minutes, a shorter timeout may allow the upstream or client to close gracefully, freeing the worker sooner. Balance this against the legitimate needs of the application protocol.

Emergency manual termination

If old workers have accumulated to the point of memory pressure and you cannot reload immediately, terminate specific old worker processes by PID. This is disruptive: their connections drop immediately.

# DANGEROUS: Terminates the worker and its open connections
kill <OLD_WORKER_PID>

Only use this as a last resort to reclaim memory or file descriptors before the OOM killer intervenes.

Prevention

Set worker_shutdown_timeout in the base configuration. Do not wait for accumulation to appear. The directive is absent by default; the omission is only safe if you have no long-lived connections and infrequent reloads.
Batch configuration changes. Coalesce updates so reloads happen at a controlled rate rather than reactively.
Monitor worker count against worker_processes. Alert when the running count exceeds the configured count for more than a brief transient window.
Review long-lived connection paths. If you proxy WebSocket, SSE, or gRPC, validate that upstream and client idle timeouts are aligned so connections close when genuinely idle.

How Netdata helps

Charts worker process count and alerts when sustained above configured worker_processes.
Correlates per-process RSS and file descriptor usage with reload events to highlight memory pressure from stuck workers.
Surfaces stub_status metrics including active connections and the Reading/Writing/Waiting breakdown to identify whether old workers are holding connections open.
Tracks reload frequency via error log patterns to detect configuration storms.
Cross-references upstream response time and connection state to distinguish old worker accumulation from legitimate traffic growth.

The Netdata solution

Web server monitoring with Netdata

Netdata monitors NGINX with per-second request, connection, and latency metrics plus ML anomaly detection. Correlate connection and file-descriptor exhaustion, upstream cascade failures, buffer spill, and TLS CPU with the host signals behind them.

See web server monitoring → Start monitoring free

NGINX old worker processes accumulating after reload: worker_shutdown_timeout

NGINX old worker processes accumulating after reload: worker_shutdown_timeout

What this means

Common causes

Quick checks

How to diagnose it

Metrics and signals to monitor

Fixes

Set worker_shutdown_timeout

Reduce reload frequency

Tune streaming endpoint timeouts

Emergency manual termination

Prevention

How Netdata helps

Related guides

Web server monitoring with Netdata

Set `worker_shutdown_timeout`