NGINX old worker processes accumulating after reload: worker_shutdown_timeout

After nginx -s reload, worker process count climbs above worker_processes. Memory and file descriptor usage grow. The workers are not crashing; they are old generations waiting for long-lived connections to close.

This pattern is normal in small doses. During a reload, the master keeps old workers alive until active connections drain. Without a shutdown deadline, a single WebSocket, gRPC stream, or long-polling connection can pin an old worker indefinitely. In environments that reload frequently, such as Kubernetes ingress controllers reacting to endpoint changes, accumulation becomes a resource leak that can exhaust memory or file descriptors.

flowchart TD
    A[Reload received] --> B[Master spawns new workers]
    B --> C[Old workers stop accepting]
    C --> D{worker_shutdown_timeout set?}
    D -->|No| E[Remain until connections close]
    D -->|Yes| F[Remain until timeout expires]
    F --> G[Force close connections]
    E --> H[Old worker exits]
    G --> H

What this means

When nginx reloads, the master spawns new workers with the updated configuration. Previous workers stop accepting new connections but continue serving existing ones. In a healthy short-request workload, old workers finish within seconds and exit.

The worker_shutdown_timeout directive caps how long old workers wait. If the directive is absent, the default is an infinite grace period. Old workers wait until every last connection closes naturally. WebSockets, Server-Sent Events, gRPC streaming, and long-polling HTTP connections often persist for minutes or hours. Each reload under these conditions adds another generation of workers that may never leave.

In Kubernetes ingress deployments, endpoint changes can trigger reloads many times per hour. This leads to a steady rise in worker count, memory footprint, and open file descriptors. The symptom is simple: running worker processes exceed the configured worker_processes, and the gap widens over time.

Common causes

CauseWhat it looks likeFirst thing to check
Long-lived connections blocking drainWorker count stays above worker_processes indefinitely; specific PIDs persist across multiple reloadsss -tnp mapped to old worker PIDs to confirm active connections
Frequent configuration reloadsProcess count climbs steadily over hours; many reconfiguring entries in the error logError log reload frequency and correlation with worker spawn times
worker_shutdown_timeout not configuredOld workers never disappear regardless of age; memory grows monotonicallynginx.conf for the presence of worker_shutdown_timeout
Streaming endpoints without endpoint-specific timeoutsOld workers cluster on ports or locations serving WebSocket, SSE, or gRPCAccess log for long-running requests to those paths

Quick checks

Run these read-only checks to confirm accumulation and identify the generation causing it.

# Count configured workers (check included configs if this returns nothing)
configured=$(grep -E '^\s*worker_processes' /etc/nginx/nginx.conf | awk '{print $2}' | tr -d ';')
echo "configured: $configured"
# Count running worker processes (excludes cache loader and manager)
# Adjust PID file path if your instance uses a different location
pgrep -a -P $(cat /var/run/nginx.pid) | grep -c 'nginx: worker process'
# Check whether worker_shutdown_timeout is configured
grep 'worker_shutdown_timeout' /etc/nginx/nginx.conf
# Review recent reload frequency from error log (adjust path if needed)
grep -c 'reconfiguring' /var/log/nginx/error.log
# Check active connection totals and state breakdown
curl -s http://127.0.0.1/nginx_status
# List worker processes to spot generations
pgrep -a -f 'nginx: worker process'
# Check connections associated with a specific old worker PID
ss -tnp | grep '<OLD_PID>'

How to diagnose it

  1. Establish the baseline. Note the configured worker_processes. If set to auto, the expected count equals the number of CPU cores visible to nginx.

  2. Count running workers. Use pgrep -a -P $(cat /var/run/nginx.pid) | grep -c 'nginx: worker process' and compare to the configured value. A sustained count above baseline confirms accumulation.

  3. Identify old generations. In the process list, old workers have lower PIDs and earlier start times than the current generation. If the same PIDs survive across multiple reloads, they are stuck.

  4. Map stuck workers to open connections. For each old worker PID, run ss -tnp | grep '<PID>' to see what connections it still holds. Long-lived connections to upstream or client ports indicate why the worker cannot exit.

  5. Check for worker_shutdown_timeout. If grep returns nothing, there is no deadline. Old workers will wait indefinitely.

  6. Correlate with reload events. Check the error log for reconfiguring entries. If reloads occur more than once per minute sustained, the environment is generating new worker generations faster than old ones can drain.

  7. Determine the connection type. Review access logs for the stuck endpoints. Look for long $request_time values or requests to known streaming paths. WebSocket connections appear as a single long-lived request in access logs.

Metrics and signals to monitor

SignalWhy it mattersWarning sign
Worker process countDirect indicator of old worker accumulationCount > worker_processes sustained for more than a few minutes
Active connectionsOld workers hold connections that should have drainedMonotonic growth without corresponding traffic growth
Writing connectionsActive proxy or streaming connections prevent worker exitSustained high Writing count tied to old worker PIDs
Worker RSSEach old worker retains memory for open buffers and connectionsPer-worker memory growing without bound
Reload frequencyEach reload spawns a new generation of workersMore than one reload per minute sustained
accepts - handled deltaConnections dropped because workers or the system ran out of file descriptorsGap increasing under load while old worker count is high

Fixes

Set worker_shutdown_timeout

Add worker_shutdown_timeout at the main configuration level. A value between 30 and 120 seconds is a reasonable starting point for most web workloads. After the timeout expires, nginx forcefully closes remaining connections on old workers, allowing them to exit.

Tradeoff: Connections terminated by this timeout drop mid-stream. Clients see disconnections. If your application relies on persistent WebSocket or gRPC streams, set the timeout to the longest acceptable interruption window, or isolate streaming locations to a separate nginx instance where reloads are rare.

# In nginx.conf, main context
worker_shutdown_timeout 60s;

Test and reload:

nginx -t
nginx -s reload

Reduce reload frequency

If a configuration management system, ingress controller, or CI/CD pipeline triggers reloads on every minor change, batch those changes. Fewer reloads mean fewer opportunities for worker generations to pile up. In Kubernetes environments, coalesce ingress object changes before they hit the nginx control loop.

Tune streaming endpoint timeouts

For locations that intentionally handle long-lived connections, review proxy_read_timeout, proxy_send_timeout, and client-side timeouts. If a connection is idle for minutes, a shorter timeout may allow the upstream or client to close gracefully, freeing the worker sooner. Balance this against the legitimate needs of the application protocol.

Emergency manual termination

If old workers have accumulated to the point of memory pressure and you cannot reload immediately, terminate specific old worker processes by PID. This is disruptive: their connections drop immediately.

# DANGEROUS: Terminates the worker and its open connections
kill <OLD_WORKER_PID>

Only use this as a last resort to reclaim memory or file descriptors before the OOM killer intervenes.

Prevention

  • Set worker_shutdown_timeout in the base configuration. Do not wait for accumulation to appear. The directive is absent by default; the omission is only safe if you have no long-lived connections and infrequent reloads.

  • Batch configuration changes. Coalesce updates so reloads happen at a controlled rate rather than reactively.

  • Monitor worker count against worker_processes. Alert when the running count exceeds the configured count for more than a brief transient window.

  • Review long-lived connection paths. If you proxy WebSocket, SSE, or gRPC, validate that upstream and client idle timeouts are aligned so connections close when genuinely idle.

How Netdata helps

  • Charts worker process count and alerts when sustained above configured worker_processes.
  • Correlates per-process RSS and file descriptor usage with reload events to highlight memory pressure from stuck workers.
  • Surfaces stub_status metrics including active connections and the Reading/Writing/Waiting breakdown to identify whether old workers are holding connections open.
  • Tracks reload frequency via error log patterns to detect configuration storms.
  • Cross-references upstream response time and connection state to distinguish old worker accumulation from legitimate traffic growth.