NGINX reload not applying config: why old workers keep serving

You pushed a config change, ran nginx -s reload, and moved on. Hours later, the new certificate is not being served, the updated upstream is not receiving traffic, or the tightened rate limit never took effect. NGINX did not stop running, but the reload never applied. This is the silent rollback: when a reload fails validation, the master keeps the previous configuration active and old workers continue serving. Even when validation passes, old workers can remain alive for hours if long-lived connections prevent them from draining and worker_shutdown_timeout is not set. This guide shows how to confirm the failure, find the root cause, and prevent config drift from going undetected.

What this means

When you run nginx -s reload, the master validates the new configuration. If validation fails, the master rolls back silently and continues serving with the old configuration. There is no stdout error; a script that only checks the exit code misses the failure. The only evidence is an [emerg] line in the error log.

If validation succeeds, the master spawns new workers with the new configuration. Old workers stop accepting new connections but continue processing existing ones until they drain. Without worker_shutdown_timeout, old workers wait indefinitely for long-lived connections to close. During this window, both old and new workers coexist. If you reload frequently or connections never close, workers accumulate and consume memory and file descriptors while still bound to the old config.

flowchart TD
    A[Operator: nginx -s reload] --> B{Master tests config}
    B -->|Fail| C[Write [emerg] to error log]
    C --> D[Rollback: old config continues]
    B -->|Pass| E[Start new workers]
    E --> F[Old workers drain connections]
    F --> G{worker_shutdown_timeout?}
    G -->|Not set| H[Old workers wait forever]
    G -->|Set| I[Force close after timeout]
    H --> J[Old workers accumulate]
    I --> K[Old workers exit]

Common causes

CauseWhat it looks likeFirst thing to check
Config validation failure during reloadNew behavior is missing after reload; error log shows [emerg]nginx -t and /var/log/nginx/error.log for [emerg]
Long-lived connections blocking drainWorker count exceeds worker_processes for minutes or hours after reloadpgrep -a -P $(cat /var/run/nginx.pid) | grep -c 'nginx: worker' vs configured worker_processes
Missing worker_shutdown_timeoutOld workers persist indefinitely on WebSocket, gRPC, or SSE connectionsnginx -T 2>/dev/null | grep worker_shutdown_timeout
Rapid reload accumulationMemory and FD usage climb; dozens of worker processes visibleReload frequency in error log and total process count

Quick checks

# Validate configuration without applying it
nginx -t

# Count worker processes (excludes cache loader/manager)
pgrep -a -P $(cat /var/run/nginx.pid) | grep -c 'nginx: worker'

# Compare against configured worker_processes
nginx -T 2>/dev/null | grep -m1 'worker_processes'

# Check for recent reload failure evidence
tail -1000 /var/log/nginx/error.log | grep -E '\[emerg\]'

# Check for reload events in the error log
tail -1000 /var/log/nginx/error.log | grep 'reconfiguring'

# Verify worker_shutdown_timeout is configured
nginx -T 2>/dev/null | grep -m1 'worker_shutdown_timeout'

# Check file descriptor usage per worker (old workers hold FDs)
for pid in $(pgrep -a -P $(cat /var/run/nginx.pid) | grep 'nginx: worker' | awk '{print $1}'); do
  echo "Worker $pid: $(ls /proc/$pid/fd 2>/dev/null | wc -l) FDs"
done

How to diagnose it

  1. Run nginx -t before reloading. If it reports [emerg], fix the named file and line before proceeding. Do not rely on nginx -s reload to surface validation errors; the master validates asynchronously and logs failures to the error log.
  2. Execute nginx -s reload, then immediately tail the error log for [emerg] or configuration file .* test failed. A failed reload emits no stdout error.
  3. Count worker processes. If the count exceeds worker_processes, old workers are still alive.
  4. Identify old workers by PID. Newer workers usually have higher PIDs (Linux allocates sequentially, but PIDs can wrap). Compare FD counts across workers; old workers holding active connections show elevated FD usage in /proc/<pid>/fd.
  5. Check whether worker_shutdown_timeout is present in the configuration. Without it, NGINX waits indefinitely for connections to close.
  6. If old workers remain, determine what connections they hold. WebSocket, gRPC streaming, or SSE connections prevent graceful exit. Inspect /proc/<pid>/fd to confirm sockets are still open.
  7. Calculate reload frequency from the error log. If reloads occur more than once per minute, rapid accumulation is likely compounding the problem.

Metrics and signals to monitor

SignalWhy it mattersWarning sign
Worker process count vs configuredOld workers accumulate when connections do not drainCount exceeds worker_processes for more than 60 seconds after reload
Error log [emerg] rateFailed reloads log [emerg] but do not stop NGINXAny [emerg] entry following a reload event
Reload frequencyEach reload spawns a new generation; rapid reloads multiply old workersMore than 1 reload per minute sustained
File descriptor usage per workerOld workers consume FDs while draining connectionsFD count remains high on specific worker PIDs long after reload
Active connectionsBrief elevation during reload is normal; persistent elevation is notActive connections stay high with low request throughput

Fixes

Fix the configuration error

If nginx -t fails, the [emerg] message names the file, line, and directive. Resolve the error, then reload. Verify with nginx -t after editing.

Force old workers to exit with worker_shutdown_timeout

Add worker_shutdown_timeout 120s; in the main context of nginx.conf. This sets a hard deadline for old workers to finish active connections. After the timeout expires, NGINX closes all remaining connections on the old workers and they exit. Tradeoff: legitimate long-lived connections (WebSocket, gRPC, SSE) are severed. Do not set this lower than your longest acceptable request duration without accepting that those requests will disconnect.

Reduce reload frequency

Batch configuration changes rather than reloading after each edit. In Kubernetes ingress environments, reduce object churn to avoid reload storms. Each reload has a cost: new worker spawn, connection draining delay, brief latency bump, and potential memory growth from overlapping generations.

Manually terminate stuck old workers

If memory or FD exhaustion is critical and old workers have been draining longer than your operational window allows, identify the old worker PIDs and terminate them with SIGKILL. Warning: this drops active connections on those workers immediately. This is a last resort when graceful cleanup is not possible.

Prevention

  • Always run nginx -t before nginx -s reload. Automation and deployment scripts should test first and abort on any [emerg].
  • Set worker_shutdown_timeout to an upper bound that matches your longest expected legitimate connection, but cap it to prevent infinite drift.
  • Monitor worker count against worker_processes. Any sustained excess after reload indicates stuck old workers.
  • Monitor error logs for [emerg] after every reload. Treat any post-reload [emerg] as a ticket-worthy event.
  • Batch dynamic configuration changes. In environments with frequent updates, rate-limit reloads.

How Netdata helps

Netdata surfaces the signals that reveal silent reload failures:

  • Worker process count: Compare running workers against the configured value. A sustained excess after reload flags stuck old workers.
  • Error log severity: Track [emerg] and [error] rates in real time. A spike immediately after a reload indicates a silent rollback.
  • Active connections: Correlates stub_status metrics to show whether connections are draining or accumulating after a reload.
  • File descriptor utilization: Per-worker FD usage reveals when old workers are still holding resources long after they should have exited.
  • Reload frequency: Custom log monitoring can count reconfiguring lines to detect reload storms from configuration management.