NGINX reload not applying config: why old workers keep serving
You pushed a config change, ran nginx -s reload, and moved on. Hours later, the new certificate is not being served, the updated upstream is not receiving traffic, or the tightened rate limit never took effect. NGINX did not stop running, but the reload never applied. This is the silent rollback: when a reload fails validation, the master keeps the previous configuration active and old workers continue serving. Even when validation passes, old workers can remain alive for hours if long-lived connections prevent them from draining and worker_shutdown_timeout is not set. This guide shows how to confirm the failure, find the root cause, and prevent config drift from going undetected.
What this means
When you run nginx -s reload, the master validates the new configuration. If validation fails, the master rolls back silently and continues serving with the old configuration. There is no stdout error; a script that only checks the exit code misses the failure. The only evidence is an [emerg] line in the error log.
If validation succeeds, the master spawns new workers with the new configuration. Old workers stop accepting new connections but continue processing existing ones until they drain. Without worker_shutdown_timeout, old workers wait indefinitely for long-lived connections to close. During this window, both old and new workers coexist. If you reload frequently or connections never close, workers accumulate and consume memory and file descriptors while still bound to the old config.
flowchart TD
A[Operator: nginx -s reload] --> B{Master tests config}
B -->|Fail| C[Write [emerg] to error log]
C --> D[Rollback: old config continues]
B -->|Pass| E[Start new workers]
E --> F[Old workers drain connections]
F --> G{worker_shutdown_timeout?}
G -->|Not set| H[Old workers wait forever]
G -->|Set| I[Force close after timeout]
H --> J[Old workers accumulate]
I --> K[Old workers exit]Common causes
| Cause | What it looks like | First thing to check |
|---|---|---|
| Config validation failure during reload | New behavior is missing after reload; error log shows [emerg] | nginx -t and /var/log/nginx/error.log for [emerg] |
| Long-lived connections blocking drain | Worker count exceeds worker_processes for minutes or hours after reload | pgrep -a -P $(cat /var/run/nginx.pid) | grep -c 'nginx: worker' vs configured worker_processes |
Missing worker_shutdown_timeout | Old workers persist indefinitely on WebSocket, gRPC, or SSE connections | nginx -T 2>/dev/null | grep worker_shutdown_timeout |
| Rapid reload accumulation | Memory and FD usage climb; dozens of worker processes visible | Reload frequency in error log and total process count |
Quick checks
# Validate configuration without applying it
nginx -t
# Count worker processes (excludes cache loader/manager)
pgrep -a -P $(cat /var/run/nginx.pid) | grep -c 'nginx: worker'
# Compare against configured worker_processes
nginx -T 2>/dev/null | grep -m1 'worker_processes'
# Check for recent reload failure evidence
tail -1000 /var/log/nginx/error.log | grep -E '\[emerg\]'
# Check for reload events in the error log
tail -1000 /var/log/nginx/error.log | grep 'reconfiguring'
# Verify worker_shutdown_timeout is configured
nginx -T 2>/dev/null | grep -m1 'worker_shutdown_timeout'
# Check file descriptor usage per worker (old workers hold FDs)
for pid in $(pgrep -a -P $(cat /var/run/nginx.pid) | grep 'nginx: worker' | awk '{print $1}'); do
echo "Worker $pid: $(ls /proc/$pid/fd 2>/dev/null | wc -l) FDs"
done
How to diagnose it
- Run
nginx -tbefore reloading. If it reports[emerg], fix the named file and line before proceeding. Do not rely onnginx -s reloadto surface validation errors; the master validates asynchronously and logs failures to the error log. - Execute
nginx -s reload, then immediately tail the error log for[emerg]orconfiguration file .* test failed. A failed reload emits no stdout error. - Count worker processes. If the count exceeds
worker_processes, old workers are still alive. - Identify old workers by PID. Newer workers usually have higher PIDs (Linux allocates sequentially, but PIDs can wrap). Compare FD counts across workers; old workers holding active connections show elevated FD usage in
/proc/<pid>/fd. - Check whether
worker_shutdown_timeoutis present in the configuration. Without it, NGINX waits indefinitely for connections to close. - If old workers remain, determine what connections they hold. WebSocket, gRPC streaming, or SSE connections prevent graceful exit. Inspect
/proc/<pid>/fdto confirm sockets are still open. - Calculate reload frequency from the error log. If reloads occur more than once per minute, rapid accumulation is likely compounding the problem.
Metrics and signals to monitor
| Signal | Why it matters | Warning sign |
|---|---|---|
| Worker process count vs configured | Old workers accumulate when connections do not drain | Count exceeds worker_processes for more than 60 seconds after reload |
Error log [emerg] rate | Failed reloads log [emerg] but do not stop NGINX | Any [emerg] entry following a reload event |
| Reload frequency | Each reload spawns a new generation; rapid reloads multiply old workers | More than 1 reload per minute sustained |
| File descriptor usage per worker | Old workers consume FDs while draining connections | FD count remains high on specific worker PIDs long after reload |
| Active connections | Brief elevation during reload is normal; persistent elevation is not | Active connections stay high with low request throughput |
Fixes
Fix the configuration error
If nginx -t fails, the [emerg] message names the file, line, and directive. Resolve the error, then reload. Verify with nginx -t after editing.
Force old workers to exit with worker_shutdown_timeout
Add worker_shutdown_timeout 120s; in the main context of nginx.conf. This sets a hard deadline for old workers to finish active connections. After the timeout expires, NGINX closes all remaining connections on the old workers and they exit. Tradeoff: legitimate long-lived connections (WebSocket, gRPC, SSE) are severed. Do not set this lower than your longest acceptable request duration without accepting that those requests will disconnect.
Reduce reload frequency
Batch configuration changes rather than reloading after each edit. In Kubernetes ingress environments, reduce object churn to avoid reload storms. Each reload has a cost: new worker spawn, connection draining delay, brief latency bump, and potential memory growth from overlapping generations.
Manually terminate stuck old workers
If memory or FD exhaustion is critical and old workers have been draining longer than your operational window allows, identify the old worker PIDs and terminate them with SIGKILL. Warning: this drops active connections on those workers immediately. This is a last resort when graceful cleanup is not possible.
Prevention
- Always run
nginx -tbeforenginx -s reload. Automation and deployment scripts should test first and abort on any[emerg]. - Set
worker_shutdown_timeoutto an upper bound that matches your longest expected legitimate connection, but cap it to prevent infinite drift. - Monitor worker count against
worker_processes. Any sustained excess after reload indicates stuck old workers. - Monitor error logs for
[emerg]after every reload. Treat any post-reload[emerg]as a ticket-worthy event. - Batch dynamic configuration changes. In environments with frequent updates, rate-limit reloads.
How Netdata helps
Netdata surfaces the signals that reveal silent reload failures:
- Worker process count: Compare running workers against the configured value. A sustained excess after reload flags stuck old workers.
- Error log severity: Track
[emerg]and[error]rates in real time. A spike immediately after a reload indicates a silent rollback. - Active connections: Correlates
stub_statusmetrics to show whether connections are draining or accumulating after a reload. - File descriptor utilization: Per-worker FD usage reveals when old workers are still holding resources long after they should have exited.
- Reload frequency: Custom log monitoring can count
reconfiguringlines to detect reload storms from configuration management.
Related guides
- How NGINX actually works in production: a mental model for operators
- nginx 413 Request Entity Too Large: client_max_body_size explained
- nginx 499 status code: why clients close connections before the response
- nginx 500 Internal Server Error: how to diagnose it
- nginx 502 Bad Gateway: causes and how to fix it
- nginx 503 Service Temporarily Unavailable: causes and fixes
- nginx 504 Gateway Time-out: causes and fixes
- NGINX active connections climbing: reading, writing, waiting explained
- NGINX backend cascade failure: when slow upstreams take down everything
- nginx: a client request body is buffered to a temporary file - what it means
- nginx connect() failed (111: Connection refused) while connecting to upstream
- NGINX connection exhaustion: detection, diagnosis, and prevention







