NGINX monitoring maturity model: from survival to expert
nginx exposes exactly seven scalars through stub_status. Latency distributions, upstream health, cache efficiency, and kernel-level drops live in access logs, error logs, or OS counters. Teams that collect only the stub_status numbers assume they have visibility. They do not.
This article defines four monitoring maturity levels. Level 1 tells you if nginx is alive. Level 2 tells you if it is healthy. Level 3 gives you leading indicators of saturation. Level 4 exposes the blind spots that only appear after repeated incidents. Use these levels to audit your current coverage and decide which signals to add next.
The levels are cumulative. Do not build Level 4 dashboards until Level 1 is automated and paging. A team that cannot reliably detect a dead master or a 5xx spike will not benefit from tracking per-worker connection imbalance. Start at the bottom and move up.
flowchart TD
L1["Level 1: Survival"] --> L2["Level 2: Operational"]
L2 --> L3["Level 3: Mature"]
L3 --> L4["Level 4: Expert"]Level 1 — Survival
The absolute minimum. If any of these signals fail, users are already affected or nginx is down entirely.
- Master and worker process liveness. The master reads configuration and spawns workers; workers handle all traffic. Verify the master with
kill -0 $(cat /var/run/nginx.pid)and count workers viapgrep -c -P $(cat /var/run/nginx.pid). A dead master means no new workers will spawn. A worker crash loop is visible indmesgas OOM kills or segfaults. - Listening port responsiveness. Confirm that nginx holds listening sockets with
ss -tlnp | grep nginx. A successful TCP handshake only means the kernel accepted the connection into the backlog. If all workers are blocked, the HTTP request still stalls, so treat this as a necessary but insufficient check. - Active connections. The
Active connections:line fromstub_statusis a point-in-time capacity gauge. It includes keepalive idle connections, so interpret it against the configuredworker_connectionslimit. - HTTP 5xx response rate. Derived from access log
$status. Sustained rates above 1 percent warrant investigation; above 5 percent with traffic indicate an active outage. Distinguish 502 (upstream down or refusing connections), 504 (upstream too slow), and 503 (rate limiting or no available upstreams). - Error log
[emerg]and[crit]rate. Check recent lines withgrep -cE '\[(emerg|alert|crit)\]'. A failed reload logs[emerg], but the previous configuration stays active. Correlate with process state and 5xx rate before paging.
Level 2 — Operational
A competent production team monitors throughput, latency, and resource saturation. These signals separate an internal tool from a public-facing service under load.
- Requests per second. Derived from the cumulative
requestscounter instub_status. A sudden drop indicates upstream failure or network partition; a spike may signal abuse or a retry storm from a client. - Connection state breakdown.
stub_statussplits active connections into Reading, Writing, and Waiting. High Writing with low request rate indicates slow upstreams or slow clients. Sustained Reading above 20 percent of active connections suggests a slowloris-style attack or pathological network degradation. - Dropped connections (
accepts - handleddelta). If the cumulativeacceptscounter exceedshandled, nginx is dropping connections. This happens whenworker_connectionsor file descriptor limits are reached. Track the rate of gap growth, not the absolute cumulative difference. - Latency percentiles. Log
$request_timeand$upstream_response_timeto compute p50, p95, and p99.$request_timeincludes time spent sending the response to the client; a slow mobile client can inflate it even when upstream is fast. Always compare it with$upstream_response_timeto isolate backend latency. - HTTP status code distribution. Break out 4xx and 5xx. Monitor 499 specifically; it is an nginx-specific code meaning the client closed the connection before the response completed. A 499 spike is the canary for user impatience before official timeouts trigger 5xx.
- File descriptor utilization per worker. Each client connection, upstream socket, log file, and temp file consumes an FD. Check the count under
/proc/$pid/fdagainst theMax open fileslimit in/proc/$pid/limits. Default OS limits are often 1024, which is dangerously low for a reverse proxy where each request can consume multiple handles. - Worker CPU and RSS. Per-worker CPU via
ps -o %cpu=and RSS via/proc/$pid/statusVmRSS. One worker at 100 percent while others idle indicates load imbalance, often fromreuseporthash collisions when traffic comes from a small number of source IPs. - Error log rate by severity. Baseline the rate of
[error]and[warn]messages. During an incident, error log volume can spike 100-1000x, which itself generates disk I/O pressure that slows the event loop.
Level 3 — Mature
Full coverage with leading indicators and upstream visibility. At this level you stop reacting to outages and start catching saturation before it drops connections.
- Upstream connect and header time.
$upstream_connect_timereveals TCP and TLS handshake cost. A value of0.000indicates keepalive pool reuse.$upstream_header_timeisolates backend processing time from response body transfer. High connect time with low header time means the network or pool is the problem, not the application. - Per-upstream-server metrics. Parse
$upstream_addrfrom access logs to aggregate latency and error rates per backend. One degraded server can hide inside a healthy average. Open-source nginx has no native per-upstream API, so access log parsing is the only option. - Cache hit rate and status distribution. Log
$upstream_cache_statusto track HIT, MISS, STALE, EXPIRED, and BYPASS. A sudden STALE spike can mask an upstream outage; a sudden MISS spike may indicate a cache purge, zone exhaustion, or mass TTL expiration. - Listen socket backlog and
TcpExtListenOverflows. Usess -tlnpto read Recv-Q (current queue depth) and Send-Q (configured backlog). Monitornstat -az TcpExtListenOverflows; any nonzero increasing rate means the kernel is dropping connections before nginx sees them. This produces zero evidence in nginx logs. - Worker connection slot utilization. Calculate
active_connections / (worker_connections * worker_processes). The defaultworker_connectionsis 512, not 1024. For reverse proxy, effective capacity is roughly half because each request uses two connection slots. Account for keepalive idle connections in your headroom. - Reload frequency and success. Track
reconfiguringmessages andconfiguration file .* test failedin error logs. A failed reload leaves the old configuration active, creating silent configuration drift. Frequent reloads withoutworker_shutdown_timeoutcause old workers to accumulate on long-lived connections. - SSL session cache hit rate. Log
$ssl_session_reused;"r"means the session was reused. Low hit rates drive CPU saturation from full TLS handshakes. Size the shared cache for connection rate multiplied byssl_session_timeout. - Rate limit rejection rate. The default rejection status is 503, not 429. Monitor error log entries containing
limiting requestsand verify that legitimate traffic is not blocked. - Request and response size distributions. Log
$request_lengthand$body_bytes_sentto detect shifts that affect buffer usage, compression CPU, and bandwidth. - Disk space on log partitions. Under incident conditions, log volume growth can fill the partition, causing writes to block the event loop and increasing latency for all requests.
Level 4 — Expert
These signals are added after the third or fourth major incident reveals a blind spot. They explain why aggregate metrics looked green while users suffered.
- Per-worker load imbalance. Compare CPU and active connection counts across individual workers. With
reuseport, a small number of client IPs can hash to a single worker, creating a hot spot that aggregate metrics miss. - Kernel conntrack table utilization. If the host runs iptables or nftables with connection tracking, compare
conntrack -Cagainst/proc/sys/net/netfilter/nf_conntrack_max. A full table drops packets silently and mimics upstream connectivity failures. - Per-location-block latency and error rate. Use custom log variables or separate access logs per location block to isolate which routes degrade. Open-source nginx provides no native per-location metrics.
- Rate limit zone effective capacity. Estimate unique keys against zone size;
limit_req_zoneentries consume roughly 128 bytes each. Zone exhaustion producescould not allocate nodeerrors and silently disables rate limiting for new clients. - Old worker accumulation after reloads. Count total worker processes against
worker_processes. Old workers from previous reloads linger indefinitely on WebSocket, gRPC streaming, or long-polling connections unlessworker_shutdown_timeoutis set. - Temp file creation rate. When proxy or client body buffers overflow, nginx writes to disk. There is no native metric for this. Watch for persistent gaps between
$request_timeand$upstream_response_time, or monitor disk I/O on the mount point used byproxy_temp_path. - DNS resolution latency. For variable-based
proxy_passwith theresolverdirective, resolution failures async-wait forresolver_timeout(default 30s) and then return 502. Log resolver errors separately from upstream errors. $upstream_statusversus$statusdiscrepancy. An upstream may return 500 while nginx serves 200 from stale cache or a customerror_page. Logging both reveals when failures are masked by nginx-layer resilience.- TIME_WAIT and ephemeral port utilization. Without effective keepalive reuse, every proxied request opens and closes a TCP connection to upstream. Check
ss -tan state time-waitper destination to detect port exhaustion beforeconnect() failed (99)appears. - TLS version and cipher distribution shifts. Log
$ssl_protocoland$ssl_cipherto detect downgrade attacks or client population changes that affect CPU load. - Stale cache serving rate. A sustained high STALE rate with no corresponding MISS spike suggests upstream is down and
proxy_cache_use_staleis hiding the failure from standard error metrics. Verify upstream health independently before celebrating the resilience.
How Netdata helps
Netdata automates the collection and correlation of signals across all four levels on a single node.
- It scrapes
stub_statusand derives requests per second, connection state breakdown, and dropped connections without manual log parsing. - It correlates nginx access log metrics, including latency percentiles, response codes, and upstream times, with OS-level signals like
TcpExtListenOverflows, per-worker CPU, and file descriptor usage. - It tracks per-process file descriptor consumption alongside configured
worker_connections, surfacing when the OS limit becomes the real bottleneck before connection slots max out. - It monitors error log severity rates and surfaces spikes alongside upstream status code changes or cache status anomalies.
- It visualizes per-worker CPU and memory RSS imbalance, helping identify
reuseporthot spots or stuck old workers after frequent reloads.
Related guides
- nginx 413 Request Entity Too Large: client_max_body_size explained
- nginx 499 status code: why clients close connections before the response
- nginx 500 Internal Server Error: how to diagnose it
- nginx 502 Bad Gateway: causes and how to fix it
- nginx 503 Service Temporarily Unavailable: causes and fixes
- nginx 504 Gateway Time-out: causes and fixes
- NGINX access log performance: buffering, sampling, and the event loop
- NGINX active connections climbing: reading, writing, waiting explained
- nginx: bind() to 0.0.0.0:80 failed (98: Address already in use)
- NGINX backend cascade failure: when slow upstreams take down everything
- NGINX proxy cache hit rate is low: measuring and improving it
- nginx: configuration file test failed - finding the syntax error







