NGINX monitoring maturity model: from survival to expert

nginx exposes exactly seven scalars through stub_status. Latency distributions, upstream health, cache efficiency, and kernel-level drops live in access logs, error logs, or OS counters. Teams that collect only the stub_status numbers assume they have visibility. They do not.

This article defines four monitoring maturity levels. Level 1 tells you if nginx is alive. Level 2 tells you if it is healthy. Level 3 gives you leading indicators of saturation. Level 4 exposes the blind spots that only appear after repeated incidents. Use these levels to audit your current coverage and decide which signals to add next.

The levels are cumulative. Do not build Level 4 dashboards until Level 1 is automated and paging. A team that cannot reliably detect a dead master or a 5xx spike will not benefit from tracking per-worker connection imbalance. Start at the bottom and move up.

flowchart TD
    L1["Level 1: Survival"] --> L2["Level 2: Operational"]
    L2 --> L3["Level 3: Mature"]
    L3 --> L4["Level 4: Expert"]

Level 1 — Survival

The absolute minimum. If any of these signals fail, users are already affected or nginx is down entirely.

  • Master and worker process liveness. The master reads configuration and spawns workers; workers handle all traffic. Verify the master with kill -0 $(cat /var/run/nginx.pid) and count workers via pgrep -c -P $(cat /var/run/nginx.pid). A dead master means no new workers will spawn. A worker crash loop is visible in dmesg as OOM kills or segfaults.
  • Listening port responsiveness. Confirm that nginx holds listening sockets with ss -tlnp | grep nginx. A successful TCP handshake only means the kernel accepted the connection into the backlog. If all workers are blocked, the HTTP request still stalls, so treat this as a necessary but insufficient check.
  • Active connections. The Active connections: line from stub_status is a point-in-time capacity gauge. It includes keepalive idle connections, so interpret it against the configured worker_connections limit.
  • HTTP 5xx response rate. Derived from access log $status. Sustained rates above 1 percent warrant investigation; above 5 percent with traffic indicate an active outage. Distinguish 502 (upstream down or refusing connections), 504 (upstream too slow), and 503 (rate limiting or no available upstreams).
  • Error log [emerg] and [crit] rate. Check recent lines with grep -cE '\[(emerg|alert|crit)\]'. A failed reload logs [emerg], but the previous configuration stays active. Correlate with process state and 5xx rate before paging.

Level 2 — Operational

A competent production team monitors throughput, latency, and resource saturation. These signals separate an internal tool from a public-facing service under load.

  • Requests per second. Derived from the cumulative requests counter in stub_status. A sudden drop indicates upstream failure or network partition; a spike may signal abuse or a retry storm from a client.
  • Connection state breakdown. stub_status splits active connections into Reading, Writing, and Waiting. High Writing with low request rate indicates slow upstreams or slow clients. Sustained Reading above 20 percent of active connections suggests a slowloris-style attack or pathological network degradation.
  • Dropped connections (accepts - handled delta). If the cumulative accepts counter exceeds handled, nginx is dropping connections. This happens when worker_connections or file descriptor limits are reached. Track the rate of gap growth, not the absolute cumulative difference.
  • Latency percentiles. Log $request_time and $upstream_response_time to compute p50, p95, and p99. $request_time includes time spent sending the response to the client; a slow mobile client can inflate it even when upstream is fast. Always compare it with $upstream_response_time to isolate backend latency.
  • HTTP status code distribution. Break out 4xx and 5xx. Monitor 499 specifically; it is an nginx-specific code meaning the client closed the connection before the response completed. A 499 spike is the canary for user impatience before official timeouts trigger 5xx.
  • File descriptor utilization per worker. Each client connection, upstream socket, log file, and temp file consumes an FD. Check the count under /proc/$pid/fd against the Max open files limit in /proc/$pid/limits. Default OS limits are often 1024, which is dangerously low for a reverse proxy where each request can consume multiple handles.
  • Worker CPU and RSS. Per-worker CPU via ps -o %cpu= and RSS via /proc/$pid/status VmRSS. One worker at 100 percent while others idle indicates load imbalance, often from reuseport hash collisions when traffic comes from a small number of source IPs.
  • Error log rate by severity. Baseline the rate of [error] and [warn] messages. During an incident, error log volume can spike 100-1000x, which itself generates disk I/O pressure that slows the event loop.

Level 3 — Mature

Full coverage with leading indicators and upstream visibility. At this level you stop reacting to outages and start catching saturation before it drops connections.

  • Upstream connect and header time. $upstream_connect_time reveals TCP and TLS handshake cost. A value of 0.000 indicates keepalive pool reuse. $upstream_header_time isolates backend processing time from response body transfer. High connect time with low header time means the network or pool is the problem, not the application.
  • Per-upstream-server metrics. Parse $upstream_addr from access logs to aggregate latency and error rates per backend. One degraded server can hide inside a healthy average. Open-source nginx has no native per-upstream API, so access log parsing is the only option.
  • Cache hit rate and status distribution. Log $upstream_cache_status to track HIT, MISS, STALE, EXPIRED, and BYPASS. A sudden STALE spike can mask an upstream outage; a sudden MISS spike may indicate a cache purge, zone exhaustion, or mass TTL expiration.
  • Listen socket backlog and TcpExtListenOverflows. Use ss -tlnp to read Recv-Q (current queue depth) and Send-Q (configured backlog). Monitor nstat -az TcpExtListenOverflows; any nonzero increasing rate means the kernel is dropping connections before nginx sees them. This produces zero evidence in nginx logs.
  • Worker connection slot utilization. Calculate active_connections / (worker_connections * worker_processes). The default worker_connections is 512, not 1024. For reverse proxy, effective capacity is roughly half because each request uses two connection slots. Account for keepalive idle connections in your headroom.
  • Reload frequency and success. Track reconfiguring messages and configuration file .* test failed in error logs. A failed reload leaves the old configuration active, creating silent configuration drift. Frequent reloads without worker_shutdown_timeout cause old workers to accumulate on long-lived connections.
  • SSL session cache hit rate. Log $ssl_session_reused; "r" means the session was reused. Low hit rates drive CPU saturation from full TLS handshakes. Size the shared cache for connection rate multiplied by ssl_session_timeout.
  • Rate limit rejection rate. The default rejection status is 503, not 429. Monitor error log entries containing limiting requests and verify that legitimate traffic is not blocked.
  • Request and response size distributions. Log $request_length and $body_bytes_sent to detect shifts that affect buffer usage, compression CPU, and bandwidth.
  • Disk space on log partitions. Under incident conditions, log volume growth can fill the partition, causing writes to block the event loop and increasing latency for all requests.

Level 4 — Expert

These signals are added after the third or fourth major incident reveals a blind spot. They explain why aggregate metrics looked green while users suffered.

  • Per-worker load imbalance. Compare CPU and active connection counts across individual workers. With reuseport, a small number of client IPs can hash to a single worker, creating a hot spot that aggregate metrics miss.
  • Kernel conntrack table utilization. If the host runs iptables or nftables with connection tracking, compare conntrack -C against /proc/sys/net/netfilter/nf_conntrack_max. A full table drops packets silently and mimics upstream connectivity failures.
  • Per-location-block latency and error rate. Use custom log variables or separate access logs per location block to isolate which routes degrade. Open-source nginx provides no native per-location metrics.
  • Rate limit zone effective capacity. Estimate unique keys against zone size; limit_req_zone entries consume roughly 128 bytes each. Zone exhaustion produces could not allocate node errors and silently disables rate limiting for new clients.
  • Old worker accumulation after reloads. Count total worker processes against worker_processes. Old workers from previous reloads linger indefinitely on WebSocket, gRPC streaming, or long-polling connections unless worker_shutdown_timeout is set.
  • Temp file creation rate. When proxy or client body buffers overflow, nginx writes to disk. There is no native metric for this. Watch for persistent gaps between $request_time and $upstream_response_time, or monitor disk I/O on the mount point used by proxy_temp_path.
  • DNS resolution latency. For variable-based proxy_pass with the resolver directive, resolution failures async-wait for resolver_timeout (default 30s) and then return 502. Log resolver errors separately from upstream errors.
  • $upstream_status versus $status discrepancy. An upstream may return 500 while nginx serves 200 from stale cache or a custom error_page. Logging both reveals when failures are masked by nginx-layer resilience.
  • TIME_WAIT and ephemeral port utilization. Without effective keepalive reuse, every proxied request opens and closes a TCP connection to upstream. Check ss -tan state time-wait per destination to detect port exhaustion before connect() failed (99) appears.
  • TLS version and cipher distribution shifts. Log $ssl_protocol and $ssl_cipher to detect downgrade attacks or client population changes that affect CPU load.
  • Stale cache serving rate. A sustained high STALE rate with no corresponding MISS spike suggests upstream is down and proxy_cache_use_stale is hiding the failure from standard error metrics. Verify upstream health independently before celebrating the resilience.

How Netdata helps

Netdata automates the collection and correlation of signals across all four levels on a single node.

  • It scrapes stub_status and derives requests per second, connection state breakdown, and dropped connections without manual log parsing.
  • It correlates nginx access log metrics, including latency percentiles, response codes, and upstream times, with OS-level signals like TcpExtListenOverflows, per-worker CPU, and file descriptor usage.
  • It tracks per-process file descriptor consumption alongside configured worker_connections, surfacing when the OS limit becomes the real bottleneck before connection slots max out.
  • It monitors error log severity rates and surfaces spikes alongside upstream status code changes or cache status anomalies.
  • It visualizes per-worker CPU and memory RSS imbalance, helping identify reuseport hot spots or stuck old workers after frequent reloads.