NGINX monitoring checklist: the signals every production server needs

NGINX is an event-driven, single-threaded-per-worker process. Most production failures follow predictable patterns: connection exhaustion, backend cascades, file descriptor limits, or silent kernel-level drops. This article maps the signals that expose those failures into four cumulative maturity levels: Survival, Operational, Mature, and Expert. Use it to audit your current coverage or to justify instrumentation work before the next incident.

Each level adds depth. Survival answers “Is it up?” Operational answers “Is it healthy?” Mature adds leading indicators. Expert adds the signals you instrument after your third postmortem. The tables below list each signal, why it matters, and the threshold that should trigger a response.

Level 1 — Survival

Level 1 is the absolute minimum. If you only have five minutes to instrument a new NGINX deployment, start here. Note that the default worker_connections is 512, not 1024, and each proxied request consumes at least two connection slots. The stub_status module must be compiled in and configured to expose the active connection count.

SignalWhy it mattersThreshold
Worker process livenessMaster dead = total outage; workers crashing faster than respawn = reduced capacityPAGE: master dead or zero workers sustained for more than 30 seconds
Listening port responsivenessBinary check that the kernel accept queue and NGINX are functionalPAGE: TCP connect fails, or localhost connect exceeds 10 ms
Active connectionsFundamental capacity gauge from stub_status; approaching the limit means imminent dropsPAGE: active connections at or above 90% of worker_connections * worker_processes
HTTP 5xx rateUser-visible server errors passed through or generated by NGINXPAGE: 5xx rate above 5% of total requests with a floor of at least 10 req/min sustained for more than 2 minutes
Error log severity[emerg] and [alert] indicate catastrophic conditions; [crit] indicates serious degradationTICKET: any [emerg] or [alert]; sustained [crit] burst; error rate above 10 per minute

Level 2 — Operational

Level 2 adds latency decomposition and resource saturation. The most common misdiagnosis at this stage is confusing $request_time with backend latency. $request_time includes client send time; always pair it with $upstream_response_time to isolate the bottleneck.

SignalWhy it mattersThreshold
Requests per secondCore throughput; sudden drop indicates outage, backend failure, or client-side issueTICKET: sustained drop above 50% from baseline, or spike above 3x
Connection state breakdownReading, Writing, and Waiting reveal the nature of the load: slow clients, slow upstreams, or keepalive bloatTICKET: Reading above 20% or Writing above 50% of active connections sustained
Dropped connectionsaccepts - handled gap is the canary for connection or FD exhaustion before hard failurePAGE: gap rate increasing above zero sustained for more than 60 seconds
Request processing timeEnd-to-end latency from NGINX’s perspective; includes client send timeTICKET: P95 trending above 2x rolling baseline for more than 5 minutes
Upstream response timeBackend performance in isolation; the dominant factor in proxy deploymentsTICKET: P95 trending up more than 20% from baseline
HTTP 4xx rateClient errors, scanning, authentication issues, or misconfigurationTICKET: sudden 2x increase without known cause
Client abandons (499)Users closing connections before NGINX finishes; often precedes 5xx spikesTICKET: sustained rate above 1% of requests
File descriptor usageFD exhaustion produces silent connection drops and upstream connect failuresPAGE: above 95% of limit with “too many open files” in error log
Worker CPU utilizationEvent loop saturation from SSL, compression, or regex overheadTICKET: sustained above 80% of one core per worker
Worker RSSMemory growth may indicate leaks, buffer bloat, or module overheadTICKET: RSS growing without bound while connection count is stable
Disk space (log partitions)Prevents a feedback loop where incident-driven log spikes fill the diskTICKET: below 20% free on log or cache partitions

Level 3 — Mature

Level 3 introduces upstream visibility and kernel-level signals. The kernel accept queue and shared memory zones are invisible to stub_status but are frequent root causes of silent connection drops.

SignalWhy it mattersThreshold
Upstream connect timeTCP/TLS handshake overhead to backends; near-zero values indicate keepalive reuse (available since NGINX 1.9.1)TICKET: P95 above 100 ms for same-datacenter backends
Upstream header timeTime to first byte from upstream; isolates backend processing from body transfer (available since NGINX 1.7.10)TICKET: approaching proxy_read_timeout or trending up
Per-upstream-server healthPartial upstream failures are diluted in aggregate metricsTICKET: individual server generating errors or removed from rotation
Cache hit rateProtects upstream from full traffic load; declining rate increases backend pressureTICKET: drop above 10% from baseline
Shared memory zone errorslimit_req_zone exhaustion silently stops rate limiting; cache zones thrashPAGE: any could not allocate node for rate or connection limit zones
Listen backlog depthss Recv-Q shows connections queued in the kernel but not yet acceptedPAGE: Recv-Q at or above 80% of Send-Q sustained
TcpExtListenOverflowsConfirmed kernel-level connection drops before NGINX sees themPAGE: rate of increase above zero sustained for more than 60 seconds
Ephemeral port utilizationUpstream TIME_WAIT accumulation can exhaust ports even at moderate connection countsTICKET: active plus TIME_WAIT above 60% of the ephemeral range per destination
Connection slot utilizationHard cliff at 100%; proxy traffic uses two slots per requestPAGE: above 95% with corroborating admission loss (gap growing or listen overflows)
Reload frequency and successFailed reloads leave old config active; frequent reloads cause worker accumulationTICKET: reload failure, or frequency above 1 per minute sustained
SSL session cache hit rateLow reuse drives CPU-heavy full handshakesTICKET: below 70%
Request and response sizeAnomalies may indicate data leakage, attacks, or buffer pressureTICKET: unexpected large responses on non-download endpoints

Level 4 — Expert

Level 4 covers the edge cases that only appear after repeated incidents: per-worker imbalance, stale cache masking upstream failures, and the gap between $upstream_status and $status.

SignalWhy it mattersThreshold
Per-worker load imbalanceUneven connection distribution hides saturation under average metricsTICKET: one worker above 90% CPU or connections while others remain below 30%
Kernel conntrack utilizationiptables/nftables connection tracking can drop packets under high loadTICKET: above 80% of nf_conntrack_max
SYN queue overflowTcpExtSyncookiesSent indicates SYN backlog exhaustionTICKET: any nonzero rate
Per-location latency and error rateRouting logic failures are invisible in global aggregatesTICKET: P95 or error rate spike isolated to a single location block
Upstream keepalive reuseIneffective pools generate TIME_WAIT and eventual port exhaustionTICKET: apparent reuse rate below 70%
Rate limit zone capacityUnique keys versus zone size; silent bypass when the zone fillsTICKET: unique keys approaching 85% of estimated zone capacity
Old worker accumulationLong-lived connections prevent draining after reloadsTICKET: total worker count consistently above worker_processes
Stale cache serving rateproxy_cache_use_stale can mask upstream failures indefinitelyTICKET: sustained STALE without corresponding upstream error visibility
Upstream vs client status$upstream_status versus $status detects error masking via error_page or stale cacheTICKET: sustained discrepancies between upstream and client-facing codes
DNS resolution latencyVariable proxy_pass with resolver fails asynchronously after timeoutTICKET: sustained resolver errors correlated with 502 spikes
TLS certificate expiryNGINX serves expired certs from memory until reloaded after disk renewalTICKET: below 30 days; PAGE: expired and causing handshake failures
Access log write latencySynchronous logging stalls the event loop under high error volumeTICKET: periodic $request_time spikes correlating with log flush

How Netdata helps

Netdata collects stub_status metrics out of the box when the endpoint is reachable, surfacing active connections, accepts, handled, requests, and the reading/writing/waiting breakdown in real time. It correlates per-worker process CPU and RSS with connection state counts to help distinguish organic load from memory leaks or uneven distribution. Access log parsing is available for 5xx rate, 499 volume, and $request_time / $upstream_response_time percentiles without requiring external log shippers. Kernel networking metrics like TcpExtListenOverflows and listen queue depth are collected at the OS level and can be correlated directly with NGINX connection metrics. File descriptor usage per process and SSL certificate expiry are monitored independently of NGINX’s own reporting, catching silent limits before they become user-visible outages.