NGINX monitoring checklist: the signals every production server needs

NGINX is an event-driven, single-threaded-per-worker process. Most production failures follow predictable patterns: connection exhaustion, backend cascades, file descriptor limits, or silent kernel-level drops. This article maps the signals that expose those failures into four cumulative maturity levels: Survival, Operational, Mature, and Expert. Use it to audit your current coverage or to justify instrumentation work before the next incident.

Each level adds depth. Survival answers “Is it up?” Operational answers “Is it healthy?” Mature adds leading indicators. Expert adds the signals you instrument after your third postmortem. The tables below list each signal, why it matters, and the threshold that should trigger a response.

Level 1 — Survival

Level 1 is the absolute minimum. If you only have five minutes to instrument a new NGINX deployment, start here. Note that the default worker_connections is 512, not 1024, and each proxied request consumes at least two connection slots. The stub_status module must be compiled in and configured to expose the active connection count.

Signal	Why it matters	Threshold
Worker process liveness	Master dead = total outage; workers crashing faster than respawn = reduced capacity	PAGE: master dead or zero workers sustained for more than 30 seconds
Listening port responsiveness	Binary check that the kernel accept queue and NGINX are functional	PAGE: TCP connect fails, or localhost connect exceeds 10 ms
Active connections	Fundamental capacity gauge from `stub_status`; approaching the limit means imminent drops	PAGE: active connections at or above 90% of `worker_connections * worker_processes`
HTTP 5xx rate	User-visible server errors passed through or generated by NGINX	PAGE: 5xx rate above 5% of total requests with a floor of at least 10 req/min sustained for more than 2 minutes
Error log severity	`[emerg]` and `[alert]` indicate catastrophic conditions; `[crit]` indicates serious degradation	TICKET: any `[emerg]` or `[alert]`; sustained `[crit]` burst; error rate above 10 per minute

Level 2 — Operational

Level 2 adds latency decomposition and resource saturation. The most common misdiagnosis at this stage is confusing $request_time with backend latency. $request_time includes client send time; always pair it with $upstream_response_time to isolate the bottleneck.

Signal	Why it matters	Threshold
Requests per second	Core throughput; sudden drop indicates outage, backend failure, or client-side issue	TICKET: sustained drop above 50% from baseline, or spike above 3x
Connection state breakdown	`Reading`, `Writing`, and `Waiting` reveal the nature of the load: slow clients, slow upstreams, or keepalive bloat	TICKET: `Reading` above 20% or `Writing` above 50% of active connections sustained
Dropped connections	`accepts - handled` gap is the canary for connection or FD exhaustion before hard failure	PAGE: gap rate increasing above zero sustained for more than 60 seconds
Request processing time	End-to-end latency from NGINX’s perspective; includes client send time	TICKET: P95 trending above 2x rolling baseline for more than 5 minutes
Upstream response time	Backend performance in isolation; the dominant factor in proxy deployments	TICKET: P95 trending up more than 20% from baseline
HTTP 4xx rate	Client errors, scanning, authentication issues, or misconfiguration	TICKET: sudden 2x increase without known cause
Client abandons (499)	Users closing connections before NGINX finishes; often precedes 5xx spikes	TICKET: sustained rate above 1% of requests
File descriptor usage	FD exhaustion produces silent connection drops and upstream connect failures	PAGE: above 95% of limit with “too many open files” in error log
Worker CPU utilization	Event loop saturation from SSL, compression, or regex overhead	TICKET: sustained above 80% of one core per worker
Worker RSS	Memory growth may indicate leaks, buffer bloat, or module overhead	TICKET: RSS growing without bound while connection count is stable
Disk space (log partitions)	Prevents a feedback loop where incident-driven log spikes fill the disk	TICKET: below 20% free on log or cache partitions

Level 3 — Mature

Level 3 introduces upstream visibility and kernel-level signals. The kernel accept queue and shared memory zones are invisible to stub_status but are frequent root causes of silent connection drops.

Signal	Why it matters	Threshold
Upstream connect time	TCP/TLS handshake overhead to backends; near-zero values indicate keepalive reuse (available since NGINX 1.9.1)	TICKET: P95 above 100 ms for same-datacenter backends
Upstream header time	Time to first byte from upstream; isolates backend processing from body transfer (available since NGINX 1.7.10)	TICKET: approaching `proxy_read_timeout` or trending up
Per-upstream-server health	Partial upstream failures are diluted in aggregate metrics	TICKET: individual server generating errors or removed from rotation
Cache hit rate	Protects upstream from full traffic load; declining rate increases backend pressure	TICKET: drop above 10% from baseline
Shared memory zone errors	`limit_req_zone` exhaustion silently stops rate limiting; cache zones thrash	PAGE: any `could not allocate node` for rate or connection limit zones
Listen backlog depth	`ss` Recv-Q shows connections queued in the kernel but not yet accepted	PAGE: Recv-Q at or above 80% of Send-Q sustained
TcpExtListenOverflows	Confirmed kernel-level connection drops before NGINX sees them	PAGE: rate of increase above zero sustained for more than 60 seconds
Ephemeral port utilization	Upstream `TIME_WAIT` accumulation can exhaust ports even at moderate connection counts	TICKET: active plus `TIME_WAIT` above 60% of the ephemeral range per destination
Connection slot utilization	Hard cliff at 100%; proxy traffic uses two slots per request	PAGE: above 95% with corroborating admission loss (gap growing or listen overflows)
Reload frequency and success	Failed reloads leave old config active; frequent reloads cause worker accumulation	TICKET: reload failure, or frequency above 1 per minute sustained
SSL session cache hit rate	Low reuse drives CPU-heavy full handshakes	TICKET: below 70%
Request and response size	Anomalies may indicate data leakage, attacks, or buffer pressure	TICKET: unexpected large responses on non-download endpoints

Level 4 — Expert

Level 4 covers the edge cases that only appear after repeated incidents: per-worker imbalance, stale cache masking upstream failures, and the gap between $upstream_status and $status.

Signal	Why it matters	Threshold
Per-worker load imbalance	Uneven connection distribution hides saturation under average metrics	TICKET: one worker above 90% CPU or connections while others remain below 30%
Kernel conntrack utilization	iptables/nftables connection tracking can drop packets under high load	TICKET: above 80% of `nf_conntrack_max`
SYN queue overflow	`TcpExtSyncookiesSent` indicates SYN backlog exhaustion	TICKET: any nonzero rate
Per-location latency and error rate	Routing logic failures are invisible in global aggregates	TICKET: P95 or error rate spike isolated to a single location block
Upstream keepalive reuse	Ineffective pools generate `TIME_WAIT` and eventual port exhaustion	TICKET: apparent reuse rate below 70%
Rate limit zone capacity	Unique keys versus zone size; silent bypass when the zone fills	TICKET: unique keys approaching 85% of estimated zone capacity
Old worker accumulation	Long-lived connections prevent draining after reloads	TICKET: total worker count consistently above `worker_processes`
Stale cache serving rate	`proxy_cache_use_stale` can mask upstream failures indefinitely	TICKET: sustained `STALE` without corresponding upstream error visibility
Upstream vs client status	`$upstream_status` versus `$status` detects error masking via `error_page` or stale cache	TICKET: sustained discrepancies between upstream and client-facing codes
DNS resolution latency	Variable `proxy_pass` with `resolver` fails asynchronously after timeout	TICKET: sustained resolver errors correlated with 502 spikes
TLS certificate expiry	NGINX serves expired certs from memory until reloaded after disk renewal	TICKET: below 30 days; PAGE: expired and causing handshake failures
Access log write latency	Synchronous logging stalls the event loop under high error volume	TICKET: periodic `$request_time` spikes correlating with log flush

How Netdata helps

Netdata collects stub_status metrics out of the box when the endpoint is reachable, surfacing active connections, accepts, handled, requests, and the reading/writing/waiting breakdown in real time. It correlates per-worker process CPU and RSS with connection state counts to help distinguish organic load from memory leaks or uneven distribution. Access log parsing is available for 5xx rate, 499 volume, and $request_time / $upstream_response_time percentiles without requiring external log shippers. Kernel networking metrics like TcpExtListenOverflows and listen queue depth are collected at the OS level and can be correlated directly with NGINX connection metrics. File descriptor usage per process and SSL certificate expiry are monitored independently of NGINX’s own reporting, catching silent limits before they become user-visible outages.

The Netdata solution

Web server monitoring with Netdata

Netdata monitors NGINX with per-second request, connection, and latency metrics plus ML anomaly detection. Correlate connection and file-descriptor exhaustion, upstream cascade failures, buffer spill, and TLS CPU with the host signals behind them.

See web server monitoring → Start monitoring free

NGINX monitoring checklist: the signals every production server needs

NGINX monitoring checklist: the signals every production server needs

Level 1 — Survival

Level 2 — Operational

Level 3 — Mature

Level 4 — Expert

How Netdata helps

Related guides

Web server monitoring with Netdata