NGINX monitoring checklist: the signals every production server needs
NGINX is an event-driven, single-threaded-per-worker process. Most production failures follow predictable patterns: connection exhaustion, backend cascades, file descriptor limits, or silent kernel-level drops. This article maps the signals that expose those failures into four cumulative maturity levels: Survival, Operational, Mature, and Expert. Use it to audit your current coverage or to justify instrumentation work before the next incident.
Each level adds depth. Survival answers “Is it up?” Operational answers “Is it healthy?” Mature adds leading indicators. Expert adds the signals you instrument after your third postmortem. The tables below list each signal, why it matters, and the threshold that should trigger a response.
Level 1 — Survival
Level 1 is the absolute minimum. If you only have five minutes to instrument a new NGINX deployment, start here. Note that the default worker_connections is 512, not 1024, and each proxied request consumes at least two connection slots. The stub_status module must be compiled in and configured to expose the active connection count.
| Signal | Why it matters | Threshold |
|---|---|---|
| Worker process liveness | Master dead = total outage; workers crashing faster than respawn = reduced capacity | PAGE: master dead or zero workers sustained for more than 30 seconds |
| Listening port responsiveness | Binary check that the kernel accept queue and NGINX are functional | PAGE: TCP connect fails, or localhost connect exceeds 10 ms |
| Active connections | Fundamental capacity gauge from stub_status; approaching the limit means imminent drops | PAGE: active connections at or above 90% of worker_connections * worker_processes |
| HTTP 5xx rate | User-visible server errors passed through or generated by NGINX | PAGE: 5xx rate above 5% of total requests with a floor of at least 10 req/min sustained for more than 2 minutes |
| Error log severity | [emerg] and [alert] indicate catastrophic conditions; [crit] indicates serious degradation | TICKET: any [emerg] or [alert]; sustained [crit] burst; error rate above 10 per minute |
Level 2 — Operational
Level 2 adds latency decomposition and resource saturation. The most common misdiagnosis at this stage is confusing $request_time with backend latency. $request_time includes client send time; always pair it with $upstream_response_time to isolate the bottleneck.
| Signal | Why it matters | Threshold |
|---|---|---|
| Requests per second | Core throughput; sudden drop indicates outage, backend failure, or client-side issue | TICKET: sustained drop above 50% from baseline, or spike above 3x |
| Connection state breakdown | Reading, Writing, and Waiting reveal the nature of the load: slow clients, slow upstreams, or keepalive bloat | TICKET: Reading above 20% or Writing above 50% of active connections sustained |
| Dropped connections | accepts - handled gap is the canary for connection or FD exhaustion before hard failure | PAGE: gap rate increasing above zero sustained for more than 60 seconds |
| Request processing time | End-to-end latency from NGINX’s perspective; includes client send time | TICKET: P95 trending above 2x rolling baseline for more than 5 minutes |
| Upstream response time | Backend performance in isolation; the dominant factor in proxy deployments | TICKET: P95 trending up more than 20% from baseline |
| HTTP 4xx rate | Client errors, scanning, authentication issues, or misconfiguration | TICKET: sudden 2x increase without known cause |
| Client abandons (499) | Users closing connections before NGINX finishes; often precedes 5xx spikes | TICKET: sustained rate above 1% of requests |
| File descriptor usage | FD exhaustion produces silent connection drops and upstream connect failures | PAGE: above 95% of limit with “too many open files” in error log |
| Worker CPU utilization | Event loop saturation from SSL, compression, or regex overhead | TICKET: sustained above 80% of one core per worker |
| Worker RSS | Memory growth may indicate leaks, buffer bloat, or module overhead | TICKET: RSS growing without bound while connection count is stable |
| Disk space (log partitions) | Prevents a feedback loop where incident-driven log spikes fill the disk | TICKET: below 20% free on log or cache partitions |
Level 3 — Mature
Level 3 introduces upstream visibility and kernel-level signals. The kernel accept queue and shared memory zones are invisible to stub_status but are frequent root causes of silent connection drops.
| Signal | Why it matters | Threshold |
|---|---|---|
| Upstream connect time | TCP/TLS handshake overhead to backends; near-zero values indicate keepalive reuse (available since NGINX 1.9.1) | TICKET: P95 above 100 ms for same-datacenter backends |
| Upstream header time | Time to first byte from upstream; isolates backend processing from body transfer (available since NGINX 1.7.10) | TICKET: approaching proxy_read_timeout or trending up |
| Per-upstream-server health | Partial upstream failures are diluted in aggregate metrics | TICKET: individual server generating errors or removed from rotation |
| Cache hit rate | Protects upstream from full traffic load; declining rate increases backend pressure | TICKET: drop above 10% from baseline |
| Shared memory zone errors | limit_req_zone exhaustion silently stops rate limiting; cache zones thrash | PAGE: any could not allocate node for rate or connection limit zones |
| Listen backlog depth | ss Recv-Q shows connections queued in the kernel but not yet accepted | PAGE: Recv-Q at or above 80% of Send-Q sustained |
| TcpExtListenOverflows | Confirmed kernel-level connection drops before NGINX sees them | PAGE: rate of increase above zero sustained for more than 60 seconds |
| Ephemeral port utilization | Upstream TIME_WAIT accumulation can exhaust ports even at moderate connection counts | TICKET: active plus TIME_WAIT above 60% of the ephemeral range per destination |
| Connection slot utilization | Hard cliff at 100%; proxy traffic uses two slots per request | PAGE: above 95% with corroborating admission loss (gap growing or listen overflows) |
| Reload frequency and success | Failed reloads leave old config active; frequent reloads cause worker accumulation | TICKET: reload failure, or frequency above 1 per minute sustained |
| SSL session cache hit rate | Low reuse drives CPU-heavy full handshakes | TICKET: below 70% |
| Request and response size | Anomalies may indicate data leakage, attacks, or buffer pressure | TICKET: unexpected large responses on non-download endpoints |
Level 4 — Expert
Level 4 covers the edge cases that only appear after repeated incidents: per-worker imbalance, stale cache masking upstream failures, and the gap between $upstream_status and $status.
| Signal | Why it matters | Threshold |
|---|---|---|
| Per-worker load imbalance | Uneven connection distribution hides saturation under average metrics | TICKET: one worker above 90% CPU or connections while others remain below 30% |
| Kernel conntrack utilization | iptables/nftables connection tracking can drop packets under high load | TICKET: above 80% of nf_conntrack_max |
| SYN queue overflow | TcpExtSyncookiesSent indicates SYN backlog exhaustion | TICKET: any nonzero rate |
| Per-location latency and error rate | Routing logic failures are invisible in global aggregates | TICKET: P95 or error rate spike isolated to a single location block |
| Upstream keepalive reuse | Ineffective pools generate TIME_WAIT and eventual port exhaustion | TICKET: apparent reuse rate below 70% |
| Rate limit zone capacity | Unique keys versus zone size; silent bypass when the zone fills | TICKET: unique keys approaching 85% of estimated zone capacity |
| Old worker accumulation | Long-lived connections prevent draining after reloads | TICKET: total worker count consistently above worker_processes |
| Stale cache serving rate | proxy_cache_use_stale can mask upstream failures indefinitely | TICKET: sustained STALE without corresponding upstream error visibility |
| Upstream vs client status | $upstream_status versus $status detects error masking via error_page or stale cache | TICKET: sustained discrepancies between upstream and client-facing codes |
| DNS resolution latency | Variable proxy_pass with resolver fails asynchronously after timeout | TICKET: sustained resolver errors correlated with 502 spikes |
| TLS certificate expiry | NGINX serves expired certs from memory until reloaded after disk renewal | TICKET: below 30 days; PAGE: expired and causing handshake failures |
| Access log write latency | Synchronous logging stalls the event loop under high error volume | TICKET: periodic $request_time spikes correlating with log flush |
How Netdata helps
Netdata collects stub_status metrics out of the box when the endpoint is reachable, surfacing active connections, accepts, handled, requests, and the reading/writing/waiting breakdown in real time. It correlates per-worker process CPU and RSS with connection state counts to help distinguish organic load from memory leaks or uneven distribution. Access log parsing is available for 5xx rate, 499 volume, and $request_time / $upstream_response_time percentiles without requiring external log shippers. Kernel networking metrics like TcpExtListenOverflows and listen queue depth are collected at the OS level and can be correlated directly with NGINX connection metrics. File descriptor usage per process and SSL certificate expiry are monitored independently of NGINX’s own reporting, catching silent limits before they become user-visible outages.
Related guides
- nginx 413 Request Entity Too Large: client_max_body_size explained
- nginx 499 status code: why clients close connections before the response
- nginx 500 Internal Server Error: how to diagnose it
- nginx 502 Bad Gateway: causes and how to fix it
- nginx 503 Service Temporarily Unavailable: causes and fixes
- nginx 504 Gateway Time-out: causes and fixes
- NGINX access log performance: buffering, sampling, and the event loop
- NGINX active connections climbing: reading, writing, waiting explained
- nginx: bind() to 0.0.0.0:80 failed (98: Address already in use)
- NGINX backend cascade failure: when slow upstreams take down everything
- NGINX proxy cache hit rate is low: measuring and improving it
- nginx: configuration file test failed - finding the syntax error







