NGINX access log performance: buffering, sampling, and the event loop

Every NGINX worker is a single-threaded event loop. By default, finishing a request triggers an immediate, synchronous write of the access log line. Under normal load the cost is negligible. Under incident conditions, when error rates and log volume spike, that synchronous write becomes a compounding failure: disk I/O stalls the event loop, responses slow down, clients time out, and the resulting errors and abandonments generate even more log lines.

This guide covers how unbuffered logging interacts with the worker event loop, the directives that move logging off the hot path, and the production signals that reveal when logging itself is the bottleneck.

What it is and why it matters

In NGINX, the worker process that serves the request also formats the log line and issues the write. Without explicit tuning, each completed request results in a blocking write(2) syscall and possible disk flush. This creates a primary positive feedback loop during incidents: higher error rates produce more log volume, which increases disk I/O pressure and CPU time in write(2), slowing request processing and generating yet more errors.

The default behavior is safe for low-traffic deployments. In production, especially behind reverse proxies or ingress controllers handling thousands of requests per second, leaving access logging unbuffered makes the disk subsystem a direct participant in every request’s tail latency. Tuning trades immediacy, completeness, and CPU for event loop headroom.

How it works

When a worker finishes a request, it evaluates the access log configuration for the location. If logging is enabled, the worker formats the line and writes it to the destination. The mechanics of that write determine whether the worker immediately returns to the event loop or waits on the storage layer.

Synchronous writes and the event loop

By default, NGINX issues a write(2) syscall for every log line. The single-threaded worker cannot process other connections while the kernel copies data to the page cache or waits for physical disk. Under flash storage and light load this is sub-millisecond. Under spinning disk, log rotation, or concurrent error-log spikes, the same write can stall the worker for tens of milliseconds. Every connection assigned to that worker is delayed, not just the request being logged.

Disk I/O from logging is a silent performance killer. Error log volume during incidents can spike 100-1000x above baseline. Because error log writes are also synchronous, combined access and error log pressure can saturate the log partition’s IOPS or throughput.

Buffering and flush triggers

Adding buffer=size to the access_log directive accumulates log lines in a per-worker memory buffer. The buffer flushes to disk in a single write(2) when one of three conditions is met:

  • The next line does not fit in the remaining buffer space.
  • The oldest buffered data exceeds the flush=time threshold.
  • The worker process reopens or shuts down.

Using buffer= alone reduces syscall count dramatically, but without flush= a low-traffic worker may hold lines in memory until the buffer fills or the worker exits. Setting flush=5s caps the visibility delay while preserving batching. The total memory cost is roughly buffer_size * worker_processes, typically negligible compared to connection buffers.

Gzip compression

The access_log directive accepts a gzip parameter that compresses log data before writing. This requires buffer= to be set. It trades CPU cycles in the worker for fewer bytes on disk. Under I/O-bound incidents, gzip logging reduces disk pressure, though it increases per-worker CPU utilization. Validate it against your workload before enabling in production.

Variable paths and file descriptor caching

If the access log path contains variables, for example /var/log/nginx/$host-access.log, NGINX opens and closes the file descriptor on every write unless open_log_file_cache is configured. The directive accepts max, inactive, valid, and min_uses parameters. The default is off. Without caching, variable-path logging amplifies syscall cost because the file must be opened and closed per line. This is especially expensive at high connection rates.

Conditional logging and sampling

NGINX supports conditional logging via if=condition on the access_log directive. Combined with a map block, this enables status-code filtering, user-agent filtering, or request-rate sampling. For example, log all 5xx responses while sampling 1% of 200s. The line is discarded before it reaches the buffer, reducing both memory footprint and eventual disk I/O. The tradeoff is reduced forensic completeness for skipped requests.

Syslog and off-host shipping

Sending access logs to a syslog destination replaces the local disk write with a network or socket operation. NGINX sends to syslog over UDP by default. Using TCP syslog directly to a remote endpoint is risky: if the connection stalls, the worker blocks on the socket. The safer pattern is to forward through a local syslog daemon (rsyslog or syslog-ng) listening on UDP or a unix socket, which decouples NGINX from remote network latency.

Thread pool limitations

The aio threads and thread_pool directives do not apply to access log writes. Thread pools handle read(2), sendfile(2), and aio_write(2) operations only. Log file writes remain synchronous to the worker. The only mitigations are buffering, compression, reducing log volume, or moving the destination off the hot path.

flowchart TD
    A[Request completes] --> B{access_log buffer= set?}
    B -->|No| C[write(2) per request]
    B -->|Yes| D[Copy to per-worker memory buffer]
    D --> E{Flush trigger?
capacity / time / worker exit} E -->|No| F[Return to event loop] E -->|Yes| G[Batched write(2)] C --> H[Disk I/O blocks event loop] G --> H H --> I[Slow responses] I --> J[Client timeouts & 499s] J --> K[Log volume spikes] K --> L[Amplified disk pressure] L --> H D --> M{Conditional if=} M -->|False| N[Discard line] M -->|True| D

Where it shows up in production

The logging bottleneck rarely appears in isolation. It surfaces during other failures and is often misdiagnosed as upstream slowness or CPU saturation.

  • Error storm feedback loops: A backend cascade failure generates 502s and 504s. The resulting error log spike hits the same disk as the access log. Synchronous writes amplify the stall, slowing healthy requests and increasing the 499 rate.
  • Latency gaps with normal upstream metrics: $request_time spikes while $upstream_response_time remains flat. The gap is time lost inside NGINX, often to disk I/O or event loop contention from log flushing.
  • Variable-path vhost logging: Hosting platforms that log per-customer or per-domain without open_log_file_cache see syscall overhead dominate at high connection rates, even when traffic is otherwise cacheable.
  • Log rotation stalls: If logrotate renames log files without sending USR1 to NGINX, workers continue writing to the rotated file descriptor. Unbuffered writes add filesystem metadata overhead on every line, and the disk space is not freed until the worker closes the file.
  • Ingress controller reload storms: Kubernetes ingress controllers that reload NGINX frequently compound the problem when old workers with long-lived connections flush buffers or write logs during shutdown, competing with new workers for disk I/O.

Tradeoffs and when to use it

Buffered logging with flush=time: Enable buffer= on every production access log. Use flush=5s or flush=10s to bound delay. The memory cost is small and the reduction in syscall pressure is substantial. Do not rely on buffering alone if your log shipping pipeline requires line-level immediacy for security alerting; in that case, maintain a separate, unbuffered or lightly buffered stream for errors.

Gzip compression: Consider gzip when disk throughput or log partition capacity is the binding constraint. Measure per-worker CPU before and after enabling it. Avoid gzip logging if workers are already CPU-saturated from TLS or response compression workloads.

Conditional logging: Apply if= to skip health-check probes, static asset requests, or a sampled subset of 200s. Never silently drop 5xx lines or authentication failures unless you have an alternate telemetry path. The goal is to reduce volume without losing the signals you need for incident response.

Open log file cache: Mandatory for variable-path access logs. Size max= to the number of distinct values the variable produces. High cardinality variables, such as thousands of virtual hosts, can exhaust the cache and trigger LRU thrashing, which reintroduces open/close overhead.

Syslog shipping: Use a local relay. Direct remote TCP syslog is a blocking risk. UDP syslog is non-blocking but can overflow the socket buffer under extreme volume, causing silent log line drops. Unix domain sockets to a local collector offer the best balance of reliability and latency for most deployments.

Signals to watch in production

SignalWhy it mattersWarning sign
Disk I/O pressure on log partitionSynchronous log writes stall the event loop when disk I/O saturates.iowait > 10% or disk queue depth > 2 on the log partition during traffic spikes.
Error log message rateError volume can spike 100-1000x during incidents, compounding disk pressure.Error lines per minute > 10x baseline sustained for > 2 minutes.
Request processing timePeriodic latency spikes may correlate with buffered log flushes or disk stalls.P95 $request_time rises without corresponding rise in $upstream_response_time.
File descriptor usage per workerVariable-path logs without open_log_file_cache consume an FD per write.FD count grows faster than active connections, or error log shows accept4() failed (24: Too many open files).
Worker CPU utilizationGzip log compression trades CPU for I/O; high CPU with gzip enabled may warrant tuning.Per-worker CPU > 80% sustained after enabling gzip on access logs.

How Netdata helps

  • Disk I/O latency per partition correlates with NGINX worker event loop stalls when log writes are synchronous. Spikes on the log partition during incident response are a leading indicator of the logging feedback loop.
  • Per-process file descriptor monitoring surfaces when variable-path logging or log file cache misses consume FDs faster than connection growth.
  • NGINX stub_status tracks active connections and request rates. Correlated with disk I/O metrics, it distinguishes logging bottlenecks from upstream failures.
  • Access log parsing tracks $request_time percentiles. A gap between $request_time and $upstream_response_time that widens during disk spikes points to event loop contention.
  • Error log rate monitoring captures the 100-1000x volume spikes that precede disk saturation, giving operators early warning before the feedback loop accelerates.