NGINX log disk full: when logging silently stops and how to recover

During an incident, upstream latency climbs and 5xx errors appear. You run tail -f /var/log/nginx/error.log and the cursor sits there. The last entry is hours old. Requests still return 200s. The server is serving, but it has stopped logging. Disk utilization shows /var/log at 100 percent.

This is the NGINX log disk-full failure mode. The process never signals the client or the operator that it can no longer write diagnostics. Visibility evaporates when it is most needed.

What this means

NGINX writes access logs synchronously and unbuffered by default. Every line is a write() syscall. When the partition returns ENOSPC, NGINX drops the log line and continues processing the request. No error is emitted to the client.

If the error log lives on the same full partition, the situation becomes self-sealing. NGINX cannot open new files and cannot append diagnostic messages explaining why. The error log does not receive a write-failure message when the error log itself is on the full partition.

Synchronous log writes block the event loop. During an incident, error log volume can spike 100-1000x above baseline. That spike generates disk I/O pressure, stalls the event loop, and increases latency for all requests. More latency generates more errors, which generate more log writes. The server enters a positive feedback loop, and because the disk is full there is no record of it.

Buffered logging via access_log ... buffer=32k flush=5s batches writes and reduces syscall overhead, but it still blocks or drops when the underlying filesystem is full. Memory-buffered error logging (error_log memory:32m debug) avoids disk pressure entirely, yet it requires compilation with --with-debug and loses buffered entries on crash. For standard production binaries, disk space is the hard boundary.

flowchart TD
    A[Upstream failure or traffic spike] --> B[Error log volume spikes 100-1000x]
    B --> C[Log partition fills to 100%]
    C --> D[write returns ENOSPC]
    D --> E[Log line dropped silently]
    E --> F[Error log cannot record its own failure]
    F --> G[Operator sees stale or empty logs during active incident]
    G --> H[No visibility into root cause while traffic continues]

Common causes

CauseWhat it looks likeFirst thing to check
Unrotated logs after rotation without USR1Log file on disk is enormous, or a fresh rotated file stays at zero bytes while space is still consumedlsof | grep deleted for nginx log paths
Inode exhaustiondf -h shows free blocks but df -i shows 100%; NGINX cannot create new filesdf -i on the log partition
Runaway error stormError log volume spikes 100-1000x above baseline; event loop slows; latency climbsError log growth rate versus baseline
Competing disk consumers on a shared partitionJournald, application logs, or package caches consume the same filesystemDisk usage by subdirectory under /var/log
Buffered log flush blockedAccess log entries batch until the buffer fills, then vanish on flush to full diskaccess_log buffering configuration and partition health

Quick checks

# Block and inode utilization for the log filesystem
df -h /var/log/nginx/
df -i /var/log/nginx/

# Deleted log files still held open by nginx workers
lsof | grep deleted | grep -E 'nginx|/var/log/nginx'

# Master process file descriptors for log files
# PID file path varies by distribution
ls -l /proc/$(cat /var/run/nginx.pid)/fd/ | grep log

# Verify NGINX is still serving traffic (requires stub_status on localhost)
curl -s http://127.0.0.1/stub_status

# Error log modification time to confirm write stall
stat /var/log/nginx/error.log

# Log files by size and recency
ls -lt /var/log/nginx/ | head

# systemd journal footprint if /var/log is shared
du -sh /var/log/journal/ 2>/dev/null || true

How to diagnose it

  1. Confirm disk exhaustion. Run df -h and df -i on the log partition. If either blocks or inodes are at 100 percent, NGINX cannot write or open logs.
  2. Detect deleted-but-open files. Run lsof | grep deleted | grep /var/log/nginx. If workers hold deleted log file descriptors, the space is consumed by invisible inodes that df counts as used but ls no longer sees.
  3. Map file descriptors to paths. Use ls -l /proc/<PID>/fd/<FD> for the PIDs from lsof. The symlink target shows the deleted path.
  4. Verify log rotation state. Check whether the current on-disk file is growing. If logrotate created a new file but the postrotate script never sent USR1, the new file stays empty while NGINX continues writing to the deleted old inode.
  5. Correlate with incident timing. If the error log stopped growing when upstream latency spiked or the 5xx rate climbed, the fill is likely error-storm-driven.
  6. Check competing disk consumers. Run du -sh /var/log/* to identify heavy consumers like journald on the same partition.
  7. Determine if buffered logging masked the onset. If access_log uses buffer=, entries may have batched successfully until the buffer filled, then flushed to a full disk and vanished without earlier warning.

Fixes

Reclaim space without restarting nginx

If lsof shows deleted log files held by NGINX, do not restart the service to free them. Send USR1 to reopen file descriptors at the currently configured paths:

sudo kill -USR1 $(cat /var/run/nginx.pid)

Wait several seconds for workers to close old file descriptors, then verify with df -h.

If you recently changed log paths via nginx -s reload (HUP), sending USR1 afterward can recreate zero-sized files at the old paths due to an upstream bug (Issue #1069) affecting nginx 1.26.x. If you are on 1.26.x, verify behavior after sending USR1.

Recover log data before reopening

To preserve buffered contents of a deleted log file before closing its FD:

sudo cp /proc/<PID>/fd/<FD> /var/log/nginx/access.log.recovered

Then send USR1. This recovers data NGINX has not yet flushed to a visible path.

Emergency truncation

If the active log file is bloated and you cannot wait for rotation, truncate it in place:

# Destructive: zeroes the file immediately and destroys contents
sudo truncate -s 0 /var/log/nginx/access.log

This immediately frees space, but NGINX continues writing to the same inode, so space begins accumulating again immediately. Truncation is a tourniquet, not a cure. Only use this when the log contents are expendable.

Redirect logs to survive the incident

If the partition is dead and you need visibility now, temporarily redirect the error log to tmpfs and reload:

error_log /dev/shm/nginx-error.log crit;

Apply with sudo nginx -s reload. This requires HUP because the path changed. Warning: tmpfs is volatile. Copy logs off the node before any restart.

Stop the feedback loop

If an error storm is actively filling the disk faster than you can free space, cut the volume at the source:

  • Set access_log off; in the affected server block and reload to eliminate access log writes immediately.
  • Raise the error_log level to crit temporarily to reduce error volume.

These changes apply on reload. Old workers may continue writing briefly until they drain.

Fix inode exhaustion

If df -i shows 100 percent but block space remains:

  • Clear NGINX cache directories, session file paths, or other small-file accumulations on the affected partition.
  • If the filesystem was created with too few inodes, migrate the data to a new filesystem. This cannot be fixed online.

Prevention

  • Use signal-based rotation. Never use copytruncate for high-traffic NGINX deployments. Under high request rates, a request logged between the copy and truncate steps is lost. Always send kill -USR1 in the logrotate postrotate script.
  • Monitor both disk space and inode utilization on the log partition. Alert at 85 percent blocks and 80 percent inodes.
  • Cap competing log consumers. In /etc/systemd/journald.conf, set SystemMaxUse=1G to prevent journald from consuming a shared partition.
  • Set worker_shutdown_timeout so old workers do not linger indefinitely after reloads, holding old log file descriptors open.
  • Isolate logs on a dedicated partition or volume. Runaway access logs should not be able to take down the root filesystem or application data.
  • Use access_log ... buffer=32k flush=5s to reduce syscall pressure, but treat this as performance tuning, not protection against disk-full events. It only delays the symptom.
  • Review logrotate configuration to ensure the postrotate script sends USR1 and that the rotated file is not using copytruncate. After any log path change in NGINX configuration, verify that subsequent rotations reopen the correct files.

Metrics and signals to monitor

SignalWhy it mattersWarning sign
Disk space % on log partitionDirect cause of ENOSPC and silent log drops>85% sustained
Inode % on log partitionInvisible cause of file creation failures>80%
Error log line rateIncident feedback loop that accelerates fill>10x baseline spike
Active connections (stub_status)Confirms NGINX is serving despite lost logsBaseline deviation while logs are silent
Requests per second (stub_status)Traffic may be dropping due to event loop stallSustained drop >50% from baseline
File descriptors per workerDeleted-but-open files consume FDs and diskTrending up without traffic growth
$request_time spikes matching log flush timingSynchronous writes stalling the event loopPeriodic latency spikes correlating with flush intervals

How Netdata helps

Netdata exposes this failure mode through:

  • Disk space and inode utilization alerts before ENOSPC hits.
  • Correlation of a flatline in nginx access log entries with sustained requests counter growth from stub_status, exposing silent log loss.
  • Error log line rate spikes alongside upstream response time increases to catch error storms that threaten to fill the disk.
  • Per-worker file descriptor usage monitoring to detect accumulation of deleted-but-open log files after failed rotations.
  • Log partition I/O wait correlated with nginx request latency, revealing when synchronous logging stalls the event loop.