NGINX log disk full: when logging silently stops and how to recover
During an incident, upstream latency climbs and 5xx errors appear. You run tail -f /var/log/nginx/error.log and the cursor sits there. The last entry is hours old. Requests still return 200s. The server is serving, but it has stopped logging. Disk utilization shows /var/log at 100 percent.
This is the NGINX log disk-full failure mode. The process never signals the client or the operator that it can no longer write diagnostics. Visibility evaporates when it is most needed.
What this means
NGINX writes access logs synchronously and unbuffered by default. Every line is a write() syscall. When the partition returns ENOSPC, NGINX drops the log line and continues processing the request. No error is emitted to the client.
If the error log lives on the same full partition, the situation becomes self-sealing. NGINX cannot open new files and cannot append diagnostic messages explaining why. The error log does not receive a write-failure message when the error log itself is on the full partition.
Synchronous log writes block the event loop. During an incident, error log volume can spike 100-1000x above baseline. That spike generates disk I/O pressure, stalls the event loop, and increases latency for all requests. More latency generates more errors, which generate more log writes. The server enters a positive feedback loop, and because the disk is full there is no record of it.
Buffered logging via access_log ... buffer=32k flush=5s batches writes and reduces syscall overhead, but it still blocks or drops when the underlying filesystem is full. Memory-buffered error logging (error_log memory:32m debug) avoids disk pressure entirely, yet it requires compilation with --with-debug and loses buffered entries on crash. For standard production binaries, disk space is the hard boundary.
flowchart TD
A[Upstream failure or traffic spike] --> B[Error log volume spikes 100-1000x]
B --> C[Log partition fills to 100%]
C --> D[write returns ENOSPC]
D --> E[Log line dropped silently]
E --> F[Error log cannot record its own failure]
F --> G[Operator sees stale or empty logs during active incident]
G --> H[No visibility into root cause while traffic continues]Common causes
| Cause | What it looks like | First thing to check |
|---|---|---|
| Unrotated logs after rotation without USR1 | Log file on disk is enormous, or a fresh rotated file stays at zero bytes while space is still consumed | lsof | grep deleted for nginx log paths |
| Inode exhaustion | df -h shows free blocks but df -i shows 100%; NGINX cannot create new files | df -i on the log partition |
| Runaway error storm | Error log volume spikes 100-1000x above baseline; event loop slows; latency climbs | Error log growth rate versus baseline |
| Competing disk consumers on a shared partition | Journald, application logs, or package caches consume the same filesystem | Disk usage by subdirectory under /var/log |
| Buffered log flush blocked | Access log entries batch until the buffer fills, then vanish on flush to full disk | access_log buffering configuration and partition health |
Quick checks
# Block and inode utilization for the log filesystem
df -h /var/log/nginx/
df -i /var/log/nginx/
# Deleted log files still held open by nginx workers
lsof | grep deleted | grep -E 'nginx|/var/log/nginx'
# Master process file descriptors for log files
# PID file path varies by distribution
ls -l /proc/$(cat /var/run/nginx.pid)/fd/ | grep log
# Verify NGINX is still serving traffic (requires stub_status on localhost)
curl -s http://127.0.0.1/stub_status
# Error log modification time to confirm write stall
stat /var/log/nginx/error.log
# Log files by size and recency
ls -lt /var/log/nginx/ | head
# systemd journal footprint if /var/log is shared
du -sh /var/log/journal/ 2>/dev/null || true
How to diagnose it
- Confirm disk exhaustion. Run
df -handdf -ion the log partition. If either blocks or inodes are at 100 percent, NGINX cannot write or open logs. - Detect deleted-but-open files. Run
lsof | grep deleted | grep /var/log/nginx. If workers hold deleted log file descriptors, the space is consumed by invisible inodes thatdfcounts as used butlsno longer sees. - Map file descriptors to paths. Use
ls -l /proc/<PID>/fd/<FD>for the PIDs fromlsof. The symlink target shows the deleted path. - Verify log rotation state. Check whether the current on-disk file is growing. If logrotate created a new file but the postrotate script never sent
USR1, the new file stays empty while NGINX continues writing to the deleted old inode. - Correlate with incident timing. If the error log stopped growing when upstream latency spiked or the 5xx rate climbed, the fill is likely error-storm-driven.
- Check competing disk consumers. Run
du -sh /var/log/*to identify heavy consumers like journald on the same partition. - Determine if buffered logging masked the onset. If
access_logusesbuffer=, entries may have batched successfully until the buffer filled, then flushed to a full disk and vanished without earlier warning.
Fixes
Reclaim space without restarting nginx
If lsof shows deleted log files held by NGINX, do not restart the service to free them. Send USR1 to reopen file descriptors at the currently configured paths:
sudo kill -USR1 $(cat /var/run/nginx.pid)
Wait several seconds for workers to close old file descriptors, then verify with df -h.
If you recently changed log paths via nginx -s reload (HUP), sending USR1 afterward can recreate zero-sized files at the old paths due to an upstream bug (Issue #1069) affecting nginx 1.26.x. If you are on 1.26.x, verify behavior after sending USR1.
Recover log data before reopening
To preserve buffered contents of a deleted log file before closing its FD:
sudo cp /proc/<PID>/fd/<FD> /var/log/nginx/access.log.recovered
Then send USR1. This recovers data NGINX has not yet flushed to a visible path.
Emergency truncation
If the active log file is bloated and you cannot wait for rotation, truncate it in place:
# Destructive: zeroes the file immediately and destroys contents
sudo truncate -s 0 /var/log/nginx/access.log
This immediately frees space, but NGINX continues writing to the same inode, so space begins accumulating again immediately. Truncation is a tourniquet, not a cure. Only use this when the log contents are expendable.
Redirect logs to survive the incident
If the partition is dead and you need visibility now, temporarily redirect the error log to tmpfs and reload:
error_log /dev/shm/nginx-error.log crit;
Apply with sudo nginx -s reload. This requires HUP because the path changed. Warning: tmpfs is volatile. Copy logs off the node before any restart.
Stop the feedback loop
If an error storm is actively filling the disk faster than you can free space, cut the volume at the source:
- Set
access_log off;in the affected server block and reload to eliminate access log writes immediately. - Raise the
error_loglevel tocrittemporarily to reduce error volume.
These changes apply on reload. Old workers may continue writing briefly until they drain.
Fix inode exhaustion
If df -i shows 100 percent but block space remains:
- Clear NGINX cache directories, session file paths, or other small-file accumulations on the affected partition.
- If the filesystem was created with too few inodes, migrate the data to a new filesystem. This cannot be fixed online.
Prevention
- Use signal-based rotation. Never use
copytruncatefor high-traffic NGINX deployments. Under high request rates, a request logged between the copy and truncate steps is lost. Always sendkill -USR1in the logrotate postrotate script. - Monitor both disk space and inode utilization on the log partition. Alert at 85 percent blocks and 80 percent inodes.
- Cap competing log consumers. In
/etc/systemd/journald.conf, setSystemMaxUse=1Gto prevent journald from consuming a shared partition. - Set
worker_shutdown_timeoutso old workers do not linger indefinitely after reloads, holding old log file descriptors open. - Isolate logs on a dedicated partition or volume. Runaway access logs should not be able to take down the root filesystem or application data.
- Use
access_log ... buffer=32k flush=5sto reduce syscall pressure, but treat this as performance tuning, not protection against disk-full events. It only delays the symptom. - Review logrotate configuration to ensure the postrotate script sends USR1 and that the rotated file is not using
copytruncate. After any log path change in NGINX configuration, verify that subsequent rotations reopen the correct files.
Metrics and signals to monitor
| Signal | Why it matters | Warning sign |
|---|---|---|
| Disk space % on log partition | Direct cause of ENOSPC and silent log drops | >85% sustained |
| Inode % on log partition | Invisible cause of file creation failures | >80% |
| Error log line rate | Incident feedback loop that accelerates fill | >10x baseline spike |
| Active connections (stub_status) | Confirms NGINX is serving despite lost logs | Baseline deviation while logs are silent |
| Requests per second (stub_status) | Traffic may be dropping due to event loop stall | Sustained drop >50% from baseline |
| File descriptors per worker | Deleted-but-open files consume FDs and disk | Trending up without traffic growth |
$request_time spikes matching log flush timing | Synchronous writes stalling the event loop | Periodic latency spikes correlating with flush intervals |
How Netdata helps
Netdata exposes this failure mode through:
- Disk space and inode utilization alerts before ENOSPC hits.
- Correlation of a flatline in nginx access log entries with sustained
requestscounter growth fromstub_status, exposing silent log loss. - Error log line rate spikes alongside upstream response time increases to catch error storms that threaten to fill the disk.
- Per-worker file descriptor usage monitoring to detect accumulation of deleted-but-open log files after failed rotations.
- Log partition I/O wait correlated with nginx request latency, revealing when synchronous logging stalls the event loop.
Related guides
- How NGINX actually works in production: a mental model for operators
- nginx 413 Request Entity Too Large: client_max_body_size explained
- nginx 499 status code: why clients close connections before the response
- nginx 500 Internal Server Error: how to diagnose it
- nginx 502 Bad Gateway: causes and how to fix it
- nginx 503 Service Temporarily Unavailable: causes and fixes
- nginx 504 Gateway Time-out: causes and fixes
- NGINX active connections climbing: reading, writing, waiting explained
- nginx: bind() to 0.0.0.0:80 failed (98: Address already in use)
- NGINX backend cascade failure: when slow upstreams take down everything
- nginx: a client request body is buffered to a temporary file - what it means
- NGINX proxy cache hit rate is low: measuring and improving it







