$ guides / mongodb / mongodb-disk-full ▌

Operations Guides

MongoDB disk full: emergency recovery when mongod can't write the journal

MongoDB disk full: emergency recovery when mongod can’t write the journal

When the filesystem backing the data or journal directory crosses a critical threshold, WiredTiger cannot allocate new journal extents. If mongod crashes or restarts, recovery replays journal files since the last checkpoint and requires free headroom to create or extend files during that replay. On a full disk, mongod hangs in recovery without binding to port 27017.

If the node is a standalone, there is no replica to fail over to. If it is a secondary, cluster redundancy is reduced while the member is down. Recovery is complicated by a counterintuitive storage engine behavior: WiredTiger reclaims space internally after deletes, but does not automatically shrink data files or return bytes to the operating system. A volume that reads 99% full after a massive delete remains 99% full at the filesystem level.

What this means

MongoDB relies on continuous journal writes for durability. At 100% filesystem utilization, journal appends fail. Depending on the exact failure path, mongod may crash immediately, enter a read-only state, or refuse to finish startup recovery.

WiredTiger recovery is non-optional after an unclean shutdown. During recovery, mongod replays all journal files written since the last checkpoint before accepting connections. If the disk is too full to create new files or extend existing ones during this replay, the process hangs in recovery without opening the client port. Treat filesystem utilization above 95% with less than 2 GB free as a pageable incident.

Space accounting is non-obvious. The oplog is a capped collection with a fixed maximum size. Index builds create temporary files excluded from dbStats() output. Large deletions drop logical dataSize but leave storageSize and underlying filesystem blocks allocated. Only a compact operation or an initial sync rewrites data files and returns space to the OS.

flowchart TD
    A[Filesystem >95% full] --> B{mongod running?}
    B -->|No, recovery hangs| C[Free space for journal replay]
    B -->|Yes, but unstable| D[Find largest space consumer]
    D --> E{Replica set member?}
    E -->|Yes| F[Resync member after cleanup]
    E -->|No| G[Compact collections or expand volume]
    C --> H[Restart mongod]
    G --> H
    F --> H

Common causes

Cause	What it looks like	First thing to check
Organic data growth	`storageSize` climbs steadily and `df` shows gradual fill	`db.adminCommand({listDatabases:1})` and daily `df` trend
Uncompacted collections after mass deletes	`dataSize` drops but `df` is unchanged; `storageSize` far exceeds `dataSize`	`db.collection.stats()` comparing `storageSize` to `dataSize`
Oversized oplog	`local` database consumes far more disk than needed for the replication window	`rs.printReplicationInfo()` to compare configured size to actual window
Runaway index build temp files	Sudden spike during index creation; build may fail with space errors	`db.currentOp()` for active index builds and `du -sh` on the dbPath
Unclean shutdown with rollback files	Failover produced rollback data in the `rollback/` directory under dbPath	`du -sh <dbPath>/rollback`

Quick checks

These commands are read-only and safe to run during an incident.

# Filesystem utilization for the data directory
df -h /data/db

# Journal directory size if located on the same volume
du -sh /data/db/journal

# Rollback directory footprint after a failover
du -sh /data/db/rollback

# Logical database sizes from MongoDB's perspective
mongosh --quiet --eval 'db.adminCommand({listDatabases:1}).databases.forEach(function(d){ print(d.name+": "+(d.sizeOnDisk/1024/1024).toFixed(1)+" MB") })'

# Per-collection storage bloat (dead space from deletes)
mongosh --quiet --eval 'db.getCollectionNames().forEach(function(c){ var s=db[c].stats(); print(c+": storage="+(s.storageSize/1024/1024).toFixed(1)+" MB data="+(s.dataSize/1024/1024).toFixed(1)+" MB") })'

# Oplog size and time window
mongosh --quiet --eval 'rs.printReplicationInfo()'

# Active long-running operations that may generate temp files
mongosh --quiet --eval 'db.currentOp({active:true, secs_running:{$gt:10}}).inprog.forEach(function(o){print(o.opid+" "+o.ns+" "+o.secs_running+"s")})'

# Recent log errors mentioning disk or journal pressure
grep -iE "journal|disk|no space" /var/log/mongodb/mongod.log | tail -20

How to diagnose it

Confirm the filesystem boundary. Run df -h on the data directory and, if applicable, the separate journal directory. If utilization is above 95% and free space is below 2 GB, treat this as an imminent crash or unrecoverable startup.
Quantify the MongoDB footprint versus total disk. Run db.adminCommand({listDatabases:1}) for logical sizes, then du -sh /data/db for physical size. If physical size far exceeds the sum of database sizes, investigate journal files, rollback data, and temporary files.
Identify uncompacted collections. For each large collection, compare db.collection.stats().storageSize to dataSize. A large gap indicates dead space from deleted documents.
Check db.currentOp() for active index builds or large transactions. Index builds write temporary files outside dbStats accounting. Large transactions generate oversized oplog entries.
Determine replica set status. If the node is a secondary, a full initial sync after cleanup is often faster than compacting in place. If it is a standalone, you must recover locally.
If mongod will not start, examine the log for recovery progress. As a rough rule, free space should exceed the total size of journal files to replay. If not, free space or move dbPath before mongod can become available.

Metrics and signals to monitor

Signal	Why it matters	Warning sign
Filesystem utilization (`df`)	Journal writes fail catastrophically at 100% full	PAGE when >95% and <2 GB free
`storageSize` vs `dataSize`	Reveals internal WiredTiger bloat invisible to `df` trends	`storageSize` exceeds `dataSize` by a wide margin after large deletes
Oplog size (`rs.printReplicationInfo`)	Fixed capped size is a baseline disk consumer	Window far larger than operational requirement (for example, >72 hours)
Active index builds (`db.currentOp`)	Builds create temp files outside normal accounting	Build running during a disk spike
Rollback directory size (`du -sh`)	Rollback data is not automatically cleaned	Unexpected growth after elections or failovers
Journal sync latency	Leading indicator of storage subsystem pressure	Average sync latency >30 ms sustained

Fixes

Reclaim space from the oplog

If the oplog is larger than needed, reduce the capped collection size. In MongoDB 4.0 and later, run:

db.adminCommand({ replSetResizeOplog: 1, size: <new_size_in_MB> })

This reclaims disk space from the capped collection. Tradeoff: a smaller window increases the risk that a secondary falls off the oplog during maintenance. Only resize if the current window exceeds your operational safety margin by a large factor.

Compact collections

For standalone instances or when you must preserve the current data files, run:

db.runCommand({ compact: "<collection>" })

Warning: compact takes an exclusive lock on the collection and blocks all reads and writes for the duration. It also requires temporary working space. If the disk is already critically full, compact may fail.

Resync a replica set member

If the affected node is a secondary and local recovery is too slow, the fastest path to clean data files is an initial sync. After freeing enough space to allow mongod to start, or after wiping the data directory and pointing to a larger volume, restart mongod. It enters initial sync and pulls a fresh copy from the primary. Tradeoff: the primary and network absorb sync load, and cluster redundancy is reduced until the sync completes.

Expand the underlying volume

On cloud block storage or LVM, expanding the filesystem is the safest fix if infrastructure allows. It requires no MongoDB-level operations other than a restart if the device was unmounted. Ensure the expanded space is visible to the filesystem before starting mongod.

What to avoid

Do not run mongod --repair on a disk that is already full. Repair rebuilds data and index files and requires significant temporary space. It is also offline for standalone nodes. Do not delete WiredTiger data files, journal files, or the mongod.lock file. Doing so causes unrecoverable data loss.

Prevention

Trend filesystem utilization and storageSize weekly. Project days until 90% using the current growth rate.
After any large delete operation, schedule a compact or plan a rolling resync to reclaim space.
Size the oplog to maintain a 24-48 hour window rather than over-provisioning disk for a multi-day window you do not need.
Monitor db.currentOp() during index builds to catch temp file growth before it fills the disk.
Maintain at least 20% free space at all times, plus enough headroom to rebuild the largest index and compact the largest collection without crossing 90%.

How Netdata helps

Netdata correlates host disk utilization with MongoDB storageSize and dataSize to distinguish organic growth from WiredTiger file bloat. It raises TICKET alerts at 80% disk utilization and PAGE alerts at 95% with less than 2 GB free, matching the failure threshold above. Per-collection metrics pinpoint which namespaces drive disk consumption. Journal sync latency spikes surface before application write timeouts. Oplog window hours and replication lag are displayed together, so you can right-size the oplog without guessing.

The Netdata solution

MongoDB monitoring with Netdata

Netdata monitors MongoDB with per-second metrics and automatic dashboards. Watch WiredTiger cache pressure, oplog window, connection counts, checkpoint stalls, and replication health in one place, correlated with the underlying host.

See MongoDB monitoring → Start monitoring free