MongoDB disk full: emergency recovery when mongod can’t write the journal

When the filesystem backing the data or journal directory crosses a critical threshold, WiredTiger cannot allocate new journal extents. If mongod crashes or restarts, recovery replays journal files since the last checkpoint and requires free headroom to create or extend files during that replay. On a full disk, mongod hangs in recovery without binding to port 27017.

If the node is a standalone, there is no replica to fail over to. If it is a secondary, cluster redundancy is reduced while the member is down. Recovery is complicated by a counterintuitive storage engine behavior: WiredTiger reclaims space internally after deletes, but does not automatically shrink data files or return bytes to the operating system. A volume that reads 99% full after a massive delete remains 99% full at the filesystem level.

What this means

MongoDB relies on continuous journal writes for durability. At 100% filesystem utilization, journal appends fail. Depending on the exact failure path, mongod may crash immediately, enter a read-only state, or refuse to finish startup recovery.

WiredTiger recovery is non-optional after an unclean shutdown. During recovery, mongod replays all journal files written since the last checkpoint before accepting connections. If the disk is too full to create new files or extend existing ones during this replay, the process hangs in recovery without opening the client port. Treat filesystem utilization above 95% with less than 2 GB free as a pageable incident.

Space accounting is non-obvious. The oplog is a capped collection with a fixed maximum size. Index builds create temporary files excluded from dbStats() output. Large deletions drop logical dataSize but leave storageSize and underlying filesystem blocks allocated. Only a compact operation or an initial sync rewrites data files and returns space to the OS.

flowchart TD
    A[Filesystem >95% full] --> B{mongod running?}
    B -->|No, recovery hangs| C[Free space for journal replay]
    B -->|Yes, but unstable| D[Find largest space consumer]
    D --> E{Replica set member?}
    E -->|Yes| F[Resync member after cleanup]
    E -->|No| G[Compact collections or expand volume]
    C --> H[Restart mongod]
    G --> H
    F --> H

Common causes

CauseWhat it looks likeFirst thing to check
Organic data growthstorageSize climbs steadily and df shows gradual filldb.adminCommand({listDatabases:1}) and daily df trend
Uncompacted collections after mass deletesdataSize drops but df is unchanged; storageSize far exceeds dataSizedb.collection.stats() comparing storageSize to dataSize
Oversized oploglocal database consumes far more disk than needed for the replication windowrs.printReplicationInfo() to compare configured size to actual window
Runaway index build temp filesSudden spike during index creation; build may fail with space errorsdb.currentOp() for active index builds and du -sh on the dbPath
Unclean shutdown with rollback filesFailover produced rollback data in the rollback/ directory under dbPathdu -sh <dbPath>/rollback

Quick checks

These commands are read-only and safe to run during an incident.

# Filesystem utilization for the data directory
df -h /data/db
# Journal directory size if located on the same volume
du -sh /data/db/journal
# Rollback directory footprint after a failover
du -sh /data/db/rollback
# Logical database sizes from MongoDB's perspective
mongosh --quiet --eval 'db.adminCommand({listDatabases:1}).databases.forEach(function(d){ print(d.name+": "+(d.sizeOnDisk/1024/1024).toFixed(1)+" MB") })'
# Per-collection storage bloat (dead space from deletes)
mongosh --quiet --eval 'db.getCollectionNames().forEach(function(c){ var s=db[c].stats(); print(c+": storage="+(s.storageSize/1024/1024).toFixed(1)+" MB data="+(s.dataSize/1024/1024).toFixed(1)+" MB") })'
# Oplog size and time window
mongosh --quiet --eval 'rs.printReplicationInfo()'
# Active long-running operations that may generate temp files
mongosh --quiet --eval 'db.currentOp({active:true, secs_running:{$gt:10}}).inprog.forEach(function(o){print(o.opid+" "+o.ns+" "+o.secs_running+"s")})'
# Recent log errors mentioning disk or journal pressure
grep -iE "journal|disk|no space" /var/log/mongodb/mongod.log | tail -20

How to diagnose it

  1. Confirm the filesystem boundary. Run df -h on the data directory and, if applicable, the separate journal directory. If utilization is above 95% and free space is below 2 GB, treat this as an imminent crash or unrecoverable startup.
  2. Quantify the MongoDB footprint versus total disk. Run db.adminCommand({listDatabases:1}) for logical sizes, then du -sh /data/db for physical size. If physical size far exceeds the sum of database sizes, investigate journal files, rollback data, and temporary files.
  3. Identify uncompacted collections. For each large collection, compare db.collection.stats().storageSize to dataSize. A large gap indicates dead space from deleted documents.
  4. Check db.currentOp() for active index builds or large transactions. Index builds write temporary files outside dbStats accounting. Large transactions generate oversized oplog entries.
  5. Determine replica set status. If the node is a secondary, a full initial sync after cleanup is often faster than compacting in place. If it is a standalone, you must recover locally.
  6. If mongod will not start, examine the log for recovery progress. As a rough rule, free space should exceed the total size of journal files to replay. If not, free space or move dbPath before mongod can become available.

Metrics and signals to monitor

SignalWhy it mattersWarning sign
Filesystem utilization (df)Journal writes fail catastrophically at 100% fullPAGE when >95% and <2 GB free
storageSize vs dataSizeReveals internal WiredTiger bloat invisible to df trendsstorageSize exceeds dataSize by a wide margin after large deletes
Oplog size (rs.printReplicationInfo)Fixed capped size is a baseline disk consumerWindow far larger than operational requirement (for example, >72 hours)
Active index builds (db.currentOp)Builds create temp files outside normal accountingBuild running during a disk spike
Rollback directory size (du -sh)Rollback data is not automatically cleanedUnexpected growth after elections or failovers
Journal sync latencyLeading indicator of storage subsystem pressureAverage sync latency >30 ms sustained

Fixes

Reclaim space from the oplog

If the oplog is larger than needed, reduce the capped collection size. In MongoDB 4.0 and later, run:

db.adminCommand({ replSetResizeOplog: 1, size: <new_size_in_MB> })

This reclaims disk space from the capped collection. Tradeoff: a smaller window increases the risk that a secondary falls off the oplog during maintenance. Only resize if the current window exceeds your operational safety margin by a large factor.

Compact collections

For standalone instances or when you must preserve the current data files, run:

db.runCommand({ compact: "<collection>" })

Warning: compact takes an exclusive lock on the collection and blocks all reads and writes for the duration. It also requires temporary working space. If the disk is already critically full, compact may fail.

Resync a replica set member

If the affected node is a secondary and local recovery is too slow, the fastest path to clean data files is an initial sync. After freeing enough space to allow mongod to start, or after wiping the data directory and pointing to a larger volume, restart mongod. It enters initial sync and pulls a fresh copy from the primary. Tradeoff: the primary and network absorb sync load, and cluster redundancy is reduced until the sync completes.

Expand the underlying volume

On cloud block storage or LVM, expanding the filesystem is the safest fix if infrastructure allows. It requires no MongoDB-level operations other than a restart if the device was unmounted. Ensure the expanded space is visible to the filesystem before starting mongod.

What to avoid

Do not run mongod --repair on a disk that is already full. Repair rebuilds data and index files and requires significant temporary space. It is also offline for standalone nodes. Do not delete WiredTiger data files, journal files, or the mongod.lock file. Doing so causes unrecoverable data loss.

Prevention

  • Trend filesystem utilization and storageSize weekly. Project days until 90% using the current growth rate.
  • After any large delete operation, schedule a compact or plan a rolling resync to reclaim space.
  • Size the oplog to maintain a 24-48 hour window rather than over-provisioning disk for a multi-day window you do not need.
  • Monitor db.currentOp() during index builds to catch temp file growth before it fills the disk.
  • Maintain at least 20% free space at all times, plus enough headroom to rebuild the largest index and compact the largest collection without crossing 90%.

How Netdata helps

Netdata correlates host disk utilization with MongoDB storageSize and dataSize to distinguish organic growth from WiredTiger file bloat. It raises TICKET alerts at 80% disk utilization and PAGE alerts at 95% with less than 2 GB free, matching the failure threshold above. Per-collection metrics pinpoint which namespaces drive disk consumption. Journal sync latency spikes surface before application write timeouts. Oplog window hours and replication lag are displayed together, so you can right-size the oplog without guessing.