MongoDB disk full: emergency recovery when mongod can’t write the journal
When the filesystem backing the data or journal directory crosses a critical threshold, WiredTiger cannot allocate new journal extents. If mongod crashes or restarts, recovery replays journal files since the last checkpoint and requires free headroom to create or extend files during that replay. On a full disk, mongod hangs in recovery without binding to port 27017.
If the node is a standalone, there is no replica to fail over to. If it is a secondary, cluster redundancy is reduced while the member is down. Recovery is complicated by a counterintuitive storage engine behavior: WiredTiger reclaims space internally after deletes, but does not automatically shrink data files or return bytes to the operating system. A volume that reads 99% full after a massive delete remains 99% full at the filesystem level.
What this means
MongoDB relies on continuous journal writes for durability. At 100% filesystem utilization, journal appends fail. Depending on the exact failure path, mongod may crash immediately, enter a read-only state, or refuse to finish startup recovery.
WiredTiger recovery is non-optional after an unclean shutdown. During recovery, mongod replays all journal files written since the last checkpoint before accepting connections. If the disk is too full to create new files or extend existing ones during this replay, the process hangs in recovery without opening the client port. Treat filesystem utilization above 95% with less than 2 GB free as a pageable incident.
Space accounting is non-obvious. The oplog is a capped collection with a fixed maximum size. Index builds create temporary files excluded from dbStats() output. Large deletions drop logical dataSize but leave storageSize and underlying filesystem blocks allocated. Only a compact operation or an initial sync rewrites data files and returns space to the OS.
flowchart TD
A[Filesystem >95% full] --> B{mongod running?}
B -->|No, recovery hangs| C[Free space for journal replay]
B -->|Yes, but unstable| D[Find largest space consumer]
D --> E{Replica set member?}
E -->|Yes| F[Resync member after cleanup]
E -->|No| G[Compact collections or expand volume]
C --> H[Restart mongod]
G --> H
F --> HCommon causes
| Cause | What it looks like | First thing to check |
|---|---|---|
| Organic data growth | storageSize climbs steadily and df shows gradual fill | db.adminCommand({listDatabases:1}) and daily df trend |
| Uncompacted collections after mass deletes | dataSize drops but df is unchanged; storageSize far exceeds dataSize | db.collection.stats() comparing storageSize to dataSize |
| Oversized oplog | local database consumes far more disk than needed for the replication window | rs.printReplicationInfo() to compare configured size to actual window |
| Runaway index build temp files | Sudden spike during index creation; build may fail with space errors | db.currentOp() for active index builds and du -sh on the dbPath |
| Unclean shutdown with rollback files | Failover produced rollback data in the rollback/ directory under dbPath | du -sh <dbPath>/rollback |
Quick checks
These commands are read-only and safe to run during an incident.
# Filesystem utilization for the data directory
df -h /data/db
# Journal directory size if located on the same volume
du -sh /data/db/journal
# Rollback directory footprint after a failover
du -sh /data/db/rollback
# Logical database sizes from MongoDB's perspective
mongosh --quiet --eval 'db.adminCommand({listDatabases:1}).databases.forEach(function(d){ print(d.name+": "+(d.sizeOnDisk/1024/1024).toFixed(1)+" MB") })'
# Per-collection storage bloat (dead space from deletes)
mongosh --quiet --eval 'db.getCollectionNames().forEach(function(c){ var s=db[c].stats(); print(c+": storage="+(s.storageSize/1024/1024).toFixed(1)+" MB data="+(s.dataSize/1024/1024).toFixed(1)+" MB") })'
# Oplog size and time window
mongosh --quiet --eval 'rs.printReplicationInfo()'
# Active long-running operations that may generate temp files
mongosh --quiet --eval 'db.currentOp({active:true, secs_running:{$gt:10}}).inprog.forEach(function(o){print(o.opid+" "+o.ns+" "+o.secs_running+"s")})'
# Recent log errors mentioning disk or journal pressure
grep -iE "journal|disk|no space" /var/log/mongodb/mongod.log | tail -20
How to diagnose it
- Confirm the filesystem boundary. Run
df -hon the data directory and, if applicable, the separate journal directory. If utilization is above 95% and free space is below 2 GB, treat this as an imminent crash or unrecoverable startup. - Quantify the MongoDB footprint versus total disk. Run
db.adminCommand({listDatabases:1})for logical sizes, thendu -sh /data/dbfor physical size. If physical size far exceeds the sum of database sizes, investigate journal files, rollback data, and temporary files. - Identify uncompacted collections. For each large collection, compare
db.collection.stats().storageSizetodataSize. A large gap indicates dead space from deleted documents. - Check
db.currentOp()for active index builds or large transactions. Index builds write temporary files outsidedbStatsaccounting. Large transactions generate oversized oplog entries. - Determine replica set status. If the node is a secondary, a full initial sync after cleanup is often faster than compacting in place. If it is a standalone, you must recover locally.
- If mongod will not start, examine the log for recovery progress. As a rough rule, free space should exceed the total size of journal files to replay. If not, free space or move dbPath before mongod can become available.
Metrics and signals to monitor
| Signal | Why it matters | Warning sign |
|---|---|---|
Filesystem utilization (df) | Journal writes fail catastrophically at 100% full | PAGE when >95% and <2 GB free |
storageSize vs dataSize | Reveals internal WiredTiger bloat invisible to df trends | storageSize exceeds dataSize by a wide margin after large deletes |
Oplog size (rs.printReplicationInfo) | Fixed capped size is a baseline disk consumer | Window far larger than operational requirement (for example, >72 hours) |
Active index builds (db.currentOp) | Builds create temp files outside normal accounting | Build running during a disk spike |
Rollback directory size (du -sh) | Rollback data is not automatically cleaned | Unexpected growth after elections or failovers |
| Journal sync latency | Leading indicator of storage subsystem pressure | Average sync latency >30 ms sustained |
Fixes
Reclaim space from the oplog
If the oplog is larger than needed, reduce the capped collection size. In MongoDB 4.0 and later, run:
db.adminCommand({ replSetResizeOplog: 1, size: <new_size_in_MB> })
This reclaims disk space from the capped collection. Tradeoff: a smaller window increases the risk that a secondary falls off the oplog during maintenance. Only resize if the current window exceeds your operational safety margin by a large factor.
Compact collections
For standalone instances or when you must preserve the current data files, run:
db.runCommand({ compact: "<collection>" })
Warning: compact takes an exclusive lock on the collection and blocks all reads and writes for the duration. It also requires temporary working space. If the disk is already critically full, compact may fail.
Resync a replica set member
If the affected node is a secondary and local recovery is too slow, the fastest path to clean data files is an initial sync. After freeing enough space to allow mongod to start, or after wiping the data directory and pointing to a larger volume, restart mongod. It enters initial sync and pulls a fresh copy from the primary. Tradeoff: the primary and network absorb sync load, and cluster redundancy is reduced until the sync completes.
Expand the underlying volume
On cloud block storage or LVM, expanding the filesystem is the safest fix if infrastructure allows. It requires no MongoDB-level operations other than a restart if the device was unmounted. Ensure the expanded space is visible to the filesystem before starting mongod.
What to avoid
Do not run mongod --repair on a disk that is already full. Repair rebuilds data and index files and requires significant temporary space. It is also offline for standalone nodes. Do not delete WiredTiger data files, journal files, or the mongod.lock file. Doing so causes unrecoverable data loss.
Prevention
- Trend filesystem utilization and
storageSizeweekly. Project days until 90% using the current growth rate. - After any large delete operation, schedule a
compactor plan a rolling resync to reclaim space. - Size the oplog to maintain a 24-48 hour window rather than over-provisioning disk for a multi-day window you do not need.
- Monitor
db.currentOp()during index builds to catch temp file growth before it fills the disk. - Maintain at least 20% free space at all times, plus enough headroom to rebuild the largest index and
compactthe largest collection without crossing 90%.
How Netdata helps
Netdata correlates host disk utilization with MongoDB storageSize and dataSize to distinguish organic growth from WiredTiger file bloat. It raises TICKET alerts at 80% disk utilization and PAGE alerts at 95% with less than 2 GB free, matching the failure threshold above. Per-collection metrics pinpoint which namespaces drive disk consumption. Journal sync latency spikes surface before application write timeouts. Oplog window hours and replication lag are displayed together, so you can right-size the oplog without guessing.
Related guides
- How MongoDB actually works in production: a mental model for operators
- MongoDB pages evicted by application threads: when eviction becomes user latency
- MongoDB WiredTiger cache dirty ratio high: the leading indicator nobody watches
- MongoDB WiredTiger cache pressure cascade: eviction stalls and latency spikes
- MongoDB cache too small: sizing the WiredTiger cache for your working set
- MongoDB checkpoint duration climbing: diagnosing slow WiredTiger checkpoints
- MongoDB checkpoint stall write freeze: when all writes stop with no error
- MongoDB connection churn: high totalCreated rate and thread creation overhead
- MongoDB connection refused at maxIncomingConnections: hitting the connection ceiling
- MongoDB connection storm spiral: reconnection floods after an election or deploy
- MongoDB exceeded memory limit for $group — aggregation spills and allowDiskUse
- MongoDB flow control throttling writes: when the primary slows itself down







