$ guides / mongodb / mongodb-checkpoint-stall-write-freeze ▌

Operations Guides

MongoDB checkpoint stall write freeze: when all writes stop with no error

Writes time out or hang while mongod is running, TCP port 27017 is open, and reads still return results from cache. The MongoDB logs are quiet, but db.serverStatus().opcounters shows write counts frozen. This is a WiredTiger checkpoint stall: the checkpoint process fell behind, dirty pages accumulated, and new writes blocked. The freeze lasts until the current checkpoint completes. If the I/O bottleneck remains, queued writes flood through and the next checkpoint stalls again.

This failure mode looks like a network or application issue, so operators often restart application pods or fail over load balancers while the real problem is storage I/O that cannot keep pace with dirty page flush demand. Do not kill the checkpoint or restart the node; that forces journal replay and makes the outage longer. Learn the signal pattern so you can diagnose it in seconds.

What this means

WiredTiger keeps data in an in-memory cache and flushes dirty pages to disk via a checkpoint. By default, a checkpoint runs every 60 seconds. If storage cannot complete the checkpoint within that interval, dirty data accumulates in the cache and journal files cannot be reclaimed until the checkpoint finishes. Once cache or journal pressure crosses WiredTiger safety limits, new writes stall or queue indefinitely while reads from cache may continue. When the checkpoint finally completes, blocked writes flush through together. If the I/O bottleneck remains, the next checkpoint also stalls, repeating the cycle.

flowchart TD
    A[Checkpoint exceeds 60 second interval] --> B[Dirty data accumulates in cache]
    B --> C[Journal files cannot be reclaimed]
    C --> D[WiredTiger blocks new writes]
    D --> E[Applications queue writes indefinitely]
    E --> F[Connection count rises]
    F --> G[Checkpoint eventually completes]
    G --> H[Queued writes execute at once]
    H --> I[Next checkpoint stalls if I/O bottleneck persists]

Common causes

Cause	What it looks like	First thing to check
Storage device saturation or failure	`iostat` shows `%util` near 100% and `await` above 50 ms	`iostat -x 1 5` on the data and journal volumes
Cloud storage burst credit exhaustion	Journal sync latency spikes 10-100x without any code change	Cloud volume burst balance metrics
RAID rebuild or degraded array	Checkpoint duration jumps after a disk replacement or failure event	RAID controller status and rebuild progress
Massive write burst	Dirty ratio climbs rapidly during bulk imports or migrations	`opcounters` write rate versus baseline

Quick checks

Run these on the affected node in mongosh and at the OS level.

// Latest checkpoint duration in milliseconds
var txn = db.serverStatus().wiredTiger.transaction;
print("Checkpoint duration (ms): " + txn["transaction checkpoint most recent time (msecs)"]);

// WiredTiger cache dirty ratio
var c = db.serverStatus().wiredTiger.cache;
var max = c["maximum bytes configured"];
var dirty = c["tracked dirty bytes in the cache"];
print("Dirty ratio: " + (100 * dirty / max).toFixed(1) + "%");

// Average journal sync latency in microseconds
var wt = db.serverStatus().wiredTiger.log;
var syncTime = wt["log sync time duration (usecs)"];
var syncOps = wt["log sync operations"];
print("Avg journal sync (us): " + (syncTime / syncOps).toFixed(0));

// Write ops and latency to confirm a freeze
var lat = db.serverStatus().opLatencies;
print("Write ops: " + lat.writes.ops + ", total write latency (us): " + lat.writes.latency);

// Connection growth as writes pile up
var conn = db.serverStatus().connections;
print("Current connections: " + conn.current + ", available: " + conn.available);

# OS-level storage latency and utilization
iostat -x 1 5

// Active operations running longer than 10 seconds
db.currentOp({ "active": true, "secs_running": { "$gt": 10 } }).inprog.length

How to diagnose it

Confirm the freeze pattern. Verify that writes are not completing while reads still work. Check db.serverStatus().opLatencies. If write latency is climbing or the write ops counter is flat under active load, writes are blocked.
Check checkpoint duration. Inspect db.serverStatus().wiredTiger.transaction["transaction checkpoint most recent time (msecs)"]. A value above 60,000 ms confirms the checkpoint is taking longer than the default interval.
Check the dirty ratio. Compute (tracked dirty bytes / maximum bytes configured) from db.serverStatus().wiredTiger.cache. A value above 20% means dirty data is accumulating faster than the checkpoint can flush it.
Check journal sync latency. Calculate the average from db.serverStatus().wiredTiger.log using log sync time duration (usecs) divided by log sync operations. Sustained values above 100,000 microseconds (100 ms) indicate the storage layer is struggling.
Inspect current operations. Run db.currentOp({ "active": true, "secs_running": { "$gt": 10 } }). Many write operations with high secs_running and no progress means they are queued behind the stall.
Check storage health at the OS level. Run iostat -x 1 5. %util near 100% with high await and w_await confirms a storage bottleneck.
Correlate with application metrics. Look for write timeouts and connection pool exhaustion in application logs; these side effects often appear before the database stall is noticed.
Determine if the issue is transient or persistent. A one-time spike during a backup may resolve itself. Sustained elevation means the storage layer is undersized or failing.

Metrics and signals to monitor

Signal	Why it matters	Warning sign
Checkpoint duration	Measures how long dirty data takes to reach disk	Sustained values above 60 seconds
WiredTiger cache dirty ratio	Indicates dirty data accumulation ahead of flush capacity	Sustained values above 20%
Journal sync latency	Reveals storage health before application latency spikes	Average above 100,000 microseconds (100 ms)
Write operation latency	Direct user-facing impact of the stall	Write p99 climbing while read latency stays flat
Connection count	Writes pile up and hold connections	Rapid growth coinciding with a write throughput drop
Current operation age	Shows queued operations waiting for the stall to clear	Many active writes with `secs_running` above 10 seconds

Fixes

Storage device saturation or failure

Do not restart mongod and do not attempt to kill the running checkpoint. Let the checkpoint complete. Interrupting it forces journal replay on recovery, which makes the outage longer. If the primary is on degraded storage and a healthy secondary exists, fail over to shift writes away from the bottleneck.

Warning: Only run rs.stepDown() on the primary. Ensure a healthy secondary exists and the replica set can elect a new primary with a majority.

// Step down the primary to force failover to a healthier secondary
rs.stepDown()

This triggers an election and brief write unavailability, but it moves write load off the failing device. After the stepdown, replace failing disks, move to faster volumes, or resolve the hypervisor I/O scheduling issue.

Cloud storage burst credit exhaustion

If burst credits are depleted, baseline IOPS may be too low for your write workload. Increase the volume size or provisioned IOPS to raise the floor, then let the current checkpoint finish. Do not scale the MongoDB process vertically until the storage layer can sustain the checkpoint flush rate.

RAID rebuild or maintenance

If the stall correlates with a RAID rebuild, options are limited until the rebuild completes. Reduce write pressure by pausing batch jobs, throttling ingestion, or disabling non-critical writes. If the cluster cannot tolerate the latency, fail over to a secondary that is not undergoing rebuild.

Massive write burst

Throttle the bulk load or migration at the application layer. The checkpoint mechanism is designed for steady-state traffic, not sustained write floods that exceed disk sequential write capacity. Spread bulk loads across time or redirect them to a dedicated secondary.

Prevention

Monitor checkpoint duration and dirty ratio as leading indicators. A checkpoint duration creeping from 5 seconds toward 30 seconds signals shrinking headroom.
Size storage to sustain peak write rate plus periodic checkpoint flush. Avoid relying solely on burst-credit storage for write-heavy primaries.
Watch journal sync latency. It typically degrades 30 to 60 seconds before application-visible latency spikes.
Keep a healthy secondary on independent storage to provide a clean failover target if the primary’s storage degrades.
Avoid running bulk imports or large index builds during peak traffic. Schedule them when checkpoint duration is low and journal sync latency is stable.
Set application driver timeouts to fail fast during a stall rather than holding connections open indefinitely. This limits connection pile-up and reduces recovery time.

How Netdata helps

Correlates MongoDB checkpoint duration with OS disk I/O utilization and latency to show whether the stall is a database or storage issue.
Alerts on WiredTiger cache dirty ratio thresholds before writes freeze.
Tracks journal sync latency as an early storage health signal, often warning before application timeouts trigger.
Visualizes WiredTiger ticket utilization and queue depths to help distinguish a checkpoint stall from a single bad query or a broader cache pressure cascade.
Monitors connection growth alongside write latency to surface the pile-up pattern that follows a stall.
Surfaces currentOp metrics to confirm operations are stuck waiting for storage rather than a specific lock.

The Netdata solution

MongoDB monitoring with Netdata

Netdata monitors MongoDB with per-second metrics and automatic dashboards. Watch WiredTiger cache pressure, oplog window, connection counts, checkpoint stalls, and replication health in one place, correlated with the underlying host.

See MongoDB monitoring → Start monitoring free

MongoDB checkpoint stall write freeze: when all writes stop with no error

MongoDB checkpoint stall write freeze: when all writes stop with no error

What this means

Common causes

Quick checks

How to diagnose it

Metrics and signals to monitor

Fixes

Storage device saturation or failure

Cloud storage burst credit exhaustion

RAID rebuild or maintenance

Massive write burst

Prevention

How Netdata helps

Related guides

MongoDB monitoring with Netdata