$ guides / mongodb / mongodb-cache-pressure-cascade ▌

Operations Guides

MongoDB WiredTiger cache pressure cascade: eviction stalls and latency spikes

Latency jumps from milliseconds to seconds for both reads and writes. The slow query log shows no single offender, but connection count climbs as clients retry and timeout. This is the cache pressure cascade. It starts in the storage engine and becomes a self-reinforcing spiral through replication, admission control, and connection handling. This guide covers the mechanism, confirmation under pressure, and how to stop it.

What this means

WiredTiger uses an in-memory cache separate from the OS page cache. The default maximum is the larger of 50% of RAM minus 1 GB, or 256 MB. Writes land in cache as dirty pages; checkpoints flush them to disk every 60 seconds by default. Background eviction threads keep cache fill near 80% and dirty pages under control.

When dirty data accumulates faster than checkpoints and background eviction can flush, cache fill climbs toward 95%. WiredTiger then forces application threads to evict pages themselves. Every operation pauses to clean pages before proceeding. Because each operation holds a concurrency ticket longer, ticket availability drops. New operations queue behind globalLock. Timeouts trigger reconnections, which create more threads, consume more memory, and compete for the same tickets and cache. The spiral will not self-resolve.

flowchart TD
    A[Write volume exceeds flush rate] --> B[Cache dirty ratio rises]
    B --> C[Background eviction can't keep up]
    C --> D[Cache fill exceeds 95%]
    D --> E[Application threads forced to evict]
    E --> F[Operation latency spikes]
    F --> G[Tickets held longer]
    G --> H[Queue depths grow]
    H --> I[Application timeouts]
    I --> J[Reconnect storm]
    J --> K[More threads compete for cache and tickets]
    K --> E

Common causes

Cause	What it looks like	First thing to check
Write volume overwhelming storage throughput	Dirty ratio rises steadily; checkpoint duration climbs; journal sync latency spikes first	`iostat -x 1` and WiredTiger checkpoint duration
Long-running snapshots pinning old versions	Cache fill high but dirty ratio moderate; `currentOp` shows old transactions or many `noTimeout` cursors	`db.currentOp()` for transaction age and `metrics.cursor`
WiredTiger cache undersized for working set	Cache fill above 80% during normal load; high page fault rate; gradual growth over days	`db.serverStatus().wiredTiger.cache` fill ratio and OS page faults
Storage device degradation or burst credit exhaustion	Journal sync latency spikes before cache pressure; checkpoint duration jumps suddenly	Cloud storage burst balance, or `iostat` `%util` and `await`

Quick checks

Run these read-only checks to confirm the cascade is active.

// Check cache fill and dirty ratio
var c = db.serverStatus().wiredTiger.cache;
var max = c["maximum bytes configured"];
print("Fill: " + (100 * c["bytes currently in the cache"] / max).toFixed(1) + "%");
print("Dirty: " + (100 * c["tracked dirty bytes in the cache"] / max).toFixed(1) + "%");

// Check application-thread evictions
var c = db.serverStatus().wiredTiger.cache;
print("App-thread evictions: " + c["pages evicted by application threads"]);

// Check available tickets
// MongoDB <= 7.x
var t = db.serverStatus().wiredTiger.concurrentTransactions;
print("Read available: " + t.read.available + " / " + t.read.totalTickets);
print("Write available: " + t.write.available + " / " + t.write.totalTickets);
// MongoDB 8.0+: inspect db.serverStatus().queues.execution

// Check queue depths
var q = db.serverStatus().globalLock.currentQueue;
print("Queued readers: " + q.readers + ", writers: " + q.writers);

// Check operation latency
var l = db.serverStatus().opLatencies;
if (l.reads.ops > 0) print("Read avg (µs): " + (l.reads.latency / l.reads.ops).toFixed(0));
if (l.writes.ops > 0) print("Write avg (µs): " + (l.writes.latency / l.writes.ops).toFixed(0));

// Check long-running operations
db.currentOp({ active: true, secs_running: { $gt: 10 } }).inprog.forEach(function(op) {
  print(op.opid + " | " + op.op + " | " + op.secs_running + "s | " + op.ns);
});

# Check OS disk health
iostat -x 1 5

How to diagnose it

Confirm both read and write latency are elevated in opLatencies. If only reads are slow, suspect a missing index or plan regression rather than cache pressure.
Check cache fill ratio and dirty ratio. Fill above 80% with rising dirty ratio confirms pressure. Dirty ratio above 15% is concerning; above 20% is critical.
Check pages evicted by application threads. Any sustained nonzero delta means user threads are doing eviction work.
Check available tickets. In MongoDB 8.0+, inspect queues.execution; in earlier versions use wiredTiger.concurrentTransactions. Sustained availability below 25% of total confirms admission control saturation.
Check globalLock.currentQueue. Sustained nonzero readers and writers means operations are queuing behind ticket or lock contention.
Run db.currentOp() to find long-running operations, active multi-document transactions, or queued locks. Identify candidates for killOp.
Check checkpoint duration in db.serverStatus().wiredTiger.transaction. If the most recent checkpoint approaches or exceeds 60 seconds, storage throughput is the bottleneck.
Check journal sync latency in db.serverStatus().wiredTiger.log. Sustained averages above 30 ms point to disk subsystem trouble.
On replica sets, check replication lag and flow control status. If secondaries are lagging, flow control may throttle the primary, which can mask or mimic cache pressure. Use rs.printReplicationInfo() and inspect db.serverStatus().flowControl.
Check OS disk metrics. High %util or await with low throughput indicates storage saturation or device degradation.

Metrics and signals to monitor

Signal	Why it matters	Warning sign
WiredTiger cache dirty ratio	Reveals checkpoint stall risk before latency degrades	Sustained above 10%; critical above 20%
Pages evicted by application threads	Confirms user threads are doing eviction instead of serving queries	Any sustained nonzero rate
WiredTiger checkpoint duration	Long checkpoints block journal recycling and can freeze writes	Above 30 seconds sustained; above 60 seconds critical
Available read/write tickets	Direct measure of storage engine admission control	Below 25% of total sustained
`globalLock.currentQueue`	Shows operations waiting behind contention	Sustained above 20 and growing
`opLatencies` reads/writes	User-visible latency impact	Average doubles from baseline for more than 5 minutes
Journal sync latency	Leading indicator of storage health	Above 30 ms sustained
Replication lag / oplog window	Secondary falloff risk if the primary is overloaded	Lag above 10 seconds or approaching 50% of oplog window
Connection count and `totalCreated`	Reconnection storms amplify resource pressure	`current` grows while latency spikes; high `totalCreated` delta
Current longest-running operation age	A single bad query can trigger ticket exhaustion	Any non-background operation above 300 seconds

Fixes

Immediate relief

Warning: db.killOp() is disruptive. Do not kill replication operations or foreground index builds. Kill only unnecessary, user-initiated operations. Pause batch writes, bulk imports, or ETL jobs to reduce dirty page generation. If a lagging secondary is causing flow control to throttle the primary, redirect read traffic away from that secondary.

Storage throughput bottleneck

If OS disk metrics show high %util or high await, the disk subsystem cannot keep up. On cloud block storage, check whether burst credits are depleted. If a disk device is failing, step down the primary to shift writes to a healthier member. Warning: Stepdown triggers an election and a brief write outage. Plan for application retry handling.

Cache and eviction tuning

If cache fill is consistently above 80% during normal load, increase wiredTigerCacheSizeGB. This requires a rolling restart. The tradeoff is less RAM for the OS page cache. If background eviction is persistently behind and CPU is available, increase WiredTiger eviction worker thread counts via storage.wiredTiger.engineConfig.

Snapshot pinning

Kill abandoned multi-document transactions and noCursorTimeout cursors that hold old snapshots open. Review application code for unbounded transactions and missing cursor closes. Applications may need to re-query after cursor death.

Oplog window pressure

If high write volume is shrinking the oplog window, resize the oplog with replSetResizeOplog (MongoDB 4.0+). This consumes more disk but prevents secondaries from falling off the oplog and requiring a full initial sync.

Prevention

Monitor dirty ratio, not just cache fill. A cache at 75% fill with 2% dirty is healthy; a cache at 70% fill with 18% dirty is not.
Monitor ticket availability as a primary saturation signal, not just a symptom.
Alert on application-thread evictions and checkpoint duration before latency spikes.
Size the WiredTiger cache so peak working set stays below 70% fill and 5% dirty.
Cap transaction lifetime and audit regularly for noCursorTimeout cursors.
Size the oplog to maintain at least 24 hours of window during your highest sustained write rate.
Correlate OS disk I/O latency with MongoDB journal sync latency and checkpoint duration in the same dashboards.

How Netdata helps

Surfaces WiredTiger cache dirty ratio, application-thread eviction rate, and ticket availability together, exposing the cascade before it becomes critical.
Baselines opLatencies, checkpoint duration, and journal sync latency; alert on deviation from normal instead of static thresholds only.
Per-second resolution on queue depths and connection counts exposes reconnect storms within the first minute.
OS disk I/O latency shown alongside MongoDB storage engine metrics distinguishes storage saturation from engine misconfiguration.

The Netdata solution

MongoDB monitoring with Netdata

Netdata monitors MongoDB with per-second metrics and automatic dashboards. Watch WiredTiger cache pressure, oplog window, connection counts, checkpoint stalls, and replication health in one place, correlated with the underlying host.

See MongoDB monitoring → Start monitoring free

MongoDB WiredTiger cache pressure cascade: eviction stalls and latency spikes

MongoDB WiredTiger cache pressure cascade: eviction stalls and latency spikes

What this means

Common causes

Quick checks

How to diagnose it

Metrics and signals to monitor

Fixes

Immediate relief

Storage throughput bottleneck

Cache and eviction tuning

Snapshot pinning

Oplog window pressure

Prevention

How Netdata helps

Related guides

MongoDB monitoring with Netdata