MongoDB WiredTiger cache pressure cascade: eviction stalls and latency spikes

Latency jumps from milliseconds to seconds for both reads and writes. The slow query log shows no single offender, but connection count climbs as clients retry and timeout. This is the cache pressure cascade. It starts in the storage engine and becomes a self-reinforcing spiral through replication, admission control, and connection handling. This guide covers the mechanism, confirmation under pressure, and how to stop it.

What this means

WiredTiger uses an in-memory cache separate from the OS page cache. The default maximum is the larger of 50% of RAM minus 1 GB, or 256 MB. Writes land in cache as dirty pages; checkpoints flush them to disk every 60 seconds by default. Background eviction threads keep cache fill near 80% and dirty pages under control.

When dirty data accumulates faster than checkpoints and background eviction can flush, cache fill climbs toward 95%. WiredTiger then forces application threads to evict pages themselves. Every operation pauses to clean pages before proceeding. Because each operation holds a concurrency ticket longer, ticket availability drops. New operations queue behind globalLock. Timeouts trigger reconnections, which create more threads, consume more memory, and compete for the same tickets and cache. The spiral will not self-resolve.

flowchart TD
    A[Write volume exceeds flush rate] --> B[Cache dirty ratio rises]
    B --> C[Background eviction can't keep up]
    C --> D[Cache fill exceeds 95%]
    D --> E[Application threads forced to evict]
    E --> F[Operation latency spikes]
    F --> G[Tickets held longer]
    G --> H[Queue depths grow]
    H --> I[Application timeouts]
    I --> J[Reconnect storm]
    J --> K[More threads compete for cache and tickets]
    K --> E

Common causes

CauseWhat it looks likeFirst thing to check
Write volume overwhelming storage throughputDirty ratio rises steadily; checkpoint duration climbs; journal sync latency spikes firstiostat -x 1 and WiredTiger checkpoint duration
Long-running snapshots pinning old versionsCache fill high but dirty ratio moderate; currentOp shows old transactions or many noTimeout cursorsdb.currentOp() for transaction age and metrics.cursor
WiredTiger cache undersized for working setCache fill above 80% during normal load; high page fault rate; gradual growth over daysdb.serverStatus().wiredTiger.cache fill ratio and OS page faults
Storage device degradation or burst credit exhaustionJournal sync latency spikes before cache pressure; checkpoint duration jumps suddenlyCloud storage burst balance, or iostat %util and await

Quick checks

Run these read-only checks to confirm the cascade is active.

// Check cache fill and dirty ratio
var c = db.serverStatus().wiredTiger.cache;
var max = c["maximum bytes configured"];
print("Fill: " + (100 * c["bytes currently in the cache"] / max).toFixed(1) + "%");
print("Dirty: " + (100 * c["tracked dirty bytes in the cache"] / max).toFixed(1) + "%");
// Check application-thread evictions
var c = db.serverStatus().wiredTiger.cache;
print("App-thread evictions: " + c["pages evicted by application threads"]);
// Check available tickets
// MongoDB <= 7.x
var t = db.serverStatus().wiredTiger.concurrentTransactions;
print("Read available: " + t.read.available + " / " + t.read.totalTickets);
print("Write available: " + t.write.available + " / " + t.write.totalTickets);
// MongoDB 8.0+: inspect db.serverStatus().queues.execution
// Check queue depths
var q = db.serverStatus().globalLock.currentQueue;
print("Queued readers: " + q.readers + ", writers: " + q.writers);
// Check operation latency
var l = db.serverStatus().opLatencies;
if (l.reads.ops > 0) print("Read avg (µs): " + (l.reads.latency / l.reads.ops).toFixed(0));
if (l.writes.ops > 0) print("Write avg (µs): " + (l.writes.latency / l.writes.ops).toFixed(0));
// Check long-running operations
db.currentOp({ active: true, secs_running: { $gt: 10 } }).inprog.forEach(function(op) {
  print(op.opid + " | " + op.op + " | " + op.secs_running + "s | " + op.ns);
});
# Check OS disk health
iostat -x 1 5

How to diagnose it

  1. Confirm both read and write latency are elevated in opLatencies. If only reads are slow, suspect a missing index or plan regression rather than cache pressure.
  2. Check cache fill ratio and dirty ratio. Fill above 80% with rising dirty ratio confirms pressure. Dirty ratio above 15% is concerning; above 20% is critical.
  3. Check pages evicted by application threads. Any sustained nonzero delta means user threads are doing eviction work.
  4. Check available tickets. In MongoDB 8.0+, inspect queues.execution; in earlier versions use wiredTiger.concurrentTransactions. Sustained availability below 25% of total confirms admission control saturation.
  5. Check globalLock.currentQueue. Sustained nonzero readers and writers means operations are queuing behind ticket or lock contention.
  6. Run db.currentOp() to find long-running operations, active multi-document transactions, or queued locks. Identify candidates for killOp.
  7. Check checkpoint duration in db.serverStatus().wiredTiger.transaction. If the most recent checkpoint approaches or exceeds 60 seconds, storage throughput is the bottleneck.
  8. Check journal sync latency in db.serverStatus().wiredTiger.log. Sustained averages above 30 ms point to disk subsystem trouble.
  9. On replica sets, check replication lag and flow control status. If secondaries are lagging, flow control may throttle the primary, which can mask or mimic cache pressure. Use rs.printReplicationInfo() and inspect db.serverStatus().flowControl.
  10. Check OS disk metrics. High %util or await with low throughput indicates storage saturation or device degradation.

Metrics and signals to monitor

SignalWhy it mattersWarning sign
WiredTiger cache dirty ratioReveals checkpoint stall risk before latency degradesSustained above 10%; critical above 20%
Pages evicted by application threadsConfirms user threads are doing eviction instead of serving queriesAny sustained nonzero rate
WiredTiger checkpoint durationLong checkpoints block journal recycling and can freeze writesAbove 30 seconds sustained; above 60 seconds critical
Available read/write ticketsDirect measure of storage engine admission controlBelow 25% of total sustained
globalLock.currentQueueShows operations waiting behind contentionSustained above 20 and growing
opLatencies reads/writesUser-visible latency impactAverage doubles from baseline for more than 5 minutes
Journal sync latencyLeading indicator of storage healthAbove 30 ms sustained
Replication lag / oplog windowSecondary falloff risk if the primary is overloadedLag above 10 seconds or approaching 50% of oplog window
Connection count and totalCreatedReconnection storms amplify resource pressurecurrent grows while latency spikes; high totalCreated delta
Current longest-running operation ageA single bad query can trigger ticket exhaustionAny non-background operation above 300 seconds

Fixes

Immediate relief

Warning: db.killOp() is disruptive. Do not kill replication operations or foreground index builds. Kill only unnecessary, user-initiated operations. Pause batch writes, bulk imports, or ETL jobs to reduce dirty page generation. If a lagging secondary is causing flow control to throttle the primary, redirect read traffic away from that secondary.

Storage throughput bottleneck

If OS disk metrics show high %util or high await, the disk subsystem cannot keep up. On cloud block storage, check whether burst credits are depleted. If a disk device is failing, step down the primary to shift writes to a healthier member. Warning: Stepdown triggers an election and a brief write outage. Plan for application retry handling.

Cache and eviction tuning

If cache fill is consistently above 80% during normal load, increase wiredTigerCacheSizeGB. This requires a rolling restart. The tradeoff is less RAM for the OS page cache. If background eviction is persistently behind and CPU is available, increase WiredTiger eviction worker thread counts via storage.wiredTiger.engineConfig.

Snapshot pinning

Kill abandoned multi-document transactions and noCursorTimeout cursors that hold old snapshots open. Review application code for unbounded transactions and missing cursor closes. Applications may need to re-query after cursor death.

Oplog window pressure

If high write volume is shrinking the oplog window, resize the oplog with replSetResizeOplog (MongoDB 4.0+). This consumes more disk but prevents secondaries from falling off the oplog and requiring a full initial sync.

Prevention

  • Monitor dirty ratio, not just cache fill. A cache at 75% fill with 2% dirty is healthy; a cache at 70% fill with 18% dirty is not.
  • Monitor ticket availability as a primary saturation signal, not just a symptom.
  • Alert on application-thread evictions and checkpoint duration before latency spikes.
  • Size the WiredTiger cache so peak working set stays below 70% fill and 5% dirty.
  • Cap transaction lifetime and audit regularly for noCursorTimeout cursors.
  • Size the oplog to maintain at least 24 hours of window during your highest sustained write rate.
  • Correlate OS disk I/O latency with MongoDB journal sync latency and checkpoint duration in the same dashboards.

How Netdata helps

  • Surfaces WiredTiger cache dirty ratio, application-thread eviction rate, and ticket availability together, exposing the cascade before it becomes critical.
  • Baselines opLatencies, checkpoint duration, and journal sync latency; alert on deviation from normal instead of static thresholds only.
  • Per-second resolution on queue depths and connection counts exposes reconnect storms within the first minute.
  • OS disk I/O latency shown alongside MongoDB storage engine metrics distinguishes storage saturation from engine misconfiguration.