MongoDB WiredTiger cache dirty ratio high: the leading indicator nobody watches

Cache fill at 70% looks safe, but if dirty ratio is climbing past 15%, a latency spike is already forming. Dirty ratio measures modified pages not yet flushed to disk. While fill ratio tells you how much cache is in use, dirty ratio tells you how fast the storage engine is falling behind. It often leads checkpoint stalls and eviction-driven latency spikes by minutes.

What this means

WiredTiger tracks dirty bytes against the configured maximum cache size. Dirty ratio equals tracked dirty bytes in the cache divided by maximum bytes configured. Checkpoints run every 60 seconds by default to flush these pages. When write volume exceeds flush capacity, dirty data accumulates.

At 20% dirty ratio (eviction_dirty_trigger), application threads are forced to evict dirty pages directly. User operations then pause to reconcile and flush pages before continuing. Dirty page eviction is significantly slower than clean page eviction. Once application threads start evicting, latency does not degrade gracefully; it typically jumps by one to two orders of magnitude as threads block on eviction work, tickets are held longer, and new operations queue.

A cache at 90% fill with 2% dirty is healthy. A cache at 70% fill with 18% dirty is approaching a stall.

flowchart TD
    A[Dirty ratio climbs past 5%] --> B[Background eviction works harder]
    B --> C{Dirty ratio >15%?}
    C -->|No| D[Stable operation]
    C -->|Yes| E[Checkpoint duration increases]
    E --> F[Dirty ratio approaches 20%]
    F --> G[Application threads forced to evict]
    G --> H[Operation latency spikes]
    H --> I[Ticket exhaustion]
    I --> J[Connection pileup and cascading failure]

Common causes

CauseWhat it looks likeFirst thing to check
Write burst overwhelming checkpoint flushDirty ratio rises suddenly during bulk imports or migrations; journal sync latency climbs firstopcounters and db.serverStatus().wiredTiger.log for write volume
Storage degradationCheckpoint duration increases while dirty ratio climbs; OS disk latency elevatediostat -x 1 5 for %util and await
Long-running snapshots blocking evictionDirty ratio high but write volume normal; transactions or noCursorTimeout cursors opendb.currentOp() and db.serverStatus().transactions for active, long-held snapshots
Undersized WiredTiger cacheDirty ratio trends upward over days; cache fill also climbing steadilyCache fill and dirty ratio trended over 7 days
Container cgroup limit ignoredCache sized to host RAM instead of container limit; dirty ratio rises even under light loadContainer memory limit vs maximum bytes configured

Quick checks

Run these from a host with mongosh access:

# WiredTiger cache dirty and fill ratios
mongosh --quiet --eval 'var c=db.serverStatus().wiredTiger.cache; var max=c["maximum bytes configured"]; print("Fill: " + (100*c["bytes currently in the cache"]/max).toFixed(1) + "%"); print("Dirty: " + (100*c["tracked dirty bytes in the cache"]/max).toFixed(1) + "%");'
# Application-thread evictions
mongosh --quiet --eval 'print("App-thread evictions: " + db.serverStatus().wiredTiger.cache["pages evicted by application threads"]);'
# Most recent checkpoint duration
mongosh --quiet --eval 'print("Checkpoint ms: " + db.serverStatus().wiredTiger.transaction["transaction checkpoint most recent time (msecs)"]);'
# Average journal sync latency
mongosh --quiet --eval 'var l=db.serverStatus().wiredTiger.log; print("Avg journal sync: " + (l["log sync time duration (usecs)"]/l["log sync operations"]).toFixed(0) + " µs");'
# Long-running operations
mongosh --quiet --eval 'db.currentOp({"active":true,"secs_running":{"$gt":10}}).inprog.forEach(function(o){print(o.opid+" | "+o.op+" | "+o.secs_running+"s | "+o.ns);});'
# Active transactions
mongosh --quiet --eval 'var t=db.serverStatus().transactions; print("Active: "+t.currentActive+", Open: "+t.currentOpen);'
# noTimeout cursors holding snapshots
mongosh --quiet --eval 'print("noTimeout cursors: " + db.serverStatus().metrics.cursor.open.noTimeout);'
# Storage latency and utilization
iostat -x 1 5

How to diagnose it

  1. Confirm dirty ratio and trend. Use the cache check. A single snapshot above 15% is concerning; a sustained climb toward 20% is critical. Compare with fill ratio. If fill is moderate but dirty is high, the problem is flush capacity, not cache size.
  2. Check if application threads are evicting. If pages evicted by application threads is incrementing, users are already feeling latency. This is the transition from saturation to active degradation.
  3. Measure checkpoint duration. If the most recent checkpoint duration approaches or exceeds 30 seconds, the flush pipeline is stressed. If it exceeds 60 seconds, checkpoints cannot keep up with the interval.
  4. Identify snapshot holders. Run db.currentOp() for operations running longer than 10 seconds. Correlate with db.serverStatus().transactions and metrics.cursor. Long-running transactions and noCursorTimeout cursors pin old snapshots, preventing WiredTiger from reclaiming pages.
  5. Validate storage health. Run iostat -x 1 5 and look for %util above 70% or await above 10 ms sustained. On cloud block storage, depleted burst credits often manifest as smooth increases in await before MongoDB signals spike.
  6. Correlate tickets and latency. Check wiredTiger.concurrentTransactions. If available write tickets drop below 25% of total and opLatencies for writes rises, the cascade is active.

Metrics and signals to monitor

SignalWhy it mattersWarning sign
Dirty ratioLeading indicator of checkpoint stall and app-thread eviction>15% sustained, or trending upward across days
Application-thread evictionsUser-visible latency starts here; operations pause to flushAny sustained nonzero rate after uptime >600s
Checkpoint durationMeasures flush pipeline health and I/O capacity>30 seconds sustained, or climbing checkpoint-over-checkpoint
Journal sync latencyDirect storage health signal; often leads application latency by 30-60 seconds>30 ms average sustained
Available write ticketsConcurrency saturation; held longer when eviction blocks threads<25% of total sustained during peak
Write operation latency (opLatencies)Confirms user impact from cache pressureSustained 2x baseline or p99 exceeding timeout threshold

Fixes

Reduce write pressure immediately

If dirty ratio climbs during a batch job, migration, or bulk import, pause or throttle the workload. This is the fastest way to give checkpoints time to drain dirty pages. Do not restart mongod to clear cache; a restart empties the cache but loses diagnostic state and may trigger a cold-start performance hit.

Kill or resolve long-running snapshots

Use db.currentOp() to find operations running longer than expected. Kill unnecessary long-running queries or aggregations with db.killOp(opid).

Warning: killOp aborts the target operation. Verify the operation is safe to terminate; aborting a multi-document transaction or DDL operation can leave application state inconsistent.

Check db.serverStatus().transactions for active multi-document transactions. A single forgotten transaction can hold a snapshot open and block eviction. Dropped noCursorTimeout cursors have the same effect; identify the application source and close them properly.

Address storage degradation

Run iostat -x 1 5 on the host. If await is elevated and %util is high, the disk subsystem is the bottleneck. On cloud volumes, check for burst credit depletion.

If the primary is on degraded storage, step it down to shift writes to a healthier secondary.

Warning: Stepping down a primary triggers a replica set election and briefly interrupts writes. Coordinate the change and confirm replica set health first. Do not kill the checkpoint process; let it complete.

Resize the cache if undersized

If dirty ratio trends upward over weeks and fill ratio is also climbing, the working set may have outgrown the cache. WiredTiger cache defaults to max(256 MB, 0.5 * (RAM - 1 GB)). In containers, it may incorrectly size to host RAM. Plan a rolling restart with --wiredTigerCacheSizeGB or storage.wiredTiger.engineConfig.cacheSizeGB set to an appropriate value for the workload and container limit. This requires a restart, so treat it as a scheduled change, not an incident fix.

Prevention

  • Graph dirty ratio, not just fill. Fill ratio at 75% is normal. Dirty ratio at 15% is not. Alert on dirty ratio >10% to get ahead of the trigger.
  • Alert on application-thread evictions. Any sustained rate means background eviction has lost the race. This is a better paging threshold than cache fill percentage.
  • Monitor checkpoint duration as a trend. A checkpoint that takes 55 seconds every 60 seconds looks stable but has zero margin. Trending duration over days gives earlier warning than single spikes.
  • Explicitly size cache in containerized deployments. Do not let WiredTiger default to host RAM. Set the cache size to fit within the container memory limit minus headroom for connections and overhead.
  • Review long-running operations proactively. Track the maximum operation age from currentOp continuously. Killing a runaway query before it holds a snapshot for minutes prevents cache pressure entirely.

How Netdata helps

  • Correlates dirty ratio, application-thread eviction rate, checkpoint duration, and opLatencies on one dashboard.
  • Surfaces available WiredTiger read and write tickets alongside cache metrics to show when ticket exhaustion amplifies cache pressure.
  • Provides historical context on cache warm-up after restart to distinguish normal cold-start eviction from genuine pressure.
  • Alerts on dirty ratio thresholds and ticket utilization.