MongoDB noTimeout cursors causing cache pressure: pinned snapshots and silent eviction stalls

When wiredTiger.cache.bytes currently in the cache climbs, the dirty ratio trends toward 20%, and read latencies spike without a single slow query in the log, check metrics.cursor.open.noTimeout.

Each noTimeout cursor pins a WiredTiger snapshot indefinitely. Old document versions cannot be evicted while that snapshot is open, so the cache fills with unreachable history until background eviction falls behind and application threads are forced to clean up. The result is a silent cache pressure cascade that looks like a capacity problem but is actually a cursor lifecycle problem.

ETL pipelines, backup tools, and change streams often open cursors with noCursorTimeout() to traverse large collections without hitting the default idle timeout. When those cursors are abandoned, leaked, or left open longer than necessary, they continue holding snapshots long after the application has moved on. Unlike a slow query, there is no obvious offender in the logs. The pressure builds in the background until WiredTiger stalls application threads to evict pages, at which point everything slows down together.

What this means

WiredTiger uses multiversion concurrency control (MVCC). Every write creates a new version of a document in cache. Old versions are retained until no transaction or cursor needs them. A cursor opened with noCursorTimeout() bypasses the normal idle timeout and holds its snapshot open until explicitly closed or the connection drops. While that snapshot is active, WiredTiger cannot evict the old page versions visible to it. If the cursor traverses a large range or sits idle for hours, the pinned history accumulates. The cache fill ratio rises, the dirty ratio climbs, and eventually the eviction threads cannot keep pace. When the cache hits the aggressive eviction threshold, application threads pause to evict pages themselves. That adds latency to every operation, depletes read and write tickets, and causes queue depths to grow. The cascade looks like storage saturation, but adding disk I/O or RAM will not fix it because the root cause is snapshot retention, not capacity.

flowchart TD
    A[Application opens noTimeout cursor] --> B[WiredTiger snapshot pinned]
    B --> C[Old document versions retained in cache]
    C --> D[Cache fill and dirty ratio climb]
    D --> E[Background eviction cannot free pinned pages]
    E --> F[Application threads forced to evict]
    F --> G[Latency spikes and queue depth grows]

Common causes

CauseWhat it looks likeFirst thing to check
ETL or backup tool using noCursorTimeout()open.noTimeout is steady and greater than 0; getmore opcounter is elevated; long-running getmore from one client hostdb.currentOp() filtered to getmore, grouped by client
Change stream consumer left openopen.noTimeout and open.pinned both elevated; an aggregation cursor on the change stream namespace appears in currentOp for hoursdb.currentOp() for aggregations with $changeStream
Application cursor leak after connection dropopen.noTimeout climbs but currentOp shows no active client for some cursors; connection churn correlates with cursor growthdb.serverStatus().metrics.cursor delta against connections.totalCreated

Quick checks

Run these read-only checks to confirm whether noTimeout cursors are driving cache pressure.

Cursor counts and cache utilization:

mongosh --quiet --eval '
  var c = db.serverStatus().metrics.cursor;
  var wt = db.serverStatus().wiredTiger.cache;
  var max = wt["maximum bytes configured"];
  print("noTimeout cursors: " + c.open.noTimeout);
  print("Pinned cursors: " + c.open.pinned);
  print("Cache fill: " + (100 * wt["bytes currently in the cache"] / max).toFixed(1) + "%");
  print("Cache dirty: " + (100 * wt["tracked dirty bytes in the cache"] / max).toFixed(1) + "%");
  print("App-thread evictions: " + wt["pages evicted by application threads"]);
'

Long-running cursor operations:

mongosh --quiet --eval '
  db.currentOp({ "active": true, "secs_running": { "$gt": 60 } }).inprog.forEach(function(op) {
    if (op.op === "getmore") {
      print(op.opid + " | " + op.secs_running + "s | " + op.ns + " | " + op.client);
    }
  });
'

Eviction stall counters:

mongosh --quiet --eval '
  var wt = db.serverStatus().wiredTiger.cache;
  print("Eviction stalls: " + wt["pages selected for eviction unable to be evicted"]);
'

Queue depths and ticket availability:

mongosh --quiet --eval '
  printjson({
    queue: db.serverStatus().globalLock.currentQueue,
    tickets: db.serverStatus().wiredTiger.concurrentTransactions
  });
'

Average latency trend:

mongosh --quiet --eval '
  var lat = db.serverStatus().opLatencies;
  print("Read avg (us): " + (lat.reads.latency / lat.reads.ops).toFixed(0));
  print("Write avg (us): " + (lat.writes.latency / lat.writes.ops).toFixed(0));
'

How to diagnose it

  1. Confirm the noTimeout count is elevated. Sample db.serverStatus().metrics.cursor.open.noTimeout. On an OLTP primary this is normally zero. Sustained nonzero values warrant investigation; values above 10 indicate high snapshot retention risk.
  2. Correlate with cache pressure. Check wiredTiger.cache for fill ratio above 80% and dirty ratio trending above 15%. If both are climbing while open.noTimeout is flat and nonzero, the cursors are likely pinning old versions.
  3. Identify the owning operations. Run db.currentOp() and look for getmore operations with high secs_running. Note the client IP, the namespace, and whether the operation is an aggregation (change streams show up here). The opid is what you need if you decide to kill the operation.
  4. Check for application-thread eviction. In wiredTiger.cache, if pages evicted by application threads is incrementing, the cache is already in crisis. This confirms that background eviction cannot keep up and user operations are paying the cost.
  5. Map the client to a workload. Cross-reference the client field with known ETL hosts, backup schedules, or application services. If the cursor opened at 02:00 and your backup job starts at 02:00, you have found the owner.
  6. Determine if the cursor is legitimate. A backup job that needs four hours to scan a terabyte collection may justify a noTimeout cursor, but it should run on a hidden secondary, not the primary. A change stream that has not consumed an event in an hour is likely abandoned.

Metrics and signals to monitor

SignalWhy it mattersWarning sign
metrics.cursor.open.noTimeoutEach cursor pins a snapshot indefinitely, preventing old-version evictionSustained > 0; > 10 is critical
WiredTiger cache dirty ratioDirty pages accumulate when pinned snapshots block eviction> 15% elevated; > 20% risks checkpoint stall
pages evicted by application threadsApplication threads doing eviction work adds latency to queries and writesAny sustained nonzero rate
opLatencies reads and writesLatency grows as operations wait behind eviction workAverage sustained > 2x baseline
globalLock.currentQueueOperations queue behind ticket-holding threads that are busy evictingSustained > 20 readers or writers
opcounters.getmoreHigh getmore rate with flat query rate suggests large cursor iterationSpike correlating with cache fill growth

Fixes

Kill abandoned or leaked cursors

If currentOp shows a noTimeout cursor that should not be running, note its opid and terminate it.

# WARNING: This kills the operation. The client receives an error and must restart its work.
mongosh --quiet --eval 'db.killOp(<opid>)'

Killing a cursor frees its snapshot immediately and allows eviction to proceed. This is safe for read-only cursors and change streams. The tradeoff is that the application must reopen the cursor and possibly re-scan data.

Refactor ETL and backup jobs

Do not let long-running scans hold a single snapshot across an entire collection. Break the work into smaller ranges using an indexed field such as _id or a timestamp. Process each range with a fresh cursor that uses the default timeout. The tradeoff is more round-trips and slightly more complex checkpointing in the application, but each snapshot is short-lived and cache pressure stays bounded. Run large backups against a hidden secondary rather than the primary.

Fix change stream lifecycle

Change streams are legitimate long-lived cursors, but they should not remain open indefinitely without consuming events. Ensure your application closes change streams on shutdown, handles errors with resume tokens, and monitors open.noTimeout to detect orphaned streams. The tradeoff is adding reconnect logic, but it eliminates the risk of a forgotten stream pinning a snapshot for days.

Pause the workload during a crisis

If the cache is already in a pressure cascade and you cannot immediately kill the cursor, pause the offending ETL job or restart the change stream consumer. This is a tactical fix, not a permanent one. The tradeoff is delayed analytics or backup completion, but it restores OLTP latency within minutes as eviction catches up.

Do not add RAM as the first response

Expanding the WiredTiger cache size only postpones the stall. The snapshot is still pinned, and the cache will eventually fill again. Fix the cursor lifecycle first. If the workload is legitimate and must run on the primary, only then consider whether the cache is undersized for the combined OLTP and snapshot load.

Prevention

  • Alert on open.noTimeout. Any sustained value above zero is abnormal for most OLTP deployments. Set a warning at > 0 and a critical threshold at > 10.
  • Bound long-running reads. Require ETL jobs to use range-based queries and standard cursor timeouts. If a job genuinely cannot finish within the idle timeout, it belongs on a secondary or needs explicit batching.
  • Audit change stream usage. Review application code for change streams that are opened without corresponding close handlers. Treat them like database connections: always close in a finally block or equivalent.
  • Watch the dirty ratio. Most teams monitor cache fill but miss the dirty ratio. A dirty ratio climbing toward 15% is often the first sign that snapshot pinning is blocking eviction. Correlate dirty ratio with open.noTimeout to catch the pattern early.

How Netdata helps

  • Correlates mongodb.cursor_open_noTimeout with mongodb.wiredtiger_cache_dirty_ratio and mongodb.wiredtiger_pages_evicted_by_application_threads on the same timeline.
  • Shows per-second getmore rates alongside cache metrics so you can tie cursor iteration to pressure spikes.
  • Alerts on sustained noTimeout cursor counts and application-thread evictions before latency degrades.