MongoDB noTimeout cursors causing cache pressure: pinned snapshots and silent eviction stalls
When wiredTiger.cache.bytes currently in the cache climbs, the dirty ratio trends toward 20%, and read latencies spike without a single slow query in the log, check metrics.cursor.open.noTimeout.
Each noTimeout cursor pins a WiredTiger snapshot indefinitely. Old document versions cannot be evicted while that snapshot is open, so the cache fills with unreachable history until background eviction falls behind and application threads are forced to clean up. The result is a silent cache pressure cascade that looks like a capacity problem but is actually a cursor lifecycle problem.
ETL pipelines, backup tools, and change streams often open cursors with noCursorTimeout() to traverse large collections without hitting the default idle timeout. When those cursors are abandoned, leaked, or left open longer than necessary, they continue holding snapshots long after the application has moved on. Unlike a slow query, there is no obvious offender in the logs. The pressure builds in the background until WiredTiger stalls application threads to evict pages, at which point everything slows down together.
What this means
WiredTiger uses multiversion concurrency control (MVCC). Every write creates a new version of a document in cache. Old versions are retained until no transaction or cursor needs them. A cursor opened with noCursorTimeout() bypasses the normal idle timeout and holds its snapshot open until explicitly closed or the connection drops. While that snapshot is active, WiredTiger cannot evict the old page versions visible to it. If the cursor traverses a large range or sits idle for hours, the pinned history accumulates. The cache fill ratio rises, the dirty ratio climbs, and eventually the eviction threads cannot keep pace. When the cache hits the aggressive eviction threshold, application threads pause to evict pages themselves. That adds latency to every operation, depletes read and write tickets, and causes queue depths to grow. The cascade looks like storage saturation, but adding disk I/O or RAM will not fix it because the root cause is snapshot retention, not capacity.
flowchart TD
A[Application opens noTimeout cursor] --> B[WiredTiger snapshot pinned]
B --> C[Old document versions retained in cache]
C --> D[Cache fill and dirty ratio climb]
D --> E[Background eviction cannot free pinned pages]
E --> F[Application threads forced to evict]
F --> G[Latency spikes and queue depth grows]Common causes
| Cause | What it looks like | First thing to check |
|---|---|---|
ETL or backup tool using noCursorTimeout() | open.noTimeout is steady and greater than 0; getmore opcounter is elevated; long-running getmore from one client host | db.currentOp() filtered to getmore, grouped by client |
| Change stream consumer left open | open.noTimeout and open.pinned both elevated; an aggregation cursor on the change stream namespace appears in currentOp for hours | db.currentOp() for aggregations with $changeStream |
| Application cursor leak after connection drop | open.noTimeout climbs but currentOp shows no active client for some cursors; connection churn correlates with cursor growth | db.serverStatus().metrics.cursor delta against connections.totalCreated |
Quick checks
Run these read-only checks to confirm whether noTimeout cursors are driving cache pressure.
Cursor counts and cache utilization:
mongosh --quiet --eval '
var c = db.serverStatus().metrics.cursor;
var wt = db.serverStatus().wiredTiger.cache;
var max = wt["maximum bytes configured"];
print("noTimeout cursors: " + c.open.noTimeout);
print("Pinned cursors: " + c.open.pinned);
print("Cache fill: " + (100 * wt["bytes currently in the cache"] / max).toFixed(1) + "%");
print("Cache dirty: " + (100 * wt["tracked dirty bytes in the cache"] / max).toFixed(1) + "%");
print("App-thread evictions: " + wt["pages evicted by application threads"]);
'
Long-running cursor operations:
mongosh --quiet --eval '
db.currentOp({ "active": true, "secs_running": { "$gt": 60 } }).inprog.forEach(function(op) {
if (op.op === "getmore") {
print(op.opid + " | " + op.secs_running + "s | " + op.ns + " | " + op.client);
}
});
'
Eviction stall counters:
mongosh --quiet --eval '
var wt = db.serverStatus().wiredTiger.cache;
print("Eviction stalls: " + wt["pages selected for eviction unable to be evicted"]);
'
Queue depths and ticket availability:
mongosh --quiet --eval '
printjson({
queue: db.serverStatus().globalLock.currentQueue,
tickets: db.serverStatus().wiredTiger.concurrentTransactions
});
'
Average latency trend:
mongosh --quiet --eval '
var lat = db.serverStatus().opLatencies;
print("Read avg (us): " + (lat.reads.latency / lat.reads.ops).toFixed(0));
print("Write avg (us): " + (lat.writes.latency / lat.writes.ops).toFixed(0));
'
How to diagnose it
- Confirm the noTimeout count is elevated. Sample
db.serverStatus().metrics.cursor.open.noTimeout. On an OLTP primary this is normally zero. Sustained nonzero values warrant investigation; values above 10 indicate high snapshot retention risk. - Correlate with cache pressure. Check
wiredTiger.cachefor fill ratio above 80% and dirty ratio trending above 15%. If both are climbing whileopen.noTimeoutis flat and nonzero, the cursors are likely pinning old versions. - Identify the owning operations. Run
db.currentOp()and look forgetmoreoperations with highsecs_running. Note theclientIP, the namespace, and whether the operation is an aggregation (change streams show up here). Theopidis what you need if you decide to kill the operation. - Check for application-thread eviction. In
wiredTiger.cache, ifpages evicted by application threadsis incrementing, the cache is already in crisis. This confirms that background eviction cannot keep up and user operations are paying the cost. - Map the client to a workload. Cross-reference the
clientfield with known ETL hosts, backup schedules, or application services. If the cursor opened at 02:00 and your backup job starts at 02:00, you have found the owner. - Determine if the cursor is legitimate. A backup job that needs four hours to scan a terabyte collection may justify a noTimeout cursor, but it should run on a hidden secondary, not the primary. A change stream that has not consumed an event in an hour is likely abandoned.
Metrics and signals to monitor
| Signal | Why it matters | Warning sign |
|---|---|---|
metrics.cursor.open.noTimeout | Each cursor pins a snapshot indefinitely, preventing old-version eviction | Sustained > 0; > 10 is critical |
| WiredTiger cache dirty ratio | Dirty pages accumulate when pinned snapshots block eviction | > 15% elevated; > 20% risks checkpoint stall |
pages evicted by application threads | Application threads doing eviction work adds latency to queries and writes | Any sustained nonzero rate |
opLatencies reads and writes | Latency grows as operations wait behind eviction work | Average sustained > 2x baseline |
globalLock.currentQueue | Operations queue behind ticket-holding threads that are busy evicting | Sustained > 20 readers or writers |
opcounters.getmore | High getmore rate with flat query rate suggests large cursor iteration | Spike correlating with cache fill growth |
Fixes
Kill abandoned or leaked cursors
If currentOp shows a noTimeout cursor that should not be running, note its opid and terminate it.
# WARNING: This kills the operation. The client receives an error and must restart its work.
mongosh --quiet --eval 'db.killOp(<opid>)'
Killing a cursor frees its snapshot immediately and allows eviction to proceed. This is safe for read-only cursors and change streams. The tradeoff is that the application must reopen the cursor and possibly re-scan data.
Refactor ETL and backup jobs
Do not let long-running scans hold a single snapshot across an entire collection. Break the work into smaller ranges using an indexed field such as _id or a timestamp. Process each range with a fresh cursor that uses the default timeout. The tradeoff is more round-trips and slightly more complex checkpointing in the application, but each snapshot is short-lived and cache pressure stays bounded. Run large backups against a hidden secondary rather than the primary.
Fix change stream lifecycle
Change streams are legitimate long-lived cursors, but they should not remain open indefinitely without consuming events. Ensure your application closes change streams on shutdown, handles errors with resume tokens, and monitors open.noTimeout to detect orphaned streams. The tradeoff is adding reconnect logic, but it eliminates the risk of a forgotten stream pinning a snapshot for days.
Pause the workload during a crisis
If the cache is already in a pressure cascade and you cannot immediately kill the cursor, pause the offending ETL job or restart the change stream consumer. This is a tactical fix, not a permanent one. The tradeoff is delayed analytics or backup completion, but it restores OLTP latency within minutes as eviction catches up.
Do not add RAM as the first response
Expanding the WiredTiger cache size only postpones the stall. The snapshot is still pinned, and the cache will eventually fill again. Fix the cursor lifecycle first. If the workload is legitimate and must run on the primary, only then consider whether the cache is undersized for the combined OLTP and snapshot load.
Prevention
- Alert on
open.noTimeout. Any sustained value above zero is abnormal for most OLTP deployments. Set a warning at > 0 and a critical threshold at > 10. - Bound long-running reads. Require ETL jobs to use range-based queries and standard cursor timeouts. If a job genuinely cannot finish within the idle timeout, it belongs on a secondary or needs explicit batching.
- Audit change stream usage. Review application code for change streams that are opened without corresponding close handlers. Treat them like database connections: always close in a
finallyblock or equivalent. - Watch the dirty ratio. Most teams monitor cache fill but miss the dirty ratio. A dirty ratio climbing toward 15% is often the first sign that snapshot pinning is blocking eviction. Correlate dirty ratio with
open.noTimeoutto catch the pattern early.
How Netdata helps
- Correlates
mongodb.cursor_open_noTimeoutwithmongodb.wiredtiger_cache_dirty_ratioandmongodb.wiredtiger_pages_evicted_by_application_threadson the same timeline. - Shows per-second
getmorerates alongside cache metrics so you can tie cursor iteration to pressure spikes. - Alerts on sustained noTimeout cursor counts and application-thread evictions before latency degrades.







