MongoDB long-running transactions: pinned snapshots and silent cache pressure
MongoDB latency climbs while write throughput looks normal and the working set is unchanged. Check WiredTiger cache: the dirty ratio is rising and application threads have started evicting pages. The cause is often invisible in standard cache metrics – a multi-document transaction that opened a snapshot and never let it go.
Multi-document transactions pin WiredTiger snapshots until commit or abort. That snapshot prevents the storage engine from evicting old document versions. A single transaction left open for minutes can silently pin enough pages to push the cache into eviction pressure. An inactive-but-open transaction is worse than an active one because it holds the snapshot indefinitely with no forward progress.
The default transactionLifetimeLimitSeconds is 60 seconds. Any transaction open longer than that is suspicious.
What this means
WiredTiger uses multiversion concurrency control. Every write creates a new version of a document in cache. Old versions are retained until no open transaction or cursor needs them. When a multi-document transaction begins, it takes a snapshot. As long as the transaction stays open, WiredTiger cannot evict any page version visible to that snapshot.
New writes continue to create versions. The cache fills with pinned history. Background eviction threads cannot free these pages. The dirty ratio climbs. Once the dirty ratio crosses the eviction threshold and background threads cannot keep up, application threads perform eviction directly. That adds latency to every operation that touches the storage engine, even reads on unrelated collections.
The damage is often caused by one forgotten transaction rather than a workload shift. Because the transaction may be inactive, it produces no opLatencies or ticket utilization signal on its own. The only direct evidence is in the transaction counters and currentOp.
flowchart TD
A[Transaction opens] --> B[Snapshot pinned]
B --> C[Old versions retained in cache]
C --> D[Cache fill rises]
D --> E[Background eviction cannot free pinned pages]
E --> F[Dirty ratio climbs]
F --> G[Application threads forced to evict]
G --> H[Latency spikes across operations]
H --> I[Transaction commits or aborts]
I --> J[Snapshot released]
J --> K[Eviction resumes normally]Common causes
| Cause | What it looks like | First thing to check |
|---|---|---|
| Abandoned transaction after application crash or disconnect | currentInactive > 0 with old open time, no active client | db.currentOp for transactions with idle connections |
| Application logic holding a transaction open across external calls | currentActive > 0 for minutes with the same opid | Application logs and currentOp elapsed time |
| Lock contention triggering abort storms | totalAborted rising rapidly, queue depths growing | Lock wait times in serverStatus().locks |
| Transaction size overwhelming cache capacity | Aborted transactions correlating with cache pressure spikes | currentOp for large uncommitted transactions |
| DDL blocked by a transaction holding collection locks | Index builds or renames stalled on the same namespace | currentOp for operations waiting on locks |
Quick checks
# Check transaction counts and abort ratio
mongosh --quiet --eval 'var t=db.serverStatus().transactions; print("Active: "+t.currentActive+", Inactive: "+t.currentInactive+", Open: "+t.currentOpen+", Aborted: "+t.totalAborted+", Started: "+t.totalStarted);'
# List open transactions and their age
mongosh --quiet --eval 'db.currentOp({"transaction":{"$exists":true}}).inprog.forEach(function(o){print(o.opid+" | "+(o.transaction.timeOpenMicros/1000000).toFixed(1)+"s | "+(o.client||"internal"));})'
# Check cache pressure signals
mongosh --quiet --eval 'var c=db.serverStatus().wiredTiger.cache; var m=c["maximum bytes configured"]; print("Used: "+(100*c["bytes currently in the cache"]/m).toFixed(1)+"%, Dirty: "+(100*c["tracked dirty bytes in the cache"]/m).toFixed(1)+"%, AppEvict: "+c["pages evicted by application threads"]);'
# Check lock wait times by type
mongosh --quiet --eval 'var l=db.serverStatus().locks; for(var t in l){if(l[t].timeAcquiringMicros){for(var m in l[t].timeAcquiringMicros){print(t+" "+m+": "+l[t].timeAcquiringMicros[m]+" µs");}}}'
How to diagnose it
- Confirm cache pressure. Check
wiredTiger.cachefor used ratio above 80% or dirty ratio above 10%, and poll for increments inpages evicted by application threads. - Check
serverStatus().transactions. IfcurrentOpenis elevated andcurrentInactiveis greater than zero, a snapshot is pinned by an idle transaction. - Run
db.currentOp({"transaction": {"$exists": true}})to identify the specific transactions, theiropid, and their open duration intransaction.timeOpenMicros. - Compute the abort ratio:
totalAborted / totalStarted. A ratio above 50% indicates severe contention or repeated cache-pressure aborts. - Check
serverStatus().locksfor growingtimeAcquiringMicros. Collection-level waits suggest a transaction is blocking other operations on the same namespace. - Review the MongoDB log for transaction abort messages near the time cache pressure began. Aborts due to cache pressure or lifetime limit expiry are logged.
- Correlate the transaction start time with the onset of cache pressure in your monitoring system. If the timelines align, the transaction is the root cause.
Metrics and signals to monitor
| Signal | Why it matters | Warning sign |
|---|---|---|
currentInactive | Idle transactions hold snapshots silently with no forward progress | Sustained > 0 for longer than 60 seconds |
currentOpen | Total open transactions consuming snapshot resources | Sustained > 10 in non-transaction-heavy workloads |
Abort ratio (totalAborted / totalStarted) | High abort rate indicates contention or cache-pressure kills | Ratio > 50% |
| WiredTiger dirty ratio | Rises when pinned snapshots block eviction | > 10% sustained |
pages evicted by application threads | Direct signal that users are feeling cache pressure | Any sustained increment |
Lock wait times (timeAcquiringMicros) | Reveals whether transactions are blocking DDL or other writes | Growth in collection or global lock waits |
Fixes
Kill abandoned or idle transactions. If currentOp shows a transaction with no active client and a high timeOpenMicros, use db.killOp(opid) to abort it. The snapshot releases instantly and eviction resumes. Warning: this is disruptive. Any work done by that transaction rolls back.
Reduce transaction lifetime if transactions are legitimate but too large. Break large transactions into smaller batches that commit more frequently. Each smaller transaction releases its snapshot sooner, reducing the pinned window. The tradeoff is that you may need application-level idempotency to handle partial completion.
Address lock contention. If totalAborted is rising and lock waits are growing, identify the operation holding the contested lock via currentOp. Kill the blocker if it is non-essential, or reschedule DDL operations to maintenance windows when transactions are not active. The tradeoff is temporary disruption to the blocking workload.
Keep transactionLifetimeLimitSeconds at the default of 60 seconds. Raising this limit masks the root cause and allows bad transactions to compound cache pressure over longer periods. Treat it as a circuit breaker, not a performance tuning knob. The tradeoff is that legitimate long-running analytical transactions must be broken into smaller units.
Prevention
- Monitor
currentInactive, not justcurrentActive. A transaction that is open but not executing is invisible to many workload dashboards yet is the most dangerous state for cache health. - Alert on any transaction open longer than 60 seconds. Even one transaction crossing this threshold is worth investigating because it can pin enough pages to start eviction pressure.
- Review application code to ensure transactions are opened immediately before the first operation and committed or aborted as soon as the last operation completes. Never hold a transaction open across network calls, user input, or external API requests.
- Monitor the abort ratio trend. A rising
totalAbortedrelative tototalStartedsignals contention or cache pressure before latency degrades. - Run DDL during maintenance windows. Because transactions hold collection locks, DDL such as index builds can queue behind them, extending transaction duration and snapshot lifetime.
How Netdata helps
- Correlate
transactions.currentOpenandtransactions.currentInactivewithwiredTiger.cache.dirty_ratioandwiredTiger.cache.pages_evicted_by_app_threadson the same timeline to reveal snapshot pinning. - Alert on sustained
currentInactivegreater than zero or on transaction duration anomalies. - Track
totalAbortedrate relative tototalStartedto detect contention before it cascades into ticket exhaustion. - Surface cache pressure cascade signals, including dirty ratio, eviction rates, and ticket utilization, alongside transaction metrics in a single view.
Related guides
- How MongoDB actually works in production: a mental model for operators
- MongoDB pages evicted by application threads: when eviction becomes user latency
- MongoDB WiredTiger cache dirty ratio high: the leading indicator nobody watches
- MongoDB WiredTiger cache pressure cascade: eviction stalls and latency spikes
- MongoDB cache too small: sizing the WiredTiger cache for your working set
- MongoDB checkpoint duration climbing: diagnosing slow WiredTiger checkpoints
- MongoDB checkpoint stall write freeze: when all writes stop with no error
- MongoDB connection churn: high totalCreated rate and thread creation overhead
- MongoDB connection refused at maxIncomingConnections: hitting the connection ceiling
- MongoDB connection storm spiral: reconnection floods after an election or deploy
- MongoDB flow control throttling writes: when the primary slows itself down
- MongoDB journal sync latency high: the storage signal that warns 60 seconds early







