MongoDB long-running transactions: pinned snapshots and silent cache pressure

MongoDB latency climbs while write throughput looks normal and the working set is unchanged. Check WiredTiger cache: the dirty ratio is rising and application threads have started evicting pages. The cause is often invisible in standard cache metrics – a multi-document transaction that opened a snapshot and never let it go.

Multi-document transactions pin WiredTiger snapshots until commit or abort. That snapshot prevents the storage engine from evicting old document versions. A single transaction left open for minutes can silently pin enough pages to push the cache into eviction pressure. An inactive-but-open transaction is worse than an active one because it holds the snapshot indefinitely with no forward progress.

The default transactionLifetimeLimitSeconds is 60 seconds. Any transaction open longer than that is suspicious.

What this means

WiredTiger uses multiversion concurrency control. Every write creates a new version of a document in cache. Old versions are retained until no open transaction or cursor needs them. When a multi-document transaction begins, it takes a snapshot. As long as the transaction stays open, WiredTiger cannot evict any page version visible to that snapshot.

New writes continue to create versions. The cache fills with pinned history. Background eviction threads cannot free these pages. The dirty ratio climbs. Once the dirty ratio crosses the eviction threshold and background threads cannot keep up, application threads perform eviction directly. That adds latency to every operation that touches the storage engine, even reads on unrelated collections.

The damage is often caused by one forgotten transaction rather than a workload shift. Because the transaction may be inactive, it produces no opLatencies or ticket utilization signal on its own. The only direct evidence is in the transaction counters and currentOp.

flowchart TD
    A[Transaction opens] --> B[Snapshot pinned]
    B --> C[Old versions retained in cache]
    C --> D[Cache fill rises]
    D --> E[Background eviction cannot free pinned pages]
    E --> F[Dirty ratio climbs]
    F --> G[Application threads forced to evict]
    G --> H[Latency spikes across operations]
    H --> I[Transaction commits or aborts]
    I --> J[Snapshot released]
    J --> K[Eviction resumes normally]

Common causes

CauseWhat it looks likeFirst thing to check
Abandoned transaction after application crash or disconnectcurrentInactive > 0 with old open time, no active clientdb.currentOp for transactions with idle connections
Application logic holding a transaction open across external callscurrentActive > 0 for minutes with the same opidApplication logs and currentOp elapsed time
Lock contention triggering abort stormstotalAborted rising rapidly, queue depths growingLock wait times in serverStatus().locks
Transaction size overwhelming cache capacityAborted transactions correlating with cache pressure spikescurrentOp for large uncommitted transactions
DDL blocked by a transaction holding collection locksIndex builds or renames stalled on the same namespacecurrentOp for operations waiting on locks

Quick checks

# Check transaction counts and abort ratio
mongosh --quiet --eval 'var t=db.serverStatus().transactions; print("Active: "+t.currentActive+", Inactive: "+t.currentInactive+", Open: "+t.currentOpen+", Aborted: "+t.totalAborted+", Started: "+t.totalStarted);'
# List open transactions and their age
mongosh --quiet --eval 'db.currentOp({"transaction":{"$exists":true}}).inprog.forEach(function(o){print(o.opid+" | "+(o.transaction.timeOpenMicros/1000000).toFixed(1)+"s | "+(o.client||"internal"));})'
# Check cache pressure signals
mongosh --quiet --eval 'var c=db.serverStatus().wiredTiger.cache; var m=c["maximum bytes configured"]; print("Used: "+(100*c["bytes currently in the cache"]/m).toFixed(1)+"%, Dirty: "+(100*c["tracked dirty bytes in the cache"]/m).toFixed(1)+"%, AppEvict: "+c["pages evicted by application threads"]);'
# Check lock wait times by type
mongosh --quiet --eval 'var l=db.serverStatus().locks; for(var t in l){if(l[t].timeAcquiringMicros){for(var m in l[t].timeAcquiringMicros){print(t+" "+m+": "+l[t].timeAcquiringMicros[m]+" µs");}}}'

How to diagnose it

  1. Confirm cache pressure. Check wiredTiger.cache for used ratio above 80% or dirty ratio above 10%, and poll for increments in pages evicted by application threads.
  2. Check serverStatus().transactions. If currentOpen is elevated and currentInactive is greater than zero, a snapshot is pinned by an idle transaction.
  3. Run db.currentOp({"transaction": {"$exists": true}}) to identify the specific transactions, their opid, and their open duration in transaction.timeOpenMicros.
  4. Compute the abort ratio: totalAborted / totalStarted. A ratio above 50% indicates severe contention or repeated cache-pressure aborts.
  5. Check serverStatus().locks for growing timeAcquiringMicros. Collection-level waits suggest a transaction is blocking other operations on the same namespace.
  6. Review the MongoDB log for transaction abort messages near the time cache pressure began. Aborts due to cache pressure or lifetime limit expiry are logged.
  7. Correlate the transaction start time with the onset of cache pressure in your monitoring system. If the timelines align, the transaction is the root cause.

Metrics and signals to monitor

SignalWhy it mattersWarning sign
currentInactiveIdle transactions hold snapshots silently with no forward progressSustained > 0 for longer than 60 seconds
currentOpenTotal open transactions consuming snapshot resourcesSustained > 10 in non-transaction-heavy workloads
Abort ratio (totalAborted / totalStarted)High abort rate indicates contention or cache-pressure killsRatio > 50%
WiredTiger dirty ratioRises when pinned snapshots block eviction> 10% sustained
pages evicted by application threadsDirect signal that users are feeling cache pressureAny sustained increment
Lock wait times (timeAcquiringMicros)Reveals whether transactions are blocking DDL or other writesGrowth in collection or global lock waits

Fixes

Kill abandoned or idle transactions. If currentOp shows a transaction with no active client and a high timeOpenMicros, use db.killOp(opid) to abort it. The snapshot releases instantly and eviction resumes. Warning: this is disruptive. Any work done by that transaction rolls back.

Reduce transaction lifetime if transactions are legitimate but too large. Break large transactions into smaller batches that commit more frequently. Each smaller transaction releases its snapshot sooner, reducing the pinned window. The tradeoff is that you may need application-level idempotency to handle partial completion.

Address lock contention. If totalAborted is rising and lock waits are growing, identify the operation holding the contested lock via currentOp. Kill the blocker if it is non-essential, or reschedule DDL operations to maintenance windows when transactions are not active. The tradeoff is temporary disruption to the blocking workload.

Keep transactionLifetimeLimitSeconds at the default of 60 seconds. Raising this limit masks the root cause and allows bad transactions to compound cache pressure over longer periods. Treat it as a circuit breaker, not a performance tuning knob. The tradeoff is that legitimate long-running analytical transactions must be broken into smaller units.

Prevention

  • Monitor currentInactive, not just currentActive. A transaction that is open but not executing is invisible to many workload dashboards yet is the most dangerous state for cache health.
  • Alert on any transaction open longer than 60 seconds. Even one transaction crossing this threshold is worth investigating because it can pin enough pages to start eviction pressure.
  • Review application code to ensure transactions are opened immediately before the first operation and committed or aborted as soon as the last operation completes. Never hold a transaction open across network calls, user input, or external API requests.
  • Monitor the abort ratio trend. A rising totalAborted relative to totalStarted signals contention or cache pressure before latency degrades.
  • Run DDL during maintenance windows. Because transactions hold collection locks, DDL such as index builds can queue behind them, extending transaction duration and snapshot lifetime.

How Netdata helps

  • Correlate transactions.currentOpen and transactions.currentInactive with wiredTiger.cache.dirty_ratio and wiredTiger.cache.pages_evicted_by_app_threads on the same timeline to reveal snapshot pinning.
  • Alert on sustained currentInactive greater than zero or on transaction duration anomalies.
  • Track totalAborted rate relative to totalStarted to detect contention before it cascades into ticket exhaustion.
  • Surface cache pressure cascade signals, including dirty ratio, eviction rates, and ticket utilization, alongside transaction metrics in a single view.