MongoDB WriteConflict errors: optimistic concurrency retries under contention

WriteConflict exceptions (error 112) in application logs, or unexplained write latency spikes, point to document-level contention under WiredTiger optimistic concurrency control. Outside of transactions, MongoDB retries single-document writes internally; the client sees slower responses rather than errors. Inside multi-document transactions, MongoDB aborts immediately and returns error 112. Either way, the root cause is concurrent writers targeting the same document.

What this means

WiredTiger uses optimistic concurrency control at the document level. Two concurrent writes to the same document do not block indefinitely; one proceeds and the other encounters a conflict.

For single-document writes outside a transaction, MongoDB applies wait-on-conflict semantics: the server retries internally with backoff. The call usually succeeds, but latency rises as retries accumulate. For multi-document transactions, MongoDB uses fail-on-conflict semantics. WiredTiger returns a WriteConflict immediately and aborts the transaction. The server does not auto-retry. The application or driver must detect the transient error and replay the entire transaction.

High WriteConflict rates mean hot-document contention. Causes include multiple writers on the same document, long-running transactions pinning snapshots, or application-level read-modify-write races.

flowchart TD
    A[Client sends write] --> B{Inside multi-doc transaction?}
    B -->|No| C[WiredTiger wait-on-conflict]
    C --> D[Server retries internally with backoff]
    D --> E[Client sees success or timeout]
    B -->|Yes| F[WiredTiger fail-on-conflict]
    F --> G[Server returns WriteConflict 112]
    G --> H[Driver or app must retry entire transaction]

Common causes

CauseWhat it looks likeFirst thing to check
Hot-document updatesError 112 in logs, or rising asserts.user, with slow writes targeting the same _id or shard keydb.currentOp() filtered by ns to find repeated access to one document
Long-running multi-document transactionsRising transactions.totalAborted, transactions open longer than 60 seconds, growing queue depthsdb.currentOp({ "transaction": { "$exists": true } }) for timeOpenMicros
Read-modify-write racesApplication fetches a document, mutates it in memory, then writes it back without atomic operatorsProfiler or logs for find followed by updateOne on the same ns without a version predicate
Transaction retry stormsAbort rate exceeds commit rate, latency spikes correlate with application error logsdb.serverStatus().transactions abort-to-commit ratio over time

Quick checks

// User assertion rate (includes all user errors, not only WriteConflict)
var a = db.serverStatus().asserts;
print("User assertions: " + a.user);
// Transaction abort versus commit balance
var t = db.serverStatus().transactions;
print("Aborted: " + t.totalAborted + ", Committed: " + t.totalCommitted);
// Long-running transactions and their age
db.currentOp({ "transaction": { "$exists": true } }).inprog.forEach(function(op) {
  print(op.opid + " | " + op.ns + " | " + (op.transaction.timeOpenMicros / 1000000) + "s");
});
// Writer queue depth
db.serverStatus().globalLock.currentQueue;
// Write ticket availability
var t = db.serverStatus().wiredTiger.concurrentTransactions;
print("Write tickets available: " + t.write.available + " / " + t.write.totalTickets);
# Check logs for WriteConflict evidence (path varies by installation)
grep -iE "writeconflict|error.*112" /var/log/mongodb/mongod.log | tail -20
// Recent slow operations for contention patterns
// Requires profiling enabled
db.system.profile.find().sort({ ts: -1 }).limit(20).pretty();

How to diagnose it

  1. Quantify the error rate. Sample asserts.user at two points and compute the delta. A sustained rise signals growing user errors, but this counter includes all user assertions, not only WriteConflicts. Correlate with MongoDB logs for “WriteConflict” or code 112 to confirm the pattern.

  2. Determine whether the problem is inside transactions. Compare transactions.totalAborted against transactions.totalCommitted. An abort rate above baseline with growing currentOpen means transactions are colliding and retrying.

  3. Find the contested namespace. Use db.currentOp() to list active operations. Look for many write operations on the same ns, especially with identical query shapes or document keys. If multiple operations share the same planSummary and target the same _id or shard key, you have identified the hot document. Long-running transactions will show high timeOpenMicros.

  4. Inspect the slow query log and profiler. Look for update operations with high lock wait time. Read-modify-write patterns appear as a find followed shortly by an updateOne on the same collection without an atomic operator such as $inc or $set. A COLLSCAN inside a transaction increases lock duration and raises collision probability.

  5. Check for cascading saturation. Write conflicts consume tickets and hold snapshots. Verify whether wiredTiger.concurrentTransactions.write.available has dropped below 25 percent of total, or whether globalLock.currentQueue.writers is nonzero. If tickets are exhausted, the problem has moved from document contention to system-wide queuing. If writers queue while tickets remain available, look for CPU or storage saturation instead.

  6. Correlate with cache pressure. Long-running transactions pin WiredTiger snapshots, which prevents eviction. Compute the dirty ratio and check application-thread evictions:

    var s = db.serverStatus().wiredTiger.cache;
    var dirtyRatio = s["tracked dirty bytes in the cache"] / s["maximum bytes configured"];
    var appEvictions = s["pages evicted by application threads"];
    print("Dirty ratio: " + dirtyRatio + ", App evictions: " + appEvictions);
    

    If the dirty ratio or application-thread eviction count is rising, the WriteConflict storm is causing secondary cache pressure.

Metrics and signals to monitor

SignalWhy it mattersWarning sign
asserts.user rateRising rate signals growing user errors, including WriteConflictSustained increase from baseline
transactions.totalAborted vs totalCommittedReveals transaction-level retry stormsAbort rate consistently above 50 percent of commit rate
currentOp max transaction ageLong transactions pin snapshots and force retries on other writersAny transaction open longer than 60 seconds
globalLock.currentQueue.writersQueued writers indicate contention is becoming system-wideSustained nonzero queue
wiredTiger.concurrentTransactions.write.availableTicket exhaustion turns document contention into global latencyBelow 25 percent of total tickets sustained
Slow query log / system.profile rateRetries increase operation durationSudden spike in slow writes
opLatencies.writes averageAverage write latency rises when operations retry internally or abortLatency doubling from baseline

Fixes

Hot-document contention

Replace read-modify-write loops with atomic operators. Use $inc, $set, or findOneAndUpdate with a version predicate to narrow the race window. For updates that must read before writing, project only the fields needed so the operation holds the document lock for the shortest time possible.

For inherently serial data such as global counters or leaderboards, distribute the hot document into N bucket documents selected by a hash or random value, then aggregate at read time. This spreads contention across multiple documents.

Long-running transactions

Reduce transaction scope. Split large batch updates into smaller transactions that commit faster. Ensure transactionLifetimeLimitSeconds is set appropriately for your workload to prevent runaway transactions.

Warning: Killing operations aborts active work and can disrupt legitimate clients.

If a transaction remains open after the application has finished, kill it with db.killOp(opid) to release its snapshot and locks.

Retry storms in application code

Ensure the application uses jittered exponential backoff between transaction retries. Without backoff, multiple clients retry simultaneously after a conflict, creating thundering-herd behavior that amplifies the problem. Cap total retry attempts to prevent infinite loops if an underlying hot document stays contested. Do not rely on naive immediate retry loops.

Ensure retries are idempotent. Retrying a transaction that has already partially committed can cause duplicate writes unless the application tracks transaction state or relies on unique indexes.

Storage-layer pressure

If WriteConflicts correlate with ticket exhaustion or cache pressure, reduce concurrent write load temporarily. Pause batch jobs or throttle ingestion until ticket availability recovers.

Warning: Killing operations aborts active work.

Kill unnecessary long-running operations only as a last resort to free tickets immediately.

Prevention

  • Monitor asserts.user deltas continuously. Anomalous rates detect contention before application timeouts trigger.
  • Track the ratio of transactions.totalAborted to totalCommitted. A rising ratio is an early warning of transaction unfriendliness in the workload.
  • Audit query patterns quarterly for read-modify-write sequences that could be replaced with atomic updates.
  • Keep transactions short and deterministic. Avoid transactions that scan large ranges or hold cursors open.
  • Monitor currentOp for operations approaching your transaction timeout threshold.
  • Set client-side transaction timeouts lower than the server’s transactionLifetimeLimitSeconds so applications fail fast rather than holding snapshots until the server aborts them.

How Netdata helps

  • Netdata charts asserts.user deltas, exposing WriteConflict storms that internal retries hide from application-level error counters.
  • Netdata correlates transaction abort rates with globalLock.currentQueue and WiredTiger ticket availability, helping distinguish document contention from storage saturation.
  • Netdata tracks opLatencies average write latency alongside queue depths, revealing retry-driven latency spikes before they trigger application timeouts.
  • Netdata monitors WiredTiger cache dirty ratio and application-thread evictions, alerting when transaction snapshots pin cache and amplify contention into a cascade.