$ guides / mongodb / mongodb-write-concern-timeout ▌

Operations Guides

MongoDB w:majority write concern timeout (wtimeout): replication lag and at-risk writes

Your application logs show write concern timeouts, or worse, the driver returned an error that the application swallowed and moved on. The write succeeded on the primary, but the cluster could not confirm it across a majority of data-bearing members before the deadline expired. If the primary crashes right now, that data is gone. This article explains how to diagnose the root cause, whether it is replication lag, a down secondary, or network saturation, and what to do before the next primary failure.

What this means

When a write uses w: "majority" with a wtimeout value, MongoDB returns success to the client only after the write is durable on the primary and acknowledged by enough members to form a majority. If the deadline passes, MongoDB returns a write concern error. The write is not undone from the primary; it continues replicating in the background. However, until a majority acknowledges it, the write is at risk of rollback if the primary fails. In a healthy cluster, the wtimeout count should be zero. Any sustained non-zero rate means your durability guarantees are compromised.

flowchart TD
    A[Application sends write with w majority] --> B[Primary applies write]
    B --> C{Majority ack within wtimeout}
    C -->|Yes| D[Write confirmed durable]
    C -->|No| E[Write concern timeout error]
    E --> F[Write remains on primary only]
    F --> G[At risk of rollback if primary crashes]
    B --> H[Replicates to secondaries]
    H --> C

Common causes

Cause	What it looks like	First thing to check
Secondary replication lag	`wtimeout` spikes alongside elevated replication lag during bulk writes or traffic surges	`rs.printSecondaryReplicationInfo()`
Secondary member down or unreachable	One or more members report `DOWN` or `UNKNOWN` in `rs.status()`; the set may still have a primary but with no redundancy margin	`rs.status().members[].stateStr`
Network latency or partition between members	Heartbeat messages show delays; replication throughput drops but member state may still read `SECONDARY`	`rs.status().members[].lastHeartbeatMessage`
Flow control throttling the primary	`serverStatus().flowControl.isLagged` is `true`; lag stabilizes because the primary is being slowed, yet writes still time out	`db.serverStatus().flowControl`
Oplog window too small for catch-up	The oplog window is shrinking; a lagged secondary is approaching the oldest oplog entry	`rs.printReplicationInfo()`

Quick checks

// Check cumulative write concern timeouts since process start
db.serverStatus().metrics.getLastError.wtimeouts

// Check replica set member states and health
rs.status().members.forEach(function(m) {
  print(m.name + " -> " + m.stateStr + " (health: " + m.health + ")");
});

// Check replication lag per secondary
rs.printSecondaryReplicationInfo()

// Check whether flow control is throttling writes
db.serverStatus().flowControl

// Check for operations currently waiting for locks
db.currentOp({ "waitingForLock": true }).inprog.forEach(function(op) {
  print(op.opid + " | " + op.op + " | " + op.ns);
});

// Check oplog window and catch-up margin
rs.printReplicationInfo()

How to diagnose it

Confirm active timeouts. Sample db.serverStatus().metrics.getLastError.wtimeouts twice over 60 seconds and compute the delta. A positive delta confirms the cluster is actively timing out majority writes.
Identify lagging or unhealthy secondaries. Run rs.status() and compare optimeDate between the primary and each secondary. Members in RECOVERING, DOWN, or STARTUP2 cannot acknowledge majority writes.
Determine if the issue is member loss or replication speed. If a secondary is healthy but lagged, check its disk I/O and oplog application rate. If the member is down, assess whether the remaining topology can still form a majority safely.
Check flow control status. If isLagged is true, MongoDB is intentionally throttling the primary to prevent oplog window collapse. The write concern timeout is a side effect of that protection.
Correlate with primary write volume. A spike in opcounters or large multi-document transactions can overwhelm secondary apply capacity and push lag beyond the wtimeout threshold.
Inspect the oplog window. If the log length is shrinking, calculate window - lag to estimate time until a secondary falls off and requires a full initial sync.
Verify network health between members. High round-trip time or packet loss reduces replication throughput without showing as a member state change.

Metrics and signals to monitor

Signal	Why it matters	Warning sign
`metrics.getLastError.wtimeouts`	Direct count of writes that failed to achieve majority durability	Any sustained non-zero rate
Replication lag	Lag exceeding the `wtimeout` window guarantees timeouts	Sustained lag > 10 seconds, or trending upward
Replica set member state	Lost members remove redundancy and can eliminate majority safety	Any data-bearing member in `DOWN`, `UNKNOWN`, or `RECOVERING` for > 2 minutes
Oplog window	Determines how long a secondary has to catch up before requiring full resync	Window < 12 hours
Flow control `isLagged`	Indicates the primary is being throttled to protect secondaries	`isLagged` is `true` and `timeAcquiringMicros` is growing
Primary write throughput (`opcounters`)	Surges in writes can outpace secondary apply rates	Sustained > 3x baseline without corresponding apply rate increase
Journal sync latency	Slow storage on either node delays durable acknowledgment	Average > 30 ms sustained

Fixes

Reduce replication lag

Pause bulk imports, batch jobs, or large transactions until secondaries catch up. If a long-running operation on a secondary is blocking oplog application, identify it with db.currentOp() on the secondary and terminate it with db.killOp() if safe.

Restore a failed secondary

If a member is down due to a crash or network partition, restore connectivity or restart the process. If the secondary has fallen off the oplog, it will enter RECOVERING and require a full initial sync.

Resize the oplog if the window is critically small

On MongoDB 4.0 and later, increase the oplog size with replSetResizeOplog to give lagged members more time to catch up without requiring a resync.

Address secondary storage saturation

If secondary disk I/O is the bottleneck, check OS-level storage latency. Consider moving reads off the lagged secondary or stepping down the primary to shift write load after verifying the new primary has healthy secondaries.

Application-side tradeoffs

Reducing write concern durability requirements can eliminate timeout errors, but it increases the risk of data loss during failover. Use this only as a temporary measure while you restore replication health.

Prevention

Size the oplog to maintain at least 24 hours of window during your highest observed write throughput.
Monitor replication lag with an alert threshold well below your wtimeout value.
Avoid running large index builds, bulk imports, or heavy aggregations on secondaries that serve production reads.
Track secondary disk I/O and WiredTiger cache pressure independently. A secondary can appear healthy until a load spike pushes it into lag.
Review application error handling to ensure write concern timeouts are logged and surfaced, not swallowed.

How Netdata helps

Surfaces metrics.getLastError.wtimeouts and replication lag on a unified timeline so you can see whether timeouts correlate with lag spikes or member state changes.
Tracks WiredTiger cache dirty ratio and ticket utilization on secondaries, revealing resource saturation that precedes visible replication lag by 10 to 30 minutes.
Alerts on oplog window shrinkage before a secondary falls off and requires a full resync.
Monitors flow control isLagged status to distinguish between a down secondary and a primary being throttled by replication back-pressure.
Correlates primary write throughput with secondary apply rates to catch capacity mismatches early.

The Netdata solution

MongoDB monitoring with Netdata

Netdata monitors MongoDB with per-second metrics and automatic dashboards. Watch WiredTiger cache pressure, oplog window, connection counts, checkpoint stalls, and replication health in one place, correlated with the underlying host.

See MongoDB monitoring → Start monitoring free

MongoDB w:majority write concern timeout (wtimeout): replication lag and at-risk writes

MongoDB w:majority write concern timeout (wtimeout): replication lag and at-risk writes

What this means

Common causes

Quick checks

How to diagnose it

Metrics and signals to monitor

Fixes

Reduce replication lag

Restore a failed secondary

Resize the oplog if the window is critically small

Address secondary storage saturation

Application-side tradeoffs

Prevention

How Netdata helps

Related guides

MongoDB monitoring with Netdata