MongoDB too stale to catch up: secondary stuck in RECOVERING and how to resync
You check rs.status() during an incident and see a member stuck in RECOVERING with an errmsg reading error RS102 too stale to catch up. The node is alive but will never transition back to SECONDARY on its own. Its last replicated oplog entry is older than the oldest entry still available on the primary, so the history it needs has already been overwritten. Incremental replication is impossible from this state. The only path forward is a full initial sync, which on large datasets can take hours to days, adds significant read load to the sync source, and leaves the cluster with reduced redundancy until it completes. If the stale member is a voting node, you are now one failure away from losing majority. This guide covers how to confirm the condition, identify why the secondary fell off, and recover without pushing the remaining cluster members into the same trap.
What this means
The oplog is a capped collection that records every write operation sequentially. Secondaries tail this log and apply entries locally to maintain consistency. Because the oplog is capped, it overwrites its oldest entries once it reaches its configured maximum size. The time span from the oldest to the newest entry is the oplog window.
If a secondary’s last applied timestamp falls outside that window, the primary no longer has the operations the secondary needs to catch up. MongoDB places the member in RECOVERING and logs error RS102 too stale to catch up. The node cannot serve reads or participate in elections. The data files are inconsistent with the primary. There is no rollback, repair, or incremental catch-up mechanism that can bridge the gap. A full initial sync is required to reclone all data, rebuild indexes, and reapply the oplog from a known clean state.
flowchart TD
A[Write volume increases] --> B[Oplog window shrinks]
B --> C[Secondary falls behind]
C --> D[Oplog overwrites secondary's last position]
D --> E[Secondary enters RECOVERING]
E --> F[RS102 too stale to catch up]
F --> G[Full initial sync required]Common causes
| Cause | What it looks like | First thing to check |
|---|---|---|
| Oplog window too small for peak write volume | Window drops to hours during bulk imports; rs.printReplicationInfo() shows far less than 24 hours | Compare the current oplog window to the peak replication lag trend |
| Secondary apply rate cannot keep up | Replication lag grows linearly under steady write load; secondary disk I/O is saturated | db.serverStatus().metrics.repl.apply on the secondary versus primary opcounters |
| Secondary offline longer than the window | Member returns from maintenance after being DOWN or unreachable for an extended period | rs.status() lastHeartbeatMessage and the secondary’s uptime |
| Sudden primary write burst | Primary opcounters or metrics.document spike sharply; flow control may engage | Primary write volume and db.serverStatus().flowControl |
Quick checks
Run these from a healthy primary, or from the affected secondary where noted. These checks are read-only.
# Confirm the exact error in the secondary's log (adjust path if your systemLog.path differs)
grep "RS102 too stale to catch up" /var/log/mongodb/mongod.log
// Check member state and exact error message
rs.status().members.forEach(function(m) {
if (m.stateStr === 'RECOVERING') print(m.name + ": " + m.errmsg);
});
// Check the primary's oplog window
rs.printReplicationInfo()
// Check replication lag across all secondaries
rs.printSecondaryReplicationInfo()
// On the secondary: check how fast it applies oplog entries
db.serverStatus().metrics.repl.apply
// On the primary: check current write volume
db.serverStatus().opcounters
// On the primary: check if flow control is throttling writes
db.serverStatus().flowControl
How to diagnose it
Confirm the stale state. Run
rs.status()on the primary or another healthy member. Identify the affected node and note itsstateStr,errmsg, andoptimeDate. Theerrmsgshould containerror RS102 too stale to catch up.Measure the oplog window. On the primary, run
rs.printReplicationInfo(). Thelog length start to endvalue tells you how many seconds of history the oplog currently covers. Convert this to hours. If the window is smaller than the secondary’s downtime or lag, the secondary is genuinely stale.Determine whether the secondary was already struggling. On the secondary, inspect
db.serverStatus().metrics.repl.apply. If theopsrate is consistently below the primary’s write rate averaged over the same window, the secondary was losing ground before it went stale. This points to a sustained capacity mismatch rather than a one-time outage.Check for a primary write burst. On the primary, compare
opcountersandmetrics.documentto your baseline. A sudden spike in inserts, updates, or large document writes consumes oplog space faster than steady-state traffic and shrinks the window non-linearly.Inspect flow control status. On the primary, run
db.serverStatus().flowControl. IfisLaggedis true andtimeAcquiringMicrosis growing, the primary was already throttling writes to protect lagged secondaries. Your replication headroom was exhausted before the secondary went stale.Assess cluster-wide risk. Check
rs.printSecondaryReplicationInfo()for every other secondary. If any other member has lag approaching the oplog window, calculate its time-to-falloff aswindow - lag. A value under one hour means the cluster is at risk of cascading into multiple simultaneous resyncs.
Metrics and signals to monitor
| Signal | Why it matters | Warning sign |
|---|---|---|
| Oplog window | Hard limit on how long a secondary can be offline or lagged before it must resync | Window drops below 12 hours, or below 2x your longest expected maintenance window |
| Replication lag | Measures how close a secondary is to the end of the oplog window | Sustained lag above 30 seconds, or above 50% of the current oplog window |
| Secondary oplog apply rate | Reveals whether the secondary can keep up | Apply rate consistently lower than primary write rate over any 10-minute window |
| Flow control status | Indicates the primary is throttling writes to prevent secondaries from falling off | isLagged: true with increasing timeAcquiringMicros |
| Primary write rate | Bursts consume oplog space faster than steady-state and shrink the window | Sudden spike in opcounters or metrics.document without a capacity increase |
Fixes
There is no incremental or online repair for a stale secondary. Once RS102 is logged, the node must be fully resynced.
Perform a full initial sync
Warning: This procedure deletes the local data directory on the secondary. Verify the host name and confirm the dbPath from mongod.conf before proceeding.
- Stop
mongodon the stale secondary. - Remove all data files in the configured
dbPath. Leave the configuration file, TLS material, and directory structure intact. - Restart
mongodwith the same replica set name and configuration. - The member will enter
STARTUP2and begin an initial sync. It will automatically select a healthy sync source, clone all databases, rebuild indexes, and then apply oplog entries generated during the clone. - Monitor
rs.status()until the member transitions toSECONDARY.
Do not resync multiple members simultaneously. Removing more than one data-bearing node from service at a time reduces fault tolerance and can overload the sync source with clone traffic. If several members are stale, resync them one at a time and confirm each has returned to SECONDARY before starting the next.
If the stale node is a hidden or non-voting member used only for analytics or backups, the cluster can continue operating safely, but you still need to complete the resync before that node is useful again.
Resize the oplog to prevent immediate recurrence
If the oplog window is chronically short, increase it before or immediately after the resync so the new secondary does not fall off again.
db.adminCommand({ replSetResizeOplog: 1, size: <sizeInMB>, minRetentionHours: <hours> })
Then persist the new size in mongod.conf under replication.oplogSizeMB. The resize takes effect without a restart, but it only changes the allocation going forward.
Prevention
- Size the oplog for peak throughput, not average. Target an oplog window of at least 24 to 72 hours during your highest observed write rate. Use
minRetentionHoursto enforce a minimum time window even if size alone would allow a shorter one. - Trend the window over time. Do not rely on the value set at deployment. As data volume grows, the same oplog size covers less history.
- Monitor lag as a fraction of the window. A lag of 30 seconds is harmless when the window is 48 hours, but critical when the window is 10 minutes.
- Match secondary hardware to primary write load. If a secondary cannot apply oplog entries as fast as the primary generates them, it will eventually fall off. Ensure secondaries have comparable disk I/O and CPU.
- Watch flow control. Active flow control signals that your replication buffer is already thin. Investigate why secondaries are lagging before the window collapses.
- Track document operation volume. Large documents and bulk operations consume oplog space faster than small updates. A spike in
metrics.documentis often the leading indicator of a shrinking window.
How Netdata helps
- Correlate oplog window shrinkage with primary write spikes to identify the causal burst.
- Alert when replication lag exceeds a configured fraction of the oplog window, before a secondary becomes unrecoverable.
- Compare secondary oplog application rate against primary
opcountersto surface capacity mismatches. - Track flow control status and WiredTiger ticket utilization on secondaries to catch apply bottlenecks.
Related guides
- How MongoDB actually works in production: a mental model for operators
- MongoDB pages evicted by application threads: when eviction becomes user latency
- MongoDB WiredTiger cache dirty ratio high: the leading indicator nobody watches
- MongoDB WiredTiger cache pressure cascade: eviction stalls and latency spikes
- MongoDB cache too small: sizing the WiredTiger cache for your working set
- MongoDB checkpoint duration climbing: diagnosing slow WiredTiger checkpoints
- MongoDB checkpoint stall write freeze: when all writes stop with no error
- MongoDB journal sync latency high: the storage signal that warns 60 seconds early
- MongoDB monitoring checklist: the signals every production cluster needs
- MongoDB monitoring maturity model: from survival to expert
- MongoDB noTimeout cursors causing cache pressure: pinned snapshots and silent eviction stalls







