MongoDB too stale to catch up: secondary stuck in RECOVERING and how to resync

You check rs.status() during an incident and see a member stuck in RECOVERING with an errmsg reading error RS102 too stale to catch up. The node is alive but will never transition back to SECONDARY on its own. Its last replicated oplog entry is older than the oldest entry still available on the primary, so the history it needs has already been overwritten. Incremental replication is impossible from this state. The only path forward is a full initial sync, which on large datasets can take hours to days, adds significant read load to the sync source, and leaves the cluster with reduced redundancy until it completes. If the stale member is a voting node, you are now one failure away from losing majority. This guide covers how to confirm the condition, identify why the secondary fell off, and recover without pushing the remaining cluster members into the same trap.

What this means

The oplog is a capped collection that records every write operation sequentially. Secondaries tail this log and apply entries locally to maintain consistency. Because the oplog is capped, it overwrites its oldest entries once it reaches its configured maximum size. The time span from the oldest to the newest entry is the oplog window.

If a secondary’s last applied timestamp falls outside that window, the primary no longer has the operations the secondary needs to catch up. MongoDB places the member in RECOVERING and logs error RS102 too stale to catch up. The node cannot serve reads or participate in elections. The data files are inconsistent with the primary. There is no rollback, repair, or incremental catch-up mechanism that can bridge the gap. A full initial sync is required to reclone all data, rebuild indexes, and reapply the oplog from a known clean state.

flowchart TD
    A[Write volume increases] --> B[Oplog window shrinks]
    B --> C[Secondary falls behind]
    C --> D[Oplog overwrites secondary's last position]
    D --> E[Secondary enters RECOVERING]
    E --> F[RS102 too stale to catch up]
    F --> G[Full initial sync required]

Common causes

CauseWhat it looks likeFirst thing to check
Oplog window too small for peak write volumeWindow drops to hours during bulk imports; rs.printReplicationInfo() shows far less than 24 hoursCompare the current oplog window to the peak replication lag trend
Secondary apply rate cannot keep upReplication lag grows linearly under steady write load; secondary disk I/O is saturateddb.serverStatus().metrics.repl.apply on the secondary versus primary opcounters
Secondary offline longer than the windowMember returns from maintenance after being DOWN or unreachable for an extended periodrs.status() lastHeartbeatMessage and the secondary’s uptime
Sudden primary write burstPrimary opcounters or metrics.document spike sharply; flow control may engagePrimary write volume and db.serverStatus().flowControl

Quick checks

Run these from a healthy primary, or from the affected secondary where noted. These checks are read-only.

# Confirm the exact error in the secondary's log (adjust path if your systemLog.path differs)
grep "RS102 too stale to catch up" /var/log/mongodb/mongod.log
// Check member state and exact error message
rs.status().members.forEach(function(m) {
  if (m.stateStr === 'RECOVERING') print(m.name + ": " + m.errmsg);
});

// Check the primary's oplog window
rs.printReplicationInfo()

// Check replication lag across all secondaries
rs.printSecondaryReplicationInfo()

// On the secondary: check how fast it applies oplog entries
db.serverStatus().metrics.repl.apply

// On the primary: check current write volume
db.serverStatus().opcounters

// On the primary: check if flow control is throttling writes
db.serverStatus().flowControl

How to diagnose it

  1. Confirm the stale state. Run rs.status() on the primary or another healthy member. Identify the affected node and note its stateStr, errmsg, and optimeDate. The errmsg should contain error RS102 too stale to catch up.

  2. Measure the oplog window. On the primary, run rs.printReplicationInfo(). The log length start to end value tells you how many seconds of history the oplog currently covers. Convert this to hours. If the window is smaller than the secondary’s downtime or lag, the secondary is genuinely stale.

  3. Determine whether the secondary was already struggling. On the secondary, inspect db.serverStatus().metrics.repl.apply. If the ops rate is consistently below the primary’s write rate averaged over the same window, the secondary was losing ground before it went stale. This points to a sustained capacity mismatch rather than a one-time outage.

  4. Check for a primary write burst. On the primary, compare opcounters and metrics.document to your baseline. A sudden spike in inserts, updates, or large document writes consumes oplog space faster than steady-state traffic and shrinks the window non-linearly.

  5. Inspect flow control status. On the primary, run db.serverStatus().flowControl. If isLagged is true and timeAcquiringMicros is growing, the primary was already throttling writes to protect lagged secondaries. Your replication headroom was exhausted before the secondary went stale.

  6. Assess cluster-wide risk. Check rs.printSecondaryReplicationInfo() for every other secondary. If any other member has lag approaching the oplog window, calculate its time-to-falloff as window - lag. A value under one hour means the cluster is at risk of cascading into multiple simultaneous resyncs.

Metrics and signals to monitor

SignalWhy it mattersWarning sign
Oplog windowHard limit on how long a secondary can be offline or lagged before it must resyncWindow drops below 12 hours, or below 2x your longest expected maintenance window
Replication lagMeasures how close a secondary is to the end of the oplog windowSustained lag above 30 seconds, or above 50% of the current oplog window
Secondary oplog apply rateReveals whether the secondary can keep upApply rate consistently lower than primary write rate over any 10-minute window
Flow control statusIndicates the primary is throttling writes to prevent secondaries from falling offisLagged: true with increasing timeAcquiringMicros
Primary write rateBursts consume oplog space faster than steady-state and shrink the windowSudden spike in opcounters or metrics.document without a capacity increase

Fixes

There is no incremental or online repair for a stale secondary. Once RS102 is logged, the node must be fully resynced.

Perform a full initial sync

Warning: This procedure deletes the local data directory on the secondary. Verify the host name and confirm the dbPath from mongod.conf before proceeding.

  1. Stop mongod on the stale secondary.
  2. Remove all data files in the configured dbPath. Leave the configuration file, TLS material, and directory structure intact.
  3. Restart mongod with the same replica set name and configuration.
  4. The member will enter STARTUP2 and begin an initial sync. It will automatically select a healthy sync source, clone all databases, rebuild indexes, and then apply oplog entries generated during the clone.
  5. Monitor rs.status() until the member transitions to SECONDARY.

Do not resync multiple members simultaneously. Removing more than one data-bearing node from service at a time reduces fault tolerance and can overload the sync source with clone traffic. If several members are stale, resync them one at a time and confirm each has returned to SECONDARY before starting the next.

If the stale node is a hidden or non-voting member used only for analytics or backups, the cluster can continue operating safely, but you still need to complete the resync before that node is useful again.

Resize the oplog to prevent immediate recurrence

If the oplog window is chronically short, increase it before or immediately after the resync so the new secondary does not fall off again.

db.adminCommand({ replSetResizeOplog: 1, size: <sizeInMB>, minRetentionHours: <hours> })

Then persist the new size in mongod.conf under replication.oplogSizeMB. The resize takes effect without a restart, but it only changes the allocation going forward.

Prevention

  • Size the oplog for peak throughput, not average. Target an oplog window of at least 24 to 72 hours during your highest observed write rate. Use minRetentionHours to enforce a minimum time window even if size alone would allow a shorter one.
  • Trend the window over time. Do not rely on the value set at deployment. As data volume grows, the same oplog size covers less history.
  • Monitor lag as a fraction of the window. A lag of 30 seconds is harmless when the window is 48 hours, but critical when the window is 10 minutes.
  • Match secondary hardware to primary write load. If a secondary cannot apply oplog entries as fast as the primary generates them, it will eventually fall off. Ensure secondaries have comparable disk I/O and CPU.
  • Watch flow control. Active flow control signals that your replication buffer is already thin. Investigate why secondaries are lagging before the window collapses.
  • Track document operation volume. Large documents and bulk operations consume oplog space faster than small updates. A spike in metrics.document is often the leading indicator of a shrinking window.

How Netdata helps

  • Correlate oplog window shrinkage with primary write spikes to identify the causal burst.
  • Alert when replication lag exceeds a configured fraction of the oplog window, before a secondary becomes unrecoverable.
  • Compare secondary oplog application rate against primary opcounters to surface capacity mismatches.
  • Track flow control status and WiredTiger ticket utilization on secondaries to catch apply bottlenecks.