MongoDB oplog window collapse: secondaries falling off and forced full resync

A secondary transitions to RECOVERING and logs “too stale to catch up.” The oplog window compresses from 48 hours to 90 minutes while replication lag on one secondary climbs steadily. These are the signatures of oplog window collapse: a write surge turns over the oplog faster than secondaries can consume it, and the safety margin between window and lag evaporates.

Once a secondary falls behind the oldest entry in the primary’s oplog, its sync position no longer exists. Recovery requires a full initial sync, which can take hours to days depending on data size and network throughput. During recovery, the replica set runs with reduced redundancy. If another secondary is near the edge, the remaining members absorb increased read load that can push them toward the same fate.

This guide covers diagnosis, emergency mitigations, and prevention.

What this means

The oplog is a capped collection (local.oplog.rs) with a fixed maximum size. The oplog window is the time span between its oldest and newest entries. Higher write volume means entries accumulate faster, so the window shrinks.

Secondaries tail the primary’s oplog and apply entries. Replication lag is the time delta between the primary’s latest entry and the secondary’s last applied entry. As long as lag remains smaller than the oplog window, a secondary can catch up after a transient slowdown. Once lag exceeds the window, the secondary’s position has been overwritten. It enters RECOVERING and cannot serve reads or resume replication. It must be rebuilt with a full initial sync.

The critical metric is time-to-falloff: oplog_window - replication_lag. When this drops below your ability to respond, you are one network blip away from losing a member.

flowchart TD
    A[Write surge on primary] --> B[Oplog turns over faster]
    B --> C[Oplog window shrinks]
    C --> D[Secondary replication lag grows]
    D --> E{Window minus lag near zero?}
    E -->|Yes| F[Secondary enters RECOVERING]
    F --> G[Forced full initial sync]
    E -->|No| H[Cluster remains at risk]
    H --> I[Remaining nodes absorb more load]
    I --> J[Cascade risk increases]

Common causes

CauseWhat it looks likeFirst thing to check
Bulk import or migrationPrimary opcounters spike uniformly; oplog window drops sharply across all secondariesdb.serverStatus().opcounters delta on the primary
Large multi-document transactionsDisproportionate oplog consumption relative to operation count; single large entriesdb.currentOp() for transactions open longer than 60 seconds
Secondary apply bottleneckOne secondary lags while others keep up; its disk or CPU is saturateddb.serverStatus().metrics.repl.apply on the lagging secondary
Chained replication topologyRemote secondaries fall off simultaneously even when the cluster holds needed entriesrs.status().members[].syncSource and lag relative to primary

Quick checks

Run these read-only commands to assess state.

# Oplog window, configured size, and time coverage
rs.printReplicationInfo()
# Per-secondary lag and last synced position
rs.printSecondaryReplicationInfo()
# Primary write volume; compute delta between two samples
db.serverStatus().opcounters
# On a lagging secondary, check apply throughput
db.serverStatus().metrics.repl.apply
# Whether flow control is throttling writes
db.serverStatus().flowControl
# Long-running transactions or bulk writes
db.currentOp({ "active": true, "secs_running": { "$gt": 10 } })

How to diagnose it

  1. Quantify the runway. Run rs.printReplicationInfo() to get the oplog window. Run rs.printSecondaryReplicationInfo() to get per-secondary lag. Compute time-to-falloff = window - lag for each secondary. If this is under one hour, treat it as an imminent PAGE.
  2. Confirm the trend is sustained. Brief lag spikes during batch jobs are normal and self-resolve. Look for a sustained downward trend in window size or sustained upward trend in lag over 10 minutes or more.
  3. Compare apply rate to write rate. On the primary, approximate write rate from the sum of insert, update, and delete in opcounters. On the secondary, derive db.serverStatus().metrics.repl.apply.ops delta. If secondary apply rate is consistently below 80% of the primary write rate, the gap will widen until falloff. Network saturation between nodes can also limit fetch throughput.
  4. Identify the write source. Use db.currentOp() to find bulk inserts, large aggregations with $out, index builds, or multi-document transactions. These generate large oplog entries or heavy secondary apply load. Large transactions serialize apply operations on secondaries, reducing concurrency.
  5. Inspect replication topology. If secondaries sync through an intermediate node, a lagging chain member can starve downstream nodes during sync source re-evaluation. Check rs.status() for syncSource values. If a secondary is chained and the intermediate node is lagging, downstream secondaries may fall off despite the primary holding the needed entries. Repeated syncSource changes in the logs indicate the secondary is hunting for a viable source.
  6. Check flow control status. In MongoDB 4.2+, db.serverStatus().flowControl.isLagged indicates the primary is throttling writes to protect secondaries. Active flow control means you are already in a degraded state where write throughput is artificially limited.

Metrics and signals to monitor

SignalWhy it mattersWarning sign
Oplog windowDetermines how long a secondary can be down before it must resync< 12 hours (TICKET); < 2 hours (PAGE)
Replication lagDistance from primary; must remain below the window> 10 seconds sustained, or > 25% of window
Primary write rate (opcounters)Drives oplog consumption rateSudden 3x+ spike sustained longer than 5 minutes
Secondary apply rateMust equal or exceed primary write rate to prevent lag growthSustained < 80% of primary write rate
Flow control isLaggedPrimary is throttling writes to prevent fallofftrue with growing timeAcquiringMicros
WiredTiger cache dirty ratioWrite surges increase dirty data and checkpoint pressure, compounding lag> 15% sustained

Fixes

Buy time by reducing write volume

The fastest way to stop the window from shrinking is to cut the write rate on the primary. Pause batch imports, defer large migrations, throttle application write threads, and postpone non-urgent index builds. Every eliminated operation extends the oplog window.

If you have identified the specific offending operation and confirmed it is safe to abort, kill it immediately:

db.killOp(<opid>)

Warning: Aborting an in-flight index build, aggregation with $out, or large transaction can trigger rollback and spike disk I/O. Only kill ops you have validated.

Resize the oplog (MongoDB 4.0+)

If the current size cannot support peak write volume, increase it live without restarting:

db.adminCommand({replSetResizeOplog: 1, size: <new_size>})

You must run this on each replica set member individually. The command does not propagate. The minimum configurable size is 990 MB. After increasing size, the window grows as new entries are written.

To survive restarts, persist the size in mongod.conf under replication.oplogSizeMB.

Reclaim space after shrinking

If you reduce the oplog size, disk space is not returned to the filesystem automatically because WiredTiger reuses space internally. You can run compact on the collection, or resync the member, to reclaim filesystem space.

Warning: compact blocks the collection and is disruptive. Do not run it during peak load or on the primary without understanding the lock impact.

Recover a secondary that has already fallen off

Once a secondary is “too stale to catch up,” there is no incremental recovery. Remove the member from the replica set and re-add it to trigger a full initial sync:

rs.remove("host:port")
rs.add("host:port")

Plan for the duration to equal your data size divided by network and disk throughput. During initial sync, the secondary builds indexes while copying collections, which often takes longer than the network transfer. The secondary consumes significant I/O and does not serve reads until sync completes. If you want to avoid changing replica set membership, stop the mongod process, wipe its dbPath, and restart; the node will perform an initial sync automatically.

Address chained replication bottlenecks

If geo-distributed secondaries fall off because an intermediate sync source lags behind the primary, force those secondaries to sync directly from the primary or from a closer, low-latency secondary:

db.adminCommand({replSetSyncFrom: "target_host:port"})

This override persists until the next sync source re-evaluation or until the target becomes unavailable. Monitor rs.status() to confirm the change and ensure lag begins to recover.

Prevention

  • Trend the oplog window minimum. Do not rely on the current value. Track the minimum window observed during peak write periods over the last 30 days. If the trend is downward, resize the oplog before you hit the threshold.
  • Size for your worst day. The oplog should maintain at least 24 hours of window during the highest sustained write throughput your cluster has experienced. Busy clusters should target 48 to 72 hours.
  • Alert on lag as a fraction of window. A fixed lag threshold misses the point. Alert when replication lag exceeds 25% of the current oplog window.
  • Monitor secondary apply rate against primary write rate. If apply rate drops below write rate for more than one polling interval, investigate before lag accumulates.
  • Watch flow control activation. Flow control is a safety mechanism, but its presence indicates your cluster is operating at the edge of replication capacity.
  • Match secondary hardware to primary write load. Secondaries with slower disks or lower IOPS than the primary cannot apply operations fast enough during surges. Apply bottlenecks often show up as disk saturation before CPU.

How Netdata helps

  • Correlate primary write throughput with oplog window duration to detect shrinkage before replication lag spikes.
  • Compare per-secondary replication lag with WiredTiger cache pressure to distinguish network delay from apply-side saturation.
  • Expose flow control throttling status to warn when secondaries are near their consumption limit.
  • Alert on cache dirty ratio rises that accompany write surges; primary pressure compounds replication lag.
  • Chart secondary apply rates against primary write rates at per-second resolution to catch falling-behind secondaries in real time.