MongoDB replica set member unhealthy: reading rs.status() states

rs.status() output is perspective-dependent and easy to misread. A member can show health: 1 while RECOVERING and unable to serve reads, or appear UNKNOWN from one node yet SECONDARY from another because of an asymmetric firewall rule. health: 0 alone does not mean the process is dead, and health: 1 alone does not mean the node is healthy.

This guide maps stateStr values to failure modes and shows how to distinguish transient startup states, replication lag spirals, and network partitions.

What this means

rs.status() returns a members array. During an incident, focus on stateStr, health, optimeDate, lastHeartbeat, lastHeartbeatRecv, pingMs, and lastHeartbeatMessage.

  • PRIMARY and SECONDARY are healthy operational states.
  • ARBITER is healthy and should never transition.
  • STARTUP2 means initial sync is in progress. Expected for new members and some rolling restarts.
  • RECOVERING means the member is alive and can vote, but cannot serve reads or be elected primary. Appears during oplog catch-up, journal replay after unclean shutdown, or when a secondary has fallen too far behind.
  • DOWN and UNKNOWN mean the member is unreachable from the queried node’s perspective. The process may be down, the host offline, or a network partition or firewall block may exist.
  • ROLLBACK means the member is reverting writes that were not replicated to a majority before a failover. Data removed during rollback is written to files under the rollback/ directory inside the dbPath.
  • REMOVED means the member has been removed from the replica set configuration.

health is 1 if the last heartbeat succeeded, and 0 if it failed. Because a member in RECOVERING can still have health: 1, health alone does not indicate full health. Any data-bearing member in RECOVERING, ROLLBACK, DOWN, or UNKNOWN for more than two minutes warrants investigation.

Note: rs.status() shows the view of the member you connect to. If member A cannot reach member B due to an outbound firewall rule, A reports B as DOWN while B may see itself as SECONDARY. Always check from multiple vantage points before concluding a node is dead.

Common causes

CauseWhat it looks likeFirst thing to check
Network partition or asymmetric firewallMember is DOWN or UNKNOWN from some nodes but SECONDARY from itself or othersrs.status() from the affected member and from the primary; compare views
Secondary fell off the oplog windowstateStr: RECOVERING, lastHeartbeatMessage contains “too stale to catch up”Oplog window versus replication lag from rs.printReplicationInfo()
Initial sync in progressstateStr: STARTUP2, large initialSyncStatus subdocument visible in rs.status()Sync source reachability and progress; syncSourceUnreachableSince if present
Rollback after failoverstateStr: ROLLBACKMongoDB logs for rollback size; rollback/ directory under dbPath
Resource exhaustion slowing heartbeatsMember flaps between PRIMARY and SECONDARY; election events in logsdb.serverStatus().wiredTiger.cache and OS disk latency on the node
Stuck recovery after unclean shutdownstateStr: RECOVERING for >2 minutes with no progressLogs for journal replay progress; disk I/O utilization

Quick checks

Run these safe, read-only commands to orient yourself.

# List all members with state and health
mongosh --quiet --eval 'rs.status().members.forEach(function(m) { print(m.name + " -> " + m.stateStr + " (health: " + m.health + ")"); })'
# Check last heartbeat messages for explicit errors
mongosh --quiet --eval 'rs.status().members.forEach(function(m) { if (m.lastHeartbeatMessage) print(m.name + ": " + m.lastHeartbeatMessage); })'
// Compare optimeDate lag directly
var status = rs.status();
var primary = status.members.filter(function(m) { return m.stateStr === 'PRIMARY'; })[0];
status.members.filter(function(m) { return m.stateStr === 'SECONDARY'; }).forEach(function(s) {
  print(s.name + " lag: " + ((primary.optimeDate - s.optimeDate) / 1000) + " sec");
});
# Check oplog window and replication overview
mongosh --quiet --eval 'rs.printReplicationInfo()'
# Look for rollback, stale, or election events in the log
grep -iE "rollback|too stale|election" /var/log/mongodb/mongod.log | tail -20
// Check if the node is reachable but unresponsive due to ticket or cache pressure
var s = db.serverStatus();
print("opcounters: " + JSON.stringify(s.opcounters));
print("queue: " + JSON.stringify(s.globalLock.currentQueue));

How to diagnose it

flowchart TD
    A[Member non-healthy] --> B{Connect directly?}
    B -->|No| C[Network or process down]
    B -->|Yes| D{Self-view matches?}
    D -->|No| E[Check firewall and ports]
    D -->|Yes| F[Read lastHeartbeatMessage]
    F --> G{RECOVERING?}
    G -->|Yes| H[Compare lag to oplog window]
    G -->|No| I[Investigate ROLLBACK or STARTUP2]
    H --> J{Lag > window?}
    J -->|Yes| K[Plan full resync]
    J -->|No| L[Check disk and cache]
  1. Run rs.status() from the current primary. Identify any member where stateStr is not PRIMARY, SECONDARY, or ARBITER.
  2. For each unhealthy member, record stateStr, health, lastHeartbeat, lastHeartbeatRecv, and pingMs. Read lastHeartbeatMessage first; it often contains the exact reason, such as “too stale to catch up”.
  3. Eliminate perspective bias. Connect directly to the unhealthy member and run rs.status(). If it sees itself as SECONDARY but the primary sees it as DOWN, the issue is network connectivity, not process failure. Verify bidirectional reachability on the replication port.
  4. Check replication lag. Compare optimeDate between the primary and the secondary. If lag is growing and approaching the oplog window, the secondary is on track to require a full initial sync.
  5. Check resource saturation on the affected member. High WiredTiger cache dirty ratio, application-thread evictions, or journal sync latency can slow oplog application enough to push a member into RECOVERING.
  6. Inspect MongoDB logs on the affected node for rollback, election, or initial sync progress. Rollback data is written to the rollback/ directory under dbPath.

Metrics and signals to monitor

SignalWhy it mattersWarning sign
Member stateStr and healthClassifies liveness and service eligibilityAny data-bearing member not PRIMARY or SECONDARY for >2 minutes
Replication lagTime between primary and secondary optimeDateSustained >10 seconds, or >25% of oplog window
Oplog windowCoverage hours of the oplog<12 hours
lastHeartbeatMessageExplicit error text from the replication layer“too stale to catch up”, connection timeouts, or sync source unreachable
Election eventsRepeated elections indicate instability>1 per hour outside maintenance windows
WiredTiger cache dirty ratioResource exhaustion can stall replication>15% sustained
Journal sync latencyStorage health leading indicator>30 ms sustained

Fixes

Oplog window exceeded or “too stale to catch up”

Once a secondary falls past the oplog window, it enters RECOVERING and must perform a full initial sync. In MongoDB 4.0 and later, increase the oplog size online with replSetResizeOplog to prevent other secondaries from falling off. The affected member still requires resync before it rejoins as a secondary.

Rollback

Do not restart the member. Let rollback finish. Inspect the rollback/ directory under the dbPath for data that may need manual re-application. To prevent future rollbacks, use w: "majority" write concern for operations that must survive a failover.

Resource exhaustion causing RECOVERING

Kill unnecessary long-running operations with db.killOp(). Warning: killing a write operation may leave data partially updated. If storage latency is the root cause, reduce write throughput or step down the primary to shift workload to a member with healthier disks. Do not increase WiredTiger ticket limits; higher concurrency worsens queuing.

Network partition or asymmetric firewall

Do not restart MongoDB. Verify bidirectional connectivity on the replication port between the affected member and the rest of the set. Firewall rules that block outbound connections from one member silently break replication even if inbound rules pass.

Initial sync in progress

STARTUP2 can last hours for large data sets. Monitor initialSyncStatus inside rs.status() for cloning progress. If syncSourceUnreachableSince is present, the member cannot reach its sync source and will remain stuck until connectivity is restored.

Prevention

  • Monitor oplog window trending, not just current lag. Size the oplog to maintain at least 24 hours of coverage during peak writes.
  • Monitor WiredTiger cache dirty ratio and ticket utilization. Cache pressure and disk stalls are common root causes of replication lag that leads to RECOVERING.
  • Verify bidirectional firewall rules between all replica set members before production cutover.
  • Prefer w: "majority" write concern to avoid rollback events.

How Netdata helps

  • Netdata’s MongoDB collector tracks replica set member state and replication lag, exposing stateStr transitions without manual rs.status() parsing.
  • Correlate member state changes with WiredTiger cache dirty ratio and ticket utilization on the same node to spot resource-induced replication stalls.
  • Track journal sync latency and checkpoint duration to detect storage-layer problems before they push a secondary into RECOVERING.
  • Alert on oplog window shrinkage and replication lag trends, not just absolute thresholds.
  • Visualize election events alongside connection churn to distinguish network blips from capacity issues.