MongoDB replica set member unhealthy: reading rs.status() states
rs.status() output is perspective-dependent and easy to misread. A member can show health: 1 while RECOVERING and unable to serve reads, or appear UNKNOWN from one node yet SECONDARY from another because of an asymmetric firewall rule. health: 0 alone does not mean the process is dead, and health: 1 alone does not mean the node is healthy.
This guide maps stateStr values to failure modes and shows how to distinguish transient startup states, replication lag spirals, and network partitions.
What this means
rs.status() returns a members array. During an incident, focus on stateStr, health, optimeDate, lastHeartbeat, lastHeartbeatRecv, pingMs, and lastHeartbeatMessage.
PRIMARYandSECONDARYare healthy operational states.ARBITERis healthy and should never transition.STARTUP2means initial sync is in progress. Expected for new members and some rolling restarts.RECOVERINGmeans the member is alive and can vote, but cannot serve reads or be elected primary. Appears during oplog catch-up, journal replay after unclean shutdown, or when a secondary has fallen too far behind.DOWNandUNKNOWNmean the member is unreachable from the queried node’s perspective. The process may be down, the host offline, or a network partition or firewall block may exist.ROLLBACKmeans the member is reverting writes that were not replicated to a majority before a failover. Data removed during rollback is written to files under therollback/directory inside the dbPath.REMOVEDmeans the member has been removed from the replica set configuration.
health is 1 if the last heartbeat succeeded, and 0 if it failed. Because a member in RECOVERING can still have health: 1, health alone does not indicate full health. Any data-bearing member in RECOVERING, ROLLBACK, DOWN, or UNKNOWN for more than two minutes warrants investigation.
Note: rs.status() shows the view of the member you connect to. If member A cannot reach member B due to an outbound firewall rule, A reports B as DOWN while B may see itself as SECONDARY. Always check from multiple vantage points before concluding a node is dead.
Common causes
| Cause | What it looks like | First thing to check |
|---|---|---|
| Network partition or asymmetric firewall | Member is DOWN or UNKNOWN from some nodes but SECONDARY from itself or others | rs.status() from the affected member and from the primary; compare views |
| Secondary fell off the oplog window | stateStr: RECOVERING, lastHeartbeatMessage contains “too stale to catch up” | Oplog window versus replication lag from rs.printReplicationInfo() |
| Initial sync in progress | stateStr: STARTUP2, large initialSyncStatus subdocument visible in rs.status() | Sync source reachability and progress; syncSourceUnreachableSince if present |
| Rollback after failover | stateStr: ROLLBACK | MongoDB logs for rollback size; rollback/ directory under dbPath |
| Resource exhaustion slowing heartbeats | Member flaps between PRIMARY and SECONDARY; election events in logs | db.serverStatus().wiredTiger.cache and OS disk latency on the node |
| Stuck recovery after unclean shutdown | stateStr: RECOVERING for >2 minutes with no progress | Logs for journal replay progress; disk I/O utilization |
Quick checks
Run these safe, read-only commands to orient yourself.
# List all members with state and health
mongosh --quiet --eval 'rs.status().members.forEach(function(m) { print(m.name + " -> " + m.stateStr + " (health: " + m.health + ")"); })'
# Check last heartbeat messages for explicit errors
mongosh --quiet --eval 'rs.status().members.forEach(function(m) { if (m.lastHeartbeatMessage) print(m.name + ": " + m.lastHeartbeatMessage); })'
// Compare optimeDate lag directly
var status = rs.status();
var primary = status.members.filter(function(m) { return m.stateStr === 'PRIMARY'; })[0];
status.members.filter(function(m) { return m.stateStr === 'SECONDARY'; }).forEach(function(s) {
print(s.name + " lag: " + ((primary.optimeDate - s.optimeDate) / 1000) + " sec");
});
# Check oplog window and replication overview
mongosh --quiet --eval 'rs.printReplicationInfo()'
# Look for rollback, stale, or election events in the log
grep -iE "rollback|too stale|election" /var/log/mongodb/mongod.log | tail -20
// Check if the node is reachable but unresponsive due to ticket or cache pressure
var s = db.serverStatus();
print("opcounters: " + JSON.stringify(s.opcounters));
print("queue: " + JSON.stringify(s.globalLock.currentQueue));
How to diagnose it
flowchart TD
A[Member non-healthy] --> B{Connect directly?}
B -->|No| C[Network or process down]
B -->|Yes| D{Self-view matches?}
D -->|No| E[Check firewall and ports]
D -->|Yes| F[Read lastHeartbeatMessage]
F --> G{RECOVERING?}
G -->|Yes| H[Compare lag to oplog window]
G -->|No| I[Investigate ROLLBACK or STARTUP2]
H --> J{Lag > window?}
J -->|Yes| K[Plan full resync]
J -->|No| L[Check disk and cache]- Run
rs.status()from the current primary. Identify any member wherestateStris notPRIMARY,SECONDARY, orARBITER. - For each unhealthy member, record
stateStr,health,lastHeartbeat,lastHeartbeatRecv, andpingMs. ReadlastHeartbeatMessagefirst; it often contains the exact reason, such as “too stale to catch up”. - Eliminate perspective bias. Connect directly to the unhealthy member and run
rs.status(). If it sees itself asSECONDARYbut the primary sees it asDOWN, the issue is network connectivity, not process failure. Verify bidirectional reachability on the replication port. - Check replication lag. Compare
optimeDatebetween the primary and the secondary. If lag is growing and approaching the oplog window, the secondary is on track to require a full initial sync. - Check resource saturation on the affected member. High WiredTiger cache dirty ratio, application-thread evictions, or journal sync latency can slow oplog application enough to push a member into
RECOVERING. - Inspect MongoDB logs on the affected node for rollback, election, or initial sync progress. Rollback data is written to the
rollback/directory under dbPath.
Metrics and signals to monitor
| Signal | Why it matters | Warning sign |
|---|---|---|
Member stateStr and health | Classifies liveness and service eligibility | Any data-bearing member not PRIMARY or SECONDARY for >2 minutes |
| Replication lag | Time between primary and secondary optimeDate | Sustained >10 seconds, or >25% of oplog window |
| Oplog window | Coverage hours of the oplog | <12 hours |
lastHeartbeatMessage | Explicit error text from the replication layer | “too stale to catch up”, connection timeouts, or sync source unreachable |
| Election events | Repeated elections indicate instability | >1 per hour outside maintenance windows |
| WiredTiger cache dirty ratio | Resource exhaustion can stall replication | >15% sustained |
| Journal sync latency | Storage health leading indicator | >30 ms sustained |
Fixes
Oplog window exceeded or “too stale to catch up”
Once a secondary falls past the oplog window, it enters RECOVERING and must perform a full initial sync. In MongoDB 4.0 and later, increase the oplog size online with replSetResizeOplog to prevent other secondaries from falling off. The affected member still requires resync before it rejoins as a secondary.
Rollback
Do not restart the member. Let rollback finish. Inspect the rollback/ directory under the dbPath for data that may need manual re-application. To prevent future rollbacks, use w: "majority" write concern for operations that must survive a failover.
Resource exhaustion causing RECOVERING
Kill unnecessary long-running operations with db.killOp(). Warning: killing a write operation may leave data partially updated. If storage latency is the root cause, reduce write throughput or step down the primary to shift workload to a member with healthier disks. Do not increase WiredTiger ticket limits; higher concurrency worsens queuing.
Network partition or asymmetric firewall
Do not restart MongoDB. Verify bidirectional connectivity on the replication port between the affected member and the rest of the set. Firewall rules that block outbound connections from one member silently break replication even if inbound rules pass.
Initial sync in progress
STARTUP2 can last hours for large data sets. Monitor initialSyncStatus inside rs.status() for cloning progress. If syncSourceUnreachableSince is present, the member cannot reach its sync source and will remain stuck until connectivity is restored.
Prevention
- Monitor oplog window trending, not just current lag. Size the oplog to maintain at least 24 hours of coverage during peak writes.
- Monitor WiredTiger cache dirty ratio and ticket utilization. Cache pressure and disk stalls are common root causes of replication lag that leads to
RECOVERING. - Verify bidirectional firewall rules between all replica set members before production cutover.
- Prefer
w: "majority"write concern to avoid rollback events.
How Netdata helps
- Netdata’s MongoDB collector tracks replica set member state and replication lag, exposing
stateStrtransitions without manualrs.status()parsing. - Correlate member state changes with WiredTiger cache dirty ratio and ticket utilization on the same node to spot resource-induced replication stalls.
- Track journal sync latency and checkpoint duration to detect storage-layer problems before they push a secondary into
RECOVERING. - Alert on oplog window shrinkage and replication lag trends, not just absolute thresholds.
- Visualize election events alongside connection churn to distinguish network blips from capacity issues.
Related guides
- How MongoDB actually works in production: a mental model for operators
- MongoDB pages evicted by application threads: when eviction becomes user latency
- MongoDB WiredTiger cache dirty ratio high: the leading indicator nobody watches
- MongoDB WiredTiger cache pressure cascade: eviction stalls and latency spikes
- MongoDB cache too small: sizing the WiredTiger cache for your working set
- MongoDB checkpoint duration climbing: diagnosing slow WiredTiger checkpoints
- MongoDB checkpoint stall write freeze: when all writes stop with no error
- MongoDB connection churn: high totalCreated rate and thread creation overhead
- MongoDB connection refused at maxIncomingConnections: hitting the connection ceiling
- MongoDB connection storm spiral: reconnection floods after an election or deploy
- MongoDB exceeded memory limit for $group — aggregation spills and allowDiskUse
- MongoDB flow control throttling writes: when the primary slows itself down







