MongoDB no primary / election storm: repeated elections and write outages

Applications log “not primary” errors. rs.status() shows a different PRIMARY than thirty seconds ago. MongoDB logs repeat "Starting an election" and "Stepping down". Each election costs 2-12 seconds of write unavailability. More than two in ten minutes is an election storm.

This pattern is more dangerous than a single failover because it creates rolling write outages that do not self-stabilize. Drivers reconnect, retry buffers fill, and application latency degrades even when a primary exists. Root causes usually fall into three categories: the primary is too slow to answer heartbeats, the network is dropping or delaying packets between members, or a misconfigured priority is forcing a healthy primary to step down.

A single election is transient. More than one per hour outside maintenance is a serious stability concern. More than two within ten minutes is a ticket; escalate to a page only if cumulative no-primary time exceeds thirty seconds or applications are visibly failing.

What this means

MongoDB replica sets use a consensus protocol to maintain a single primary. Members send heartbeats every two seconds. If a secondary does not hear from the primary for longer than electionTimeoutMillis (default ten seconds), it triggers an election. During an election, the replica set has no primary and rejects all writes.

An election storm occurs when the winning primary is unstable. It may step down because another member initiates a priority takeover, because heartbeats continue to time out, or because resource exhaustion on the new primary causes the same cycle to repeat. The cluster oscillates between primary and secondary states, and each transition forces applications to rediscover the topology.

Priority takeovers are a common trigger. In MongoDB 4.2+, a secondary with higher priority calls an election even when the current primary is healthy. If that secondary is under-provisioned, too stale, or suffers a network blip immediately after promotion, it steps down and the cycle repeats. The same happens when a primary is overloaded: resource saturation delays heartbeat processing and transmission, so secondaries miss the ten-second deadline and call an election.

flowchart TD
    A[Primary under load or network loss] -->|Heartbeats missed| B[Election triggered]
    B --> C[No primary window
2-12s write outage] C --> D[Member elected PRIMARY] D -->|Root cause persists| B D -->|Issue resolved| E[Stable replica set]

Common causes

CauseWhat it looks likeFirst thing to check
Primary resource exhaustionCPU or disk saturation on the primary delays heartbeat responses past the ten-second election timeout; stepdowns correlate with load spikesrs.status() lastHeartbeatMessage; OS CPU and iostat -x
Network instability or partitionMembers intermittently report each other as DOWN; different nodes see different primaries; writes fail with transient “not primary”Bidirectional ping and TCP connectivity between nodes on the replica set port
Misconfigured replica set prioritiesA higher-priority secondary forces an election even though the current primary is healthy; if that secondary cannot sustain the load or is too stale, the cycle repeatsrs.conf() member priorities
Clock skewSporadic elections without clear network or load patterns; heartbeat timestamps drift between membersdate -u or NTP status on all members

Quick checks

# Check replica set member states and last heartbeat messages
mongosh --quiet --eval 'rs.status().members.forEach(m => print(m.name + " -> " + m.stateStr + " | health: " + m.health + " | lastHeartbeatMessage: " + (m.lastHeartbeatMessage || "n/a")))'
# Check configuration for priorities and election timeout
mongosh --quiet --eval 'printjson(rs.conf().settings)'
# Check recent election and stepdown events
grep -iE "election|stepping down" /var/log/mongodb/mongod.log | tail -20
# Check primary operation latency
mongosh --quiet --eval 'db.serverStatus().opLatencies'
# Check WiredTiger cache pressure on the primary
mongosh --quiet --eval 'var c = db.serverStatus().wiredTiger.cache; var max = c["maximum bytes configured"]; print("dirty:", (100*c["tracked dirty bytes in the cache"]/max).toFixed(1), "% fill:", (100*c["bytes currently in the cache"]/max).toFixed(1), "%")'
# Check current replication lag for all secondaries
mongosh --quiet --eval 'var st = rs.status(); var p = st.members.filter(m => m.stateStr === "PRIMARY")[0]; st.members.filter(m => m.stateStr === "SECONDARY").forEach(s => print(s.name + " lag: " + ((p.optimeDate - s.optimeDate)/1000).toFixed(1) + "s"))'
# Check OS disk latency on the primary
iostat -x 1 3
# Check network connectivity from primary to a secondary
ping -c 5 <secondary-host>
nc -zv <secondary-host> 27017
# Print UTC time on the current member; compare across all members
date -u +"%Y-%m-%d %H:%M:%S"

How to diagnose it

  1. Confirm the storm. Count election log lines. More than two "Starting an election" entries within ten minutes means the replica set is unstable. Check whether timestamps cluster around load spikes or form a regular interval.
  2. Map the sequence. Identify which member initiates each election and which member steps down. If the same secondary always starts the election, investigate that node first. If the primary steps down voluntarily, the logs show "Stepping down" with a reason.
  3. Check resource pressure on the stepping-down primary. Correlate election timestamps with CPU utilization, disk await, and WiredTiger cache pressure. If the primary is above 80% CPU or disk await spikes above 50 ms, heartbeats are likely starving. Look for lastHeartbeatMessage fields that mention slow responses.
  4. Test network paths. Run bidirectional latency and TCP checks between the stepping-down primary and the node that triggers the election. Packet loss or asymmetric routing causes intermittent heartbeat timeouts that are hard to see from a single ping. Check netstat -s for retransmits.
  5. Review priorities. Run rs.conf() and compare priority values. If a secondary has higher priority than the current primary, it forces a takeover. Ensure the higher-priority node is healthy and caught up before it takes over. A lagged secondary that wins an election but cannot maintain the load creates a loop.
  6. Verify clock synchronization. Compare date -u output across all members. Even a few seconds of skew can cause heartbeat logic to behave unpredictably. Ensure all nodes run NTP or chrony.
  7. Check for connection storms. After each election, application pools may reconnect en masse. High totalCreated rates on the new primary spike connection overhead and can trigger another stepdown. See the connection storm guide for details.

Metrics and signals to monitor

SignalWhy it mattersWarning sign
Election eventsDirect measure of consensus instabilityMore than 2 elections in 10 minutes outside maintenance
Replica set member stateFlapping between PRIMARY and SECONDARY indicates oscillationAny member transitions from PRIMARY to SECONDARY and back within 5 minutes
Replication lagHigh lag prevents stable failovers and can cause priority takeover failuresSustained lag above 10 seconds on voting members
Primary opLatenciesRising latency means the primary is too slow to answer heartbeatsRead or write average latency doubles from baseline
Connection count and churnElections trigger reconnection floods that amplify loadtotalCreated rate spikes after each election
WiredTiger cache dirty ratioResource exhaustion on the primary is a leading cause of missed heartbeatsDirty ratio above 10% sustained

Fixes

Reduce resource pressure on the primary

If the primary steps down because heartbeats time out under load, reduce the load or increase headroom. Pause batch writes, kill long-running operations, or manually step the primary down during a maintenance window to a secondary that has capacity. Stepping down triggers another election, so do this only when the target secondary is healthy and replication lag is low. If the working set has grown beyond the cache, pressure returns until you resize or shard.

Stabilize the network path

Fix firewall rules, security group ingress, or routing asymmetry between members. If the replica set spans availability zones with variable latency, increase electionTimeoutMillis via rs.reconfig() to tolerate tail latency. The tradeoff is slower failover during a real primary failure. Do not set the timeout lower than the round-trip time between your most distant members.

Eliminate priority takeover loops

Set all data-bearing voting members to the same priority unless you require a specific failover order. If you need a preferred primary, ensure that node has equal or better resources than the current primary and that its replication lag is near zero before it takes over. Lowering the current primary’s priority via rs.reconfig() can itself trigger an election if the change is applied while that node is primary.

Synchronize clocks

Ensure all members run NTP or chrony and that offset is below one second. This is a zero-tradeoff fix that eliminates a class of sporadic, hard-to-reproduce election triggers.

Prevention

  • Alert on election events. More than one election per hour outside maintenance is a stability concern.
  • Monitor primary resource saturation (cache dirty ratio, ticket utilization, disk await) as a leading indicator of heartbeat timeout risk.
  • Keep replica set priorities equal unless you have a documented failover hierarchy and have verified that the preferred node can sustain the load.
  • Maintain network latency between members well below half the election timeout.
  • Use an odd number of voting members (or add an arbiter) to prevent ties where neither side holds a majority.
  • Monitor replication lag on electable members. A lagged secondary that forces a takeover creates instability.
  • Track connection churn after elections. A reconnection flood can spike load and trigger a second stepdown.

How Netdata helps

  • Election events correlated with replica set member state changes, showing when a primary steps down and which secondary promotes.
  • Primary resource saturation (WiredTiger cache dirty ratio, ticket utilization, opLatencies) exposed to identify heartbeat timeouts caused by resource exhaustion rather than network failure.
  • Replication lag per member to identify secondaries that are too stale to sustain a stable takeover.
  • Connection churn via totalCreated deltas, showing reconnection floods that follow each election.
  • OS-level disk await and CPU utilization cross-referenced with MongoDB signals to distinguish storage pressure from network partitions.