MongoDB not master error: writes hitting a non-primary node after failover

A node restart, network partition, or planned stepdown triggers a MongoDB election. Seconds later, application logs show NotWritablePrimary (code 10107) or the legacy string not master and slaveOk=false. Writes fail against a node that used to be PRIMARY, even though the cluster has elected a new one.

This guide covers how to find the root cause and stop it from recurring.

What this means

MongoDB replica sets elect exactly one PRIMARY at a time. When a failover occurs, the old primary steps down and a secondary is promoted. Application drivers discover the new topology through the replica set seed list and refresh their connection pools automatically. Between stepdown and election completion, there is a brief window with no writable primary. After the new primary is elected, drivers should route writes there.

If writes land on a non-primary node after the election has settled, the driver’s view of the topology is stale, the connection is pinned to a specific host, or the application timed out before discovery completed. The cause is almost always in the driver configuration, connection pool state, or timeout behavior.

flowchart TD
  A[NotWritablePrimary error] --> B[Run db.runCommand {hello: 1} on target host]
  B --> C{isWritablePrimary?}
  C -->|false| D[Node is secondary or recovering]
  C -->|true| E[Driver topology stale or directConnection pinned]
  D --> F[Check rs.status for recent election]
  F --> G{Election occurred?}
  G -->|yes| H[Driver has not refreshed topology]
  G -->|no| I[Node stuck in RECOVERING or ROLLBACK]
  H --> J{Check driver URI}
  J -->|directConnection=true| K[Remove directConnection]
  J -->|serverSelectionTimeoutMS < 10s| L[Increase timeout to 30s]
  J -->|retryWrites=false| M[Enable retryable writes]
  E --> N[Restart application to rebuild connection pool]

Common causes

CauseWhat it looks likeFirst thing to check
Stale connection pool after planned primary switchErrors point to the former primary; only applications that did not restart are affectedRun db.runCommand({ hello: 1 }) on the target node to confirm it is no longer primary
directConnection=true in the connection URIEvery write fails against the same seed host, even when other nodes are healthyApplication connection string for directConnection=true
Aggressive serverSelectionTimeoutMS or socket timeoutDriver raises MongoTimeoutException or NotWritablePrimary during brief elections lasting 2-12 secondsDriver timeout settings; compare to the default 30,000 ms
Retryable writes disabled or older driverBrief election window causes permanent write failures instead of a single automatic retryURI for retryWrites=false or driver version
Transaction writes during an active electionMulti-document transaction fails and is not individually retried; only commit and abort are retryableApplication transaction retry logic

Quick checks

// Verify the target node's current role
db.runCommand({ hello: 1 }).isWritablePrimary
// Check replica set member states
rs.status().members.forEach(function(m) {
  print(m.name + " -> " + m.stateStr);
});
# Look for recent elections in the log
grep -iE "election|stepping down" /var/log/mongodb/mongod.log | tail -20
// Check current connections and churn
var c = db.serverStatus().connections;
print("Current: " + c.current + ", Available: " + c.available + ", Total created: " + c.totalCreated);
// Check write throughput on the current primary
db.serverStatus().opcounters
// Check write latency distribution
var lat = db.serverStatus().opLatencies;
print("Write avg (µs): " + (lat.writes.latency / lat.writes.ops));

How to diagnose it

  1. Identify the exact error and target host. Modern drivers return NotWritablePrimary (10107). Legacy drivers may return not master. Note the host the application is targeting.
  2. Confirm the target node is not primary. Connect directly to that host and run db.runCommand({ hello: 1 }). If isWritablePrimary is false, the application is writing to a secondary or recovering node. If it is true, the node may have stepped down very recently and the driver is holding a connection that was valid milliseconds ago.
  3. Check for a recent election. Search the MongoDB log for Starting an election, Stepping down, or VoteRequester. Elections typically complete in 2-12 seconds, but driver discovery depends on heartbeatFrequencyMS . Cross-reference with rs.status(): compare electionTime and stateStr across members to confirm when the new primary took over.
  4. Inspect the connection URI for directConnection=true. This setting forces the driver into Single topology and pins all operations to the seed host, bypassing replica set discovery entirely. It is a frequent misconfiguration in Kubernetes StatefulSets where each pod exposes its own host.
  5. Compare timeout values to the election window. If serverSelectionTimeoutMS is set to a few seconds, the driver may time out before it discovers the new primary. The default is 30,000 ms. Values below 10,000 ms are risky during failover.
  6. Check if the application disables retryable writes or uses an older driver. Inspect the URI for retryWrites=false. With retryWrites=true, the driver automatically retries single-document writes once after a transient error. If disabled, the application sees the error immediately and must handle it itself.
  7. Evaluate connection pool state. After a planned primary switch, applications that do not restart may retain idle connections to the old primary. Monitor totalCreated over time: a sharp rise after the switch indicates the driver is discarding stale connections and rebuilding the pool. If totalCreated stays flat but errors persist, connections are likely pinned by directConnection=true or the driver has not yet attempted to create new ones.
  8. For transaction errors, verify application-level retry logic. Writes inside a multi-document transaction are not individually retryable. Only the commitTransaction and abortTransaction operations are retryable.

Metrics and signals to monitor

SignalWhy it mattersWarning sign
Replica set member stateShows which node is PRIMARY; any write to another state failsTarget node shows SECONDARY, RECOVERING, or ROLLBACK
Election eventsElections change topology; drivers need time to refresh via heartbeatsElections outside maintenance windows
Connection count and churnStale pools or reconnection storms after failovertotalCreated delta spiking after member state changes
Operation latency (opLatencies)High latency triggers aggressive timeouts that abort topology discoveryWrite average approaching or exceeding serverSelectionTimeoutMS
opcounters write rateA near-zero write rate on the primary while applications error confirms misrouted trafficWrite opcounters flat on PRIMARY during reported write failures

Fixes

Remove directConnection=true

If the connection URI contains directConnection=true, remove it. This setting forces the driver to treat the seed host as the only node, disabling replica set topology discovery. The fix requires an application redeploy or restart.

Refresh stale connection pools

After a planned primary switch, restart application instances to force connection pools to rebuild against the new primary. Most drivers detect topology changes automatically, but if connections remain pinned, a restart clears them. Warning: restarting causes a brief capacity reduction and disrupts in-flight requests.

Extend serverSelectionTimeoutMS

Increase serverSelectionTimeoutMS to at least 10,000-30,000 ms. The default is 30,000 ms. Values below 10,000 ms often expire before the election completes and the driver refreshes its topology. Tradeoff: slower detection of permanently unreachable nodes.

Enable retryable writes

Ensure the URI includes retryWrites=true. This is the default in current MongoDB drivers. It handles transient NotWritablePrimary errors during brief elections by retrying once. Tradeoff: a small latency penalty for the retry handshake.

Add application-level transaction retries

For multi-document transactions, implement retry logic around the commitTransaction and abortTransaction operations. Individual writes inside a transaction are not retryable. Tradeoff: requires code changes.

Fix Kubernetes StatefulSet routing

If the application connects directly to a single pod hostname because of a headless service workaround, switch to the full replica set seed list in the URI and remove directConnection=true. A headless Kubernetes service returns pod IPs, but if the application hardcodes one pod’s DNS name or uses a single-pod endpoint, the driver never sees the other members. Use the StatefulSet headless service DNS names for all pods in the seed list.

Prevention

  • Never use directConnection=true in production replica set connections. Use the full seed list and let the driver discover the primary.
  • Set serverSelectionTimeoutMS to at least 10,000 ms, preferably 30,000 ms, to survive elections without timing out.
  • Keep drivers up to date and do not disable retryWrites.
  • Restart application instances or verify that connection pools refresh after planned primary maintenance to avoid stale connections to the old primary.
  • Monitor election events and alert when they occur outside of maintenance windows.
  • Verify that load balancers or proxies between the application and MongoDB do not pin connections to a single node, as this defeats driver topology discovery.

How Netdata helps

  • Correlate replica set member state changes with application error spikes to identify stale topology quickly.
  • Track MongoDB connection count and totalCreated churn to detect reconnection storms after failovers.
  • Monitor opLatencies write latency to catch aggressive timeout configurations before they cause mass write failures.
  • Alert on election events parsed from MongoDB logs to surface the root-cause timeline.
  • Visualize opcounters drops on the primary to confirm that write traffic is not reaching the new primary.