MongoDB not master error: writes hitting a non-primary node after failover
A node restart, network partition, or planned stepdown triggers a MongoDB election. Seconds later, application logs show NotWritablePrimary (code 10107) or the legacy string not master and slaveOk=false. Writes fail against a node that used to be PRIMARY, even though the cluster has elected a new one.
This guide covers how to find the root cause and stop it from recurring.
What this means
MongoDB replica sets elect exactly one PRIMARY at a time. When a failover occurs, the old primary steps down and a secondary is promoted. Application drivers discover the new topology through the replica set seed list and refresh their connection pools automatically. Between stepdown and election completion, there is a brief window with no writable primary. After the new primary is elected, drivers should route writes there.
If writes land on a non-primary node after the election has settled, the driver’s view of the topology is stale, the connection is pinned to a specific host, or the application timed out before discovery completed. The cause is almost always in the driver configuration, connection pool state, or timeout behavior.
flowchart TD
A[NotWritablePrimary error] --> B[Run db.runCommand {hello: 1} on target host]
B --> C{isWritablePrimary?}
C -->|false| D[Node is secondary or recovering]
C -->|true| E[Driver topology stale or directConnection pinned]
D --> F[Check rs.status for recent election]
F --> G{Election occurred?}
G -->|yes| H[Driver has not refreshed topology]
G -->|no| I[Node stuck in RECOVERING or ROLLBACK]
H --> J{Check driver URI}
J -->|directConnection=true| K[Remove directConnection]
J -->|serverSelectionTimeoutMS < 10s| L[Increase timeout to 30s]
J -->|retryWrites=false| M[Enable retryable writes]
E --> N[Restart application to rebuild connection pool]Common causes
| Cause | What it looks like | First thing to check |
|---|---|---|
| Stale connection pool after planned primary switch | Errors point to the former primary; only applications that did not restart are affected | Run db.runCommand({ hello: 1 }) on the target node to confirm it is no longer primary |
directConnection=true in the connection URI | Every write fails against the same seed host, even when other nodes are healthy | Application connection string for directConnection=true |
Aggressive serverSelectionTimeoutMS or socket timeout | Driver raises MongoTimeoutException or NotWritablePrimary during brief elections lasting 2-12 seconds | Driver timeout settings; compare to the default 30,000 ms |
| Retryable writes disabled or older driver | Brief election window causes permanent write failures instead of a single automatic retry | URI for retryWrites=false or driver version |
| Transaction writes during an active election | Multi-document transaction fails and is not individually retried; only commit and abort are retryable | Application transaction retry logic |
Quick checks
// Verify the target node's current role
db.runCommand({ hello: 1 }).isWritablePrimary
// Check replica set member states
rs.status().members.forEach(function(m) {
print(m.name + " -> " + m.stateStr);
});
# Look for recent elections in the log
grep -iE "election|stepping down" /var/log/mongodb/mongod.log | tail -20
// Check current connections and churn
var c = db.serverStatus().connections;
print("Current: " + c.current + ", Available: " + c.available + ", Total created: " + c.totalCreated);
// Check write throughput on the current primary
db.serverStatus().opcounters
// Check write latency distribution
var lat = db.serverStatus().opLatencies;
print("Write avg (µs): " + (lat.writes.latency / lat.writes.ops));
How to diagnose it
- Identify the exact error and target host. Modern drivers return
NotWritablePrimary(10107). Legacy drivers may returnnot master. Note the host the application is targeting. - Confirm the target node is not primary. Connect directly to that host and run
db.runCommand({ hello: 1 }). IfisWritablePrimaryis false, the application is writing to a secondary or recovering node. If it is true, the node may have stepped down very recently and the driver is holding a connection that was valid milliseconds ago. - Check for a recent election. Search the MongoDB log for
Starting an election,Stepping down, orVoteRequester. Elections typically complete in 2-12 seconds, but driver discovery depends onheartbeatFrequencyMS. Cross-reference withrs.status(): compareelectionTimeandstateStracross members to confirm when the new primary took over. - Inspect the connection URI for
directConnection=true. This setting forces the driver into Single topology and pins all operations to the seed host, bypassing replica set discovery entirely. It is a frequent misconfiguration in Kubernetes StatefulSets where each pod exposes its own host. - Compare timeout values to the election window. If
serverSelectionTimeoutMSis set to a few seconds, the driver may time out before it discovers the new primary. The default is 30,000 ms. Values below 10,000 ms are risky during failover. - Check if the application disables retryable writes or uses an older driver. Inspect the URI for
retryWrites=false. WithretryWrites=true, the driver automatically retries single-document writes once after a transient error. If disabled, the application sees the error immediately and must handle it itself. - Evaluate connection pool state. After a planned primary switch, applications that do not restart may retain idle connections to the old primary. Monitor
totalCreatedover time: a sharp rise after the switch indicates the driver is discarding stale connections and rebuilding the pool. IftotalCreatedstays flat but errors persist, connections are likely pinned bydirectConnection=trueor the driver has not yet attempted to create new ones. - For transaction errors, verify application-level retry logic. Writes inside a multi-document transaction are not individually retryable. Only the
commitTransactionandabortTransactionoperations are retryable.
Metrics and signals to monitor
| Signal | Why it matters | Warning sign |
|---|---|---|
| Replica set member state | Shows which node is PRIMARY; any write to another state fails | Target node shows SECONDARY, RECOVERING, or ROLLBACK |
| Election events | Elections change topology; drivers need time to refresh via heartbeats | Elections outside maintenance windows |
| Connection count and churn | Stale pools or reconnection storms after failover | totalCreated delta spiking after member state changes |
Operation latency (opLatencies) | High latency triggers aggressive timeouts that abort topology discovery | Write average approaching or exceeding serverSelectionTimeoutMS |
| opcounters write rate | A near-zero write rate on the primary while applications error confirms misrouted traffic | Write opcounters flat on PRIMARY during reported write failures |
Fixes
Remove directConnection=true
If the connection URI contains directConnection=true, remove it. This setting forces the driver to treat the seed host as the only node, disabling replica set topology discovery. The fix requires an application redeploy or restart.
Refresh stale connection pools
After a planned primary switch, restart application instances to force connection pools to rebuild against the new primary. Most drivers detect topology changes automatically, but if connections remain pinned, a restart clears them. Warning: restarting causes a brief capacity reduction and disrupts in-flight requests.
Extend serverSelectionTimeoutMS
Increase serverSelectionTimeoutMS to at least 10,000-30,000 ms. The default is 30,000 ms. Values below 10,000 ms often expire before the election completes and the driver refreshes its topology. Tradeoff: slower detection of permanently unreachable nodes.
Enable retryable writes
Ensure the URI includes retryWrites=true. This is the default in current MongoDB drivers. It handles transient NotWritablePrimary errors during brief elections by retrying once. Tradeoff: a small latency penalty for the retry handshake.
Add application-level transaction retries
For multi-document transactions, implement retry logic around the commitTransaction and abortTransaction operations. Individual writes inside a transaction are not retryable. Tradeoff: requires code changes.
Fix Kubernetes StatefulSet routing
If the application connects directly to a single pod hostname because of a headless service workaround, switch to the full replica set seed list in the URI and remove directConnection=true. A headless Kubernetes service returns pod IPs, but if the application hardcodes one pod’s DNS name or uses a single-pod endpoint, the driver never sees the other members. Use the StatefulSet headless service DNS names for all pods in the seed list.
Prevention
- Never use
directConnection=truein production replica set connections. Use the full seed list and let the driver discover the primary. - Set
serverSelectionTimeoutMSto at least 10,000 ms, preferably 30,000 ms, to survive elections without timing out. - Keep drivers up to date and do not disable
retryWrites. - Restart application instances or verify that connection pools refresh after planned primary maintenance to avoid stale connections to the old primary.
- Monitor election events and alert when they occur outside of maintenance windows.
- Verify that load balancers or proxies between the application and MongoDB do not pin connections to a single node, as this defeats driver topology discovery.
How Netdata helps
- Correlate replica set member state changes with application error spikes to identify stale topology quickly.
- Track MongoDB connection count and
totalCreatedchurn to detect reconnection storms after failovers. - Monitor
opLatencieswrite latency to catch aggressive timeout configurations before they cause mass write failures. - Alert on election events parsed from MongoDB logs to surface the root-cause timeline.
- Visualize
opcountersdrops on the primary to confirm that write traffic is not reaching the new primary.
Related guides
- How MongoDB actually works in production: a mental model for operators
- MongoDB pages evicted by application threads: when eviction becomes user latency
- MongoDB WiredTiger cache dirty ratio high: the leading indicator nobody watches
- MongoDB WiredTiger cache pressure cascade: eviction stalls and latency spikes
- MongoDB cache too small: sizing the WiredTiger cache for your working set
- MongoDB checkpoint duration climbing: diagnosing slow WiredTiger checkpoints
- MongoDB checkpoint stall write freeze: when all writes stop with no error
- MongoDB journal sync latency high: the storage signal that warns 60 seconds early
- MongoDB monitoring checklist: the signals every production cluster needs
- MongoDB monitoring maturity model: from survival to expert
- MongoDB noTimeout cursors causing cache pressure: pinned snapshots and silent eviction stalls
- MongoDB oplog window collapse: secondaries falling off and forced full resync







