MongoDB connection storm spiral: reconnection floods after an election or deploy
Connection count on a primary jumps from 200 to 4,000 in under a minute. Resident memory climbs, query latencies double, and application logs fill with timeout errors. The slow query log shows nothing unusual. Individual queries are not the problem. The database is drowning in threads.
This is a connection storm spiral. A trigger event, usually a replica set election, application deploy, or network blip, invalidates existing connections across your application fleet. Every driver reconnects at once. Each new connection costs MongoDB a dedicated thread and roughly 1 MB of stack memory . The resulting RSS spike and ticket contention slow down operations already in flight, causing more timeouts, which drives even more reconnections. The feedback loop ends in OOM kill or unresponsiveness.
The differentiator from normal pool growth is connection churn. A steady current count with a rapidly climbing totalCreated means threads are being created and destroyed faster than the server can sustain.
flowchart TD
A[Election, deploy, or network blip] --> B[Drivers invalidate pooled connections]
B --> C[Mass simultaneous reconnection]
C --> D[New thread per connection]
D --> E[Memory RSS spikes]
E --> F[Ticket contention and scheduling overhead]
F --> G[Operations slow and timeout]
G --> H[More reconnections]
H --> CWhat this means
MongoDB uses a one-thread-per-connection model. When a trigger event causes mass reconnection, the server must create thousands of threads almost instantly. This consumes resident memory, increases kernel scheduling overhead, and floods the WiredTiger storage engine with concurrent operations competing for read and write tickets.
As tickets exhaust, new operations queue in globalLock.currentQueue. Latencies rise. Application drivers time out and retry, opening yet more connections. The cycle feeds itself until the node runs out of memory or file descriptors, or until the underlying trigger is resolved and reconnections stop.
A spike in current alone can be normal pool warmup. A sustained delta in totalCreated means the server is burning resources on thread lifecycle overhead. Treat totalCreated churn rate as the primary diagnostic signal.
Common causes
| Cause | What it looks like | First thing to check |
|---|---|---|
| Primary election or failover | Connection spike seconds after a new primary is elected; rs.status() shows a recent electionDate | rs.status() member states and MongoDB logs for "Starting an election" |
| Rolling application restart or deploy | Connections spike from many new application processes simultaneously; client IPs are distributed uniformly across the fleet | Application deployment timestamps and process start events |
| Network blip or DNS failure | Brief drop in network throughput followed by a flood; driver logs may show pool cleared events | Network latency and DNS resolution health between apps and database |
| Load balancer health check failure | Regular cadence of connection spikes matching the LB probe interval; many connections from the LB IP range | LB health check configuration and probe logs |
Quick checks
Run these read-only commands to confirm the storm and assess severity.
// Check connection count, available slots, and total churn
var c = db.serverStatus().connections;
print("Current: " + c.current + " Available: " + c.available + " Total created: " + c.totalCreated);
// Check WiredTiger ticket availability (MongoDB <=7.x)
var t = db.serverStatus().wiredTiger.concurrentTransactions;
print("Read tickets available: " + t.read.available + " / " + t.read.totalTickets);
print("Write tickets available: " + t.write.available + " / " + t.write.totalTickets);
// Check queue depth
var q = db.serverStatus().globalLock.currentQueue;
print("Total queued: " + q.total + " Readers: " + q.readers + " Writers: " + q.writers);
// Identify heaviest client sources among active operations
db.currentOp({ active: true }).inprog.forEach(function(op) {
print((op.client || "internal") + " | " + op.op + " | " + (op.secs_running || 0) + "s");
});
# Check OS-level RSS in MB (adjust if multiple mongod processes exist)
ps -o rss= -p $(pgrep -x mongod) | awk '{print "RSS: " $1/1024 " MB"}'
# Check for recent elections or stepdowns in the log
grep -iE "Starting an election|stepping down" /var/log/mongodb/mongod.log | tail -10
// Check server-side average latency
var lat = db.serverStatus().opLatencies;
var rOps = lat.reads.ops;
var wOps = lat.writes.ops;
print("Read avg ms: " + (rOps ? (lat.reads.latency / rOps / 1000).toFixed(2) : "N/A"));
print("Write avg ms: " + (wOps ? (lat.writes.latency / wOps / 1000).toFixed(2) : "N/A"));
// Check if throughput has collapsed
db.serverStatus().opcounters
How to diagnose it
- Confirm the trigger. Check
rs.status()for a recent election, application deployment logs for a restart, or network metrics for a blip. The spiral almost always has an identifiable trigger within the last 1-2 minutes. - Measure churn, not just count. Sample
totalCreatedtwice, 30 seconds apart. A large delta with a stable or slowly changingcurrentconfirms a storm rather than legitimate growth. - Correlate memory with connections. Compare
db.serverStatus().mem.residentagainst the expected baseline ofWiredTiger cache + (current connections x ~1MB) + internal overhead. If RSS is significantly higher, thread stacks are the likely cause. - Identify the top client sources. Use
db.currentOp()to see which hosts are driving the most load. In a storm, you will see a broad distribution across your application fleet rather than a single misbehaving host. - Check ticket exhaustion. If available read or write tickets have dropped below 25% of total, the storage engine is saturated and operations are queuing. This is what turns a reconnect burst into a latency spiral.
- Assess memory exhaustion risk. If RSS is within 1 GB of system memory or the
availableconnection count is approaching zero, the node is at risk of OOM or file descriptor exhaustion.
Metrics and signals to monitor
| Signal | Why it matters | Warning sign |
|---|---|---|
Connection totalCreated churn rate | Thread creation and destruction consume CPU and memory; high churn with stable current is the defining storm signal | Delta > 100 per minute while current is flat |
connections.current | Each connection allocates a thread and stack memory | Rapid 10x increase within seconds |
| Memory RSS | Thread stacks drive resident memory upward during a storm | RSS growth correlates with connection spike and exceeds expected baseline |
| WiredTiger ticket availability | More connections means more operations competing for storage engine admission | Available read or write tickets drops below 25% of total |
globalLock.currentQueue depth | Operations queue when tickets or locks are scarce | Sustained total queue > 20 |
opLatencies reads and writes | Server-side latency degradation from contention | Average latency doubles from baseline for > 5 minutes |
opcounters throughput | Ticket exhaustion and thread overhead cause throughput collapse | Sustained drop > 50% from baseline |
Fixes
Immediate containment
The fastest way to stop the node from dying is to cap new connections dynamically without a restart:
db.adminCommand({setParameter: 1, maxIncomingConnections: <value>})
Choose a value just below connections.current to force immediate rejections. Rejected connections appear in application logs as connection failures, but this breaks the feedback loop before OOM.
Persist the change in mongod.conf after the incident. Do not restart mongod to apply this during a storm. Avoid reactive rs.stepDown() calls; an election invalidates connections and can worsen the flood.
Identify the heaviest client sources with db.currentOp() and coordinate with application owners to slow or pause restarts until the topology is stable.
Stabilize the topology
If the trigger was an election, determine why the primary stepped down. Check for heartbeat timeouts, disk stalls, or memory pressure that caused the original instability. Fixing the trigger without stabilizing the root cause will simply restart the spiral on the next event.
Throttle client reconnect behavior
After the storm subsides, review application driver configuration. Large connection pool sizes multiply the impact of any trigger. Ensure that total possible connections across all application instances leaves substantial headroom below the server limit. A common target is to operate at less than 50% of maxIncomingConnections during normal load, leaving capacity for reconnection bursts.
Prevention
- Monitor
totalCreateddelta, not justcurrent. Churn is the leading indicator. - Keep normal connection utilization below 50% of the configured maximum. This leaves headroom for reconnection floods.
- Track WiredTiger ticket availability continuously. Ticket exhaustion is what turns a reconnect burst into a spiral.
- Ensure replica set elections are rare outside of maintenance. Frequent elections indicate network instability, storage latency, or misconfigured
electionTimeoutMillis. - Review application logs for
connectionPoolClearedevents after any deploy or failover. These indicate driver-level reconnection behavior that may need tuning.
How Netdata helps
- Netdata samples
serverStatusevery second. AtotalCreateddelta that dwarfs baseline churn is visible immediately. - A single dashboard correlates
mongodb.connections_current, systemmem.rss, andmongodb.global_lock_current_queue_total, confirming the spiral in seconds. - Ticket availability is trended automatically, so you can see when a reconnect burst crosses into storage engine saturation.
- Alerts on connection spikes coupled with rising queue depth or falling ticket availability reduce false positives from benign pool resizing.
Related guides
- How MongoDB actually works in production: a mental model for operators
- MongoDB pages evicted by application threads: when eviction becomes user latency
- MongoDB WiredTiger cache dirty ratio high: the leading indicator nobody watches
- MongoDB WiredTiger cache pressure cascade: eviction stalls and latency spikes
- MongoDB cache too small: sizing the WiredTiger cache for your working set
- MongoDB checkpoint duration climbing: diagnosing slow WiredTiger checkpoints
- MongoDB checkpoint stall write freeze: when all writes stop with no error
- MongoDB journal sync latency high: the storage signal that warns 60 seconds early
- MongoDB monitoring checklist: the signals every production cluster needs
- MongoDB monitoring maturity model: from survival to expert
- MongoDB noTimeout cursors causing cache pressure: pinned snapshots and silent eviction stalls
- MongoDB oplog window collapse: secondaries falling off and forced full resync







