MongoDB connection storm spiral: reconnection floods after an election or deploy

Connection count on a primary jumps from 200 to 4,000 in under a minute. Resident memory climbs, query latencies double, and application logs fill with timeout errors. The slow query log shows nothing unusual. Individual queries are not the problem. The database is drowning in threads.

This is a connection storm spiral. A trigger event, usually a replica set election, application deploy, or network blip, invalidates existing connections across your application fleet. Every driver reconnects at once. Each new connection costs MongoDB a dedicated thread and roughly 1 MB of stack memory . The resulting RSS spike and ticket contention slow down operations already in flight, causing more timeouts, which drives even more reconnections. The feedback loop ends in OOM kill or unresponsiveness.

The differentiator from normal pool growth is connection churn. A steady current count with a rapidly climbing totalCreated means threads are being created and destroyed faster than the server can sustain.

flowchart TD
    A[Election, deploy, or network blip] --> B[Drivers invalidate pooled connections]
    B --> C[Mass simultaneous reconnection]
    C --> D[New thread per connection]
    D --> E[Memory RSS spikes]
    E --> F[Ticket contention and scheduling overhead]
    F --> G[Operations slow and timeout]
    G --> H[More reconnections]
    H --> C

What this means

MongoDB uses a one-thread-per-connection model. When a trigger event causes mass reconnection, the server must create thousands of threads almost instantly. This consumes resident memory, increases kernel scheduling overhead, and floods the WiredTiger storage engine with concurrent operations competing for read and write tickets.

As tickets exhaust, new operations queue in globalLock.currentQueue. Latencies rise. Application drivers time out and retry, opening yet more connections. The cycle feeds itself until the node runs out of memory or file descriptors, or until the underlying trigger is resolved and reconnections stop.

A spike in current alone can be normal pool warmup. A sustained delta in totalCreated means the server is burning resources on thread lifecycle overhead. Treat totalCreated churn rate as the primary diagnostic signal.

Common causes

CauseWhat it looks likeFirst thing to check
Primary election or failoverConnection spike seconds after a new primary is elected; rs.status() shows a recent electionDaters.status() member states and MongoDB logs for "Starting an election"
Rolling application restart or deployConnections spike from many new application processes simultaneously; client IPs are distributed uniformly across the fleetApplication deployment timestamps and process start events
Network blip or DNS failureBrief drop in network throughput followed by a flood; driver logs may show pool cleared eventsNetwork latency and DNS resolution health between apps and database
Load balancer health check failureRegular cadence of connection spikes matching the LB probe interval; many connections from the LB IP rangeLB health check configuration and probe logs

Quick checks

Run these read-only commands to confirm the storm and assess severity.

// Check connection count, available slots, and total churn
var c = db.serverStatus().connections;
print("Current: " + c.current + "  Available: " + c.available + "  Total created: " + c.totalCreated);
// Check WiredTiger ticket availability (MongoDB <=7.x)
var t = db.serverStatus().wiredTiger.concurrentTransactions;
print("Read tickets available: " + t.read.available + " / " + t.read.totalTickets);
print("Write tickets available: " + t.write.available + " / " + t.write.totalTickets);
// Check queue depth
var q = db.serverStatus().globalLock.currentQueue;
print("Total queued: " + q.total + "  Readers: " + q.readers + "  Writers: " + q.writers);
// Identify heaviest client sources among active operations
db.currentOp({ active: true }).inprog.forEach(function(op) {
  print((op.client || "internal") + " | " + op.op + " | " + (op.secs_running || 0) + "s");
});
# Check OS-level RSS in MB (adjust if multiple mongod processes exist)
ps -o rss= -p $(pgrep -x mongod) | awk '{print "RSS: " $1/1024 " MB"}'
# Check for recent elections or stepdowns in the log
grep -iE "Starting an election|stepping down" /var/log/mongodb/mongod.log | tail -10
// Check server-side average latency
var lat = db.serverStatus().opLatencies;
var rOps = lat.reads.ops;
var wOps = lat.writes.ops;
print("Read avg ms: " + (rOps ? (lat.reads.latency / rOps / 1000).toFixed(2) : "N/A"));
print("Write avg ms: " + (wOps ? (lat.writes.latency / wOps / 1000).toFixed(2) : "N/A"));
// Check if throughput has collapsed
db.serverStatus().opcounters

How to diagnose it

  1. Confirm the trigger. Check rs.status() for a recent election, application deployment logs for a restart, or network metrics for a blip. The spiral almost always has an identifiable trigger within the last 1-2 minutes.
  2. Measure churn, not just count. Sample totalCreated twice, 30 seconds apart. A large delta with a stable or slowly changing current confirms a storm rather than legitimate growth.
  3. Correlate memory with connections. Compare db.serverStatus().mem.resident against the expected baseline of WiredTiger cache + (current connections x ~1MB) + internal overhead. If RSS is significantly higher, thread stacks are the likely cause.
  4. Identify the top client sources. Use db.currentOp() to see which hosts are driving the most load. In a storm, you will see a broad distribution across your application fleet rather than a single misbehaving host.
  5. Check ticket exhaustion. If available read or write tickets have dropped below 25% of total, the storage engine is saturated and operations are queuing. This is what turns a reconnect burst into a latency spiral.
  6. Assess memory exhaustion risk. If RSS is within 1 GB of system memory or the available connection count is approaching zero, the node is at risk of OOM or file descriptor exhaustion.

Metrics and signals to monitor

SignalWhy it mattersWarning sign
Connection totalCreated churn rateThread creation and destruction consume CPU and memory; high churn with stable current is the defining storm signalDelta > 100 per minute while current is flat
connections.currentEach connection allocates a thread and stack memoryRapid 10x increase within seconds
Memory RSSThread stacks drive resident memory upward during a stormRSS growth correlates with connection spike and exceeds expected baseline
WiredTiger ticket availabilityMore connections means more operations competing for storage engine admissionAvailable read or write tickets drops below 25% of total
globalLock.currentQueue depthOperations queue when tickets or locks are scarceSustained total queue > 20
opLatencies reads and writesServer-side latency degradation from contentionAverage latency doubles from baseline for > 5 minutes
opcounters throughputTicket exhaustion and thread overhead cause throughput collapseSustained drop > 50% from baseline

Fixes

Immediate containment

The fastest way to stop the node from dying is to cap new connections dynamically without a restart:

db.adminCommand({setParameter: 1, maxIncomingConnections: <value>})

Choose a value just below connections.current to force immediate rejections. Rejected connections appear in application logs as connection failures, but this breaks the feedback loop before OOM.

Persist the change in mongod.conf after the incident. Do not restart mongod to apply this during a storm. Avoid reactive rs.stepDown() calls; an election invalidates connections and can worsen the flood.

Identify the heaviest client sources with db.currentOp() and coordinate with application owners to slow or pause restarts until the topology is stable.

Stabilize the topology

If the trigger was an election, determine why the primary stepped down. Check for heartbeat timeouts, disk stalls, or memory pressure that caused the original instability. Fixing the trigger without stabilizing the root cause will simply restart the spiral on the next event.

Throttle client reconnect behavior

After the storm subsides, review application driver configuration. Large connection pool sizes multiply the impact of any trigger. Ensure that total possible connections across all application instances leaves substantial headroom below the server limit. A common target is to operate at less than 50% of maxIncomingConnections during normal load, leaving capacity for reconnection bursts.

Prevention

  • Monitor totalCreated delta, not just current. Churn is the leading indicator.
  • Keep normal connection utilization below 50% of the configured maximum. This leaves headroom for reconnection floods.
  • Track WiredTiger ticket availability continuously. Ticket exhaustion is what turns a reconnect burst into a spiral.
  • Ensure replica set elections are rare outside of maintenance. Frequent elections indicate network instability, storage latency, or misconfigured electionTimeoutMillis.
  • Review application logs for connectionPoolCleared events after any deploy or failover. These indicate driver-level reconnection behavior that may need tuning.

How Netdata helps

  • Netdata samples serverStatus every second. A totalCreated delta that dwarfs baseline churn is visible immediately.
  • A single dashboard correlates mongodb.connections_current, system mem.rss, and mongodb.global_lock_current_queue_total, confirming the spiral in seconds.
  • Ticket availability is trended automatically, so you can see when a reconnect burst crosses into storage engine saturation.
  • Alerts on connection spikes coupled with rising queue depth or falling ticket availability reduce false positives from benign pool resizing.