$ guides / mongodb / mongodb-connection-storm-spiral ▌

Operations Guides

MongoDB connection storm spiral: reconnection floods after an election or deploy

Connection count on a primary jumps from 200 to 4,000 in under a minute. Resident memory climbs, query latencies double, and application logs fill with timeout errors. The slow query log shows nothing unusual. Individual queries are not the problem. The database is drowning in threads.

This is a connection storm spiral. A trigger event, usually a replica set election, application deploy, or network blip, invalidates existing connections across your application fleet. Every driver reconnects at once. Each new connection costs MongoDB a dedicated thread and roughly 1 MB of stack memory . The resulting RSS spike and ticket contention slow down operations already in flight, causing more timeouts, which drives even more reconnections. The feedback loop ends in OOM kill or unresponsiveness.

The differentiator from normal pool growth is connection churn. A steady current count with a rapidly climbing totalCreated means threads are being created and destroyed faster than the server can sustain.

flowchart TD
    A[Election, deploy, or network blip] --> B[Drivers invalidate pooled connections]
    B --> C[Mass simultaneous reconnection]
    C --> D[New thread per connection]
    D --> E[Memory RSS spikes]
    E --> F[Ticket contention and scheduling overhead]
    F --> G[Operations slow and timeout]
    G --> H[More reconnections]
    H --> C

What this means

MongoDB uses a one-thread-per-connection model. When a trigger event causes mass reconnection, the server must create thousands of threads almost instantly. This consumes resident memory, increases kernel scheduling overhead, and floods the WiredTiger storage engine with concurrent operations competing for read and write tickets.

As tickets exhaust, new operations queue in globalLock.currentQueue. Latencies rise. Application drivers time out and retry, opening yet more connections. The cycle feeds itself until the node runs out of memory or file descriptors, or until the underlying trigger is resolved and reconnections stop.

A spike in current alone can be normal pool warmup. A sustained delta in totalCreated means the server is burning resources on thread lifecycle overhead. Treat totalCreated churn rate as the primary diagnostic signal.

Common causes

Cause	What it looks like	First thing to check
Primary election or failover	Connection spike seconds after a new primary is elected; `rs.status()` shows a recent `electionDate`	`rs.status()` member states and MongoDB logs for `"Starting an election"`
Rolling application restart or deploy	Connections spike from many new application processes simultaneously; client IPs are distributed uniformly across the fleet	Application deployment timestamps and process start events
Network blip or DNS failure	Brief drop in network throughput followed by a flood; driver logs may show pool cleared events	Network latency and DNS resolution health between apps and database
Load balancer health check failure	Regular cadence of connection spikes matching the LB probe interval; many connections from the LB IP range	LB health check configuration and probe logs

Quick checks

Run these read-only commands to confirm the storm and assess severity.

// Check connection count, available slots, and total churn
var c = db.serverStatus().connections;
print("Current: " + c.current + "  Available: " + c.available + "  Total created: " + c.totalCreated);

// Check WiredTiger ticket availability (MongoDB <=7.x)
var t = db.serverStatus().wiredTiger.concurrentTransactions;
print("Read tickets available: " + t.read.available + " / " + t.read.totalTickets);
print("Write tickets available: " + t.write.available + " / " + t.write.totalTickets);

// Check queue depth
var q = db.serverStatus().globalLock.currentQueue;
print("Total queued: " + q.total + "  Readers: " + q.readers + "  Writers: " + q.writers);

// Identify heaviest client sources among active operations
db.currentOp({ active: true }).inprog.forEach(function(op) {
  print((op.client || "internal") + " | " + op.op + " | " + (op.secs_running || 0) + "s");
});

# Check OS-level RSS in MB (adjust if multiple mongod processes exist)
ps -o rss= -p $(pgrep -x mongod) | awk '{print "RSS: " $1/1024 " MB"}'

# Check for recent elections or stepdowns in the log
grep -iE "Starting an election|stepping down" /var/log/mongodb/mongod.log | tail -10

// Check server-side average latency
var lat = db.serverStatus().opLatencies;
var rOps = lat.reads.ops;
var wOps = lat.writes.ops;
print("Read avg ms: " + (rOps ? (lat.reads.latency / rOps / 1000).toFixed(2) : "N/A"));
print("Write avg ms: " + (wOps ? (lat.writes.latency / wOps / 1000).toFixed(2) : "N/A"));

// Check if throughput has collapsed
db.serverStatus().opcounters

How to diagnose it

Confirm the trigger. Check rs.status() for a recent election, application deployment logs for a restart, or network metrics for a blip. The spiral almost always has an identifiable trigger within the last 1-2 minutes.
Measure churn, not just count. Sample totalCreated twice, 30 seconds apart. A large delta with a stable or slowly changing current confirms a storm rather than legitimate growth.
Correlate memory with connections. Compare db.serverStatus().mem.resident against the expected baseline of WiredTiger cache + (current connections x ~1MB) + internal overhead. If RSS is significantly higher, thread stacks are the likely cause.
Identify the top client sources. Use db.currentOp() to see which hosts are driving the most load. In a storm, you will see a broad distribution across your application fleet rather than a single misbehaving host.
Check ticket exhaustion. If available read or write tickets have dropped below 25% of total, the storage engine is saturated and operations are queuing. This is what turns a reconnect burst into a latency spiral.
Assess memory exhaustion risk. If RSS is within 1 GB of system memory or the available connection count is approaching zero, the node is at risk of OOM or file descriptor exhaustion.

Metrics and signals to monitor

Signal	Why it matters	Warning sign
Connection `totalCreated` churn rate	Thread creation and destruction consume CPU and memory; high churn with stable `current` is the defining storm signal	Delta > 100 per minute while `current` is flat
`connections.current`	Each connection allocates a thread and stack memory	Rapid 10x increase within seconds
Memory RSS	Thread stacks drive resident memory upward during a storm	RSS growth correlates with connection spike and exceeds expected baseline
WiredTiger ticket availability	More connections means more operations competing for storage engine admission	Available read or write tickets drops below 25% of total
`globalLock.currentQueue` depth	Operations queue when tickets or locks are scarce	Sustained total queue > 20
`opLatencies` reads and writes	Server-side latency degradation from contention	Average latency doubles from baseline for > 5 minutes
`opcounters` throughput	Ticket exhaustion and thread overhead cause throughput collapse	Sustained drop > 50% from baseline

Fixes

Immediate containment

The fastest way to stop the node from dying is to cap new connections dynamically without a restart:

db.adminCommand({setParameter: 1, maxIncomingConnections: <value>})

Choose a value just below connections.current to force immediate rejections. Rejected connections appear in application logs as connection failures, but this breaks the feedback loop before OOM.

Persist the change in mongod.conf after the incident. Do not restart mongod to apply this during a storm. Avoid reactive rs.stepDown() calls; an election invalidates connections and can worsen the flood.

Identify the heaviest client sources with db.currentOp() and coordinate with application owners to slow or pause restarts until the topology is stable.

Stabilize the topology

If the trigger was an election, determine why the primary stepped down. Check for heartbeat timeouts, disk stalls, or memory pressure that caused the original instability. Fixing the trigger without stabilizing the root cause will simply restart the spiral on the next event.

Throttle client reconnect behavior

After the storm subsides, review application driver configuration. Large connection pool sizes multiply the impact of any trigger. Ensure that total possible connections across all application instances leaves substantial headroom below the server limit. A common target is to operate at less than 50% of maxIncomingConnections during normal load, leaving capacity for reconnection bursts.

Prevention

Monitor totalCreated delta, not just current. Churn is the leading indicator.
Keep normal connection utilization below 50% of the configured maximum. This leaves headroom for reconnection floods.
Track WiredTiger ticket availability continuously. Ticket exhaustion is what turns a reconnect burst into a spiral.
Ensure replica set elections are rare outside of maintenance. Frequent elections indicate network instability, storage latency, or misconfigured electionTimeoutMillis.
Review application logs for connectionPoolCleared events after any deploy or failover. These indicate driver-level reconnection behavior that may need tuning.

How Netdata helps

Netdata samples serverStatus every second. A totalCreated delta that dwarfs baseline churn is visible immediately.
A single dashboard correlates mongodb.connections_current, system mem.rss, and mongodb.global_lock_current_queue_total, confirming the spiral in seconds.
Ticket availability is trended automatically, so you can see when a reconnect burst crosses into storage engine saturation.
Alerts on connection spikes coupled with rising queue depth or falling ticket availability reduce false positives from benign pool resizing.

The Netdata solution

MongoDB monitoring with Netdata

Netdata monitors MongoDB with per-second metrics and automatic dashboards. Watch WiredTiger cache pressure, oplog window, connection counts, checkpoint stalls, and replication health in one place, correlated with the underlying host.

See MongoDB monitoring → Start monitoring free

MongoDB connection storm spiral: reconnection floods after an election or deploy

MongoDB connection storm spiral: reconnection floods after an election or deploy

What this means

Common causes

Quick checks

How to diagnose it

Metrics and signals to monitor

Fixes

Immediate containment

Stabilize the topology

Throttle client reconnect behavior

Prevention

How Netdata helps

Related guides

MongoDB monitoring with Netdata