MongoDB connection churn: high totalCreated rate and thread creation overhead
db.serverStatus().connections can show low current and a rapidly climbing totalCreated. That mismatch is connection churn: connections open and close rapidly instead of being reused. MongoDB uses a thread-per-connection model, so each cycle costs roughly a megabyte of thread stack, scheduling overhead, and file descriptor work. The result is rising RSS, CPU contention, and latency spikes that do not correlate with the active connection count.
For the broader mental model, see How MongoDB actually works in production: a mental model for operators. For the cascade after a failover, see MongoDB connection storm spiral: reconnection floods after an election or deploy.
What this means
db.serverStatus().connections reports three values:
current: connections open right now.available: connection slots remaining before the server limit.totalCreated: cumulative connections created since the mongod process started.
A high delta on totalCreated while current stays flat means clients are not holding connections. They open, authenticate, possibly run one or a few operations, close, and repeat. On a thread-per-connection deployment, 500 connections created and destroyed 100 times per minute generates memory and scheduler pressure even though the active count never exceeds 500.
Churn is also a leading indicator of a connection storm spiral. Once latency rises, applications time out and reconnect more aggressively, which raises churn further. Catching the high totalCreated rate early stops the feedback loop before memory or ticket exhaustion forces an outage.
flowchart TD
A[Stable current connections] --> B[Rising totalCreated delta]
B --> C[Connection churn]
C --> D[Thread create/destroy]
D --> E[Memory RSS growth]
D --> F[CPU scheduling overhead]
F --> G[Operation latency spikes]
E --> H[Ticket contention]
H --> GCommon causes
| Cause | What it looks like | First thing to check |
|---|---|---|
| Client created per request, common in FaaS/serverless handlers | totalCreated spikes with each request wave; current stays flat; many short-lived source IPs | Application logs and db.currentOp() grouped by client |
| Driver pool too large or idle timeout too aggressive | Rapid open/close cycles; driver pool metrics show high creation | Driver pool settings for size and idle behavior |
| Reconnect storm after election, deploy, or network blip | totalCreated surges after a topology event; correlates with election log entries | MongoDB logs for "Starting an election" or "Stepping down"; rs.status() |
| Monitoring or scraping tools opening a fresh connection per check | Steady, low-rate churn from a small set of hosts | Source IPs in currentOp; monitoring agent configuration |
| Load balancer or proxy health checks resetting TCP | Repeated short-lived connections from the LB IP; duration is seconds | ss output sorted by source IP and state |
Quick checks
Run these in order. All are read-only except where noted.
# Check current, available, active, and totalCreated
mongosh --quiet --eval 'JSON.stringify(db.serverStatus().connections)'
// Compute totalCreated delta over 60 seconds
var first = db.serverStatus().connections;
sleep(60000);
var second = db.serverStatus().connections;
print("current: " + second.current);
print("totalCreated delta / min: " + (second.totalCreated - first.totalCreated));
print("active: " + (second.active || "N/A"));
// Active vs current ratio and utilization against the server limit
var c = db.serverStatus().connections;
var util = 100 * c.current / (c.current + c.available);
print("utilization: " + util.toFixed(1) + "%");
print("active/current: " + (c.active !== undefined ? (c.active / c.current).toFixed(2) : "N/A"));
// Group active operations by client IP to find churn sources
var counts = {};
db.currentOp({ active: true }).inprog.forEach(function(op) {
var ip = (op.client || "unknown").split(":")[0];
counts[ip] = (counts[ip] || 0) + 1;
});
printjson(counts);
# Look for recent elections, connection errors, or resets in the logs
grep -iE "Starting an election|Stepping down|connection refused|error accepting" /var/log/mongodb/mongod.log | tail -20
# Compare open file descriptors to the process hard limit (assumes one mongod)
PID=$(pgrep -x mongod)
ls /proc/$PID/fd | wc -l
cat /proc/$PID/limits | grep "Max open files"
# Show established connections by source IP to spot repeat short-lived clients
# Strips the last :port; assumes IPv4 source addresses
ss -tnp | awk 'NR>1 {print $5}' | sed 's/:[^:]*$//' | sort | uniq -c | sort -rn | head
How to diagnose it
Confirm churn, not growth. Sample
totalCreatedtwice over 60 seconds. If the delta is high whilecurrentis stable or only slightly changed, you have churn rather than legitimate pool growth.Correlate with a trigger. Check MongoDB logs for elections, stepdowns, network errors, or application deployments. Churn that starts within seconds of an election points to a reconnect storm. Churn that tracks application request rate points to per-request client creation.
Identify the source hosts. Use
db.currentOp()grouped byclientto find which application instances or middleware are holding many short-lived connections. If the same IP appears repeatedly with new connection ports, that host is the culprit.Check driver and application behavior. Verify whether the application creates a new
MongoClientper request or per handler invocation. Verify whether monitoring agents authenticate on every scrape. Verify whether load balancer health checks open a new TCP connection each time.Quantify impact. Correlate the churn window with:
mem.residentgrowth that outpaces your baseline plus WiredTiger cache size and connection overhead (~1MB percurrentconnection).opLatenciestail latency rising.globalLock.currentQueueorwiredTiger.concurrentTransactionsavailable tickets dropping.- File descriptor usage climbing toward
ulimit -n.
Classify the root cause. Use the common causes table to decide whether the fix belongs in application code, driver configuration, infrastructure health checks, or the MongoDB network topology.
Metrics and signals to monitor
| Signal | Why it matters | Warning sign |
|---|---|---|
connections.totalCreated delta | Direct measure of churn; more informative than current alone | Sustained increase, or high delta with flat current |
connections.active / connections.current | Shows how many open connections are actually doing work | Ratio stays low while current is high; many idle connections |
mem.resident | Each connection costs ~1MB of thread stack; churn drives RSS growth | RSS grows disproportionately to cache size and connection count |
opLatencies reads and writes | User-visible latency impact | p99 sustained >2x baseline |
globalLock.currentQueue | Operations queuing behind contention | Sustained total >20 |
wiredTiger.concurrentTransactions.available | Ticket exhaustion from thread overhead | Read or write available tickets <25% of total |
| File descriptor utilization | Hard ceiling before connection rejections | >80% of ulimit -n |
| Election events | Common trigger for churn spikes | More than 1 per hour outside maintenance windows |
Fixes
Application creates a client per request
The fix is to reuse one MongoClient instance per application process. Creating a client per request, per FaaS invocation, or per HTTP request forces a full TCP handshake, authentication, and potentially a topology discovery cycle every time. Cache the client at module or process scope and share it across requests. This is the single most effective fix for churn.
Driver pool sizing or idle behavior
If the driver pool is oversized or its idle timeout is aggressive, connections open and close unnecessarily. Reduce the maximum pool size to match actual concurrency, and set an idle timeout longer than typical request inter-arrival times so normal traffic keeps the pool warm.
Reconnect storm after a topology event
If churn followed an election or network blip, stabilize the cluster first:
- Check
rs.status()for flapping member states. - Review application retry configuration so clients back off rather than reconnect immediately.
- If connections are approaching the limit and memory is climbing, you can temporarily lower
net.maxIncomingConnectionsso MongoDB rejects new connections cleanly rather than accepting them and crashing from OOM. Warning: this is disruptive and will reject client connections. Coordinate with application owners before applying.
Monitoring or load balancer churn
If health checks or monitoring scrapers are the source, reconfigure them to use persistent connections or reduce their frequency. Ensure health checks do not perform an expensive handshake on every TCP open. If a proxy sits between the application and MongoDB, verify its idle timeout is not shorter than the driver’s, which causes the proxy to sever connections the driver still considers valid.
OS file descriptor limits
If churn is combined with high connection counts, check that ulimit -n and the systemd LimitNOFILE setting give MongoDB enough descriptors. Verify the actual process limit in /proc/$PID/limits (where PID is your mongod), because systemd unit files often override shell ulimit.
Prevention
- Alert on
totalCreateddelta, not justcurrentoravailable. - Track
active / currentso idle connections do not hide in the totals. - Enforce a single
MongoClientsingleton per application process. - Size driver pools to real concurrency and avoid idle timeouts shorter than your traffic cadence.
- Test failover behavior under load to confirm clients back off instead of thundering herd.
- Review infrastructure health checks quarterly to ensure they do not open fresh MongoDB connections per probe.
- Keep connection headroom: operate below 50% of the effective connection limit so a reconnect storm does not immediately hit the ceiling.
How Netdata helps
- Surfaces
mongodb.connections_totalCreatedas a rate without manual sampling. - Correlates churn with
mongodb.memory_resident, CPU, andmongodb.globalLock_currentQueueon the same timeline. - Tracks
mongodb.wiredTiger_concurrentTransactions_availableto expose whether churn is translating into ticket contention. - Thresholds on total-created rate and RSS growth catch this failure mode before connection count alarms fire.
- Per-second resolution catches short churn bursts that one-minute averages miss.
Related guides
- How MongoDB actually works in production: a mental model for operators
- MongoDB pages evicted by application threads: when eviction becomes user latency
- MongoDB WiredTiger cache dirty ratio high: the leading indicator nobody watches
- MongoDB WiredTiger cache pressure cascade: eviction stalls and latency spikes
- MongoDB cache too small: sizing the WiredTiger cache for your working set
- MongoDB checkpoint duration climbing: diagnosing slow WiredTiger checkpoints
- MongoDB checkpoint stall write freeze: when all writes stop with no error
- MongoDB connection storm spiral: reconnection floods after an election or deploy
- MongoDB flow control throttling writes: when the primary slows itself down
- MongoDB journal sync latency high: the storage signal that warns 60 seconds early
- MongoDB monitoring checklist: the signals every production cluster needs
- MongoDB monitoring maturity model: from survival to expert







