MongoDB connection churn: high totalCreated rate and thread creation overhead

db.serverStatus().connections can show low current and a rapidly climbing totalCreated. That mismatch is connection churn: connections open and close rapidly instead of being reused. MongoDB uses a thread-per-connection model, so each cycle costs roughly a megabyte of thread stack, scheduling overhead, and file descriptor work. The result is rising RSS, CPU contention, and latency spikes that do not correlate with the active connection count.

For the broader mental model, see How MongoDB actually works in production: a mental model for operators. For the cascade after a failover, see MongoDB connection storm spiral: reconnection floods after an election or deploy.

What this means

db.serverStatus().connections reports three values:

  • current: connections open right now.
  • available: connection slots remaining before the server limit.
  • totalCreated: cumulative connections created since the mongod process started.

A high delta on totalCreated while current stays flat means clients are not holding connections. They open, authenticate, possibly run one or a few operations, close, and repeat. On a thread-per-connection deployment, 500 connections created and destroyed 100 times per minute generates memory and scheduler pressure even though the active count never exceeds 500.

Churn is also a leading indicator of a connection storm spiral. Once latency rises, applications time out and reconnect more aggressively, which raises churn further. Catching the high totalCreated rate early stops the feedback loop before memory or ticket exhaustion forces an outage.

flowchart TD
    A[Stable current connections] --> B[Rising totalCreated delta]
    B --> C[Connection churn]
    C --> D[Thread create/destroy]
    D --> E[Memory RSS growth]
    D --> F[CPU scheduling overhead]
    F --> G[Operation latency spikes]
    E --> H[Ticket contention]
    H --> G

Common causes

CauseWhat it looks likeFirst thing to check
Client created per request, common in FaaS/serverless handlerstotalCreated spikes with each request wave; current stays flat; many short-lived source IPsApplication logs and db.currentOp() grouped by client
Driver pool too large or idle timeout too aggressiveRapid open/close cycles; driver pool metrics show high creationDriver pool settings for size and idle behavior
Reconnect storm after election, deploy, or network bliptotalCreated surges after a topology event; correlates with election log entriesMongoDB logs for "Starting an election" or "Stepping down"; rs.status()
Monitoring or scraping tools opening a fresh connection per checkSteady, low-rate churn from a small set of hostsSource IPs in currentOp; monitoring agent configuration
Load balancer or proxy health checks resetting TCPRepeated short-lived connections from the LB IP; duration is secondsss output sorted by source IP and state

Quick checks

Run these in order. All are read-only except where noted.

# Check current, available, active, and totalCreated
mongosh --quiet --eval 'JSON.stringify(db.serverStatus().connections)'
// Compute totalCreated delta over 60 seconds
var first = db.serverStatus().connections;
sleep(60000);
var second = db.serverStatus().connections;
print("current: " + second.current);
print("totalCreated delta / min: " + (second.totalCreated - first.totalCreated));
print("active: " + (second.active || "N/A"));
// Active vs current ratio and utilization against the server limit
var c = db.serverStatus().connections;
var util = 100 * c.current / (c.current + c.available);
print("utilization: " + util.toFixed(1) + "%");
print("active/current: " + (c.active !== undefined ? (c.active / c.current).toFixed(2) : "N/A"));
// Group active operations by client IP to find churn sources
var counts = {};
db.currentOp({ active: true }).inprog.forEach(function(op) {
  var ip = (op.client || "unknown").split(":")[0];
  counts[ip] = (counts[ip] || 0) + 1;
});
printjson(counts);
# Look for recent elections, connection errors, or resets in the logs
grep -iE "Starting an election|Stepping down|connection refused|error accepting" /var/log/mongodb/mongod.log | tail -20
# Compare open file descriptors to the process hard limit (assumes one mongod)
PID=$(pgrep -x mongod)
ls /proc/$PID/fd | wc -l
cat /proc/$PID/limits | grep "Max open files"
# Show established connections by source IP to spot repeat short-lived clients
# Strips the last :port; assumes IPv4 source addresses
ss -tnp | awk 'NR>1 {print $5}' | sed 's/:[^:]*$//' | sort | uniq -c | sort -rn | head

How to diagnose it

  1. Confirm churn, not growth. Sample totalCreated twice over 60 seconds. If the delta is high while current is stable or only slightly changed, you have churn rather than legitimate pool growth.

  2. Correlate with a trigger. Check MongoDB logs for elections, stepdowns, network errors, or application deployments. Churn that starts within seconds of an election points to a reconnect storm. Churn that tracks application request rate points to per-request client creation.

  3. Identify the source hosts. Use db.currentOp() grouped by client to find which application instances or middleware are holding many short-lived connections. If the same IP appears repeatedly with new connection ports, that host is the culprit.

  4. Check driver and application behavior. Verify whether the application creates a new MongoClient per request or per handler invocation. Verify whether monitoring agents authenticate on every scrape. Verify whether load balancer health checks open a new TCP connection each time.

  5. Quantify impact. Correlate the churn window with:

    • mem.resident growth that outpaces your baseline plus WiredTiger cache size and connection overhead (~1MB per current connection).
    • opLatencies tail latency rising.
    • globalLock.currentQueue or wiredTiger.concurrentTransactions available tickets dropping.
    • File descriptor usage climbing toward ulimit -n.
  6. Classify the root cause. Use the common causes table to decide whether the fix belongs in application code, driver configuration, infrastructure health checks, or the MongoDB network topology.

Metrics and signals to monitor

SignalWhy it mattersWarning sign
connections.totalCreated deltaDirect measure of churn; more informative than current aloneSustained increase, or high delta with flat current
connections.active / connections.currentShows how many open connections are actually doing workRatio stays low while current is high; many idle connections
mem.residentEach connection costs ~1MB of thread stack; churn drives RSS growthRSS grows disproportionately to cache size and connection count
opLatencies reads and writesUser-visible latency impactp99 sustained >2x baseline
globalLock.currentQueueOperations queuing behind contentionSustained total >20
wiredTiger.concurrentTransactions.availableTicket exhaustion from thread overheadRead or write available tickets <25% of total
File descriptor utilizationHard ceiling before connection rejections>80% of ulimit -n
Election eventsCommon trigger for churn spikesMore than 1 per hour outside maintenance windows

Fixes

Application creates a client per request

The fix is to reuse one MongoClient instance per application process. Creating a client per request, per FaaS invocation, or per HTTP request forces a full TCP handshake, authentication, and potentially a topology discovery cycle every time. Cache the client at module or process scope and share it across requests. This is the single most effective fix for churn.

Driver pool sizing or idle behavior

If the driver pool is oversized or its idle timeout is aggressive, connections open and close unnecessarily. Reduce the maximum pool size to match actual concurrency, and set an idle timeout longer than typical request inter-arrival times so normal traffic keeps the pool warm.

Reconnect storm after a topology event

If churn followed an election or network blip, stabilize the cluster first:

  • Check rs.status() for flapping member states.
  • Review application retry configuration so clients back off rather than reconnect immediately.
  • If connections are approaching the limit and memory is climbing, you can temporarily lower net.maxIncomingConnections so MongoDB rejects new connections cleanly rather than accepting them and crashing from OOM. Warning: this is disruptive and will reject client connections. Coordinate with application owners before applying.

Monitoring or load balancer churn

If health checks or monitoring scrapers are the source, reconfigure them to use persistent connections or reduce their frequency. Ensure health checks do not perform an expensive handshake on every TCP open. If a proxy sits between the application and MongoDB, verify its idle timeout is not shorter than the driver’s, which causes the proxy to sever connections the driver still considers valid.

OS file descriptor limits

If churn is combined with high connection counts, check that ulimit -n and the systemd LimitNOFILE setting give MongoDB enough descriptors. Verify the actual process limit in /proc/$PID/limits (where PID is your mongod), because systemd unit files often override shell ulimit.

Prevention

  • Alert on totalCreated delta, not just current or available.
  • Track active / current so idle connections do not hide in the totals.
  • Enforce a single MongoClient singleton per application process.
  • Size driver pools to real concurrency and avoid idle timeouts shorter than your traffic cadence.
  • Test failover behavior under load to confirm clients back off instead of thundering herd.
  • Review infrastructure health checks quarterly to ensure they do not open fresh MongoDB connections per probe.
  • Keep connection headroom: operate below 50% of the effective connection limit so a reconnect storm does not immediately hit the ceiling.

How Netdata helps

  • Surfaces mongodb.connections_totalCreated as a rate without manual sampling.
  • Correlates churn with mongodb.memory_resident, CPU, and mongodb.globalLock_currentQueue on the same timeline.
  • Tracks mongodb.wiredTiger_concurrentTransactions_available to expose whether churn is translating into ticket contention.
  • Thresholds on total-created rate and RSS growth catch this failure mode before connection count alarms fire.
  • Per-second resolution catches short churn bursts that one-minute averages miss.