$ guides / mongodb ▌

MONGODB · OPERATIONS PLAYBOOK

MongoDB at scale: the WiredTiger cache, the oplog window, and the election you didn't expect

A WiredTiger cache that hates giving RAM back, 60-second checkpoints, a fixed-size oplog, semaphore-based admission tickets, and a replica set that elects a new primary the moment a heartbeat is missed. Together they decide how MongoDB behaves under real traffic — the incidents they cause, the signals that catch them early, and how to recover.

> Start with the monitoring checklist → # Jump to the full guide list

MongoDB is famously easy to start and famously unforgiving once the working set, the write rate, or the secondary count grows past what the defaults assumed.

The defaults work. Until the WiredTiger cache fills, background eviction can't keep up, and application threads are forced to evict pages inline — adding latency to every operation at once. Until a write surge turns the oplog over faster than a secondary can consume it, the secondary falls off the window, and you read too stale to catch up on a node now stuck in RECOVERING. Until a checkpoint falls behind, the journal fills, and WiredTiger blocks every write with no error at all — just infinite latency. Until all 128 write tickets are held by operations waiting on slow disk, and every new query queues behind them. Until an election storm flips the primary back and forth and your driver answers writes with not master.

These guides are written for engineers who already run MongoDB, not for people learning what a document is. The goal is to give you the mental model of how the server actually behaves under load, the failure patterns that keep recurring, the monitoring story that catches problems before they page anyone, and the runbooks you wish someone had handed you before your last incident.

How MongoDB actually runs in production

MongoDB is not just a document store. It is a replica set tailing an oplog, fronted by a WiredTiger storage engine with a managed cache, periodic checkpoints, a write-ahead journal, and a fixed pool of admission tickets. Most production failures live between these layers, not inside any one of them.

drivers / connection pool

Application drivers, replica set peers, <code>mongos</code> routers, and monitoring tools. MongoDB uses one thread per connection (~1 MB stack each), so 10,000 connections is 10,000 threads. The total counts toward <code>maxIncomingConnections</code> and the file-descriptor limit — exceed either and new connections are refused.

CLIENT

mongos / replica set routing

In a sharded cluster, stateless <code>mongos</code> routers read chunk maps from config servers and fan out queries. In a replica set, the driver routes writes to the PRIMARY. A stale topology after an election sends writes to a former primary, which answers <code>not master</code>.

ROUTING

admission control (tickets)

Every operation touching storage must acquire a read or write ticket — 128 of each by default (≤6.x), dynamically tuned in 7.0+, surfaced as <code>queues.execution</code> in 8.0. When tickets run out, operations queue. Ticket exhaustion is the most under-monitored cause of MongoDB latency crises.

TICKETS

WiredTiger cache

A managed buffer pool, default 50% of RAM minus 1 GB, holding documents and indexes uncompressed. Background eviction starts at 80% fill; dirty pages evict aggressively past 20% dirty ratio. When background eviction falls behind, application threads evict inline and latency spikes 10–100x.

CACHE

checkpoints + journal

Every 60 seconds a checkpoint flushes dirty pages to disk; the journal (write-ahead log) syncs every ~100 ms for crash recovery. If a checkpoint takes longer than the interval, dirty data accumulates and the journal fills — and WiredTiger freezes all writes until it drains.

PERSIST

oplog + replication

The PRIMARY records every write to a capped <code>local.oplog.rs</code> collection. Secondaries tail it and apply entries. The oplog window is your safety margin: if a secondary falls behind it, it cannot catch up and needs a multi-hour full resync. Flow control (4.2+) throttles the primary to protect the window.

REPLICA

elections + heartbeats

Members heartbeat every 2 seconds. Miss them past <code>electionTimeoutMillis</code> (default 10s) and an election runs — a 2–12 second write outage. A node that accepted writes never replicated to a majority must <code>ROLLBACK</code> them at failover, writing lost data to a rollback directory.

ELECT

OS memory + page cache

RSS should be roughly cache + ~1 MB per connection + ~1 GB overhead. The Linux OOM killer judges by RSS and targets <code>mongod</code> first. Transparent Huge Pages and swap both wreck latency; MongoDB should never swap.

KERNEL

disk (checkpoints / journal / data)

Local NVMe, EBS, or a container volume holding data files, journal, and oplog. Journal sync latency and checkpoint duration set the floor on write durability. On cloud disks, depleted burst credits spike I/O latency 10–100x and stall everything above.

DISK

Why this matters: a latency spike can come from cache eviction, a checkpoint stall, a journal-sync delay, ticket exhaustion, a collection scan from a dropped index, replication lag, or a saturated disk. The symptom is the same — MongoDB is slow — but each layer has a different signal and a different fix.

The failures you'll actually see

Most MongoDB incidents fall into a small set of recurring patterns. Recognise the shape, and triage gets dramatically faster.

CRITICAL

The cache pressure cascade

Write volume exceeds the rate WiredTiger can flush dirty pages. The cache fills, background eviction can't keep up, and application threads start evicting pages inline — adding latency to every operation. Tickets are held longer, new operations queue, application timeouts trigger reconnections, and the reconnections create more threads competing for the same tickets. A self-reinforcing degradation spiral that affects reads and writes alike.

tracked dirty bytes / maximum bytes configured above 15%
pages evicted by application threads incrementing at a sustained rate
globalLock.currentQueue readers and writers growing
opLatencies rising on both reads and writes with connection count climbing

Investigate →

IMMINENT

The oplog window collapse

A write surge turns the oplog over faster than a secondary can consume it. The window shrinks, the secondary has less and less time to catch up, and once its position wraps past the oldest oplog entry it falls off entirely, enters RECOVERING, and needs a multi-hour full initial sync. The cluster loses a member, the survivors absorb more load, and the risk of a second secondary falling off rises.

oplog window shrinking from hours toward minutes (rs.printReplicationInfo)
replication lag increasing linearly, not stabilizing
secondary metrics.repl.apply rate below the primary's write rate
too stale to catch up in the log; a member stuck in RECOVERING

Investigate →

CRITICAL

The connection storm spiral

A trigger — an election, a deploy, a network blip, a DNS failure — causes connection pools across every application instance to reconnect at once. Each new connection spawns a ~1 MB thread, RSS spikes, tickets contend, existing operations slow, more timeouts fire, and more reconnections follow. MongoDB eventually refuses new connections at maxIncomingConnections or runs out of file descriptors entirely.

connections.totalCreated rate spiking (churn) with current climbing fast
memory RSS spiking in step with the connection count
connection refused / error accepting new connection in the log
current / (current + available) above 80%

Investigate →

IMMINENT

The checkpoint stall write freeze

The checkpoint process falls critically behind. Dirty data accumulates, the journal reaches its size limit, and WiredTiger blocks every new write until the checkpoint drains enough to recycle the journal. Writes simply stop — no error, just infinite latency — while reads may still serve from cache. When the checkpoint finally completes, all queued writes execute at once and can trigger the next stall.

transaction checkpoint most recent time (msecs) exceeding the 60s interval and growing
WiredTiger cache dirty ratio high and rising
journal sync latency spiking then flatlining
opLatencies.writes climbing toward infinity while reads continue

Investigate →

WATCHFUL

The silent index regression

An index is accidentally dropped, a background build fails, or the planner picks a worse plan. Queries silently switch to collection scans. Latency rises gradually, proportional to collection growth, until scan I/O reaches a tipping point and overwhelms the storage subsystem. Nothing errors — the same queries that were fast last month are now the slowest thing on the box.

slow query log showing COLLSCAN on collections over 10,000 documents
metrics.queryExecutor.scanned / scannedObjects rate increasing
docsExamined / docsReturned ratio climbing for specific queries
$indexStats showing a previously-busy index with zero recent ops

Investigate →

ACTIVE

The election storm

The primary repeatedly steps down or loses elections — from resource exhaustion delaying heartbeats, network instability, or a priority-takeover loop. Each election is a 2–12 second write outage, members flip between PRIMARY and SECONDARY, and the driver answers writes with not master. Worst case, a node that accepted writes before stepping down must roll them back, losing data.

rs.status() showing different members claiming PRIMARY over time
Starting an election / Stepping down repeating in the log
more than 2 elections within 10 minutes outside maintenance
intermittent write failures and connection resets during transitions

Investigate →

The Netdata solution

MongoDB monitoring with Netdata

Netdata monitors MongoDB with per-second metrics and automatic dashboards. Watch WiredTiger cache pressure, oplog window, connection counts, checkpoint stalls, and replication health in one place, correlated with the underlying host.

See MongoDB monitoring → Start monitoring free

MongoDB monitoring maturity levels

MongoDB observability works in four practical levels. Each is a complete operation, not a stepping stone. Pick the level that matches how much your cluster matters. Most production replica sets should land at the second level.

Level 1: Survival

Know that something is wrong

Survival monitoring is the floor. With these signals you can answer one question: is the cluster still functioning? You will not learn what broke, but you will learn that something broke before users do. Survival is enough for dev environments and non-critical workloads.

Process liveness (ping) Does mongod answer db.adminCommand({ping:1}) within a couple of seconds?
Replica set member state Is there exactly one PRIMARY, and is every member PRIMARY or SECONDARY?
Replication lag How far behind is each secondary, in seconds and as a fraction of the oplog window?
Oplog window (hours of coverage) How long can a secondary be offline before it needs a full resync?
Disk space on data + journal Are you about to fill the disk and freeze writes?
Connection count vs available Are you approaching maxIncomingConnections?
Slow query log enabled (slowms:100) Will a slow query actually be recorded when it happens?

Level 2: Operational

Diagnose most incidents on your own

Operational monitoring is what most production clusters should target. Survival tells you something is wrong; operational tells you what. With this coverage your team can usually diagnose an incident on its own: cache pressure, checkpoint stalls, replication lag, election churn, connection pressure.

WiredTiger cache fill AND dirty ratio Fill near 80% with dirty above 15% is the cache-pressure warning.
opcounters (insert/query/update/delete) A sudden drop means something is blocking operations.
opLatencies (reads / writes) Average and approximate p99, expressed as a multiple of baseline.
globalLock.currentQueue depth Are readers or writers queuing for the storage engine?
Oplog window trend Is your safety margin shrinking month over month?
Memory RSS vs expected Is RSS approaching the system limit (OOM-kill risk)?
Election events Is the primary stepping down more than it should?
Page faults rate Does the working set still fit in memory after warmup?
OS disk I/O latency Is the storage device keeping up with checkpoints and journal?

Level 3: Mature

Catch problems before they become incidents

Mature monitoring catches problems before they wake anyone up. Tickets depleting during peaks, checkpoint duration creeping toward the interval, journal sync latency drifting, a noTimeout cursor pinning a snapshot, an index quietly going unused. None of these pages you on day one. They become page-out incidents on day thirty.

WiredTiger ticket utilization (read + write) Available tickets below 25% means operations are about to queue.
Application-thread eviction rate Any sustained nonzero rate means users are feeling the cache.
Checkpoint duration A 55s checkpoint every 60s looks stable but has zero margin.
Journal sync latency A storage-health signal that warns 30–60s before app latency.
scanned / scannedObjects vs returned Rising ratios mean inefficient or missing indexes.
Connection churn (totalCreated delta) Stable count can still hide expensive create/destroy churn.
Cursor counts (especially noTimeout) noTimeout cursors pin snapshots and cause silent cache pressure.
currentOp longest-running operation One runaway query holding a ticket makes the whole server look slow.
Flow control status (isLagged) Is the primary throttling itself to protect the oplog window?

Level 4: Expert

Reactive instrumentation after real incidents

Expert signals enter your stack the day after a specific incident proved you needed them. History-store activity, plan-cache evictions, jumbo-chunk growth, per-shard contention heat maps, tcmalloc fragmentation, oplog entry-size distribution. Most teams never need every signal here. Add the ones your incident history says you do.

Plan cache eviction events Sudden plan changes that turn a fast query into a collection scan.
WiredTiger history store activity Old-version retention pressure from long snapshots (replaced cache overflow in ~4.4).
Ticket utilization trended per-minute A declining peak-time minimum forecasts the next ticket crisis.
Jumbo chunk count and growth Chunks that can't split or migrate cause permanent shard imbalance.
Chunk migration latency / failures moveChunk I/O pressure and range locks on busy shards.
Config server operation latency Slow config servers stall splits and migrations cluster-wide.
tcmalloc fragmentation ratio heap_size / current_allocated_bytes inflating RSS over time.
Oplog entry size distribution Large multi-document transactions producing oversized oplog entries.

Operating mistakes worth avoiding

The traps MongoDB teams keep falling into. Each has a clear, well-known fix. Most teams only learn it after an incident.

Not monitoring the WiredTiger dirty ratio

Teams watch cache fill percentage and ignore the dirty ratio — yet dirty is the stronger leading indicator. It reveals checkpoint-stall risk 10–30 minutes before any latency degradation. Alert on <code>tracked dirty bytes / maximum bytes configured</code> above 15%, not on fill alone (75–80% fill is normal and healthy).

Never watching ticket utilization

Ticket exhaustion is a top cause of MongoDB latency crises and is the single most under-monitored signal. Teams debug "slow queries" when the real cause is all 128 write tickets held by operations waiting on slow disk. Graph available read and write tickets; alert below 25%. Raising the ticket limit is almost never the right fix.

Sizing the oplog once and never trending it

Teams size the oplog at deployment and forget. As write volume grows organically, the window shrinks month by month, and the failure is discovered when a secondary needs maintenance and can't catch up. Trend the minimum window during peak writes; keep it above 2x your longest expected secondary downtime.

Trusting w:1 writes as durable

<code>w:1</code> means the primary acknowledged the write in memory — not that it reached disk or replicated. A crash or a rollback at failover loses those writes silently. Use <code>w:"majority"</code> for data you can't lose, and monitor <code>wtimeouts</code> so you know when durability isn't being met.

Leaving Transparent Huge Pages enabled and allowing swap

THP causes latency spikes and fragmentation; swap turns the process into a 1/1000th-speed zombie. Both are easy to miss. Confirm <code>cat /sys/kernel/mm/transparent_hugepage/enabled</code> shows <code>[never]</code>, set <code>vm.swappiness=1</code>, and protect <code>mongod</code> from the OOM killer with <code>oom_score_adj</code>.

Monitoring connection count but not churn

A count of 500/10,000 looks fine — but if those 500 connections are created and destroyed 100 times a minute, thread-creation overhead is devastating. The <code>totalCreated</code> delta is more informative than <code>current</code> in many failure modes. Fix the pool, don't just raise the ceiling.

Assuming secondaries are healthy because lag is zero

Teams watch the primary obsessively and assume zero lag means secondaries are fine. But a secondary can have degraded storage, memory pressure, or competing read traffic that won't surface as lag until a load spike pushes it over — taking both the secondary and your redundancy at once. Verify each secondary's own resource health.

Monitoring a sharded cluster blind to per-shard skew

Aggregate dashboards hide a hot shard sitting at 90% I/O while the others idle at 20%. Poor shard keys and jumbo chunks cause skew the balancer can't fix. Compare latency, throughput, and resource use per shard, and watch jumbo chunk count — aggregate alarms will never catch it.

MongoDB runbooks in this section

Each guide is a focused runbook for one symptom or topic. Pick one when you have an incident, or use the categories to learn the area.

▸

Start here

▸

WiredTiger cache and eviction

▸

Checkpoints, journal, and write freeze

▸

Oplog, replication lag, and resync

▸

Connections, threads, and storms

▸

Tickets, queues, and lock contention

▸

Slow queries, indexes, and plans

▸

Elections, failover, and rollback

▸

Disk, storage, and reclamation

▸

Sharding, balancer, and mongos

▸

Memory, RSS, and the OS

▸

Security, auth, and exposure

WHERE TO GO NEXT

Setting up MongoDB monitoring, or putting out a fire?

If you're starting from scratch, the monitoring checklist is the path of least regret. If you're mid-incident, jump straight to the symptom that matches what you're seeing.

> Start with the checklist > Back to Operations Guides