MongoDB monitoring checklist: the signals every production cluster needs

Production MongoDB failures are preceded by signals that are visible but often unmonitored: climbing dirty cache ratio, shrinking oplog window, or ticket counts approaching zero. This guide organizes essential signals into four monitoring levels. Use them to audit instrumentation or triage gaps during an incident.

Each level builds on the previous one. If you are missing a survival signal, instrument it before adding expert metrics. The thresholds below are drawn from the MongoDB serverStatus() and rs.status() contract and from operational patterns observed across WiredTiger deployments.

Level 1: survival

The bare minimum.

Signal	Source / Command	What to watch for
Process liveness	`db.adminCommand({ping:1})` or TCP to port 27017	Failed ping: instance is down, OOM-killed, or network-partitioned. Success does not guarantee health; node may still be RECOVERING or stalled.
Replica set member state	`rs.status().members[].stateStr` and `.health`	Exactly one PRIMARY. Any data-bearing member in RECOVERING, ROLLBACK, DOWN, or UNKNOWN for >2 minutes is abnormal.
Replication lag	Compare `optimeDate` between PRIMARY and SECONDARY in `rs.status()`	Lag > oplog window forces initial sync. Exclude intentionally delayed members.
Disk space utilization	`df -h` on dbPath and journal paths	MongoDB exits if it cannot write the journal. WiredTiger does not return freed disk space to the filesystem after deletes; plan headroom accordingly.
Connection count	`db.serverStatus().connections.current` and `.available`	Each connection uses ~1 MB thread stack. Approaching the file-descriptor or `maxIncomingConnections` limit causes silent rejections or OOM.
Slow query log	MongoDB log or `db.setProfilingLevel(1, {slowms: 100})`	Queries >100 ms reveal missing indexes, COLLSCAN plans, and regressions. Correlate with `metrics.queryExecutor.scanned` for wasted work. Enabling the profiler adds write overhead; prefer logs in high-throughput deployments.

Level 2: operational

Standard for production traffic.

Signal	Source / Command	What to watch for
WiredTiger cache utilization	`db.serverStatus().wiredTiger.cache`: used/max and dirty/max	Used cache >80% with rising eviction is concerning. Dirty ratio >20% predicts checkpoint stalls before user-visible latency appears.
Operation throughput	`db.serverStatus().opcounters` deltas	Sustained drop >50% from baseline indicates blocking: ticket exhaustion, election, or disk stall. Spikes suggest retry storms.
Operation latency	`db.serverStatus().opLatencies`: reads, writes, commands	Watch p99 deviation from baseline, not absolute values. Server-side latency often spikes 30-60 seconds before application timeouts when storage degrades.
Queue depths	`db.serverStatus().globalLock.currentQueue`	Persistent or growing queues mean saturation. Queue <20 and stable may be acceptable. Unbounded growth predicts collapse.
Oplog window	`rs.printReplicationInfo()` or `db.getReplicationInfo()`	Safety margin for secondary catch-up. Minimum >24 hours during peak write. Resize with `replSetResizeOplog` if trending down.
Memory RSS	`db.serverStatus().mem.resident` vs system RAM	Should approximate cache size + ~1 MB per connection + ~1 GB overhead. RSS within 1 GB of total RAM risks OOM kill.
Election events	MongoDB logs: “Starting an election”, “Stepping down”	Each election causes 2-12 seconds of write unavailability. >1 per hour outside maintenance indicates instability.
OS disk latency	`iostat -x`: `await`, `%util`	Storage saturation precedes MongoDB symptoms. Journal sync latency and checkpoint duration derive directly from this layer.
Page fault rate	`db.serverStatus().extra_info.page_faults` delta	Major page faults mean working set exceeds memory. Expected during cold start; after warmup, a rising trend signals cache pressure.

Level 3: mature

Full coverage for systems with an SLO.

Signal	Source / Command	What to watch for
WiredTiger ticket utilization	`db.serverStatus().wiredTiger.concurrentTransactions` (7.x and earlier) or `queues.execution` (8.0+)	<25% available means storage engine saturation. <10 available is critical; zero blocks all new operations.
Application-thread evictions	`db.serverStatus().wiredTiger.cache`: “pages evicted by application threads”	Sustained nonzero rate is abnormal. Background eviction cannot keep up and user threads are doing cleanup work, adding latency directly to operations.
Checkpoint duration	`db.serverStatus().wiredTiger.transaction`: “transaction checkpoint most recent time”	<10 s healthy. 10-30 s concerning. >60 s critical (exceeds default interval). Trend matters more than absolute value.
Journal sync latency	`db.serverStatus().wiredTiger.log`: “log sync time duration” / “log sync operations”	>30 ms sustained degrades write acknowledgment. >100 ms means writes are stalling. Often leads MongoDB symptoms by 30-60 seconds.
Document scan ratio	`db.serverStatus().metrics.queryExecutor`: scanned, scannedObjects	Scanned:returned ratio climbing toward 100:1 indicates inefficient scans. Correlate with slow query log for COLLSCAN.
Connection churn	`db.serverStatus().connections.totalCreated` delta	High `totalCreated` with stable `current` means connection pool thrashing. Thread creation spikes RSS and CPU.
Active vs idle connections	`db.serverStatus().connections.active` vs `.current`	Idle connections still consume ~1 MB RSS and file descriptors. Low active ratio suggests oversized driver pools.
Cursor counts	`db.serverStatus().metrics.cursor`: open.total, open.noTimeout	noTimeout cursors hold WiredTiger snapshots open, causing silent cache pressure. Target near zero for noTimeout.
Secondary apply rate	`db.serverStatus().metrics.repl.apply`: ops and batches	Must keep up with primary write rate over any 10-minute window. <80% consistently leads to unbounded lag.
Per-collection growth	`db.collection.stats().storageSize`	Tracks on-disk growth per collection. Use `storageSize`, not `dataSize`, for capacity planning.
Index usage stats	`db.collection.aggregate([{$indexStats: {}}])`	Unused indexes waste cache and add write amplification. Wait >24 hours after restart before concluding an index is unused.
Lock wait times	`db.serverStatus().locks`: timeAcquiringMicros	Collection-level waits indicate write hotspots. Global waits indicate DDL blocking all traffic.
Network throughput	`db.serverStatus().network`: bytesIn, bytesOut, numRequests	High bytesOut per request indicates large payloads or missing projections. Replication traffic competes with client traffic.
Long-running operations	`db.currentOp({active:true, secs_running:{$gt:10}})` max age	Maximum age of any active operation. A single runaway query holding a ticket for minutes can cascade to system-wide latency.
Assertion rates	`db.serverStatus().asserts` deltas	`regular` or `msg` assertions indicate server bugs or data corruption. `user` assertion spikes indicate application errors or auth issues.

Level 4: expert

Deep signals operators add after their third major incident.

Signal	Source / Command	What to watch for
History store activity	`db.serverStatus().wiredTiger.cache` legacy overflow stats	Cache overflow table was replaced by the history store in 4.4+. In 6.x+ the legacy counter is zero and no longer meaningful; monitor hazard pointers instead.
Hazard pointer count	`db.serverStatus().wiredTiger.transaction` or cache stats	High hazard pointers indicate snapshots are retained too long, preventing eviction.
Plan cache evictions	MongoDB logs and `serverStatus().metrics.query`	Sudden plan cache evictions can cause a query to switch from an index scan to a collection scan without warning.
Ticket trends per minute	`wiredTiger.concurrentTransactions` or `queues.execution` sampled per minute	In 7.0+ dynamic ticket adjustment changes total counts. Per-minute trends reveal whether the adaptive algorithm is keeping up or falling behind.
Chunk migration rate	`db.getSiblingDB("config").changelog`: moveChunk entries	Failed migrations or migration storms cause I/O pressure and range locks. Monitor latency and failure counts.
Jumbo chunk count	`db.getSiblingDB("config").chunks.find({jumbo:true})`	Jumbo chunks cannot be split or migrated, creating permanent imbalance that the balancer cannot resolve.
Per-collection lock heat map	Correlate `db.serverStatus().locks` with `db.currentOp()` by `ns`	Aggregate lock stats hide which collection is the hotspot. Cross-reference namespace-specific currentOps with lock type waits.
Config server latency	`db.adminCommand({ping:1})` against config server replica set	Config server stalls block chunk splits and balancer decisions. Elevated latency affects all cluster metadata operations.
Write amplification	`wiredTiger.block-manager` bytes written vs oplog bytes written	High ratios indicate index updates or document padding generating more I/O than logical write volume suggests.
TCMalloc fragmentation	`db.serverStatus().tcmalloc`: heap_size vs current_allocated_bytes	In 8.0, per-CPU caches may retain memory in RSS without returning it to the OS.
Secondary read latency	`db.serverStatus().opLatencies.reads` on secondaries vs primary	Elevated secondary reads indicate replication apply threads competing with read traffic for disk and tickets.
Reconciliation errors	`db.serverStatus().wiredTiger.transaction`	Non-zero reconciliation errors indicate data integrity issues during page flush. Investigate immediately.
Oplog entry size distribution	`local.oplog.rs`	Large entries from multi-document transactions or bulk writes shrink the oplog window disproportionately and slow replication.

How Netdata helps

Netdata polls serverStatus() and rs.status() every second and computes deltas automatically, surfacing operation rates, replication lag, and latency windows without manual scripts.
WiredTiger cache fill ratio, dirty ratio, and available tickets are displayed together to spot cache-pressure cascades before application-thread evictions begin.
Per-node and per-replica-set views compare secondary apply rates, oplog windows, and member states side by side to identify a falling secondary or hot shard.
Built-in alerts cover Level 1 and Level 2 signals such as disk space, replication lag, and cache utilization; clone and extend them to Level 3 thresholds like ticket availability and checkpoint duration.
Connection charts include both current and totalCreated rates to distinguish a stable pool from a reconnection storm driving RSS spikes.

The Netdata solution

MongoDB monitoring with Netdata

Netdata monitors MongoDB with per-second metrics and automatic dashboards. Watch WiredTiger cache pressure, oplog window, connection counts, checkpoint stalls, and replication health in one place, correlated with the underlying host.

See MongoDB monitoring → Start monitoring free