MongoDB monitoring maturity model: from survival to expert
If you run MongoDB in production, you need a progression: what to watch first to know the database is alive, what to add next to know why it is slowing down, and what to track finally to predict a cascade before it becomes an outage.
Treat the levels below as gates, not a shopping list. Automate and alert on Level 1 before you build Level 2 dashboards. If you are instrumenting Level 4 but lack a reliable page for disk space or member state, you have inverted your priorities.
The model applies to replica sets and sharded clusters running the WiredTiger storage engine.
flowchart TD
A[Level 1 Survival]
B[Level 2 Operational]
C[Level 3 Mature]
D[Level 4 Expert]
A --> B
B --> C
C --> DLevel 1 - survival
These six signals are the difference between knowing you have a problem and learning about it from users. They are mostly binary and should trigger pages when they breach.
- Process liveness. Confirm
mongodormongosaccepts connections. Failure means the instance is down, unreachable, or degraded. - Replica set member state. Exactly one PRIMARY must exist at steady state. Members in RECOVERING, ROLLBACK, or UNKNOWN for more than two minutes indicate replication instability.
- Replication lag. The gap between a secondary’s last applied oplog entry and the primary’s. If lag exceeds your oplog window, the secondary requires a full initial sync. Alert when lag exceeds 10 seconds sustained, or 50% of the oplog window.
- Disk space utilization. WiredTiger data files do not shrink automatically after deletes or drops; disk growth trends upward unless you run
compact, which blocks the collection and consumes significant I/O. Crossing 90% risks immediate crash when the journal cannot append. - Connection count. Each connection consumes roughly 1 MB of thread stack memory. A sustained climb toward
maxIncomingConnectionsprecedes memory pressure and rejection storms. - Slow query log. Queries exceeding
slowmsreveal index regressions and resource contention without the overhead of active profiling.
Level 2 - operational
Once liveness is assured, the next question is whether the database can sustain its workload. Level 2 adds throughput, latency, and capacity signals that reveal trends before they become hard failures.
- WiredTiger cache fill and dirty ratio. Fill ratio above 80% is normal by design, but a dirty ratio climbing past 10-15% predicts checkpoint stalls minutes before application latency spikes. Dirty ratio above 20% is critical.
- Operation counters (
opcounters). Derived deltas of insert, query, update, delete, getmore, and command rates expose workload shifts, retry storms, and election-induced write pauses. - Operation latency (
opLatencies). Server-side read, write, and command histograms. Rising p95 or p99 often precedes throughput collapse by several minutes. - Queue depths (
globalLock.currentQueue). Non-zero readers or writers mean operations are waiting. A queue that grows without clearing signals imminent saturation. - Oplog window. The hours of coverage your capped oplog provides. Shrinkage during write bursts reduces the margin for secondary recovery and maintenance. Alert when the window falls below 12 hours.
- Memory RSS. Should approximate WiredTiger cache size plus ~1 MB per connection plus internal overhead. RSS approaching system memory invites OOM killer intervention.
- Election events. Repeated elections outside maintenance windows indicate network flapping, resource exhaustion delaying heartbeats, or priority takeover loops.
- OS disk I/O latency. Correlate with MongoDB-level signals. OS
awaitexplains whether WiredTiger checkpoint stalls are caused by storage degradation or external I/O contention. - Page fault rate. Hard page faults mean the working set exceeds available memory. High sustained rates after cache warmup indicate undersized cache or unindexed scans.
Level 3 - mature
Level 3 instruments the internal machinery of WiredTiger, replication, and the query layer. These signals explain the root cause of most production incidents.
- WiredTiger ticket utilization. Available read and write tickets in
wiredTiger.concurrentTransactions(MongoDB 7.x and earlier) orqueues.execution(MongoDB 8.0+). Dropping below 25% of total means operations are queuing at the storage engine. Below 10 available is critical. - WiredTiger eviction rates. Background eviction is healthy. Any sustained rate of pages evicted by application threads means user operations are pausing to free memory.
- Checkpoint duration. Measured via
transaction checkpoint most recent time (msecs)inwiredTiger.transaction. Exceeding the 60-second interval means dirty data accumulates faster than the flush rate. - Journal sync latency. Derived from
wiredTiger.logtotals. Sustained averages above 30 ms stall all durable writes and predict storage subsystem failure 30-60 seconds before application latency spikes. Above 100 ms is critical. - Document and index efficiency ratios.
docsExaminedversusdocsReturnedandkeysExaminedversusdocsReturnedfrom the slow query log distinguish a missing index from a temporarily slow query. - Connection churn (
totalCreateddelta). A stablecurrentcount with a rapidly incrementingtotalCreatedreveals connection pool thrashing, which burns CPU on thread creation and destroys latency. - Active versus idle connection breakdown. High
currentwith lowactivemeans memory is consumed by idle pooled connections that still hold file descriptors and stack space. - Cursor counts, especially
noTimeout. Open cursors withnoCursorTimeouthold WiredTiger snapshots indefinitely. Even a handful can pin old cache versions and create mystery pressure. Alert ifopen.noTimeoutexceeds 10. - Replication oplog application rate. On secondaries, compare
metrics.repl.apply.opsdeltas to the primary’s write rate. Sustained application below ingestion guarantees the secondary will eventually fall off. - Per-collection size growth. Track
storageSize(on-disk, compressed) rather thandataSize(logical). Index builds and mass deletes do not shrink files withoutcompact, which blocks the collection. - Index usage statistics.
$indexStatsresets on restart. Zero-ops indexes after 24 hours of normal load are candidates for removal to reclaim cache and reduce write amplification. - Lock wait times by type. Collection-level waits indicate a hotspot. Global or Metadata waits indicate DDL blocking production traffic.
- Network bytes in/out. High outbound replication bytes relative to inbound client bytes may indicate bulk ingestion on the primary or scatter-gather query overhead.
currentOplongest-running age. Continuously tracking the maximum age of active operations catches runaway aggregations and collection scans before they cascade into ticket exhaustion.- Assertion rates by type.
regularandmsgassertions indicate server-side bugs or data corruption.userassertion spikes point to application errors or auth failures. - Flow control status.
isLaggedtrue with growingtimeAcquiringMicrosmeans the primary is throttling writes to prevent secondaries from falling off the oplog.
Level 4 - expert
These signals are for operators who have already survived at least one cascading failure. They expose subtle regressions in query planning, allocator behavior, and sharding metadata that standard monitoring misses.
- WiredTiger history store activity. The history store replaced the cache overflow table around MongoDB 4.4. Monitor it to detect extreme MVCC snapshot retention and version explosion.
- WiredTiger hazard pointer count. Elevated hazard pointers indicate snapshots are retained longer than expected, preventing page eviction and accelerating cache pressure.
- Plan cache eviction events. Sudden plan cache invalidation causes query regressions where the optimizer selects collection scans over efficient indexes.
- Ticket utilization trended per-minute. Sub-minute trending reveals micro-bursts that five-minute averages smooth over. Essential for connecting latency spikes to brief admission-control saturation.
- Chunk migration latency and failure rate. In sharded clusters, failed migrations waste I/O and hold range locks. Sustained failures suggest config server pressure or storage saturation on destination shards.
- Jumbo chunk count and growth. Chunks flagged
jumbocannot migrate or split. They create permanent imbalance even when total chunk counts look even. - Per-collection lock contention heat map. Correlate aggregate
locksstatistics withcurrentOpnamespace filtering to identify which collections are driving wait times. - Config server operation latency. Slow config servers delay chunk splits and balancer decisions. This is often the hidden bottleneck in sharded cluster metadata operations.
- Journal-to-oplog write amplification ratio. Compare journal bytes written to oplog bytes written. A rising ratio indicates inefficient write patterns or large multi-document transactions.
tcmallocmemory fragmentation ratio.generic.heap_sizedivided bygeneric.current_allocated_bytesinserverStatus.tcmalloc. Persistent divergence suggests allocator fragmentation that inflates RSS beyond actual need.- Secondary read latency versus primary. Divergence indicates the secondary is contending between oplog application and serving read traffic, or that its storage is degraded.
- WiredTiger reconciliation errors. Non-zero reconciliation errors are a data integrity signal that should never be ignored.
- Oplog entry size distribution. Large entries from bulk writes or multi-document transactions strain replication bandwidth and shrink the oplog window disproportionately to operation count.
How Netdata helps
Netdata collects MongoDB metrics from serverStatus, rs.status(), and the oplog. The collector surfaces WiredTiger cache fill, dirty ratio, ticket availability, and eviction rates together, making cache pressure patterns visible before latency spikes. Replication lag, oplog window, and member state are correlated per-node. Per-second resolution on opcounters, opLatencies, and queue depths distinguishes sustained workload shifts from transient retry storms. MongoDB signals are correlated with OS metrics such as disk await, memory RSS, page faults, and file descriptors on the same dashboard.







