MongoDB monitoring checklist: the signals every production cluster needs
Production MongoDB failures are preceded by signals that are visible but often unmonitored: climbing dirty cache ratio, shrinking oplog window, or ticket counts approaching zero. This guide organizes essential signals into four monitoring levels. Use them to audit instrumentation or triage gaps during an incident.
Each level builds on the previous one. If you are missing a survival signal, instrument it before adding expert metrics. The thresholds below are drawn from the MongoDB serverStatus() and rs.status() contract and from operational patterns observed across WiredTiger deployments.
Level 1: survival
The bare minimum.
| Signal | Source / Command | What to watch for |
|---|---|---|
| Process liveness | db.adminCommand({ping:1}) or TCP to port 27017 | Failed ping: instance is down, OOM-killed, or network-partitioned. Success does not guarantee health; node may still be RECOVERING or stalled. |
| Replica set member state | rs.status().members[].stateStr and .health | Exactly one PRIMARY. Any data-bearing member in RECOVERING, ROLLBACK, DOWN, or UNKNOWN for >2 minutes is abnormal. |
| Replication lag | Compare optimeDate between PRIMARY and SECONDARY in rs.status() | Lag > oplog window forces initial sync. Exclude intentionally delayed members. |
| Disk space utilization | df -h on dbPath and journal paths | MongoDB exits if it cannot write the journal. WiredTiger does not return freed disk space to the filesystem after deletes; plan headroom accordingly. |
| Connection count | db.serverStatus().connections.current and .available | Each connection uses ~1 MB thread stack. Approaching the file-descriptor or maxIncomingConnections limit causes silent rejections or OOM. |
| Slow query log | MongoDB log or db.setProfilingLevel(1, {slowms: 100}) | Queries >100 ms reveal missing indexes, COLLSCAN plans, and regressions. Correlate with metrics.queryExecutor.scanned for wasted work. Enabling the profiler adds write overhead; prefer logs in high-throughput deployments. |
Level 2: operational
Standard for production traffic.
| Signal | Source / Command | What to watch for |
|---|---|---|
| WiredTiger cache utilization | db.serverStatus().wiredTiger.cache: used/max and dirty/max | Used cache >80% with rising eviction is concerning. Dirty ratio >20% predicts checkpoint stalls before user-visible latency appears. |
| Operation throughput | db.serverStatus().opcounters deltas | Sustained drop >50% from baseline indicates blocking: ticket exhaustion, election, or disk stall. Spikes suggest retry storms. |
| Operation latency | db.serverStatus().opLatencies: reads, writes, commands | Watch p99 deviation from baseline, not absolute values. Server-side latency often spikes 30-60 seconds before application timeouts when storage degrades. |
| Queue depths | db.serverStatus().globalLock.currentQueue | Persistent or growing queues mean saturation. Queue <20 and stable may be acceptable. Unbounded growth predicts collapse. |
| Oplog window | rs.printReplicationInfo() or db.getReplicationInfo() | Safety margin for secondary catch-up. Minimum >24 hours during peak write. Resize with replSetResizeOplog if trending down. |
| Memory RSS | db.serverStatus().mem.resident vs system RAM | Should approximate cache size + ~1 MB per connection + ~1 GB overhead. RSS within 1 GB of total RAM risks OOM kill. |
| Election events | MongoDB logs: “Starting an election”, “Stepping down” | Each election causes 2-12 seconds of write unavailability. >1 per hour outside maintenance indicates instability. |
| OS disk latency | iostat -x: await, %util | Storage saturation precedes MongoDB symptoms. Journal sync latency and checkpoint duration derive directly from this layer. |
| Page fault rate | db.serverStatus().extra_info.page_faults delta | Major page faults mean working set exceeds memory. Expected during cold start; after warmup, a rising trend signals cache pressure. |
Level 3: mature
Full coverage for systems with an SLO.
| Signal | Source / Command | What to watch for |
|---|---|---|
| WiredTiger ticket utilization | db.serverStatus().wiredTiger.concurrentTransactions (7.x and earlier) or queues.execution (8.0+) | <25% available means storage engine saturation. <10 available is critical; zero blocks all new operations. |
| Application-thread evictions | db.serverStatus().wiredTiger.cache: “pages evicted by application threads” | Sustained nonzero rate is abnormal. Background eviction cannot keep up and user threads are doing cleanup work, adding latency directly to operations. |
| Checkpoint duration | db.serverStatus().wiredTiger.transaction: “transaction checkpoint most recent time” | <10 s healthy. 10-30 s concerning. >60 s critical (exceeds default interval). Trend matters more than absolute value. |
| Journal sync latency | db.serverStatus().wiredTiger.log: “log sync time duration” / “log sync operations” | >30 ms sustained degrades write acknowledgment. >100 ms means writes are stalling. Often leads MongoDB symptoms by 30-60 seconds. |
| Document scan ratio | db.serverStatus().metrics.queryExecutor: scanned, scannedObjects | Scanned:returned ratio climbing toward 100:1 indicates inefficient scans. Correlate with slow query log for COLLSCAN. |
| Connection churn | db.serverStatus().connections.totalCreated delta | High totalCreated with stable current means connection pool thrashing. Thread creation spikes RSS and CPU. |
| Active vs idle connections | db.serverStatus().connections.active vs .current | Idle connections still consume ~1 MB RSS and file descriptors. Low active ratio suggests oversized driver pools. |
| Cursor counts | db.serverStatus().metrics.cursor: open.total, open.noTimeout | noTimeout cursors hold WiredTiger snapshots open, causing silent cache pressure. Target near zero for noTimeout. |
| Secondary apply rate | db.serverStatus().metrics.repl.apply: ops and batches | Must keep up with primary write rate over any 10-minute window. <80% consistently leads to unbounded lag. |
| Per-collection growth | db.collection.stats().storageSize | Tracks on-disk growth per collection. Use storageSize, not dataSize, for capacity planning. |
| Index usage stats | db.collection.aggregate([{$indexStats: {}}]) | Unused indexes waste cache and add write amplification. Wait >24 hours after restart before concluding an index is unused. |
| Lock wait times | db.serverStatus().locks: timeAcquiringMicros | Collection-level waits indicate write hotspots. Global waits indicate DDL blocking all traffic. |
| Network throughput | db.serverStatus().network: bytesIn, bytesOut, numRequests | High bytesOut per request indicates large payloads or missing projections. Replication traffic competes with client traffic. |
| Long-running operations | db.currentOp({active:true, secs_running:{$gt:10}}) max age | Maximum age of any active operation. A single runaway query holding a ticket for minutes can cascade to system-wide latency. |
| Assertion rates | db.serverStatus().asserts deltas | regular or msg assertions indicate server bugs or data corruption. user assertion spikes indicate application errors or auth issues. |
Level 4: expert
Deep signals operators add after their third major incident.
| Signal | Source / Command | What to watch for |
|---|---|---|
| History store activity | db.serverStatus().wiredTiger.cache legacy overflow stats | Cache overflow table was replaced by the history store in 4.4+. In 6.x+ the legacy counter is zero and no longer meaningful; monitor hazard pointers instead. |
| Hazard pointer count | db.serverStatus().wiredTiger.transaction or cache stats | High hazard pointers indicate snapshots are retained too long, preventing eviction. |
| Plan cache evictions | MongoDB logs and serverStatus().metrics.query | Sudden plan cache evictions can cause a query to switch from an index scan to a collection scan without warning. |
| Ticket trends per minute | wiredTiger.concurrentTransactions or queues.execution sampled per minute | In 7.0+ dynamic ticket adjustment changes total counts. Per-minute trends reveal whether the adaptive algorithm is keeping up or falling behind. |
| Chunk migration rate | db.getSiblingDB("config").changelog: moveChunk entries | Failed migrations or migration storms cause I/O pressure and range locks. Monitor latency and failure counts. |
| Jumbo chunk count | db.getSiblingDB("config").chunks.find({jumbo:true}) | Jumbo chunks cannot be split or migrated, creating permanent imbalance that the balancer cannot resolve. |
| Per-collection lock heat map | Correlate db.serverStatus().locks with db.currentOp() by ns | Aggregate lock stats hide which collection is the hotspot. Cross-reference namespace-specific currentOps with lock type waits. |
| Config server latency | db.adminCommand({ping:1}) against config server replica set | Config server stalls block chunk splits and balancer decisions. Elevated latency affects all cluster metadata operations. |
| Write amplification | wiredTiger.block-manager bytes written vs oplog bytes written | High ratios indicate index updates or document padding generating more I/O than logical write volume suggests. |
| TCMalloc fragmentation | db.serverStatus().tcmalloc: heap_size vs current_allocated_bytes | In 8.0, per-CPU caches may retain memory in RSS without returning it to the OS. |
| Secondary read latency | db.serverStatus().opLatencies.reads on secondaries vs primary | Elevated secondary reads indicate replication apply threads competing with read traffic for disk and tickets. |
| Reconciliation errors | db.serverStatus().wiredTiger.transaction | Non-zero reconciliation errors indicate data integrity issues during page flush. Investigate immediately. |
| Oplog entry size distribution | local.oplog.rs | Large entries from multi-document transactions or bulk writes shrink the oplog window disproportionately and slow replication. |
How Netdata helps
- Netdata polls
serverStatus()andrs.status()every second and computes deltas automatically, surfacing operation rates, replication lag, and latency windows without manual scripts. - WiredTiger cache fill ratio, dirty ratio, and available tickets are displayed together to spot cache-pressure cascades before application-thread evictions begin.
- Per-node and per-replica-set views compare secondary apply rates, oplog windows, and member states side by side to identify a falling secondary or hot shard.
- Built-in alerts cover Level 1 and Level 2 signals such as disk space, replication lag, and cache utilization; clone and extend them to Level 3 thresholds like ticket availability and checkpoint duration.
- Connection charts include both
currentandtotalCreatedrates to distinguish a stable pool from a reconnection storm driving RSS spikes.







