MongoDB monitoring checklist: the signals every production cluster needs

Production MongoDB failures are preceded by signals that are visible but often unmonitored: climbing dirty cache ratio, shrinking oplog window, or ticket counts approaching zero. This guide organizes essential signals into four monitoring levels. Use them to audit instrumentation or triage gaps during an incident.

Each level builds on the previous one. If you are missing a survival signal, instrument it before adding expert metrics. The thresholds below are drawn from the MongoDB serverStatus() and rs.status() contract and from operational patterns observed across WiredTiger deployments.

Level 1: survival

The bare minimum.

SignalSource / CommandWhat to watch for
Process livenessdb.adminCommand({ping:1}) or TCP to port 27017Failed ping: instance is down, OOM-killed, or network-partitioned. Success does not guarantee health; node may still be RECOVERING or stalled.
Replica set member staters.status().members[].stateStr and .healthExactly one PRIMARY. Any data-bearing member in RECOVERING, ROLLBACK, DOWN, or UNKNOWN for >2 minutes is abnormal.
Replication lagCompare optimeDate between PRIMARY and SECONDARY in rs.status()Lag > oplog window forces initial sync. Exclude intentionally delayed members.
Disk space utilizationdf -h on dbPath and journal pathsMongoDB exits if it cannot write the journal. WiredTiger does not return freed disk space to the filesystem after deletes; plan headroom accordingly.
Connection countdb.serverStatus().connections.current and .availableEach connection uses ~1 MB thread stack. Approaching the file-descriptor or maxIncomingConnections limit causes silent rejections or OOM.
Slow query logMongoDB log or db.setProfilingLevel(1, {slowms: 100})Queries >100 ms reveal missing indexes, COLLSCAN plans, and regressions. Correlate with metrics.queryExecutor.scanned for wasted work. Enabling the profiler adds write overhead; prefer logs in high-throughput deployments.

Level 2: operational

Standard for production traffic.

SignalSource / CommandWhat to watch for
WiredTiger cache utilizationdb.serverStatus().wiredTiger.cache: used/max and dirty/maxUsed cache >80% with rising eviction is concerning. Dirty ratio >20% predicts checkpoint stalls before user-visible latency appears.
Operation throughputdb.serverStatus().opcounters deltasSustained drop >50% from baseline indicates blocking: ticket exhaustion, election, or disk stall. Spikes suggest retry storms.
Operation latencydb.serverStatus().opLatencies: reads, writes, commandsWatch p99 deviation from baseline, not absolute values. Server-side latency often spikes 30-60 seconds before application timeouts when storage degrades.
Queue depthsdb.serverStatus().globalLock.currentQueuePersistent or growing queues mean saturation. Queue <20 and stable may be acceptable. Unbounded growth predicts collapse.
Oplog windowrs.printReplicationInfo() or db.getReplicationInfo()Safety margin for secondary catch-up. Minimum >24 hours during peak write. Resize with replSetResizeOplog if trending down.
Memory RSSdb.serverStatus().mem.resident vs system RAMShould approximate cache size + ~1 MB per connection + ~1 GB overhead. RSS within 1 GB of total RAM risks OOM kill.
Election eventsMongoDB logs: “Starting an election”, “Stepping down”Each election causes 2-12 seconds of write unavailability. >1 per hour outside maintenance indicates instability.
OS disk latencyiostat -x: await, %utilStorage saturation precedes MongoDB symptoms. Journal sync latency and checkpoint duration derive directly from this layer.
Page fault ratedb.serverStatus().extra_info.page_faults deltaMajor page faults mean working set exceeds memory. Expected during cold start; after warmup, a rising trend signals cache pressure.

Level 3: mature

Full coverage for systems with an SLO.

SignalSource / CommandWhat to watch for
WiredTiger ticket utilizationdb.serverStatus().wiredTiger.concurrentTransactions (7.x and earlier) or queues.execution (8.0+)<25% available means storage engine saturation. <10 available is critical; zero blocks all new operations.
Application-thread evictionsdb.serverStatus().wiredTiger.cache: “pages evicted by application threads”Sustained nonzero rate is abnormal. Background eviction cannot keep up and user threads are doing cleanup work, adding latency directly to operations.
Checkpoint durationdb.serverStatus().wiredTiger.transaction: “transaction checkpoint most recent time”<10 s healthy. 10-30 s concerning. >60 s critical (exceeds default interval). Trend matters more than absolute value.
Journal sync latencydb.serverStatus().wiredTiger.log: “log sync time duration” / “log sync operations”>30 ms sustained degrades write acknowledgment. >100 ms means writes are stalling. Often leads MongoDB symptoms by 30-60 seconds.
Document scan ratiodb.serverStatus().metrics.queryExecutor: scanned, scannedObjectsScanned:returned ratio climbing toward 100:1 indicates inefficient scans. Correlate with slow query log for COLLSCAN.
Connection churndb.serverStatus().connections.totalCreated deltaHigh totalCreated with stable current means connection pool thrashing. Thread creation spikes RSS and CPU.
Active vs idle connectionsdb.serverStatus().connections.active vs .currentIdle connections still consume ~1 MB RSS and file descriptors. Low active ratio suggests oversized driver pools.
Cursor countsdb.serverStatus().metrics.cursor: open.total, open.noTimeoutnoTimeout cursors hold WiredTiger snapshots open, causing silent cache pressure. Target near zero for noTimeout.
Secondary apply ratedb.serverStatus().metrics.repl.apply: ops and batchesMust keep up with primary write rate over any 10-minute window. <80% consistently leads to unbounded lag.
Per-collection growthdb.collection.stats().storageSizeTracks on-disk growth per collection. Use storageSize, not dataSize, for capacity planning.
Index usage statsdb.collection.aggregate([{$indexStats: {}}])Unused indexes waste cache and add write amplification. Wait >24 hours after restart before concluding an index is unused.
Lock wait timesdb.serverStatus().locks: timeAcquiringMicrosCollection-level waits indicate write hotspots. Global waits indicate DDL blocking all traffic.
Network throughputdb.serverStatus().network: bytesIn, bytesOut, numRequestsHigh bytesOut per request indicates large payloads or missing projections. Replication traffic competes with client traffic.
Long-running operationsdb.currentOp({active:true, secs_running:{$gt:10}}) max ageMaximum age of any active operation. A single runaway query holding a ticket for minutes can cascade to system-wide latency.
Assertion ratesdb.serverStatus().asserts deltasregular or msg assertions indicate server bugs or data corruption. user assertion spikes indicate application errors or auth issues.

Level 4: expert

Deep signals operators add after their third major incident.

SignalSource / CommandWhat to watch for
History store activitydb.serverStatus().wiredTiger.cache legacy overflow statsCache overflow table was replaced by the history store in 4.4+. In 6.x+ the legacy counter is zero and no longer meaningful; monitor hazard pointers instead.
Hazard pointer countdb.serverStatus().wiredTiger.transaction or cache statsHigh hazard pointers indicate snapshots are retained too long, preventing eviction.
Plan cache evictionsMongoDB logs and serverStatus().metrics.querySudden plan cache evictions can cause a query to switch from an index scan to a collection scan without warning.
Ticket trends per minutewiredTiger.concurrentTransactions or queues.execution sampled per minuteIn 7.0+ dynamic ticket adjustment changes total counts. Per-minute trends reveal whether the adaptive algorithm is keeping up or falling behind.
Chunk migration ratedb.getSiblingDB("config").changelog: moveChunk entriesFailed migrations or migration storms cause I/O pressure and range locks. Monitor latency and failure counts.
Jumbo chunk countdb.getSiblingDB("config").chunks.find({jumbo:true})Jumbo chunks cannot be split or migrated, creating permanent imbalance that the balancer cannot resolve.
Per-collection lock heat mapCorrelate db.serverStatus().locks with db.currentOp() by nsAggregate lock stats hide which collection is the hotspot. Cross-reference namespace-specific currentOps with lock type waits.
Config server latencydb.adminCommand({ping:1}) against config server replica setConfig server stalls block chunk splits and balancer decisions. Elevated latency affects all cluster metadata operations.
Write amplificationwiredTiger.block-manager bytes written vs oplog bytes writtenHigh ratios indicate index updates or document padding generating more I/O than logical write volume suggests.
TCMalloc fragmentationdb.serverStatus().tcmalloc: heap_size vs current_allocated_bytesIn 8.0, per-CPU caches may retain memory in RSS without returning it to the OS.
Secondary read latencydb.serverStatus().opLatencies.reads on secondaries vs primaryElevated secondary reads indicate replication apply threads competing with read traffic for disk and tickets.
Reconciliation errorsdb.serverStatus().wiredTiger.transactionNon-zero reconciliation errors indicate data integrity issues during page flush. Investigate immediately.
Oplog entry size distributionlocal.oplog.rsLarge entries from multi-document transactions or bulk writes shrink the oplog window disproportionately and slow replication.

How Netdata helps

  • Netdata polls serverStatus() and rs.status() every second and computes deltas automatically, surfacing operation rates, replication lag, and latency windows without manual scripts.
  • WiredTiger cache fill ratio, dirty ratio, and available tickets are displayed together to spot cache-pressure cascades before application-thread evictions begin.
  • Per-node and per-replica-set views compare secondary apply rates, oplog windows, and member states side by side to identify a falling secondary or hot shard.
  • Built-in alerts cover Level 1 and Level 2 signals such as disk space, replication lag, and cache utilization; clone and extend them to Level 3 thresholds like ticket availability and checkpoint duration.
  • Connection charts include both current and totalCreated rates to distinguish a stable pool from a reconnection storm driving RSS spikes.