CockroachDB monitoring maturity model: from survival to expert

CockroachDB failures rarely announce themselves through a single metric. A slow disk turns into L0 compaction debt, which stalls writes, which drops Raft proposals, which makes ranges unavailable. A drifting clock raises transaction retry rates long before any node self-terminates. To run this database safely, you need layered observability that matches the system’s own layers: storage engine, Raft replication, distributed SQL, and transaction execution.

This guide organizes monitoring into four maturity levels. Level 1 keeps you alive during obvious outages. Level 2 catches workload and latency regressions. Level 3 adds the storage-engine and garbage-collection signals that precede most performance cliffs. Level 4 exposes the per-range, per-queue, and proposal-level details that explain why a seemingly healthy cluster suddenly degrades. Use the model to audit your current coverage and decide what to instrument next.

flowchart TD
    A[Level 1 Survival] --> B[Level 2 Operational]
    B --> C[Level 3 Mature]
    C --> D[Level 4 Expert]
    A -- liveness, ranges, disk, SQL probe, certs --> B
    B -- latency, retries, L0, clock, RPC --> C
    C -- read amp, stalls, cache, GC, intents, MVCC --> D
    D -- raft drops, hot ranges, closed ts, queue errors --> E[Predictive incident response]

Level 1: survival

Survival monitoring answers one question: is the cluster up? These signals are binary, cheap to collect, and unambiguous. If any of them fire outside a maintenance window, you have a production incident.

SignalWhy it matters
Node liveness statusTells you whether the cluster considers each node alive. A non-live node loses its leases and forces emergency lease transfers.
Range unavailability countAny nonzero value means some keyspace cannot be read or written. System ranges being unavailable amplifies the impact.
Disk space per storeRunning out of disk stalls compaction and can trigger a write-availability death spiral.
SQL synthetic probeSELECT 1 from a monitoring client tests the full path: TCP, TLS, pgwire, SQL execution.
Certificate expiration timelineCockroachDB uses mutual TLS everywhere. Expired certs cause immediate inter-node or client lockout.
Process aliveA basic binary check that the cockroach process is running, used as a coarse guard before finer metrics.

With these six signals you will know when the cluster is broken, but not why it is slow or why it is about to break.

Level 2: operational

Operational monitoring adds workload-facing and infrastructure signals. This is the layer that catches latency regressions, contention, and the first signs of storage-engine pressure. If your team runs CockroachDB in production, these should be on dashboards and alertable.

SignalWhy it matters
SQL statement latency (P50, P99)Client-visible performance. P99 is where SLO violations live.
KV read/write latencyIsolates storage and replication latency from SQL planning overhead.
Transaction commit latencyCaptures multi-statement overhead, retries, and commit protocol cost.
Transaction restart rate by causewritetooold points to contention, readwithinuncertainty to clock skew, txnpush to app conflicts.
SQL error rate by code40001 is contention, 53200 is resource exhaustion, XX000 is an internal fault.
CPU utilization per nodeRaft ticking is per-range; sustained CPU above 70% leaves no headroom for bursts.
Memory RSS per nodeTracks process memory against host or cgroup limits. CGo allocations (Pebble cache) are not managed by the Go GC.
LSM L0 sublevel countThe single best leading indicator of storage-driven degradation. Past 10-20, latency rises; past 20, write stalls are imminent.
WAL fsync latencyDirectly measures the write-path fsync health that every Raft commit depends on.
Disk I/O utilization and latencySSDs and cloud volumes throttle silently. Latency matters more than percent-utilization on multi-queue devices.
Under-replicated range countShows whether the cluster can heal replicas fast enough after a node loss.
Clock offset between node pairsDrift approaching 80% of --max-offset triggers self-termination.
Inter-node RPC heartbeat latencyCaptures network health as CockroachDB experiences it, including scheduling delay.
Active client connections per nodeConnection storms consume goroutines and memory and can overwhelm surviving nodes after failover.
Admission control queue depthSustained queuing means the system is at capacity and is artificially adding latency to protect itself.
Job statusStuck backups, schema changes, or imports block operations and consume I/O.

Level 2 is where most production teams stop. It is enough to diagnose many incidents, but it misses the slow-burn failures that fill disks with garbage or let compaction debt accumulate for weeks.

Level 3: mature

Mature monitoring tracks the storage engine, runtime, and replication internals. These signals are leading indicators: they warn you before Level 2 latency spikes or Level 1 unavailability.

SignalWhy it matters
LSM read amplificationMeasures how many SSTables a read consults. Above 25, read-heavy workloads degrade.
Pebble write stall countThe storage engine has paused writes. Any stall in normal OLTP is abnormal.
Compaction throughput and backlogIf compaction cannot keep up with ingestion, L0 grows and stalls follow.
Pebble block cache hit ratioA drop means the working set has outgrown cache or the cache is cold after restart.
SQL memory budget utilizationApproaching --max-sql-memory causes spills to disk or 53200 rejections.
Go GC pause duration and frequencyPauses approaching the liveness heartbeat interval risk Raft liveness loss.
Goroutine countMonotonic growth without workload growth indicates a leak.
Range count per node (replicas and leases)Raft overhead scales with range count. Imbalance points to allocator or constraint problems.
Raft snapshot rateHigh rates mean followers cannot keep up and are being rebuilt from full snapshots.
Lease transfer rateElevated transfers without planned maintenance mean nodes are flapping or thrashing.
Intent count and bytes per storeGrowing unresolved intents block other transactions and add latency system-wide.
MVCC garbage bytes per storeGarbage accumulates silently when GC cannot keep up or protected timestamps block it.
KV read/write latencySeparates storage-layer slowdown from SQL-layer behavior.
Raft log commit latencyIsolates the fsync-bound write path that SQL and KV latency depend on.
Changefeed lagGrowing lag means CDC cannot keep up; stalled changefeeds create protected timestamps that block GC.
Protected timestamp record count and ageRecords from CDC or backups prevent MVCC GC. Old records are a silent disk-filling risk.
File descriptor usageFD exhaustion prevents new connections and SSTable opens.
Composite pattern alertsMulti-signal correlations catch failure modes like LSM death spirals and protected-timestamp GC stalls.
Network throughput between nodesSaturated NICs delay Raft and DistSQL shuffle, especially during snapshot storms.

This level turns operators from firefighters into investigators. You stop asking “why is P99 high?” and start asking “why is read amplification rising while garbage bytes grow?”

Level 4: expert

Expert monitoring targets the low-level signals that explain edge-case behavior. These metrics are noisy in isolation and expensive to collect, but they are decisive during deep incidents and capacity planning.

SignalWhy it matters
Raft proposal drop rateDropped proposals mean writes are being silently retried; immediate tail-latency impact.
Per-range request rate distributionIdentifies hot ranges before single-node CPU saturation becomes obvious.
Intent resolution throughputTells you whether the cleanup system is keeping pace during an intent cascade.
Lease preference violation countCritical for multi-region latency SLOs when zone configs cannot be satisfied.
Raft entry cache hit rateMisses force disk reads for log entries, adding hidden latency to replication.
Compaction debt by LSM levelShows exactly where in the LSM tree pressure is building.
KV write batch size distributionShifts in batch size precede write-pattern changes that stress the storage engine.
SQL plan cache hit rateMisses cause repeated planning CPU burn and can indicate stats or version issues.
Cross-range transaction percentageHigher percentages increase distributed coordination overhead and tail latency.
Admission control token exhaustion rateQuantifies how often each queue is delaying work.
Closed timestamp lagAffects follower read freshness and can indicate replication or clock issues.
Queue processor error countsSplit, merge, replicate, and GC queue failures reveal internal scheduling problems.
Node decommission progress rateEnsures decommissions complete before patience or redundancy runs out.
Time-series data retention pressureCockroachDB’s internal time-series store can itself become a storage burden.
Bulk data export activityUnexpected EXPORT or BACKUP jobs may indicate exfiltration or runaway operations.

These signals are best consumed through exploratory dashboards and ad-hoc queries rather than high-noise paging alerts. Their value is giving you the last few data points that separate one root cause from another.

Moving between levels

Do not try to implement all four levels at once. A practical progression:

  • Start with Level 1 plus Level 2’s L0 sublevel count and SQL latency. Those three classes of signals catch the majority of production incidents.
  • Add Level 2 clock offset, retry cause breakdown, and admission control queue depth before your first multi-region deployment.
  • Move to Level 3 when you have recurring latency tickets that Level 2 cannot explain. Focus on read amplification, compaction backlog, and MVCC garbage first.
  • Add Level 4 signals after you have experienced at least one incident where you needed per-range request rates or Raft proposal drop metrics to close the diagnosis.

Keep scrape intervals short enough to catch sub-second write stalls; a 30-second scrape will miss brief events that still hurt users. Reserve crdb_internal queries for manual diagnosis, not automated alerting, because they are unsupported and can be expensive during incidents.

How Netdata helps

Netdata can reduce the gap between Level 2 and Level 3 monitoring without forcing you to build custom Prometheus recording rules for every storage-engine signal:

  • Netdata collects CockroachDB’s Prometheus-formatted metrics from /_status/vars out of the box and correlates them with node-level CPU, memory, disk I/O, and network on the same charts.
  • Per-second granularity makes brief Pebble write stalls, Raft commit latency spikes, and admission-control queue flushes visible instead of averaging them away.
  • Composite dashboards let you overlay L0 sublevel count, KV write latency, and disk I/O utilization to confirm an LSM compaction death spiral in one view.
  • Anomaly detection on slowly moving signals such as MVCC garbage bytes, protected timestamp age, and read amplification can surface the “silent but catastrophic” trends that static thresholds miss.
  • Correlating CockroachDB process RSS with Go GC pause duration and goroutine count helps distinguish a SQL memory leak from a storage-engine cache sizing issue.