CockroachDB monitoring maturity model: from survival to expert

CockroachDB failures rarely announce themselves through a single metric. A slow disk turns into L0 compaction debt, which stalls writes, which drops Raft proposals, which makes ranges unavailable. A drifting clock raises transaction retry rates long before any node self-terminates. To run this database safely, you need layered observability that matches the system’s own layers: storage engine, Raft replication, distributed SQL, and transaction execution.

This guide organizes monitoring into four maturity levels. Level 1 keeps you alive during obvious outages. Level 2 catches workload and latency regressions. Level 3 adds the storage-engine and garbage-collection signals that precede most performance cliffs. Level 4 exposes the per-range, per-queue, and proposal-level details that explain why a seemingly healthy cluster suddenly degrades. Use the model to audit your current coverage and decide what to instrument next.

flowchart TD
    A[Level 1 Survival] --> B[Level 2 Operational]
    B --> C[Level 3 Mature]
    C --> D[Level 4 Expert]
    A -- liveness, ranges, disk, SQL probe, certs --> B
    B -- latency, retries, L0, clock, RPC --> C
    C -- read amp, stalls, cache, GC, intents, MVCC --> D
    D -- raft drops, hot ranges, closed ts, queue errors --> E[Predictive incident response]

Level 1: survival

Survival monitoring answers one question: is the cluster up? These signals are binary, cheap to collect, and unambiguous. If any of them fire outside a maintenance window, you have a production incident.

Signal	Why it matters
Node liveness status	Tells you whether the cluster considers each node alive. A non-live node loses its leases and forces emergency lease transfers.
Range unavailability count	Any nonzero value means some keyspace cannot be read or written. System ranges being unavailable amplifies the impact.
Disk space per store	Running out of disk stalls compaction and can trigger a write-availability death spiral.
SQL synthetic probe	`SELECT 1` from a monitoring client tests the full path: TCP, TLS, pgwire, SQL execution.
Certificate expiration timeline	CockroachDB uses mutual TLS everywhere. Expired certs cause immediate inter-node or client lockout.
Process alive	A basic binary check that the `cockroach` process is running, used as a coarse guard before finer metrics.

With these six signals you will know when the cluster is broken, but not why it is slow or why it is about to break.

Level 2: operational

Operational monitoring adds workload-facing and infrastructure signals. This is the layer that catches latency regressions, contention, and the first signs of storage-engine pressure. If your team runs CockroachDB in production, these should be on dashboards and alertable.

Signal	Why it matters
SQL statement latency (P50, P99)	Client-visible performance. P99 is where SLO violations live.
KV read/write latency	Isolates storage and replication latency from SQL planning overhead.
Transaction commit latency	Captures multi-statement overhead, retries, and commit protocol cost.
Transaction restart rate by cause	`writetooold` points to contention, `readwithinuncertainty` to clock skew, `txnpush` to app conflicts.
SQL error rate by code	`40001` is contention, `53200` is resource exhaustion, `XX000` is an internal fault.
CPU utilization per node	Raft ticking is per-range; sustained CPU above 70% leaves no headroom for bursts.
Memory RSS per node	Tracks process memory against host or cgroup limits. CGo allocations (Pebble cache) are not managed by the Go GC.
LSM L0 sublevel count	The single best leading indicator of storage-driven degradation. Past 10-20, latency rises; past 20, write stalls are imminent.
WAL fsync latency	Directly measures the write-path fsync health that every Raft commit depends on.
Disk I/O utilization and latency	SSDs and cloud volumes throttle silently. Latency matters more than percent-utilization on multi-queue devices.
Under-replicated range count	Shows whether the cluster can heal replicas fast enough after a node loss.
Clock offset between node pairs	Drift approaching 80% of `--max-offset` triggers self-termination.
Inter-node RPC heartbeat latency	Captures network health as CockroachDB experiences it, including scheduling delay.
Active client connections per node	Connection storms consume goroutines and memory and can overwhelm surviving nodes after failover.
Admission control queue depth	Sustained queuing means the system is at capacity and is artificially adding latency to protect itself.
Job status	Stuck backups, schema changes, or imports block operations and consume I/O.

Level 2 is where most production teams stop. It is enough to diagnose many incidents, but it misses the slow-burn failures that fill disks with garbage or let compaction debt accumulate for weeks.

Level 3: mature

Mature monitoring tracks the storage engine, runtime, and replication internals. These signals are leading indicators: they warn you before Level 2 latency spikes or Level 1 unavailability.

Signal	Why it matters
LSM read amplification	Measures how many SSTables a read consults. Above 25, read-heavy workloads degrade.
Pebble write stall count	The storage engine has paused writes. Any stall in normal OLTP is abnormal.
Compaction throughput and backlog	If compaction cannot keep up with ingestion, L0 grows and stalls follow.
Pebble block cache hit ratio	A drop means the working set has outgrown cache or the cache is cold after restart.
SQL memory budget utilization	Approaching `--max-sql-memory` causes spills to disk or 53200 rejections.
Go GC pause duration and frequency	Pauses approaching the liveness heartbeat interval risk Raft liveness loss.
Goroutine count	Monotonic growth without workload growth indicates a leak.
Range count per node (replicas and leases)	Raft overhead scales with range count. Imbalance points to allocator or constraint problems.
Raft snapshot rate	High rates mean followers cannot keep up and are being rebuilt from full snapshots.
Lease transfer rate	Elevated transfers without planned maintenance mean nodes are flapping or thrashing.
Intent count and bytes per store	Growing unresolved intents block other transactions and add latency system-wide.
MVCC garbage bytes per store	Garbage accumulates silently when GC cannot keep up or protected timestamps block it.
KV read/write latency	Separates storage-layer slowdown from SQL-layer behavior.
Raft log commit latency	Isolates the fsync-bound write path that SQL and KV latency depend on.
Changefeed lag	Growing lag means CDC cannot keep up; stalled changefeeds create protected timestamps that block GC.
Protected timestamp record count and age	Records from CDC or backups prevent MVCC GC. Old records are a silent disk-filling risk.
File descriptor usage	FD exhaustion prevents new connections and SSTable opens.
Composite pattern alerts	Multi-signal correlations catch failure modes like LSM death spirals and protected-timestamp GC stalls.
Network throughput between nodes	Saturated NICs delay Raft and DistSQL shuffle, especially during snapshot storms.

This level turns operators from firefighters into investigators. You stop asking “why is P99 high?” and start asking “why is read amplification rising while garbage bytes grow?”

Level 4: expert

Expert monitoring targets the low-level signals that explain edge-case behavior. These metrics are noisy in isolation and expensive to collect, but they are decisive during deep incidents and capacity planning.

Signal	Why it matters
Raft proposal drop rate	Dropped proposals mean writes are being silently retried; immediate tail-latency impact.
Per-range request rate distribution	Identifies hot ranges before single-node CPU saturation becomes obvious.
Intent resolution throughput	Tells you whether the cleanup system is keeping pace during an intent cascade.
Lease preference violation count	Critical for multi-region latency SLOs when zone configs cannot be satisfied.
Raft entry cache hit rate	Misses force disk reads for log entries, adding hidden latency to replication.
Compaction debt by LSM level	Shows exactly where in the LSM tree pressure is building.
KV write batch size distribution	Shifts in batch size precede write-pattern changes that stress the storage engine.
SQL plan cache hit rate	Misses cause repeated planning CPU burn and can indicate stats or version issues.
Cross-range transaction percentage	Higher percentages increase distributed coordination overhead and tail latency.
Admission control token exhaustion rate	Quantifies how often each queue is delaying work.
Closed timestamp lag	Affects follower read freshness and can indicate replication or clock issues.
Queue processor error counts	Split, merge, replicate, and GC queue failures reveal internal scheduling problems.
Node decommission progress rate	Ensures decommissions complete before patience or redundancy runs out.
Time-series data retention pressure	CockroachDB’s internal time-series store can itself become a storage burden.
Bulk data export activity	Unexpected EXPORT or BACKUP jobs may indicate exfiltration or runaway operations.

These signals are best consumed through exploratory dashboards and ad-hoc queries rather than high-noise paging alerts. Their value is giving you the last few data points that separate one root cause from another.

Moving between levels

Do not try to implement all four levels at once. A practical progression:

Start with Level 1 plus Level 2’s L0 sublevel count and SQL latency. Those three classes of signals catch the majority of production incidents.
Add Level 2 clock offset, retry cause breakdown, and admission control queue depth before your first multi-region deployment.
Move to Level 3 when you have recurring latency tickets that Level 2 cannot explain. Focus on read amplification, compaction backlog, and MVCC garbage first.
Add Level 4 signals after you have experienced at least one incident where you needed per-range request rates or Raft proposal drop metrics to close the diagnosis.

Keep scrape intervals short enough to catch sub-second write stalls; a 30-second scrape will miss brief events that still hurt users. Reserve crdb_internal queries for manual diagnosis, not automated alerting, because they are unsupported and can be expensive during incidents.

How Netdata helps

Netdata can reduce the gap between Level 2 and Level 3 monitoring without forcing you to build custom Prometheus recording rules for every storage-engine signal:

Netdata collects CockroachDB’s Prometheus-formatted metrics from /_status/vars out of the box and correlates them with node-level CPU, memory, disk I/O, and network on the same charts.
Per-second granularity makes brief Pebble write stalls, Raft commit latency spikes, and admission-control queue flushes visible instead of averaging them away.
Composite dashboards let you overlay L0 sublevel count, KV write latency, and disk I/O utilization to confirm an LSM compaction death spiral in one view.
Anomaly detection on slowly moving signals such as MVCC garbage bytes, protected timestamp age, and read amplification can surface the “silent but catastrophic” trends that static thresholds miss.
Correlating CockroachDB process RSS with Go GC pause duration and goroutine count helps distinguish a SQL memory leak from a storage-engine cache sizing issue.

How CockroachDB actually works in production: a mental model for operators

The Netdata solution

CockroachDB monitoring with Netdata

Netdata monitors CockroachDB with per-second metrics and automatic dashboards. Watch LSM compaction, Raft liveness, clock skew, hot ranges, and intent buildup so the distributed-systems failure modes in these runbooks surface early.

See CockroachDB monitoring → Start monitoring free

CockroachDB monitoring maturity model: from survival to expert

CockroachDB monitoring maturity model: from survival to expert

Level 1: survival

Level 2: operational

Level 3: mature

Level 4: expert

Moving between levels

How Netdata helps

Related guides

CockroachDB monitoring with Netdata