CockroachDB monitoring maturity model: from survival to expert
CockroachDB failures rarely announce themselves through a single metric. A slow disk turns into L0 compaction debt, which stalls writes, which drops Raft proposals, which makes ranges unavailable. A drifting clock raises transaction retry rates long before any node self-terminates. To run this database safely, you need layered observability that matches the system’s own layers: storage engine, Raft replication, distributed SQL, and transaction execution.
This guide organizes monitoring into four maturity levels. Level 1 keeps you alive during obvious outages. Level 2 catches workload and latency regressions. Level 3 adds the storage-engine and garbage-collection signals that precede most performance cliffs. Level 4 exposes the per-range, per-queue, and proposal-level details that explain why a seemingly healthy cluster suddenly degrades. Use the model to audit your current coverage and decide what to instrument next.
flowchart TD
A[Level 1 Survival] --> B[Level 2 Operational]
B --> C[Level 3 Mature]
C --> D[Level 4 Expert]
A -- liveness, ranges, disk, SQL probe, certs --> B
B -- latency, retries, L0, clock, RPC --> C
C -- read amp, stalls, cache, GC, intents, MVCC --> D
D -- raft drops, hot ranges, closed ts, queue errors --> E[Predictive incident response]Level 1: survival
Survival monitoring answers one question: is the cluster up? These signals are binary, cheap to collect, and unambiguous. If any of them fire outside a maintenance window, you have a production incident.
| Signal | Why it matters |
|---|---|
| Node liveness status | Tells you whether the cluster considers each node alive. A non-live node loses its leases and forces emergency lease transfers. |
| Range unavailability count | Any nonzero value means some keyspace cannot be read or written. System ranges being unavailable amplifies the impact. |
| Disk space per store | Running out of disk stalls compaction and can trigger a write-availability death spiral. |
| SQL synthetic probe | SELECT 1 from a monitoring client tests the full path: TCP, TLS, pgwire, SQL execution. |
| Certificate expiration timeline | CockroachDB uses mutual TLS everywhere. Expired certs cause immediate inter-node or client lockout. |
| Process alive | A basic binary check that the cockroach process is running, used as a coarse guard before finer metrics. |
With these six signals you will know when the cluster is broken, but not why it is slow or why it is about to break.
Level 2: operational
Operational monitoring adds workload-facing and infrastructure signals. This is the layer that catches latency regressions, contention, and the first signs of storage-engine pressure. If your team runs CockroachDB in production, these should be on dashboards and alertable.
| Signal | Why it matters |
|---|---|
| SQL statement latency (P50, P99) | Client-visible performance. P99 is where SLO violations live. |
| KV read/write latency | Isolates storage and replication latency from SQL planning overhead. |
| Transaction commit latency | Captures multi-statement overhead, retries, and commit protocol cost. |
| Transaction restart rate by cause | writetooold points to contention, readwithinuncertainty to clock skew, txnpush to app conflicts. |
| SQL error rate by code | 40001 is contention, 53200 is resource exhaustion, XX000 is an internal fault. |
| CPU utilization per node | Raft ticking is per-range; sustained CPU above 70% leaves no headroom for bursts. |
| Memory RSS per node | Tracks process memory against host or cgroup limits. CGo allocations (Pebble cache) are not managed by the Go GC. |
| LSM L0 sublevel count | The single best leading indicator of storage-driven degradation. Past 10-20, latency rises; past 20, write stalls are imminent. |
| WAL fsync latency | Directly measures the write-path fsync health that every Raft commit depends on. |
| Disk I/O utilization and latency | SSDs and cloud volumes throttle silently. Latency matters more than percent-utilization on multi-queue devices. |
| Under-replicated range count | Shows whether the cluster can heal replicas fast enough after a node loss. |
| Clock offset between node pairs | Drift approaching 80% of --max-offset triggers self-termination. |
| Inter-node RPC heartbeat latency | Captures network health as CockroachDB experiences it, including scheduling delay. |
| Active client connections per node | Connection storms consume goroutines and memory and can overwhelm surviving nodes after failover. |
| Admission control queue depth | Sustained queuing means the system is at capacity and is artificially adding latency to protect itself. |
| Job status | Stuck backups, schema changes, or imports block operations and consume I/O. |
Level 2 is where most production teams stop. It is enough to diagnose many incidents, but it misses the slow-burn failures that fill disks with garbage or let compaction debt accumulate for weeks.
Level 3: mature
Mature monitoring tracks the storage engine, runtime, and replication internals. These signals are leading indicators: they warn you before Level 2 latency spikes or Level 1 unavailability.
| Signal | Why it matters |
|---|---|
| LSM read amplification | Measures how many SSTables a read consults. Above 25, read-heavy workloads degrade. |
| Pebble write stall count | The storage engine has paused writes. Any stall in normal OLTP is abnormal. |
| Compaction throughput and backlog | If compaction cannot keep up with ingestion, L0 grows and stalls follow. |
| Pebble block cache hit ratio | A drop means the working set has outgrown cache or the cache is cold after restart. |
| SQL memory budget utilization | Approaching --max-sql-memory causes spills to disk or 53200 rejections. |
| Go GC pause duration and frequency | Pauses approaching the liveness heartbeat interval risk Raft liveness loss. |
| Goroutine count | Monotonic growth without workload growth indicates a leak. |
| Range count per node (replicas and leases) | Raft overhead scales with range count. Imbalance points to allocator or constraint problems. |
| Raft snapshot rate | High rates mean followers cannot keep up and are being rebuilt from full snapshots. |
| Lease transfer rate | Elevated transfers without planned maintenance mean nodes are flapping or thrashing. |
| Intent count and bytes per store | Growing unresolved intents block other transactions and add latency system-wide. |
| MVCC garbage bytes per store | Garbage accumulates silently when GC cannot keep up or protected timestamps block it. |
| KV read/write latency | Separates storage-layer slowdown from SQL-layer behavior. |
| Raft log commit latency | Isolates the fsync-bound write path that SQL and KV latency depend on. |
| Changefeed lag | Growing lag means CDC cannot keep up; stalled changefeeds create protected timestamps that block GC. |
| Protected timestamp record count and age | Records from CDC or backups prevent MVCC GC. Old records are a silent disk-filling risk. |
| File descriptor usage | FD exhaustion prevents new connections and SSTable opens. |
| Composite pattern alerts | Multi-signal correlations catch failure modes like LSM death spirals and protected-timestamp GC stalls. |
| Network throughput between nodes | Saturated NICs delay Raft and DistSQL shuffle, especially during snapshot storms. |
This level turns operators from firefighters into investigators. You stop asking “why is P99 high?” and start asking “why is read amplification rising while garbage bytes grow?”
Level 4: expert
Expert monitoring targets the low-level signals that explain edge-case behavior. These metrics are noisy in isolation and expensive to collect, but they are decisive during deep incidents and capacity planning.
| Signal | Why it matters |
|---|---|
| Raft proposal drop rate | Dropped proposals mean writes are being silently retried; immediate tail-latency impact. |
| Per-range request rate distribution | Identifies hot ranges before single-node CPU saturation becomes obvious. |
| Intent resolution throughput | Tells you whether the cleanup system is keeping pace during an intent cascade. |
| Lease preference violation count | Critical for multi-region latency SLOs when zone configs cannot be satisfied. |
| Raft entry cache hit rate | Misses force disk reads for log entries, adding hidden latency to replication. |
| Compaction debt by LSM level | Shows exactly where in the LSM tree pressure is building. |
| KV write batch size distribution | Shifts in batch size precede write-pattern changes that stress the storage engine. |
| SQL plan cache hit rate | Misses cause repeated planning CPU burn and can indicate stats or version issues. |
| Cross-range transaction percentage | Higher percentages increase distributed coordination overhead and tail latency. |
| Admission control token exhaustion rate | Quantifies how often each queue is delaying work. |
| Closed timestamp lag | Affects follower read freshness and can indicate replication or clock issues. |
| Queue processor error counts | Split, merge, replicate, and GC queue failures reveal internal scheduling problems. |
| Node decommission progress rate | Ensures decommissions complete before patience or redundancy runs out. |
| Time-series data retention pressure | CockroachDB’s internal time-series store can itself become a storage burden. |
| Bulk data export activity | Unexpected EXPORT or BACKUP jobs may indicate exfiltration or runaway operations. |
These signals are best consumed through exploratory dashboards and ad-hoc queries rather than high-noise paging alerts. Their value is giving you the last few data points that separate one root cause from another.
Moving between levels
Do not try to implement all four levels at once. A practical progression:
- Start with Level 1 plus Level 2’s L0 sublevel count and SQL latency. Those three classes of signals catch the majority of production incidents.
- Add Level 2 clock offset, retry cause breakdown, and admission control queue depth before your first multi-region deployment.
- Move to Level 3 when you have recurring latency tickets that Level 2 cannot explain. Focus on read amplification, compaction backlog, and MVCC garbage first.
- Add Level 4 signals after you have experienced at least one incident where you needed per-range request rates or Raft proposal drop metrics to close the diagnosis.
Keep scrape intervals short enough to catch sub-second write stalls; a 30-second scrape will miss brief events that still hurt users. Reserve crdb_internal queries for manual diagnosis, not automated alerting, because they are unsupported and can be expensive during incidents.
How Netdata helps
Netdata can reduce the gap between Level 2 and Level 3 monitoring without forcing you to build custom Prometheus recording rules for every storage-engine signal:
- Netdata collects CockroachDB’s Prometheus-formatted metrics from
/_status/varsout of the box and correlates them with node-level CPU, memory, disk I/O, and network on the same charts. - Per-second granularity makes brief Pebble write stalls, Raft commit latency spikes, and admission-control queue flushes visible instead of averaging them away.
- Composite dashboards let you overlay L0 sublevel count, KV write latency, and disk I/O utilization to confirm an LSM compaction death spiral in one view.
- Anomaly detection on slowly moving signals such as MVCC garbage bytes, protected timestamp age, and read amplification can surface the “silent but catastrophic” trends that static thresholds miss.
- Correlating CockroachDB process RSS with Go GC pause duration and goroutine count helps distinguish a SQL memory leak from a storage-engine cache sizing issue.







