$ guides / cockroachdb ▌

COCKROACHDB · OPERATIONS PLAYBOOK

CockroachDB without the pager: Raft ranges, clock skew, and an LSM that falls behind

A distributed SQL layer over a transactional key-value store, replicated by Raft across ranges, persisted to a Pebble LSM tree, and coordinated by synchronized clocks. Here's how that machinery holds up in production, where it cracks first, what to watch as traffic grows, and how to work each failure when it lands.

> Start with the monitoring checklist → # Jump to the full guide list

CockroachDB is famously easy to scale horizontally and famously unforgiving once write rate, range count, or clock drift pushes past what the defaults assumed.

The defaults work. Until writes outpace compaction, the L0 sublevel count climbs past 20, and Pebble stalls writes while read latency goes exponential. Until a node's Go GC pause exceeds the liveness heartbeat interval and it loses its leases, then recovers, then loses them again. Until NTP drifts and a node logs clock synchronization error: this node is more than 500ms away from at least half of the known nodes and self-terminates. Until a sequential primary key funnels every write through one leaseholder while the rest of the cluster idles. Until a stalled changefeed holds a protected timestamp that silently blocks MVCC garbage collection and the disk fills with dead data.

These guides are written for engineers who already run CockroachDB, not for people learning what a range is. The goal is to give you the mental model of how the cluster actually behaves under load, the failure patterns that keep recurring, the monitoring story that catches problems before they page anyone, and the runbooks you wish someone had handed you before your last incident.

How CockroachDB actually runs in production

CockroachDB is not just a SQL database. It is a distributed SQL engine over a transactional KV store, where every write travels through Raft consensus to a quorum of replicas, lands in a Pebble LSM tree, and is ordered by clocks that must stay synchronized. Most production failures live between these layers, not inside any one of them.

SQL gateway + DistSQL

Application connections terminate on a gateway node over the PostgreSQL wire protocol; each costs a goroutine and memory. The gateway parses and plans the statement, then may distribute it across nodes as a flow of processors. A single analytical query can saturate inter-node bandwidth and consume <code>--max-sql-memory</code> on several nodes, spilling to disk or failing with <code>53200</code>.

GATEWAY

transaction layer (MVCC)

Serializable snapshot isolation over MVCC timestamps. Conflicting transactions are pushed or restarted with <code>TransactionRetryWithProtoRefreshError</code>. Uncommitted writes leave intents that other transactions must resolve — abandoned intents accumulate and add latency cluster-wide.

TXN

ranges + leaseholder + Raft

The keyspace is split into ~512 MiB ranges, each replicated (default 3x). One replica holds the lease and serves reads; one is the Raft leader. Every write is proposed through Raft and committed by a quorum. A node with 10,000 ranges runs 10,000 Raft state machines — a non-obvious CPU multiplier.

RAFT

node liveness

Each node renews a liveness record on a short heartbeat. Miss the expiry — because of a GC pause, disk stall, or CPU starvation — and the cluster declares it dead, redistributing its leases. Flapping liveness is worse than a clean failure: it creates oscillating availability.

LIVENESS

Hybrid Logical Clocks

HLC combines wall-clock time with a logical counter and enforces a maximum offset (default 500 ms). Skew within the window causes <code>readwithinuncertainty</code> restarts; skew past 80% of max-offset makes a node self-terminate. Shared NTP failure can drift a quorum at once.

CLOCK

admission control

An internal flow-control system queues SQL, KV, and storage-write work to prevent overload. The <code>store-write</code> queue is tied directly to LSM L0 health — it begins shaping regular traffic at 5 sublevels. Sustained queuing means zero burst headroom.

ADMISSION

Pebble / LSM storage

Writes hit an in-memory memtable, flush to L0 SSTables, and compact down through L6. When compaction falls behind ingestion, L0 sublevels grow, read amplification rises nonlinearly, and Pebble eventually stalls writes — the single most common performance cliff.

STORAGE

disk (WAL + compaction)

Local NVMe, EBS, or a PD volume holds the WAL, SSTables, and snapshots. WAL <code>fsync</code> latency sits on the critical path of every write; a detected disk stall makes the node self-terminate. Free space below ~15% can starve compaction and trigger a death spiral.

DISK

Why this matters: a latency spike can come from LSM read amplification, lock contention, a hot range, a DistSQL shuffle, a clock-skew uncertainty restart, an admission-control queue, or a saturated disk. The symptom is the same — CockroachDB is slow — but each layer has a different signal and a different fix.

The failures you'll actually see

Most CockroachDB incidents fall into a small set of recurring patterns. Recognise the shape, and triage gets dramatically faster.

CRITICAL

The LSM compaction death spiral

Write rate exceeds disk compaction throughput. L0 SSTables accumulate, storage_l0_sublevels climbs past 10 then 20+, and read amplification rises — which makes compaction itself slower, a positive feedback loop. Eventually Pebble stalls writes, the node can't service its Raft log, loses leases, and appears partially unavailable. If several nodes hit this at once, the cluster goes down.

storage_l0_sublevels rising past 20 and not decreasing
storage_write_stalls incrementing (rate above 1/second)
KV write latency climbing from milliseconds to seconds
admission store-write queue deep, disk I/O pinned at 100%

Investigate →

CRITICAL

Memory pressure to GC thrashing to liveness loss

The Go heap grows from large queries or misconfigured memory budgets. GC runs more often and longer. During a pause the node can't process Raft heartbeats; if a pause exceeds the liveness heartbeat interval, the node loses liveness and its leases redistribute. It recovers, regains leases, and the cycle repeats — oscillating availability that's hard to pin down.

Go GC pause durations above 500 ms, GC CPU above 15%
node liveness flapping in lockstep with GC pauses
lease transfers spiking each time liveness drops
sys_rss approaching the cgroup limit

Investigate →

IMMINENT

The clock-skew crisis

NTP fails or a VM drifts. First the uncertainty interval widens, so readwithinuncertainty restarts climb and tail latency rises. If drift passes 80% of --max-offset (over 400 ms by default) the node logs clock synchronization error: this node is more than 500ms away from at least half of the known nodes and self-terminates — and crash-loops until the clock is fixed. Shared NTP can drift a quorum at once.

clock_offset_meannanos rising toward 400 ms on one or more nodes
readwithinuncertainty restart rate climbing (near-diagnostic)
a node self-terminating, then failing to rejoin
multiple nodes in the same NTP domain drifting together

Investigate →

ACTIVE

The transaction contention storm

Infrastructure is healthy — good liveness, zero unavailable ranges — but the workload creates serialized hot paths. Transactions collide on the same keys, retry with RETRY_WRITE_TOO_OLD, leave intents others must resolve, and the cluster spends its time waiting and retrying rather than doing work. Under load it becomes a positive feedback loop.

txn_restarts rising, dominated by writetooold
intentcount and intentbytes growing
SQL P99 latency rising while CPU stays moderate
contention isolated to specific tables or indexes

Investigate →

CRITICAL

Lost quorum and unavailable ranges

Simultaneous node failures, a partition bisecting a replica group, or a stuck Raft group leave ranges with no leaseholder or no quorum. ranges_unavailable goes nonzero and clients see replica unavailable for the affected keyspace. If the unavailable ranges back system metadata (meta, liveness, jobs), the impact is cluster-wide even though the count is small.

ranges_unavailable nonzero, sustained beyond brief lease transfers
replica unavailable and context deadline exceeded errors to clients
node liveness changes preceding the unavailability
ranges_underreplicated elevated and not healing

Investigate →

IMMINENT

The disk-full death spiral

A store runs low on space — from data growth, MVCC garbage, or a protected timestamp that blocks GC. Below ~15% free, compaction can't stage its output and starts failing, which drives L0 growth and write stalls. Deleting data does not help immediately: tombstones only clear through compaction, which is exactly what's now broken. The store reports store is full.

capacity_available below 10% of total and still falling
MVCC garbage bytes diverging from live bytes
compaction failing or stalled, L0 sublevels climbing
protected timestamp records present with no active backup or CDC

Investigate →

The Netdata solution

CockroachDB monitoring with Netdata

Netdata monitors CockroachDB with per-second metrics and automatic dashboards. Watch LSM compaction, Raft liveness, clock skew, hot ranges, and intent buildup so the distributed-systems failure modes in these runbooks surface early.

See CockroachDB monitoring → Start monitoring free

CockroachDB monitoring maturity levels

CockroachDB observability works in four practical levels. Each is a complete operation, not a stepping stone. Pick the level that matches how much your cluster matters. Most production clusters should land at the second level.

Level 1: Survival

Know that something is wrong

Survival monitoring is the floor. With these signals you can answer one question: is the cluster still serving? You will not learn what broke, but you will learn that something broke before users do. Survival is enough for dev clusters and low-stakes workloads.

Node liveness Does the cluster consider every node alive and renewing its heartbeat?
ranges_unavailable Is any part of the keyspace unable to serve reads or writes?
Disk space per store Is any store below 20% free (capacity_available)?
SELECT 1 synthetic probe Can a client actually connect and execute over pgwire?
Certificate expiration Will mutual TLS break with no grace period?

Level 2: Operational

Diagnose most incidents on your own

Operational monitoring is what most production clusters should target. Survival tells you something is wrong; operational tells you what. With this coverage your team can usually diagnose an incident on its own: storage debt, contention, clock skew, replication risk, connection pressure.

SQL statement latency (P50/P99) Per-node, ideally per fingerprint — a new slow query barely moves P99.
Transaction restart rate by cause writetooold (schema), readwithinuncertainty (clock), txnpush (app).
SQL error rate by code class XX000 is a genuine fault; 40001, 53200, 08006 mean different things.
storage_l0_sublevels per store The earliest predictor of write stalls — 10–30 min of warning.
clock_offset_meannanos How close is any node to the self-termination threshold?
Under-replicated range count Is the cluster's replication safety margin holding?
round_trip_latency between nodes Is inter-node RPC health slowing Raft and DistSQL?
CPU and RSS per node Headroom to absorb the loss of one node, per node not aggregate.
WAL fsync latency per store The most direct write-path health signal.
Admission control queue depth Is internal flow control throttling — i.e. are you at capacity?

Level 3: Mature

Catch problems before they become incidents

Mature monitoring catches problems before they wake anyone up. L0 creeping a sublevel a week, MVCC garbage diverging from live bytes, a protected timestamp aging for days, intents accumulating, a changefeed falling behind. None of these will page you on day one. They become page-out incidents on day thirty.

LSM read amplification per store rocksdb_read_amplification above 25 means compaction debt.
Pebble write stall count Any stall during normal workload is abnormal.
Block cache hit ratio Has the working set outgrown --cache?
MVCC garbage bytes Is GC keeping pace, or is dead data accumulating silently?
Intent count and bytes Are abandoned transactions leaving unresolved work?
Protected timestamp count / age Is a stalled job blocking GC and filling the disk?
Go GC pause duration / CPU Is GC pressure trending toward a liveness threat?
Range count per node Is Raft ticking overhead growing as a scaling dimension?
Raft snapshot and lease transfer rate Is the cluster churning or healing cleanly?
Changefeed lag (if using CDC) Are consumers behind, and is GC at risk downstream?

Level 4: Expert

Reactive instrumentation after real incidents

Expert signals enter your stack the day after a specific incident proved you needed them. Raft proposal drops, per-range request distribution, intent resolution throughput, closed timestamp lag, queue processor errors. Most teams never need every signal here. Add the ones your incident history says you do.

Raft proposal drop rate Dropped proposals are silently retried writes with latency cost.
Per-range request distribution Which range is hot — over 10x the average QPS?
Intent resolution throughput Is cleanup keeping pace during an intent cascade?
Closed timestamp lag How fresh can follower reads be?
Queue processor error counts Split, merge, replicate, and GC queue failures.
Disk stall detection metrics storage_disk_stalled and storage_disk_slow before utilization reacts.
Admission token exhaustion Are tokens running out under sustained overload?
SQL plan cache hit rate Are plan regressions or churn hurting specific endpoints?

Operating mistakes worth avoiding

The traps CockroachDB teams keep falling into. Each has a clear, well-known fix. Most teams only learn it after an incident.

Ignoring L0 sublevel count until write stalls hit

Teams watch disk utilization and IOPS but not LSM tree health. <code>storage_l0_sublevels</code> gives 10–30 minutes of warning before Pebble stalls writes, and it's almost never instrumented. Alert when sublevels climb past 10, and treat 20+ and rising as an emergency.

Not monitoring clock offset proactively

NTP is set-and-forget for most teams, so the first sign of drift is a node self-terminating at 3 a.m. Meanwhile <code>readwithinuncertainty</code> restarts have been quietly inflating tail latency for days. Watch <code>clock_offset_meannanos</code> and ticket at 250 ms, well before the 400 ms self-termination threshold.

Alarming on total retry rate without breaking down the cause

<code>writetooold</code> means contention (a schema problem), <code>readwithinuncertainty</code> means clock skew (an infra problem), and <code>txnpush</code> means transaction conflicts (an application problem). Treating <code>txn_restarts</code> as one number wastes diagnosis time — each cause needs a different response.

Watching only aggregate latency instead of per fingerprint

A new slow query among fast ones barely moves cluster P99 but kills the endpoint that runs it. Per-statement-fingerprint tracking catches plan regressions and missing indexes that aggregate <code>sql_service_latency</code> hides.

Not monitoring MVCC garbage and protected timestamps

Data gets deleted but nobody checks that GC actually runs. A stalled changefeed or hung backup holds a protected timestamp that blocks GC entirely, and the disk fills with several times the live data in tombstones — completely silent until the store is full.

Using TCP health checks instead of /health?ready=1

A plain TCP check keeps routing traffic to nodes that are draining, write-stalled, or GC-thrashing. CockroachDB exposes <code>/health?ready=1</code>, which returns 503 when a node is impaired. Most deployments never wire their load balancer to it.

Trusting cluster averages over per-node signals

One hot leaseholder or one overloaded region hides under healthy global metrics. A single node can be melting at 95% CPU while the cluster average looks fine. Always alert per node, never on the aggregate alone.

Treating recovery activity as automatically good

Snapshot and rebalance storms during healing compete with foreground traffic and can degrade the cluster further. Confirm that under-replication is actually decreasing, and throttle background work if recovery I/O is starving live queries.

CockroachDB runbooks in this section

Each guide is a focused runbook for one symptom or topic. Pick one when you have an incident, or use the categories to learn the area.

▸

Start here

▸

Storage engine, Pebble, and the LSM

▸

Raft, node liveness, and availability

▸

Clocks, HLC, and clock skew

▸

Ranges, replication, and rebalancing

▸

Hot ranges and load distribution

▸

Transactions, intents, and contention

▸

Disk, MVCC garbage, and capacity

▸

Memory, GC, and CPU

▸

SQL latency, connections, and admission control

▸

Jobs, CDC, certificates, and security

WHERE TO GO NEXT

Setting up CockroachDB monitoring, or putting out a fire?

If you're starting from scratch, the monitoring checklist is the path of least regret. If you're mid-incident, jump straight to the symptom that matches what you're seeing.

> Start with the checklist > Back to Operations Guides