CockroachDB Raft liveness failure cascade: slow node, lost leases, rolling unavailability

You notice brief, repeating SQL timeouts or ambiguous result errors across your application. In the CockroachDB DB Console, one node is still marked as running, yet its range lease count is oscillating and lease transfers are spiking. This is not a clean node crash. It is a Raft liveness failure cascade: a node has become slow enough to miss Raft heartbeats, but not dead enough to fail permanently. The cluster keeps trying to move leases away and back, creating rolling windows of unavailability.

Unlike a clean failure, where a node drops and the cluster redistributes leases once, a slow node flaps. Go GC pauses, disk stalls, CPU saturation, or network asymmetry prevent the node from processing Raft ticks and heartbeat responses inside the liveness expiry window. The cluster revokes its leases. If the node recovers moments later, it may reacquire leases, only to lose them again on the next stall. Each transfer introduces a small unavailability window, and in aggregate your cluster feels unstable even though no node is fully down.

The key operational distinction is speed, not state. A node can be up, responsive to SSH, and even answering some SQL queries while being too slow to maintain thousands of Raft consensus relationships. This article explains how to identify the root cause, break the cycle, and prevent recurrence.

What this means

CockroachDB divides the keyspace into ranges, each roughly 512 MiB by default. Every range is a Raft consensus group with multiple replicas. One replica is the leaseholder, serving reads and coordinating writes, and one is the Raft leader driving consensus. In the common case, leaseholder and Raft leader are co-located. A node with tens of thousands of ranges runs that many Raft state machines concurrently, each requiring regular CPU for ticking and heartbeats.

When a node becomes slow, it may fail to renew its node liveness record or respond to Raft heartbeats before expiry. The default heartbeat interval is on the order of a few seconds, with an expiry roughly double that. If a pause exceeds that window, the cluster treats the node as not-live. Once liveness is lost, the node drops its leases. Other nodes must acquire them, causing a brief unavailability window for each affected range.

If the slow node recovers and rejoins fully, the allocator may eventually move leases back, especially if it is less loaded than peers. If the underlying slowness persists, the node enters a flap cycle: lose leases, recover, regain leases, stall again. This is worse than a clean failure because the cluster never reaches a steady state. The cascade also strains healthy nodes, which must handle additional leaseholder work, send and receive more snapshots, and process rebalancing decisions while the impaired node oscillates. Under extreme conditions, the extra load can push a second node into saturation, widening the blast radius.

flowchart TD
    A[Node slows: GC disk or CPU] --> B[Missed Raft heartbeats]
    B --> C[Raft leader steps down]
    C --> D[Cluster transfers leases]
    D --> E[Brief unavailability per range]
    E --> F{Node recovers}
    F -->|Yes| G[Reacquires leases]
    G --> A
    F -->|No| H[Clean failure: leases settle]
    A --> I[Flapping liveness]
    I --> J[Rolling unavailability windows]

Common causes

CauseWhat it looks likeFirst thing to check
Go GC thrashing / memory pressureRSS near limit, GC CPU climbing, GC pauses over 500 ms, liveness flapping every few minutessys_rss, sys_gc_pause_ns, sys_go_allocbytes
Disk stall or slow WAL fsyncstorage_disk_stalled nonzero, WAL fsync P99 over 50 ms on SSD, Pebble write stalls appearingstorage_wal_fsync_latency, storage_disk_stalled, iostat
CPU saturation from Raft overheadCPU over 80% sustained, RPC latency elevated on all pairs to this node, admission control elastic-cpu queuesys_cpu_user_ns, round_trip_latency, queue depth
Network partition or asymmetryOne-way RPC timeouts, round_trip_latency over 5x baseline, Raft election messages in logsround_trip_latency, node logs for partition errors
Pebble LSM write stallstorage_write_stalls nonzero, storage_l0_sublevels over 20, store-write admission queue backing upstorage_l0_sublevels, storage_write_stalls

Quick checks

These commands are read-only and safe to run during an incident.

# Check node liveness status for flapping epochs
curl -s http://localhost:8080/_status/nodes | python3 -c "
import json, sys
for n in json.load(sys.stdin)['nodes']:
    print(f'Node {n[\"desc\"][\"node_id\"]}: liveness={n.get(\"liveness\",{}).get(\"liveness\",\"UNKNOWN\")}')"
# Check lease transfer rate for unusual churn
curl -s http://localhost:8080/_status/vars | grep leases_transfers_success
# Check for unavailable ranges
curl -s http://localhost:8080/_status/vars | grep ranges_unavailable
# Check GC pause pressure and frequency
curl -s http://localhost:8080/_status/vars | grep -E 'sys_gc_pause_ns|sys_gc_count'
# Check disk stall and WAL fsync latency
curl -s http://localhost:8080/_status/vars | grep -E 'storage_disk_stalled|storage_wal_fsync_latency'
# Check inter-node RPC latency for asymmetry or spikes
curl -s http://localhost:8080/_status/vars | grep round_trip_latency
# Check CPU utilization of the CockroachDB process
top -p $(pgrep -x cockroach) -bn1 | tail -1
# Check disk I/O latency and utilization
iostat -xz 1 3

How to diagnose it

  1. Confirm flapping versus clean failure. Run the liveness check. If the node’s epoch increments repeatedly while the process stays up, it is flapping. A clean failure shows one epoch jump and then stable loss.
  2. Map liveness events to resource events. Overlay node liveness transitions with Go GC pause duration, CPU utilization, and WAL fsync latency. If GC pauses approach the heartbeat interval and correlate exactly with liveness loss, the root cause is memory pressure.
  3. Check for disk stalls. Any nonzero storage_disk_stalled is definitive. Also look for storage_write_stalls counter increments. Disk stalls block Raft log writes, which prevents heartbeat responses.
  4. Inspect RPC latency. Elevated round_trip_latency to or from the affected node suggests network or scheduling delay. If the node is CPU-saturated, its end of the RPC will be slow even if the network is fine.
  5. Quantify the blast radius. Check ranges_unavailable and leases_transfers_success. Sustained nonzero unavailable ranges mean the cascade is affecting reads and writes. Spiking lease transfers without planned rebalancing confirm churn.
  6. Read logs for Raft messages. Look for Raft election timeouts, “pebble: write stall”, or “disk stall” messages near the liveness loss timestamps. These tie the cascade to a specific subsystem.

Metrics and signals to monitor

SignalWhy it mattersWarning sign
Node liveness statusBinary cluster membershipTransition to not-live, or epoch incrementing repeatedly
Range lease transfer rateMeasures cluster churn and instabilityOver 10x baseline without operational cause
Range unavailability countDirect user impactAny nonzero value sustained for more than a few minutes
Go GC pause durationGC pauses block Raft processingPauses approaching the heartbeat interval, or over 500 ms
WAL fsync latency / disk stallWrite path health directly affects Raftstorage_disk_stalled nonzero, or fsync P99 over 50 ms on SSD
Inter-node RPC latencyCaptures network and scheduling healthOver 5x baseline sustained
CPU utilizationRaft ticking is per-range and CPU-intensiveSustained over 80%
LSM L0 sublevel countStorage engine backpressure stalls writesOver 20 sustained with an upward trend

Fixes

Go GC thrashing or memory pressure

Find the memory consumer. Check sys_go_allocbytes, sys_cgo_allocbytes, and sql_mem_root_current. If SQL memory is the culprit, cancel the expensive query. If --max-sql-memory or --cache is sized too aggressively for the container or host limit, reduce them. Trade-off: lowering the block cache increases Pebble read amplification and can raise KV latency.

Disk stall or slow fsync

If the underlying storage is a cloud volume that has exhausted burst credits, increase provisioned IOPS or throughput. If a physical disk is degrading, decommission the node. Separating the WAL onto a dedicated fast device removes the most latency-sensitive I/O from compaction contention. Trade-off: extra volume cost and reconfiguration.

CPU saturation from Raft overhead

Add nodes to reduce ranges per node. Raft ticking overhead scales with range count; a node with 50,000 ranges needs significantly more CPU than one with 5,000. Trade-off: rebalancing generates snapshot traffic and temporarily increases load on peers.

Pebble LSM write stall

Reduce the write rate immediately. Pause bulk imports, backups, or large schema backfills. Tighten admission control if necessary. If L0 is elevated but trending downward, allow compaction to catch up rather than restarting the node. Trade-off: pausing ingest impacts application throughput.

Network asymmetry

There is no in-cluster fix for a broken network path. Isolate the bad link, switch, or NIC. In asymmetric partitions, the node may see some peers but not others, which causes especially unpredictable lease behavior. Trade-off: rerouting traffic may require infrastructure changes outside the database.

Prevention

  • Monitor L0 sublevels, GC pause duration, and RPC latency as leading indicators. Do not wait for node liveness to alert you.
  • Keep steady-state CPU below 60-70% so per-range Raft work and GC have headroom.
  • Use /health?ready=1 for load balancer health checks, not simple TCP connects, to prevent routing traffic to a node that is alive but stalled.
  • Ensure NTP or cloud time services are healthy and monitor clock offset; clock skew creates uncertainty restarts that add load.
  • Size memory so RSS stays below 75% of the host or cgroup limit, leaving room for GC spikes and OS page cache.
  • Watch protected timestamp records and MVCC garbage bytes. Silent disk growth from blocked GC eventually leads to compaction stall, which triggers this cascade.

How Netdata helps

  • Netdata collects sys_gc_pause_ns, sys_rss, cpu.utilization, and storage_l0_sublevels with per-second resolution, letting you correlate GC spikes or storage stalls with lease transfer spikes on the same timeline.
  • The built-in CockroachDB collector exposes ranges_unavailable, leases_transfers_success, round_trip_latency, and storage_disk_stalled without requiring manual curl commands during an incident.
  • Composite charts let you overlay Raft heartbeat latency, disk fsync duration, and CPU saturation to quickly distinguish a network partition from a disk stall or GC thrash.
  • Anomaly detection on inter-node RPC paths flags latency deviations before they trigger liveness loss.