CockroachDB node liveness failure: heartbeats, lease redistribution, and flapping
Lease transfer spikes, briefly unavailable ranges, and client errors such as ambiguous results or connection resets indicate node liveness failure. In the logs, nodes transition to not-live and back within seconds. When the cluster decides a node cannot renew its liveness heartbeat, it redistributes leases. If the node recovers fast enough to renew but not fast enough to stay healthy, it flaps: an oscillating state more destructive than a clean outage because it repeatedly interrupts in-flight work and prevents stable failover.
CockroachDB tracks node membership through a liveness record stored in the KV layer. Each node renews its record on an interval. If a node misses the expiry window, the cluster treats it as dead and moves its leases. Ranges for which it was leaseholder become temporarily unavailable until new leaseholders are established. Flapping occurs when the underlying issue is transient: a GC pause, a brief disk stall, or CPU scheduler delay that lasts just long enough to miss a heartbeat, but not long enough to kill the process.
What this means
Liveness is binary: a node is live or not-live. When a node unexpectedly becomes not-live, the cluster immediately transfers leases away from it. Each transfer creates a small unavailability window for that range, typically sub-second but visible to clients as latency spikes or retry errors. If the node held many leases, the aggregate impact is a burst of unavailable ranges and a spike in under-replicated counts during reorganization.
Flapping is worse than a clean failure because the node never stays down long enough for clients to fail over cleanly. The cluster repeatedly moves leases away and back, oscillating availability, inflating tail latency, and triggering cascading retries that add load to stressed nodes. Common root causes include disk stalls, Go GC pauses longer than the heartbeat interval, CPU starvation that prevents Raft goroutines from running, network partitions that isolate the node from the liveness range, and clock skew that causes self-termination after exceeding --max-offset.
A node can heartbeat successfully while write-stalled on one store. Always correlate liveness status with storage, CPU, and network signals. If the liveness range itself loses quorum, the entire cluster can freeze because no node can renew its record.
flowchart TD
A[Disk stall, GC pause, CPU saturation, or network partition] --> B[Node misses liveness heartbeat]
B --> C[Cluster marks node not-live]
C --> D[Leases redistribute away from node]
D --> E[Ranges briefly unavailable]
E --> F{Node recovers before next expiry?}
F -->|Yes| G[Node regains leases]
G --> H[Flapping: oscillating availability]
F -->|No| I[Sustained unavailability until operator intervenes]Common causes
| Cause | What it looks like | First thing to check |
|---|---|---|
| Disk stall | Node logs show disk stall detection; storage_disk_stalled is nonzero; WAL fsync latency spikes to hundreds of milliseconds | storage_disk_stalled and storage_wal_fsync_latency |
| GC pause thrashing | Go GC pause duration approaches or exceeds the heartbeat interval; RSS is near the container or host limit; GC CPU climbs above 15% | sys_gc_pause_ns, sys_gc_count, and process RSS |
| CPU starvation or overload | Sustained CPU above 80-90%; admission control queues show sustained waits; Raft ticking falls behind | sys_cpu_user_ns, sys_cpu_sys_ns, and admission_wait_durations |
| Network partition or latency spike | RPC round-trip latency between node pairs jumps above 5x baseline; one node is reachable from some peers but not others | round_trip_latency for each node pair |
| Clock skew | clock_offset_meannanos trends toward max-offset; readwithinuncertainty restart rate appears | Clock offset metrics and txn_restarts by cause |
| LSM write stall | storage_l0_sublevels climbs past 20; storage_write_stalls increments; the node cannot write Raft log entries | storage_l0_sublevels and storage_write_stalls |
Quick checks
Run these safe, read-only checks to orient yourself during an incident.
# Check which nodes are not-live or draining
curl -s http://localhost:8080/_status/nodes | python3 -c "
import json, sys
for n in json.load(sys.stdin)['nodes']:
print(f'Node {n[\"desc\"][\"node_id\"]}: liveness={n.get(\"liveness\",{}).get(\"liveness\",\"UNKNOWN\")}')"
# Check for unavailable or under-replicated ranges
curl -s http://localhost:8080/_status/vars | grep -E 'ranges_unavailable|ranges_underreplicated'
# Check if the node has detected a disk stall
curl -s http://localhost:8080/_status/vars | grep storage_disk_stalled
# Check Go GC pressure (cumulative counters; compare over time)
curl -s http://localhost:8080/_status/vars | grep -E 'sys_gc_pause_ns|sys_gc_count'
# Check inter-node RPC latency
curl -s http://localhost:8080/_status/vars | grep round_trip_latency
# Check for write stalls and L0 pressure
curl -s http://localhost:8080/_status/vars | grep -E 'storage_write_stalls|storage_l0_sublevels'
# Check CockroachDB process CPU and memory usage
ps -p "$(pgrep -d ',' -x cockroach)" -o pid,pcpu,pmem,cmd
How to diagnose it
Confirm the symptom and eliminate planned operations. Query
/_status/nodesorcrdb_internal.gossip_livenessto identify which nodes are not-live. Check if the node is draining or decommissioning. A graceful drain is expected; an unexpected transition is the problem.Check for disk stalls. Check
storage_disk_stalled. A nonzero value means the node has detected an unresponsive storage layer. Checkstorage_wal_fsync_latency. On SSDs, fsync P99 above 50ms is concerning; above 200ms is critical. Disk stalls prevent heartbeat and Raft log writes.Check Go GC pressure. Compute GC pause trends from
sys_gc_pause_nsandsys_gc_count. Pauses approaching the heartbeat interval freeze the process and cause missed heartbeats. Correlate with RSS andsql_mem_root_currentto find the memory consumer.Check CPU saturation. Review
sys_cpu_user_nsandsys_cpu_sys_ns. Sustained CPU above 80-90% leaves insufficient cycles for per-range Raft ticking and heartbeat renewal. High system CPU with low user CPU points to I/O wait or kernel scheduling issues, not SQL execution load.Check network health. Review
round_trip_latencyfor the affected node against all peers. Spikes above 5x baseline indicate network congestion, NIC saturation, or a partition. Asymmetric partitions (where A reaches B but B cannot reach A) are particularly dangerous and can cause unpredictable liveness behavior.Check clock skew. Review
clock_offset_meannanos. Offsets approaching--max-offset(default 500ms) indicate dangerous drift and will eventually trigger self-termination. Checktxn_restartsforreadwithinuncertaintyto confirm clock-driven restarts.Check storage engine pressure. If
storage_l0_sublevelsis above 20 and climbing, orstorage_write_stallsis nonzero, the node is write-stalled. During a stall, the node cannot process Raft log writes, which cascades into liveness loss and lease transfers.Correlate with lease transfer and unavailable range metrics. A spike in successful lease transfers confirms the cluster is reacting to liveness loss. Sustained nonzero
ranges_unavailablemeans the impact has moved from internal rebalancing to client-visible unavailability.
Metrics and signals to monitor
| Signal | Why it matters | Warning sign |
|---|---|---|
| Node liveness status | Binary cluster membership signal | Any unexpected transition to not-live, or draining without a planned operation |
| Range unavailability | Direct user impact | Nonzero value sustained for more than 5 minutes |
| Lease transfer rate | Confirms the cluster is evacuating a node | Rate more than 10x baseline without operational cause |
| WAL fsync latency and disk stall detection | Disk stalls prevent heartbeats and Raft writes | storage_disk_stalled nonzero, or WAL fsync P99 above 50ms on SSDs |
| Go GC pause duration and GC CPU | Long pauses freeze the process during heartbeat windows | Individual pauses above 500ms, or GC CPU above 15% sustained |
| RPC heartbeat latency | Network or scheduling issues isolate the node | Any node pair above 5x baseline sustained |
| Clock offset between nodes | Skew causes restarts and eventual self-termination | Offset above 50% of --max-offset (250ms default) |
| L0 sublevel count and write stalls | Write stalls block Raft log writes | L0 above 20 sustained, or any write stall during normal workload |
| Under-replicated range count | Measures healing progress after node loss | Sustained increase over 30 minutes without operational cause |
Fixes
Disk stall
Do not restart the node immediately. If the underlying volume is a cloud block store (EBS, PD), check for burst credit exhaustion or provider throttling. If storage_disk_stalled is active, the node may self-terminate to protect consistency. Replace or migrate the storage before rejoining the node. Rejoining a node with a still-stalled disk will cause repeated liveness loss.
GC pause thrashing
Identify the memory consumer by comparing sys_go_allocbytes, sys_cgo_allocbytes, and sql_mem_root_current. If a specific query is consuming the SQL memory budget, cancel the session. If the Pebble cache or SQL memory limit is sized too aggressively for the container or host memory, reduce --cache or --max-sql-memory. Restart is a last resort; the goal is to reduce heap pressure so GC pauses shrink below the heartbeat interval.
CPU starvation
Reduce load by pausing bulk imports, backups, or heavy analytical queries. Check the per-node range count. If a node holds more than 50,000 ranges, Raft ticking alone can consume multiple cores. If admission control queues are deep, the system is protecting itself, but you need more CPU or fewer ranges. Add nodes or reduce per-node replica density.
Network partition
Verify network paths in both directions with ping and NIC error counters. Asymmetric partitions are harder to detect than clean drops; review round_trip_latency from both sides of each pair. Fix the network before restarting. Restarting into an ongoing partition triggers the same liveness loss.
Clock skew
Check NTP status with chronyc tracking or ntpstat on all nodes and fix the time source. Nodes that self-terminated due to offset exceeding --max-offset will not stay up until the clock is corrected. Correct the clock before restarting them.
LSM write stall
Pause bulk writes and check if a single store is affected. If storage_l0_sublevels is decreasing, allow compaction to catch up. Do not restart a write-stalled node unless it is completely unrecoverable; restarting replays WAL and can extend the stall window. If the stall is cluster-wide, reduce ingestion rate and tighten admission control.
Prevention
- Monitor
storage_l0_sublevelsandstorage_write_stallsas leading indicators. They provide minutes to tens of minutes of warning before liveness fails. - Size
--cacheand--max-sql-memoryto keep process RSS below 80% of the host or cgroup limit. Leave headroom for the OS page cache, Go GC, and burst allocations. - Use
/health?ready=1for load balancer health checks, not simple TCP port checks. TCP checks route traffic to nodes that are listening but functionally impaired. - Monitor clock offset proactively. Do not wait for nodes to self-terminate.
- Keep per-node range count below 50,000 to limit Raft CPU overhead.
- Monitor WAL fsync latency and disk stall detection as early signals of storage path degradation.
How Netdata helps
- Correlates CockroachDB liveness transitions with host-level disk latency, fsync times, and CPU scheduler metrics to surface disk stalls and CPU starvation.
- Tracks Go GC pause patterns alongside node liveness status to expose GC thrashing that causes flapping.
- Monitors per-node RPC round-trip latency to distinguish network partitions from storage-driven heartbeat loss.
- Surfaces L0 sublevel growth and write stall counts on the same timeline as SQL latency and lease transfer rates, revealing storage pressure before it cascades.
- Supports composite alerts combining disk stall detection, lease transfer spikes, and unavailable range counts.
Related guides
- CockroachDB compaction backlog growing: when Pebble can’t keep pace with writes
- CockroachDB storage_l0_sublevels climbing: the earliest warning of write stalls
- CockroachDB LSM compaction death spiral: L0 sublevels, read amplification, and write stalls
- CockroachDB monitoring maturity model: from survival to expert
- CockroachDB Pebble write stalls: when the storage engine refuses writes
- How CockroachDB actually works in production: a mental model for operators







