$ guides / cockroachdb / cockroachdb-node-liveness-failure ▌

Operations Guides

CockroachDB node liveness failure: heartbeats, lease redistribution, and flapping

Lease transfer spikes, briefly unavailable ranges, and client errors such as ambiguous results or connection resets indicate node liveness failure. In the logs, nodes transition to not-live and back within seconds. When the cluster decides a node cannot renew its liveness heartbeat, it redistributes leases. If the node recovers fast enough to renew but not fast enough to stay healthy, it flaps: an oscillating state more destructive than a clean outage because it repeatedly interrupts in-flight work and prevents stable failover.

CockroachDB tracks node membership through a liveness record stored in the KV layer. Each node renews its record on an interval. If a node misses the expiry window, the cluster treats it as dead and moves its leases. Ranges for which it was leaseholder become temporarily unavailable until new leaseholders are established. Flapping occurs when the underlying issue is transient: a GC pause, a brief disk stall, or CPU scheduler delay that lasts just long enough to miss a heartbeat, but not long enough to kill the process.

What this means

Liveness is binary: a node is live or not-live. When a node unexpectedly becomes not-live, the cluster immediately transfers leases away from it. Each transfer creates a small unavailability window for that range, typically sub-second but visible to clients as latency spikes or retry errors. If the node held many leases, the aggregate impact is a burst of unavailable ranges and a spike in under-replicated counts during reorganization.

Flapping is worse than a clean failure because the node never stays down long enough for clients to fail over cleanly. The cluster repeatedly moves leases away and back, oscillating availability, inflating tail latency, and triggering cascading retries that add load to stressed nodes. Common root causes include disk stalls, Go GC pauses longer than the heartbeat interval, CPU starvation that prevents Raft goroutines from running, network partitions that isolate the node from the liveness range, and clock skew that causes self-termination after exceeding --max-offset.

A node can heartbeat successfully while write-stalled on one store. Always correlate liveness status with storage, CPU, and network signals. If the liveness range itself loses quorum, the entire cluster can freeze because no node can renew its record.

flowchart TD
    A[Disk stall, GC pause, CPU saturation, or network partition] --> B[Node misses liveness heartbeat]
    B --> C[Cluster marks node not-live]
    C --> D[Leases redistribute away from node]
    D --> E[Ranges briefly unavailable]
    E --> F{Node recovers before next expiry?}
    F -->|Yes| G[Node regains leases]
    G --> H[Flapping: oscillating availability]
    F -->|No| I[Sustained unavailability until operator intervenes]

Common causes

Cause	What it looks like	First thing to check
Disk stall	Node logs show disk stall detection; `storage_disk_stalled` is nonzero; WAL fsync latency spikes to hundreds of milliseconds	`storage_disk_stalled` and `storage_wal_fsync_latency`
GC pause thrashing	Go GC pause duration approaches or exceeds the heartbeat interval; RSS is near the container or host limit; GC CPU climbs above 15%	`sys_gc_pause_ns`, `sys_gc_count`, and process RSS
CPU starvation or overload	Sustained CPU above 80-90%; admission control queues show sustained waits; Raft ticking falls behind	`sys_cpu_user_ns`, `sys_cpu_sys_ns`, and `admission_wait_durations`
Network partition or latency spike	RPC round-trip latency between node pairs jumps above 5x baseline; one node is reachable from some peers but not others	`round_trip_latency` for each node pair
Clock skew	`clock_offset_meannanos` trends toward max-offset; `readwithinuncertainty` restart rate appears	Clock offset metrics and `txn_restarts` by cause
LSM write stall	`storage_l0_sublevels` climbs past 20; `storage_write_stalls` increments; the node cannot write Raft log entries	`storage_l0_sublevels` and `storage_write_stalls`

Quick checks

Run these safe, read-only checks to orient yourself during an incident.

# Check which nodes are not-live or draining
curl -s http://localhost:8080/_status/nodes | python3 -c "
import json, sys
for n in json.load(sys.stdin)['nodes']:
    print(f'Node {n[\"desc\"][\"node_id\"]}: liveness={n.get(\"liveness\",{}).get(\"liveness\",\"UNKNOWN\")}')"

# Check for unavailable or under-replicated ranges
curl -s http://localhost:8080/_status/vars | grep -E 'ranges_unavailable|ranges_underreplicated'

# Check if the node has detected a disk stall
curl -s http://localhost:8080/_status/vars | grep storage_disk_stalled

# Check Go GC pressure (cumulative counters; compare over time)
curl -s http://localhost:8080/_status/vars | grep -E 'sys_gc_pause_ns|sys_gc_count'

# Check inter-node RPC latency
curl -s http://localhost:8080/_status/vars | grep round_trip_latency

# Check for write stalls and L0 pressure
curl -s http://localhost:8080/_status/vars | grep -E 'storage_write_stalls|storage_l0_sublevels'

# Check CockroachDB process CPU and memory usage
ps -p "$(pgrep -d ',' -x cockroach)" -o pid,pcpu,pmem,cmd

How to diagnose it

Confirm the symptom and eliminate planned operations. Query /_status/nodes or crdb_internal.gossip_liveness to identify which nodes are not-live. Check if the node is draining or decommissioning. A graceful drain is expected; an unexpected transition is the problem.
Check for disk stalls. Check storage_disk_stalled. A nonzero value means the node has detected an unresponsive storage layer. Check storage_wal_fsync_latency. On SSDs, fsync P99 above 50ms is concerning; above 200ms is critical. Disk stalls prevent heartbeat and Raft log writes.
Check Go GC pressure. Compute GC pause trends from sys_gc_pause_ns and sys_gc_count. Pauses approaching the heartbeat interval freeze the process and cause missed heartbeats. Correlate with RSS and sql_mem_root_current to find the memory consumer.
Check CPU saturation. Review sys_cpu_user_ns and sys_cpu_sys_ns. Sustained CPU above 80-90% leaves insufficient cycles for per-range Raft ticking and heartbeat renewal. High system CPU with low user CPU points to I/O wait or kernel scheduling issues, not SQL execution load.
Check network health. Review round_trip_latency for the affected node against all peers. Spikes above 5x baseline indicate network congestion, NIC saturation, or a partition. Asymmetric partitions (where A reaches B but B cannot reach A) are particularly dangerous and can cause unpredictable liveness behavior.
Check clock skew. Review clock_offset_meannanos. Offsets approaching --max-offset (default 500ms) indicate dangerous drift and will eventually trigger self-termination. Check txn_restarts for readwithinuncertainty to confirm clock-driven restarts.
Check storage engine pressure. If storage_l0_sublevels is above 20 and climbing, or storage_write_stalls is nonzero, the node is write-stalled. During a stall, the node cannot process Raft log writes, which cascades into liveness loss and lease transfers.
Correlate with lease transfer and unavailable range metrics. A spike in successful lease transfers confirms the cluster is reacting to liveness loss. Sustained nonzero ranges_unavailable means the impact has moved from internal rebalancing to client-visible unavailability.

Metrics and signals to monitor

Signal	Why it matters	Warning sign
Node liveness status	Binary cluster membership signal	Any unexpected transition to not-live, or draining without a planned operation
Range unavailability	Direct user impact	Nonzero value sustained for more than 5 minutes
Lease transfer rate	Confirms the cluster is evacuating a node	Rate more than 10x baseline without operational cause
WAL fsync latency and disk stall detection	Disk stalls prevent heartbeats and Raft writes	`storage_disk_stalled` nonzero, or WAL fsync P99 above 50ms on SSDs
Go GC pause duration and GC CPU	Long pauses freeze the process during heartbeat windows	Individual pauses above 500ms, or GC CPU above 15% sustained
RPC heartbeat latency	Network or scheduling issues isolate the node	Any node pair above 5x baseline sustained
Clock offset between nodes	Skew causes restarts and eventual self-termination	Offset above 50% of `--max-offset` (250ms default)
L0 sublevel count and write stalls	Write stalls block Raft log writes	L0 above 20 sustained, or any write stall during normal workload
Under-replicated range count	Measures healing progress after node loss	Sustained increase over 30 minutes without operational cause

Fixes

Disk stall

Do not restart the node immediately. If the underlying volume is a cloud block store (EBS, PD), check for burst credit exhaustion or provider throttling. If storage_disk_stalled is active, the node may self-terminate to protect consistency. Replace or migrate the storage before rejoining the node. Rejoining a node with a still-stalled disk will cause repeated liveness loss.

GC pause thrashing

Identify the memory consumer by comparing sys_go_allocbytes, sys_cgo_allocbytes, and sql_mem_root_current. If a specific query is consuming the SQL memory budget, cancel the session. If the Pebble cache or SQL memory limit is sized too aggressively for the container or host memory, reduce --cache or --max-sql-memory. Restart is a last resort; the goal is to reduce heap pressure so GC pauses shrink below the heartbeat interval.

CPU starvation

Reduce load by pausing bulk imports, backups, or heavy analytical queries. Check the per-node range count. If a node holds more than 50,000 ranges, Raft ticking alone can consume multiple cores. If admission control queues are deep, the system is protecting itself, but you need more CPU or fewer ranges. Add nodes or reduce per-node replica density.

Network partition

Verify network paths in both directions with ping and NIC error counters. Asymmetric partitions are harder to detect than clean drops; review round_trip_latency from both sides of each pair. Fix the network before restarting. Restarting into an ongoing partition triggers the same liveness loss.

Clock skew

Check NTP status with chronyc tracking or ntpstat on all nodes and fix the time source. Nodes that self-terminated due to offset exceeding --max-offset will not stay up until the clock is corrected. Correct the clock before restarting them.

LSM write stall

Pause bulk writes and check if a single store is affected. If storage_l0_sublevels is decreasing, allow compaction to catch up. Do not restart a write-stalled node unless it is completely unrecoverable; restarting replays WAL and can extend the stall window. If the stall is cluster-wide, reduce ingestion rate and tighten admission control.

Prevention

Monitor storage_l0_sublevels and storage_write_stalls as leading indicators. They provide minutes to tens of minutes of warning before liveness fails.
Size --cache and --max-sql-memory to keep process RSS below 80% of the host or cgroup limit. Leave headroom for the OS page cache, Go GC, and burst allocations.
Use /health?ready=1 for load balancer health checks, not simple TCP port checks. TCP checks route traffic to nodes that are listening but functionally impaired.
Monitor clock offset proactively. Do not wait for nodes to self-terminate.
Keep per-node range count below 50,000 to limit Raft CPU overhead.
Monitor WAL fsync latency and disk stall detection as early signals of storage path degradation.

How Netdata helps

Correlates CockroachDB liveness transitions with host-level disk latency, fsync times, and CPU scheduler metrics to surface disk stalls and CPU starvation.
Tracks Go GC pause patterns alongside node liveness status to expose GC thrashing that causes flapping.
Monitors per-node RPC round-trip latency to distinguish network partitions from storage-driven heartbeat loss.
Surfaces L0 sublevel growth and write stall counts on the same timeline as SQL latency and lease transfer rates, revealing storage pressure before it cascades.
Supports composite alerts combining disk stall detection, lease transfer spikes, and unavailable range counts.

The Netdata solution

CockroachDB monitoring with Netdata

Netdata monitors CockroachDB with per-second metrics and automatic dashboards. Watch LSM compaction, Raft liveness, clock skew, hot ranges, and intent buildup so the distributed-systems failure modes in these runbooks surface early.

See CockroachDB monitoring → Start monitoring free

CockroachDB node liveness failure: heartbeats, lease redistribution, and flapping

CockroachDB node liveness failure: heartbeats, lease redistribution, and flapping

What this means

Common causes

Quick checks

How to diagnose it

Metrics and signals to monitor

Fixes

Disk stall

GC pause thrashing

CPU starvation

Network partition

Clock skew

LSM write stall

Prevention

How Netdata helps

Related guides

CockroachDB monitoring with Netdata