CockroachDB replica unavailable: lost quorum and stuck Raft groups

ranges_unavailable > 0 means some portion of the keyspace has no leaseholder or has lost Raft quorum. Reads and writes to those ranges block until the cluster restores a leader and a valid leaseholder.

One unavailable range can halt a critical table. If the affected ranges include system metadata such as the liveness or meta ranges, the impact is cluster-wide. This guide covers symptom-to-root-cause diagnosis and safe response.

What this means

CockroachDB divides the keyspace into ranges, each approximately 512 MiB by default. Every range is replicated via Raft. Writes commit only after a quorum of replicas acknowledges them. One replica serves as the leaseholder for reads and write coordination; another serves as the Raft leader, driving consensus. If a node crashes, a network partition isolates a majority, or a disk stall prevents a replica from participating, the remaining replicas may fail to elect a leader or maintain a lease. The range then becomes unavailable.

ranges_unavailable is emitted per-store, but the impact is per-keyspace. A low count can still be catastrophic if the range backs a hot table or a system range such as liveness, meta, or jobs. Because CockroachDB routes all data access through these system ranges, their unavailability can freeze the entire cluster even when user data ranges are intact.

flowchart TD
    A[Disk stall or network partition] --> B[Node cannot renew liveness]
    B --> C[Raft quorum lost]
    C --> D[Lease expires]
    D --> E[ranges_unavailable increases]

Common causes

CauseWhat it looks likeFirst thing to check
Multi-node failure in a failure domainSeveral nodes transition to not-live simultaneously; ranges_underreplicated also risesNode liveness status via /_status/nodes or crdb_internal.gossip_liveness
Network partition bisecting replica groupsNodes appear live individually but cannot reach each other; asymmetric RPC latencyInter-node round_trip_latency and partition indicators in logs
Disk stall on a node holding critical replicasNode may stay live briefly but storage_disk_stalled is nonzero; WAL fsync latency spikesDisk stall gauge and store-level write stall metrics
Stuck Raft group unable to elect leaderNodes are live and networked but a specific range has no leader; Raft leader-not-found errors in logsPer-range status endpoints or log patterns for Raft leader elections

Quick checks

# Check how many ranges are unavailable (sum across all stores)
curl -s http://localhost:8080/_status/vars | grep ranges_unavailable

# Check node liveness and membership
curl -s http://localhost:8080/_status/nodes | python3 -c "
import json, sys
for n in json.load(sys.stdin)['nodes']:
    print(f'Node {n[\"desc\"][\"node_id\"]}: liveness={n.get(\"liveness\",{}).get(\"liveness\",\"UNKNOWN\")}')"
# TODO: verify JSON key path; node liveness structure varies by version

# Check node liveness via SQL
cockroach sql -e "SELECT node_id, epoch, expiration, draining, decommissioning, membership FROM crdb_internal.gossip_liveness;"

# Check under-replicated range count
curl -s http://localhost:8080/_status/vars | grep ranges_underreplicated

# Check for disk stall detection on any store
curl -s http://localhost:8080/_status/vars | grep storage_disk_stalled

# Check inter-node RPC latency for partition or delay
curl -s http://localhost:8080/_status/vars | grep round_trip_latency

# Check clock offset between nodes
curl -s http://localhost:8080/_status/vars | grep clock_offset_meannanos
# TODO: verify exact metric name; may be clock_offset_mean_nanoseconds

# Check for active write stalls
curl -s http://localhost:8080/_status/vars | grep storage_write_stalls
# TODO: verify exact metric name

# Verify SQL connectivity from a client perspective
cockroach sql -e "SELECT 1"

How to diagnose it

  1. Quantify the scope. Sum ranges_unavailable across all stores. A nonzero value on any store means some portion of the keyspace is blocked. Cross-reference with ranges_underreplicated to distinguish between total replica loss and transient unavailability. Determine whether system ranges are affected, because unavailability in the liveness or meta ranges produces cluster-wide symptoms.

  2. Inspect node liveness. Query crdb_internal.gossip_liveness or /_status/nodes. If multiple nodes are not-live, check whether they failed at the same time; a shared timestamp points to a correlated failure such as a shared power domain, top-of-rack switch, or NTP outage.

  3. Check for disk stalls. If any store shows storage_disk_stalled nonzero, that node has detected an unresponsive disk and may self-terminate. Disk stalls block WAL fsync, which blocks Raft log commits and eventually lease renewal.

  4. Verify network symmetry. Review round_trip_latency between node pairs. Elevated or asymmetric latency indicates a partition or NIC saturation. Look for nodes that can reach the admin UI but report elevated latency to specific peers; this pattern often reveals an asymmetric partition. Asymmetric partitions are especially damaging because they can cause Raft leadership churn while preventing stable quorum.

  5. Review storage engine health. High storage_l0_sublevels or active storage_write_stalls on a node can make it unable to process Raft heartbeats or append log entries. This effectively removes the replica from quorum even if the process is still running and responds to RPCs.

  6. Scan logs for Raft leader errors. Look for Raft leader not found or related error patterns. These point to specific ranges where the group is stuck.

  7. Correlate with operations. Check whether the event started during a rolling restart, node decommission, backup, or schema change. Decommissioning should not cause ranges_unavailable; if it does, the cluster is moving replicas too slowly or the target node failed before its ranges were fully rebalanced.

Metrics and signals to monitor

SignalWhy it mattersWarning sign
ranges_unavailableBinary indicator of active data unavailabilityAny nonzero value sustained for more than 5 minutes
ranges_underreplicatedSafety margin before quorum lossNonzero and not decreasing over a 10-minute window
Node liveness statusDetermines whether the cluster can maintain quorumAny node unexpectedly not-live
storage_disk_stalledDirect signal that a store cannot writeNonzero on any store
round_trip_latencyReveals network partitions and scheduling delayAny node pair above 5x baseline sustained
clock_offset_meannanosClock skew causes uncertainty restarts and eventual node deathAbove 50% of --max-offset
storage_write_stallsStorage engine is refusing writes, blocking Raft progressAny nonzero during normal workload
storage_l0_sublevelsLeading indicator that compaction is falling behindSustained count above 10

Fixes

Multi-node failure

If multiple nodes are down due to a power, hypervisor, or rack failure, restore the nodes if possible. If the nodes are permanently lost, replace them and allow the cluster to up-replicate. Up-replication sends Raft snapshots, which consume disk I/O and network bandwidth and raise foreground latency. Do not restart remaining healthy nodes to force a new quorum; this extends unavailability and triggers further Raft elections.

Network partition

Resolve the partition at the network layer. Restarting CockroachDB processes does not fix a partition and usually extends the outage by forcing additional leader elections. Once connectivity is restored, Raft groups should re-establish leadership automatically. Monitor round_trip_latency to confirm symmetric, stable heartbeats before declaring the incident resolved.

Disk stall or failure

If storage_disk_stalled is nonzero, the node has detected an unresponsive disk and may self-terminate. If the node is still running but unresponsive, attempt a graceful termination. If the shutdown hangs due to the stalled disk, a forced kill may be required, but expect longer unavailability while other replicas campaign for leadership. After replacing the disk or resolving the storage issue, restart the node and monitor ranges_underreplicated until it reaches zero. Do not rejoin a node with a still-stalled disk; it will lose liveness again.

Stuck Raft from storage pressure

If a live node cannot participate in quorum because of write stalls or L0 compaction debt, reduce write load immediately. Pause bulk imports, backups, or large schema backfills. See the related guides on compaction and write stalls for the recovery path. Do not issue manual range splits or forced Raft commands; these add coordination overhead while the storage engine is already saturated.

System range unavailability

If the liveness or meta ranges are unavailable, the cluster may freeze transaction processing even when user data ranges are healthy. This usually means a majority of system range replicas reside on failed or partitioned nodes. Recover those nodes first. The cluster cannot reassign leases, rebalance, or admit new nodes until the system ranges regain quorum.

Prevention

  • Monitor under-replication proactively. ranges_underreplicated should converge to zero after any node event. If it is flat or rising, healing is blocked.
  • Use /health?ready=1 for load balancer health checks. A simple TCP check routes traffic to nodes that are alive but impaired. The ready endpoint returns 503 during drains or quorum loss.
  • Maintain disk I/O headroom. Compaction throughput should be at least twice the sustained write ingestion rate. Rising storage_l0_sublevels is an early warning.
  • Synchronize clocks aggressively. Monitor clock_offset_meannanos across all node pairs. VMs are especially prone to drift after live migration.
  • Size for node loss. Keep per-node CPU and disk utilization low enough that losing one node does not push the survivors past 80% utilization.

How Netdata helps

  • Surfaces per-store ranges_unavailable alongside node-level disk stall detection, inter-node latency, and clock offset in one view. This correlation helps distinguish a storage failure from a network partition without switching contexts.
  • Per-second granularity can catch transient write stalls and Raft heartbeat gaps that longer scrape intervals miss.
  • Anomaly detection on storage_l0_sublevels and inter-node latency surfaces leading indicators before they escalate to quorum loss.
  • Synthetic SQL probes test the full client path, distinguishing internal cluster health from actual application connectivity.
  • CockroachDB compaction backlog growing: when Pebble can’t keep pace with writes: /guides/cockroachdb/cockroachdb-compaction-backlog-growing/
  • CockroachDB storage_l0_sublevels climbing: the earliest warning of write stalls: /guides/cockroachdb/cockroachdb-l0-sublevels-high/
  • CockroachDB LSM compaction death spiral: L0 sublevels, read amplification, and write stalls: /guides/cockroachdb/cockroachdb-lsm-compaction-death-spiral/
  • CockroachDB monitoring maturity model: from survival to expert: /guides/cockroachdb/cockroachdb-monitoring-maturity-model/
  • CockroachDB Pebble write stalls: when the storage engine refuses writes: /guides/cockroachdb/cockroachdb-pebble-write-stalls/
  • How CockroachDB actually works in production: a mental model for operators: /guides/cockroachdb/how-cockroachdb-works-in-production/