CockroachDB range unavailable: diagnosing ranges_unavailable and recovering quorum

The Prometheus gauge ranges_unavailable is the hardest of hard signals in CockroachDB. When it rises above zero and stays there, some slice of your keyspace has no leaseholder or has lost Raft quorum. Reads and writes to those ranges block or fail, and applications see errors, retries, or timeouts. If the unavailable ranges include system keyspaces, the impact is not limited to user tables. The cluster may lose the ability to heartbeat node liveness, execute schema changes, or route requests.

Your first job is to avoid reflexive restarts. Restarting a node that is slow because of a disk stall, or cycling a node on the wrong side of a network partition, can drop additional replicas out of quorum and make the situation worse. Instead, identify exactly which ranges are affected, which nodes hold their replicas, and whether the root cause is node death, storage failure, network partition, or a stuck Raft election.

This guide focuses on that diagnostic sequence and the decision tree that leads either to restoring quorum by bringing nodes back or, when replicas are permanently lost, to last-resort unsafe recovery.

What this means

CockroachDB divides the keyspace into ranges of approximately 512 MiB by default. Every range is replicated, typically to three replicas across failure domains. One replica acts as the leaseholder, serving reads and coordinating writes. Another acts as the Raft leader, driving consensus. A write is not acknowledged until a quorum of replicas commits the Raft log entry.

ranges_unavailable counts ranges that have no leaseholder or whose replica set cannot establish a Raft quorum. Without quorum, the range cannot make progress. The metric is exposed per store, but you should aggregate it cluster-wide. A brief flicker during a lease transfer is usually invisible at monitoring granularity. A sustained nonzero value means a hard failure.

System ranges that hold range metadata, node liveness records, or job states have amplified impact. If the liveness range itself loses quorum, nodes cannot renew their leases and the cluster can freeze even though most data ranges are technically intact.

flowchart TD
    A[ranges_unavailable nonzero] --> B{Nodes unexpectedly dead?}
    B -->|One| C[Under-replication.
Check healing blocked] B -->|Multiple| D[Quorum loss.
Check failure domain] B -->|No| E{Disk or write stalls?} E -->|Yes| F[Storage path failure.
Check WAL fsync] E -->|No| G{Asymmetric RPC latency?} G -->|Yes| H[Network partition.
Fix routing] G -->|No| I[Stuck Raft group.
Check range logs] C --> J[Restore node or
accelerate snapshots] D --> K[Restore nodes or
unsafe recovery last resort] F --> L[Replace disk or node]

Common causes

CauseWhat it looks likeFirst thing to check
Simultaneous multi-node failureMultiple nodes not live; quorum lost across many rangesNode liveness status and uptime
Network partition bisecting replicasNodes appear live but cannot reach each other; asymmetric RPC latencyround_trip_latency between node pairs
Disk stall or store failure on replica nodeNode is live by heartbeat but one store is not processing writes; storage_disk_stalled nonzeroPer-store storage_disk_stalled and WAL fsync latency
Stuck Raft group / split voteSpecific range repeatedly fails to elect a leader; logs show Raft timeoutsRange-specific logs and raftlog_behind
Decommissioning too fastUnder-replication spikes during node removal; unavailable ranges appear before replicas finish movingranges_underreplicated trending up during decommission

Quick checks

All of these are read-only and safe to run during an incident.

# Unavailable ranges per store
curl -s http://localhost:8080/_status/vars | grep ranges_unavailable

# Node liveness overview
curl -s http://localhost:8080/_status/nodes | python3 -c "
import json,sys
for n in json.load(sys.stdin)['nodes']:
    print(f'Node {n[\"desc\"][\"node_id\"]}: {n.get(\"liveness\",{}).get(\"liveness\",\"UNKNOWN\")}')"

# Under-replicated range count
curl -s http://localhost:8080/_status/vars | grep ranges_underreplicated

# Recent write stalls or disk stalls
curl -s http://localhost:8080/_status/vars | grep -E 'storage_write_stalls|storage_disk_stalled'

# Inter-node RPC latency to spot partitions
curl -s http://localhost:8080/_status/vars | grep round_trip_latency

# Raft snapshot rate (healing activity)
curl -s http://localhost:8080/_status/vars | grep 'range_snapshots'

# Node uptime to identify recent restarts
curl -s http://localhost:8080/_status/vars | grep sys_uptime

Note: /_status/nodes parsing assumes Python 3 is available on the node. If not, pipe to jq or inspect manually. The crdb_internal.ranges_no_leases SQL view can identify specific range IDs, but crdb_internal tables are unsupported and may change between versions.

How to diagnose it

  1. Confirm the scope. Check whether ranges_unavailable is localized to one store or spread across the cluster. If the sum is low but includes system ranges, treat it as a cluster-wide emergency. User-table unavailability hurts; system-range unavailability can halt the entire control plane.
  2. Map unavailable ranges to nodes. Cross-reference per-store unavailable counts with node liveness. If a node is reported as dead or not-live, its replicas are offline. If the node appears live but its store shows unavailable ranges, suspect a partial store failure or disk stall where the process heartbeats but cannot write.
  3. Check for network partitions. Review round_trip_latency between node pairs. Asymmetric latency or timeouts between nodes that both report as live strongly suggest a partition. In this case, both sides may claim to be healthy while neither can see a quorum of replicas for some ranges.
  4. Inspect storage health on suspect nodes. Elevated storage_write_stalls, nonzero storage_disk_stalled, or WAL fsync latency well above baseline indicate the store cannot accept writes. A node in this state may pass coarse health checks while being unable to participate in Raft consensus.
  5. Determine if the cluster is trying to heal. A rising ranges_underreplicated count alongside elevated range_snapshots_generated means the allocator is attempting to recover. If under-replication is flat or growing while snapshot rates are low, healing is blocked by disk space exhaustion, zone constraint conflicts, or aggressive snapshot throttling.
  6. Review logs for Raft leader errors. Log messages such as “raft leader not found” or repeated election timeouts point to a stuck Raft group. This can happen when nodes are just responsive enough to avoid liveness eviction but too slow to win an election, creating a split-vote loop.
  7. Evaluate decommission state. If the event started during node decommission, verify whether the allocator finished moving ranges before the node shut down. An incomplete decommission leaves ranges under-replicated. The decommission may be going too fast for the cluster’s disk or network capacity.

Metrics and signals to monitor

SignalWhy it mattersWarning sign
ranges_unavailableActive loss of availabilityAny nonzero value sustained
ranges_underreplicatedCluster safety margin and healing progressNot decreasing over 10 minutes
Node liveness statusWhether the cluster considers nodes aliveUnexpected transitions to dead or draining
storage_disk_stalledStore cannot write; node may self-terminateNonzero on any store
storage_write_stallsPebble refusing writes due to LSM debtRate above 1/sec sustained
round_trip_latencyNetwork partition or CPU scheduling delayAbove 5x baseline or asymmetric
raftlog_behindFollower cannot keep up with log applicationAbove 10,000 entries sustained
range_snapshots_generatedRecovery I/O loadElevated without operational cause
sql_failure_countClient-visible errorsSpike in XX000 or 40001 codes

Fixes

Restore failed nodes or network paths

If the outage stems from a temporarily crashed node, OOM kill, or network blip, bringing the node back online is the fastest and safest fix. Once the process restarts with a corrected clock and healthy storage, CockroachDB automatically rejoins the replica to the Raft group, replays the log, and restores quorum. For network partitions, restore connectivity between the separated groups. Do not cycle nodes blindly, as this can trigger further lease churn and extend the unavailability window.

Address disk stalls and store failures

If a store reports nonzero storage_disk_stalled or repeated write stalls, that node cannot participate in consensus even if its process is running and heartbeating. If the node is still partially responsive, attempt a graceful drain to move leases away. If the disk is physically failed or the store is corrupt, plan to replace the node or the store. After removal, let the cluster up-replicate from the remaining replicas.

Recover from stuck decommissioning

If unavailable ranges appeared during decommissioning, pause further node removals. Verify that ranges_underreplicated is decreasing and that snapshots are flowing. If the cluster is under-provisioned or snapshot throttling is too aggressive, the allocator cannot move data fast enough to maintain safety margins. You may need to add temporary nodes to serve as replica targets or temporarily raise snapshot rate limits, understanding that this increases I/O load on donor nodes.

Last resort: unsafe recovery

If too many replicas are permanently lost and quorum cannot be restored, the remaining option is to run CockroachDB unsafe recovery commands to force a new replica set on the surviving nodes. This is a destructive operation that can cause data loss and should only be performed when no other path exists. Consult the official CockroachDB documentation for the exact procedure for your version. Do not run version-agnostic commands from memory.

Prevention

  • Monitor ranges_underreplicated and node liveness together. A node that is flapping between live and dead is a countdown to unavailable ranges.
  • Avoid maintenance windows that overlap across failure domains. Restarting or decommissioning multiple nodes concurrently is the most common preventable cause of quorum loss.
  • Monitor per-store disk health (storage_disk_stalled, WAL fsync latency) as aggressively as node-level CPU and memory. A stalled disk looks like a live node to coarse health checks.
  • Use /health?ready=1 for load balancer health checks, not simple TCP probes. A node that is draining or has lost quorum returns 503, preventing traffic from routing to an impaired node.
  • Keep NTP and clock synchronization healthy. Clock skew can cause nodes to self-terminate, which in turn removes replicas from quorum.

How Netdata helps

  • Netdata collects ranges_unavailable, ranges_underreplicated, and node liveness at per-second resolution, catching transient spikes that 15-30 second Prometheus scrapes miss.
  • Correlate CockroachDB store metrics with node-level disk latency, I/O utilization, and network retransmits to distinguish disk stalls from network partitions in one dashboard.
  • Composite alerting on ranges_unavailable + storage_write_stalls + node liveness flapping surfaces the LSM compaction death spiral before write stalls cascade into quorum loss.
  • Per-node CPU and memory context shows whether a node lost liveness due to GC pressure or CPU saturation, helping you distinguish resource exhaustion from true node failure.