CockroachDB range unavailable: diagnosing ranges_unavailable and recovering quorum
The Prometheus gauge ranges_unavailable is the hardest of hard signals in CockroachDB. When it rises above zero and stays there, some slice of your keyspace has no leaseholder or has lost Raft quorum. Reads and writes to those ranges block or fail, and applications see errors, retries, or timeouts. If the unavailable ranges include system keyspaces, the impact is not limited to user tables. The cluster may lose the ability to heartbeat node liveness, execute schema changes, or route requests.
Your first job is to avoid reflexive restarts. Restarting a node that is slow because of a disk stall, or cycling a node on the wrong side of a network partition, can drop additional replicas out of quorum and make the situation worse. Instead, identify exactly which ranges are affected, which nodes hold their replicas, and whether the root cause is node death, storage failure, network partition, or a stuck Raft election.
This guide focuses on that diagnostic sequence and the decision tree that leads either to restoring quorum by bringing nodes back or, when replicas are permanently lost, to last-resort unsafe recovery.
What this means
CockroachDB divides the keyspace into ranges of approximately 512 MiB by default. Every range is replicated, typically to three replicas across failure domains. One replica acts as the leaseholder, serving reads and coordinating writes. Another acts as the Raft leader, driving consensus. A write is not acknowledged until a quorum of replicas commits the Raft log entry.
ranges_unavailable counts ranges that have no leaseholder or whose replica set cannot establish a Raft quorum. Without quorum, the range cannot make progress. The metric is exposed per store, but you should aggregate it cluster-wide. A brief flicker during a lease transfer is usually invisible at monitoring granularity. A sustained nonzero value means a hard failure.
System ranges that hold range metadata, node liveness records, or job states have amplified impact. If the liveness range itself loses quorum, nodes cannot renew their leases and the cluster can freeze even though most data ranges are technically intact.
flowchart TD
A[ranges_unavailable nonzero] --> B{Nodes unexpectedly dead?}
B -->|One| C[Under-replication.
Check healing blocked]
B -->|Multiple| D[Quorum loss.
Check failure domain]
B -->|No| E{Disk or write stalls?}
E -->|Yes| F[Storage path failure.
Check WAL fsync]
E -->|No| G{Asymmetric RPC latency?}
G -->|Yes| H[Network partition.
Fix routing]
G -->|No| I[Stuck Raft group.
Check range logs]
C --> J[Restore node or
accelerate snapshots]
D --> K[Restore nodes or
unsafe recovery last resort]
F --> L[Replace disk or node]Common causes
| Cause | What it looks like | First thing to check |
|---|---|---|
| Simultaneous multi-node failure | Multiple nodes not live; quorum lost across many ranges | Node liveness status and uptime |
| Network partition bisecting replicas | Nodes appear live but cannot reach each other; asymmetric RPC latency | round_trip_latency between node pairs |
| Disk stall or store failure on replica node | Node is live by heartbeat but one store is not processing writes; storage_disk_stalled nonzero | Per-store storage_disk_stalled and WAL fsync latency |
| Stuck Raft group / split vote | Specific range repeatedly fails to elect a leader; logs show Raft timeouts | Range-specific logs and raftlog_behind |
| Decommissioning too fast | Under-replication spikes during node removal; unavailable ranges appear before replicas finish moving | ranges_underreplicated trending up during decommission |
Quick checks
All of these are read-only and safe to run during an incident.
# Unavailable ranges per store
curl -s http://localhost:8080/_status/vars | grep ranges_unavailable
# Node liveness overview
curl -s http://localhost:8080/_status/nodes | python3 -c "
import json,sys
for n in json.load(sys.stdin)['nodes']:
print(f'Node {n[\"desc\"][\"node_id\"]}: {n.get(\"liveness\",{}).get(\"liveness\",\"UNKNOWN\")}')"
# Under-replicated range count
curl -s http://localhost:8080/_status/vars | grep ranges_underreplicated
# Recent write stalls or disk stalls
curl -s http://localhost:8080/_status/vars | grep -E 'storage_write_stalls|storage_disk_stalled'
# Inter-node RPC latency to spot partitions
curl -s http://localhost:8080/_status/vars | grep round_trip_latency
# Raft snapshot rate (healing activity)
curl -s http://localhost:8080/_status/vars | grep 'range_snapshots'
# Node uptime to identify recent restarts
curl -s http://localhost:8080/_status/vars | grep sys_uptime
Note: /_status/nodes parsing assumes Python 3 is available on the node. If not, pipe to jq or inspect manually. The crdb_internal.ranges_no_leases SQL view can identify specific range IDs, but crdb_internal tables are unsupported and may change between versions.
How to diagnose it
- Confirm the scope. Check whether
ranges_unavailableis localized to one store or spread across the cluster. If the sum is low but includes system ranges, treat it as a cluster-wide emergency. User-table unavailability hurts; system-range unavailability can halt the entire control plane. - Map unavailable ranges to nodes. Cross-reference per-store unavailable counts with node liveness. If a node is reported as dead or not-live, its replicas are offline. If the node appears live but its store shows unavailable ranges, suspect a partial store failure or disk stall where the process heartbeats but cannot write.
- Check for network partitions. Review
round_trip_latencybetween node pairs. Asymmetric latency or timeouts between nodes that both report as live strongly suggest a partition. In this case, both sides may claim to be healthy while neither can see a quorum of replicas for some ranges. - Inspect storage health on suspect nodes. Elevated
storage_write_stalls, nonzerostorage_disk_stalled, or WAL fsync latency well above baseline indicate the store cannot accept writes. A node in this state may pass coarse health checks while being unable to participate in Raft consensus. - Determine if the cluster is trying to heal. A rising
ranges_underreplicatedcount alongside elevatedrange_snapshots_generatedmeans the allocator is attempting to recover. If under-replication is flat or growing while snapshot rates are low, healing is blocked by disk space exhaustion, zone constraint conflicts, or aggressive snapshot throttling. - Review logs for Raft leader errors. Log messages such as “raft leader not found” or repeated election timeouts point to a stuck Raft group. This can happen when nodes are just responsive enough to avoid liveness eviction but too slow to win an election, creating a split-vote loop.
- Evaluate decommission state. If the event started during node decommission, verify whether the allocator finished moving ranges before the node shut down. An incomplete decommission leaves ranges under-replicated. The decommission may be going too fast for the cluster’s disk or network capacity.
Metrics and signals to monitor
| Signal | Why it matters | Warning sign |
|---|---|---|
ranges_unavailable | Active loss of availability | Any nonzero value sustained |
ranges_underreplicated | Cluster safety margin and healing progress | Not decreasing over 10 minutes |
| Node liveness status | Whether the cluster considers nodes alive | Unexpected transitions to dead or draining |
storage_disk_stalled | Store cannot write; node may self-terminate | Nonzero on any store |
storage_write_stalls | Pebble refusing writes due to LSM debt | Rate above 1/sec sustained |
round_trip_latency | Network partition or CPU scheduling delay | Above 5x baseline or asymmetric |
raftlog_behind | Follower cannot keep up with log application | Above 10,000 entries sustained |
range_snapshots_generated | Recovery I/O load | Elevated without operational cause |
sql_failure_count | Client-visible errors | Spike in XX000 or 40001 codes |
Fixes
Restore failed nodes or network paths
If the outage stems from a temporarily crashed node, OOM kill, or network blip, bringing the node back online is the fastest and safest fix. Once the process restarts with a corrected clock and healthy storage, CockroachDB automatically rejoins the replica to the Raft group, replays the log, and restores quorum. For network partitions, restore connectivity between the separated groups. Do not cycle nodes blindly, as this can trigger further lease churn and extend the unavailability window.
Address disk stalls and store failures
If a store reports nonzero storage_disk_stalled or repeated write stalls, that node cannot participate in consensus even if its process is running and heartbeating. If the node is still partially responsive, attempt a graceful drain to move leases away. If the disk is physically failed or the store is corrupt, plan to replace the node or the store. After removal, let the cluster up-replicate from the remaining replicas.
Recover from stuck decommissioning
If unavailable ranges appeared during decommissioning, pause further node removals. Verify that ranges_underreplicated is decreasing and that snapshots are flowing. If the cluster is under-provisioned or snapshot throttling is too aggressive, the allocator cannot move data fast enough to maintain safety margins. You may need to add temporary nodes to serve as replica targets or temporarily raise snapshot rate limits, understanding that this increases I/O load on donor nodes.
Last resort: unsafe recovery
If too many replicas are permanently lost and quorum cannot be restored, the remaining option is to run CockroachDB unsafe recovery commands to force a new replica set on the surviving nodes. This is a destructive operation that can cause data loss and should only be performed when no other path exists. Consult the official CockroachDB documentation for the exact procedure for your version. Do not run version-agnostic commands from memory.
Prevention
- Monitor
ranges_underreplicatedand node liveness together. A node that is flapping between live and dead is a countdown to unavailable ranges. - Avoid maintenance windows that overlap across failure domains. Restarting or decommissioning multiple nodes concurrently is the most common preventable cause of quorum loss.
- Monitor per-store disk health (
storage_disk_stalled, WAL fsync latency) as aggressively as node-level CPU and memory. A stalled disk looks like a live node to coarse health checks. - Use
/health?ready=1for load balancer health checks, not simple TCP probes. A node that is draining or has lost quorum returns 503, preventing traffic from routing to an impaired node. - Keep NTP and clock synchronization healthy. Clock skew can cause nodes to self-terminate, which in turn removes replicas from quorum.
How Netdata helps
- Netdata collects
ranges_unavailable,ranges_underreplicated, and node liveness at per-second resolution, catching transient spikes that 15-30 second Prometheus scrapes miss. - Correlate CockroachDB store metrics with node-level disk latency, I/O utilization, and network retransmits to distinguish disk stalls from network partitions in one dashboard.
- Composite alerting on
ranges_unavailable+storage_write_stalls+ node liveness flapping surfaces the LSM compaction death spiral before write stalls cascade into quorum loss. - Per-node CPU and memory context shows whether a node lost liveness due to GC pressure or CPU saturation, helping you distinguish resource exhaustion from true node failure.
Related guides
- CockroachDB compaction backlog growing: when Pebble can’t keep pace with writes
- CockroachDB storage_l0_sublevels climbing: the earliest warning of write stalls
- CockroachDB LSM compaction death spiral: L0 sublevels, read amplification, and write stalls
- CockroachDB monitoring maturity model: from survival to expert
- CockroachDB node liveness failure: heartbeats, lease redistribution, and flapping
- CockroachDB Pebble write stalls: when the storage engine refuses writes
- CockroachDB replica unavailable: lost quorum and stuck Raft groups
- How CockroachDB actually works in production: a mental model for operators







