$ guides / cockroachdb / cockroachdb-replica-unavailable ▌

Operations Guides

CockroachDB replica unavailable: lost quorum and stuck Raft groups

ranges_unavailable has gone nonzero. Clients are seeing “replica unavailable” errors. Some portion of your keyspace cannot be read or written. This is an active availability incident.

In CockroachDB, every range (a ~512 MiB slice of the keyspace) is replicated across multiple nodes. A write requires quorum acknowledgment from a majority of replicas before it commits. When too few replicas are reachable, the range loses quorum and cannot serve reads or writes. The ranges_unavailable metric tracks exactly this condition: ranges with no leaseholder or with lost Raft quorum.

The severity depends on which ranges are affected. A user-data range being unavailable means that specific data is inaccessible. A system range (meta ranges, liveness range) being unavailable can freeze the entire cluster, because nodes cannot renew liveness records or locate range metadata.

What this means

CockroachDB replicates each range to a configurable number of nodes (default 3). A Raft majority (2 of 3, or 3 of 5) must be reachable for the range to accept writes and elect a leader. When quorum is lost, the range enters an unavailable state until enough replicas come back.

Since v22.1, CockroachDB includes a per-replica circuit breaker. After approximately 60 seconds of failed Raft proposals, the breaker trips and returns a ReplicaUnavailableError to the client immediately, rather than hanging indefinitely. The threshold is controlled by kv.replica_circuit_breaker.slow_replication_threshold (default 1 minute). Before this feature existed, requests to unavailable ranges would hang with no fail-fast behavior.

The error string operators typically search for is “replica unavailable”. When you see ranges_unavailable nonzero in Prometheus or the DB Console Replication Dashboard, some keyspace is actively unreachable.

flowchart TD
    A[Multi-node failure or network partition] --> B{Quorum lost for affected ranges?}
    B -->|Yes| C[Raft cannot commit or elect leader]
    B -->|No| Z[Range still available]
    C --> D[No leaseholder can serve reads or writes]
    D --> E[ranges_unavailable greater than zero]
    D --> F[Circuit breaker trips after about 60s]
    F --> G[ReplicaUnavailableError to clients]
    E --> H{Which range type?}
    H -->|User data range| I[Specific keyspace unavailable]
    H -->|System range: liveness or meta| J[Cluster-wide freeze]
    J --> K[Nodes cannot renew liveness]
    K --> L[Cascading unavailability]

A critical distinction: if the unavailable ranges include system ranges (meta1, meta2, liveness, jobs), the impact is cluster-wide and catastrophic. Nodes cannot renew liveness records if the liveness range has lost quorum. Restarted nodes may not complete boot. The DB Console itself may stop working if timeseries ranges lose quorum, even though SQL operations on user data can continue normally.

Common causes

Cause	What it looks like	First thing to check
Multi-node failure	Multiple nodes not-live simultaneously; ranges_unavailable spikes	`/_status/nodes` for liveness status of all nodes
Network partition	Nodes alive but cannot reach each other; RPC latency drops or errors	`round_trip_latency` between node pairs
Stuck Raft group	One or few ranges unavailable while nodes are healthy	`/_status/range/{id}` for the specific range
System range quorum loss	Cluster-wide freeze; nodes cannot renew liveness or locate range metadata	Whether unavailable ranges back system metadata
Clock skew cascade	Nodes self-terminating; clock_offset near max-offset	`clock_offset_meannanos` on all nodes

Quick checks

# Check for unavailable ranges (canonical signal)
curl -s http://localhost:8080/_status/vars | grep ranges_unavailable

# Check under-replicated ranges (elevation means one failure from unavailability)
curl -s http://localhost:8080/_status/vars | grep ranges_underreplicated

# Node readiness (returns 503 when draining or quorum lost)
curl -s -o /dev/null -w "%{http_code}\n" http://localhost:8080/health?ready=1

# Node liveness status via API
curl -s http://localhost:8080/_status/nodes | python3 -c "
import json, sys
for n in json.load(sys.stdin)['nodes']:
    print(f'Node {n[\"desc\"][\"node_id\"]}: liveness={n.get(\"liveness\",{}).get(\"liveness\",\"UNKNOWN\")}')"

# RPC latency between node pairs (partition indicator)
curl -s http://localhost:8080/_status/vars | grep round_trip_latency

# Clock offset (self-termination precursor)
curl -s http://localhost:8080/_status/vars | grep clock_offset

# Count ranges with no lease (admin SQL, expensive during incidents)
# cockroach sql -e "SELECT count(*) FROM crdb_internal.ranges_no_leases;"

# Liveness details via SQL (admin-only, version-sensitive)
# cockroach sql -e "SELECT node_id, epoch, expiration, draining, decommissioning, membership FROM crdb_internal.gossip_liveness;"

How to diagnose it

Confirm the scope. Check ranges_unavailable on every node. The metric is per-store, so a single node’s view may differ from another’s. If the count is nonzero on multiple nodes for the same ranges, the issue is cluster-wide rather than localized.
Check node liveness. Query crdb_internal.gossip_liveness or hit /_status/nodes. If multiple nodes are not-live, you have a multi-node failure. If all nodes report live but ranges are still unavailable, suspect a network partition or a stuck Raft group.
Identify which ranges are unavailable. Use SELECT count(*) FROM crdb_internal.ranges_no_leases; to confirm the count. Then check the DB Console Problem Ranges page, or query /_status/range/{range_id} for specific ranges to see replica placement and Raft status. Note that crdb_internal tables are unsupported and may change between versions.
Determine if system ranges are affected. If unavailable ranges include meta ranges, the liveness range, or the jobs table, the impact is amplified. The cluster may freeze entirely. Check whether the liveness range specifically has lost quorum.
Check for network partition. Examine round_trip_latency between node pairs. A partition manifests as RPC latency dropping to zero (unreachable) or spiking dramatically. Asymmetric partitions (A reaches B but B cannot reach A) cause unpredictable Raft behavior and are harder to detect.
Check for clock skew. If nodes are self-terminating, examine clock_offset_meannanos. If offset exceeds 80% of --max-offset (default 500ms, so above 400ms), nodes will self-terminate. Multiple nodes sharing NTP infrastructure can drift simultaneously, causing cascading quorum loss.
Check logs for Raft leader-not-found errors. Grep CockroachDB logs for leader election messages. Stuck Raft groups that cannot elect a leader will show repeated election attempts without success. This distinguishes a stuck group from a simple quorum loss.

Metrics and signals to monitor

Signal	Why it matters	Warning sign
`ranges_unavailable`	Direct measure of unavailable keyspace	Any nonzero value is an emergency
`ranges_underreplicated`	Replication safety margin; one failure from unavailability	Sustained nonzero after maintenance window closes
Node liveness (`/_status/nodes`)	Whether cluster considers each node alive and participating	Unexpected transition to not-live; flapping between live and not-live
`round_trip_latency`	Inter-node network health as Raft experiences it	Pairs showing >5x baseline or unreachable
`clock_offset_meannanos`	Clock synchronization health	Any node above 250ms (50% of default max-offset)
`leases_transfers_success`	Lease churn indicates instability	>10x baseline without operational cause
Client error rate	Application-visible impact	Spike in unavailable or retry errors
`ranges` and `leases_count` per node	Load distribution	Imbalance suggests failed node load not redistributed

Fixes

Multi-node failure: restore nodes

If multiple nodes have crashed or been killed, the fastest path to recovery is bringing them back online. Quorum is restored when enough replicas are reachable again. For a replication factor of 3 with 2 dead nodes, bringing one back restores quorum for ranges where the returning node holds a replica.

Caveat: if multiple replicas survive for a range, simply bringing back the designated survivor may not restore availability. Non-designated survivors may still appear in meta2 records and DistSender caches, causing requests to hang. Targeted restarts of specific nodes holding surviving replicas may be needed.

Network partition: restore connectivity

If a partition is the root cause, restoring network connectivity resolves the issue. CockroachDB re-establishes Raft heartbeats and elects leaders automatically once connectivity returns. Check firewall rules, security groups, DNS resolution, and physical network health. Asymmetric partitions are harder to detect and may require packet capture or traceroute from both directions.

Stuck Raft groups

If nodes are healthy and the network is fine but specific ranges remain unavailable, the Raft group may be stuck in a state where it cannot elect a leader. This can happen after prolonged unavailability, corrupted Raft state, or version-specific bugs. Check /_status/range/{range_id} for the specific range’s replica list and Raft status.

If the range has sufficient replicas but still cannot elect a leader, this may require CockroachDB support intervention. Do not attempt manual Raft state manipulation.

System range unavailability: offline recovery

If system ranges (especially the liveness range) have lost quorum permanently, the cluster is effectively down. Standard recovery by restarting nodes will not work because nodes cannot complete boot without the liveness range.

The cockroach debug recover workflow is the official offline recovery path for permanent loss of quorum:

# DESTRUCTIVE: Offline recovery for permanent quorum loss.
# Requires all nodes stopped. Follow the upstream RFC carefully.
# 1. Collect replica information from all stores
cockroach debug recover collect-info --store=<store-dir>

# 2. Generate a recovery plan
cockroach debug recover make-plan --store=<store-dir>

# 3. Apply the plan (rewrites range descriptors to remove dead replicas)
cockroach debug recover apply-plan --store=<store-dir>

This procedure rewrites range descriptors to remove dead replicas and form a new quorum from surviving replicas. It is a last resort. The older cockroach debug unsafe-remove-dead-replicas command is still present, but the debug recover workflow is safer.

Warning: these commands operate on a stopped cluster and modify range metadata directly. Back up store directories before proceeding.

Decommission-induced unavailability

If ranges_unavailable spikes during a decommission, the decommission is proceeding too fast or the cluster lacks capacity to absorb the relocated replicas. Stop the decommission and let up-replication complete before removing more nodes. During a healthy decommission, ranges are moved before the node goes down, so ranges_unavailable should stay zero.

Prevention

Maintain replication headroom. For a 3-node cluster with replication factor 3, losing one node leaves every range one failure from unavailability. Five-node clusters provide better failure tolerance.
Monitor ranges_underreplicated proactively. If under-replication persists after a maintenance window, investigate before another failure compounds the problem.
Use /health?ready=1 for load balancer health checks. TCP checks route traffic to impaired nodes. The readiness endpoint returns 503 when draining or quorum is lost.
Monitor clock offset continuously. NTP failures cause silent performance degradation long before causing node self-termination. Alert on clock_offset_meannanos well below the 500ms default max-offset.
Separate failure domains. Ensure replicas are spread across zones or racks. Verify zone configurations enforce correct placement constraints.
Gate decommissioning on replication health. Do not decommission multiple nodes simultaneously. Verify ranges_underreplicated returns to zero before removing the next node.
Watch lease transfer rates. Sustained elevated lease transfers without an operational cause indicate underlying node instability that can cascade into unavailability.

How Netdata helps

Per-second ranges_unavailable collection catches nonzero values faster than typical 15-30 second scrape intervals, reducing detection latency during active incidents.
Correlation with node liveness shows whether ranges_unavailable spikes coincide with specific node deaths, narrowing diagnosis from “something is unavailable” to “these specific nodes went down.”
RPC latency between nodes is visible alongside range unavailability, making network partition diagnosis immediate rather than requiring separate tooling.
Clock offset monitoring provides early warning when NTP drift begins, often hours before it would cause quorum loss through node self-termination.
Lease transfer rate anomalies flagged by ML detection identify the oscillating pattern where nodes repeatedly gain and lose liveness, a common precursor to sustained unavailability.
Under-replication trends at per-second granularity show whether the cluster is healing or stalled, which determines whether you can wait or must intervene immediately.

Netdata’s CockroachDB monitoring with Netdata brings these signals together with per-second metrics and ML anomaly detection.

The Netdata solution

CockroachDB monitoring with Netdata

Netdata monitors CockroachDB with per-second metrics and automatic dashboards. Watch LSM compaction, Raft liveness, clock skew, hot ranges, and intent buildup so the distributed-systems failure modes in these runbooks surface early.

See CockroachDB monitoring → Start monitoring free

CockroachDB replica unavailable: lost quorum and stuck Raft groups

CockroachDB replica unavailable: lost quorum and stuck Raft groups

What this means

Common causes

Quick checks

How to diagnose it

Metrics and signals to monitor

Fixes

Multi-node failure: restore nodes

Network partition: restore connectivity

Stuck Raft groups

System range unavailability: offline recovery

Decommission-induced unavailability

Prevention

How Netdata helps

Related guides

CockroachDB monitoring with Netdata