Cassandra quorum loss: when too many replicas are down to satisfy the CL
Your application is throwing UnavailableException. Writes and reads fail immediately, not timing out. The coordinator checked the token ring, counted live replicas, and rejected the request because it cannot satisfy the consistency level. No client retry will fix this server-side. The cluster is in quorum loss and will not recover until enough replicas are restored.
When the count of down nodes for a token range exceeds floor(RF/2), every request at QUORUM or stronger fails instantly. Surviving nodes may still serve CL=ONE requests for ranges that have a live replica, but any operation requiring a majority is an outage for those partitions. This is a PAGE-level composite failure pattern that demands immediate node recovery followed by repair.
flowchart TD
A[Client receives UnavailableException] --> B{Check nodetool status}
B -->|Multiple DN nodes| C[Quorum loss for affected ranges]
B -->|All UN| D[Check CL and RF settings]
C --> E{Nodes reachable?}
E -->|No| F[Hardware or JVM failure]
E -->|Yes| G[Network partition or GC flap]
F --> H[Restore one node at a time]
G --> H
H --> I[Run full repair after recovery]What this means
Cassandra uses the phi accrual failure detector to mark nodes UP or DOWN via gossip. When a coordinator receives a request, it determines which replicas own the partition key and checks how many are alive. If the live count is insufficient for the requested consistency level, the coordinator throws UnavailableException immediately. It does not wait, queue, or retry.
This differs from TimeoutException. A timeout means enough replicas were alive to attempt the request but did not respond within read_request_timeout_in_ms or write_request_timeout_in_ms. Timeouts can self-resolve if a slow replica catches up. Unavailables persist until nodes are recovered or the consistency level is lowered.
In a typical RF=3 cluster running QUORUM, losing two nodes removes the majority. Every token range replicated to those two nodes is unreachable at QUORUM. If an entire rack or availability zone hosts multiple replicas for the same ranges, the blast radius can be cluster-wide despite many nodes remaining alive.
Common causes
| Cause | What it looks like | First thing to check |
|---|---|---|
| Multiple simultaneous node failures | nodetool status shows several DN nodes in the same rack or DC | Run nodetool status from multiple coordinators to confirm the view is consistent |
| Bad rolling restart | Maintenance overlapped and too many nodes were taken down at once | Maintenance logs and uptime per node; check for nodes stuck in UJ or L states |
| Network partition isolating replicas | Nodes respond to ping but gossip marks them DN from the other side | nodetool status from nodes on both sides of the suspected partition |
| Cascading GC death spiral | Nodes rapidly flap UP and DOWN before going permanently DN | GC logs for pauses exceeding the phi threshold and heap usage on affected nodes |
Quick checks
Run these read-only commands to assess scope without making the situation worse.
# Cluster membership and DOWN node count
nodetool status
# Schema agreement; stuck migrations complicate recovery
nodetool describecluster
# UnavailableException frequency in system logs
grep -i "UnavailableException" /var/log/cassandra/system.log
# Heap, uptime, and native transport status on survivors
nodetool info
# Hint accumulation to estimate outage duration
du -sh /var/lib/cassandra/hints/
# Active streaming or repair compounding load
nodetool netstats
# Disk pressure on surviving nodes
df -h /var/lib/cassandra/data /var/lib/cassandra/commitlog
How to diagnose it
Confirm Unavailable versus Timeout.
UnavailableExceptionmeans the coordinator rejected the query immediately because insufficient replicas are alive.TimeoutExceptionmeans enough replicas were alive but did not respond in time. Check application logs and Cassandra system logs for the exact exception class. Timeouts but no unavailables means replicas are slow, not missing.Count DOWN nodes. Run
nodetool statuson multiple coordinators. Different views across coordinators indicate a network partition. A consistent count means those nodes are genuinely unreachable.Map DOWN nodes to token ranges. With vnodes (default
num_tokens=256), a single DOWN node affects all ranges where it is a replica. With multiple DOWN nodes, calculate whether any range has lost its majority. ForRF=3, any range whose two replicas are bothDNhas lost quorum.Check for recent topology changes. Review whether a rolling restart, bootstrap, decommission, or removenode was in progress. A node stuck in
UJorLstate is not fully available and can tip the cluster into quorum loss if other failures coincide.Correlate with resource exhaustion. On DOWN nodes (if reachable), check
dmesg, disk SMART data, and GC logs. On surviving nodes, check thread pool pending tasks and dropped messages to confirm they are not about to fail under redirected load.
Metrics and signals to monitor
| Signal | Why it matters | Warning sign |
|---|---|---|
| DownEndpointCount | Nodes marked DOWN by the failure detector | Sustained > 0 is a TICKET; quorum loss when count exceeds floor(RF/2) for affected ranges |
| Client Request Unavailables | Immediate failures for insufficient replicas | Sustained spike in UnavailableException rate relative to baseline |
| Client Request Timeouts | Distinguishes slow replicas from missing replicas | High timeouts with zero unavailables means replicas are alive but overloaded |
Node liveness (nodetool status) | Human-readable gossip state | Any DN status sustained beyond a restart window |
| Hinted handoff store size | Hints accumulate when replicas are DOWN | Growing /var/lib/cassandra/hints/ signals an extended outage |
| Thread pool pending tasks | Surviving nodes may saturate under extra load | Pending > 0 in MUTATION or READ stages for > 60 seconds |
Fixes
Restore nodes one at a time
Bring nodes back individually. Wait for gossip to converge and nodetool status to show UN before proceeding to the next. Bringing multiple nodes back concurrently can trigger hint replay storms and compaction debt that push recovering nodes back into failure.
Warning: If a node was down longer than gc_grace_seconds (default 10 days), do not simply restart and repair. Tombstones may have been garbage-collected on surviving replicas while the down node still holds original data, which resurrects deleted rows. In this scenario, wipe the data directory and rebuild the node from scratch using nodetool rebuild, or decommission it and bootstrap a replacement.
Run full repair after recovery
Once all nodes are UN, run a full repair across affected ranges. If your environment defaults to incremental repair, force a full comparison to ensure all token ranges are reconciled. Use nodetool repair -full -pr to repair each node’s primary ranges sequentially. Schedule this during low-traffic hours; repair is extremely I/O-intensive.
Heal network partitions
If the root cause is a network partition, restore connectivity first. After the network heals, run full repair. During a partition, timestamp-based conflict resolution may have allowed divergent writes on both sides. Repair is mandatory to reconcile inconsistencies.
Handle decommissioned nodes correctly
If a node was properly decommissioned with nodetool decommission or nodetool removenode, its tokens were redistributed. Simply restarting the old hardware will fail because its token assignments no longer match the ring. You must bootstrap it as a new node. If you are unsure whether the node was decommissioned, check the ring topology with nodetool status before starting the process.
Prevention
- Rolling restarts: Restart exactly one node at a time. Verify
UNstatus and that native transport is active before touching the next node. - Sustained failure detection: Alert on
DownEndpointCount > 0sustained for more than 5 minutes. Brief blips during restarts are normal; sustainedDNis not. - Repair cadence: Run full repair on every node within
gc_grace_seconds. Best practice is to complete a full cycle in half thegc_grace_secondswindow. - Failure domain isolation: Distribute replicas across racks and availability zones so a single switch or power event cannot take down multiple replicas for the same ranges.
- Phi threshold tuning: If nodes are falsely marked DOWN due to GC pauses or network latency, adjust
phi_convict_threshold. Raise it cautiously to 10-12 on high-latency networks such as AWS EC2. Do not raise it above 12; this masks real failures.
How Netdata helps
- Correlate
DownEndpointCountwithClientRequest Unavailablesto confirm quorum loss within seconds of the first failure. - Track JVM heap usage and GC pause duration to detect nodes entering a GC death spiral before gossip marks them DOWN.
- Monitor disk I/O utilization and pending compactions to catch resource exhaustion that precedes node failure.
- Alert on sustained node unavailability using baseline-aware thresholds that filter brief restart blips.
Related guides
- Cassandra compaction strategies: STCS vs LCS vs TWCS vs UCS
- Cassandra compaction death spiral: when writes outrun compaction throughput
- Cassandra consistency levels explained: QUORUM, ONE, LOCAL_QUORUM, and EACH_QUORUM
- Cassandra zombie data resurrection: gc_grace_seconds and unrepaired tombstones
- Cassandra disk space exhaustion: emergency recovery when the data volume fills
- Cassandra dropped mutations: silent write loss and load shedding
- Cassandra dropped reads and other messages: reading nodetool tpstats Dropped
- Cassandra GC death spiral: long pauses, gossip flapping, and recovery
- Cassandra GC pauses too long: diagnosing G1 stop-the-world pauses
- Cassandra heap pressure: sizing the JVM heap and tuning G1GC
- Cassandra monitoring checklist: the signals every production cluster needs
- Cassandra monitoring maturity model: from survival to expert







