Cassandra quorum loss: when too many replicas are down to satisfy the CL

Your application is throwing UnavailableException. Writes and reads fail immediately, not timing out. The coordinator checked the token ring, counted live replicas, and rejected the request because it cannot satisfy the consistency level. No client retry will fix this server-side. The cluster is in quorum loss and will not recover until enough replicas are restored.

When the count of down nodes for a token range exceeds floor(RF/2), every request at QUORUM or stronger fails instantly. Surviving nodes may still serve CL=ONE requests for ranges that have a live replica, but any operation requiring a majority is an outage for those partitions. This is a PAGE-level composite failure pattern that demands immediate node recovery followed by repair.

flowchart TD
    A[Client receives UnavailableException] --> B{Check nodetool status}
    B -->|Multiple DN nodes| C[Quorum loss for affected ranges]
    B -->|All UN| D[Check CL and RF settings]
    C --> E{Nodes reachable?}
    E -->|No| F[Hardware or JVM failure]
    E -->|Yes| G[Network partition or GC flap]
    F --> H[Restore one node at a time]
    G --> H
    H --> I[Run full repair after recovery]

What this means

Cassandra uses the phi accrual failure detector to mark nodes UP or DOWN via gossip. When a coordinator receives a request, it determines which replicas own the partition key and checks how many are alive. If the live count is insufficient for the requested consistency level, the coordinator throws UnavailableException immediately. It does not wait, queue, or retry.

This differs from TimeoutException. A timeout means enough replicas were alive to attempt the request but did not respond within read_request_timeout_in_ms or write_request_timeout_in_ms. Timeouts can self-resolve if a slow replica catches up. Unavailables persist until nodes are recovered or the consistency level is lowered.

In a typical RF=3 cluster running QUORUM, losing two nodes removes the majority. Every token range replicated to those two nodes is unreachable at QUORUM. If an entire rack or availability zone hosts multiple replicas for the same ranges, the blast radius can be cluster-wide despite many nodes remaining alive.

Common causes

CauseWhat it looks likeFirst thing to check
Multiple simultaneous node failuresnodetool status shows several DN nodes in the same rack or DCRun nodetool status from multiple coordinators to confirm the view is consistent
Bad rolling restartMaintenance overlapped and too many nodes were taken down at onceMaintenance logs and uptime per node; check for nodes stuck in UJ or L states
Network partition isolating replicasNodes respond to ping but gossip marks them DN from the other sidenodetool status from nodes on both sides of the suspected partition
Cascading GC death spiralNodes rapidly flap UP and DOWN before going permanently DNGC logs for pauses exceeding the phi threshold and heap usage on affected nodes

Quick checks

Run these read-only commands to assess scope without making the situation worse.

# Cluster membership and DOWN node count
nodetool status

# Schema agreement; stuck migrations complicate recovery
nodetool describecluster

# UnavailableException frequency in system logs
grep -i "UnavailableException" /var/log/cassandra/system.log

# Heap, uptime, and native transport status on survivors
nodetool info

# Hint accumulation to estimate outage duration
du -sh /var/lib/cassandra/hints/

# Active streaming or repair compounding load
nodetool netstats

# Disk pressure on surviving nodes
df -h /var/lib/cassandra/data /var/lib/cassandra/commitlog

How to diagnose it

  1. Confirm Unavailable versus Timeout. UnavailableException means the coordinator rejected the query immediately because insufficient replicas are alive. TimeoutException means enough replicas were alive but did not respond in time. Check application logs and Cassandra system logs for the exact exception class. Timeouts but no unavailables means replicas are slow, not missing.

  2. Count DOWN nodes. Run nodetool status on multiple coordinators. Different views across coordinators indicate a network partition. A consistent count means those nodes are genuinely unreachable.

  3. Map DOWN nodes to token ranges. With vnodes (default num_tokens=256), a single DOWN node affects all ranges where it is a replica. With multiple DOWN nodes, calculate whether any range has lost its majority. For RF=3, any range whose two replicas are both DN has lost quorum.

  4. Check for recent topology changes. Review whether a rolling restart, bootstrap, decommission, or removenode was in progress. A node stuck in UJ or L state is not fully available and can tip the cluster into quorum loss if other failures coincide.

  5. Correlate with resource exhaustion. On DOWN nodes (if reachable), check dmesg, disk SMART data, and GC logs. On surviving nodes, check thread pool pending tasks and dropped messages to confirm they are not about to fail under redirected load.

Metrics and signals to monitor

SignalWhy it mattersWarning sign
DownEndpointCountNodes marked DOWN by the failure detectorSustained > 0 is a TICKET; quorum loss when count exceeds floor(RF/2) for affected ranges
Client Request UnavailablesImmediate failures for insufficient replicasSustained spike in UnavailableException rate relative to baseline
Client Request TimeoutsDistinguishes slow replicas from missing replicasHigh timeouts with zero unavailables means replicas are alive but overloaded
Node liveness (nodetool status)Human-readable gossip stateAny DN status sustained beyond a restart window
Hinted handoff store sizeHints accumulate when replicas are DOWNGrowing /var/lib/cassandra/hints/ signals an extended outage
Thread pool pending tasksSurviving nodes may saturate under extra loadPending > 0 in MUTATION or READ stages for > 60 seconds

Fixes

Restore nodes one at a time

Bring nodes back individually. Wait for gossip to converge and nodetool status to show UN before proceeding to the next. Bringing multiple nodes back concurrently can trigger hint replay storms and compaction debt that push recovering nodes back into failure.

Warning: If a node was down longer than gc_grace_seconds (default 10 days), do not simply restart and repair. Tombstones may have been garbage-collected on surviving replicas while the down node still holds original data, which resurrects deleted rows. In this scenario, wipe the data directory and rebuild the node from scratch using nodetool rebuild, or decommission it and bootstrap a replacement.

Run full repair after recovery

Once all nodes are UN, run a full repair across affected ranges. If your environment defaults to incremental repair, force a full comparison to ensure all token ranges are reconciled. Use nodetool repair -full -pr to repair each node’s primary ranges sequentially. Schedule this during low-traffic hours; repair is extremely I/O-intensive.

Heal network partitions

If the root cause is a network partition, restore connectivity first. After the network heals, run full repair. During a partition, timestamp-based conflict resolution may have allowed divergent writes on both sides. Repair is mandatory to reconcile inconsistencies.

Handle decommissioned nodes correctly

If a node was properly decommissioned with nodetool decommission or nodetool removenode, its tokens were redistributed. Simply restarting the old hardware will fail because its token assignments no longer match the ring. You must bootstrap it as a new node. If you are unsure whether the node was decommissioned, check the ring topology with nodetool status before starting the process.

Prevention

  • Rolling restarts: Restart exactly one node at a time. Verify UN status and that native transport is active before touching the next node.
  • Sustained failure detection: Alert on DownEndpointCount > 0 sustained for more than 5 minutes. Brief blips during restarts are normal; sustained DN is not.
  • Repair cadence: Run full repair on every node within gc_grace_seconds. Best practice is to complete a full cycle in half the gc_grace_seconds window.
  • Failure domain isolation: Distribute replicas across racks and availability zones so a single switch or power event cannot take down multiple replicas for the same ranges.
  • Phi threshold tuning: If nodes are falsely marked DOWN due to GC pauses or network latency, adjust phi_convict_threshold. Raise it cautiously to 10-12 on high-latency networks such as AWS EC2. Do not raise it above 12; this masks real failures.

How Netdata helps

  • Correlate DownEndpointCount with ClientRequest Unavailables to confirm quorum loss within seconds of the first failure.
  • Track JVM heap usage and GC pause duration to detect nodes entering a GC death spiral before gossip marks them DOWN.
  • Monitor disk I/O utilization and pending compactions to catch resource exhaustion that precedes node failure.
  • Alert on sustained node unavailability using baseline-aware thresholds that filter brief restart blips.