Cassandra UnavailableException: not enough replicas for the consistency level

Cassandra threw UnavailableException. The coordinator rejected the request immediately without contacting replicas. No mutation occurred, and retrying at the same consistency level against the same coordinator fails until enough replicas are UN in the coordinator’s gossip view. This is a topology problem, not a performance problem.

This exception is qualitatively different from TimeoutException. A timeout means enough replicas were alive but responded too slowly. An unavailable means the coordinator never sent the request because the topology could not satisfy the consistency level.

flowchart LR
    A[Client request] --> B[Coordinator]
    B --> C{Live replicas >= CL?}
    C -->|No| D[UnavailableException
fail-fast, no waiting] C -->|Yes| E[Send to replicas] E -->|Slow responses| F[TimeoutException] E -->|Fast enough| G[Success]

What this means

Before forwarding a request, the coordinator counts replicas marked UP in gossip for the partition’s token range. If the count is below the consistency level, it throws UnavailableException immediately. The exception payload includes the requested consistency level, the number required, and the number alive. For example, QUORUM with replication factor 3 requires 2 replicas; if only 1 is alive, the operation fails before any network round-trip.

Nodes marked DOWN, still joining, or not yet recognized after a restart do not count toward the consistency level. Because the request is rejected before any data is mutated, retrying at a lower consistency level or after topology recovery is safe. Retrying at the original consistency level continues to fail until missing replicas return to UN.

Common causes

CauseWhat it looks likeFirst thing to check
Node marked DOWN by gossipnodetool status shows DN or the node is absent from the ringnodetool status from multiple coordinators
Rolling restart too aggressiveUnavailables spike during deployments; nodes briefly show UJ or JNRestart procedure: only one node at a time, wait for UN
Node joining or recoveringPeers do not yet count the new node as alivenodetool gossipinfo for IS_ALIVE and STATUS
Replication factor too low for the consistency levelA single node failure breaks quorum; e.g. RF=2 with CL QUORUMKeyspace replication factor against CL requirements
Network partitionCoordinators on different sides see different DOWN node setsCompare nodetool status across multiple nodes

Quick checks

Run these read-only commands to assess cluster topology and distinguish unavailability from saturation.

# Check node states across the cluster
nodetool status

# Check gossip liveness and state transitions
nodetool gossipinfo

# Verify the coordinator is accepting client connections
nodetool statusbinary

# Check for dropped messages that indicate overload, not unavailability
nodetool tpstats

# Inspect recent gossip events in system logs
grep -E "Gossiper|FailureDetector|DOWN|UP" /var/log/cassandra/system.log

# Check for long GC pauses that may have triggered false DOWN detection
grep "pause" /var/log/cassandra/gc.log | tail -20

How to diagnose it

  1. Confirm the exception type. Check application or driver logs. UnavailableException means the coordinator failed fast. TimeoutException or WriteTimeoutException means replicas were alive but slow. The fix for timeout is capacity tuning; the fix for unavailable is topology recovery.

  2. Count live replicas. Run nodetool status from multiple nodes. Identify any replicas not in UN (Up Normal). Only UN nodes count toward the consistency level. DN (Down Normal), UJ (Up Joining), and leaving states do not count.

  3. Check the exception payload. The exception includes cl, required, and alive. Verify the math. If required is 2 and alive is 1, you are missing exactly one replica for that token range.

  4. Map the missing replica to the token range. Use nodetool ring or nodetool describering <keyspace> to find which node owns the token range for the failing partition. If that node is not UN, it is the source of the unavailability.

  5. Determine why the replica is not alive.

    • Check nodetool gossipinfo for the target node. Look for IS_ALIVE and STATUS. A node that recently restarted may be running but not yet fully in the ring.
    • Check GC logs on the missing node. Long pauses that exceed the phi accrual threshold (default phi_convict_threshold=8) cause the failure detector to mark the node DOWN even though the process has not crashed.
    • Check system logs for FSError, CorruptSSTableException, or JVM exit signals that indicate a crash or disk failure.
  6. Check for network partitions. Compare nodetool status from nodes on different sides of the network. If coordinators disagree on which nodes are DOWN, you have a partial partition. This can cause subsets of the cluster to each believe they lack quorum.

  7. Validate replication factor versus consistency level. If your keyspace uses a replication factor of 2 and the application requests QUORUM (which requires 2), a single node failure causes total unavailability for that partition. Verify the keyspace schema to ensure the replication factor provides fault tolerance for your consistency level.

  8. Correlate with operational events. If the incident coincides with a deployment or restart, verify that nodes were restarted one at a time and that each node reached UN before the next restart began. Starting a second node before the first has fully joined can transiently reduce available replicas below quorum.

Metrics and signals to monitor

Correlate these Cassandra-internal signals with Netdata charts for disk I/O, network throughput, and JVM heap to distinguish topology failures from resource exhaustion.

SignalWhy it mattersWarning sign
ClientRequest Unavailables (Read/Write)Direct count of CL satisfaction failuresAny sustained rate above zero during steady state
FailureDetector DownEndpointCountHow many nodes are currently DOWNDownEndpointCount > 0 sustained for more than 5 minutes
Gossip flapping (node state transitions)Nodes briefly marked DOWN trigger transient unavailablesMore than 3 UP/DOWN transitions in 30 minutes
ClientRequest Timeouts (Read/Write)Distinguishes overload from topology lossHigh timeouts with zero unavailables indicates slow replicas, not missing ones
Thread pool pending tasks (Read/Mutation)Rules out internal saturation masquerading as unavailabilitySustained pending tasks correlate with timeouts, not unavailables
Schema versionsPersistent disagreement can indicate stuck or partitioned nodesMultiple schema versions lasting more than 5 minutes

Fixes

Restore downed nodes

Investigate why the node is not UN and address the root cause. If the node crashed due to an OOM kill or JVM failure, restart it. If the node is recovering from a restart, wait for nodetool status to show UN on all peers before considering it available. If the node was marked DOWN due to long GC pauses, check nodetool info for heap usage and review GC logs for pause times before restarting. Restarting without reducing heap pressure or query load will likely trigger the same failure. Do not restart multiple nodes concurrently.

Repair after recovery

Once the node is back to UN, run nodetool repair on the affected keyspace or tables. If the outage was shorter than max_hint_window_in_ms (default 3 hours), hinted handoff may have preserved most writes on surviving coordinators, but repair is still required to guarantee consistency. For outages longer than the hint window, expired hints cannot recover data; repair is the only mechanism. Schedule repair during a low-traffic window; it generates significant disk and network I/O, and running it against an already stressed cluster prolongs recovery.

Adjust consistency level or replication factor

As an emergency measure, temporarily downgrade the application consistency level (for example, from QUORUM to ONE) to restore availability. The tradeoff is reduced durability and consistency guarantees. Long term, if your cluster uses a replication factor of 2 and requires tolerance to single-node failures, increase the replication factor to at least 3 per datacenter. A replication factor of 2 cannot tolerate any node loss for QUORUM or ALL operations.

Resolve network partitions

Identify and isolate the network fault (switch failure, AZ isolation, or firewall change). Prefer the partition side that contains the majority of nodes. Restore connectivity, then verify all peers agree on cluster state with nodetool status. Restart isolated nodes if they do not automatically rejoin.