Cassandra UnavailableException: not enough replicas for the consistency level
Cassandra threw UnavailableException. The coordinator rejected the request immediately without contacting replicas. No mutation occurred, and retrying at the same consistency level against the same coordinator fails until enough replicas are UN in the coordinator’s gossip view. This is a topology problem, not a performance problem.
This exception is qualitatively different from TimeoutException. A timeout means enough replicas were alive but responded too slowly. An unavailable means the coordinator never sent the request because the topology could not satisfy the consistency level.
flowchart LR
A[Client request] --> B[Coordinator]
B --> C{Live replicas >= CL?}
C -->|No| D[UnavailableException
fail-fast, no waiting]
C -->|Yes| E[Send to replicas]
E -->|Slow responses| F[TimeoutException]
E -->|Fast enough| G[Success]What this means
Before forwarding a request, the coordinator counts replicas marked UP in gossip for the partition’s token range. If the count is below the consistency level, it throws UnavailableException immediately. The exception payload includes the requested consistency level, the number required, and the number alive. For example, QUORUM with replication factor 3 requires 2 replicas; if only 1 is alive, the operation fails before any network round-trip.
Nodes marked DOWN, still joining, or not yet recognized after a restart do not count toward the consistency level. Because the request is rejected before any data is mutated, retrying at a lower consistency level or after topology recovery is safe. Retrying at the original consistency level continues to fail until missing replicas return to UN.
Common causes
| Cause | What it looks like | First thing to check |
|---|---|---|
| Node marked DOWN by gossip | nodetool status shows DN or the node is absent from the ring | nodetool status from multiple coordinators |
| Rolling restart too aggressive | Unavailables spike during deployments; nodes briefly show UJ or JN | Restart procedure: only one node at a time, wait for UN |
| Node joining or recovering | Peers do not yet count the new node as alive | nodetool gossipinfo for IS_ALIVE and STATUS |
| Replication factor too low for the consistency level | A single node failure breaks quorum; e.g. RF=2 with CL QUORUM | Keyspace replication factor against CL requirements |
| Network partition | Coordinators on different sides see different DOWN node sets | Compare nodetool status across multiple nodes |
Quick checks
Run these read-only commands to assess cluster topology and distinguish unavailability from saturation.
# Check node states across the cluster
nodetool status
# Check gossip liveness and state transitions
nodetool gossipinfo
# Verify the coordinator is accepting client connections
nodetool statusbinary
# Check for dropped messages that indicate overload, not unavailability
nodetool tpstats
# Inspect recent gossip events in system logs
grep -E "Gossiper|FailureDetector|DOWN|UP" /var/log/cassandra/system.log
# Check for long GC pauses that may have triggered false DOWN detection
grep "pause" /var/log/cassandra/gc.log | tail -20
How to diagnose it
Confirm the exception type. Check application or driver logs.
UnavailableExceptionmeans the coordinator failed fast.TimeoutExceptionorWriteTimeoutExceptionmeans replicas were alive but slow. The fix for timeout is capacity tuning; the fix for unavailable is topology recovery.Count live replicas. Run
nodetool statusfrom multiple nodes. Identify any replicas not inUN(Up Normal). OnlyUNnodes count toward the consistency level.DN(Down Normal),UJ(Up Joining), and leaving states do not count.Check the exception payload. The exception includes
cl,required, andalive. Verify the math. Ifrequiredis 2 andaliveis 1, you are missing exactly one replica for that token range.Map the missing replica to the token range. Use
nodetool ringornodetool describering <keyspace>to find which node owns the token range for the failing partition. If that node is notUN, it is the source of the unavailability.Determine why the replica is not alive.
- Check
nodetool gossipinfofor the target node. Look forIS_ALIVEandSTATUS. A node that recently restarted may be running but not yet fully in the ring. - Check GC logs on the missing node. Long pauses that exceed the phi accrual threshold (default
phi_convict_threshold=8) cause the failure detector to mark the node DOWN even though the process has not crashed. - Check system logs for
FSError,CorruptSSTableException, or JVM exit signals that indicate a crash or disk failure.
- Check
Check for network partitions. Compare
nodetool statusfrom nodes on different sides of the network. If coordinators disagree on which nodes are DOWN, you have a partial partition. This can cause subsets of the cluster to each believe they lack quorum.Validate replication factor versus consistency level. If your keyspace uses a replication factor of 2 and the application requests
QUORUM(which requires 2), a single node failure causes total unavailability for that partition. Verify the keyspace schema to ensure the replication factor provides fault tolerance for your consistency level.Correlate with operational events. If the incident coincides with a deployment or restart, verify that nodes were restarted one at a time and that each node reached
UNbefore the next restart began. Starting a second node before the first has fully joined can transiently reduce available replicas below quorum.
Metrics and signals to monitor
Correlate these Cassandra-internal signals with Netdata charts for disk I/O, network throughput, and JVM heap to distinguish topology failures from resource exhaustion.
| Signal | Why it matters | Warning sign |
|---|---|---|
| ClientRequest Unavailables (Read/Write) | Direct count of CL satisfaction failures | Any sustained rate above zero during steady state |
| FailureDetector DownEndpointCount | How many nodes are currently DOWN | DownEndpointCount > 0 sustained for more than 5 minutes |
| Gossip flapping (node state transitions) | Nodes briefly marked DOWN trigger transient unavailables | More than 3 UP/DOWN transitions in 30 minutes |
| ClientRequest Timeouts (Read/Write) | Distinguishes overload from topology loss | High timeouts with zero unavailables indicates slow replicas, not missing ones |
| Thread pool pending tasks (Read/Mutation) | Rules out internal saturation masquerading as unavailability | Sustained pending tasks correlate with timeouts, not unavailables |
| Schema versions | Persistent disagreement can indicate stuck or partitioned nodes | Multiple schema versions lasting more than 5 minutes |
Fixes
Restore downed nodes
Investigate why the node is not UN and address the root cause. If the node crashed due to an OOM kill or JVM failure, restart it. If the node is recovering from a restart, wait for nodetool status to show UN on all peers before considering it available. If the node was marked DOWN due to long GC pauses, check nodetool info for heap usage and review GC logs for pause times before restarting. Restarting without reducing heap pressure or query load will likely trigger the same failure. Do not restart multiple nodes concurrently.
Repair after recovery
Once the node is back to UN, run nodetool repair on the affected keyspace or tables. If the outage was shorter than max_hint_window_in_ms (default 3 hours), hinted handoff may have preserved most writes on surviving coordinators, but repair is still required to guarantee consistency. For outages longer than the hint window, expired hints cannot recover data; repair is the only mechanism. Schedule repair during a low-traffic window; it generates significant disk and network I/O, and running it against an already stressed cluster prolongs recovery.
Adjust consistency level or replication factor
As an emergency measure, temporarily downgrade the application consistency level (for example, from QUORUM to ONE) to restore availability. The tradeoff is reduced durability and consistency guarantees. Long term, if your cluster uses a replication factor of 2 and requires tolerance to single-node failures, increase the replication factor to at least 3 per datacenter. A replication factor of 2 cannot tolerate any node loss for QUORUM or ALL operations.
Resolve network partitions
Identify and isolate the network fault (switch failure, AZ isolation, or firewall change). Prefer the partition side that contains the majority of nodes. Restore connectivity, then verify all peers agree on cluster state with nodetool status. Restart isolated nodes if they do not automatically rejoin.







