Cassandra ReadTimeoutException: diagnosing coordinator read timeouts

The driver throws ReadTimeoutException when the coordinator fails to gather enough replica responses within read_request_timeout_in_ms (default 5000 ms). This is a server-side timeout. It maps directly to the JMX metric ClientRequest,scope=Read,name=Timeouts.

This is not OperationTimedOutException, which fires on the driver’s socket timeout. It is also not UnavailableException, which means not enough replicas were alive to attempt the read. Here, replicas are alive but too slow. The read may have executed partially on some replicas, yet the coordinator could not assemble a response that met the consistency level within the window.

Timeout does not mean data loss, but the client received no results. The root cause usually lives on the replica nodes, not the coordinator.

What this means

On every read, the coordinator hashes the partition key, identifies the replica nodes that own the token range, and dispatches requests to enough replicas to satisfy the consistency level. If any replica stalls, the coordinator blocks until read_request_timeout_in_ms expires, then returns ReadTimeoutException to the client.

Common replica-side delays include long JVM GC pauses, disk I/O saturation, compaction backlog forcing reads to touch many SSTables, tombstone-heavy partitions burning CPU and I/O scanning dead data, and large partitions overwhelming the merge path. Cross-datacenter network latency can also contribute.

Because the timeout is server-side, tuning the client driver socket timeout will not fix it. Find the slow replica or expensive data access pattern and fix the resource contention behind it.

flowchart TD
    A[ReadTimeoutException] --> B{Node flapping?}
    B -->|Yes| C[GC pause death spiral]
    B -->|No| D{READ stage pending?}
    D -->|Yes| E[Replica saturation]
    D -->|No| F{Compaction growing?}
    F -->|Yes| G[Compaction backlog]
    F -->|No| H{Tombstone warnings?}
    H -->|Yes| I[Tombstone storm]
    H -->|No| J[Disk I/O or cross-node latency]

Common causes

CauseWhat it looks likeFirst thing to check
GC pause on replicaNode flapping UP/DOWN in gossip; GCInspector warnings in logsnodetool gcstats or GC logs
Compaction backlogP50 stable but P99 climbing; SSTable count risingnodetool compactionstats
Tombstone-heavy readLog warnings about scanned tombstones; latency spikes on one tablenodetool tablehistograms for tombstone cells
Large partitionExtreme P999 latency on a specific table; possible GC pressurenodetool tablehistograms for partition size
Disk I/O saturationReads and compaction compete for the same deviceiostat -x 1
Thread pool saturationReplicas cannot keep up with read volumenodetool tpstats

Quick checks

Run these safe, read-only checks to narrow the scope before making changes.

# Check node liveness and schema agreement
nodetool status
nodetool describecluster

# Check for dropped read messages and thread pool backpressure
nodetool tpstats

# Check coordinator and local read latency percentiles
nodetool proxyhistograms

# Check compaction debt and active compactions
nodetool compactionstats

# Check per-table SSTable count and space used
nodetool tablestats <keyspace> | grep -E "SSTable count|Space used"

# Check per-table latency and tombstone distributions
nodetool tablehistograms <keyspace> <table>

# Check disk I/O saturation on data and commitlog devices
iostat -x 1

# Check GC pause duration and frequency
nodetool gcstats
grep -i "pause" /var/log/cassandra/gc.log | tail -20

How to diagnose it

  1. Confirm it is a server-side timeout. If the driver throws ReadTimeoutException, the coordinator timed out waiting for replicas. If it throws OperationTimedOutException, the client socket expired first. The fixes are different.

  2. Check cluster topology. Run nodetool status. If nodes are DOWN or flapping between UP and DOWN, the timeout may stem from missing replicas or GC pauses long enough to trigger gossip failure detection. Flapping nodes cannot satisfy reads for their token ranges reliably.

  3. Inspect garbage collection on all replicas. Use nodetool gcstats and parse GC logs. Pauses above 500 ms degrade latency. Pauses above 2000 ms often trigger gossip failure, which leads to flapping and cascading timeouts. If old-generation pauses are increasing and heap after full GC is above 75% of max, the node is in GC pressure.

  4. Check for replica overload. nodetool tpstats shows Active, Pending, and Blocked counts for each thread pool. Sustained Pending > 0 in ReadStage means reads are queuing locally on replicas. If Blocked is increasing, requests are encountering backpressure.

  5. Measure compaction debt. nodetool compactionstats shows pending tasks. If the pending count is trending upward over hours and LiveSSTableCount is growing, every read must consult more SSTables. This read amplification directly increases replica response time.

  6. Identify table-level offenders. Run nodetool tablehistograms <keyspace> <table> on the affected tables. Compare coordinator read latency against local read latency. If local latency is high on one node but normal on others, that node has a resource problem. If coordinator latency is high across the cluster, the data model or compaction strategy is the likely culprit. Look at the tombstone cell count and partition size columns.

  7. Check disk I/O. Run iostat -x 1 on the data and commitlog devices. On SSDs, await above 10 ms or %util above 80% sustained indicates saturation. On NVMe, prioritize queue depth and latency over %util. If commitlog and data share a device, write-path flushes and compaction reads will starve point lookups.

  8. Review system logs for tombstones. Search for Scanned over .* tombstones in /var/log/cassandra/system.log. Sustained warnings mean reads are doing enormous amounts of wasted work scanning delete markers. At the default tombstone_failure_threshold of 100000, Cassandra aborts the query entirely.

Metrics and signals to monitor

SignalWhy it mattersWarning sign
ClientRequest Read TimeoutsDirect count of coordinator read timeoutsRate sustained above zero for > 60 seconds
DroppedMessage READReplicas discarding reads that sat in queue past timeoutNon-zero rate in nodetool tpstats
GC pause durationStop-the-world pauses freeze all replica threadsMax pause > 500 ms
Compaction PendingTasksGrowing read amplification as SSTables accumulateTrending upward over 4+ hours
LiveSSTableCount per tableEach read may need to check more files> 50 under STCS, or rising trend under any strategy
TombstoneScannedHistogramDead cells force reads to scan and discard garbageSustained log warnings or high histogram tail
ReadStage PendingTasksRead requests queuing on the replicaPending > 0 sustained > 60 seconds

Fixes

If GC pauses are the trigger

Review heap usage with nodetool info. If heap after full GC stays above 75% of max, reduce pressure by disabling or shrinking the row cache, lowering the number of in-flight requests, or reducing batch sizes. If a specific query is reading huge partitions, identify it with nodetool toppartitions and fix the data model or add application-level caching.

Warning: as an emergency measure on a spiraling node, you can run nodetool disablebinary to reject new client connections and let the JVM stabilize.

If compaction backlog is the trigger

Temporarily increase compaction_throughput_mb_per_sec with nodetool setcompactionthroughput to let compaction catch up. Tradeoff: this consumes more disk I/O and may temporarily worsen read latency. Verify that disk space headroom exists. STCS can transiently need up to 100% additional space during a major compaction. If SSTable count is structurally high, plan a compaction strategy migration during a maintenance window.

If tombstones or large partitions are the trigger

Ensure repair has completed within gc_grace_seconds for the affected table. Tombstones cannot be purged by compaction until all replicas have seen the delete. Run targeted compaction with nodetool compact <keyspace> <table>.

Warning: this is I/O-intensive and will spike disk utilization. Fix the data model to avoid unbounded partition growth. For TTL-dominated workloads, migrate to TimeWindowCompactionStrategy.

If disk I/O is the trigger

Separate commitlog and data directories onto different devices if they currently share one. Throttle or reschedule background repair and streaming to reduce contention. If the storage layer is undersized, add IOPS or migrate to SSD. Cassandra is fundamentally I/O-bound; spinning disks often cannot keep up with the combined load of compaction, flushes, and reads.

Prevention

  • Monitor compaction as a trend, not a number. Alert when PendingTasks is increasing over a 24-hour window, not just when it crosses an absolute threshold.
  • Alert on GC pause duration and heap after full GC. These move minutes or hours before client timeouts appear.
  • Monitor per-table SSTable count and tombstone scan histograms. Catching these early prevents the sudden P99 cliff.
  • Run repair on a schedule that completes well within gc_grace_seconds. Unrepaired tombstones accumulate silently until they destroy read performance.
  • Keep commitlog and data on isolated storage devices. This prevents write-path flushes from starving read I/O.

How Netdata helps

Netdata collects ClientRequest Read Timeouts, GC pause duration, compaction pending tasks, and disk I/O latency on the same time axis. Use these to identify which replica resource triggered the timeout. Per-table SSTable counts and JVM heap usage trends are available without manual JMX polling. Alert on sustained thread pool pending tasks and dropped message rates; they fire before client timeouts become visible. Off-heap memory and process RSS metrics catch pressure that JVM heap metrics alone miss.