Cassandra ReadTimeoutException: diagnosing coordinator read timeouts
The driver throws ReadTimeoutException when the coordinator fails to gather enough replica responses within read_request_timeout_in_ms (default 5000 ms). This is a server-side timeout. It maps directly to the JMX metric ClientRequest,scope=Read,name=Timeouts.
This is not OperationTimedOutException, which fires on the driver’s socket timeout. It is also not UnavailableException, which means not enough replicas were alive to attempt the read. Here, replicas are alive but too slow. The read may have executed partially on some replicas, yet the coordinator could not assemble a response that met the consistency level within the window.
Timeout does not mean data loss, but the client received no results. The root cause usually lives on the replica nodes, not the coordinator.
What this means
On every read, the coordinator hashes the partition key, identifies the replica nodes that own the token range, and dispatches requests to enough replicas to satisfy the consistency level. If any replica stalls, the coordinator blocks until read_request_timeout_in_ms expires, then returns ReadTimeoutException to the client.
Common replica-side delays include long JVM GC pauses, disk I/O saturation, compaction backlog forcing reads to touch many SSTables, tombstone-heavy partitions burning CPU and I/O scanning dead data, and large partitions overwhelming the merge path. Cross-datacenter network latency can also contribute.
Because the timeout is server-side, tuning the client driver socket timeout will not fix it. Find the slow replica or expensive data access pattern and fix the resource contention behind it.
flowchart TD
A[ReadTimeoutException] --> B{Node flapping?}
B -->|Yes| C[GC pause death spiral]
B -->|No| D{READ stage pending?}
D -->|Yes| E[Replica saturation]
D -->|No| F{Compaction growing?}
F -->|Yes| G[Compaction backlog]
F -->|No| H{Tombstone warnings?}
H -->|Yes| I[Tombstone storm]
H -->|No| J[Disk I/O or cross-node latency]Common causes
| Cause | What it looks like | First thing to check |
|---|---|---|
| GC pause on replica | Node flapping UP/DOWN in gossip; GCInspector warnings in logs | nodetool gcstats or GC logs |
| Compaction backlog | P50 stable but P99 climbing; SSTable count rising | nodetool compactionstats |
| Tombstone-heavy read | Log warnings about scanned tombstones; latency spikes on one table | nodetool tablehistograms for tombstone cells |
| Large partition | Extreme P999 latency on a specific table; possible GC pressure | nodetool tablehistograms for partition size |
| Disk I/O saturation | Reads and compaction compete for the same device | iostat -x 1 |
| Thread pool saturation | Replicas cannot keep up with read volume | nodetool tpstats |
Quick checks
Run these safe, read-only checks to narrow the scope before making changes.
# Check node liveness and schema agreement
nodetool status
nodetool describecluster
# Check for dropped read messages and thread pool backpressure
nodetool tpstats
# Check coordinator and local read latency percentiles
nodetool proxyhistograms
# Check compaction debt and active compactions
nodetool compactionstats
# Check per-table SSTable count and space used
nodetool tablestats <keyspace> | grep -E "SSTable count|Space used"
# Check per-table latency and tombstone distributions
nodetool tablehistograms <keyspace> <table>
# Check disk I/O saturation on data and commitlog devices
iostat -x 1
# Check GC pause duration and frequency
nodetool gcstats
grep -i "pause" /var/log/cassandra/gc.log | tail -20
How to diagnose it
Confirm it is a server-side timeout. If the driver throws
ReadTimeoutException, the coordinator timed out waiting for replicas. If it throwsOperationTimedOutException, the client socket expired first. The fixes are different.Check cluster topology. Run
nodetool status. If nodes are DOWN or flapping between UP and DOWN, the timeout may stem from missing replicas or GC pauses long enough to trigger gossip failure detection. Flapping nodes cannot satisfy reads for their token ranges reliably.Inspect garbage collection on all replicas. Use
nodetool gcstatsand parse GC logs. Pauses above 500 ms degrade latency. Pauses above 2000 ms often trigger gossip failure, which leads to flapping and cascading timeouts. If old-generation pauses are increasing and heap after full GC is above 75% of max, the node is in GC pressure.Check for replica overload.
nodetool tpstatsshows Active, Pending, and Blocked counts for each thread pool. SustainedPending> 0 in ReadStage means reads are queuing locally on replicas. IfBlockedis increasing, requests are encountering backpressure.Measure compaction debt.
nodetool compactionstatsshows pending tasks. If the pending count is trending upward over hours andLiveSSTableCountis growing, every read must consult more SSTables. This read amplification directly increases replica response time.Identify table-level offenders. Run
nodetool tablehistograms <keyspace> <table>on the affected tables. Compare coordinator read latency against local read latency. If local latency is high on one node but normal on others, that node has a resource problem. If coordinator latency is high across the cluster, the data model or compaction strategy is the likely culprit. Look at the tombstone cell count and partition size columns.Check disk I/O. Run
iostat -x 1on the data and commitlog devices. On SSDs,awaitabove 10 ms or%utilabove 80% sustained indicates saturation. On NVMe, prioritize queue depth and latency over%util. If commitlog and data share a device, write-path flushes and compaction reads will starve point lookups.Review system logs for tombstones. Search for
Scanned over .* tombstonesin/var/log/cassandra/system.log. Sustained warnings mean reads are doing enormous amounts of wasted work scanning delete markers. At the defaulttombstone_failure_thresholdof 100000, Cassandra aborts the query entirely.
Metrics and signals to monitor
| Signal | Why it matters | Warning sign |
|---|---|---|
| ClientRequest Read Timeouts | Direct count of coordinator read timeouts | Rate sustained above zero for > 60 seconds |
| DroppedMessage READ | Replicas discarding reads that sat in queue past timeout | Non-zero rate in nodetool tpstats |
| GC pause duration | Stop-the-world pauses freeze all replica threads | Max pause > 500 ms |
| Compaction PendingTasks | Growing read amplification as SSTables accumulate | Trending upward over 4+ hours |
| LiveSSTableCount per table | Each read may need to check more files | > 50 under STCS, or rising trend under any strategy |
| TombstoneScannedHistogram | Dead cells force reads to scan and discard garbage | Sustained log warnings or high histogram tail |
| ReadStage PendingTasks | Read requests queuing on the replica | Pending > 0 sustained > 60 seconds |
Fixes
If GC pauses are the trigger
Review heap usage with nodetool info. If heap after full GC stays above 75% of max, reduce pressure by disabling or shrinking the row cache, lowering the number of in-flight requests, or reducing batch sizes. If a specific query is reading huge partitions, identify it with nodetool toppartitions and fix the data model or add application-level caching.
Warning: as an emergency measure on a spiraling node, you can run nodetool disablebinary to reject new client connections and let the JVM stabilize.
If compaction backlog is the trigger
Temporarily increase compaction_throughput_mb_per_sec with nodetool setcompactionthroughput to let compaction catch up. Tradeoff: this consumes more disk I/O and may temporarily worsen read latency. Verify that disk space headroom exists. STCS can transiently need up to 100% additional space during a major compaction. If SSTable count is structurally high, plan a compaction strategy migration during a maintenance window.
If tombstones or large partitions are the trigger
Ensure repair has completed within gc_grace_seconds for the affected table. Tombstones cannot be purged by compaction until all replicas have seen the delete. Run targeted compaction with nodetool compact <keyspace> <table>.
Warning: this is I/O-intensive and will spike disk utilization. Fix the data model to avoid unbounded partition growth. For TTL-dominated workloads, migrate to TimeWindowCompactionStrategy.
If disk I/O is the trigger
Separate commitlog and data directories onto different devices if they currently share one. Throttle or reschedule background repair and streaming to reduce contention. If the storage layer is undersized, add IOPS or migrate to SSD. Cassandra is fundamentally I/O-bound; spinning disks often cannot keep up with the combined load of compaction, flushes, and reads.
Prevention
- Monitor compaction as a trend, not a number. Alert when
PendingTasksis increasing over a 24-hour window, not just when it crosses an absolute threshold. - Alert on GC pause duration and heap after full GC. These move minutes or hours before client timeouts appear.
- Monitor per-table SSTable count and tombstone scan histograms. Catching these early prevents the sudden P99 cliff.
- Run repair on a schedule that completes well within
gc_grace_seconds. Unrepaired tombstones accumulate silently until they destroy read performance. - Keep commitlog and data on isolated storage devices. This prevents write-path flushes from starving read I/O.
How Netdata helps
Netdata collects ClientRequest Read Timeouts, GC pause duration, compaction pending tasks, and disk I/O latency on the same time axis. Use these to identify which replica resource triggered the timeout. Per-table SSTable counts and JVM heap usage trends are available without manual JMX polling. Alert on sustained thread pool pending tasks and dropped message rates; they fire before client timeouts become visible. Off-heap memory and process RSS metrics catch pressure that JVM heap metrics alone miss.
Related guides
- Cassandra compaction strategies: STCS vs LCS vs TWCS vs UCS
- Cassandra compaction death spiral: when writes outrun compaction throughput
- Cassandra consistency levels explained: QUORUM, ONE, LOCAL_QUORUM, and EACH_QUORUM
- Cassandra zombie data resurrection: gc_grace_seconds and unrepaired tombstones
- Cassandra GC death spiral: long pauses, gossip flapping, and recovery
- Cassandra GC pauses too long: diagnosing G1 stop-the-world pauses
- Cassandra heap pressure: sizing the JVM heap and tuning G1GC
- Cassandra monitoring checklist: the signals every production cluster needs
- Cassandra monitoring maturity model: from survival to expert
- Cassandra java.lang.OutOfMemoryError: Java heap space - causes and recovery
- Cassandra pending compactions growing: the compaction backlog runbook
- Cassandra Scanned over N tombstones warning: finding the offending query







