Cassandra read latency spikes: P99 vs P50 and proxyhistograms
Your application is timing out on Cassandra reads, but a quick glance at average latency looks acceptable. This is the percentile trap. In Cassandra, read latency follows a long-tail distribution: most requests are fast, but a fraction hit slow replicas, large partitions, or GC pauses and become orders of magnitude slower. The critical first step is knowing whether the spike lives at the coordinator or the replica, and whether it is systemic or isolated to the tail. nodetool proxyhistograms and nodetool tablehistograms answer exactly that.
When P50 is stable but P99 spikes, the cluster is not uniformly overloaded. A subset of requests is slow, which points to specific partitions, specific nodes, or transient JVM events. When both rise together, you are looking at systemic saturation. Do not trust average latency. A single large-partition read can distort the average while nearly all requests remain fast, masking the tail that is actually killing your clients.
What this means
nodetool proxyhistograms reports coordinator-level latency. It measures from the moment the coordinator receives the CQL read until it returns a response to the client. This includes network round-trips to replicas, replica-side processing, and result merging. It is the latency your application actually feels.
nodetool tablehistograms <keyspace> <table> reports local replica latency. It isolates disk I/O, bloom filter checks, SSTable merges, and tombstone application on the local node. It excludes the network hop.
If proxyhistograms is elevated but tablehistograms is low on the same node, the bottleneck is not local disk. It is either a slow remote replica, network degradation, or coordinator-side congestion. If both are elevated, the replica itself is struggling.
flowchart TD
A[P99 read latency spikes] --> B{P50 stable?}
B -->|Yes| C[Tail-only problem]
B -->|No| D[Systemic slowdown]
C --> E{Proxy high, local low?}
E -->|Yes| F[Network or slow replica]
E -->|No| G[Local replica issue]
G --> H{Correlate with}
H -->|GC pauses| I[JVM pressure]
H -->|SSTables/Tombstones| J[Read amplification]
D --> K[Check disk I/O and saturation]Common causes
| Cause | What it looks like | First thing to check |
|---|---|---|
| Hot or large partition | P99 spikes, P50 flat; spikes correlate with specific partition keys | nodetool toppartitions <keyspace> <table> 1000 and MaxPartitionSize |
| Tombstone accumulation | P99 high on specific tables; log warnings about scanned tombstones | grep "tombstone" /var/log/cassandra/system.log; TombstoneScannedHistogram |
| GC pause on a replica | P99 spikes isolated to one or two nodes; no disk I/O correlation | nodetool gcstats and GC logs on the slow replica |
| Compaction debt / read amplification | P99 climbs gradually over hours; affects all queries on affected tables | nodetool compactionstats; LiveSSTableCount; SSTablesPerReadHistogram |
| Slow replica or network path | Proxy latency high, local latency low on coordinator | nodetool status; compare tablehistograms across replicas |
| Disk I/O saturation | Both P50 and P99 rise together; reads and compactions compete | iostat -x 1; commitlog vs data device separation |
Quick checks
# Coordinator-level latency (client-visible, includes network to replicas)
nodetool proxyhistograms
# Local replica latency for a specific table (excludes network)
nodetool tablehistograms <keyspace> <table>
# Pending compactions and SSTable accumulation
nodetool compactionstats
nodetool tablestats <keyspace> | grep "SSTable count"
# Thread pool saturation and dropped reads
nodetool tpstats
# Recent GC pause behavior
nodetool gcstats
# Heap usage and native transport state
nodetool info | grep -E "Heap Memory|Native Transport"
# Tombstone warnings in the current log window
grep "tombstone" /var/log/cassandra/system.log
# Per-device I/O latency and saturation
iostat -x 1
How to diagnose it
- Establish the scope with
nodetool proxyhistograms. Is the spike in reads, writes, or both? Is P50 stable while P99 and p999 climb? - Isolate coordinator versus replica. Run
nodetool tablehistogramson the coordinator node. If local latency is low but proxy is high, investigate remote replicas and network paths. - Check node symmetry. Compare
tablehistogramsoutput across replicas for the same table. If one replica is an outlier, inspect that node for GC, disk degradation, or compaction backlog. - Correlate with GC. Check
nodetool gcstatsand GC logs. G1GC stop-the-world pauses inflate P99 and p999 while leaving P50 untouched if the workload has idle cycles between pauses. - Inspect the read path. Query
TombstoneScannedHistogramandSSTablesPerReadHistogramper table. Values climbing towardtombstone_warn_threshold(default 1000) or high SSTable counts indicate read amplification. - Identify hot partitions. Run
nodetool toppartitions <keyspace> <table> 1000to see if a single partition is dominating traffic. - Verify disk I/O. Use
iostat -xto checkawaitand%utilon data and commitlog devices. If both are elevated, the node is I/O-saturated. - Check for load shedding.
nodetool tpstatswill show non-zero dropped READ or MUTATION messages if internal queues are expiring requests.
Metrics and signals to monitor
| Signal | Why it matters | Warning sign |
|---|---|---|
| ClientRequest Read Latency p99/p999 | Client-visible tail latency | Sustained elevation above 3x rolling 1-hour baseline |
| ClientRequest Read Latency p50 | Median experience | Stable P50 with spiking P99 indicates a tail-only issue |
| proxyhistograms coordinator latencies | Includes network and all replicas | Elevated against local tablehistograms implies coordination or network problem |
| tablehistograms local latencies | Isolates replica-side processing | Elevated local latency points to disk, GC, or compaction on this node |
| TombstoneScannedHistogram | Dead data scanned per read | Sustained values near or above 1000 (default warn threshold) |
| SSTablesPerReadHistogram | Read amplification factor | Rising count means queries touch more files |
| GC pause duration | Stop-the-world blocking | Pauses > 2 seconds risk gossip failure; pauses correlate with p999 spikes |
| DroppedMessage scope=READ | Internal overload shedding | Any sustained non-zero rate indicates the node cannot keep up |
| Pending compactions | Compaction debt | Trending upward over hours signals eventual read amplification |
| Thread pool READ pending | Request queuing | Sustained > 0 for > 60 seconds means read saturation |
Fixes
Hot or large partitions
Application-level caching for the hot key is the fastest relief. Rate-limit or buffer writes to the partition. Long-term, redesign the data model to split the partition key or add a bucketing suffix.
Tombstone storms
Ensure repair has run within gc_grace_seconds (default 10 days); tombstones cannot be purged until all replicas have been repaired. Trigger targeted compaction with nodetool compact <keyspace> <table>. Warning: this is I/O-intensive and will compete with live traffic. Review delete patterns and consider TWCS for TTL-dominated tables.
GC pressure
Reduce batch statement sizes and avoid reading entire large partitions into memory. If row cache is enabled, consider disabling it; it is off by default for good reason and often wastes heap. If heap is undersized or oversized, adjust -Xmx. Avoid exceeding 16GB with G1GC, and never exceed 32GB, which disables compressed OOPs.
Compaction backlog
Temporarily increase compaction_throughput_mb_per_sec if CPU and disk allow. Postpone non-urgent repairs that compete for I/O. If the table uses STCS and read amplification is chronic, plan a migration to LCS or UCS. Altering compaction strategy triggers a full recompaction, so time this carefully.
Slow replica or network
If one node is persistently slower than its peers, exclude it from client routing temporarily and investigate hardware health. Check for asymmetric network paths or cross-DC latency if using non-local consistency levels.
Disk I/O saturation
Separate commitlog and data directories onto different devices if they share a disk. Throttle background streaming and repair during peak hours. If SSD await exceeds 10ms sustained, the device is struggling.
Prevention
- Monitor compaction pending tasks as a derivative, not a static value. A count that increases over 24 hours is a leading indicator.
- Track
TombstoneScannedHistogramper table and alert before queries reach the 1000 tombstone warning threshold. - Sample partition size distributions with
nodetool toppartitionsduring normal operations to catch growth trends. - Set relationship-based latency alerts (deviation from baseline) rather than fixed thresholds, because workload norms vary by compaction strategy and hardware.
- Keep
nodetool proxyhistogramsbaseline context in mind; the tool resets on restart and provides no historical data, so external time-series storage is essential.
How Netdata helps
- Netdata collects JMX
ClientRequestpercentiles (p50 through p999) continuously, preserving history thatnodetool proxyhistogramsloses on restart. - Correlate P99/p999 read latency spikes with GC pause duration charts on the same node to identify JVM pressure without parsing GC logs manually.
- Compare coordinator-level latency against per-node disk I/O await and thread pool saturation to distinguish network issues from local replica issues.
- Per-node anomaly detection flags when one replica’s read latency diverges from the cluster median, surfacing slow replicas before they trigger quorum timeouts.
- Netdata tracks
DroppedMessage,PendingCompactions, andTombstoneScannedHistogramas first-class metrics, letting you build composite alerts that avoid false positives from single-signal thresholds.
Related guides
- Cassandra compaction strategies: STCS vs LCS vs TWCS vs UCS
- Cassandra compaction death spiral: when writes outrun compaction throughput
- Cassandra consistency levels explained: QUORUM, ONE, LOCAL_QUORUM, and EACH_QUORUM
- Cassandra zombie data resurrection: gc_grace_seconds and unrepaired tombstones
- Cassandra disk space exhaustion: emergency recovery when the data volume fills
- Cassandra dropped mutations: silent write loss and load shedding
- Cassandra GC death spiral: long pauses, gossip flapping, and recovery
- Cassandra GC pauses too long: diagnosing G1 stop-the-world pauses
- Cassandra heap pressure: sizing the JVM heap and tuning G1GC
- Cassandra monitoring checklist: the signals every production cluster needs
- Cassandra monitoring maturity model: from survival to expert
- Cassandra Not enough space for compaction: STCS space amplification and recovery







