$ guides / cassandra / cassandra-read-latency-spikes ▌

Operations Guides

Cassandra read latency spikes: P99 vs P50 and proxyhistograms

Your application is timing out on Cassandra reads, but a quick glance at average latency looks acceptable. This is the percentile trap. In Cassandra, read latency follows a long-tail distribution: most requests are fast, but a fraction hit slow replicas, large partitions, or GC pauses and become orders of magnitude slower. The critical first step is knowing whether the spike lives at the coordinator or the replica, and whether it is systemic or isolated to the tail. nodetool proxyhistograms and nodetool tablehistograms answer exactly that.

When P50 is stable but P99 spikes, the cluster is not uniformly overloaded. A subset of requests is slow, which points to specific partitions, specific nodes, or transient JVM events. When both rise together, you are looking at systemic saturation. Do not trust average latency. A single large-partition read can distort the average while nearly all requests remain fast, masking the tail that is actually killing your clients.

What this means

nodetool proxyhistograms reports coordinator-level latency. It measures from the moment the coordinator receives the CQL read until it returns a response to the client. This includes network round-trips to replicas, replica-side processing, and result merging. It is the latency your application actually feels.

nodetool tablehistograms <keyspace> <table> reports local replica latency. It isolates disk I/O, bloom filter checks, SSTable merges, and tombstone application on the local node. It excludes the network hop.

If proxyhistograms is elevated but tablehistograms is low on the same node, the bottleneck is not local disk. It is either a slow remote replica, network degradation, or coordinator-side congestion. If both are elevated, the replica itself is struggling.

flowchart TD
    A[P99 read latency spikes] --> B{P50 stable?}
    B -->|Yes| C[Tail-only problem]
    B -->|No| D[Systemic slowdown]
    C --> E{Proxy high, local low?}
    E -->|Yes| F[Network or slow replica]
    E -->|No| G[Local replica issue]
    G --> H{Correlate with}
    H -->|GC pauses| I[JVM pressure]
    H -->|SSTables/Tombstones| J[Read amplification]
    D --> K[Check disk I/O and saturation]

Common causes

Cause	What it looks like	First thing to check
Hot or large partition	P99 spikes, P50 flat; spikes correlate with specific partition keys	`nodetool toppartitions <keyspace> <table> 1000` and `MaxPartitionSize`
Tombstone accumulation	P99 high on specific tables; log warnings about scanned tombstones	`grep "tombstone" /var/log/cassandra/system.log`; `TombstoneScannedHistogram`
GC pause on a replica	P99 spikes isolated to one or two nodes; no disk I/O correlation	`nodetool gcstats` and GC logs on the slow replica
Compaction debt / read amplification	P99 climbs gradually over hours; affects all queries on affected tables	`nodetool compactionstats`; `LiveSSTableCount`; `SSTablesPerReadHistogram`
Slow replica or network path	Proxy latency high, local latency low on coordinator	`nodetool status`; compare `tablehistograms` across replicas
Disk I/O saturation	Both P50 and P99 rise together; reads and compactions compete	`iostat -x 1`; commitlog vs data device separation

Quick checks

# Coordinator-level latency (client-visible, includes network to replicas)
nodetool proxyhistograms

# Local replica latency for a specific table (excludes network)
nodetool tablehistograms <keyspace> <table>

# Pending compactions and SSTable accumulation
nodetool compactionstats
nodetool tablestats <keyspace> | grep "SSTable count"

# Thread pool saturation and dropped reads
nodetool tpstats

# Recent GC pause behavior
nodetool gcstats

# Heap usage and native transport state
nodetool info | grep -E "Heap Memory|Native Transport"

# Tombstone warnings in the current log window
grep "tombstone" /var/log/cassandra/system.log

# Per-device I/O latency and saturation
iostat -x 1

How to diagnose it

Establish the scope with nodetool proxyhistograms. Is the spike in reads, writes, or both? Is P50 stable while P99 and p999 climb?
Isolate coordinator versus replica. Run nodetool tablehistograms on the coordinator node. If local latency is low but proxy is high, investigate remote replicas and network paths.
Check node symmetry. Compare tablehistograms output across replicas for the same table. If one replica is an outlier, inspect that node for GC, disk degradation, or compaction backlog.
Correlate with GC. Check nodetool gcstats and GC logs. G1GC stop-the-world pauses inflate P99 and p999 while leaving P50 untouched if the workload has idle cycles between pauses.
Inspect the read path. Query TombstoneScannedHistogram and SSTablesPerReadHistogram per table. Values climbing toward tombstone_warn_threshold (default 1000) or high SSTable counts indicate read amplification.
Identify hot partitions. Run nodetool toppartitions <keyspace> <table> 1000 to see if a single partition is dominating traffic.
Verify disk I/O. Use iostat -x to check await and %util on data and commitlog devices. If both are elevated, the node is I/O-saturated.
Check for load shedding. nodetool tpstats will show non-zero dropped READ or MUTATION messages if internal queues are expiring requests.

Metrics and signals to monitor

Signal	Why it matters	Warning sign
ClientRequest Read Latency p99/p999	Client-visible tail latency	Sustained elevation above 3x rolling 1-hour baseline
ClientRequest Read Latency p50	Median experience	Stable P50 with spiking P99 indicates a tail-only issue
proxyhistograms coordinator latencies	Includes network and all replicas	Elevated against local `tablehistograms` implies coordination or network problem
tablehistograms local latencies	Isolates replica-side processing	Elevated local latency points to disk, GC, or compaction on this node
TombstoneScannedHistogram	Dead data scanned per read	Sustained values near or above 1000 (default warn threshold)
SSTablesPerReadHistogram	Read amplification factor	Rising count means queries touch more files
GC pause duration	Stop-the-world blocking	Pauses > 2 seconds risk gossip failure; pauses correlate with p999 spikes
DroppedMessage scope=READ	Internal overload shedding	Any sustained non-zero rate indicates the node cannot keep up
Pending compactions	Compaction debt	Trending upward over hours signals eventual read amplification
Thread pool READ pending	Request queuing	Sustained > 0 for > 60 seconds means read saturation

Fixes

Hot or large partitions

Application-level caching for the hot key is the fastest relief. Rate-limit or buffer writes to the partition. Long-term, redesign the data model to split the partition key or add a bucketing suffix.

Tombstone storms

Ensure repair has run within gc_grace_seconds (default 10 days); tombstones cannot be purged until all replicas have been repaired. Trigger targeted compaction with nodetool compact <keyspace> <table>. Warning: this is I/O-intensive and will compete with live traffic. Review delete patterns and consider TWCS for TTL-dominated tables.

GC pressure

Reduce batch statement sizes and avoid reading entire large partitions into memory. If row cache is enabled, consider disabling it; it is off by default for good reason and often wastes heap. If heap is undersized or oversized, adjust -Xmx. Avoid exceeding 16GB with G1GC, and never exceed 32GB, which disables compressed OOPs.

Compaction backlog

Temporarily increase compaction_throughput_mb_per_sec if CPU and disk allow. Postpone non-urgent repairs that compete for I/O. If the table uses STCS and read amplification is chronic, plan a migration to LCS or UCS. Altering compaction strategy triggers a full recompaction, so time this carefully.

Slow replica or network

If one node is persistently slower than its peers, exclude it from client routing temporarily and investigate hardware health. Check for asymmetric network paths or cross-DC latency if using non-local consistency levels.

Disk I/O saturation

Separate commitlog and data directories onto different devices if they share a disk. Throttle background streaming and repair during peak hours. If SSD await exceeds 10ms sustained, the device is struggling.

Prevention

Monitor compaction pending tasks as a derivative, not a static value. A count that increases over 24 hours is a leading indicator.
Track TombstoneScannedHistogram per table and alert before queries reach the 1000 tombstone warning threshold.
Sample partition size distributions with nodetool toppartitions during normal operations to catch growth trends.
Set relationship-based latency alerts (deviation from baseline) rather than fixed thresholds, because workload norms vary by compaction strategy and hardware.
Keep nodetool proxyhistograms baseline context in mind; the tool resets on restart and provides no historical data, so external time-series storage is essential.

How Netdata helps

Netdata collects JMX ClientRequest percentiles (p50 through p999) continuously, preserving history that nodetool proxyhistograms loses on restart.
Correlate P99/p999 read latency spikes with GC pause duration charts on the same node to identify JVM pressure without parsing GC logs manually.
Compare coordinator-level latency against per-node disk I/O await and thread pool saturation to distinguish network issues from local replica issues.
Per-node anomaly detection flags when one replica’s read latency diverges from the cluster median, surfacing slow replicas before they trigger quorum timeouts.
Netdata tracks DroppedMessage, PendingCompactions, and TombstoneScannedHistogram as first-class metrics, letting you build composite alerts that avoid false positives from single-signal thresholds.

The Netdata solution

Cassandra monitoring with Netdata

Netdata monitors Apache Cassandra with per-second metrics and automatic dashboards. Correlate GC pauses, compaction backlog, tombstone rates, pending hints, and disk usage across nodes to catch a creeping cluster before it tips over.

See Cassandra monitoring → Start monitoring free

Cassandra read latency spikes: P99 vs P50 and proxyhistograms

Cassandra read latency spikes: P99 vs P50 and proxyhistograms

What this means

Common causes

Quick checks

How to diagnose it

Metrics and signals to monitor

Fixes

Hot or large partitions

Tombstone storms

GC pressure

Compaction backlog

Slow replica or network

Disk I/O saturation

Prevention

How Netdata helps

Related guides

Cassandra monitoring with Netdata