Cassandra read latency spikes: P99 vs P50 and proxyhistograms

Your application is timing out on Cassandra reads, but a quick glance at average latency looks acceptable. This is the percentile trap. In Cassandra, read latency follows a long-tail distribution: most requests are fast, but a fraction hit slow replicas, large partitions, or GC pauses and become orders of magnitude slower. The critical first step is knowing whether the spike lives at the coordinator or the replica, and whether it is systemic or isolated to the tail. nodetool proxyhistograms and nodetool tablehistograms answer exactly that.

When P50 is stable but P99 spikes, the cluster is not uniformly overloaded. A subset of requests is slow, which points to specific partitions, specific nodes, or transient JVM events. When both rise together, you are looking at systemic saturation. Do not trust average latency. A single large-partition read can distort the average while nearly all requests remain fast, masking the tail that is actually killing your clients.

What this means

nodetool proxyhistograms reports coordinator-level latency. It measures from the moment the coordinator receives the CQL read until it returns a response to the client. This includes network round-trips to replicas, replica-side processing, and result merging. It is the latency your application actually feels.

nodetool tablehistograms <keyspace> <table> reports local replica latency. It isolates disk I/O, bloom filter checks, SSTable merges, and tombstone application on the local node. It excludes the network hop.

If proxyhistograms is elevated but tablehistograms is low on the same node, the bottleneck is not local disk. It is either a slow remote replica, network degradation, or coordinator-side congestion. If both are elevated, the replica itself is struggling.

flowchart TD
    A[P99 read latency spikes] --> B{P50 stable?}
    B -->|Yes| C[Tail-only problem]
    B -->|No| D[Systemic slowdown]
    C --> E{Proxy high, local low?}
    E -->|Yes| F[Network or slow replica]
    E -->|No| G[Local replica issue]
    G --> H{Correlate with}
    H -->|GC pauses| I[JVM pressure]
    H -->|SSTables/Tombstones| J[Read amplification]
    D --> K[Check disk I/O and saturation]

Common causes

CauseWhat it looks likeFirst thing to check
Hot or large partitionP99 spikes, P50 flat; spikes correlate with specific partition keysnodetool toppartitions <keyspace> <table> 1000 and MaxPartitionSize
Tombstone accumulationP99 high on specific tables; log warnings about scanned tombstonesgrep "tombstone" /var/log/cassandra/system.log; TombstoneScannedHistogram
GC pause on a replicaP99 spikes isolated to one or two nodes; no disk I/O correlationnodetool gcstats and GC logs on the slow replica
Compaction debt / read amplificationP99 climbs gradually over hours; affects all queries on affected tablesnodetool compactionstats; LiveSSTableCount; SSTablesPerReadHistogram
Slow replica or network pathProxy latency high, local latency low on coordinatornodetool status; compare tablehistograms across replicas
Disk I/O saturationBoth P50 and P99 rise together; reads and compactions competeiostat -x 1; commitlog vs data device separation

Quick checks

# Coordinator-level latency (client-visible, includes network to replicas)
nodetool proxyhistograms

# Local replica latency for a specific table (excludes network)
nodetool tablehistograms <keyspace> <table>

# Pending compactions and SSTable accumulation
nodetool compactionstats
nodetool tablestats <keyspace> | grep "SSTable count"

# Thread pool saturation and dropped reads
nodetool tpstats

# Recent GC pause behavior
nodetool gcstats

# Heap usage and native transport state
nodetool info | grep -E "Heap Memory|Native Transport"

# Tombstone warnings in the current log window
grep "tombstone" /var/log/cassandra/system.log

# Per-device I/O latency and saturation
iostat -x 1

How to diagnose it

  1. Establish the scope with nodetool proxyhistograms. Is the spike in reads, writes, or both? Is P50 stable while P99 and p999 climb?
  2. Isolate coordinator versus replica. Run nodetool tablehistograms on the coordinator node. If local latency is low but proxy is high, investigate remote replicas and network paths.
  3. Check node symmetry. Compare tablehistograms output across replicas for the same table. If one replica is an outlier, inspect that node for GC, disk degradation, or compaction backlog.
  4. Correlate with GC. Check nodetool gcstats and GC logs. G1GC stop-the-world pauses inflate P99 and p999 while leaving P50 untouched if the workload has idle cycles between pauses.
  5. Inspect the read path. Query TombstoneScannedHistogram and SSTablesPerReadHistogram per table. Values climbing toward tombstone_warn_threshold (default 1000) or high SSTable counts indicate read amplification.
  6. Identify hot partitions. Run nodetool toppartitions <keyspace> <table> 1000 to see if a single partition is dominating traffic.
  7. Verify disk I/O. Use iostat -x to check await and %util on data and commitlog devices. If both are elevated, the node is I/O-saturated.
  8. Check for load shedding. nodetool tpstats will show non-zero dropped READ or MUTATION messages if internal queues are expiring requests.

Metrics and signals to monitor

SignalWhy it mattersWarning sign
ClientRequest Read Latency p99/p999Client-visible tail latencySustained elevation above 3x rolling 1-hour baseline
ClientRequest Read Latency p50Median experienceStable P50 with spiking P99 indicates a tail-only issue
proxyhistograms coordinator latenciesIncludes network and all replicasElevated against local tablehistograms implies coordination or network problem
tablehistograms local latenciesIsolates replica-side processingElevated local latency points to disk, GC, or compaction on this node
TombstoneScannedHistogramDead data scanned per readSustained values near or above 1000 (default warn threshold)
SSTablesPerReadHistogramRead amplification factorRising count means queries touch more files
GC pause durationStop-the-world blockingPauses > 2 seconds risk gossip failure; pauses correlate with p999 spikes
DroppedMessage scope=READInternal overload sheddingAny sustained non-zero rate indicates the node cannot keep up
Pending compactionsCompaction debtTrending upward over hours signals eventual read amplification
Thread pool READ pendingRequest queuingSustained > 0 for > 60 seconds means read saturation

Fixes

Hot or large partitions

Application-level caching for the hot key is the fastest relief. Rate-limit or buffer writes to the partition. Long-term, redesign the data model to split the partition key or add a bucketing suffix.

Tombstone storms

Ensure repair has run within gc_grace_seconds (default 10 days); tombstones cannot be purged until all replicas have been repaired. Trigger targeted compaction with nodetool compact <keyspace> <table>. Warning: this is I/O-intensive and will compete with live traffic. Review delete patterns and consider TWCS for TTL-dominated tables.

GC pressure

Reduce batch statement sizes and avoid reading entire large partitions into memory. If row cache is enabled, consider disabling it; it is off by default for good reason and often wastes heap. If heap is undersized or oversized, adjust -Xmx. Avoid exceeding 16GB with G1GC, and never exceed 32GB, which disables compressed OOPs.

Compaction backlog

Temporarily increase compaction_throughput_mb_per_sec if CPU and disk allow. Postpone non-urgent repairs that compete for I/O. If the table uses STCS and read amplification is chronic, plan a migration to LCS or UCS. Altering compaction strategy triggers a full recompaction, so time this carefully.

Slow replica or network

If one node is persistently slower than its peers, exclude it from client routing temporarily and investigate hardware health. Check for asymmetric network paths or cross-DC latency if using non-local consistency levels.

Disk I/O saturation

Separate commitlog and data directories onto different devices if they share a disk. Throttle background streaming and repair during peak hours. If SSD await exceeds 10ms sustained, the device is struggling.

Prevention

  • Monitor compaction pending tasks as a derivative, not a static value. A count that increases over 24 hours is a leading indicator.
  • Track TombstoneScannedHistogram per table and alert before queries reach the 1000 tombstone warning threshold.
  • Sample partition size distributions with nodetool toppartitions during normal operations to catch growth trends.
  • Set relationship-based latency alerts (deviation from baseline) rather than fixed thresholds, because workload norms vary by compaction strategy and hardware.
  • Keep nodetool proxyhistograms baseline context in mind; the tool resets on restart and provides no historical data, so external time-series storage is essential.

How Netdata helps

  • Netdata collects JMX ClientRequest percentiles (p50 through p999) continuously, preserving history that nodetool proxyhistograms loses on restart.
  • Correlate P99/p999 read latency spikes with GC pause duration charts on the same node to identify JVM pressure without parsing GC logs manually.
  • Compare coordinator-level latency against per-node disk I/O await and thread pool saturation to distinguish network issues from local replica issues.
  • Per-node anomaly detection flags when one replica’s read latency diverges from the cluster median, surfacing slow replicas before they trigger quorum timeouts.
  • Netdata tracks DroppedMessage, PendingCompactions, and TombstoneScannedHistogram as first-class metrics, letting you build composite alerts that avoid false positives from single-signal thresholds.