Cassandra hot partition: when one key saturates a replica set

One or two nodes run hot while the rest of the cluster idles. Client P99 latency doubles or triples, but the average looks fine. Timeouts cluster on a subset of hosts, and nodetool status shows uneven load that does not match token ring expectations. When you trace requests, a single partition key consumes a disproportionate share of reads or writes. This is a hot partition: the partitioner mapped one key to a narrow token range, and the replicas owning that range are saturated.

Cassandra distributes data by partition key hash, not by load. When a viral entity, a coarse time-series bucket, or a poorly sharded status flag becomes popular, all traffic for that logical record lands on the same physical nodes. The affected replicas face elevated CPU, I/O, and GC pressure. If speculative retry is configured, the coordinator fans out requests to additional replicas when the primary ones stall, spreading saturation instead of isolating it. The rest of the ring stays healthy, so cluster-aggregated dashboards often miss hot partitions until clients time out.

The damage is not always partition size. A hot partition can be small but heavily trafficked, or both large and hot. Either way, the replica set serving it becomes the bottleneck for the request path. This guide shows how to confirm the pattern, find the key, and stop the cascade.

What this means

A hot partition is an access-pattern skew problem masked as a performance problem. The coordinator sends requests to the replicas that own the partition’s token, so a single hot key creates a concentrated blast radius on those nodes. Even at consistency level ONE, the coordinator must contact a replica for that token. If that replica is saturated, local read latency rises from queueing, context switching, and GC. Coordinator latency rises because it waits for the slowest required replica.

If the table uses speculative retry, the coordinator fires parallel requests to additional replicas when the first response breaches a latency threshold. This improves perceived tail latency under normal conditions, but under hot-partition saturation it turns a single-replica bottleneck into a multi-replica CPU and I/O storm. The result is a latency tail that degrades the client experience even though most partitions in the table are fast.

Hot partitions also compound with large partition pathology. When a partition grows beyond tens of megabytes, reading it requires more heap for merge-sort across SSTables and more disk I/O for index traversal. A partition that is both large and hot consumes multiple resources simultaneously, making it harder for the replica to recover.

Common causes

CauseWhat it looks likeFirst thing to check
Skewed access on a single partition keyOne or two nodes show high CPU and disk await while peers are calm; nodetool toppartitions shows a single key dominating trafficnodetool toppartitions on the suspect table
Oversized partitionHigh local read latency on the replica, GC pressure, and compaction stalls for the tablenodetool tablehistograms for per-partition size percentiles
Speculative retry amplifying loadElevated SpeculativeRetries counter and load spreading from one replica to manynodetool tablestats or JMX speculative retry rate per table
Application thundering herdA coordinated spike from many clients targeting the same key simultaneouslyClient request rate versus rolling baseline
Poor bucketing in time-series or queue modelsAll writes for the current time window or status category land in one partitionPartition key design in the CQL schema

Quick checks

Run these read-only commands on the affected nodes to confirm the pattern.

# Identify the hottest partition keys on a node
nodetool toppartitions <keyspace> <table> 1000

# Check per-table partition size percentiles
nodetool tablehistograms <keyspace> <table>

# View cluster topology and per-node load skew
nodetool status

# Inspect coordinator-level latency percentiles
nodetool proxyhistograms

# Check heap pressure
nodetool info | grep "Heap Memory"

# Review thread pool backpressure and dropped messages
nodetool tpstats

# Check for compaction or flush bottlenecks
nodetool compactionstats

# Review speculative retry activity per table
nodetool tablestats <keyspace> <table> | grep -i speculative

How to diagnose it

  1. Confirm replica isolation. Use nodetool status and per-node OS metrics to verify that only a subset of nodes is saturated. If all nodes are equally loaded, the problem is cluster-wide, not a hot partition.
  2. Find the hot partition. Run nodetool toppartitions on the affected replicas. Look for one partition key or a small set of keys that account for a disproportionate percentage of reads or writes.
  3. Check partition size. Run nodetool tablehistograms on the table. If the P99 or maximum partition size is above 10 MB, or if any partition exceeds 100 MB, the partition is oversized as well as hot. Large partitions amplify GC and compaction cost.
  4. Correlate with speculative retries. Check the SpeculativeRetries metric via JMX or nodetool tablestats. If speculative retries are elevated, the coordinator is multiplying requests and loading additional replicas.
  5. Validate client traffic. Compare the current request rate for the table against your baseline. A sudden step change suggests a thundering herd or a trending key.
  6. Inspect replica saturation signals. On the hot replicas, check nodetool tpstats for pending tasks in ReadStage or MutationStage, and review GC logs for long pauses. These confirm the node is struggling to keep up.
  7. Review the data model. Examine the partition key structure. Unbounded time buckets, single-row global counters, and queue-style patterns are common culprits.
flowchart TD
    A[Single partition key receives disproportionate traffic] --> B[Replicas owning the token range saturate]
    B --> C[Coordinator latency spikes on reads and writes]
    C --> D{Speculative retry enabled?}
    D -->|Yes| E[Coordinator fans out to additional replicas]
    E --> F[CPU and I/O pressure spreads across the replica set]
    D -->|No| G[Requests timeout or queue on the slow replica]
    F --> H[Dropped messages and thread pool backpressure]
    G --> H
    H --> I[P99 latency tail degrades cluster-wide]

Metrics and signals to monitor

SignalWhy it mattersWarning sign
Per-node request rateHot partitions create asymmetric loadOne node handling more than 2x the cluster median traffic
Coordinator read/write latency (p99/p999)Tail latency reflects replica saturationP99 sustained above 3x baseline or approaching client timeout
Speculative retry rateIndicates slow replicas and multiplies loadGreater than 5% of reads for a table
Partition size percentilesOversized partitions amplify GC and I/OP99 partition size above 10 MB, or any partition near 100 MB
Thread pool pending tasks (Read/Mutation)Saturation on hot replicasPending tasks greater than 0 sustained for more than 60 seconds
Dropped messagesThe replica is shedding loadAny sustained non-zero rate of dropped reads or mutations
SSTable count per tableRead amplification grows when compaction lagsGreater than 50 for STCS, or L0 greater than 32 for LCS
GC pause durationHot partition reads can trigger heap pressurePauses greater than 500 ms sustained

Fixes

Reduce load on the hot partition

If the application can tolerate it, throttle or cache requests for the hot key at the client layer. Deduplicating concurrent requests for the same partition key before they reach Cassandra cuts replica load significantly. This is often the fastest way to stop the bleeding without a schema change.

Tune speculative retry

If the table uses speculative retry and you see a high speculative retry rate, reduce or disable it temporarily. Under hot-partition saturation, speculative retry fans out requests to additional replicas, turning a localized bottleneck into cluster-wide CPU and I/O pressure. The tradeoff is slightly higher tail latency for affected reads until the root cause is fixed.

Redesign the partition key with bucketing

Long-term, fix the data model so the hot logical entity spreads across multiple partition keys. Append a deterministic hash or a time shard to the partition key to break one logical stream into many physical partitions. For example, a time-series table that buckets by hour can be redesigned to use a composite key with a sub-hour shard. The tradeoff is that reads must query multiple partitions and merge results client-side, so choose a bucket count that balances write spread against read complexity.

Bound partition size

If nodetool tablehistograms shows the hot partition is also oversized, split the data model to enforce a hard size ceiling. Avoid unbounded collections or wide rows inside a single partition. If you cannot change the schema immediately, purge old data or offload historical rows to reduce the partition footprint.

Prevention

  • Monitor per-node load skew, not just cluster aggregates. Aggregate metrics hide hot partitions because the healthy majority dilutes the signal. Alert when any node’s request rate or CPU deviates from the median by more than a set threshold.
  • Alert on partition size percentiles and speculative retry rate. These are leading indicators that surface modeling problems before the replica set saturates.
  • Design partition keys for bounded cardinality. Every partition key should include a natural shard or bucket that prevents unbounded growth or traffic concentration.
  • Review schemas for anti-patterns. Global counters, queue-style tables with delete-heavy access, and coarse time buckets are frequent sources of hot partitions.

How Netdata helps

  • Per-node Cassandra JMX metrics expose load skew that cluster-wide averages hide.
  • Correlate elevated SpeculativeRetries with per-node CPU utilization and disk latency to identify slow replicas and coordinator amplification.
  • Track per-table coordinator latency percentiles alongside OS-level metrics such as disk await and CPU iowait to distinguish hot-partition saturation from generic cluster overload.
  • Alert on per-node request rate deviations from the cluster median to catch asymmetric load before P99 latency spikes.