Cassandra hot partition: when one key saturates a replica set
One or two nodes run hot while the rest of the cluster idles. Client P99 latency doubles or triples, but the average looks fine. Timeouts cluster on a subset of hosts, and nodetool status shows uneven load that does not match token ring expectations. When you trace requests, a single partition key consumes a disproportionate share of reads or writes. This is a hot partition: the partitioner mapped one key to a narrow token range, and the replicas owning that range are saturated.
Cassandra distributes data by partition key hash, not by load. When a viral entity, a coarse time-series bucket, or a poorly sharded status flag becomes popular, all traffic for that logical record lands on the same physical nodes. The affected replicas face elevated CPU, I/O, and GC pressure. If speculative retry is configured, the coordinator fans out requests to additional replicas when the primary ones stall, spreading saturation instead of isolating it. The rest of the ring stays healthy, so cluster-aggregated dashboards often miss hot partitions until clients time out.
The damage is not always partition size. A hot partition can be small but heavily trafficked, or both large and hot. Either way, the replica set serving it becomes the bottleneck for the request path. This guide shows how to confirm the pattern, find the key, and stop the cascade.
What this means
A hot partition is an access-pattern skew problem masked as a performance problem. The coordinator sends requests to the replicas that own the partition’s token, so a single hot key creates a concentrated blast radius on those nodes. Even at consistency level ONE, the coordinator must contact a replica for that token. If that replica is saturated, local read latency rises from queueing, context switching, and GC. Coordinator latency rises because it waits for the slowest required replica.
If the table uses speculative retry, the coordinator fires parallel requests to additional replicas when the first response breaches a latency threshold. This improves perceived tail latency under normal conditions, but under hot-partition saturation it turns a single-replica bottleneck into a multi-replica CPU and I/O storm. The result is a latency tail that degrades the client experience even though most partitions in the table are fast.
Hot partitions also compound with large partition pathology. When a partition grows beyond tens of megabytes, reading it requires more heap for merge-sort across SSTables and more disk I/O for index traversal. A partition that is both large and hot consumes multiple resources simultaneously, making it harder for the replica to recover.
Common causes
| Cause | What it looks like | First thing to check |
|---|---|---|
| Skewed access on a single partition key | One or two nodes show high CPU and disk await while peers are calm; nodetool toppartitions shows a single key dominating traffic | nodetool toppartitions on the suspect table |
| Oversized partition | High local read latency on the replica, GC pressure, and compaction stalls for the table | nodetool tablehistograms for per-partition size percentiles |
| Speculative retry amplifying load | Elevated SpeculativeRetries counter and load spreading from one replica to many | nodetool tablestats or JMX speculative retry rate per table |
| Application thundering herd | A coordinated spike from many clients targeting the same key simultaneously | Client request rate versus rolling baseline |
| Poor bucketing in time-series or queue models | All writes for the current time window or status category land in one partition | Partition key design in the CQL schema |
Quick checks
Run these read-only commands on the affected nodes to confirm the pattern.
# Identify the hottest partition keys on a node
nodetool toppartitions <keyspace> <table> 1000
# Check per-table partition size percentiles
nodetool tablehistograms <keyspace> <table>
# View cluster topology and per-node load skew
nodetool status
# Inspect coordinator-level latency percentiles
nodetool proxyhistograms
# Check heap pressure
nodetool info | grep "Heap Memory"
# Review thread pool backpressure and dropped messages
nodetool tpstats
# Check for compaction or flush bottlenecks
nodetool compactionstats
# Review speculative retry activity per table
nodetool tablestats <keyspace> <table> | grep -i speculative
How to diagnose it
- Confirm replica isolation. Use
nodetool statusand per-node OS metrics to verify that only a subset of nodes is saturated. If all nodes are equally loaded, the problem is cluster-wide, not a hot partition. - Find the hot partition. Run
nodetool toppartitionson the affected replicas. Look for one partition key or a small set of keys that account for a disproportionate percentage of reads or writes. - Check partition size. Run
nodetool tablehistogramson the table. If the P99 or maximum partition size is above 10 MB, or if any partition exceeds 100 MB, the partition is oversized as well as hot. Large partitions amplify GC and compaction cost. - Correlate with speculative retries. Check the
SpeculativeRetriesmetric via JMX ornodetool tablestats. If speculative retries are elevated, the coordinator is multiplying requests and loading additional replicas. - Validate client traffic. Compare the current request rate for the table against your baseline. A sudden step change suggests a thundering herd or a trending key.
- Inspect replica saturation signals. On the hot replicas, check
nodetool tpstatsfor pending tasks inReadStageorMutationStage, and review GC logs for long pauses. These confirm the node is struggling to keep up. - Review the data model. Examine the partition key structure. Unbounded time buckets, single-row global counters, and queue-style patterns are common culprits.
flowchart TD
A[Single partition key receives disproportionate traffic] --> B[Replicas owning the token range saturate]
B --> C[Coordinator latency spikes on reads and writes]
C --> D{Speculative retry enabled?}
D -->|Yes| E[Coordinator fans out to additional replicas]
E --> F[CPU and I/O pressure spreads across the replica set]
D -->|No| G[Requests timeout or queue on the slow replica]
F --> H[Dropped messages and thread pool backpressure]
G --> H
H --> I[P99 latency tail degrades cluster-wide]Metrics and signals to monitor
| Signal | Why it matters | Warning sign |
|---|---|---|
| Per-node request rate | Hot partitions create asymmetric load | One node handling more than 2x the cluster median traffic |
| Coordinator read/write latency (p99/p999) | Tail latency reflects replica saturation | P99 sustained above 3x baseline or approaching client timeout |
| Speculative retry rate | Indicates slow replicas and multiplies load | Greater than 5% of reads for a table |
| Partition size percentiles | Oversized partitions amplify GC and I/O | P99 partition size above 10 MB, or any partition near 100 MB |
| Thread pool pending tasks (Read/Mutation) | Saturation on hot replicas | Pending tasks greater than 0 sustained for more than 60 seconds |
| Dropped messages | The replica is shedding load | Any sustained non-zero rate of dropped reads or mutations |
| SSTable count per table | Read amplification grows when compaction lags | Greater than 50 for STCS, or L0 greater than 32 for LCS |
| GC pause duration | Hot partition reads can trigger heap pressure | Pauses greater than 500 ms sustained |
Fixes
Reduce load on the hot partition
If the application can tolerate it, throttle or cache requests for the hot key at the client layer. Deduplicating concurrent requests for the same partition key before they reach Cassandra cuts replica load significantly. This is often the fastest way to stop the bleeding without a schema change.
Tune speculative retry
If the table uses speculative retry and you see a high speculative retry rate, reduce or disable it temporarily. Under hot-partition saturation, speculative retry fans out requests to additional replicas, turning a localized bottleneck into cluster-wide CPU and I/O pressure. The tradeoff is slightly higher tail latency for affected reads until the root cause is fixed.
Redesign the partition key with bucketing
Long-term, fix the data model so the hot logical entity spreads across multiple partition keys. Append a deterministic hash or a time shard to the partition key to break one logical stream into many physical partitions. For example, a time-series table that buckets by hour can be redesigned to use a composite key with a sub-hour shard. The tradeoff is that reads must query multiple partitions and merge results client-side, so choose a bucket count that balances write spread against read complexity.
Bound partition size
If nodetool tablehistograms shows the hot partition is also oversized, split the data model to enforce a hard size ceiling. Avoid unbounded collections or wide rows inside a single partition. If you cannot change the schema immediately, purge old data or offload historical rows to reduce the partition footprint.
Prevention
- Monitor per-node load skew, not just cluster aggregates. Aggregate metrics hide hot partitions because the healthy majority dilutes the signal. Alert when any node’s request rate or CPU deviates from the median by more than a set threshold.
- Alert on partition size percentiles and speculative retry rate. These are leading indicators that surface modeling problems before the replica set saturates.
- Design partition keys for bounded cardinality. Every partition key should include a natural shard or bucket that prevents unbounded growth or traffic concentration.
- Review schemas for anti-patterns. Global counters, queue-style tables with delete-heavy access, and coarse time buckets are frequent sources of hot partitions.
How Netdata helps
- Per-node Cassandra JMX metrics expose load skew that cluster-wide averages hide.
- Correlate elevated
SpeculativeRetrieswith per-node CPU utilization and disk latency to identify slow replicas and coordinator amplification. - Track per-table coordinator latency percentiles alongside OS-level metrics such as disk
awaitand CPUiowaitto distinguish hot-partition saturation from generic cluster overload. - Alert on per-node request rate deviations from the cluster median to catch asymmetric load before P99 latency spikes.
Related guides
- Cassandra compaction death spiral: when writes outrun compaction throughput
- Cassandra dropped mutations: silent write loss and load shedding
- Cassandra dropped reads and other messages: reading nodetool tpstats Dropped
- Cassandra disk space exhaustion: emergency recovery when the data volume fills
- Cassandra compaction strategies: STCS vs LCS vs TWCS vs UCS







