$ guides / cassandra / cassandra-hot-partition ▌

Operations Guides

Cassandra hot partition: when one key saturates a replica set

One or two nodes run hot while the rest of the cluster idles. Client P99 latency doubles or triples, but the average looks fine. Timeouts cluster on a subset of hosts, and nodetool status shows uneven load that does not match token ring expectations. When you trace requests, a single partition key consumes a disproportionate share of reads or writes. This is a hot partition: the partitioner mapped one key to a narrow token range, and the replicas owning that range are saturated.

Cassandra distributes data by partition key hash, not by load. When a viral entity, a coarse time-series bucket, or a poorly sharded status flag becomes popular, all traffic for that logical record lands on the same physical nodes. The affected replicas face elevated CPU, I/O, and GC pressure. If speculative retry is configured, the coordinator fans out requests to additional replicas when the primary ones stall, spreading saturation instead of isolating it. The rest of the ring stays healthy, so cluster-aggregated dashboards often miss hot partitions until clients time out.

The damage is not always partition size. A hot partition can be small but heavily trafficked, or both large and hot. Either way, the replica set serving it becomes the bottleneck for the request path. This guide shows how to confirm the pattern, find the key, and stop the cascade.

What this means

A hot partition is an access-pattern skew problem masked as a performance problem. The coordinator sends requests to the replicas that own the partition’s token, so a single hot key creates a concentrated blast radius on those nodes. Even at consistency level ONE, the coordinator must contact a replica for that token. If that replica is saturated, local read latency rises from queueing, context switching, and GC. Coordinator latency rises because it waits for the slowest required replica.

If the table uses speculative retry, the coordinator fires parallel requests to additional replicas when the first response breaches a latency threshold. This improves perceived tail latency under normal conditions, but under hot-partition saturation it turns a single-replica bottleneck into a multi-replica CPU and I/O storm. The result is a latency tail that degrades the client experience even though most partitions in the table are fast.

Hot partitions also compound with large partition pathology. When a partition grows beyond tens of megabytes, reading it requires more heap for merge-sort across SSTables and more disk I/O for index traversal. A partition that is both large and hot consumes multiple resources simultaneously, making it harder for the replica to recover.

Common causes

Cause	What it looks like	First thing to check
Skewed access on a single partition key	One or two nodes show high CPU and disk await while peers are calm; `nodetool toppartitions` shows a single key dominating traffic	`nodetool toppartitions` on the suspect table
Oversized partition	High local read latency on the replica, GC pressure, and compaction stalls for the table	`nodetool tablehistograms` for per-partition size percentiles
Speculative retry amplifying load	Elevated `SpeculativeRetries` counter and load spreading from one replica to many	`nodetool tablestats` or JMX speculative retry rate per table
Application thundering herd	A coordinated spike from many clients targeting the same key simultaneously	Client request rate versus rolling baseline
Poor bucketing in time-series or queue models	All writes for the current time window or status category land in one partition	Partition key design in the CQL schema

Quick checks

Run these read-only commands on the affected nodes to confirm the pattern.

# Identify the hottest partition keys on a node
nodetool toppartitions <keyspace> <table> 1000

# Check per-table partition size percentiles
nodetool tablehistograms <keyspace> <table>

# View cluster topology and per-node load skew
nodetool status

# Inspect coordinator-level latency percentiles
nodetool proxyhistograms

# Check heap pressure
nodetool info | grep "Heap Memory"

# Review thread pool backpressure and dropped messages
nodetool tpstats

# Check for compaction or flush bottlenecks
nodetool compactionstats

# Review speculative retry activity per table
nodetool tablestats <keyspace> <table> | grep -i speculative

How to diagnose it

Confirm replica isolation. Use nodetool status and per-node OS metrics to verify that only a subset of nodes is saturated. If all nodes are equally loaded, the problem is cluster-wide, not a hot partition.
Find the hot partition. Run nodetool toppartitions on the affected replicas. Look for one partition key or a small set of keys that account for a disproportionate percentage of reads or writes.
Check partition size. Run nodetool tablehistograms on the table. If the P99 or maximum partition size is above 10 MB, or if any partition exceeds 100 MB, the partition is oversized as well as hot. Large partitions amplify GC and compaction cost.
Correlate with speculative retries. Check the SpeculativeRetries metric via JMX or nodetool tablestats. If speculative retries are elevated, the coordinator is multiplying requests and loading additional replicas.
Validate client traffic. Compare the current request rate for the table against your baseline. A sudden step change suggests a thundering herd or a trending key.
Inspect replica saturation signals. On the hot replicas, check nodetool tpstats for pending tasks in ReadStage or MutationStage, and review GC logs for long pauses. These confirm the node is struggling to keep up.
Review the data model. Examine the partition key structure. Unbounded time buckets, single-row global counters, and queue-style patterns are common culprits.

flowchart TD
    A[Single partition key receives disproportionate traffic] --> B[Replicas owning the token range saturate]
    B --> C[Coordinator latency spikes on reads and writes]
    C --> D{Speculative retry enabled?}
    D -->|Yes| E[Coordinator fans out to additional replicas]
    E --> F[CPU and I/O pressure spreads across the replica set]
    D -->|No| G[Requests timeout or queue on the slow replica]
    F --> H[Dropped messages and thread pool backpressure]
    G --> H
    H --> I[P99 latency tail degrades cluster-wide]

Metrics and signals to monitor

Signal	Why it matters	Warning sign
Per-node request rate	Hot partitions create asymmetric load	One node handling more than 2x the cluster median traffic
Coordinator read/write latency (p99/p999)	Tail latency reflects replica saturation	P99 sustained above 3x baseline or approaching client timeout
Speculative retry rate	Indicates slow replicas and multiplies load	Greater than 5% of reads for a table
Partition size percentiles	Oversized partitions amplify GC and I/O	P99 partition size above 10 MB, or any partition near 100 MB
Thread pool pending tasks (Read/Mutation)	Saturation on hot replicas	Pending tasks greater than 0 sustained for more than 60 seconds
Dropped messages	The replica is shedding load	Any sustained non-zero rate of dropped reads or mutations
SSTable count per table	Read amplification grows when compaction lags	Greater than 50 for STCS, or L0 greater than 32 for LCS
GC pause duration	Hot partition reads can trigger heap pressure	Pauses greater than 500 ms sustained

Fixes

Reduce load on the hot partition

If the application can tolerate it, throttle or cache requests for the hot key at the client layer. Deduplicating concurrent requests for the same partition key before they reach Cassandra cuts replica load significantly. This is often the fastest way to stop the bleeding without a schema change.

Tune speculative retry

If the table uses speculative retry and you see a high speculative retry rate, reduce or disable it temporarily. Under hot-partition saturation, speculative retry fans out requests to additional replicas, turning a localized bottleneck into cluster-wide CPU and I/O pressure. The tradeoff is slightly higher tail latency for affected reads until the root cause is fixed.

Redesign the partition key with bucketing

Long-term, fix the data model so the hot logical entity spreads across multiple partition keys. Append a deterministic hash or a time shard to the partition key to break one logical stream into many physical partitions. For example, a time-series table that buckets by hour can be redesigned to use a composite key with a sub-hour shard. The tradeoff is that reads must query multiple partitions and merge results client-side, so choose a bucket count that balances write spread against read complexity.

Bound partition size

If nodetool tablehistograms shows the hot partition is also oversized, split the data model to enforce a hard size ceiling. Avoid unbounded collections or wide rows inside a single partition. If you cannot change the schema immediately, purge old data or offload historical rows to reduce the partition footprint.

Prevention

Monitor per-node load skew, not just cluster aggregates. Aggregate metrics hide hot partitions because the healthy majority dilutes the signal. Alert when any node’s request rate or CPU deviates from the median by more than a set threshold.
Alert on partition size percentiles and speculative retry rate. These are leading indicators that surface modeling problems before the replica set saturates.
Design partition keys for bounded cardinality. Every partition key should include a natural shard or bucket that prevents unbounded growth or traffic concentration.
Review schemas for anti-patterns. Global counters, queue-style tables with delete-heavy access, and coarse time buckets are frequent sources of hot partitions.

How Netdata helps

Per-node Cassandra JMX metrics expose load skew that cluster-wide averages hide.
Correlate elevated SpeculativeRetries with per-node CPU utilization and disk latency to identify slow replicas and coordinator amplification.
Track per-table coordinator latency percentiles alongside OS-level metrics such as disk await and CPU iowait to distinguish hot-partition saturation from generic cluster overload.
Alert on per-node request rate deviations from the cluster median to catch asymmetric load before P99 latency spikes.

The Netdata solution

Cassandra monitoring with Netdata

Netdata monitors Apache Cassandra with per-second metrics and automatic dashboards. Correlate GC pauses, compaction backlog, tombstone rates, pending hints, and disk usage across nodes to catch a creeping cluster before it tips over.

See Cassandra monitoring → Start monitoring free

Cassandra hot partition: when one key saturates a replica set

Cassandra hot partition: when one key saturates a replica set

What this means

Common causes

Quick checks

How to diagnose it

Metrics and signals to monitor

Fixes

Reduce load on the hot partition

Tune speculative retry

Redesign the partition key with bucketing

Bound partition size

Prevention

How Netdata helps

Related guides

Cassandra monitoring with Netdata