$ guides / cassandra / cassandra-large-partition ▌

Operations Guides

Cassandra large partition pathology: Compacting large partition warnings and reads

If system.log prints Compacting large partition ks/table:key (N bytes) and reads to the affected table are spiking, a single partition has crossed compaction_large_partition_warning_threshold_mb (default 100 MB). Oversized partitions force the node to deserialize a large in-memory index on reads, rewrite the entire partition during compaction, and move it atomically during streaming or repair. Left alone, the partition grows until it triggers GC pauses, gossip flapping, and cascading retries. Use this guide to confirm the offending key, measure the blast radius, and choose between immediate relief and a data-model fix.

What this means

A Cassandra partition is the set of rows that share the same partition key. When a partition grows past the threshold, three subsystems suffer:

Reads. Cassandra deserializes the partition’s IndexInfo into heap to locate rows. A large partition creates many IndexInfo objects and buffers. Modern Cassandra handles this better than 2.x did, but oversized partitions still produce GC pressure proportional to their width.
Compaction. Compacting a large partition reads and rewrites the whole partition. It consumes compaction threads and disk I/O, and can produce oversized SSTables that break the invariants of compaction strategies such as LCS.
Streaming and repair. Large partitions stream atomically. A single oversized partition can stall or fail streaming sessions during bootstrap, decommission, or repair.

The warning itself is harmless, but the partition rarely shrinks without intervention.

flowchart TD
    A[Large partition read] --> B[Deserialize IndexInfo into heap]
    B --> C[GC pressure and long pauses]
    C --> D[Gossip marks node DOWN]
    D --> E[Client retries and hint replay]
    E --> F[More heap pressure]
    F --> C
    B --> G[Read latency spikes]
    H[Large partition compaction] --> I[Slow oversized compaction]
    I --> J[Compaction backlog]
    J --> K[Read amplification rises]
    K --> G

Common causes

Cause	What it looks like	First thing to check
Unbounded time-series partition	Partition key is only `device_id` or `user_id`, with time as a clustering column; partitions accumulate indefinitely	`nodetool tablehistograms` Partition Size percentiles
Unbounded collection growth	Table uses `list`, `set`, or `map` columns that append without bound	Per-table max partition size and cell count from `tablestats` / `tablehistograms`
LCS with large partitions	SSTables grow past level targets and L0/L1 compaction stalls	`nodetool compactionstats` plus SSTable sizes in the data directory
Streaming or repair stall	Bootstrap, decommission, or rebuild hangs while transferring one key	`nodetool netstats` streaming progress
Denormalized aggregate without bucketing	All events for an entity accumulate in one partition	`nodetool toppartitions` for hot keys

Quick checks

Run these safe, read-only checks to scope the problem.

# Confirm the warning and identify affected tables
grep "Compacting large partition" /var/log/cassandra/system.log

# Partition size distribution for the affected table
nodetool tablehistograms <keyspace> <table>

# Compacted partition maximum and mean bytes
nodetool tablestats <keyspace>.<table>

# Currently running compactions and pending tasks
nodetool compactionstats

# Hot partitions being read or written right now (duration in ms)
nodetool toppartitions <keyspace> <table> 1000

# Recent GC pause behavior
grep -i "pause" /var/log/cassandra/gc.log* | tail -20

# Thread pool saturation that can accompany large-partition pressure
nodetool tpstats

On Cassandra 4.0 and later, run sstablepartitions against the relevant -Data.db files offline. It is non-destructive but I/O intensive.

How to diagnose it

Confirm the warning in logs. Note the keyspace, table, and partition key from the log line.
Measure partition size distribution. Use nodetool tablehistograms and watch the Partition Size column. If the maximum or p99 is above 100 MB and growing, you have an active data-model problem.
Find the offending keys. Use nodetool toppartitions during the incident, or sstablepartitions offline, to identify the exact partition keys.
Correlate with the read path. Check GC logs for long pauses, ClientRequest Read Latency p99/p999, and speculative retry rates. Large partitions usually show up as extreme tail latency with a stable p50.
Correlate with compaction. Check PendingCompactions, BytesCompacted, and whether any running compaction is making slow progress on a very large SSTable.
Check SSTable count. If SSTables per table are also growing, the large partition is contributing to a wider compaction backlog.
Decide severity. Partitions above 100 MB with sustained warnings warrant a ticket. Partitions above roughly 1 GB that correlate with GC pauses or streaming stalls should be paged and mitigated immediately.

Metrics and signals to monitor

Signal	Why it matters	Warning sign
Partition Size from `nodetool tablehistograms`	Direct measure of partition width	p99 or max above 100 MB and rising
`EstimatedPartitionSizeHistogram`	Tracks partition size distribution over time	Rightward shift or high max bucket
`Compacting large partition` log rate	Frequency of large partition events	Sustained or increasing warnings
GC pause duration	Heap pressure from `IndexInfo` deserialization	Old-gen pause above 1 second, or young pause above 200 ms
Read latency p99/p999	Client-visible impact	Sustained value above 3x baseline
Pending compactions	Compaction falling behind	Trending upward over hours
SSTable count per table	Read amplification	Above the expected value for the compaction strategy
Speculative retries	Masked slow reads from large partitions	Above 5 percent of reads

Fixes

Immediate relief

If the incident is active, reduce pressure before planning the model change.

Throttle or block the offending query at the application layer. A single repeated large-partition read can dominate heap.
Temporarily raise compaction throughput with nodetool setcompactionthroughput <MB_per_second> so the oversized SSTable finishes. Tradeoff: compaction will starve reads of I/O.
Do not restart the node as a fix. A restart may transiently clear heap, but the partition will trigger the same pressure as soon as reads resume.

Reduce read-side pressure

Add application caching for hot partitions so Cassandra stops repeatedly materializing them.
Consider raising column_index_size_in_kb. A larger value creates fewer, wider IndexInfo blocks, which reduces in-memory index size at the cost of slightly more I/O per read. Existing SSTables keep their old index until rewritten; schedule nodetool upgradesstables if you need it applied cluster-wide.

Fix the data model

The only durable fix is to bound partition size.

Remodel the partition key to include a bucketing dimension. Examples: (device_id, yyyyMMdd), (tenant_id, bucket), or a hash prefix that splits a hot entity across multiple partitions.
For time-series data, use TimeWindowCompactionStrategy with TTL, and ensure the partition key includes a time bucket narrow enough to keep partitions small.
Avoid unbounded collections. Replace lists, sets, or maps that grow forever with a separate table where each item is its own partition, or use frozen collections only when the cardinality is strictly bounded.
Review compaction strategy. LCS is particularly punishing for large partitions because a single large partition forces creation of oversized SSTables beyond the normal size target. If the workload produces unavoidable large partitions, evaluate whether STCS or UCS is more appropriate, but understand that no compaction strategy fixes an unbounded partition key.

Long-term guardrails

On Cassandra 4.1+, use the guardrails framework to set hard or soft limits on partition size and reject pathological writes at the cluster boundary.
Include sstablepartitions checks in schema-review CI so oversized partitions are caught before they reach production.

Prevention

Monitor partition size histograms weekly, not just during incidents.
Set application-level limits on rows per partition and collection cardinality.
Design partition keys with a natural upper bound. Keep the vast majority of partitions under 10 MB and all partitions under 100 MB.
Do not use Cassandra as a queue or an unbounded event log per partition.
Review new tables for partition key cardinality before deploying them.

How Netdata helps

Netdata correlates the signals that large partitions produce so you can separate partition pathology from generic load:

Chart JVM heap usage and GC pause duration on the same timeline to confirm that read spikes are causing memory pressure.
Overlay ClientRequest Read Latency p99 with pending compactions to see whether the large partition is also backing up compaction.
Track per-node disk I/O and SSTable counts to rule out hardware saturation before blaming the data model.
Monitor EstimatedPartitionSizeHistogram over time to catch partition growth before the warning log appears.
Alert on sustained Compacting large partition log patterns if log monitoring is configured.

The Netdata solution

Cassandra monitoring with Netdata

Netdata monitors Apache Cassandra with per-second metrics and automatic dashboards. Correlate GC pauses, compaction backlog, tombstone rates, pending hints, and disk usage across nodes to catch a creeping cluster before it tips over.

See Cassandra monitoring → Start monitoring free

Cassandra large partition pathology: Compacting large partition warnings and reads

Cassandra large partition pathology: Compacting large partition warnings and reads

What this means

Common causes

Quick checks

How to diagnose it

Metrics and signals to monitor

Fixes

Immediate relief

Reduce read-side pressure

Fix the data model

Long-term guardrails

Prevention

How Netdata helps

Related guides

Cassandra monitoring with Netdata