Cassandra large partition pathology: Compacting large partition warnings and reads
If system.log prints Compacting large partition ks/table:key (N bytes) and reads to the affected table are spiking, a single partition has crossed compaction_large_partition_warning_threshold_mb (default 100 MB). Oversized partitions force the node to deserialize a large in-memory index on reads, rewrite the entire partition during compaction, and move it atomically during streaming or repair. Left alone, the partition grows until it triggers GC pauses, gossip flapping, and cascading retries. Use this guide to confirm the offending key, measure the blast radius, and choose between immediate relief and a data-model fix.
What this means
A Cassandra partition is the set of rows that share the same partition key. When a partition grows past the threshold, three subsystems suffer:
- Reads. Cassandra deserializes the partition’s
IndexInfointo heap to locate rows. A large partition creates manyIndexInfoobjects and buffers. Modern Cassandra handles this better than 2.x did, but oversized partitions still produce GC pressure proportional to their width. - Compaction. Compacting a large partition reads and rewrites the whole partition. It consumes compaction threads and disk I/O, and can produce oversized SSTables that break the invariants of compaction strategies such as LCS.
- Streaming and repair. Large partitions stream atomically. A single oversized partition can stall or fail streaming sessions during bootstrap, decommission, or repair.
The warning itself is harmless, but the partition rarely shrinks without intervention.
flowchart TD
A[Large partition read] --> B[Deserialize IndexInfo into heap]
B --> C[GC pressure and long pauses]
C --> D[Gossip marks node DOWN]
D --> E[Client retries and hint replay]
E --> F[More heap pressure]
F --> C
B --> G[Read latency spikes]
H[Large partition compaction] --> I[Slow oversized compaction]
I --> J[Compaction backlog]
J --> K[Read amplification rises]
K --> GCommon causes
| Cause | What it looks like | First thing to check |
|---|---|---|
| Unbounded time-series partition | Partition key is only device_id or user_id, with time as a clustering column; partitions accumulate indefinitely | nodetool tablehistograms Partition Size percentiles |
| Unbounded collection growth | Table uses list, set, or map columns that append without bound | Per-table max partition size and cell count from tablestats / tablehistograms |
| LCS with large partitions | SSTables grow past level targets and L0/L1 compaction stalls | nodetool compactionstats plus SSTable sizes in the data directory |
| Streaming or repair stall | Bootstrap, decommission, or rebuild hangs while transferring one key | nodetool netstats streaming progress |
| Denormalized aggregate without bucketing | All events for an entity accumulate in one partition | nodetool toppartitions for hot keys |
Quick checks
Run these safe, read-only checks to scope the problem.
# Confirm the warning and identify affected tables
grep "Compacting large partition" /var/log/cassandra/system.log
# Partition size distribution for the affected table
nodetool tablehistograms <keyspace> <table>
# Compacted partition maximum and mean bytes
nodetool tablestats <keyspace>.<table>
# Currently running compactions and pending tasks
nodetool compactionstats
# Hot partitions being read or written right now (duration in ms)
nodetool toppartitions <keyspace> <table> 1000
# Recent GC pause behavior
grep -i "pause" /var/log/cassandra/gc.log* | tail -20
# Thread pool saturation that can accompany large-partition pressure
nodetool tpstats
On Cassandra 4.0 and later, run sstablepartitions against the relevant -Data.db files offline. It is non-destructive but I/O intensive.
How to diagnose it
- Confirm the warning in logs. Note the keyspace, table, and partition key from the log line.
- Measure partition size distribution. Use
nodetool tablehistogramsand watch the Partition Size column. If the maximum or p99 is above 100 MB and growing, you have an active data-model problem. - Find the offending keys. Use
nodetool toppartitionsduring the incident, orsstablepartitionsoffline, to identify the exact partition keys. - Correlate with the read path. Check GC logs for long pauses,
ClientRequest Read Latencyp99/p999, and speculative retry rates. Large partitions usually show up as extreme tail latency with a stable p50. - Correlate with compaction. Check
PendingCompactions,BytesCompacted, and whether any running compaction is making slow progress on a very large SSTable. - Check SSTable count. If SSTables per table are also growing, the large partition is contributing to a wider compaction backlog.
- Decide severity. Partitions above 100 MB with sustained warnings warrant a ticket. Partitions above roughly 1 GB that correlate with GC pauses or streaming stalls should be paged and mitigated immediately.
Metrics and signals to monitor
| Signal | Why it matters | Warning sign |
|---|---|---|
Partition Size from nodetool tablehistograms | Direct measure of partition width | p99 or max above 100 MB and rising |
EstimatedPartitionSizeHistogram | Tracks partition size distribution over time | Rightward shift or high max bucket |
Compacting large partition log rate | Frequency of large partition events | Sustained or increasing warnings |
| GC pause duration | Heap pressure from IndexInfo deserialization | Old-gen pause above 1 second, or young pause above 200 ms |
| Read latency p99/p999 | Client-visible impact | Sustained value above 3x baseline |
| Pending compactions | Compaction falling behind | Trending upward over hours |
| SSTable count per table | Read amplification | Above the expected value for the compaction strategy |
| Speculative retries | Masked slow reads from large partitions | Above 5 percent of reads |
Fixes
Immediate relief
If the incident is active, reduce pressure before planning the model change.
- Throttle or block the offending query at the application layer. A single repeated large-partition read can dominate heap.
- Temporarily raise compaction throughput with
nodetool setcompactionthroughput <MB_per_second>so the oversized SSTable finishes. Tradeoff: compaction will starve reads of I/O. - Do not restart the node as a fix. A restart may transiently clear heap, but the partition will trigger the same pressure as soon as reads resume.
Reduce read-side pressure
- Add application caching for hot partitions so Cassandra stops repeatedly materializing them.
- Consider raising
column_index_size_in_kb. A larger value creates fewer, widerIndexInfoblocks, which reduces in-memory index size at the cost of slightly more I/O per read. Existing SSTables keep their old index until rewritten; schedulenodetool upgradesstablesif you need it applied cluster-wide.
Fix the data model
The only durable fix is to bound partition size.
- Remodel the partition key to include a bucketing dimension. Examples:
(device_id, yyyyMMdd),(tenant_id, bucket), or a hash prefix that splits a hot entity across multiple partitions. - For time-series data, use
TimeWindowCompactionStrategywith TTL, and ensure the partition key includes a time bucket narrow enough to keep partitions small. - Avoid unbounded collections. Replace lists, sets, or maps that grow forever with a separate table where each item is its own partition, or use frozen collections only when the cardinality is strictly bounded.
- Review compaction strategy. LCS is particularly punishing for large partitions because a single large partition forces creation of oversized SSTables beyond the normal size target. If the workload produces unavoidable large partitions, evaluate whether STCS or UCS is more appropriate, but understand that no compaction strategy fixes an unbounded partition key.
Long-term guardrails
- On Cassandra 4.1+, use the guardrails framework to set hard or soft limits on partition size and reject pathological writes at the cluster boundary.
- Include
sstablepartitionschecks in schema-review CI so oversized partitions are caught before they reach production.
Prevention
- Monitor partition size histograms weekly, not just during incidents.
- Set application-level limits on rows per partition and collection cardinality.
- Design partition keys with a natural upper bound. Keep the vast majority of partitions under 10 MB and all partitions under 100 MB.
- Do not use Cassandra as a queue or an unbounded event log per partition.
- Review new tables for partition key cardinality before deploying them.
How Netdata helps
Netdata correlates the signals that large partitions produce so you can separate partition pathology from generic load:
- Chart JVM heap usage and GC pause duration on the same timeline to confirm that read spikes are causing memory pressure.
- Overlay
ClientRequest Read Latencyp99 with pending compactions to see whether the large partition is also backing up compaction. - Track per-node disk I/O and SSTable counts to rule out hardware saturation before blaming the data model.
- Monitor
EstimatedPartitionSizeHistogramover time to catch partition growth before the warning log appears. - Alert on sustained
Compacting large partitionlog patterns if log monitoring is configured.
Related guides
- Cassandra adding and removing nodes safely: vnodes, tokens, and cleanup
- Cassandra node stuck in joining (UJ): bootstrap diagnosis
- Cassandra compaction strategies: STCS vs LCS vs TWCS vs UCS
- Cassandra clock skew: how NTP drift silently corrupts data
- Cassandra commitlog disk full: segment exhaustion and forced flushes
- Cassandra commitlog pending tasks: write-path I/O pressure
- Cassandra compaction death spiral: when writes outrun compaction throughput
- Cassandra consistency levels explained: QUORUM, ONE, LOCAL_QUORUM, and EACH_QUORUM
- Cassandra zombie data resurrection: gc_grace_seconds and unrepaired tombstones
- Cassandra disk space exhaustion: emergency recovery when the data volume fills
- Cassandra dropped mutations: silent write loss and load shedding
- Cassandra dropped reads and other messages: reading nodetool tpstats Dropped







