Cassandra large partition pathology: Compacting large partition warnings and reads

If system.log prints Compacting large partition ks/table:key (N bytes) and reads to the affected table are spiking, a single partition has crossed compaction_large_partition_warning_threshold_mb (default 100 MB). Oversized partitions force the node to deserialize a large in-memory index on reads, rewrite the entire partition during compaction, and move it atomically during streaming or repair. Left alone, the partition grows until it triggers GC pauses, gossip flapping, and cascading retries. Use this guide to confirm the offending key, measure the blast radius, and choose between immediate relief and a data-model fix.

What this means

A Cassandra partition is the set of rows that share the same partition key. When a partition grows past the threshold, three subsystems suffer:

  • Reads. Cassandra deserializes the partition’s IndexInfo into heap to locate rows. A large partition creates many IndexInfo objects and buffers. Modern Cassandra handles this better than 2.x did, but oversized partitions still produce GC pressure proportional to their width.
  • Compaction. Compacting a large partition reads and rewrites the whole partition. It consumes compaction threads and disk I/O, and can produce oversized SSTables that break the invariants of compaction strategies such as LCS.
  • Streaming and repair. Large partitions stream atomically. A single oversized partition can stall or fail streaming sessions during bootstrap, decommission, or repair.

The warning itself is harmless, but the partition rarely shrinks without intervention.

flowchart TD
    A[Large partition read] --> B[Deserialize IndexInfo into heap]
    B --> C[GC pressure and long pauses]
    C --> D[Gossip marks node DOWN]
    D --> E[Client retries and hint replay]
    E --> F[More heap pressure]
    F --> C
    B --> G[Read latency spikes]
    H[Large partition compaction] --> I[Slow oversized compaction]
    I --> J[Compaction backlog]
    J --> K[Read amplification rises]
    K --> G

Common causes

CauseWhat it looks likeFirst thing to check
Unbounded time-series partitionPartition key is only device_id or user_id, with time as a clustering column; partitions accumulate indefinitelynodetool tablehistograms Partition Size percentiles
Unbounded collection growthTable uses list, set, or map columns that append without boundPer-table max partition size and cell count from tablestats / tablehistograms
LCS with large partitionsSSTables grow past level targets and L0/L1 compaction stallsnodetool compactionstats plus SSTable sizes in the data directory
Streaming or repair stallBootstrap, decommission, or rebuild hangs while transferring one keynodetool netstats streaming progress
Denormalized aggregate without bucketingAll events for an entity accumulate in one partitionnodetool toppartitions for hot keys

Quick checks

Run these safe, read-only checks to scope the problem.

# Confirm the warning and identify affected tables
grep "Compacting large partition" /var/log/cassandra/system.log

# Partition size distribution for the affected table
nodetool tablehistograms <keyspace> <table>

# Compacted partition maximum and mean bytes
nodetool tablestats <keyspace>.<table>

# Currently running compactions and pending tasks
nodetool compactionstats

# Hot partitions being read or written right now (duration in ms)
nodetool toppartitions <keyspace> <table> 1000

# Recent GC pause behavior
grep -i "pause" /var/log/cassandra/gc.log* | tail -20

# Thread pool saturation that can accompany large-partition pressure
nodetool tpstats

On Cassandra 4.0 and later, run sstablepartitions against the relevant -Data.db files offline. It is non-destructive but I/O intensive.

How to diagnose it

  1. Confirm the warning in logs. Note the keyspace, table, and partition key from the log line.
  2. Measure partition size distribution. Use nodetool tablehistograms and watch the Partition Size column. If the maximum or p99 is above 100 MB and growing, you have an active data-model problem.
  3. Find the offending keys. Use nodetool toppartitions during the incident, or sstablepartitions offline, to identify the exact partition keys.
  4. Correlate with the read path. Check GC logs for long pauses, ClientRequest Read Latency p99/p999, and speculative retry rates. Large partitions usually show up as extreme tail latency with a stable p50.
  5. Correlate with compaction. Check PendingCompactions, BytesCompacted, and whether any running compaction is making slow progress on a very large SSTable.
  6. Check SSTable count. If SSTables per table are also growing, the large partition is contributing to a wider compaction backlog.
  7. Decide severity. Partitions above 100 MB with sustained warnings warrant a ticket. Partitions above roughly 1 GB that correlate with GC pauses or streaming stalls should be paged and mitigated immediately.

Metrics and signals to monitor

SignalWhy it mattersWarning sign
Partition Size from nodetool tablehistogramsDirect measure of partition widthp99 or max above 100 MB and rising
EstimatedPartitionSizeHistogramTracks partition size distribution over timeRightward shift or high max bucket
Compacting large partition log rateFrequency of large partition eventsSustained or increasing warnings
GC pause durationHeap pressure from IndexInfo deserializationOld-gen pause above 1 second, or young pause above 200 ms
Read latency p99/p999Client-visible impactSustained value above 3x baseline
Pending compactionsCompaction falling behindTrending upward over hours
SSTable count per tableRead amplificationAbove the expected value for the compaction strategy
Speculative retriesMasked slow reads from large partitionsAbove 5 percent of reads

Fixes

Immediate relief

If the incident is active, reduce pressure before planning the model change.

  • Throttle or block the offending query at the application layer. A single repeated large-partition read can dominate heap.
  • Temporarily raise compaction throughput with nodetool setcompactionthroughput <MB_per_second> so the oversized SSTable finishes. Tradeoff: compaction will starve reads of I/O.
  • Do not restart the node as a fix. A restart may transiently clear heap, but the partition will trigger the same pressure as soon as reads resume.

Reduce read-side pressure

  • Add application caching for hot partitions so Cassandra stops repeatedly materializing them.
  • Consider raising column_index_size_in_kb. A larger value creates fewer, wider IndexInfo blocks, which reduces in-memory index size at the cost of slightly more I/O per read. Existing SSTables keep their old index until rewritten; schedule nodetool upgradesstables if you need it applied cluster-wide.

Fix the data model

The only durable fix is to bound partition size.

  • Remodel the partition key to include a bucketing dimension. Examples: (device_id, yyyyMMdd), (tenant_id, bucket), or a hash prefix that splits a hot entity across multiple partitions.
  • For time-series data, use TimeWindowCompactionStrategy with TTL, and ensure the partition key includes a time bucket narrow enough to keep partitions small.
  • Avoid unbounded collections. Replace lists, sets, or maps that grow forever with a separate table where each item is its own partition, or use frozen collections only when the cardinality is strictly bounded.
  • Review compaction strategy. LCS is particularly punishing for large partitions because a single large partition forces creation of oversized SSTables beyond the normal size target. If the workload produces unavoidable large partitions, evaluate whether STCS or UCS is more appropriate, but understand that no compaction strategy fixes an unbounded partition key.

Long-term guardrails

  • On Cassandra 4.1+, use the guardrails framework to set hard or soft limits on partition size and reject pathological writes at the cluster boundary.
  • Include sstablepartitions checks in schema-review CI so oversized partitions are caught before they reach production.

Prevention

  • Monitor partition size histograms weekly, not just during incidents.
  • Set application-level limits on rows per partition and collection cardinality.
  • Design partition keys with a natural upper bound. Keep the vast majority of partitions under 10 MB and all partitions under 100 MB.
  • Do not use Cassandra as a queue or an unbounded event log per partition.
  • Review new tables for partition key cardinality before deploying them.

How Netdata helps

Netdata correlates the signals that large partitions produce so you can separate partition pathology from generic load:

  • Chart JVM heap usage and GC pause duration on the same timeline to confirm that read spikes are causing memory pressure.
  • Overlay ClientRequest Read Latency p99 with pending compactions to see whether the large partition is also backing up compaction.
  • Track per-node disk I/O and SSTable counts to rule out hardware saturation before blaming the data model.
  • Monitor EstimatedPartitionSizeHistogram over time to catch partition growth before the warning log appears.
  • Alert on sustained Compacting large partition log patterns if log monitoring is configured.