Cassandra compaction stuck: large partitions blocking a compaction thread
nodetool compactionstats shows a compaction on one table that has not moved past the same byte offset for hours. The progress percentage is frozen, the pending queue behind it is growing, and read latency on that table is creeping up. This is not a slow disk. A single large partition has monopolized a compaction thread, turning background maintenance into a bottleneck that threatens node stability.
While large SSTables take time to compact, there is a difference between slow progress and no progress. When a partition grows unbounded, the compaction thread spends disproportionate time merging it across multiple SSTables. The rest of the compaction queue stalls, SSTables accumulate, and tombstones cannot be purged until every SSTable containing that partition is included in the same compaction job.
What this means
Cassandra compacts SSTables using a finite thread pool. Each job reads input SSTables sequentially and writes a new merged SSTable. When a partition within those SSTables exceeds tens or hundreds of megabytes, the merge cost for that single partition dominates the job. The thread appears stuck at the same byte offset because it is processing one enormous partition while producing relatively few output bytes.
This differs from generic compaction debt. In a compaction death spiral, backlog grows because write rate exceeds overall compaction throughput. Here, throughput collapses because one thread is blocked by a single partition. The node may still have idle compaction threads, but the job containing the large partition holds up the SSTables it touched. Tombstones cannot be dropped until every SSTable containing older data for that partition is included in the same compaction job. If the partition is so large that the job cannot finish, tombstones are effectively trapped, and disk space is not reclaimed.
Common causes
| Cause | What it looks like | First thing to check |
|---|---|---|
| Single large partition (> 100 MB) | nodetool compactionstats shows the same completed bytes for hours; system logs contain “Compacting large partition” warnings | nodetool tablehistograms and nodetool tablestats for partition size outliers |
| Many moderately large partitions in one SSTable | Progress moves slowly but does advance; compactionhistory shows duration outliers for this table | nodetool compactionhistory for abnormally long past compactions on the same table |
| I/O throttling or cloud storage bottleneck | All compactions are slow, not just one; disk await is high but CPU is low | iostat -x and the configured compaction throughput limit |
| Tombstone-heavy large partition | Partition spans many SSTables and compaction cannot drop tombstones because not all SSTables are in the same job | Repair status and tombstone scan warnings in logs |
Quick checks
Run these read-only commands to confirm the symptom and scope the impact.
# Check active compactions and verify the byte offset is not moving
nodetool compactionstats
# Compare against historical compaction durations for this table
nodetool compactionhistory
# Look for large-partition warnings that correlate with the stuck compaction
grep -i "Compacting large partition" /var/log/cassandra/system.log
# Inspect partition size distribution and SSTable count for the affected table
nodetool tablehistograms <keyspace> <table>
nodetool tablestats <keyspace> <table>
# Identify the largest partitions currently being read or written
nodetool toppartitions <keyspace> <table> 1000
# Check if compaction threads are saturated or if other pools are affected
nodetool tpstats
# Verify whether the bottleneck is disk I/O or CPU/merge overhead
iostat -x 1
How to diagnose it
Follow these steps before stopping anything. Large SSTables can spend extended time in validation and merge phases, so verify the freeze before intervening.
Confirm the compaction is truly stuck. Sample
nodetool compactionstatstwice, 10 to 15 minutes apart. If the completed bytes and progress percentage are identical, the thread is blocked.Capture the compaction UUID, keyspace, and table. You will need the UUID if you decide to stop a specific compaction later.
Check
nodetool compactionhistoryfor this table. If past compactions of similar total size completed in minutes but this one has run for hours, the job is pathological.Search system logs for large partition warnings. Log lines referencing “Compacting large partition” pointing to the same table confirm the diagnosis.
Inspect partition sizes. Use
nodetool tablehistogramsto see the maximum partition size and distribution. If the max is approaching or exceeding 100 MB, you have found the culprit.Check thread pool health.
nodetool tpstatsshows whether the CompactionExecutor has pending tasks building behind the active stuck job. If other request-stage pools also show pending tasks, the node is under broader pressure.Correlate with disk metrics. Run
iostat -x. If disk utilization is low andawaitis normal, the bottleneck is CPU or memory pressure from the merge itself, not storage.
flowchart TD
A[Compaction frozen at same offset] --> B[Check logs for large partition warnings]
B --> C{Warnings found?}
C -->|Yes| D[Large partition blocking thread]
C -->|No| E[Check compactionhistory for duration outliers]
E --> F{Outliers match?}
F -->|Yes| G[Slow but valid large SSTable]
F -->|No| H[Check disk I/O and thread pools]
H --> I{I/O saturated?}
I -->|Yes| J[Storage bottleneck]
I -->|No| K[Investigate other causes]
D --> L[Assess partition size via tablehistograms]
L --> M[Stop compaction or let finish]Metrics and signals to monitor
| Signal | Why it matters | Warning sign |
|---|---|---|
| Pending compactions | A growing queue while one job is stuck means the backlog is compounding | PendingTasks trending upward over hours |
| SSTable count per table | Each flush adds an SSTable; without compaction, read amplification rises | LiveSSTableCount growing on the affected table |
| Partition size distribution | Partitions over 100 MB cause OOM and compaction stalls | Max partition size in tablehistograms approaching 100 MB |
| CompactionExecutor pending | Shows compaction thread saturation directly | Pending tasks accumulating while active count is frozen |
| Disk I/O utilization | Distinguishes partition merge overhead from storage saturation | %util > 80% or await > 50 ms sustained |
| JVM heap usage after GC | Large partition merges allocate heavily in heap | Heap floor after old GC climbing above 75% of max |
Fixes
Stop a runaway compaction
If the compaction has been frozen for hours and pending tasks are building, stop it. Partial merge work is discarded, so the input SSTables must be re-compacted later. This provides immediate relief to the thread pool but does not eliminate the underlying large partition. You will still need to fix the data model to prevent the same stall on the next compaction.
# Stop ALL running compactions. Partial work is discarded.
nodetool stop COMPACTION
# Or stop only the stuck compaction by UUID from compactionstats
nodetool stop <uuid>
WARNING: nodetool stop COMPACTION without a UUID stops every active compaction. Do not treat this as a routine fix.
Do not disable auto-compaction to prevent recurrence. Disabling auto-compaction causes SSTable count to grow and read performance to degrade. It is only a temporary emergency measure.
Adjust compaction throughput
If the compaction is slow but progressing, and disk I/O headroom exists, temporarily raise the compaction throughput ceiling. This helps compaction catch up but will increase I/O contention with reads.
# View current limit (MB/s)
nodetool getcompactionthroughput
# Raise temporarily during a low-traffic window; revert afterward
nodetool setcompactionthroughput 128
Force a targeted compaction after cleanup
After you stop writes to the large partition via application changes, you may want to force a major compaction to consolidate existing SSTables and purge tombstones.
# WARNING: extremely I/O intensive and may temporarily double disk usage with STCS
nodetool compact <keyspace> <table>
Only run this after the data model is fixed and during a maintenance window. With STCS, major compaction requires temporary space roughly equal to the table size.
Fix the data model
The only permanent fix is to prevent unbounded partition growth.
- Keep partitions under 10 MB. The 100 MB threshold is a hard ceiling where OOM during compaction becomes likely.
- Redesign the partition key to add cardinality. For time-series workloads, include a time bucket in the partition key so no single partition grows forever.
- Avoid using Cassandra as a queue. Queue patterns with write-read-delete cycles create partitions that grow until they are unmanageable.
- If the table uses a legacy compaction strategy poorly suited to the access pattern, plan a migration. For TTL-heavy time-series data, use TWCS. Cassandra 5.0’s UCS uses shard parallelism that reduces the likelihood of a single partition stalling an entire compaction.
Prevention
- Sample partition sizes weekly with
nodetool tablehistogramsornodetool toppartitions. Trending max partition size is a leading indicator. - Monitor the derivative of pending compactions, not just the absolute value. A steady increase over 24 hours signals that compaction is losing ground.
- Ensure repair runs within
gc_grace_seconds. Tombstones in large partitions can only be dropped after repair completes on all replicas. - Separate commitlog and data directories onto different devices so that compaction I/O does not starve the write path.
- For tables with known large partition risk, implement application-level write guardrails that reject or split oversized batches.
How Netdata helps
- Correlate pending compaction tasks with disk I/O utilization and JVM GC pause duration in one view to confirm whether a stuck compaction is causing broader pressure.
- Alert on the rate of change of pending compactions to detect backlog growth within minutes of a thread stalling.
- Track per-table SSTable count growth alongside read latency percentiles to visualize read amplification in real time.
- Monitor off-heap memory and process RSS to catch Linux OOM kills that can occur when large partition merges pressure the JVM beyond heap limits.
Related guides
- Cassandra compaction death spiral: when writes outrun compaction throughput
- Cassandra consistency levels explained: QUORUM, ONE, LOCAL_QUORUM, and EACH_QUORUM
- Cassandra GC death spiral: long pauses, gossip flapping, and recovery
- Cassandra GC pauses too long: diagnosing G1 stop-the-world pauses
- Cassandra heap pressure: sizing the JVM heap and tuning G1GC
- Cassandra monitoring checklist: the signals every production cluster needs
- Cassandra monitoring maturity model: from survival to expert
- Cassandra java.lang.OutOfMemoryError: Java heap space - causes and recovery
- Cassandra pending compactions growing: the compaction backlog runbook
- Cassandra too many SSTables per table: read amplification and how to fix it
- How Cassandra actually works in production: a mental model for operators







