Cassandra compaction stuck: large partitions blocking a compaction thread

nodetool compactionstats shows a compaction on one table that has not moved past the same byte offset for hours. The progress percentage is frozen, the pending queue behind it is growing, and read latency on that table is creeping up. This is not a slow disk. A single large partition has monopolized a compaction thread, turning background maintenance into a bottleneck that threatens node stability.

While large SSTables take time to compact, there is a difference between slow progress and no progress. When a partition grows unbounded, the compaction thread spends disproportionate time merging it across multiple SSTables. The rest of the compaction queue stalls, SSTables accumulate, and tombstones cannot be purged until every SSTable containing that partition is included in the same compaction job.

What this means

Cassandra compacts SSTables using a finite thread pool. Each job reads input SSTables sequentially and writes a new merged SSTable. When a partition within those SSTables exceeds tens or hundreds of megabytes, the merge cost for that single partition dominates the job. The thread appears stuck at the same byte offset because it is processing one enormous partition while producing relatively few output bytes.

This differs from generic compaction debt. In a compaction death spiral, backlog grows because write rate exceeds overall compaction throughput. Here, throughput collapses because one thread is blocked by a single partition. The node may still have idle compaction threads, but the job containing the large partition holds up the SSTables it touched. Tombstones cannot be dropped until every SSTable containing older data for that partition is included in the same compaction job. If the partition is so large that the job cannot finish, tombstones are effectively trapped, and disk space is not reclaimed.

Common causes

CauseWhat it looks likeFirst thing to check
Single large partition (> 100 MB)nodetool compactionstats shows the same completed bytes for hours; system logs contain “Compacting large partition” warningsnodetool tablehistograms and nodetool tablestats for partition size outliers
Many moderately large partitions in one SSTableProgress moves slowly but does advance; compactionhistory shows duration outliers for this tablenodetool compactionhistory for abnormally long past compactions on the same table
I/O throttling or cloud storage bottleneckAll compactions are slow, not just one; disk await is high but CPU is lowiostat -x and the configured compaction throughput limit
Tombstone-heavy large partitionPartition spans many SSTables and compaction cannot drop tombstones because not all SSTables are in the same jobRepair status and tombstone scan warnings in logs

Quick checks

Run these read-only commands to confirm the symptom and scope the impact.

# Check active compactions and verify the byte offset is not moving
nodetool compactionstats

# Compare against historical compaction durations for this table
nodetool compactionhistory

# Look for large-partition warnings that correlate with the stuck compaction
grep -i "Compacting large partition" /var/log/cassandra/system.log

# Inspect partition size distribution and SSTable count for the affected table
nodetool tablehistograms <keyspace> <table>
nodetool tablestats <keyspace> <table>

# Identify the largest partitions currently being read or written
nodetool toppartitions <keyspace> <table> 1000

# Check if compaction threads are saturated or if other pools are affected
nodetool tpstats

# Verify whether the bottleneck is disk I/O or CPU/merge overhead
iostat -x 1

How to diagnose it

Follow these steps before stopping anything. Large SSTables can spend extended time in validation and merge phases, so verify the freeze before intervening.

  1. Confirm the compaction is truly stuck. Sample nodetool compactionstats twice, 10 to 15 minutes apart. If the completed bytes and progress percentage are identical, the thread is blocked.

  2. Capture the compaction UUID, keyspace, and table. You will need the UUID if you decide to stop a specific compaction later.

  3. Check nodetool compactionhistory for this table. If past compactions of similar total size completed in minutes but this one has run for hours, the job is pathological.

  4. Search system logs for large partition warnings. Log lines referencing “Compacting large partition” pointing to the same table confirm the diagnosis.

  5. Inspect partition sizes. Use nodetool tablehistograms to see the maximum partition size and distribution. If the max is approaching or exceeding 100 MB, you have found the culprit.

  6. Check thread pool health. nodetool tpstats shows whether the CompactionExecutor has pending tasks building behind the active stuck job. If other request-stage pools also show pending tasks, the node is under broader pressure.

  7. Correlate with disk metrics. Run iostat -x. If disk utilization is low and await is normal, the bottleneck is CPU or memory pressure from the merge itself, not storage.

flowchart TD
    A[Compaction frozen at same offset] --> B[Check logs for large partition warnings]
    B --> C{Warnings found?}
    C -->|Yes| D[Large partition blocking thread]
    C -->|No| E[Check compactionhistory for duration outliers]
    E --> F{Outliers match?}
    F -->|Yes| G[Slow but valid large SSTable]
    F -->|No| H[Check disk I/O and thread pools]
    H --> I{I/O saturated?}
    I -->|Yes| J[Storage bottleneck]
    I -->|No| K[Investigate other causes]
    D --> L[Assess partition size via tablehistograms]
    L --> M[Stop compaction or let finish]

Metrics and signals to monitor

SignalWhy it mattersWarning sign
Pending compactionsA growing queue while one job is stuck means the backlog is compoundingPendingTasks trending upward over hours
SSTable count per tableEach flush adds an SSTable; without compaction, read amplification risesLiveSSTableCount growing on the affected table
Partition size distributionPartitions over 100 MB cause OOM and compaction stallsMax partition size in tablehistograms approaching 100 MB
CompactionExecutor pendingShows compaction thread saturation directlyPending tasks accumulating while active count is frozen
Disk I/O utilizationDistinguishes partition merge overhead from storage saturation%util > 80% or await > 50 ms sustained
JVM heap usage after GCLarge partition merges allocate heavily in heapHeap floor after old GC climbing above 75% of max

Fixes

Stop a runaway compaction

If the compaction has been frozen for hours and pending tasks are building, stop it. Partial merge work is discarded, so the input SSTables must be re-compacted later. This provides immediate relief to the thread pool but does not eliminate the underlying large partition. You will still need to fix the data model to prevent the same stall on the next compaction.

# Stop ALL running compactions. Partial work is discarded.
nodetool stop COMPACTION

# Or stop only the stuck compaction by UUID from compactionstats
nodetool stop <uuid>

WARNING: nodetool stop COMPACTION without a UUID stops every active compaction. Do not treat this as a routine fix.

Do not disable auto-compaction to prevent recurrence. Disabling auto-compaction causes SSTable count to grow and read performance to degrade. It is only a temporary emergency measure.

Adjust compaction throughput

If the compaction is slow but progressing, and disk I/O headroom exists, temporarily raise the compaction throughput ceiling. This helps compaction catch up but will increase I/O contention with reads.

# View current limit (MB/s)
nodetool getcompactionthroughput

# Raise temporarily during a low-traffic window; revert afterward
nodetool setcompactionthroughput 128

Force a targeted compaction after cleanup

After you stop writes to the large partition via application changes, you may want to force a major compaction to consolidate existing SSTables and purge tombstones.

# WARNING: extremely I/O intensive and may temporarily double disk usage with STCS
nodetool compact <keyspace> <table>

Only run this after the data model is fixed and during a maintenance window. With STCS, major compaction requires temporary space roughly equal to the table size.

Fix the data model

The only permanent fix is to prevent unbounded partition growth.

  • Keep partitions under 10 MB. The 100 MB threshold is a hard ceiling where OOM during compaction becomes likely.
  • Redesign the partition key to add cardinality. For time-series workloads, include a time bucket in the partition key so no single partition grows forever.
  • Avoid using Cassandra as a queue. Queue patterns with write-read-delete cycles create partitions that grow until they are unmanageable.
  • If the table uses a legacy compaction strategy poorly suited to the access pattern, plan a migration. For TTL-heavy time-series data, use TWCS. Cassandra 5.0’s UCS uses shard parallelism that reduces the likelihood of a single partition stalling an entire compaction.

Prevention

  • Sample partition sizes weekly with nodetool tablehistograms or nodetool toppartitions. Trending max partition size is a leading indicator.
  • Monitor the derivative of pending compactions, not just the absolute value. A steady increase over 24 hours signals that compaction is losing ground.
  • Ensure repair runs within gc_grace_seconds. Tombstones in large partitions can only be dropped after repair completes on all replicas.
  • Separate commitlog and data directories onto different devices so that compaction I/O does not starve the write path.
  • For tables with known large partition risk, implement application-level write guardrails that reject or split oversized batches.

How Netdata helps

  • Correlate pending compaction tasks with disk I/O utilization and JVM GC pause duration in one view to confirm whether a stuck compaction is causing broader pressure.
  • Alert on the rate of change of pending compactions to detect backlog growth within minutes of a thread stalling.
  • Track per-table SSTable count growth alongside read latency percentiles to visualize read amplification in real time.
  • Monitor off-heap memory and process RSS to catch Linux OOM kills that can occur when large partition merges pressure the JVM beyond heap limits.