Cassandra Not enough space for compaction: STCS space amplification and recovery

Not enough space for compaction in system.log means STCS has hit a structural space-amplification limit. Disk usage may already be above 50%. Cassandra aborts the compaction, skips the tier, and leaves tombstones and old versions unmerged. SSTable count rises, read amplification increases, and free space stops being reclaimed. Left unchecked, this enters a compaction death spiral that ends in write rejection or disk exhaustion. Recovery requires immediate free space, targeted cleanup, and a headroom plan that accounts for transient STCS amplification.

What this means

STCS compacts SSTables by reading all input files in a tier and writing a new merged file. Both the input and output sets coexist until the new SSTable is complete and the old ones are removed. If a tier contains four 50 GB SSTables, compaction writes a new ~200 GB file before deleting the sources. For large tiers, transient disk usage can equal the full size of the data being compacted. Treat running STCS above 50% disk utilization as dangerous: any large compaction can exhaust remaining space.

When Cassandra skips a compaction due to insufficient space, SSTables accumulate. Higher SSTable counts increase read amplification and leave tombstones in place. Because compaction is the only way to reclaim space from deletes and overwrites, a stall also halts reclamation. The result is a feedback loop: less free space means fewer compactions, which means more SSTables and worse performance.

Common causes

CauseWhat it looks likeFirst thing to check
STCS headroom exceededNot enough space for compaction in logs; pending compactions rising while disk is above 50% fulldf -h /var/lib/cassandra/data and nodetool info | grep "Load"
Stale snapshotsFilesystem usage grows faster than Load; old backups never cleaned upnodetool listsnapshots
Write rate exceeding compaction throughputPending compactions trending up; SSTable count growing; disk I/O saturatednodetool compactionstats and iostat -x 1
Overwrite workload amplifying spaceRapid disk growth on update-heavy tables; same partitions rewritten repeatedlynodetool tablestats <keyspace> | grep "Space used"
Hint accumulation after node outageHints directory large; a node was recently marked DOWNdu -sh /var/lib/cassandra/hints/

Quick checks

# Check filesystem free space on the data volume
df -h /var/lib/cassandra/data

# Check Cassandra's live data size estimate (excludes snapshots and hints)
nodetool info | grep "Load"

# Check pending and active compactions
nodetool compactionstats

# Check SSTable accumulation per table
nodetool tablestats | grep "SSTable count"

# Check for snapshots
nodetool listsnapshots

# Check hint file accumulation
du -sh /var/lib/cassandra/hints/

# Check disk I/O saturation on data and commitlog devices
iostat -x 1

# Check CompactionExecutor thread pool status
nodetool tpstats | grep -A1 "CompactionExecutor"

How to diagnose it

  1. Confirm the exact log error. Search system.log for the verbatim message: grep "Not enough space for compaction" /var/log/cassandra/system.log. Note the timestamp and table name if included.
  2. Compare data size to free space. Run nodetool info | grep "Load" to see Cassandra’s live data size, then run df -h on the data mount. If free space is less than the Load value, a major compaction of the full dataset cannot complete. Regular tier compactions can also fail if the target tier is larger than available free space.
  3. Verify compaction is stalled. Run nodetool compactionstats. If pending tasks are high but no compactions are active, and logs show space errors, disk pressure is the direct blocker.
  4. Find snapshot bloat. Snapshots hold hard links to SSTables. Compaction cannot delete original files that are still linked by snapshots, so snapshot growth directly amplifies space consumption. Check snapshot directories under the data path.
  5. Quantify SSTable growth. Run nodetool tablestats <keyspace> <table> and check SSTable count. In STCS, sustained counts above 50 indicate compaction is falling behind.
  6. Check for storage exceptions. Run grep -i "FSError\|CorruptSSTable\|IOError" /var/log/cassandra/system.log. A failing disk can trigger write errors that resemble space exhaustion.
  7. Correlate write and flush pressure. In nodetool tpstats, rising completed tasks on MemtableFlushWriter with flat or slow CompactionExecutor activity means flush debt is outpacing compaction.
flowchart TD
    A[Disk usage exceeds 50% with STCS] --> B{Cassandra selects large tier for compaction}
    B -->|Not enough free space| C[Compaction skipped]
    C --> D[SSTables accumulate]
    D --> E[Read amplification rises]
    E --> F[More disk consumed by old SSTables]
    F --> B
    D --> G[Pending compactions grow unchecked]
    G --> H[Compaction death spiral]

Metrics and signals to monitor

SignalWhy it mattersWarning sign
Disk space available (data dir)STCS needs temporary space equal to the data size being compacted< 50% free for STCS; < 30% for LCS; < 20% for TWCS
Pending compactionsShows compaction backlog directlyTrending upward for more than 4 hours
SSTable count per tableAccumulates when compaction is skippedSTCS sustained count > 50
Storage exceptionsHardware errors can mimic space issuesAny non-zero rate
Disk I/O utilizationSaturated I/O slows compaction, preventing space reclamation%util > 80% sustained
Commitlog pending tasksWrite-path pressure from disk exhaustionPending > 0 sustained
Client request timeoutsReads suffer from accumulated SSTablesRate > 0 sustained

Fixes

Free disk space immediately

The fastest way to give compaction room is to remove snapshots. Snapshots are hard links to SSTables. They prevent the space from being reclaimed when compaction replaces the original files.

# List snapshots before deleting to confirm they are stale
nodetool listsnapshots

# Remove all snapshots. WARNING: only run if backups are verified or stale.
nodetool clearsnapshot --all

If the node is on a cloud volume, expand the filesystem after resizing the block device. Adding space is safer than running a major compaction on a nearly full disk, because the compaction itself can transiently double disk usage and push the node into disk-full failure.

Trigger targeted compaction, not cluster-wide major

Avoid a full major compaction when disk is low. Instead, target the largest or most bloated tables. Before running, check the table size to ensure the compaction will fit:

# Check target table size
nodetool tablestats <keyspace> <table> | grep "Space used"

# Target a specific high-bloat table
nodetool compact <keyspace> <table>

Tradeoff: Even targeted compaction consumes temporary space. Only run this after clearing snapshots or expanding storage. The operation is I/O intensive and will raise read latency while it runs. After starting, run nodetool compactionstats to confirm the job is active and progressing.

Increase compaction throughput temporarily

If disk I/O is not saturated and the bottleneck is compaction speed, raise the throttle:

# Increase compaction throughput (example: 128 MB/s)
nodetool setcompactionthroughput 128

Tradeoff: Higher throughput steals I/O bandwidth from reads and flushes. Monitor iostat -x 1 and client latency while the value is elevated. Return it to baseline once the backlog clears.

Stop background I/O consumers

Pause repair and streaming operations during recovery. Repair generates anti-compaction SSTables and consumes network and disk I/O. Check for active streams:

nodetool netstats

If repair is running, let it finish or schedule it for off-peak, but do not start new full repairs while fighting disk pressure.

Change compaction strategy (long-term)

If the workload is read-heavy or time-series, migrate away from STCS. LCS provides steadier space usage and needs roughly 30% headroom. TWCS needs roughly 20% headroom.

Tradeoff: Altering the compaction strategy on an existing table triggers a full recompaction. This requires significant temporary space and I/O. Plan the migration for a maintenance window when the node has adequate free space.

Prevention

  • Maintain per-strategy headroom. STCS requires greater than 50% free disk space. LCS requires greater than 30%. TWCS requires greater than 20%.
  • Automate snapshot cleanup. Retention scripts should run nodetool clearsnapshot after backup verification. Hard-linked snapshot data silently accumulates as compaction progresses.
  • Monitor compaction trends, not just absolute pending counts. Alert when pending compactions increase continuously over a 24 hour period.
  • Separate commitlog and data directories. Place them on independent volumes so commitlog growth does not steal headroom from compaction.
  • Provision for transient amplification, not just live data size. Size disks assuming STCS will need up to 100% additional space during major compaction.

How Netdata helps

  • Correlate data volume disk usage with JMX compaction pending tasks and per-table SSTable counts.
  • Alert on disk space crossing strategy-specific thresholds before compaction stalls.
  • Track read latency spikes alongside SSTable growth to surface the compaction death spiral before disk exhaustion.
  • Monitor JMX metrics like org.apache.cassandra.metrics:type=Compaction,name=PendingTasks and per-table LiveSSTableCount without manual nodetool polling.