Cassandra compaction death spiral: when writes outrun compaction throughput
P99 read latency climbs while disk utilisation on the data volume pins near 100%. nodetool compactionstats shows pending tasks rising hour over hour, and nodetool tablestats reports a growing SSTable count. Writes stay fast; reads slow down. This is the compaction death spiral: writes exceed compaction throughput, SSTables accumulate, and read amplification rises.
Unlike a sudden node crash, this failure is gradual. A background queue grows a little each day. Once disk I/O saturates, the cycle self-reinforces: compaction falls further behind, reads consult more files, latency spikes, and the backlog deepens. By the time client SLAs breach, recovery can take hours.
The root cause is almost always a capacity mismatch: writes generate SSTables faster than compaction can merge them. Watch the trend of pending compactions and SSTable counts, not just absolute values.
What this means
Cassandra is an LSM database. Writes append to memtables and commitlogs; memtables flush to immutable SSTables on disk. Compaction merges those SSTables in the background to bound read amplification and reclaim space.
When the incoming write rate exceeds the rate at which compaction can read, merge, and write new SSTables, files accumulate. Every read must consult more SSTables, bloom filters, and partition indexes, increasing disk I/O and CPU per read. As read latency climbs, application retries can add more load. Compaction itself slows because it must read a larger file set and because client reads saturate disk I/O. The result is a self-reinforcing cycle that ends in disk space exhaustion or unrecoverable latency.
flowchart TD
A[Write rate exceeds compaction throughput] --> B[SSTables accumulate]
B --> C[Read amplification rises]
C --> D[Disk I/O saturates]
D --> E[P99 read latency climbs]
E --> F[Compaction falls further behind]
F --> DThis pattern differs from a GC death spiral because the JVM heap may look healthy. The bottleneck is disk I/O and SSTable count, not memory pressure. Long GC pauses alongside these symptoms indicate a composite failure. See the related guides for GC-specific diagnosis.
Common causes
| Cause | What it looks like | First thing to check |
|---|---|---|
| Write traffic spike | SSTable count and flush rate jump after a deployment, batch job, or traffic shift | nodetool proxyhistograms to compare operation counts against baseline |
| Compaction throughput throttled too low | Pending tasks grow even though CPU and disk have headroom; active compaction throughput stays below device capability | nodetool compactionstats to see active progress versus the configured cap |
| Insufficient disk IOPS | iostat shows %util above 90% and await climbing on the data device; reads and compaction compete | iostat -x 1 on both data and commitlog devices |
| LCS on a write-heavy workload | L0 SSTable count exceeds 32 and keeps growing; L0 to L1 promotion cannot keep up | nodetool tablestats SSTable count per table, especially L0 |
| Repair or streaming active | Pending compactions spike during a repair window; repair adds anti-compaction SSTables and streaming I/O | nodetool netstats and the repair schedule |
| STCS disk space pressure | Disk usage is above 50% with STCS; compaction stalls because it cannot allocate temporary space for merging | df -h on the data volume and nodetool info Load |
Quick checks
Run these read-only commands during the incident to confirm the spiral and rule out other failure modes. They do not mutate state.
# Compaction backlog and active progress
nodetool compactionstats
# SSTable count for a specific table
nodetool tablestats <keyspace> <table> | grep "SSTable count"
# Disk saturation on the data device
iostat -x 1 10
# Coordinator-level read and write latency
nodetool proxyhistograms
# Thread pool backpressure
nodetool tpstats
# Disk space headroom
df -h /var/lib/cassandra/data
# Heap usage to rule out GC pressure
nodetool info | grep "Heap Memory"
How to diagnose it
- Confirm compaction is falling behind. Run
nodetool compactionstatsevery 15 minutes for a few hours and log the output. A monotonic increase over three or more samples confirms the node is falling behind rather than handling a transient spike. - Identify disk saturation. Run
iostat -x 1on the data device. On SSD, sustained%utilabove 80% orawaitclimbing above 10 ms indicates a disk bottleneck. Check bothr_awaitandw_await; if compaction writes queue while readawaitspikes, the disk is the constraint. - Correlate with read amplification. Run
nodetool tablestatsfor affected keyspaces. IfSSTable countgrows continuously, reads are touching more files. A double-digit percentage growth per day is unsustainable. - Check client impact. Run
nodetool proxyhistograms. In a classic compaction spiral, write latency stays near baseline while read P99 climbs. Flat write P99 alongside climbing read P99 isolates the problem to the read path. - Rule out GC pressure. Check
nodetool infofor heap usage. If heap is above 75% of max after GC, you may have a GC death spiral compounding the I/O problem. - Assess disk runway. Run
df -h. STCS can transiently need up to 100% additional space; running above 50% full is dangerous. With LCS or TWCS, maintain at least 30% free. - Identify triggers. Check
nodetool netstatsfor active streaming or repair. Look for recent traffic spikes, schema changes, orcompaction_throughput_mb_per_secset below disk capability.
Metrics and signals to monitor
| Signal | Why it matters | Warning sign |
|---|---|---|
| Pending compactions | The leading indicator of compaction debt | Trending upward continuously for more than 8 hours |
| Live SSTable count per table | Directly determines read amplification | Growing day over day; LCS L0 above 32 |
| Disk I/O utilization | Saturated disks cannot flush or compact | %util above 80% sustained or await trending up |
| Client read latency P99 | User-visible degradation | P99 above 3 times rolling baseline or approaching read_request_timeout_in_ms |
| Dropped messages | The node is shedding load internally | Sustained non-zero dropped read or write messages |
| Thread pool pending tasks | Internal backpressure is building | Pending in ReadStage or MutationStage above 0 sustained |
| Disk space available | Compaction needs temporary space to run | Below 50% free for STCS, below 30% for LCS or TWCS |
Fixes
Immediate relief
Throttle non-critical writes. Reduce write rate at the application layer. Pausing bulk loads or deferring non-essential mutations gives compaction room to catch up.
Cancel or defer repair and streaming. Repair generates anti-compaction SSTables and consumes disk I/O and network bandwidth. Defer scheduled repairs and avoid starting new ones until the backlog clears. If a repair is already running, abort it only if your operational procedures support a clean stop.
Raise compaction throughput. If the disk has headroom, increase the compaction I/O cap:
nodetool setcompactionthroughput 128
Monitor iostat after the change. If read latency regresses, reduce the cap. The default is 64 MB/s; raising it helps only if the storage device can deliver more.
Increase concurrent compactors. If CPU is not saturated, raise concurrent_compactors in cassandra.yaml.
Strategy and hardware mismatches
LCS on write-heavy workloads. If L0 is above 32 and growing, LCS cannot promote SSTables to L1 fast enough. LCS is designed for read-heavy workloads. Plan a migration to STCS or UCS for write-heavy tables. Schedule this during a maintenance window.
STCS space amplification. If disk usage is above 50% with STCS, compaction may stall because it cannot allocate temporary space for major compactions. Free space by removing old snapshots:
nodetool clearsnapshot --all
Warning: nodetool clearsnapshot --all deletes all snapshots irreversibly. Verify your backup and recovery procedures do not depend on them before running this.
If growth is sustained, add storage capacity.
Insufficient disk IOPS. If iostat shows saturation and the device is a spinning disk, the only durable fix is faster storage. SSDs are strongly recommended for Cassandra data directories. Ensure commitlog and data directories are on separate devices so sequential commitlog writes do not compete with compaction I/O.
What to avoid
Do not restart the node to “clear the backlog.” A restart flushes all memtables, creating a burst of small SSTables that adds to the compaction debt.
Do not trigger a manual major compaction to clear the backlog on a saturated node. It will consume massive temporary disk space and I/O, likely worsening the spiral.
Do not alter the compaction strategy on a node that is already I/O saturated unless you can tolerate the additional load.
Prevention
Watch the trend, not the absolute. A pending compaction count of 25 may be normal for your cluster. A count that rises from 15 to 25 over three days is a warning. Alert on the derivative of pending tasks over a 24-hour window.
Maintain strategy-specific disk headroom. STCS can need up to 100% additional temporary space during major compaction. Never let STCS run above 50% disk utilization. For LCS and TWCS, keep at least 30% free.
Schedule maintenance operations off-peak. Run repair, streaming, and major topology changes during low-traffic windows. These operations compete for the same disk and network resources as compaction.
Monitor per-table SSTable counts. Aggregate cluster metrics hide per-table skew. A single table with an unbounded partition key can generate enough SSTables to saturate a node.
Size for compaction throughput, not just capacity. A node can have terabytes of free disk but still enter a death spiral if its IOPS are too low to merge SSTables at the incoming write rate.
Size compaction_throughput_mb_per_sec to your storage capability. If your data devices are fast NVMe arrays, the default 64 MB/s cap may artificially throttle compaction.
How Netdata helps
- Correlate pending compaction tasks with disk utilization and I/O wait on the data volume to confirm when compaction starves reads of I/O.
- Track live SSTable count per table to catch read amplification before P99 latency breaches SLA.
- Overlay read latency P99 with compaction pending tasks to expose the lag between compaction debt and client impact.
- Monitor thread pool pending tasks for CompactionExecutor and ReadStage to detect internal backpressure before it surfaces as timeouts.
- Watch disk space percentage alongside Cassandra data load to account for snapshot accumulation and compaction temporary files.
Related guides
- Cassandra consistency levels explained: QUORUM, ONE, LOCAL_QUORUM, and EACH_QUORUM
- Cassandra GC death spiral: long pauses, gossip flapping, and recovery
- Cassandra GC pauses too long: diagnosing G1 stop-the-world pauses
- Cassandra heap pressure: sizing the JVM heap and tuning G1GC
- Cassandra monitoring checklist: the signals every production cluster needs
- Cassandra monitoring maturity model: from survival to expert
- Cassandra java.lang.OutOfMemoryError: Java heap space - causes and recovery
- Cassandra pending compactions growing: the compaction backlog runbook
- How Cassandra actually works in production: a mental model for operators







