$ guides / cassandra / cassandra-compaction-death-spiral ▌

Operations Guides

Cassandra compaction death spiral: when writes outrun compaction throughput

P99 read latency climbs while disk utilisation on the data volume pins near 100%. nodetool compactionstats shows pending tasks rising hour over hour, and nodetool tablestats reports a growing SSTable count. Writes stay fast; reads slow down. This is the compaction death spiral: writes exceed compaction throughput, SSTables accumulate, and read amplification rises.

Unlike a sudden node crash, this failure is gradual. A background queue grows a little each day. Once disk I/O saturates, the cycle self-reinforces: compaction falls further behind, reads consult more files, latency spikes, and the backlog deepens. By the time client SLAs breach, recovery can take hours.

The root cause is almost always a capacity mismatch: writes generate SSTables faster than compaction can merge them. Watch the trend of pending compactions and SSTable counts, not just absolute values.

What this means

Cassandra is an LSM database. Writes append to memtables and commitlogs; memtables flush to immutable SSTables on disk. Compaction merges those SSTables in the background to bound read amplification and reclaim space.

When the incoming write rate exceeds the rate at which compaction can read, merge, and write new SSTables, files accumulate. Every read must consult more SSTables, bloom filters, and partition indexes, increasing disk I/O and CPU per read. As read latency climbs, application retries can add more load. Compaction itself slows because it must read a larger file set and because client reads saturate disk I/O. The result is a self-reinforcing cycle that ends in disk space exhaustion or unrecoverable latency.

flowchart TD
    A[Write rate exceeds compaction throughput] --> B[SSTables accumulate]
    B --> C[Read amplification rises]
    C --> D[Disk I/O saturates]
    D --> E[P99 read latency climbs]
    E --> F[Compaction falls further behind]
    F --> D

This pattern differs from a GC death spiral because the JVM heap may look healthy. The bottleneck is disk I/O and SSTable count, not memory pressure. Long GC pauses alongside these symptoms indicate a composite failure. See the related guides for GC-specific diagnosis.

Common causes

Cause	What it looks like	First thing to check
Write traffic spike	SSTable count and flush rate jump after a deployment, batch job, or traffic shift	`nodetool proxyhistograms` to compare operation counts against baseline
Compaction throughput throttled too low	Pending tasks grow even though CPU and disk have headroom; active compaction throughput stays below device capability	`nodetool compactionstats` to see active progress versus the configured cap
Insufficient disk IOPS	`iostat` shows `%util` above 90% and `await` climbing on the data device; reads and compaction compete	`iostat -x 1` on both data and commitlog devices
LCS on a write-heavy workload	L0 SSTable count exceeds 32 and keeps growing; L0 to L1 promotion cannot keep up	`nodetool tablestats` SSTable count per table, especially L0
Repair or streaming active	Pending compactions spike during a repair window; repair adds anti-compaction SSTables and streaming I/O	`nodetool netstats` and the repair schedule
STCS disk space pressure	Disk usage is above 50% with STCS; compaction stalls because it cannot allocate temporary space for merging	`df -h` on the data volume and `nodetool info` Load

Quick checks

Run these read-only commands during the incident to confirm the spiral and rule out other failure modes. They do not mutate state.

# Compaction backlog and active progress
nodetool compactionstats

# SSTable count for a specific table
nodetool tablestats <keyspace> <table> | grep "SSTable count"

# Disk saturation on the data device
iostat -x 1 10

# Coordinator-level read and write latency
nodetool proxyhistograms

# Thread pool backpressure
nodetool tpstats

# Disk space headroom
df -h /var/lib/cassandra/data

# Heap usage to rule out GC pressure
nodetool info | grep "Heap Memory"

How to diagnose it

Confirm compaction is falling behind. Run nodetool compactionstats every 15 minutes for a few hours and log the output. A monotonic increase over three or more samples confirms the node is falling behind rather than handling a transient spike.
Identify disk saturation. Run iostat -x 1 on the data device. On SSD, sustained %util above 80% or await climbing above 10 ms indicates a disk bottleneck. Check both r_await and w_await; if compaction writes queue while read await spikes, the disk is the constraint.
Correlate with read amplification. Run nodetool tablestats for affected keyspaces. If SSTable count grows continuously, reads are touching more files. A double-digit percentage growth per day is unsustainable.
Check client impact. Run nodetool proxyhistograms. In a classic compaction spiral, write latency stays near baseline while read P99 climbs. Flat write P99 alongside climbing read P99 isolates the problem to the read path.
Rule out GC pressure. Check nodetool info for heap usage. If heap is above 75% of max after GC, you may have a GC death spiral compounding the I/O problem.
Assess disk runway. Run df -h. STCS can transiently need up to 100% additional space; running above 50% full is dangerous. With LCS or TWCS, maintain at least 30% free.
Identify triggers. Check nodetool netstats for active streaming or repair. Look for recent traffic spikes, schema changes, or compaction_throughput_mb_per_sec set below disk capability.

Metrics and signals to monitor

Signal	Why it matters	Warning sign
Pending compactions	The leading indicator of compaction debt	Trending upward continuously for more than 8 hours
Live SSTable count per table	Directly determines read amplification	Growing day over day; LCS L0 above 32
Disk I/O utilization	Saturated disks cannot flush or compact	`%util` above 80% sustained or `await` trending up
Client read latency P99	User-visible degradation	P99 above 3 times rolling baseline or approaching `read_request_timeout_in_ms`
Dropped messages	The node is shedding load internally	Sustained non-zero dropped read or write messages
Thread pool pending tasks	Internal backpressure is building	Pending in ReadStage or MutationStage above 0 sustained
Disk space available	Compaction needs temporary space to run	Below 50% free for STCS, below 30% for LCS or TWCS

Fixes

Immediate relief

Throttle non-critical writes. Reduce write rate at the application layer. Pausing bulk loads or deferring non-essential mutations gives compaction room to catch up.

Cancel or defer repair and streaming. Repair generates anti-compaction SSTables and consumes disk I/O and network bandwidth. Defer scheduled repairs and avoid starting new ones until the backlog clears. If a repair is already running, abort it only if your operational procedures support a clean stop.

Raise compaction throughput. If the disk has headroom, increase the compaction I/O cap:

nodetool setcompactionthroughput 128

Monitor iostat after the change. If read latency regresses, reduce the cap. The default is 64 MB/s; raising it helps only if the storage device can deliver more.

Increase concurrent compactors. If CPU is not saturated, raise concurrent_compactors in cassandra.yaml.

Strategy and hardware mismatches

LCS on write-heavy workloads. If L0 is above 32 and growing, LCS cannot promote SSTables to L1 fast enough. LCS is designed for read-heavy workloads. Plan a migration to STCS or UCS for write-heavy tables. Schedule this during a maintenance window.

STCS space amplification. If disk usage is above 50% with STCS, compaction may stall because it cannot allocate temporary space for major compactions. Free space by removing old snapshots:

nodetool clearsnapshot --all

Warning: nodetool clearsnapshot --all deletes all snapshots irreversibly. Verify your backup and recovery procedures do not depend on them before running this.

If growth is sustained, add storage capacity.

Insufficient disk IOPS. If iostat shows saturation and the device is a spinning disk, the only durable fix is faster storage. SSDs are strongly recommended for Cassandra data directories. Ensure commitlog and data directories are on separate devices so sequential commitlog writes do not compete with compaction I/O.

What to avoid

Do not restart the node to “clear the backlog.” A restart flushes all memtables, creating a burst of small SSTables that adds to the compaction debt.

Do not trigger a manual major compaction to clear the backlog on a saturated node. It will consume massive temporary disk space and I/O, likely worsening the spiral.

Do not alter the compaction strategy on a node that is already I/O saturated unless you can tolerate the additional load.

Prevention

Watch the trend, not the absolute. A pending compaction count of 25 may be normal for your cluster. A count that rises from 15 to 25 over three days is a warning. Alert on the derivative of pending tasks over a 24-hour window.

Maintain strategy-specific disk headroom. STCS can need up to 100% additional temporary space during major compaction. Never let STCS run above 50% disk utilization. For LCS and TWCS, keep at least 30% free.

Schedule maintenance operations off-peak. Run repair, streaming, and major topology changes during low-traffic windows. These operations compete for the same disk and network resources as compaction.

Monitor per-table SSTable counts. Aggregate cluster metrics hide per-table skew. A single table with an unbounded partition key can generate enough SSTables to saturate a node.

Size for compaction throughput, not just capacity. A node can have terabytes of free disk but still enter a death spiral if its IOPS are too low to merge SSTables at the incoming write rate.

Size compaction_throughput_mb_per_sec to your storage capability. If your data devices are fast NVMe arrays, the default 64 MB/s cap may artificially throttle compaction.

How Netdata helps

Correlate pending compaction tasks with disk utilization and I/O wait on the data volume to confirm when compaction starves reads of I/O.
Track live SSTable count per table to catch read amplification before P99 latency breaches SLA.
Overlay read latency P99 with compaction pending tasks to expose the lag between compaction debt and client impact.
Monitor thread pool pending tasks for CompactionExecutor and ReadStage to detect internal backpressure before it surfaces as timeouts.
Watch disk space percentage alongside Cassandra data load to account for snapshot accumulation and compaction temporary files.

The Netdata solution

Cassandra monitoring with Netdata

Netdata monitors Apache Cassandra with per-second metrics and automatic dashboards. Correlate GC pauses, compaction backlog, tombstone rates, pending hints, and disk usage across nodes to catch a creeping cluster before it tips over.

See Cassandra monitoring → Start monitoring free

Cassandra compaction death spiral: when writes outrun compaction throughput

Cassandra compaction death spiral: when writes outrun compaction throughput

What this means

Common causes

Quick checks

How to diagnose it

Metrics and signals to monitor

Fixes

Immediate relief

Strategy and hardware mismatches

What to avoid

Prevention

How Netdata helps

Related guides

Cassandra monitoring with Netdata