Cassandra tombstone storm: delete-heavy tables and read latency collapse

DELETEs and TTL expirations write tombstones. Read latency bifurcates: P50 stays flat while P99 spikes. Logs show queries scanning thousands of tombstones; eventually clients see queries abort after crossing tombstone_failure_threshold. Disk space does not shrink despite deletions. This is a tombstone storm.

Cassandra does not remove deleted data immediately. A DELETE inserts a tombstone marker that persists until compaction purges it after gc_grace_seconds elapses and all replicas have been repaired. When tombstones scatter across many SSTables, every read must scan and merge them, generating temporary heap objects and escalating GC pressure. Only a subset of partitions may be affected, which is why P50 stays flat while tail latency explodes.

If left unchecked, the node hits tombstone_failure_threshold (default 100000) and aborts queries. Before that, long GC pauses from tombstone merges can trigger gossip failure and mark the node DOWN.

What this means

Tombstones are normal in Cassandra’s LSM storage engine. They are written on DELETE and TTL expiration. To drop during compaction, a tombstone must be older than gc_grace_seconds (default 10 days) and repair must have completed across all replicas. Without repair, Cassandra cannot purge tombstones because it cannot guarantee all replicas have seen the delete.

When tombstones accumulate across dozens of SSTables, read amplification skyrockets. The coordinator consults every SSTable that might contain the partition, reads the tombstones, and merges them to determine which cells are live. That merge allocates temporary heap objects. Under heavy read load, the garbage collector spends more time cleaning up merge debris, latency spikes, and the node may enter a GC death spiral.

flowchart TD
    A[Delete or TTL expiry writes tombstone] --> B[Tombstones scatter across SSTables]
    B --> C[Repair missing or wrong compaction strategy]
    C --> D[SSTable count rises]
    D --> E[Reads scan and merge dead data]
    E --> F[GC pressure from merge garbage]
    E --> G[P99 latency collapse]
    G --> H[Query abort at 100000 tombstones]
    F --> I[Long pauses trigger gossip failure]

Common causes

CauseWhat it looks likeFirst thing to check
Delete-heavy workload without TWCSP99 read latency spikes on specific tables; tombstone warnings in logsnodetool cfstats Droppable tombstone ratio
Repair not running within gc_grace_secondsTombstones never drop; disk space flatsystem_distributed.repair_history or nodetool repair_admin list
Wide partitions with range deletesSingle partition hits failure threshold; logs name the partition keynodetool tablehistograms for partition size and tombstone cells
Compaction falling behindSSTable count growing; pending tasks trending upnodetool compactionstats and SSTable count

Quick checks

# Check for tombstone scan warnings and query abortions
grep -E "Scanned over .* tombstones|tombstone_failure_threshold" /var/log/cassandra/system.log

# Check per-table tombstone ratio and SSTable count
nodetool cfstats <keyspace>.<table> | grep -E "SSTable count|Droppable tombstone"

# View per-table tombstone histograms (shows Tombstone Cells row)
nodetool tablehistograms <keyspace> <table>

# Check compaction backlog
nodetool compactionstats

# Verify repair has completed within gc_grace_seconds
nodetool repair_admin list
# On 3.x: cqlsh -e "SELECT * FROM system_distributed.repair_history;"

# Coarse check of current heap utilization; pair with GC logs for allocation pressure
nodetool info | grep "Heap Memory"

# Check if disk I/O is saturated and starving compaction
iostat -x 1

How to diagnose it

  1. Confirm the symptom in logs. Search system.log for tombstone scan warnings. Sustained warnings above 1000 tombstones per read indicate active accumulation. Query abortions confirm the storm has reached critical severity.
  2. Identify affected tables. Run nodetool cfstats <keyspace>.<table> (or nodetool tablestats on 4.x+). Look for a high Droppable tombstone ratio and elevated SSTable count. A ratio approaching 1.0 means nearly all data in live SSTables is dead.
  3. Inspect tombstone distribution. Use nodetool tablehistograms to view the Tombstone Cells row. If the P99 or max tombstone count is high, a subset of partitions is carrying the load.
  4. Verify repair status. Check system_distributed.repair_history or nodetool repair_admin list. If the last successful repair is older than gc_grace_seconds, tombstones cannot be purged. This blocks purge entirely.
  5. Correlate with GC behavior. Parse GC logs for long pauses. Tombstone-heavy reads allocate large temporary structures during SSTable merging. If GC pause duration is increasing and aligns with read latency spikes, merge garbage is pressuring the heap.
  6. Assess compaction health. Run nodetool compactionstats. If pending tasks are trending upward over hours, compaction is losing ground and cannot reclaim tombstones or disk.
  7. Check for I/O saturation. Use iostat -x on the data volume. If %util is high and await is elevated, compaction may be starved of disk bandwidth, preventing tombstone purge.

Metrics and signals to monitor

SignalWhy it mattersWarning sign
Tombstone scan warningsDirect measure of dead data scanned per querySustained warnings above 1000 tombstones
Droppable tombstone ratioIndicates how much live data is actually tombstonedRatio approaching 1.0
Read latency P99Client-visible impact from scanning dead rowsP99 above 3x rolling baseline
GC pause durationTombstone merges create heap garbagePauses above 500 ms and increasing
Pending compactionsCompaction must run to purge tombstonesTrending upward over 4+ hours
Repair completionRequired before tombstones can be droppedLast repair above 80% of gc_grace_seconds
SSTable count per tableRead amplification rises with SSTable countSTCS above 50, LCS L0 above 32

Fixes

Run targeted compaction for immediate relief

If a specific table is saturated, run nodetool compact <keyspace> <table>. This forces a major compaction that can purge eligible tombstones and reduce SSTable count. WARNING: this temporarily doubles disk usage for the table while it writes the new SSTable, and is I/O-intensive. Run during low traffic; monitor disk space, I/O, and latency.

Complete repair to unblock tombstone purge

If repair has not run within gc_grace_seconds, schedule it immediately. In Cassandra 4.0+, use incremental repair. Until repair completes, tombstones will not drop even if compaction runs. After repair finishes, run compaction on the affected table.

Switch to TWCS for TTL-dominated tables

If the workload is time-series with TTL, alter the table to use TimeWindowCompactionStrategy. TWCS drops entire SSTables once all data in a window expires, avoiding cross-window tombstone pollution. Altering strategy does not immediately rewrite existing SSTables; old SSTables compact under the previous strategy until rewritten. A manual major compaction may be needed to unify the layout, which is expensive. Plan for I/O cost and schedule outside peak hours.

Address delete-heavy application patterns

Review whether the application uses Cassandra as a queue with continuous INSERT followed by DELETE. This anti-pattern scatters tombstones across SSTables that rarely compact together. Replace it with time-bucketed tables or soft-delete flags. If you must delete frequently, match the compaction strategy and repair cadence to the delete rate.

Increase compaction throughput temporarily

If compaction is I/O-starved, increase compaction_throughput_mb_per_sec or add dedicated disk IOPS. Ensure concurrent_compactors is sized for the CPU count. Compaction cannot purge tombstones if it cannot read and write SSTables fast enough.

Prevention

  • TWCS for TTL tables. TWCS drops whole expired windows cleanly, preventing tombstones from scattering across SSTables.
  • Repair monitoring. Alert at 50% of gc_grace_seconds to prevent silent accumulation.
  • Per-table tombstone histograms. Monitor nodetool tablehistograms or the system_views.tombstones_per_read virtual table (4.1+) to catch growth before it breaches thresholds.
  • Compaction headroom. Size disk IOPS and concurrent_compactors so compaction keeps pace with writes.
  • Avoid wide partitions with range deletes. Large partitions amplify the cost of tombstone scans and are harder to purge.

How Netdata helps

  • Correlates tombstone scan warnings with per-table P99 read latency spikes to isolate affected tables.
  • Tracks GC pause duration alongside read latency to reveal merge garbage pressure before it becomes a death spiral.
  • Monitors pending compactions and SSTable counts per table to catch compaction backlog early.
  • Surfaces repair timing relative to gc_grace_seconds to expose purge blockers before they silently accumulate.
  • Visualizes disk I/O saturation during compaction catch-up to spot bandwidth starvation.