Cassandra tombstone storm: delete-heavy tables and read latency collapse
DELETEs and TTL expirations write tombstones. Read latency bifurcates: P50 stays flat while P99 spikes. Logs show queries scanning thousands of tombstones; eventually clients see queries abort after crossing tombstone_failure_threshold. Disk space does not shrink despite deletions. This is a tombstone storm.
Cassandra does not remove deleted data immediately. A DELETE inserts a tombstone marker that persists until compaction purges it after gc_grace_seconds elapses and all replicas have been repaired. When tombstones scatter across many SSTables, every read must scan and merge them, generating temporary heap objects and escalating GC pressure. Only a subset of partitions may be affected, which is why P50 stays flat while tail latency explodes.
If left unchecked, the node hits tombstone_failure_threshold (default 100000) and aborts queries. Before that, long GC pauses from tombstone merges can trigger gossip failure and mark the node DOWN.
What this means
Tombstones are normal in Cassandra’s LSM storage engine. They are written on DELETE and TTL expiration. To drop during compaction, a tombstone must be older than gc_grace_seconds (default 10 days) and repair must have completed across all replicas. Without repair, Cassandra cannot purge tombstones because it cannot guarantee all replicas have seen the delete.
When tombstones accumulate across dozens of SSTables, read amplification skyrockets. The coordinator consults every SSTable that might contain the partition, reads the tombstones, and merges them to determine which cells are live. That merge allocates temporary heap objects. Under heavy read load, the garbage collector spends more time cleaning up merge debris, latency spikes, and the node may enter a GC death spiral.
flowchart TD
A[Delete or TTL expiry writes tombstone] --> B[Tombstones scatter across SSTables]
B --> C[Repair missing or wrong compaction strategy]
C --> D[SSTable count rises]
D --> E[Reads scan and merge dead data]
E --> F[GC pressure from merge garbage]
E --> G[P99 latency collapse]
G --> H[Query abort at 100000 tombstones]
F --> I[Long pauses trigger gossip failure]Common causes
| Cause | What it looks like | First thing to check |
|---|---|---|
| Delete-heavy workload without TWCS | P99 read latency spikes on specific tables; tombstone warnings in logs | nodetool cfstats Droppable tombstone ratio |
Repair not running within gc_grace_seconds | Tombstones never drop; disk space flat | system_distributed.repair_history or nodetool repair_admin list |
| Wide partitions with range deletes | Single partition hits failure threshold; logs name the partition key | nodetool tablehistograms for partition size and tombstone cells |
| Compaction falling behind | SSTable count growing; pending tasks trending up | nodetool compactionstats and SSTable count |
Quick checks
# Check for tombstone scan warnings and query abortions
grep -E "Scanned over .* tombstones|tombstone_failure_threshold" /var/log/cassandra/system.log
# Check per-table tombstone ratio and SSTable count
nodetool cfstats <keyspace>.<table> | grep -E "SSTable count|Droppable tombstone"
# View per-table tombstone histograms (shows Tombstone Cells row)
nodetool tablehistograms <keyspace> <table>
# Check compaction backlog
nodetool compactionstats
# Verify repair has completed within gc_grace_seconds
nodetool repair_admin list
# On 3.x: cqlsh -e "SELECT * FROM system_distributed.repair_history;"
# Coarse check of current heap utilization; pair with GC logs for allocation pressure
nodetool info | grep "Heap Memory"
# Check if disk I/O is saturated and starving compaction
iostat -x 1
How to diagnose it
- Confirm the symptom in logs. Search
system.logfor tombstone scan warnings. Sustained warnings above 1000 tombstones per read indicate active accumulation. Query abortions confirm the storm has reached critical severity. - Identify affected tables. Run
nodetool cfstats <keyspace>.<table>(ornodetool tablestatson 4.x+). Look for a high Droppable tombstone ratio and elevated SSTable count. A ratio approaching 1.0 means nearly all data in live SSTables is dead. - Inspect tombstone distribution. Use
nodetool tablehistogramsto view the Tombstone Cells row. If the P99 or max tombstone count is high, a subset of partitions is carrying the load. - Verify repair status. Check
system_distributed.repair_historyornodetool repair_admin list. If the last successful repair is older thangc_grace_seconds, tombstones cannot be purged. This blocks purge entirely. - Correlate with GC behavior. Parse GC logs for long pauses. Tombstone-heavy reads allocate large temporary structures during SSTable merging. If GC pause duration is increasing and aligns with read latency spikes, merge garbage is pressuring the heap.
- Assess compaction health. Run
nodetool compactionstats. If pending tasks are trending upward over hours, compaction is losing ground and cannot reclaim tombstones or disk. - Check for I/O saturation. Use
iostat -xon the data volume. If%utilis high andawaitis elevated, compaction may be starved of disk bandwidth, preventing tombstone purge.
Metrics and signals to monitor
| Signal | Why it matters | Warning sign |
|---|---|---|
| Tombstone scan warnings | Direct measure of dead data scanned per query | Sustained warnings above 1000 tombstones |
| Droppable tombstone ratio | Indicates how much live data is actually tombstoned | Ratio approaching 1.0 |
| Read latency P99 | Client-visible impact from scanning dead rows | P99 above 3x rolling baseline |
| GC pause duration | Tombstone merges create heap garbage | Pauses above 500 ms and increasing |
| Pending compactions | Compaction must run to purge tombstones | Trending upward over 4+ hours |
| Repair completion | Required before tombstones can be dropped | Last repair above 80% of gc_grace_seconds |
| SSTable count per table | Read amplification rises with SSTable count | STCS above 50, LCS L0 above 32 |
Fixes
Run targeted compaction for immediate relief
If a specific table is saturated, run nodetool compact <keyspace> <table>. This forces a major compaction that can purge eligible tombstones and reduce SSTable count. WARNING: this temporarily doubles disk usage for the table while it writes the new SSTable, and is I/O-intensive. Run during low traffic; monitor disk space, I/O, and latency.
Complete repair to unblock tombstone purge
If repair has not run within gc_grace_seconds, schedule it immediately. In Cassandra 4.0+, use incremental repair. Until repair completes, tombstones will not drop even if compaction runs. After repair finishes, run compaction on the affected table.
Switch to TWCS for TTL-dominated tables
If the workload is time-series with TTL, alter the table to use TimeWindowCompactionStrategy. TWCS drops entire SSTables once all data in a window expires, avoiding cross-window tombstone pollution. Altering strategy does not immediately rewrite existing SSTables; old SSTables compact under the previous strategy until rewritten. A manual major compaction may be needed to unify the layout, which is expensive. Plan for I/O cost and schedule outside peak hours.
Address delete-heavy application patterns
Review whether the application uses Cassandra as a queue with continuous INSERT followed by DELETE. This anti-pattern scatters tombstones across SSTables that rarely compact together. Replace it with time-bucketed tables or soft-delete flags. If you must delete frequently, match the compaction strategy and repair cadence to the delete rate.
Increase compaction throughput temporarily
If compaction is I/O-starved, increase compaction_throughput_mb_per_sec or add dedicated disk IOPS. Ensure concurrent_compactors is sized for the CPU count. Compaction cannot purge tombstones if it cannot read and write SSTables fast enough.
Prevention
- TWCS for TTL tables. TWCS drops whole expired windows cleanly, preventing tombstones from scattering across SSTables.
- Repair monitoring. Alert at 50% of
gc_grace_secondsto prevent silent accumulation. - Per-table tombstone histograms. Monitor
nodetool tablehistogramsor thesystem_views.tombstones_per_readvirtual table (4.1+) to catch growth before it breaches thresholds. - Compaction headroom. Size disk IOPS and
concurrent_compactorsso compaction keeps pace with writes. - Avoid wide partitions with range deletes. Large partitions amplify the cost of tombstone scans and are harder to purge.
How Netdata helps
- Correlates tombstone scan warnings with per-table P99 read latency spikes to isolate affected tables.
- Tracks GC pause duration alongside read latency to reveal merge garbage pressure before it becomes a death spiral.
- Monitors pending compactions and SSTable counts per table to catch compaction backlog early.
- Surfaces repair timing relative to
gc_grace_secondsto expose purge blockers before they silently accumulate. - Visualizes disk I/O saturation during compaction catch-up to spot bandwidth starvation.
Related guides
- Cassandra compaction strategies: STCS vs LCS vs TWCS vs UCS
- Cassandra compaction death spiral: when writes outrun compaction throughput
- Cassandra consistency levels explained: QUORUM, ONE, LOCAL_QUORUM, and EACH_QUORUM
- Cassandra zombie data resurrection: gc_grace_seconds and unrepaired tombstones
- Cassandra disk space exhaustion: emergency recovery when the data volume fills
- Cassandra dropped mutations: silent write loss and load shedding
- Cassandra dropped reads and other messages: reading nodetool tpstats Dropped
- Cassandra GC death spiral: long pauses, gossip flapping, and recovery
- Cassandra GC pauses too long: diagnosing G1 stop-the-world pauses
- Cassandra heap pressure: sizing the JVM heap and tuning G1GC
- Cassandra monitoring checklist: the signals every production cluster needs
- Cassandra monitoring maturity model: from survival to expert







