Cassandra Scanned over N tombstones warning: finding the offending query
Your application latency may still look acceptable at the median, but system.log is filling with warnings like Read 500 live rows and 12000 tombstone cells for query SELECT ... or Scanned over 5000 tombstones. These messages mean a single read is sifting through thousands of delete markers to return a small result set. Each tombstone consumes CPU, disk I/O, and heap memory during the merge phase. Left alone, the same query will eventually cross tombstone_failure_threshold (default 100000) and be aborted by the coordinator. The log line gives you the query text, but the real operational work is deciding whether the root cause is a missing repair, a compaction backlog, or a data model mismatch, and then confirming which partition is the actual source of the dead data.
This article walks through locating the offending query, validating table-level impact, and correlating tombstone load with repair and compaction state. It is grounded in the signals and commands you can run on a live cluster without restarting nodes.
What this means
Cassandra keeps delete markers, called tombstones, until they can be safely purged by compaction. That purge is only safe after all replicas have been repaired within gc_grace_seconds (default 10 days). Until then, every read that touches a deleted cell must scan the tombstone, hold it in memory during the cross-SSTable merge, and then discard it. The coordinator must keep scanned tombstones in memory so it can return them to the client correctly. With workloads that generate a lot of tombstones, this can exhaust heap and trigger long GC pauses.
When a single query scans more than tombstone_warn_threshold (default 1000) tombstones, the coordinator logs a WARNING. When it crosses tombstone_failure_threshold (default 100000), Cassandra aborts the read. Both values are defined in cassandra.yaml. The warning log line emitted by the coordinator includes the CQL query text, the keyspace, and the table. Treat these log entries as first-class metrics: sustained warnings always indicate a table that is accumulating tombstones faster than compaction can remove them.
Common causes
| Cause | What it looks like | First thing to check |
|---|---|---|
Repair has not completed within gc_grace_seconds | Tombstone warnings appear across many partitions in the same table; disk space is not reclaimed after deletions | system_distributed.repair_history or nodetool tablestats to confirm last repair time |
| TTL-driven or time-series data without TWCS | Warnings cluster on the newest or oldest time windows; SSTable count in old windows remains high because compaction cannot drop whole windows | nodetool tablestats for SSTable count per table and the current compaction strategy |
| Wide partition with many deleted rows or range deletes | nodetool toppartitions shows a single partition receiving heavy read traffic; latency spikes are isolated to one token range | Partition size and read distribution for the affected table |
| Queue-like write/read/delete pattern | Warnings correlate with a table that acts as a queue: rows are inserted, read once, then deleted immediately; this generates a tombstone per row | Application query patterns visible in the warning log |
Quick checks
Run these read-only commands on the node emitting the warnings or on any node that owns the affected token range.
# Find recent tombstone warnings and the exact query text
grep -E "Scanned over .* tombstones|Read .* live rows and .* tombstone cells" /var/log/cassandra/system.log | tail -n 20
# Check per-table tombstone distribution
nodetool tablehistograms <keyspace> <table>
# Review table health, SSTable count, and repair status
nodetool tablestats <keyspace> | grep -A 20 "Table: <table>"
# Check coordinator-level latency for tail reads
nodetool proxyhistograms
# See if compaction is falling behind
nodetool compactionstats
# Identify hot partitions that may be driving the scans
nodetool toppartitions <keyspace> <table> 1000
What to look for:
system.logreveals the exact query and the tombstone count. Extract the keyspace and table name first.tablehistogramsincludes a “Tombstone Cells” row. If the p99 is above your threshold, the table is systematically tombstone-heavy, not just unlucky.tablestatsshows SSTable count. A high count means more files to scan during merge, which amplifies the cost of each tombstone.proxyhistogramsexposes coordinator p99 latency. Tombstone scans inflate the tail latency before they show up at the median.compactionstatstells you whether compaction is keeping up or creating debt. Tombstones are only removed during compaction.toppartitionssurfaces the specific partition keys that dominate read traffic. If one partition accounts for most reads and also carries many tombstones, you have found the hotspot.
How to diagnose it
Isolate the query from the log.
The warning emitted by the coordinator contains the full CQL statement. Extract the keyspace, table name, andWHEREclause. If the query uses a broad range or anINclause on a partition with many deleted rows, that is your immediate smoking gun. Note that if the table receives many different queries, the same table may trigger warnings from several query patterns.Confirm table-level tombstone load.
Usenodetool tablehistograms <keyspace> <table>and inspect the Tombstone Cells distribution. If the p99 is high even when warnings are sparse, the table has a background tombstone problem that has not yet triggered thresholds on every query. Compare this with the LiveSSTableCount fromnodetool tablestats. A high SSTable count on a tombstone-heavy table means reads are merging many small files full of dead data.Find the hot partition.
Runnodetool toppartitions <keyspace> <table> 1000. Look for a small number of partitions that dominate read counts. Cross-reference these partitions with the query predicates from step 1. A query that ranges over a wide partition containing thousands of deleted cells will generate a tombstone warning even if the overall table is otherwise healthy. If the same partition appears at the top of both reads and writes, it is likely a queue-like hotspot.Check repair status.
Tombstones cannot be purged until repair has run. Querysystem_distributed.repair_historyor usenodetool tablestatsto check the repaired percentage in Cassandra 4.0+. If the last successful repair is older than half ofgc_grace_seconds, the table is at risk of retaining tombstones indefinitely. Run repair before expecting compaction to clean anything. Repair is extremely I/O intensive; throttle it and run during low-traffic periods.Check compaction health.
Runnodetool compactionstats. If pending tasks are trending upward, compaction is not reclaiming tombstones fast enough. Correlate with SSTable count fromnodetool tablestats. A growing SSTable count on a delete-heavy table is a leading indicator that tombstone purge is stalled. If compaction is current but SSTable count is still high, the tombstones may be spread too evenly across SSTables for efficient purge.Correlate with latency and GC.
Usenodetool proxyhistogramsto see if coordinator p99 latency spikes coincide with tombstone warnings. Tombstone scans are CPU and heap intensive. If GC pause duration is also elevated, the read is likely promoting garbage during merge. This confirms the warnings are client-visible, not just log noise. If only p999 is spiking while p50 is flat, you are looking at a narrow hotspot rather than a table-wide failure.
Metrics and signals to monitor
| Signal | Why it matters | Warning sign |
|---|---|---|
TombstoneScannedHistogram p99 | Tracks the distribution of tombstones touched per read. A rising p99 means reads are doing more wasted work. | p99 > tombstone_warn_threshold sustained |
TombstoneWarnings / TombstoneFailures counters | Increment each time a read crosses the warn or fail threshold. Treat them as error counters, not informational logs. | Any sustained increase in TombstoneWarnings |
| Coordinator read latency p99 | The user-visible cost of tombstone scans. Spikes here confirm client impact. | p99 > 3x baseline or > read_request_timeout_in_ms / 2 |
| Pending compactions | Tombstones are only removed during compaction. Backlog means tombstones accumulate. | Pending tasks trending upward over 4+ hours |
Repair age vs gc_grace_seconds | Without repair, tombstones cannot be purged. This is a silent prerequisite for cleanup. | Last repair > 80% of gc_grace_seconds |
| SSTable count per table | More SSTables means more merge work and more tombstones to scan per read. | Count growing steadily regardless of strategy |
Fixes
Catch up on repair
If repair has not run, tombstones will never leave, no matter how much compaction you trigger. Run nodetool repair <keyspace> <table> or schedule it via Reaper. After repair completes, allow at least one compaction cycle to run before declaring the issue resolved. Do not force compaction before repair; compaction cannot purge unrepaired tombstones safely.
Force compaction on the affected table
If repair is current but tombstones remain because they are spread across many SSTables, you can force a compaction:
nodetool compact <keyspace> <table>
Warning: This generates heavy read and write I/O. It also needs temporary disk space to write the new SSTable before deleting the old ones. Do not run this during peak traffic or on a node that is already I/O saturated. With STCS, major compaction can transiently need up to 100% additional disk space.
Change the compaction strategy for TTL data
If the table holds time-series data with TTL, and you are using STCS or LCS, migrate to TimeWindowCompactionStrategy (TWCS). TWCS compacts entire time windows as units, allowing Cassandra to drop tombstones cleanly when a window expires. This is a schema change that triggers a full recompaction; plan it for a maintenance window.
Fix the application pattern
If the offending query is scanning a wide partition with many deleted rows, the durable fix is data-model or application changes:
- Avoid using Cassandra as a queue (insert, read, delete).
- Replace row-level deletes with TTL where possible.
- Narrow the query range so the read touches fewer deleted cells.
- If the workload requires frequent deletes, ensure the partition key design distributes them so no single partition becomes a tombstone archive.
Temporary threshold adjustment
Raising tombstone_warn_threshold or tombstone_failure_threshold in cassandra.yaml can buy time to avoid query aborts, but it is a band-aid. If you raise the threshold, you must also increase monitoring sensitivity on TombstoneScannedHistogram so you still detect the problem before heap pressure causes a GC death spiral.
Prevention
- Monitor
TombstoneScannedHistogramper table and alert when the p99 crosses your warn threshold. Do not wait for log warnings. - Automate repair tracking. Alert when any table’s last successful repair exceeds 50% of
gc_grace_seconds. This is the silent prerequisite for tombstone health. - Match compaction strategy to workload. Use TWCS for time-series TTL data. Do not use STCS for tables with heavy, uniform deletes.
- Watch SSTable count trends. Rising SSTable count on a delete-heavy table is an early warning that compaction is losing the race against tombstone accumulation.
- Review
nodetool toppartitionsperiodically. Sampling hotspot partitions before they become critical lets you catch skewed delete patterns early.
How Netdata helps
- Correlates
TombstoneWarningsandTombstoneFailuresJMX counters with per-table read latency spikes, so you can see whether tombstone scans are causing client-visible tail latency. - Surfaces
TombstoneScannedHistogrampercentiles alongside coordinator and local read latency, letting you distinguish a table-wide tombstone problem from a single slow query. - Tracks pending compactions and repair age in the same view, making it easy to verify whether tombstone purge is blocked by compaction debt or missing repair.
- Alerts on sustained increases in tombstone warnings before queries hit the failure threshold and abort.
Related guides
- Cassandra compaction strategies: STCS vs LCS vs TWCS vs UCS
- Cassandra compaction death spiral: when writes outrun compaction throughput
- Cassandra consistency levels explained: QUORUM, ONE, LOCAL_QUORUM, and EACH_QUORUM
- Cassandra GC death spiral: long pauses, gossip flapping, and recovery
- Cassandra GC pauses too long: diagnosing G1 stop-the-world pauses
- Cassandra heap pressure: sizing the JVM heap and tuning G1GC
- Cassandra monitoring checklist: the signals every production cluster needs
- Cassandra monitoring maturity model: from survival to expert
- Cassandra java.lang.OutOfMemoryError: Java heap space - causes and recovery
- Cassandra pending compactions growing: the compaction backlog runbook
- Cassandra compaction stuck: large partitions blocking a compaction thread
- Cassandra too many SSTables per table: read amplification and how to fix it







