Cassandra Scanned over N tombstones warning: finding the offending query

Your application latency may still look acceptable at the median, but system.log is filling with warnings like Read 500 live rows and 12000 tombstone cells for query SELECT ... or Scanned over 5000 tombstones. These messages mean a single read is sifting through thousands of delete markers to return a small result set. Each tombstone consumes CPU, disk I/O, and heap memory during the merge phase. Left alone, the same query will eventually cross tombstone_failure_threshold (default 100000) and be aborted by the coordinator. The log line gives you the query text, but the real operational work is deciding whether the root cause is a missing repair, a compaction backlog, or a data model mismatch, and then confirming which partition is the actual source of the dead data.

This article walks through locating the offending query, validating table-level impact, and correlating tombstone load with repair and compaction state. It is grounded in the signals and commands you can run on a live cluster without restarting nodes.

What this means

Cassandra keeps delete markers, called tombstones, until they can be safely purged by compaction. That purge is only safe after all replicas have been repaired within gc_grace_seconds (default 10 days). Until then, every read that touches a deleted cell must scan the tombstone, hold it in memory during the cross-SSTable merge, and then discard it. The coordinator must keep scanned tombstones in memory so it can return them to the client correctly. With workloads that generate a lot of tombstones, this can exhaust heap and trigger long GC pauses.

When a single query scans more than tombstone_warn_threshold (default 1000) tombstones, the coordinator logs a WARNING. When it crosses tombstone_failure_threshold (default 100000), Cassandra aborts the read. Both values are defined in cassandra.yaml. The warning log line emitted by the coordinator includes the CQL query text, the keyspace, and the table. Treat these log entries as first-class metrics: sustained warnings always indicate a table that is accumulating tombstones faster than compaction can remove them.

Common causes

CauseWhat it looks likeFirst thing to check
Repair has not completed within gc_grace_secondsTombstone warnings appear across many partitions in the same table; disk space is not reclaimed after deletionssystem_distributed.repair_history or nodetool tablestats to confirm last repair time
TTL-driven or time-series data without TWCSWarnings cluster on the newest or oldest time windows; SSTable count in old windows remains high because compaction cannot drop whole windowsnodetool tablestats for SSTable count per table and the current compaction strategy
Wide partition with many deleted rows or range deletesnodetool toppartitions shows a single partition receiving heavy read traffic; latency spikes are isolated to one token rangePartition size and read distribution for the affected table
Queue-like write/read/delete patternWarnings correlate with a table that acts as a queue: rows are inserted, read once, then deleted immediately; this generates a tombstone per rowApplication query patterns visible in the warning log

Quick checks

Run these read-only commands on the node emitting the warnings or on any node that owns the affected token range.

# Find recent tombstone warnings and the exact query text
grep -E "Scanned over .* tombstones|Read .* live rows and .* tombstone cells" /var/log/cassandra/system.log | tail -n 20

# Check per-table tombstone distribution
nodetool tablehistograms <keyspace> <table>

# Review table health, SSTable count, and repair status
nodetool tablestats <keyspace> | grep -A 20 "Table: <table>"

# Check coordinator-level latency for tail reads
nodetool proxyhistograms

# See if compaction is falling behind
nodetool compactionstats

# Identify hot partitions that may be driving the scans
nodetool toppartitions <keyspace> <table> 1000

What to look for:

  • system.log reveals the exact query and the tombstone count. Extract the keyspace and table name first.
  • tablehistograms includes a “Tombstone Cells” row. If the p99 is above your threshold, the table is systematically tombstone-heavy, not just unlucky.
  • tablestats shows SSTable count. A high count means more files to scan during merge, which amplifies the cost of each tombstone.
  • proxyhistograms exposes coordinator p99 latency. Tombstone scans inflate the tail latency before they show up at the median.
  • compactionstats tells you whether compaction is keeping up or creating debt. Tombstones are only removed during compaction.
  • toppartitions surfaces the specific partition keys that dominate read traffic. If one partition accounts for most reads and also carries many tombstones, you have found the hotspot.

How to diagnose it

  1. Isolate the query from the log.
    The warning emitted by the coordinator contains the full CQL statement. Extract the keyspace, table name, and WHERE clause. If the query uses a broad range or an IN clause on a partition with many deleted rows, that is your immediate smoking gun. Note that if the table receives many different queries, the same table may trigger warnings from several query patterns.

  2. Confirm table-level tombstone load.
    Use nodetool tablehistograms <keyspace> <table> and inspect the Tombstone Cells distribution. If the p99 is high even when warnings are sparse, the table has a background tombstone problem that has not yet triggered thresholds on every query. Compare this with the LiveSSTableCount from nodetool tablestats. A high SSTable count on a tombstone-heavy table means reads are merging many small files full of dead data.

  3. Find the hot partition.
    Run nodetool toppartitions <keyspace> <table> 1000. Look for a small number of partitions that dominate read counts. Cross-reference these partitions with the query predicates from step 1. A query that ranges over a wide partition containing thousands of deleted cells will generate a tombstone warning even if the overall table is otherwise healthy. If the same partition appears at the top of both reads and writes, it is likely a queue-like hotspot.

  4. Check repair status.
    Tombstones cannot be purged until repair has run. Query system_distributed.repair_history or use nodetool tablestats to check the repaired percentage in Cassandra 4.0+. If the last successful repair is older than half of gc_grace_seconds, the table is at risk of retaining tombstones indefinitely. Run repair before expecting compaction to clean anything. Repair is extremely I/O intensive; throttle it and run during low-traffic periods.

  5. Check compaction health.
    Run nodetool compactionstats. If pending tasks are trending upward, compaction is not reclaiming tombstones fast enough. Correlate with SSTable count from nodetool tablestats. A growing SSTable count on a delete-heavy table is a leading indicator that tombstone purge is stalled. If compaction is current but SSTable count is still high, the tombstones may be spread too evenly across SSTables for efficient purge.

  6. Correlate with latency and GC.
    Use nodetool proxyhistograms to see if coordinator p99 latency spikes coincide with tombstone warnings. Tombstone scans are CPU and heap intensive. If GC pause duration is also elevated, the read is likely promoting garbage during merge. This confirms the warnings are client-visible, not just log noise. If only p999 is spiking while p50 is flat, you are looking at a narrow hotspot rather than a table-wide failure.

Metrics and signals to monitor

SignalWhy it mattersWarning sign
TombstoneScannedHistogram p99Tracks the distribution of tombstones touched per read. A rising p99 means reads are doing more wasted work.p99 > tombstone_warn_threshold sustained
TombstoneWarnings / TombstoneFailures countersIncrement each time a read crosses the warn or fail threshold. Treat them as error counters, not informational logs.Any sustained increase in TombstoneWarnings
Coordinator read latency p99The user-visible cost of tombstone scans. Spikes here confirm client impact.p99 > 3x baseline or > read_request_timeout_in_ms / 2
Pending compactionsTombstones are only removed during compaction. Backlog means tombstones accumulate.Pending tasks trending upward over 4+ hours
Repair age vs gc_grace_secondsWithout repair, tombstones cannot be purged. This is a silent prerequisite for cleanup.Last repair > 80% of gc_grace_seconds
SSTable count per tableMore SSTables means more merge work and more tombstones to scan per read.Count growing steadily regardless of strategy

Fixes

Catch up on repair

If repair has not run, tombstones will never leave, no matter how much compaction you trigger. Run nodetool repair <keyspace> <table> or schedule it via Reaper. After repair completes, allow at least one compaction cycle to run before declaring the issue resolved. Do not force compaction before repair; compaction cannot purge unrepaired tombstones safely.

Force compaction on the affected table

If repair is current but tombstones remain because they are spread across many SSTables, you can force a compaction:

nodetool compact <keyspace> <table>

Warning: This generates heavy read and write I/O. It also needs temporary disk space to write the new SSTable before deleting the old ones. Do not run this during peak traffic or on a node that is already I/O saturated. With STCS, major compaction can transiently need up to 100% additional disk space.

Change the compaction strategy for TTL data

If the table holds time-series data with TTL, and you are using STCS or LCS, migrate to TimeWindowCompactionStrategy (TWCS). TWCS compacts entire time windows as units, allowing Cassandra to drop tombstones cleanly when a window expires. This is a schema change that triggers a full recompaction; plan it for a maintenance window.

Fix the application pattern

If the offending query is scanning a wide partition with many deleted rows, the durable fix is data-model or application changes:

  • Avoid using Cassandra as a queue (insert, read, delete).
  • Replace row-level deletes with TTL where possible.
  • Narrow the query range so the read touches fewer deleted cells.
  • If the workload requires frequent deletes, ensure the partition key design distributes them so no single partition becomes a tombstone archive.

Temporary threshold adjustment

Raising tombstone_warn_threshold or tombstone_failure_threshold in cassandra.yaml can buy time to avoid query aborts, but it is a band-aid. If you raise the threshold, you must also increase monitoring sensitivity on TombstoneScannedHistogram so you still detect the problem before heap pressure causes a GC death spiral.

Prevention

  • Monitor TombstoneScannedHistogram per table and alert when the p99 crosses your warn threshold. Do not wait for log warnings.
  • Automate repair tracking. Alert when any table’s last successful repair exceeds 50% of gc_grace_seconds. This is the silent prerequisite for tombstone health.
  • Match compaction strategy to workload. Use TWCS for time-series TTL data. Do not use STCS for tables with heavy, uniform deletes.
  • Watch SSTable count trends. Rising SSTable count on a delete-heavy table is an early warning that compaction is losing the race against tombstone accumulation.
  • Review nodetool toppartitions periodically. Sampling hotspot partitions before they become critical lets you catch skewed delete patterns early.

How Netdata helps

  • Correlates TombstoneWarnings and TombstoneFailures JMX counters with per-table read latency spikes, so you can see whether tombstone scans are causing client-visible tail latency.
  • Surfaces TombstoneScannedHistogram percentiles alongside coordinator and local read latency, letting you distinguish a table-wide tombstone problem from a single slow query.
  • Tracks pending compactions and repair age in the same view, making it easy to verify whether tombstone purge is blocked by compaction debt or missing repair.
  • Alerts on sustained increases in tombstone warnings before queries hit the failure threshold and abort.