Cassandra zombie data resurrection: gc_grace_seconds and unrepaired tombstones
A query returns rows that were deleted weeks ago. Application logs show no errors. All nodes report UP in nodetool status. Compaction is running and disk usage looks normal. The data is back because a tombstone was compacted away before anti-entropy repair verified that every replica saw the delete. Once the tombstone is gone, live data on unrepaired replicas is treated as authoritative. The next repair or read repair streams it back to the tombstone-less nodes.
This article explains the mechanics, how to confirm the divergence, and how to stop it from recurring.
What this means
In Cassandra, a delete does not remove data immediately. It writes a tombstone marker that hides older values. During compaction, tombstones are eligible for purge only after gc_grace_seconds have elapsed since the delete. The default is 864000 seconds, or 10 days.
Cassandra cannot purge a tombstone until it knows every replica has seen it. Anti-entropy repair provides that guarantee by comparing Merkle trees across replicas and streaming differences. If repair has not completed within gc_grace_seconds, some replicas compact away the tombstone while others still retain the original live data. When repair or read repair eventually reconciles those replicas, the live data wins because the tombstone is gone.
Hinted handoff does not prevent this. Hints replay missed mutations, but they do not cover tombstones whose gc_grace_seconds window has already closed.
flowchart TD
A[Delete writes tombstone to replica set] --> B{Tombstone ages past gc_grace_seconds}
B -->|Repair incomplete| C[Compaction purges tombstone on repaired nodes]
B -->|Repair complete| D[Tombstone purged safely everywhere]
C --> E[Unrepaired replica still holds live data]
E --> F[Repair or read repair streams live data back]
F --> G[Deleted data resurrected as zombie rows]Common causes
| Cause | What it looks like | First thing to check |
|---|---|---|
| Repair not running or incomplete | Deleted data returns after days or weeks with no prior errors | system_distributed.repair_history for last successful repair timestamp |
| Repair slower than the grace window | Repair starts but does not finish before tombstones expire on early token ranges | Repair cycle duration versus gc_grace_seconds |
Node down longer than gc_grace_seconds | Hints expire before the node returns, so missed deletes are never propagated | nodetool status history and hint delivery status |
gc_grace_seconds reduced without faster repair | Tombstones become eligible for purge sooner than repair reaches all replicas | DESCRIBE TABLE output for the per-table gc_grace_seconds value |
Quick checks
# Check repair history for last completion time
cqlsh -e "SELECT keyspace_name, columnfamily_name, finished_at FROM system_distributed.repair_history LIMIT 50;"
# Check current repair status (Cassandra 4.0+)
nodetool repair_admin list
# Check node liveness and cluster state
nodetool status
# Check hint delivery and backlog
nodetool statushandoff
# Check table-level gc_grace_seconds
cqlsh -e "DESCRIBE TABLE mykeyspace.mytable;"
# Check for tombstone scan warnings
grep -i "tombstone" /var/log/cassandra/system.log | tail -20
# Check compaction backlog and pending tasks
nodetool compactionstats
# Check table statistics for SSTable count and tombstone load
nodetool tablestats mykeyspace | grep -i -E "Table:|tombstone|SSTable count"
How to diagnose it
- Confirm replica divergence at low consistency. Run the suspect query at
CL=ONEagainst individual replicas. If one replica returns the deleted row while another returns nothing, you have replica divergence. AtCL=QUORUM, read repair may propagate live data to tombstone-less replicas or hide the issue depending on which replicas form the quorum. - Check last repair time. Query
system_distributed.repair_historyfor the affected keyspace and table. If the most recent successful repair finished before the delete occurred and is older than the table’sgc_grace_seconds, resurrection is possible. - Compare repair cadence to
gc_grace_seconds. Best practice is to complete a full repair cycle withingc_grace_seconds / 2. If your repair takes several days to finish and runs only once per week, the interval between the start of one cycle and the completion of the next can exceed the grace window. - Review node downtime. Check logs or monitoring history for any node that was DOWN longer than
gc_grace_seconds. A node that is down longer than the grace window will miss tombstones permanently because hinted handoff stops storing hints aftermax_hint_window_in_ms(default 3 hours). - Inspect compaction history. Use
nodetool compactionhistoryto verify that compaction has run on the affected table since the delete. If compaction ran while repair was incomplete, tombstones were likely purged on repaired nodes only. - Check for repair failures. Partial repairs are worse than no repairs because they create an illusion of safety. Verify that repair completed all token ranges, not just that it started.
- Compare tablestats across replicas. Use
nodetool tablestatsfor the affected table on each node. A large gap in SSTable count or tombstone metrics between replicas indicates uneven compaction and a higher risk that one node has purged tombstones the others still need.
Metrics and signals to monitor
| Signal | Why it matters | Warning sign |
|---|---|---|
Repair status vs gc_grace_seconds | Tombstones cannot be safely purged until all replicas are repaired | Last successful repair > 80% of gc_grace_seconds |
| Tombstone scan warnings | Tombstones are accumulating and reads are scanning excessive dead data | Sustained log warnings exceeding tombstone_warn_threshold (default 1000) |
| Pending compactions | Tombstones survive until a compaction event includes both the tombstone and the live data | Pending tasks trending upward over days |
| Node liveness duration | A node down longer than gc_grace_seconds misses tombstones permanently | Any node DOWN for longer than gc_grace_seconds |
| Hinted handoff backlog | Expired hints leave permanent inconsistency that only repair can fix | Hints directory growing or hint delivery failures increasing |
Fixes
Run full repair immediately
Warning: Repair is I/O and network intensive and will impact production traffic.
The only way to reconcile zombie data is to run a full repair on the affected tables. This streams divergent data and re-aligns replicas. Schedule it during low-traffic hours and throttle streaming if needed.
# Repair a specific keyspace and table
nodetool repair -full mykeyspace mytable
If the dataset is very large, use sub-range repair to limit scope and duration:
# Repair a specific token range
nodetool repair -full -st <start_token> -et <end_token> mykeyspace mytable
Monitor nodetool netstats to track streaming progress and nodetool compactionstats to verify compaction resumes after repair completes.
Enable only_purge_repaired_tombstones
If you cannot guarantee a reliable repair cadence, enable the only_purge_repaired_tombstones compaction option. When set to true, Cassandra will only purge tombstones from SSTables that have been marked as repaired. This severs the link between compaction and resurrection, at the cost of higher transient disk usage because tombstones survive longer.
Adjust repair cadence, not gc_grace_seconds
Never reduce gc_grace_seconds to solve a repair gap. A lower grace period shortens the repair window. If a node is down longer than the reduced gc_grace_seconds, missed deletes become permanent because the tombstones expire before repair can reconcile them.
Instead, increase repair frequency. Target completion within gc_grace_seconds / 2 to provide a safety margin for repair duration.
Address node downtime immediately
If a node was down longer than max_hint_window_in_ms (default 3 hours) or gc_grace_seconds, treat it as inconsistent. Do not assume hinted handoff recovered all missed deletes. Run a full repair on the token ranges owned by that node before it rejoins active production traffic.
Prevention
- Automate repair and monitor completion. Use a repair scheduler such as Reaper, or the unified repair scheduler available in Cassandra 5.1 . Alert when the time since last successful repair exceeds 50% of
gc_grace_seconds. Monitor completion of all token ranges, not just the start of the job. - Keep repair duration inside the grace window. If a full cluster repair takes several days, run smaller sub-range repairs more frequently. The interval from the start of one complete repair cycle to the end of the next must remain below
gc_grace_seconds. - Consider
only_purge_repaired_tombstones. If your repair automation is immature, this option is the safest defense against accidental resurrection. - Repair after any extended outage. If a node is down longer than
max_hint_window_in_ms, run repair immediately after recovery. Hints do not cover the full outage window. - Do not reduce
gc_grace_secondswithout proven repair frequency. Some teams lower the grace period to reclaim disk space faster. This is dangerous unless you have verified repair completion times that fit comfortably inside the new window.
How Netdata helps
- Correlate repair lag with compaction pressure and pending tasks to identify when tombstones are at risk of premature purge.
- Alert on node availability drops that exceed
gc_grace_seconds, flagging replicas that may have missed deletes. - Track JVM heap usage and GC pause duration, which can delay repair completion and widen the vulnerability window.
- Monitor SSTable count growth per table as a leading indicator that compaction is outpacing repair reconciliation.
- Surface hinted handoff backlog and delivery failure rates, highlighting replicas that will need manual repair after recovery.
Related guides
- Cassandra compaction strategies: STCS vs LCS vs TWCS vs UCS
- Cassandra compaction death spiral: when writes outrun compaction throughput
- Cassandra consistency levels explained: QUORUM, ONE, LOCAL_QUORUM, and EACH_QUORUM
- Cassandra GC death spiral: long pauses, gossip flapping, and recovery
- Cassandra GC pauses too long: diagnosing G1 stop-the-world pauses
- Cassandra heap pressure: sizing the JVM heap and tuning G1GC
- Cassandra monitoring checklist: the signals every production cluster needs
- Cassandra monitoring maturity model: from survival to expert
- Cassandra java.lang.OutOfMemoryError: Java heap space - causes and recovery
- Cassandra pending compactions growing: the compaction backlog runbook
- Cassandra compaction stuck: large partitions blocking a compaction thread
- Cassandra TombstoneOverwhelmingException: reads aborted by tombstone_failure_threshold







