Cassandra zombie data resurrection: gc_grace_seconds and unrepaired tombstones

A query returns rows that were deleted weeks ago. Application logs show no errors. All nodes report UP in nodetool status. Compaction is running and disk usage looks normal. The data is back because a tombstone was compacted away before anti-entropy repair verified that every replica saw the delete. Once the tombstone is gone, live data on unrepaired replicas is treated as authoritative. The next repair or read repair streams it back to the tombstone-less nodes.

This article explains the mechanics, how to confirm the divergence, and how to stop it from recurring.

What this means

In Cassandra, a delete does not remove data immediately. It writes a tombstone marker that hides older values. During compaction, tombstones are eligible for purge only after gc_grace_seconds have elapsed since the delete. The default is 864000 seconds, or 10 days.

Cassandra cannot purge a tombstone until it knows every replica has seen it. Anti-entropy repair provides that guarantee by comparing Merkle trees across replicas and streaming differences. If repair has not completed within gc_grace_seconds, some replicas compact away the tombstone while others still retain the original live data. When repair or read repair eventually reconciles those replicas, the live data wins because the tombstone is gone.

Hinted handoff does not prevent this. Hints replay missed mutations, but they do not cover tombstones whose gc_grace_seconds window has already closed.

flowchart TD
    A[Delete writes tombstone to replica set] --> B{Tombstone ages past gc_grace_seconds}
    B -->|Repair incomplete| C[Compaction purges tombstone on repaired nodes]
    B -->|Repair complete| D[Tombstone purged safely everywhere]
    C --> E[Unrepaired replica still holds live data]
    E --> F[Repair or read repair streams live data back]
    F --> G[Deleted data resurrected as zombie rows]

Common causes

CauseWhat it looks likeFirst thing to check
Repair not running or incompleteDeleted data returns after days or weeks with no prior errorssystem_distributed.repair_history for last successful repair timestamp
Repair slower than the grace windowRepair starts but does not finish before tombstones expire on early token rangesRepair cycle duration versus gc_grace_seconds
Node down longer than gc_grace_secondsHints expire before the node returns, so missed deletes are never propagatednodetool status history and hint delivery status
gc_grace_seconds reduced without faster repairTombstones become eligible for purge sooner than repair reaches all replicasDESCRIBE TABLE output for the per-table gc_grace_seconds value

Quick checks

# Check repair history for last completion time
cqlsh -e "SELECT keyspace_name, columnfamily_name, finished_at FROM system_distributed.repair_history LIMIT 50;"
# Check current repair status (Cassandra 4.0+)
nodetool repair_admin list
# Check node liveness and cluster state
nodetool status
# Check hint delivery and backlog
nodetool statushandoff
# Check table-level gc_grace_seconds
cqlsh -e "DESCRIBE TABLE mykeyspace.mytable;"
# Check for tombstone scan warnings
grep -i "tombstone" /var/log/cassandra/system.log | tail -20
# Check compaction backlog and pending tasks
nodetool compactionstats
# Check table statistics for SSTable count and tombstone load
nodetool tablestats mykeyspace | grep -i -E "Table:|tombstone|SSTable count"

How to diagnose it

  1. Confirm replica divergence at low consistency. Run the suspect query at CL=ONE against individual replicas. If one replica returns the deleted row while another returns nothing, you have replica divergence. At CL=QUORUM, read repair may propagate live data to tombstone-less replicas or hide the issue depending on which replicas form the quorum.
  2. Check last repair time. Query system_distributed.repair_history for the affected keyspace and table. If the most recent successful repair finished before the delete occurred and is older than the table’s gc_grace_seconds, resurrection is possible.
  3. Compare repair cadence to gc_grace_seconds. Best practice is to complete a full repair cycle within gc_grace_seconds / 2. If your repair takes several days to finish and runs only once per week, the interval between the start of one cycle and the completion of the next can exceed the grace window.
  4. Review node downtime. Check logs or monitoring history for any node that was DOWN longer than gc_grace_seconds. A node that is down longer than the grace window will miss tombstones permanently because hinted handoff stops storing hints after max_hint_window_in_ms (default 3 hours).
  5. Inspect compaction history. Use nodetool compactionhistory to verify that compaction has run on the affected table since the delete. If compaction ran while repair was incomplete, tombstones were likely purged on repaired nodes only.
  6. Check for repair failures. Partial repairs are worse than no repairs because they create an illusion of safety. Verify that repair completed all token ranges, not just that it started.
  7. Compare tablestats across replicas. Use nodetool tablestats for the affected table on each node. A large gap in SSTable count or tombstone metrics between replicas indicates uneven compaction and a higher risk that one node has purged tombstones the others still need.

Metrics and signals to monitor

SignalWhy it mattersWarning sign
Repair status vs gc_grace_secondsTombstones cannot be safely purged until all replicas are repairedLast successful repair > 80% of gc_grace_seconds
Tombstone scan warningsTombstones are accumulating and reads are scanning excessive dead dataSustained log warnings exceeding tombstone_warn_threshold (default 1000)
Pending compactionsTombstones survive until a compaction event includes both the tombstone and the live dataPending tasks trending upward over days
Node liveness durationA node down longer than gc_grace_seconds misses tombstones permanentlyAny node DOWN for longer than gc_grace_seconds
Hinted handoff backlogExpired hints leave permanent inconsistency that only repair can fixHints directory growing or hint delivery failures increasing

Fixes

Run full repair immediately

Warning: Repair is I/O and network intensive and will impact production traffic.

The only way to reconcile zombie data is to run a full repair on the affected tables. This streams divergent data and re-aligns replicas. Schedule it during low-traffic hours and throttle streaming if needed.

# Repair a specific keyspace and table
nodetool repair -full mykeyspace mytable

If the dataset is very large, use sub-range repair to limit scope and duration:

# Repair a specific token range
nodetool repair -full -st <start_token> -et <end_token> mykeyspace mytable

Monitor nodetool netstats to track streaming progress and nodetool compactionstats to verify compaction resumes after repair completes.

Enable only_purge_repaired_tombstones

If you cannot guarantee a reliable repair cadence, enable the only_purge_repaired_tombstones compaction option. When set to true, Cassandra will only purge tombstones from SSTables that have been marked as repaired. This severs the link between compaction and resurrection, at the cost of higher transient disk usage because tombstones survive longer.

Adjust repair cadence, not gc_grace_seconds

Never reduce gc_grace_seconds to solve a repair gap. A lower grace period shortens the repair window. If a node is down longer than the reduced gc_grace_seconds, missed deletes become permanent because the tombstones expire before repair can reconcile them.

Instead, increase repair frequency. Target completion within gc_grace_seconds / 2 to provide a safety margin for repair duration.

Address node downtime immediately

If a node was down longer than max_hint_window_in_ms (default 3 hours) or gc_grace_seconds, treat it as inconsistent. Do not assume hinted handoff recovered all missed deletes. Run a full repair on the token ranges owned by that node before it rejoins active production traffic.

Prevention

  • Automate repair and monitor completion. Use a repair scheduler such as Reaper, or the unified repair scheduler available in Cassandra 5.1 . Alert when the time since last successful repair exceeds 50% of gc_grace_seconds. Monitor completion of all token ranges, not just the start of the job.
  • Keep repair duration inside the grace window. If a full cluster repair takes several days, run smaller sub-range repairs more frequently. The interval from the start of one complete repair cycle to the end of the next must remain below gc_grace_seconds.
  • Consider only_purge_repaired_tombstones. If your repair automation is immature, this option is the safest defense against accidental resurrection.
  • Repair after any extended outage. If a node is down longer than max_hint_window_in_ms, run repair immediately after recovery. Hints do not cover the full outage window.
  • Do not reduce gc_grace_seconds without proven repair frequency. Some teams lower the grace period to reclaim disk space faster. This is dangerous unless you have verified repair completion times that fit comfortably inside the new window.

How Netdata helps

  • Correlate repair lag with compaction pressure and pending tasks to identify when tombstones are at risk of premature purge.
  • Alert on node availability drops that exceed gc_grace_seconds, flagging replicas that may have missed deletes.
  • Track JVM heap usage and GC pause duration, which can delay repair completion and widen the vulnerability window.
  • Monitor SSTable count growth per table as a leading indicator that compaction is outpacing repair reconciliation.
  • Surface hinted handoff backlog and delivery failure rates, highlighting replicas that will need manual repair after recovery.