Cassandra TTL tombstone accumulation: time-series tables and TWCS

You set a TTL on every row expecting old data to vanish. Instead, disk usage climbs, read latency spikes, and your logs fill with tombstone warnings.

An expired TTL does not delete data immediately. Cassandra writes a tombstone that survives until compaction runs and the tombstone exceeds gc_grace_seconds (default 864000 seconds, or 10 days). In time-series workloads ingesting millions of TTL’d rows per hour, the wrong compaction strategy turns this bookkeeping into a production incident.

TWCS groups SSTables into time windows and drops entire files when every row inside has expired, bypassing the expensive cell-by-cell merging that SizeTieredCompactionStrategy (STCS) and LeveledCompactionStrategy (LCS) require. TWCS is not automatic. It suppresses tombstone compaction by default and fails silently when out-of-order writes land in old windows. Cassandra 5.0 introduces Unified Compaction Strategy (UCS), which includes a time-series profile and can be reconfigured at runtime without a full compaction. Understanding how tombstones accumulate under TWCS, and where it fails, is essential for running time-series tables at scale.

What it is and why it matters

A tombstone is Cassandra’s deletion marker. When a cell’s TTL expires, the node treats it as a delete and marks the cell with a tombstone that shadows the old value. That tombstone is not free. It occupies disk space, consumes heap during reads, and forces the read path to scan and discard dead data until it is eligible for removal.

Time-series tables are usually append-only: new data streams in by timestamp and expires on a fixed TTL. The storage engine is an LSM tree, so every flush creates a new SSTable. Under STCS or LCS, expired tombstones scatter across many SSTables that also contain live data. Compaction must read all of those files, merge them cell-by-cell, and rewrite the live data into new SSTables just to eliminate the dead cells. As write volume grows, compaction cannot keep up. Tombstones accumulate, read amplification explodes, and queries abort when they hit tombstone_failure_threshold (default 100000).

TWCS partitions SSTables into time windows. If every row in an SSTable has expired and the table has been repaired, the entire SSTable can be dropped as a single file. No merge, no rewrite, no per-cell tombstone compaction. This is efficient, but it relies on a strict contract: writes must be time-ordered and TTL-aligned to window boundaries.

How it works

flowchart TD
    A[Write with TTL] --> B[Flush to SSTable]
    B --> C[TTL expires]
    C --> D[Tombstone in SSTable]
    D --> E{Compaction strategy}
    E -->|STCS / LCS| F[Tombstones mixed with live data]
    F --> G[Cell-by-cell merge to purge]
    E -->|TWCS| H[Time-windowed SSTables]
    H --> I[Window fully expired]
    I --> J[Whole file dropped]

TTL does not change the write path. The mutation is appended to the commitlog and memtable; on flush, the TTL is stored as an expiration timestamp in the SSTable. No background thread wakes when TTL expires. Expiration is lazy: it is discovered during reads or compaction.

STCS merges SSTables of similar size. An old SSTable containing mostly expired data is compacted with a newer SSTable containing mostly live data. The tombstones are purged only after reading and rewriting all the live data. In high-throughput time-series workloads, this creates a compaction death spiral: writes generate SSTables faster than compaction can rewrite them, so tombstones never clear.

LCS organizes SSTables into levels with strict size boundaries. Tombstones propagate through levels during compaction. LCS also lacks a fast path for bulk TTL expiration; it runs more frequent, smaller compactions. Write amplification is high for pure ingest workloads.

TWCS assigns each SSTable to a time window based on the minimum timestamp of the data it contains. Within the active window, SSTables are compacted using STCS. Once the window closes, no new SSTables should belong to it. When the entire window has passed its TTL plus gc_grace_seconds, and the table has been repaired, Cassandra deletes the SSTable files entirely. This is the whole-file drop optimization.

TWCS suppresses single-SSTable tombstone compaction by default. It expects tombstones to be removed by whole-file expiration. If something prevents the whole-file drop, tombstones have no escape hatch unless you explicitly enable one. Tombstone compaction is disabled unless tombstone_threshold, tombstone_compaction_interval, or unchecked_tombstone_compaction is set to a non-default value. unchecked_tombstone_compaction defaults to false.

Out-of-order writes block whole-file drops. If read repair, a backfill with USING TIMESTAMP, or a mixed write path places even a single live cell into a newer SSTable that overlaps an old window, the old SSTable cannot drop. The newer SSTable blocks it. In that scenario, TWCS behaves worse than STCS: it refuses cell-level tombstone compaction by default, so tombstones accumulate indefinitely until you intervene.

Two unsafe options exist to force expiration. Setting unchecked_tombstone_compaction: true allows Cassandra to run tombstone compaction on individual SSTables; safety checks for shadowing data are still performed internally. unsafe_aggressive_sstable_expiration drops expired SSTables without checking whether they shadow data in other SSTables. It requires the JVM flag -Dcassandra.unsafe_aggressive_sstable_expiration=true and can cause data resurrection. Do not use it without understanding the shadowing risk.

If only_purge_repaired_tombstones is enabled, tombstone removal is gated behind successful repair. A fully expired window will not drop until repair completes. If repair runs slower than the TTL window, disk accumulates even with correct TWCS configuration.

Where it shows up in production

The first symptom is disk usage that does not decrease after the TTL period. You expect data from 30 days ago to be gone, but nodetool info shows Load holding steady or growing.

Range scans over historical windows spike in latency. The read path must scan and merge tombstones from dozens of SSTables. P50 latency may look fine while P99 explodes.

If tombstones per read exceed tombstone_warn_threshold (default 1000), Cassandra logs warnings. At tombstone_failure_threshold (default 100000), it aborts the query with TombstoneOverwhelmingException. This often hits dashboards or batch jobs that scan wide time ranges.

GC pressure also increases. Tombstone-heavy reads allocate temporary objects during the merge phase. This triggers young generation churn and can promote garbage into the old generation, increasing GC pause frequency.

A specific tell for TWCS misconfiguration is high SSTable counts in old time windows. In a healthy TWCS table, each closed window should have roughly one SSTable after compaction. If old windows still contain multiple SSTables, you likely have overlapping SSTables from out-of-order writes.

Tradeoffs and when to use it

TWCS is purpose-built for append-only time-series data with time-ordered writes and uniform TTL. Use it when:

  • Data is inserted with the current timestamp and no backfills land in old windows.
  • TTL is roughly uniform and known at write time.
  • Queries read contiguous time ranges.
  • The workload is ingest-heavy, not update-heavy.

Do not use TWCS when:

  • You update old data or backfill historical windows. Out-of-order writes destroy the whole-file drop optimization.
  • TTLs vary wildly within the same table. If some rows expire in one hour and others in thirty days, windows cannot drop until the longest TTL passes.
  • You need efficient single-row lookups outside the time dimension. TWCS does not improve point-read performance; it optimizes time-range scans and expiration.
  • You are starting a new deployment on Cassandra 5.0 or newer. UCS is available and recommended for new tables; it includes a time-series profile and can be reconfigured at runtime without a full compaction.

If you must use TWCS on a table with default_time_to_live, set unchecked_tombstone_compaction: true unless you are certain out-of-order writes are impossible. This provides a fallback path when whole-file drops are blocked.

Reducing compaction_window_size does not retroactively split existing SSTables. Only new flushes respect the smaller window. Plan window sizing before the table grows.

Signals to watch in production

SignalWhy it mattersWarning sign
Tombstone scan warnings in system logReads are traversing excessive dead dataSustained Scanned over .* tombstones entries
SSTable count in TWCS windowsMultiple SSTables in expired windows indicate blocked dropsOld windows still contain multiple SSTables
Pending compactionsCompaction backlog prevents tombstone purgingPendingTasks in nodetool compactionstats trending upward over 4+ hours
Repair status vs gc_grace_secondsTombstones cannot be purged without repairLast repair > 80% of gc_grace_seconds
Disk space usageExpired data should reclaim space; if not, tombstones are accumulatingLoad flat or growing despite TTL
nodetool sstableexpiredblockers outputIdentifies which SSTables prevent window dropsAny blockers listed for expired windows
Local read latency p99Tombstone merging is CPU and I/O intensivep99 > 3x baseline or spiking on time-range queries

How Netdata helps

  • Correlate tombstone warning logs with per-table read latency spikes and GC pause duration to confirm tombstones are the root cause.
  • Track LiveSSTableCount and pending compactions per table to detect when TWCS windows are not dropping.
  • Monitor JVM heap usage and GC pause trends to catch memory pressure from tombstone-heavy range scans before it triggers a GC death spiral.
  • Alert on repair schedule drift relative to gc_grace_seconds so unrepaired tombstones do not silently accumulate.