The GC death spiral
A trigger — a large partition read, a tombstone scan, an oversized batch — drives a long old-gen GC pause. During the pause the node can't gossip, so peers mark it DN. Clients retry elsewhere, hints accumulate, and when the node recovers it faces a flood of hint replay and read repair on top of the now-higher baseline. That load triggers the next pause. The canonical Cassandra failure mode.
- GCInspector pause warnings, G1 Old Generation pauses above 2s
- heap usage not recovering after GC (rising post-GC floor)
- node flapping UP/DOWN in gossip while reachable between pauses
- dropped mutations and client timeouts rising together







