Cassandra GC death spiral: long pauses, gossip flapping, and recovery

You are paged because a Cassandra node is flapping between UP and DOWN in nodetool status, client timeouts are rising, and system logs show GCInspector warnings. The node has not crashed. It is stuck in a GC death spiral: heap pressure produces long pauses, gossip marks the node DOWN, and the resulting retry and hint traffic creates even more heap pressure when the node recovers. It can start with a single large partition read, a misconfigured cache, or an oversized batch statement, and escalates until the node is effectively useless. Catch it early by watching the GC floor and gossip stability together, not just process uptime.

What this means

During normal operation, Cassandra nodes exchange gossip heartbeats every second. The phi accrual failure detector computes a suspicion level from these heartbeats. At the default phi_convict_threshold of 8, a node is marked DOWN after roughly 18 seconds of missed heartbeats.

When the JVM enters a long stop-the-world GC pause, it cannot gossip. Peers correctly conclude the node is dead and begin storing hinted handoffs. Clients timeout and retry against other nodes. When the pause ends, the node returns to UP and must process the accumulated hint replay from every peer, plus the redirected client traffic. This burst often triggers another full GC before the node has fully recovered. The cycle repeats, with the interval between pauses shrinking until the node is either OOM killed or permanently marked DOWN.

Cassandra 3.x and later include a local pause self-protection mechanism: if the local JVM detects a pause greater than 5 seconds, it temporarily skips marking other nodes DOWN to avoid cascading false convictions. This can mask the root cause on the impaired node while still allowing peers to convict it. The spiral is therefore visible first from the cluster view: gossip flapping, dropped mutations, and timeout spikes that correlate with GCInspector log warnings.

flowchart TD
    A[Heap pressure] --> B[Long GC pause]
    B --> C[Gossip heartbeat missed]
    C --> D[Phi accrual marks node DOWN]
    D --> E[Hints accumulate on peers]
    D --> F[Client retries flood other nodes]
    E --> G[Node recovers]
    F --> G
    G --> H[Hint replay and retry burst]
    H --> A

Common causes

CauseWhat it looks likeFirst thing to check
Large partitionsGC spikes correlate with reads or writes; log warns about large partitionsgrep -i "Writing large partition" /var/log/cassandra/system.log and nodetool toppartitions
Oversized batch statementsCoordinator heap spikes; batch size warnings in logsgrep -i "Batch" /var/log/cassandra/system.log for WARN lines
Bloated on-heap cachesHeap after GC remains high despite low traffic; low cache hit ratenodetool info for cache size, entries, and hit rate
Undersized heapGC floor climbs toward max; frequent old-gen collectionsnodetool info heap used versus max
Excessive in-flight requestsNative transport request queue grows; pending tasks risenodetool tpstats Native-Transport-Requests pending
Memory leakHeap after GC creeps up slowly regardless of workloadjmap -histo:live (triggers a full GC; use only if already impaired)

Quick checks

Run these read-only commands to confirm the pattern.

# Gossip state: look for the target node oscillating between UN and DN
nodetool status

# Heap pressure and cache footprint
nodetool info | grep -Ei "Heap Memory|Key Cache|Row Cache"

# Recent GC pauses: inspect for durations > 200 ms; log format varies by JVM version
grep -i "pause" /var/log/cassandra/gc.log | tail -20

# Dropped messages and thread pool saturation
nodetool tpstats

# Recent large partition writes
grep -i "Writing large partition" /var/log/cassandra/system.log | tail -10

# Live GC utilization (run as the same UID as Cassandra)
jstat -gcutil $(pgrep -f CassandraDaemon) 1000

# Hint streaming activity
nodetool netstats

# Local pause self-protection activation
grep -i "local pause" /var/log/cassandra/system.log

How to diagnose it

  1. Confirm flapping from the cluster perspective. Run nodetool status from multiple nodes. If the victim oscillates between UN and DN while the process remains running, the failure detector is convicting it based on missed heartbeats, not a crash. If only one peer sees the node as DOWN, suspect a network partition instead.

  2. Correlate with GC logs. Inspect /var/log/cassandra/gc.log for pauses greater than 2 seconds. Look for GCInspector warnings in system.log. If pauses are longer than the phi threshold window, gossip will reliably mark the node DOWN.

  3. Check the heap floor. Use nodetool info to compare heap used to max. If used heap exceeds 75% of max after a known GC event, the node is under severe memory pressure. Above 85% after GC, a death spiral is imminent.

  4. Check for dropped messages. Run nodetool tpstats and look at the Dropped section. Non-zero MUTATION drops confirm the replica is shedding writes. Non-zero READ drops confirm requests are expiring in queues.

  5. Identify the heap consumer. Check system logs for large partition or batch warnings. Check nodetool info for cache sizes. If the node is already impaired and you need object-level detail, run jmap -histo:live <cassandra_pid> | head -30. This command triggers a stop-the-world full GC itself and requires the same UID or root.

  6. Measure hint backlog. On the peers that stored hints, run du -sh /var/lib/cassandra/hints/. Run nodetool statushandoff to verify hint delivery is active. A large hints directory means the recovering node will face a secondary write burst.

  7. Check for local pause self-protection. Search system logs for the phrase Not marking nodes down due to local pause. If present, the local failure detector was masking the GC stall while remote peers still convicted the node.

Metrics and signals to monitor

SignalWhy it mattersWarning sign
GC pause durationPauses freeze all threads, including gossip> 500 ms sustained; any pause > 2 s
Heap usage after GCShows irreducible long-lived memory pressureUsed > 75% of max heap
Gossip state transitionsFlapping is the hallmark of this spiral> 2 UP/DOWN transitions in 10 minutes
Dropped MUTATION messagesThe node is shedding writes; data becomes inconsistentAny sustained non-zero rate
Client request timeoutsClients are waiting too long for replica responsesTimeout rate increasing alongside flapping
Hinted handoff store sizeReplayed hints create a secondary write burstHints directory growing on peers

Fixes

Stop the feedback loop

If a node is actively spiraling, cut client traffic to it while keeping it in the ring so it can stabilize. Run nodetool disablebinary on the impaired node. This stops new CQL requests but allows gossip and internode messaging to continue, preventing unnecessary data movement. If the node stabilizes, re-enable client traffic with nodetool enablebinary. On the peers storing hints, temporarily reduce hinted_handoff_throttle_in_kb in cassandra.yaml to slow the replay rate if the recovered node is immediately re-overloaded.

Right-size memory

For most workloads using G1GC (the default in Cassandra 4.x and 5.x), keep the heap between 8 GB and 16 GB. Do not exceed 31 GB, which is the CompressedOops ceiling; a 31 GB heap typically outperforms a 48 GB heap because the JVM cannot use compressed pointers above that boundary. Set -Xms equal to -Xmx to avoid heap resize cost. Larger heaps make G1 pause targets harder to meet, so adding memory can paradoxically worsen the spiral.

Remove heap pressure sources

Disable the row cache unless you have verified a high hit rate; it is on-heap and disabled by default for good reason. Reduce key cache size if it dominates the old generation. Fix large partitions by redesigning the data model or rate-limiting access to known large keys; identify them with nodetool toppartitions. If large partitions are unavoidable, reduce driver page size and use token-aware load balancing to spread coordinator load. Eliminate large batch statements at the application layer, because they force the coordinator to hold entire batches in heap. Set batch_size_warn_threshold_in_kb and batch_size_fail_threshold_in_kb conservatively and alert on warnings.

Emergency recovery

Restarting the impaired node can temporarily free heap and break the spiral, but the spiral will return if the root cause remains. Use a restart only to buy time for a data model or configuration fix. Do not restart as a substitute for root-cause analysis.

Prevention

  • Monitor the GC floor (heap used immediately after an old GC), not just heap used. Alert if the post-GC floor trends upward or crosses 75% of max.
  • Alert on GCInspector pause warnings before pauses reach 2 seconds.
  • Keep row cache disabled unless you have proven it benefits your workload.
  • Monitor partition size distributions proactively with nodetool toppartitions or nodetool tablehistograms. Do not wait for the first 100 MB partition to crash the node.
  • Ensure repair runs within gc_grace_seconds so anti-entropy does not depend entirely on hint replay, which amplifies recovery load.
  • Audit application batch sizes and enforce batch_size_fail_threshold_in_kb.
  • Track pending compactions and flushes via nodetool compactionstats and nodetool tpstats; backlog here can spike heap under load.

How Netdata helps

  • Correlate JVM GC pause duration with Cassandra gossip DOWN events on the same timeline to confirm the spiral pattern.
  • Alert on GC pause duration, gossip flapping, and dropped mutations together.
  • Track per-thread-pool pending tasks to distinguish request overload from GC stalls.
  • Monitor off-heap RSS alongside JVM heap to catch Linux OOM kills before they happen.
  • Flag GCInspector log warnings and large partition alerts via log monitoring.