Cassandra GC death spiral: long pauses, gossip flapping, and recovery
You are paged because a Cassandra node is flapping between UP and DOWN in nodetool status, client timeouts are rising, and system logs show GCInspector warnings. The node has not crashed. It is stuck in a GC death spiral: heap pressure produces long pauses, gossip marks the node DOWN, and the resulting retry and hint traffic creates even more heap pressure when the node recovers. It can start with a single large partition read, a misconfigured cache, or an oversized batch statement, and escalates until the node is effectively useless. Catch it early by watching the GC floor and gossip stability together, not just process uptime.
What this means
During normal operation, Cassandra nodes exchange gossip heartbeats every second. The phi accrual failure detector computes a suspicion level from these heartbeats. At the default phi_convict_threshold of 8, a node is marked DOWN after roughly 18 seconds of missed heartbeats.
When the JVM enters a long stop-the-world GC pause, it cannot gossip. Peers correctly conclude the node is dead and begin storing hinted handoffs. Clients timeout and retry against other nodes. When the pause ends, the node returns to UP and must process the accumulated hint replay from every peer, plus the redirected client traffic. This burst often triggers another full GC before the node has fully recovered. The cycle repeats, with the interval between pauses shrinking until the node is either OOM killed or permanently marked DOWN.
Cassandra 3.x and later include a local pause self-protection mechanism: if the local JVM detects a pause greater than 5 seconds, it temporarily skips marking other nodes DOWN to avoid cascading false convictions. This can mask the root cause on the impaired node while still allowing peers to convict it. The spiral is therefore visible first from the cluster view: gossip flapping, dropped mutations, and timeout spikes that correlate with GCInspector log warnings.
flowchart TD
A[Heap pressure] --> B[Long GC pause]
B --> C[Gossip heartbeat missed]
C --> D[Phi accrual marks node DOWN]
D --> E[Hints accumulate on peers]
D --> F[Client retries flood other nodes]
E --> G[Node recovers]
F --> G
G --> H[Hint replay and retry burst]
H --> ACommon causes
| Cause | What it looks like | First thing to check |
|---|---|---|
| Large partitions | GC spikes correlate with reads or writes; log warns about large partitions | grep -i "Writing large partition" /var/log/cassandra/system.log and nodetool toppartitions |
| Oversized batch statements | Coordinator heap spikes; batch size warnings in logs | grep -i "Batch" /var/log/cassandra/system.log for WARN lines |
| Bloated on-heap caches | Heap after GC remains high despite low traffic; low cache hit rate | nodetool info for cache size, entries, and hit rate |
| Undersized heap | GC floor climbs toward max; frequent old-gen collections | nodetool info heap used versus max |
| Excessive in-flight requests | Native transport request queue grows; pending tasks rise | nodetool tpstats Native-Transport-Requests pending |
| Memory leak | Heap after GC creeps up slowly regardless of workload | jmap -histo:live (triggers a full GC; use only if already impaired) |
Quick checks
Run these read-only commands to confirm the pattern.
# Gossip state: look for the target node oscillating between UN and DN
nodetool status
# Heap pressure and cache footprint
nodetool info | grep -Ei "Heap Memory|Key Cache|Row Cache"
# Recent GC pauses: inspect for durations > 200 ms; log format varies by JVM version
grep -i "pause" /var/log/cassandra/gc.log | tail -20
# Dropped messages and thread pool saturation
nodetool tpstats
# Recent large partition writes
grep -i "Writing large partition" /var/log/cassandra/system.log | tail -10
# Live GC utilization (run as the same UID as Cassandra)
jstat -gcutil $(pgrep -f CassandraDaemon) 1000
# Hint streaming activity
nodetool netstats
# Local pause self-protection activation
grep -i "local pause" /var/log/cassandra/system.log
How to diagnose it
Confirm flapping from the cluster perspective. Run
nodetool statusfrom multiple nodes. If the victim oscillates betweenUNandDNwhile the process remains running, the failure detector is convicting it based on missed heartbeats, not a crash. If only one peer sees the node as DOWN, suspect a network partition instead.Correlate with GC logs. Inspect
/var/log/cassandra/gc.logfor pauses greater than 2 seconds. Look forGCInspectorwarnings insystem.log. If pauses are longer than the phi threshold window, gossip will reliably mark the node DOWN.Check the heap floor. Use
nodetool infoto compare heap used to max. If used heap exceeds 75% of max after a known GC event, the node is under severe memory pressure. Above 85% after GC, a death spiral is imminent.Check for dropped messages. Run
nodetool tpstatsand look at the Dropped section. Non-zeroMUTATIONdrops confirm the replica is shedding writes. Non-zeroREADdrops confirm requests are expiring in queues.Identify the heap consumer. Check system logs for large partition or batch warnings. Check
nodetool infofor cache sizes. If the node is already impaired and you need object-level detail, runjmap -histo:live <cassandra_pid> | head -30. This command triggers a stop-the-world full GC itself and requires the same UID or root.Measure hint backlog. On the peers that stored hints, run
du -sh /var/lib/cassandra/hints/. Runnodetool statushandoffto verify hint delivery is active. A large hints directory means the recovering node will face a secondary write burst.Check for local pause self-protection. Search system logs for the phrase
Not marking nodes down due to local pause. If present, the local failure detector was masking the GC stall while remote peers still convicted the node.
Metrics and signals to monitor
| Signal | Why it matters | Warning sign |
|---|---|---|
| GC pause duration | Pauses freeze all threads, including gossip | > 500 ms sustained; any pause > 2 s |
| Heap usage after GC | Shows irreducible long-lived memory pressure | Used > 75% of max heap |
| Gossip state transitions | Flapping is the hallmark of this spiral | > 2 UP/DOWN transitions in 10 minutes |
| Dropped MUTATION messages | The node is shedding writes; data becomes inconsistent | Any sustained non-zero rate |
| Client request timeouts | Clients are waiting too long for replica responses | Timeout rate increasing alongside flapping |
| Hinted handoff store size | Replayed hints create a secondary write burst | Hints directory growing on peers |
Fixes
Stop the feedback loop
If a node is actively spiraling, cut client traffic to it while keeping it in the ring so it can stabilize. Run nodetool disablebinary on the impaired node. This stops new CQL requests but allows gossip and internode messaging to continue, preventing unnecessary data movement. If the node stabilizes, re-enable client traffic with nodetool enablebinary. On the peers storing hints, temporarily reduce hinted_handoff_throttle_in_kb in cassandra.yaml to slow the replay rate if the recovered node is immediately re-overloaded.
Right-size memory
For most workloads using G1GC (the default in Cassandra 4.x and 5.x), keep the heap between 8 GB and 16 GB. Do not exceed 31 GB, which is the CompressedOops ceiling; a 31 GB heap typically outperforms a 48 GB heap because the JVM cannot use compressed pointers above that boundary. Set -Xms equal to -Xmx to avoid heap resize cost. Larger heaps make G1 pause targets harder to meet, so adding memory can paradoxically worsen the spiral.
Remove heap pressure sources
Disable the row cache unless you have verified a high hit rate; it is on-heap and disabled by default for good reason. Reduce key cache size if it dominates the old generation. Fix large partitions by redesigning the data model or rate-limiting access to known large keys; identify them with nodetool toppartitions. If large partitions are unavoidable, reduce driver page size and use token-aware load balancing to spread coordinator load. Eliminate large batch statements at the application layer, because they force the coordinator to hold entire batches in heap. Set batch_size_warn_threshold_in_kb and batch_size_fail_threshold_in_kb conservatively and alert on warnings.
Emergency recovery
Restarting the impaired node can temporarily free heap and break the spiral, but the spiral will return if the root cause remains. Use a restart only to buy time for a data model or configuration fix. Do not restart as a substitute for root-cause analysis.
Prevention
- Monitor the GC floor (heap used immediately after an old GC), not just heap used. Alert if the post-GC floor trends upward or crosses 75% of max.
- Alert on GCInspector pause warnings before pauses reach 2 seconds.
- Keep row cache disabled unless you have proven it benefits your workload.
- Monitor partition size distributions proactively with
nodetool toppartitionsornodetool tablehistograms. Do not wait for the first 100 MB partition to crash the node. - Ensure repair runs within
gc_grace_secondsso anti-entropy does not depend entirely on hint replay, which amplifies recovery load. - Audit application batch sizes and enforce
batch_size_fail_threshold_in_kb. - Track pending compactions and flushes via
nodetool compactionstatsandnodetool tpstats; backlog here can spike heap under load.
How Netdata helps
- Correlate JVM GC pause duration with Cassandra gossip DOWN events on the same timeline to confirm the spiral pattern.
- Alert on GC pause duration, gossip flapping, and dropped mutations together.
- Track per-thread-pool pending tasks to distinguish request overload from GC stalls.
- Monitor off-heap RSS alongside JVM heap to catch Linux OOM kills before they happen.
- Flag GCInspector log warnings and large partition alerts via log monitoring.







