Cassandra gossip flapping: nodes bouncing UP and DOWN

nodetool status shows a node flipping between UN and DN, or multiple nodes doing it in sequence. Each transition forces the cluster to replay hints, recalculate read repair, and propagate gossip state. When a node transitions more than three times in thirty minutes, you are dealing with gossip flapping. It is almost always JVM heap pressure or GC pauses misdiagnosed as a network problem.

Cassandra’s phi accrual failure detector uses a sliding window of heartbeat inter-arrival times. With the default phi_convict_threshold of 8, a node must miss gossip heartbeats for roughly 18 seconds before peers mark it DOWN. A single long GC pause can cross that threshold. When the pause ends, the node resumes gossip, peers mark it UP, hint delivery begins, and the cycle repeats. The result is a self-reinforcing spiral: redirected traffic and hint replay create more memory pressure, which triggers longer pauses.

Because the failure detector is local to each node, different coordinators can hold different views of the ring during an incident. A flapping node degrades the entire cluster with gossip storms and inconsistent replica selection.

What this means

Gossip runs every second. Each exchange carries heartbeats, topology, and application state. When a node is marked DOWN, healthy coordinators store hinted handoffs for its token ranges. When it returns, they deliver those hints and re-evaluate replicas for read repair. If the node flaps rapidly, the cluster spends more time managing membership than serving requests.

The phi accrual detector adapts to network variance, but it cannot distinguish a network partition from a JVM frozen by garbage collection. To the failure detector, both look like missing heartbeats. Operators often restart networking or blame the cloud provider before checking GC logs. Check GC first.

flowchart TD
    A[Long GC pause freezes JVM] --> B[Missed gossip heartbeats > 18s]
    B --> C[Peers mark node DOWN]
    C --> D[Hints stored and read repair recalculated]
    D --> E[Node recovers and is marked UP]
    E --> F[Hint replay floods recovering node]
    F --> G[Heap pressure increases]
    G --> A

Common causes

CauseWhat it looks likeFirst thing to check
GC pauses exceeding phi thresholdNode flaps with no network errors; GC logs show pauses > 18s; nodetool info shows heap near limitGC logs and heap usage after full GC
Gossip thread pool saturationGossipStage pending tasks stay above zero; schema state lingersnodetool tpstats
Network partition or packet lossAsymmetric views between nodes; some peers see DOWN while others see UPnodetool status from multiple nodes
Clock skewTimestamp validation failures in logs; schema disagreement after restartsNTP synchronization on all nodes
File descriptor exhaustionNode is reachable but inter-node messaging fails; open FDs near ulimit/proc/<pid>/fd and limits

Quick checks

Run these read-only commands to narrow the cause.

# Compare cluster views from two different nodes
nodetool status

# Check heap pressure
nodetool info | grep -i "Heap Memory"

# Check gossip and request thread pool saturation
nodetool tpstats
# Check for stop-the-world pauses > 200 ms
grep -i "pause" /var/log/cassandra/gc.log* | awk '$NF > 200'
# Verify schema agreement
nodetool describecluster

# Check file descriptor consumption vs limit
ls /proc/$(pgrep -f CassandraDaemon)/fd | wc -l
cat /proc/$(pgrep -f CassandraDaemon)/limits | grep "Max open files"

# Check commitlog and data device I/O saturation
iostat -x 1

How to diagnose it

  1. Separate flapping from true failure. A node DOWN for more than five minutes is a true failure. A node transitioning more than three times in thirty minutes is flapping. Apply sustained-duration filters to alerts so transient GC blips do not page.
  2. Compare views across the ring. Run nodetool status from at least two nodes. If one coordinator shows DN while another shows UN, suspect a network partition or asymmetric packet loss. If all agree on the flapping state, suspect GC or local saturation.
  3. Read GC logs before tcpdump. Look for stop-the-world pauses. With default phi_convict_threshold of 8, a pause longer than roughly 18 seconds will typically trigger a DOWN marking. Pauses longer than 2 seconds already risk gossip disruption. If pauses correlate with DOWN transitions, GC is the culprit.
  4. Check heap floor. nodetool info gives used and max. The critical number is heap used after a full GC, not the peak. If the post-GC floor is climbing toward 75% of max, the node is headed for a GC death spiral.
  5. Inspect thread pools. In nodetool tpstats, sustained pending tasks in GossipStage mean gossip tasks are queuing faster than they execute. This often accompanies mutation stage saturation or compaction overload.
  6. Validate schema agreement. Run nodetool describecluster. Persistent schema disagreement means gossip cannot propagate application state cleanly. If this correlates with flapping, the node may be too saturated to process migration messages.
  7. Check file descriptors and I/O. File descriptor exhaustion prevents socket communication, which starves gossip. Disk I/O saturation on the commitlog or data device can stall the JVM and mimic network unreachability.

Metrics and signals to monitor

SignalWhy it mattersWarning sign
Node liveness transitionsDirect measure of flapping frequency> 3 UP/DOWN transitions in 30 minutes
GC pause durationPauses > ~18s breach phi threshold and mark node DOWNMax pause > 2,000 ms sustained
JVM heap after GCHeap pressure drives the GC death spiral that causes flappingPost-GC heap > 75% of max
GossipStage pending tasksGossip backlog prevents heartbeat processing and state propagationPending > 0 sustained > 60 s
Schema versionsDisagreement indicates gossip cannot propagate state> 1 schema version for > 5 minutes
File descriptor usageFD exhaustion blocks inter-node socketsOpen FDs > 80% of ulimit
Client request timeoutsFlapping causes replica unavailability and tail latencyTimeout rate > 0.1% sustained
Dropped messagesLoad shedding confirms the node cannot keep upNon-zero rate in MUTATION or READ

Fixes

GC pressure and heap exhaustion

If GC pauses correlate with flapping, identify the allocation trigger before restarting. Large partition reads, tombstone-heavy scans, and oversized batch statements are common causes. Use nodetool tablehistograms to find tables with extreme partition sizes.

Disruptive: nodetool disablebinary stops new client connections without removing the node from the ring. Use it for temporary relief while you investigate.

Tune the JVM for stable pauses. Keep -Xms equal to -Xmx. Do not set the heap above 16 GB with G1GC. If old-generation pauses remain long, consider reducing heap size or moving large off-heap consumers. Disable or shrink the row cache if it is enabled; it consumes old-generation space and is disabled by default for good reason.

Gossip and thread pool saturation

If nodetool tpstats shows sustained pending tasks in the gossip pool, the node is overloaded. Reduce non-essential background load such as repairs. If the mutation stage is also saturated, reduce application write rate or add capacity.

Network and clock issues

When node views diverge, verify that peers are reachable on ports 7000 and 7001 and that the local process is listening. Ensure NTP is synchronized on every node. Cassandra validates gossip generation timestamps; even moderate clock skew causes heartbeat rejection and inconsistent state.

File descriptor exhaustion

If open FDs exceed 80% of the ulimit, increase the limit to at least 100,000 and restart the Cassandra process.

Disruptive: Restarting the process drops all connections and triggers hints.

Then address the root cause, which is usually compaction debt creating too many SSTables. Each SSTable opens multiple files, so a high SSTable count directly drives FD growth.

Prevention

  • Alert on heap usage after full GC, not peak heap. A rising post-GC floor predicts flapping days before it happens.
  • Use sustained-duration filters on DOWN alerts. Require five minutes of continuous DOWN state before paging to exclude transient GC events.
  • Synchronize NTP across all nodes and monitor clock skew.
  • Maintain file descriptor headroom below 80% of ulimit.
  • Separate commitlog and data devices so I/O saturation on one path does not stall the other.
  • Track schema agreement after rolling restarts or topology changes.

How Netdata helps

Netdata surfaces the signals above without manual JMX polling:

  • GC pause duration and node liveness transitions on one timeline, so you can confirm a pause breached the phi threshold without cross-referencing logs.
  • Thread pool backlog per stage, including GossipStage.
  • Heap usage after GC with relationship-based thresholds instead of noisy peak-heap gauges.
  • Per-node disk I/O await and file descriptor usage to rule out infrastructure causes before blaming the network.
  • Schema version counts and dropped message rates in real time.
  • Cassandra compaction strategies: STCS vs LCS vs TWCS vs UCS: /guides/cassandra/cassandra-choosing-compaction-strategy/
  • Cassandra compaction death spiral: when writes outrun compaction throughput: /guides/cassandra/cassandra-compaction-death-spiral/
  • Cassandra consistency levels explained: QUORUM, ONE, LOCAL_QUORUM, and EACH_QUORUM: /guides/cassandra/cassandra-consistency-levels-explained/
  • Cassandra zombie data resurrection: gc_grace_seconds and unrepaired tombstones: /guides/cassandra/cassandra-data-resurrection-gc-grace/
  • Cassandra disk space exhaustion: emergency recovery when the data volume fills: /guides/cassandra/cassandra-disk-space-exhaustion/
  • Cassandra dropped mutations: silent write loss and load shedding: /guides/cassandra/cassandra-dropped-mutations/
  • Cassandra dropped reads and other messages: reading nodetool tpstats Dropped: /guides/cassandra/cassandra-dropped-reads-and-messages/
  • Cassandra GC death spiral: long pauses, gossip flapping, and recovery: /guides/cassandra/cassandra-gc-death-spiral/
  • Cassandra GC pauses too long: diagnosing G1 stop-the-world pauses: /guides/cassandra/cassandra-gc-pauses-too-long/
  • Cassandra heap pressure: sizing the JVM heap and tuning G1GC: /guides/cassandra/cassandra-heap-pressure-tuning/
  • Cassandra monitoring checklist: the signals every production cluster needs: /guides/cassandra/cassandra-monitoring-checklist/
  • Cassandra monitoring maturity model: from survival to expert: /guides/cassandra/cassandra-monitoring-maturity-model/