Cassandra network partition and split brain: detection and reconciliation

A partial network partition does not always stop the cluster. Each isolated subset often continues to serve traffic, accept writes, and report itself healthy while marking the other side as DOWN. By the time you notice contradictory gossip views or resurrected data, the two sides have diverged for hours. A healed partition is not self-resolving. If both sides accepted writes, last-write-wins semantics combined with even modest clock skew can silently overwrite valid data. Cross-DC deployments are most vulnerable because WAN latency already strains the phi accrual failure detector and inter-DC gossip paths have more single points of failure.

Detecting a split brain requires correlating gossip state from multiple nodes, confirming schema disagreement, and checking whether writes have diverged beyond the hint window. Once detected, the only safe path to consistency is a full repair.

What this means

Cassandra uses a phi accrual failure detector with a default phi_convict_threshold of 8. At default settings, a node missing gossip heartbeats for roughly 18 seconds is marked DOWN by its peers. This decision is local to each node. During a network partition, the two halves form independent failure-detector views. Nodes on side A mark side B as DOWN, while side B does the same to side A. Gossip views do not converge until traffic resumes across the partition boundary.

While partitioned, both sides continue to handle reads and writes for the token ranges they believe they own. If your consistency level can be satisfied by the local side, clients see no errors. Writes destined for replicas on the far side are stored as hints locally, but only within max_hint_window_in_ms (default 3 hours). After that window expires, the coordinator stops storing hints, and writes that should have gone to the isolated replicas are permanently missed. When the network heals and the cluster reunites, Cassandra resolves conflicting mutations using wall-clock timestamps and last-write-wins semantics. If node clocks are not tightly synchronized, the write with the newer timestamp may not be the logically correct one. Data loss or reordering can occur without any error logged.

Because schema changes propagate through gossip, a prolonged partition can also leave the two sides on different schema versions. This blocks DDL and can cause query routing inconsistencies even after connectivity returns.

flowchart TD
    A[Network partition] --> B[Side A sees Side B as DOWN]
    A --> C[Side B sees Side A as DOWN]
    B --> D[Side A continues writes]
    C --> E[Side B continues writes]
    D --> F[Hints cover first 3h]
    E --> F
    F --> G[Partition heals]
    G --> H{Clock skew?}
    H -->|Yes| I[Last-write-wins overwrites valid data]
    H -->|No| J[Timestamp ordering resolves conflicts]
    G --> K[Full repair mandatory]
    K --> L[Reconcile divergent SSTables]

Common causes

CauseWhat it looks likeFirst thing to check
Cross-DC link degradation or cloud zone isolationNodes in one DC show all remote DC nodes as DOWN; local DC nodes are UP. Client writes at LOCAL_QUORUM succeed, but EACH_QUORUM fails.Inter-DC latency and packet loss from nodes on each side.
Firewall or security group change blocking internode gossipGossip on port 7000/7001 is filtered between subsets of nodes, but client port 9042 remains open. Nodes flip to DOWN in groups.nodetool status from nodes on both sides of the suspected boundary.
Asymmetric packet loss or routing failureOne direction of traffic is dropped. Some nodes see peers as DOWN while those peers still see them as UP, producing asymmetric flapping.nodetool gossipinfo and endpoint reachability from multiple vantage points.
Switch failure in a rack or AZAll nodes in one failure domain lose internode connectivity to the rest, forming a clean partition line.Network infrastructure logs and rack-level connectivity tests.

Quick checks

Run these read-only commands from nodes on each side of the suspected partition. Do not restart nodes or change topology during diagnosis; restarting can reset the local gossip view and complicate reconciliation.

# Compare cluster views from each side
nodetool status

# Inspect gossip state and endpoint liveness
nodetool gossipinfo

# Verify whether schema versions have diverged
nodetool describecluster

# Check for load shedding and backpressure
nodetool tpstats

# Measure hint accumulation on coordinators
du -sh /var/lib/cassandra/hints/

# Compare system clocks across nodes
date +%s

If nodetool status from one node shows peers as DOWN while those same peers show the first node as DOWN, you are looking at a partition, not a cascading node failure.

How to diagnose it

  1. Correlate gossip views. Collect nodetool status and nodetool gossipinfo from at least one node on each side of the suspected partition. In a true split brain, UP/DOWN assignments are contradictory: each side reports itself as UN and the other side as DN. Check nodetool describecluster; multiple schema versions confirm the cluster has formed isolated subgroups.

  2. Confirm client impact. Look for UnavailableExceptions in client metrics or JMX ClientRequest Unavailables if the consistency level requires replicas across the partition. If the application uses LOCAL_QUORUM and the partition aligns with a DC boundary, writes may appear healthy while silently diverging.

  3. Check the hint window. Inspect the hints directory size on coordinators (du -sh /var/lib/cassandra/hints/). If nodes have been partitioned longer than max_hint_window_in_ms (default 3 hours), hints are no longer being stored. Missed mutations from that point forward require full repair to recover.

  4. Measure clock skew. Compare system time across all nodes with date +%s or ntpstat. Even sub-second skew determines which writes survive last-write-wins reconciliation. If one side is running ahead, its writes will overwrite the other side’s data on heal, regardless of logical causality.

  5. Rule out GC-induced false partitions. Check JVM GC pause duration. Pauses longer than roughly 18 seconds at default phi settings trigger gossip failure detection. If the “partition” is actually a GC death spiral on individual nodes, the fix is heap tuning, not network repair. See the Cassandra GC death spiral guide.

  6. Assess write divergence. If both sides accepted writes for the same partition keys, the datasets have diverged. There is no automated merge on heal. Treat the cluster as inconsistent and plan a full repair.

Metrics and signals to monitor

SignalWhy it mattersWarning sign
Node liveness (DownEndpointCount , nodetool status)Each node’s view is local. Asymmetric DOWN counts suggest a partition rather than individual node failures.Same peers shown as DN from one node but UN from another, sustained for > 60 seconds.
Client request unavailables (ClientRequest Unavailables)Failures when the coordinator cannot reach enough replicas.Sustained unavailable rate > 0.1% of request rate, especially across DC boundaries.
Dropped messages (DroppedMessage MBean, nodetool tpstats)Coordinators drop mutations or reads when replicas are unreachable or queues expire.Non-zero sustained drop rate in MUTATION or READ while nodes appear UP locally.
Schema versions (SchemaVersions, nodetool describecluster)Schema propagates via gossip. Divergent schemas indicate isolated subgroups.More than one schema UUID reported for > 2 minutes outside of planned DDL.
Hinted handoff store sizeHints indicate replicas were unreachable. Growth beyond the hint window means silent inconsistency.Hints directory growing when all nodes should be UP, or large backlog after a partition.
GC pause durationLong pauses mimic network partitions by triggering gossip failure detection.Max pause > 18 seconds or sustained pauses > 2 seconds correlating with DOWN events.

Fixes

During an active partition

If the partition is active and both sides serve traffic, choose one side to keep online. Prefer the side with lower client latency, more recent repair history, or the smaller hint backlog. If possible, stop write traffic to one side to minimize divergence. Do not issue schema changes until the partition is resolved; they will propagate inconsistently.

On heal: mandatory full repair

Once connectivity is restored, do not assume the cluster is consistent. Gossip will converge, but divergent writes will not reconcile themselves.

  • Run a full repair. Execute nodetool repair across all affected keyspaces to run a full anti-entropy repair. Do not rely on incremental repair or read repair alone; they are not sufficient to reconcile SSTables that diverged during a split brain. Expect elevated disk I/O and CPU.
  • Repair within gc_grace_seconds. The default gc_grace_seconds is 10 days. If repair does not complete within this window, tombstones compacted on one side but not the other can lead to data resurrection.

Clock skew remediation

Before bringing a partitioned cluster back together, synchronize clocks across all nodes. If one side has drifted significantly, last-write-wins semantics will overwrite valid data during reconciliation. Ensure NTP is running and skew is held to sub-second levels.

When the hint window is exceeded

If the partition lasted longer than max_hint_window_in_ms (default 3 hours), mutations written after that window are permanently missing from the isolated replicas. After repair, audit application-level consistency if the lost window contained critical writes.

Prevention

  • Monitor per-node gossip views. Alert when nodetool status output differs across nodes for the same peer. A node that is DN from one vantage point but UN from another is a leading indicator of partition.
  • Tune failure detection for cross-DC. In multi-DC deployments, consider raising phi_convict_threshold to tolerate higher WAN latency and reduce false positives from transient inter-DC jitter. The default of 8 assumes LAN-like latencies.
  • Enforce NTP everywhere. Cassandra’s conflict resolution depends on synchronized clocks. Monitor clock skew as a first-class operational metric.
  • Schedule regular repair. A cluster that is already consistent has less to reconcile when a partition heals. Ensure repair completes well before gc_grace_seconds expires.
  • Separate consistency level strategy. Use LOCAL_QUORUM for routine operations to reduce cross-DC latency exposure, and reserve EACH_QUORUM for operations that truly require global majority acknowledgment.

How Netdata helps

  • Correlate node liveness across the fleet. Netdata collects FailureDetector and Gossiper JMX metrics from every node simultaneously, revealing asymmetric UP/DOWN views that single-node nodetool status misses.
  • Alert on unavailables and drops. The ClientRequest unavailable and timeout rates, plus DroppedMessage rates, surface client impact automatically without manual JMX polling.
  • Distinguish GC from network failure. By correlating JVM GC pause duration with gossip DOWN events, Netdata helps you determine whether a node was convicted due to a network partition or a local GC death spiral.
  • Track repair and hint status. Netdata monitors hinted handoff metrics and can surface when repair has not completed within the gc_grace_seconds window .