Cassandra repair overload: when anti-entropy repair causes the outage it prevents

A full nodetool repair started during peak traffic can spike P99 read latency from milliseconds to hundreds of milliseconds, trigger write timeouts, and cause nodes to flap between UP and DOWN in gossip. The repair job meant to prevent inconsistency becomes the cause of the outage.

Anti-entropy repair is a heavy distributed scan, not a background task. It reads all local data to build Merkle trees, exchanges hashes with replicas, and streams differing ranges. On a multi-terabyte node this means terabytes of sequential disk reads, heavy CPU hashing, and gigabits of network traffic. Without dedicated headroom, repair competes with the commitlog, memtable flushes, compaction, and client requests for disk bandwidth, CPU, and network. The result is thread pool backpressure, dropped messages, GC pressure from Merkle tree construction, and eventually gossip failure as the node becomes unresponsive.

The symptoms look like cascading failure, but they resolve when the repair load is removed.

flowchart TD
    A[Full repair starts] --> B[Merkle tree construction reads all local data]
    B --> C{Disk or network saturated?}
    C -->|Yes| D[Foreground reads and writes starved]
    D --> E[Thread pools back up]
    E --> F[Dropped messages and timeouts]
    F --> G[GC pressure from queued requests]
    G --> H[Gossip failure node marked DOWN]
    H --> I[Client unavailable exceptions]
    C -->|No| J[Repair completes normally]

What this means

Repair overload is resource contention, not hardware failure. Cassandra’s anti-entropy process treats the node’s entire dataset as a checksum source. Full repair generates I/O and network load comparable to a major compaction combined with a bootstrap stream. Every local SSTable is read to construct a Merkle tree, which is then exchanged with replicas and differenced. Missing data is streamed over the network. If the node lacks headroom, repair starves foreground reads and writes. The node is healthy but saturated.

Common causes

CauseWhat it looks likeFirst thing to check
Full repair during peak trafficP99 latency and timeout rates rise within minutes of repair start; dropped mutations appearnodetool netstats
Unthrottled streaming on dense nodesDisk %util near 100%, commitlog pending tasks > 0, mutation stage pending sustainediostat -x 1 and nodetool tpstats
Repair colliding with bootstrap or decommissionNetwork bandwidth saturated, streaming sessions from multiple sources, node marked DOWNnodetool netstats and nodetool status
Anti-compaction backlog after repairCompaction pending spikes after repair completes, SSTable count grows, read latency stays elevatednodetool compactionstats

Quick checks

Run these safe, read-only commands to confirm whether repair is the source of saturation.

# Check active repair and streaming sessions
nodetool netstats

# Check dropped messages and thread pool saturation
nodetool tpstats

# Check compaction backlog including anti-compaction
nodetool compactionstats

# Check disk I/O saturation on data and commitlog devices
iostat -x 1

# Check coordinator latency percentiles
nodetool proxyhistograms

# Check JVM heap usage
nodetool info | grep -i "Heap Memory"

# Check node liveness and schema agreement
nodetool status
nodetool describecluster

How to diagnose it

  1. Confirm repair is running. Run nodetool netstats to look for active streaming sessions labeled with repair ranges. On Cassandra 4.0+, run nodetool repair_admin list to see active repair sessions and their token ranges.
  2. Correlate the timeline. Compare the repair start time against the onset of client timeout and unavailable metrics. If the latency spike begins within minutes of repair initiation, the correlation is strong.
  3. Check for thread pool saturation. Run nodetool tpstats and look at MutationStage, ReadStage, and Native-Transport-Requests. Sustained Pending > 0 means requests are queuing. Blocked > 0 means the submitting thread is being backpressure-blocked because the queue is full.
  4. Inspect disk I/O. Run iostat -x 1 on the data directory device and the commitlog device. If %util is > 80% sustained or await is elevated beyond baseline, the disk is saturated. If commitlog and data share the same device, repair reads directly contend with commitlog writes.
  5. Check for dropped messages. In nodetool tpstats, any non-zero rate of dropped MUTATION or READ messages means the node is shedding load. Dropped mutations risk replica inconsistency; the write may have succeeded on some replicas before the coordinator timed out.
  6. Evaluate compaction state. Run nodetool compactionstats. If pending tasks are growing while repair is active, the node cannot keep up with both anti-compaction and normal compaction.
  7. Check JVM pressure. Run nodetool info for heap usage. If heap usage is high, check GC logs for pause duration. Pauses > 500ms degrade latency. Pauses > 2 seconds risk gossip failure and nodes being marked DOWN.
  8. Determine repair scope. Check whether the running repair is full or incremental. Full repair reads the entire dataset; incremental repair (4.0+) processes only unrepaired data and is significantly lighter. If you see anti-compaction creating a large number of new SSTables immediately after repair, expect a secondary compaction wave that can prolong the overload.

Metrics and signals to monitor

SignalWhy it mattersWarning sign
Client request latency (coordinator P99)Direct measure of user-visible degradationP99 > 3x rolling 1-hour baseline sustained > 5 min
Dropped messages (MUTATION/READ)Node is overloaded and shedding loadAny sustained non-zero rate > 60 seconds
Thread pool pending tasks (MutationStage/ReadStage)Backpressure before messages are droppedPending > 0 sustained > 60 seconds
Disk I/O utilization (%util and await)Repair saturates disk bandwidth%util > 80% or await > 10ms on SSD sustained
Active repair streaming sessionsRepair streams differences between replicasStreaming sessions coinciding with latency spikes
Compaction pending tasksAnti-compaction adds to background debtPending trending upward during or after repair
GC pause durationMerkle trees and streaming pressure heapPause > 500ms; > 2s risks gossip failure
Node liveness (gossip state)Extreme overload causes phi accrual failureNode marked DOWN or flapping > 2 transitions in 10 min
Repair completion statusPartial repairs create false safetyRepair duration exceeds expected window without completion

Fixes

Throttle or relocate the repair load

If repair is causing acute client impact, cap outbound streaming bandwidth. Run nodetool setstreamthroughput <value> to reduce it dynamically. For a persistent change, lower stream_throughput_outbound_megabits_per_sec in cassandra.yaml and perform a rolling restart. If you use Cassandra Reaper, configure more conservative per-segment throughput. If compaction is contending for disk bandwidth, temporarily lower its throughput cap with nodetool setcompactionthroughput <mb_per_sec>. This throttles background compaction to favor foreground traffic, at the cost of slower compaction catch-up. Move full repairs to off-peak windows.

Switch to incremental repair on Cassandra 4.0+

If you are running full repairs on 4.0 or later, migrate to incremental repair. Incremental repair tracks repaired SSTables via metadata and processes only unrepaired data written since the last cycle. It is the default in 4.0+ and significantly lighter than full repair. The tradeoff is that anti-compaction still creates additional SSTables, so monitor compaction pending after each run. Do not use incremental repair on versions earlier than 4.0; pre-4.0 incremental repair had bugs that could cause data corruption.

Use subrange repair with Reaper

Instead of a single full-range repair per node, use Reaper to orchestrate subrange repair. This divides a node’s token range into smaller segments, limiting per-session memory overhead and isolating failures to a single segment. If one segment fails, Reaper retries it without re-scanning the entire node. Subrange repair is more efficient for large clusters, and Reaper provides per-segment success visibility that nodetool repair alone does not. The tradeoff is longer total repair duration, but each segment imposes a smaller peak load and can be scheduled independently.

Separate commitlog and data directories

If commitlog and data directories share a physical device, repair reads on the data directory contend directly with commitlog writes. Separate them onto dedicated volumes. This is not an immediate fix during an incident, but it eliminates a major contention path.

Prevention

  • Schedule repairs during low-traffic windows. Never run full repairs during peak traffic. Repair generates I/O and network traffic comparable to a major failure. For multi-DC clusters, schedule repairs sequentially by datacenter to avoid cross-DC streaming load.
  • Prefer incremental repair on Cassandra 4.0+. This bounds repair cost to recent write volume rather than total dataset size. Verify that incremental repair completes successfully; partial runs can leave unrepaired SSTables that accumulate debt.
  • Use Reaper with subrange segmentation and conservative throughput limits. This avoids monolithic repair sessions and spreads load over time. Reaper also provides per-segment success and failure visibility that nodetool repair lacks.
  • Monitor repair completion, not just start. Repairs can silently fail to complete all token ranges. Alert when repair duration exceeds the expected window or when last repair time approaches 80% of gc_grace_seconds. A repair that starts but does not finish every range is worse than no repair because it creates a false sense of safety.
  • Maintain disk I/O headroom. Keep sustained disk utilization below 70% to absorb repair bursts without starving foreground traffic. Major compaction can transiently need up to 100% additional disk space. High utilization increases the risk of space exhaustion during background operations.

How Netdata helps

  • Correlate per-device disk I/O utilization (%util, await) with active repair streaming to identify saturation immediately.
  • Track JVM GC pause duration and heap usage to catch pressure from Merkle tree construction before gossip fails.
  • Monitor thread pool pending tasks for MutationStage, ReadStage, and Native-Transport-Requests to detect backpressure before messages are dropped.
  • Alert on dropped mutation and read rates, which are the first signals of repair-induced overload.
  • Overlay node gossip state (UP/DOWN transitions) with latency metrics to distinguish repair saturation from true hardware failures.