Cassandra repair overload: when anti-entropy repair causes the outage it prevents
A full nodetool repair started during peak traffic can spike P99 read latency from milliseconds to hundreds of milliseconds, trigger write timeouts, and cause nodes to flap between UP and DOWN in gossip. The repair job meant to prevent inconsistency becomes the cause of the outage.
Anti-entropy repair is a heavy distributed scan, not a background task. It reads all local data to build Merkle trees, exchanges hashes with replicas, and streams differing ranges. On a multi-terabyte node this means terabytes of sequential disk reads, heavy CPU hashing, and gigabits of network traffic. Without dedicated headroom, repair competes with the commitlog, memtable flushes, compaction, and client requests for disk bandwidth, CPU, and network. The result is thread pool backpressure, dropped messages, GC pressure from Merkle tree construction, and eventually gossip failure as the node becomes unresponsive.
The symptoms look like cascading failure, but they resolve when the repair load is removed.
flowchart TD
A[Full repair starts] --> B[Merkle tree construction reads all local data]
B --> C{Disk or network saturated?}
C -->|Yes| D[Foreground reads and writes starved]
D --> E[Thread pools back up]
E --> F[Dropped messages and timeouts]
F --> G[GC pressure from queued requests]
G --> H[Gossip failure node marked DOWN]
H --> I[Client unavailable exceptions]
C -->|No| J[Repair completes normally]What this means
Repair overload is resource contention, not hardware failure. Cassandra’s anti-entropy process treats the node’s entire dataset as a checksum source. Full repair generates I/O and network load comparable to a major compaction combined with a bootstrap stream. Every local SSTable is read to construct a Merkle tree, which is then exchanged with replicas and differenced. Missing data is streamed over the network. If the node lacks headroom, repair starves foreground reads and writes. The node is healthy but saturated.
Common causes
| Cause | What it looks like | First thing to check |
|---|---|---|
| Full repair during peak traffic | P99 latency and timeout rates rise within minutes of repair start; dropped mutations appear | nodetool netstats |
| Unthrottled streaming on dense nodes | Disk %util near 100%, commitlog pending tasks > 0, mutation stage pending sustained | iostat -x 1 and nodetool tpstats |
| Repair colliding with bootstrap or decommission | Network bandwidth saturated, streaming sessions from multiple sources, node marked DOWN | nodetool netstats and nodetool status |
| Anti-compaction backlog after repair | Compaction pending spikes after repair completes, SSTable count grows, read latency stays elevated | nodetool compactionstats |
Quick checks
Run these safe, read-only commands to confirm whether repair is the source of saturation.
# Check active repair and streaming sessions
nodetool netstats
# Check dropped messages and thread pool saturation
nodetool tpstats
# Check compaction backlog including anti-compaction
nodetool compactionstats
# Check disk I/O saturation on data and commitlog devices
iostat -x 1
# Check coordinator latency percentiles
nodetool proxyhistograms
# Check JVM heap usage
nodetool info | grep -i "Heap Memory"
# Check node liveness and schema agreement
nodetool status
nodetool describecluster
How to diagnose it
- Confirm repair is running. Run
nodetool netstatsto look for active streaming sessions labeled with repair ranges. On Cassandra 4.0+, runnodetool repair_admin listto see active repair sessions and their token ranges. - Correlate the timeline. Compare the repair start time against the onset of client timeout and unavailable metrics. If the latency spike begins within minutes of repair initiation, the correlation is strong.
- Check for thread pool saturation. Run
nodetool tpstatsand look atMutationStage,ReadStage, andNative-Transport-Requests. SustainedPending> 0 means requests are queuing.Blocked> 0 means the submitting thread is being backpressure-blocked because the queue is full. - Inspect disk I/O. Run
iostat -x 1on the data directory device and the commitlog device. If%utilis > 80% sustained orawaitis elevated beyond baseline, the disk is saturated. If commitlog and data share the same device, repair reads directly contend with commitlog writes. - Check for dropped messages. In
nodetool tpstats, any non-zero rate of droppedMUTATIONorREADmessages means the node is shedding load. Dropped mutations risk replica inconsistency; the write may have succeeded on some replicas before the coordinator timed out. - Evaluate compaction state. Run
nodetool compactionstats. If pending tasks are growing while repair is active, the node cannot keep up with both anti-compaction and normal compaction. - Check JVM pressure. Run
nodetool infofor heap usage. If heap usage is high, check GC logs for pause duration. Pauses > 500ms degrade latency. Pauses > 2 seconds risk gossip failure and nodes being marked DOWN. - Determine repair scope. Check whether the running repair is full or incremental. Full repair reads the entire dataset; incremental repair (4.0+) processes only unrepaired data and is significantly lighter. If you see anti-compaction creating a large number of new SSTables immediately after repair, expect a secondary compaction wave that can prolong the overload.
Metrics and signals to monitor
| Signal | Why it matters | Warning sign |
|---|---|---|
| Client request latency (coordinator P99) | Direct measure of user-visible degradation | P99 > 3x rolling 1-hour baseline sustained > 5 min |
| Dropped messages (MUTATION/READ) | Node is overloaded and shedding load | Any sustained non-zero rate > 60 seconds |
| Thread pool pending tasks (MutationStage/ReadStage) | Backpressure before messages are dropped | Pending > 0 sustained > 60 seconds |
Disk I/O utilization (%util and await) | Repair saturates disk bandwidth | %util > 80% or await > 10ms on SSD sustained |
| Active repair streaming sessions | Repair streams differences between replicas | Streaming sessions coinciding with latency spikes |
| Compaction pending tasks | Anti-compaction adds to background debt | Pending trending upward during or after repair |
| GC pause duration | Merkle trees and streaming pressure heap | Pause > 500ms; > 2s risks gossip failure |
| Node liveness (gossip state) | Extreme overload causes phi accrual failure | Node marked DOWN or flapping > 2 transitions in 10 min |
| Repair completion status | Partial repairs create false safety | Repair duration exceeds expected window without completion |
Fixes
Throttle or relocate the repair load
If repair is causing acute client impact, cap outbound streaming bandwidth. Run nodetool setstreamthroughput <value> to reduce it dynamically. For a persistent change, lower stream_throughput_outbound_megabits_per_sec in cassandra.yaml and perform a rolling restart. If you use Cassandra Reaper, configure more conservative per-segment throughput. If compaction is contending for disk bandwidth, temporarily lower its throughput cap with nodetool setcompactionthroughput <mb_per_sec>. This throttles background compaction to favor foreground traffic, at the cost of slower compaction catch-up. Move full repairs to off-peak windows.
Switch to incremental repair on Cassandra 4.0+
If you are running full repairs on 4.0 or later, migrate to incremental repair. Incremental repair tracks repaired SSTables via metadata and processes only unrepaired data written since the last cycle. It is the default in 4.0+ and significantly lighter than full repair. The tradeoff is that anti-compaction still creates additional SSTables, so monitor compaction pending after each run. Do not use incremental repair on versions earlier than 4.0; pre-4.0 incremental repair had bugs that could cause data corruption.
Use subrange repair with Reaper
Instead of a single full-range repair per node, use Reaper to orchestrate subrange repair. This divides a node’s token range into smaller segments, limiting per-session memory overhead and isolating failures to a single segment. If one segment fails, Reaper retries it without re-scanning the entire node. Subrange repair is more efficient for large clusters, and Reaper provides per-segment success visibility that nodetool repair alone does not. The tradeoff is longer total repair duration, but each segment imposes a smaller peak load and can be scheduled independently.
Separate commitlog and data directories
If commitlog and data directories share a physical device, repair reads on the data directory contend directly with commitlog writes. Separate them onto dedicated volumes. This is not an immediate fix during an incident, but it eliminates a major contention path.
Prevention
- Schedule repairs during low-traffic windows. Never run full repairs during peak traffic. Repair generates I/O and network traffic comparable to a major failure. For multi-DC clusters, schedule repairs sequentially by datacenter to avoid cross-DC streaming load.
- Prefer incremental repair on Cassandra 4.0+. This bounds repair cost to recent write volume rather than total dataset size. Verify that incremental repair completes successfully; partial runs can leave unrepaired SSTables that accumulate debt.
- Use Reaper with subrange segmentation and conservative throughput limits. This avoids monolithic repair sessions and spreads load over time. Reaper also provides per-segment success and failure visibility that
nodetool repairlacks. - Monitor repair completion, not just start. Repairs can silently fail to complete all token ranges. Alert when repair duration exceeds the expected window or when last repair time approaches 80% of
gc_grace_seconds. A repair that starts but does not finish every range is worse than no repair because it creates a false sense of safety. - Maintain disk I/O headroom. Keep sustained disk utilization below 70% to absorb repair bursts without starving foreground traffic. Major compaction can transiently need up to 100% additional disk space. High utilization increases the risk of space exhaustion during background operations.
How Netdata helps
- Correlate per-device disk I/O utilization (
%util,await) with active repair streaming to identify saturation immediately. - Track JVM GC pause duration and heap usage to catch pressure from Merkle tree construction before gossip fails.
- Monitor thread pool pending tasks for
MutationStage,ReadStage, andNative-Transport-Requeststo detect backpressure before messages are dropped. - Alert on dropped mutation and read rates, which are the first signals of repair-induced overload.
- Overlay node gossip state (UP/DOWN transitions) with latency metrics to distinguish repair saturation from true hardware failures.
Related guides
- Cassandra compaction strategies: STCS vs LCS vs TWCS vs UCS
- Cassandra clock skew: how NTP drift silently corrupts data
- Cassandra compaction death spiral: when writes outrun compaction throughput
- Cassandra consistency levels explained: QUORUM, ONE, LOCAL_QUORUM, and EACH_QUORUM
- Cassandra zombie data resurrection: gc_grace_seconds and unrepaired tombstones
- Cassandra disk space exhaustion: emergency recovery when the data volume fills
- Cassandra dropped mutations: silent write loss and load shedding
- Cassandra dropped reads and other messages: reading nodetool tpstats Dropped
- Cassandra GC death spiral: long pauses, gossip flapping, and recovery
- Cassandra GC pauses too long: diagnosing G1 stop-the-world pauses
- Cassandra gossip flapping: nodes bouncing UP and DOWN
- Cassandra heap pressure: sizing the JVM heap and tuning G1GC







