Cassandra repair failing or stuck: partial repairs and how to verify completion
nodetool repair returning to the prompt, or hanging at 99 percent, are both dangerous because the worst outcome is silent: a partial repair. A partial repair anti-compacts some token ranges and skips others, leaving inconsistency while looking complete. Because incremental repair marks SSTables as repaired during anti-compaction, a session that fails mid-range leaves later ranges unrepaired while earlier ones are already marked done. No built-in alert fires when only forty percent of ranges were covered. If unrepaired ranges contain tombstones, deleted data resurrects once gc_grace_seconds passes. This guide shows how to verify completion, diagnose stuck sessions, and recover without causing a cascading I/O incident.
What this means
Cassandra anti-entropy repair compares Merkle trees across replicas and streams missing data. In Cassandra 3.11 and 4.x, incremental repair is the default. It separates SSTables into repaired and unrepaired sets by setting a RepairedAt timestamp. Only unrepaired SSTables participate in subsequent incremental Merkle tree comparisons. This is efficient, but creates a hazard: if a repair session fails after some ranges are anti-compacted, those ranges look finished to the next incremental run while downstream ranges remain unsynced. No built-in alert fires for partial coverage. A partial repair is worse than no repair because it trains teams to ignore the problem until tombstones expire and deleted data reappears on unrepaired replicas.
flowchart TD
A[Repair command returns] --> B{Active streams?}
B -->|Yes, progressing| C[Slow I/O, monitor]
B -->|Yes, stalled| D[Stuck repair session]
B -->|No streams| E[Check repair history]
D --> F{Validation failed in logs?}
F -->|Yes| G[Oversized partition or schema issue]
F -->|No| H[Deadlock or remote hang]
E -->|Incomplete or missing| I[Partial repair, rerun]
G --> J[Subrange repair with -st -et]
H --> K[Restart node or cancel session]Common causes
| Cause | What it looks like | First thing to check |
|---|---|---|
| Repair session stuck waiting for remote Merkle tree | nodetool repair hangs at 99%, AntiEntropyStage active but no progress | nodetool tpstats AntiEntropyStage Active and Completed counts |
| Oversized partition or schema mismatch | “Validation failed” or “Failed creating a merkle tree” in system.log | grep -iE "Validation failed|Cannot get comparator" /var/log/cassandra/system.log |
| Concurrent repair sessions on the same node | Multiple repair commands launched together; both hang or fail | nodetool repair_admin list (4.0+) or check for multiple nodetool repair processes |
| AntiEntropyStage blocked by compaction or I/O saturation | Repair starts but progress halts under heavy disk load | nodetool compactionstats and iostat -x during the repair window |
| Mixed incremental and full repairs pre-4.0 | Data reappears, repair sessions abort with inconsistencies | Cassandra version and alternating repair types in history |
| Replica DOWN during repair window | Repair skips ranges owned by unreachable replicas | nodetool status for DN nodes during the repair |
Quick checks
Run these safe, read-only checks before making changes.
# Check active repair sessions and state (Cassandra 4.0+)
nodetool repair_admin list
# Check active streaming sessions and bytes transferred
nodetool netstats
# Check AntiEntropyStage thread pool for progress
nodetool tpstats
# Review recent repair history entries
cqlsh -e "SELECT * FROM system_distributed.repair_history LIMIT 50;"
# Search logs for validation failures or Merkle tree errors
grep -iE "repair|validation failed|merkle|Cannot get comparator" /var/log/cassandra/system.log
# Check compaction backlog that may starve repair I/O
nodetool compactionstats
# Check disk saturation on data and commitlog devices
iostat -x 1
# Verify cluster liveness during the repair window
nodetool status
How to diagnose it
- Determine if the repair is slow or actually stuck. On Cassandra 4.0+, run
nodetool repair_admin list. If a session is ACTIVE andnodetool netstatsshows streaming bytes increasing, the repair is progressing through slow I/O. If bytes are unchanged for more than thirty minutes, treat it as stuck. - Inspect the AntiEntropyStage thread pool. Run
nodetool tpstats. Look for theAntiEntropyStagepool. IfActiveis greater than zero butCompletedhas not incremented for more than thirty minutes, the repair stage is stalled. - Read the logs for validation failures. Search system.log for
Validation failed,Failed creating a merkle tree, orRuntimeException: Cannot get comparator. These errors usually indicate an oversized partition or a schema mismatch that prevents Merkle tree construction. - Identify the offending table or partition. If validation failed, use
nodetool tablestats <keyspace>to check for large partitions, or runnodetool toppartitions <keyspace> <table> <duration_ms>to sample the hottest partitions. - Verify historical completion, not just start. Query
system_distributed.repair_historyfor the node and keyspace. Look for sessions that terminated in a failed state, or for gaps where no successful entry exists withingc_grace_seconds. On Cassandra 4.0+, usenodetool repair_adminto inspect active and pending session state. - Check for concurrent repairs. Ensure no other operator or automation has launched a second
nodetool repairon the same node. Multiple concurrent repairs compete for the AntiEntropyStage and can hang the node. - Confirm replica availability. Run
nodetool status. If any replica was DOWN during the repair window, the coordinator may have skipped ranges owned by that node. - Narrow the scope for large partitions. If a full-range repair consistently fails on the same table, split the job into subrange repairs using
-stand-ettoken boundaries to reduce per-session partition exposure.
Metrics and signals to monitor
| Signal | Why it matters | Warning sign |
|---|---|---|
AntiEntropyStage Active / Completed (tpstats) | Direct view of repair execution | Active > 0 with Completed flat for > 30 min |
Streaming progress (netstats) | Repair streams differences after Merkle comparison | Byte count unchanged for > 30 min while streams exist |
repair_admin list session state (4.0+) | Authoritative state of incremental repair sessions | Any session stuck in ACTIVE beyond the expected window |
system_distributed.repair_history | Proof that repair covered all ranges | No successful entry for a node/keyspace within 80% of gc_grace_seconds |
| Validation failure log entries | Merkle tree creation aborts the session | Any Validation failed error during a repair window |
| Disk I/O await and utilization | Repair competes for I/O with client traffic | await > 10 ms on SSD or > 50 ms on HDD sustained |
| Pending compactions | Anti-compaction output from repair adds compaction debt | Pending tasks trending upward during repair cycles |
Fixes
Stuck repair session
If nodetool netstats shows no progress and tpstats confirms the AntiEntropyStage pool is frozen, the safest recovery is to restart the Cassandra process on the stuck node. This clears the hung session state. This is disruptive: the node will be briefly DOWN. After restart, run a fresh repair during off-peak hours.
Validation failure from oversized partitions
First identify the table and partition key responsible via nodetool tablestats or nodetool toppartitions. Split the repair into subranges using -st and -et to reduce the amount of data processed in a single Merkle tree. If the partition is unbounded, fix the data model. As a workaround, add --full to the subrange repair to force a complete comparison within that narrowed window.
Concurrent repair deadlock
On Cassandra 4.0+, run nodetool repair_admin cancel on all but one session ID. On earlier versions, killing the nodetool client does not stop the server-side repair; if the node is deadlocked, restart it. Never launch multiple nodetool repair invocations on the same node. Centralize scheduling with Reaper or a similar tool.
I/O saturation throttling repair
If repair is merely slow due to disk or network saturation, schedule it outside peak traffic windows. In cassandra.yaml, ensure stream_throughput_outbound_megabits_per_sec is set to a value that leaves headroom for client traffic. Do not increase compaction throttle to fix repair I/O; compaction and repair share the disk, but they are controlled by separate settings.
Partial repair cleanup
When you discover that a previous repair was partial, do not assume the next incremental run will fill the gaps cleanly, especially on versions before 4.0. On Cassandra 4.0+, run nodetool repair --full on the affected keyspace or subrange to force a complete comparison across all data regardless of RepairedAt state. For Cassandra 3.x, where incremental repair is unreliable, run full repairs exclusively and plan an upgrade. Be warned: full repair generates massive I/O and network load.
Prevention
- Automate repair scheduling with Reaper. Manual repairs and cron jobs are prone to silent failure. A scheduler tracks state and retries.
- Always run with
-pr. This restricts each node to its primary token ranges, avoiding redundant work and reducing the chance of overlapping sessions. - Verify completion via
repair_adminorrepair_history. Do not trust the return of thenodetool repaircommand alone. - Alert on repair freshness. Trigger a ticket when any node or keyspace has gone longer than eighty percent of
gc_grace_secondswithout a successful repair. - Run incremental repairs frequently. A cadence of every one to three days, plus a full repair every one to three weeks, limits the window for data divergence.
- Bound partition sizes. Review
nodetool tablehistogramsregularly for partitions approaching hundreds of megabytes. - Schedule all repair operations off-peak. Repair generates massive disk and network I/O that can starve client traffic and mask its own failures under load.
How Netdata helps
- Correlate repair windows with per-device disk I/O await and network throughput to spot contention.
- Monitor JVM heap usage and GC pause duration during repairs to catch nodes approaching gossip failure before they flap.
- Track
AntiEntropyStagethread pool saturation to identify stuck repairs before they time out. - Alert on repair freshness by querying
system_distributed.repair_historyor parsing repair completion logs so cycles do not exceedgc_grace_seconds. - Visualize pending compactions and SSTable growth to determine whether anti-compaction is overwhelming background compaction.
Related guides
- Cassandra compaction strategies: STCS vs LCS vs TWCS vs UCS
- Cassandra clock skew: how NTP drift silently corrupts data
- Cassandra compaction death spiral: when writes outrun compaction throughput
- Cassandra consistency levels explained: QUORUM, ONE, LOCAL_QUORUM, and EACH_QUORUM
- Cassandra zombie data resurrection: gc_grace_seconds and unrepaired tombstones
- Cassandra disk space exhaustion: emergency recovery when the data volume fills
- Cassandra dropped mutations: silent write loss and load shedding
- Cassandra dropped reads and other messages: reading nodetool tpstats Dropped
- Cassandra GC death spiral: long pauses, gossip flapping, and recovery
- Cassandra GC pauses too long: diagnosing G1 stop-the-world pauses
- Cassandra gossip flapping: nodes bouncing UP and DOWN
- Cassandra heap pressure: sizing the JVM heap and tuning G1GC







