Cassandra repair failing or stuck: partial repairs and how to verify completion

nodetool repair returning to the prompt, or hanging at 99 percent, are both dangerous because the worst outcome is silent: a partial repair. A partial repair anti-compacts some token ranges and skips others, leaving inconsistency while looking complete. Because incremental repair marks SSTables as repaired during anti-compaction, a session that fails mid-range leaves later ranges unrepaired while earlier ones are already marked done. No built-in alert fires when only forty percent of ranges were covered. If unrepaired ranges contain tombstones, deleted data resurrects once gc_grace_seconds passes. This guide shows how to verify completion, diagnose stuck sessions, and recover without causing a cascading I/O incident.

What this means

Cassandra anti-entropy repair compares Merkle trees across replicas and streams missing data. In Cassandra 3.11 and 4.x, incremental repair is the default. It separates SSTables into repaired and unrepaired sets by setting a RepairedAt timestamp. Only unrepaired SSTables participate in subsequent incremental Merkle tree comparisons. This is efficient, but creates a hazard: if a repair session fails after some ranges are anti-compacted, those ranges look finished to the next incremental run while downstream ranges remain unsynced. No built-in alert fires for partial coverage. A partial repair is worse than no repair because it trains teams to ignore the problem until tombstones expire and deleted data reappears on unrepaired replicas.

flowchart TD
    A[Repair command returns] --> B{Active streams?}
    B -->|Yes, progressing| C[Slow I/O, monitor]
    B -->|Yes, stalled| D[Stuck repair session]
    B -->|No streams| E[Check repair history]
    D --> F{Validation failed in logs?}
    F -->|Yes| G[Oversized partition or schema issue]
    F -->|No| H[Deadlock or remote hang]
    E -->|Incomplete or missing| I[Partial repair, rerun]
    G --> J[Subrange repair with -st -et]
    H --> K[Restart node or cancel session]

Common causes

CauseWhat it looks likeFirst thing to check
Repair session stuck waiting for remote Merkle treenodetool repair hangs at 99%, AntiEntropyStage active but no progressnodetool tpstats AntiEntropyStage Active and Completed counts
Oversized partition or schema mismatch“Validation failed” or “Failed creating a merkle tree” in system.loggrep -iE "Validation failed|Cannot get comparator" /var/log/cassandra/system.log
Concurrent repair sessions on the same nodeMultiple repair commands launched together; both hang or failnodetool repair_admin list (4.0+) or check for multiple nodetool repair processes
AntiEntropyStage blocked by compaction or I/O saturationRepair starts but progress halts under heavy disk loadnodetool compactionstats and iostat -x during the repair window
Mixed incremental and full repairs pre-4.0Data reappears, repair sessions abort with inconsistenciesCassandra version and alternating repair types in history
Replica DOWN during repair windowRepair skips ranges owned by unreachable replicasnodetool status for DN nodes during the repair

Quick checks

Run these safe, read-only checks before making changes.

# Check active repair sessions and state (Cassandra 4.0+)
nodetool repair_admin list

# Check active streaming sessions and bytes transferred
nodetool netstats

# Check AntiEntropyStage thread pool for progress
nodetool tpstats

# Review recent repair history entries
cqlsh -e "SELECT * FROM system_distributed.repair_history LIMIT 50;"

# Search logs for validation failures or Merkle tree errors
grep -iE "repair|validation failed|merkle|Cannot get comparator" /var/log/cassandra/system.log

# Check compaction backlog that may starve repair I/O
nodetool compactionstats

# Check disk saturation on data and commitlog devices
iostat -x 1

# Verify cluster liveness during the repair window
nodetool status

How to diagnose it

  1. Determine if the repair is slow or actually stuck. On Cassandra 4.0+, run nodetool repair_admin list. If a session is ACTIVE and nodetool netstats shows streaming bytes increasing, the repair is progressing through slow I/O. If bytes are unchanged for more than thirty minutes, treat it as stuck.
  2. Inspect the AntiEntropyStage thread pool. Run nodetool tpstats. Look for the AntiEntropyStage pool. If Active is greater than zero but Completed has not incremented for more than thirty minutes, the repair stage is stalled.
  3. Read the logs for validation failures. Search system.log for Validation failed, Failed creating a merkle tree, or RuntimeException: Cannot get comparator. These errors usually indicate an oversized partition or a schema mismatch that prevents Merkle tree construction.
  4. Identify the offending table or partition. If validation failed, use nodetool tablestats <keyspace> to check for large partitions, or run nodetool toppartitions <keyspace> <table> <duration_ms> to sample the hottest partitions.
  5. Verify historical completion, not just start. Query system_distributed.repair_history for the node and keyspace. Look for sessions that terminated in a failed state, or for gaps where no successful entry exists within gc_grace_seconds. On Cassandra 4.0+, use nodetool repair_admin to inspect active and pending session state.
  6. Check for concurrent repairs. Ensure no other operator or automation has launched a second nodetool repair on the same node. Multiple concurrent repairs compete for the AntiEntropyStage and can hang the node.
  7. Confirm replica availability. Run nodetool status. If any replica was DOWN during the repair window, the coordinator may have skipped ranges owned by that node.
  8. Narrow the scope for large partitions. If a full-range repair consistently fails on the same table, split the job into subrange repairs using -st and -et token boundaries to reduce per-session partition exposure.

Metrics and signals to monitor

SignalWhy it mattersWarning sign
AntiEntropyStage Active / Completed (tpstats)Direct view of repair executionActive > 0 with Completed flat for > 30 min
Streaming progress (netstats)Repair streams differences after Merkle comparisonByte count unchanged for > 30 min while streams exist
repair_admin list session state (4.0+)Authoritative state of incremental repair sessionsAny session stuck in ACTIVE beyond the expected window
system_distributed.repair_historyProof that repair covered all rangesNo successful entry for a node/keyspace within 80% of gc_grace_seconds
Validation failure log entriesMerkle tree creation aborts the sessionAny Validation failed error during a repair window
Disk I/O await and utilizationRepair competes for I/O with client trafficawait > 10 ms on SSD or > 50 ms on HDD sustained
Pending compactionsAnti-compaction output from repair adds compaction debtPending tasks trending upward during repair cycles

Fixes

Stuck repair session

If nodetool netstats shows no progress and tpstats confirms the AntiEntropyStage pool is frozen, the safest recovery is to restart the Cassandra process on the stuck node. This clears the hung session state. This is disruptive: the node will be briefly DOWN. After restart, run a fresh repair during off-peak hours.

Validation failure from oversized partitions

First identify the table and partition key responsible via nodetool tablestats or nodetool toppartitions. Split the repair into subranges using -st and -et to reduce the amount of data processed in a single Merkle tree. If the partition is unbounded, fix the data model. As a workaround, add --full to the subrange repair to force a complete comparison within that narrowed window.

Concurrent repair deadlock

On Cassandra 4.0+, run nodetool repair_admin cancel on all but one session ID. On earlier versions, killing the nodetool client does not stop the server-side repair; if the node is deadlocked, restart it. Never launch multiple nodetool repair invocations on the same node. Centralize scheduling with Reaper or a similar tool.

I/O saturation throttling repair

If repair is merely slow due to disk or network saturation, schedule it outside peak traffic windows. In cassandra.yaml, ensure stream_throughput_outbound_megabits_per_sec is set to a value that leaves headroom for client traffic. Do not increase compaction throttle to fix repair I/O; compaction and repair share the disk, but they are controlled by separate settings.

Partial repair cleanup

When you discover that a previous repair was partial, do not assume the next incremental run will fill the gaps cleanly, especially on versions before 4.0. On Cassandra 4.0+, run nodetool repair --full on the affected keyspace or subrange to force a complete comparison across all data regardless of RepairedAt state. For Cassandra 3.x, where incremental repair is unreliable, run full repairs exclusively and plan an upgrade. Be warned: full repair generates massive I/O and network load.

Prevention

  • Automate repair scheduling with Reaper. Manual repairs and cron jobs are prone to silent failure. A scheduler tracks state and retries.
  • Always run with -pr. This restricts each node to its primary token ranges, avoiding redundant work and reducing the chance of overlapping sessions.
  • Verify completion via repair_admin or repair_history. Do not trust the return of the nodetool repair command alone.
  • Alert on repair freshness. Trigger a ticket when any node or keyspace has gone longer than eighty percent of gc_grace_seconds without a successful repair.
  • Run incremental repairs frequently. A cadence of every one to three days, plus a full repair every one to three weeks, limits the window for data divergence.
  • Bound partition sizes. Review nodetool tablehistograms regularly for partitions approaching hundreds of megabytes.
  • Schedule all repair operations off-peak. Repair generates massive disk and network I/O that can starve client traffic and mask its own failures under load.

How Netdata helps

  • Correlate repair windows with per-device disk I/O await and network throughput to spot contention.
  • Monitor JVM heap usage and GC pause duration during repairs to catch nodes approaching gossip failure before they flap.
  • Track AntiEntropyStage thread pool saturation to identify stuck repairs before they time out.
  • Alert on repair freshness by querying system_distributed.repair_history or parsing repair completion logs so cycles do not exceed gc_grace_seconds.
  • Visualize pending compactions and SSTable growth to determine whether anti-compaction is overwhelming background compaction.