Cassandra hints accumulating: hinted handoff backlog and replay storms

A coordinator node triggers a disk alert for /var/lib/cassandra/hints, or a replica returning from maintenance flaps UP/DOWN under a write burst it never requested. Hints let writes succeed when a replica is temporarily unreachable, but a large backlog turns that safety net into a secondary failure. Hints consume coordinator disk, expire after max_hint_window_in_ms, and can synchronize into a replay storm that overwhelms a recovering node.

The root cause is almost always a DOWN or unreachable node, but the damage spreads to coordinators holding the backlog and then back to the recovering replica. You may discover it as disk exhaustion, gossip flapping, or inconsistent reads after an outage that exceeded the hint window. Unlike dropped mutations, hints appear benign at first: the cluster accepts writes and clients see no errors. The cost is deferred to the moment the missing replica returns, when every coordinator that buffered mutations replays them at once. If the backlog is large and the target is fragile, that deferred cost can exceed the original outage.

What this means

When a write target is unreachable, the coordinator stores a hint locally instead of failing the write. From Cassandra 3.0 onward, hints are flat files in the hints directory rather than system table rows. Each file contains the target endpoint, the mutation timestamp, and the serialized mutation. Because hints buffer to disk rather than memory, they protect against data loss without unbounded heap growth. The tradeoff is coordinator-side disk usage and the eventual replay cost.

Hints accumulate until the target recovers, at which point coordinators deliver the backlog in bulk segments. Delivery is throttled per coordinator by hinted_handoff_throttle_in_kb, but every coordinator acts simultaneously. If the backlog is large enough, the combined replay can saturate the recovering node’s disk, heap, or mutation stage, causing it to miss gossip heartbeats and be marked DOWN again. Meanwhile, the hint files themselves are a disk space risk on every node that accepted writes for the missing replica.

flowchart TD
    A[Replica marked DOWN] --> B[Coordinators store hints]
    B --> C[Hints backlog grows]
    C --> D[Disk pressure on coordinators]
    C --> E[Target node recovers]
    E --> F[Simultaneous hint replay]
    F --> G[Replay storm overloads target]
    G --> H[Node re-marked DOWN]
    H --> I[Cascading backlog]

Common causes

CauseWhat it looks likeFirst thing to check
Replica down or unreachableHints directory grows steadily on coordinators; TotalHintsInProgress > 0 after recoverynodetool status
Hint delivery failingHintsFailed or HintsTimedOut increasing; target node overloadednodetool netstats on the target
Replay storm overwhelming a recovered nodeTarget returned but GC pauses, disk saturation, or dropped mutations spike, then it goes DOWN againnodetool tpstats HintsDispatcher and target resource pressure
Outage exceeded max_hint_window_in_msHints stopped growing after the configured window; permanent inconsistency for that periodOutage duration versus the configured hint window
Throttle mismatch across the clusterUneven backlogs and hot spotsCompare hinted_handoff_throttle_in_kb in cassandra.yaml across nodes

Quick checks

# Check whether hint storage is enabled on this node
nodetool statushandoff

# Measure total hint backlog size (adjust path if hints_directory is overridden)
du -sh /var/lib/cassandra/hints/

# Count top-level entries in the hints directory to gauge backlog granularity
ls /var/lib/cassandra/hints/ | wc -l

# Identify down or unreachable replicas
nodetool status

# Check active internode sessions and streaming
nodetool netstats

# Inspect the HintsDispatcher thread pool for active replay
nodetool tpstats | grep -A2 HintsDispatcher

# Snapshot JVM heap usage (run repeatedly during replay; persistent growth signals pressure)
nodetool info | grep "Heap Memory"

# Check disk I/O saturation on the recovering node (runs until interrupted; stop with Ctrl-C)
iostat -x 1

How to diagnose it

  1. Confirm handoff is enabled with nodetool statushandoff, then measure the hints directory size with du. A growing directory when all nodes are expected healthy is abnormal.
  2. Identify the target replica using nodetool status. Look for a node that is DOWN or flapping between UP and DOWN.
  3. Correlate JMX metrics. TotalHintsInProgress indicates active delivery. HintsFailed or HintsTimedOut indicates the target is rejecting or missing replay.
  4. If the target recently recovered, check for a replay storm. On the target, watch nodetool tpstats for MutationStage pending tasks, monitor GC logs for long pauses, and use iostat to confirm write saturation.
  5. Compare the outage duration to max_hint_window_in_ms. The default is 3 hours. If the node was down longer than this window, hints are no longer stored and the missing data can only be recovered via repair.
  6. If delivery appears stuck despite an active target, inspect the hints directory for unconsumed files and review system logs for errors from the HintsService.

Metrics and signals to monitor

SignalWhy it mattersWarning sign
TotalHintsInProgressHints actively being delivered to a recovered nodeSustained > 0 for hours after recovery
HintsFailed / HintsTimedOutDelivery failing because the target is overloaded or unreachableNon-zero rate sustained > 5 minutes
Hints directory sizeDisk consumption on coordinators that can feed disk exhaustionGrowing over hours or approaching volume limits
HintsDispatcher pending tasksBacklog of segments waiting to be replayedPending > 0 sustained
DownEndpointCountRoot cause: a down replica is why hints are being generatedAny node DOWN > 5 minutes
Disk space available on coordinatorsHints compete with SSTables, commitlog, and snapshotsFree space < 30%
GC pause duration on recovered nodeReplay storms can push a node into GC death spiralPauses > 2 seconds during replay
SSTable count on recovered nodeReplayed hints create new SSTables that need compactionRapid increase during hint ingestion

Fixes

Restore the down replica. The only durable fix is a stable, healthy target. Do not force a sick node back into the ring. If it is flapping due to GC pressure or disk saturation, resolve those root causes first. Treat long GC pauses as a GC death spiral before attempting replay. Only when the node is stable should you allow full-speed delivery.

Throttle the replay storm. Temporarily reduce hinted_handoff_throttle_in_kb to lower write pressure on the recovering node. This extends replay time but reduces the risk of re-failure. Lowering the throttle reduces the mutation rate arriving at the target, giving its flush writers and compaction executor time to keep up. Watch nodetool compactionstats on the target; if pending compactions begin to grow while replay is active, the throttle is still too high.

Truncate hints when the target cannot survive replay. If the node re-fails under replay, drop the backlog for that endpoint with nodetool truncatehints <target_ip> on the storing coordinators. Specify the exact endpoint to avoid dropping hints for other replicas. After truncation, run a full nodetool repair on the target to reconcile the missing data. Tradeoff: you lose the hinted mutations for that window, and repair is mandatory and I/O intensive.

Repair after a long outage. If the node was down longer than max_hint_window_in_ms, hints are already gone. Schedule nodetool repair on the affected tables immediately. Run it during low-traffic hours and throttle streaming to avoid saturating the cluster. In a cluster with many tables, schedule the repair by keyspace to limit streaming concurrency. Tradeoff: repair competes for disk I/O and network bandwidth.

Recover from disk pressure on coordinators. If hints are consuming enough space to threaten disk exhaustion, verify whether the target node has been decommissioned or is permanently gone. For a node that will not return, truncate its hints and plan a repair. For transient backlogs, freeing snapshots or increasing storage may be necessary, but the real fix is delivering the hints or removing them and repairing.

Prevention

Monitor hints directory size with the same urgency as data directory growth. Alert on sustained growth when no maintenance window is active. Keep hinted_handoff_throttle_in_kb identical on every node to avoid creating bottlenecks. Monitor file count in the hints directory, not just byte size, because a very large number of small files increases overhead and can stress filesystem limits. Set max_hint_window_in_ms to a value aligned with your recovery time objective, and treat any outage approaching that window as a mandatory repair trigger. During rolling restarts or upgrades, maintain only one node down at a time to prevent cross-node hint amplification. Review your restart and decommission procedures. A rolling restart should never leave a node down long enough to accumulate gigabytes of hints. If you anticipate an extended outage, consider proactively increasing the hint throttle on the remaining nodes so that delivery is faster when the node returns, provided the target hardware can absorb it.

How Netdata helps

  • Correlate hints directory size with disk space utilization on coordinators to catch backlog-driven disk exhaustion before writes block.
  • Track TotalHintsInProgress and delivery outcome rates (HintsSucceeded, HintsFailed, HintsTimedOut) to distinguish healthy replay from a failing storm.
  • Overlay hint replay periods with GC pause duration and disk I/O wait on the recovered node to expose replay storms in real time.
  • Alert on node liveness transitions and DownEndpointCount so you detect the root cause before hints begin to accumulate.