Cassandra hints accumulating: hinted handoff backlog and replay storms
A coordinator node triggers a disk alert for /var/lib/cassandra/hints, or a replica returning from maintenance flaps UP/DOWN under a write burst it never requested. Hints let writes succeed when a replica is temporarily unreachable, but a large backlog turns that safety net into a secondary failure. Hints consume coordinator disk, expire after max_hint_window_in_ms, and can synchronize into a replay storm that overwhelms a recovering node.
The root cause is almost always a DOWN or unreachable node, but the damage spreads to coordinators holding the backlog and then back to the recovering replica. You may discover it as disk exhaustion, gossip flapping, or inconsistent reads after an outage that exceeded the hint window. Unlike dropped mutations, hints appear benign at first: the cluster accepts writes and clients see no errors. The cost is deferred to the moment the missing replica returns, when every coordinator that buffered mutations replays them at once. If the backlog is large and the target is fragile, that deferred cost can exceed the original outage.
What this means
When a write target is unreachable, the coordinator stores a hint locally instead of failing the write. From Cassandra 3.0 onward, hints are flat files in the hints directory rather than system table rows. Each file contains the target endpoint, the mutation timestamp, and the serialized mutation. Because hints buffer to disk rather than memory, they protect against data loss without unbounded heap growth. The tradeoff is coordinator-side disk usage and the eventual replay cost.
Hints accumulate until the target recovers, at which point coordinators deliver the backlog in bulk segments. Delivery is throttled per coordinator by hinted_handoff_throttle_in_kb, but every coordinator acts simultaneously. If the backlog is large enough, the combined replay can saturate the recovering node’s disk, heap, or mutation stage, causing it to miss gossip heartbeats and be marked DOWN again. Meanwhile, the hint files themselves are a disk space risk on every node that accepted writes for the missing replica.
flowchart TD
A[Replica marked DOWN] --> B[Coordinators store hints]
B --> C[Hints backlog grows]
C --> D[Disk pressure on coordinators]
C --> E[Target node recovers]
E --> F[Simultaneous hint replay]
F --> G[Replay storm overloads target]
G --> H[Node re-marked DOWN]
H --> I[Cascading backlog]Common causes
| Cause | What it looks like | First thing to check |
|---|---|---|
| Replica down or unreachable | Hints directory grows steadily on coordinators; TotalHintsInProgress > 0 after recovery | nodetool status |
| Hint delivery failing | HintsFailed or HintsTimedOut increasing; target node overloaded | nodetool netstats on the target |
| Replay storm overwhelming a recovered node | Target returned but GC pauses, disk saturation, or dropped mutations spike, then it goes DOWN again | nodetool tpstats HintsDispatcher and target resource pressure |
Outage exceeded max_hint_window_in_ms | Hints stopped growing after the configured window; permanent inconsistency for that period | Outage duration versus the configured hint window |
| Throttle mismatch across the cluster | Uneven backlogs and hot spots | Compare hinted_handoff_throttle_in_kb in cassandra.yaml across nodes |
Quick checks
# Check whether hint storage is enabled on this node
nodetool statushandoff
# Measure total hint backlog size (adjust path if hints_directory is overridden)
du -sh /var/lib/cassandra/hints/
# Count top-level entries in the hints directory to gauge backlog granularity
ls /var/lib/cassandra/hints/ | wc -l
# Identify down or unreachable replicas
nodetool status
# Check active internode sessions and streaming
nodetool netstats
# Inspect the HintsDispatcher thread pool for active replay
nodetool tpstats | grep -A2 HintsDispatcher
# Snapshot JVM heap usage (run repeatedly during replay; persistent growth signals pressure)
nodetool info | grep "Heap Memory"
# Check disk I/O saturation on the recovering node (runs until interrupted; stop with Ctrl-C)
iostat -x 1
How to diagnose it
- Confirm handoff is enabled with
nodetool statushandoff, then measure the hints directory size withdu. A growing directory when all nodes are expected healthy is abnormal. - Identify the target replica using
nodetool status. Look for a node that is DOWN or flapping between UP and DOWN. - Correlate JMX metrics.
TotalHintsInProgressindicates active delivery.HintsFailedorHintsTimedOutindicates the target is rejecting or missing replay. - If the target recently recovered, check for a replay storm. On the target, watch
nodetool tpstatsfor MutationStage pending tasks, monitor GC logs for long pauses, and useiostatto confirm write saturation. - Compare the outage duration to
max_hint_window_in_ms. The default is 3 hours. If the node was down longer than this window, hints are no longer stored and the missing data can only be recovered via repair. - If delivery appears stuck despite an active target, inspect the hints directory for unconsumed files and review system logs for errors from the HintsService.
Metrics and signals to monitor
| Signal | Why it matters | Warning sign |
|---|---|---|
TotalHintsInProgress | Hints actively being delivered to a recovered node | Sustained > 0 for hours after recovery |
HintsFailed / HintsTimedOut | Delivery failing because the target is overloaded or unreachable | Non-zero rate sustained > 5 minutes |
| Hints directory size | Disk consumption on coordinators that can feed disk exhaustion | Growing over hours or approaching volume limits |
HintsDispatcher pending tasks | Backlog of segments waiting to be replayed | Pending > 0 sustained |
DownEndpointCount | Root cause: a down replica is why hints are being generated | Any node DOWN > 5 minutes |
| Disk space available on coordinators | Hints compete with SSTables, commitlog, and snapshots | Free space < 30% |
| GC pause duration on recovered node | Replay storms can push a node into GC death spiral | Pauses > 2 seconds during replay |
| SSTable count on recovered node | Replayed hints create new SSTables that need compaction | Rapid increase during hint ingestion |
Fixes
Restore the down replica. The only durable fix is a stable, healthy target. Do not force a sick node back into the ring. If it is flapping due to GC pressure or disk saturation, resolve those root causes first. Treat long GC pauses as a GC death spiral before attempting replay. Only when the node is stable should you allow full-speed delivery.
Throttle the replay storm. Temporarily reduce hinted_handoff_throttle_in_kb to lower write pressure on the recovering node. This extends replay time but reduces the risk of re-failure. Lowering the throttle reduces the mutation rate arriving at the target, giving its flush writers and compaction executor time to keep up. Watch nodetool compactionstats on the target; if pending compactions begin to grow while replay is active, the throttle is still too high.
Truncate hints when the target cannot survive replay. If the node re-fails under replay, drop the backlog for that endpoint with nodetool truncatehints <target_ip> on the storing coordinators. Specify the exact endpoint to avoid dropping hints for other replicas. After truncation, run a full nodetool repair on the target to reconcile the missing data. Tradeoff: you lose the hinted mutations for that window, and repair is mandatory and I/O intensive.
Repair after a long outage. If the node was down longer than max_hint_window_in_ms, hints are already gone. Schedule nodetool repair on the affected tables immediately. Run it during low-traffic hours and throttle streaming to avoid saturating the cluster. In a cluster with many tables, schedule the repair by keyspace to limit streaming concurrency. Tradeoff: repair competes for disk I/O and network bandwidth.
Recover from disk pressure on coordinators. If hints are consuming enough space to threaten disk exhaustion, verify whether the target node has been decommissioned or is permanently gone. For a node that will not return, truncate its hints and plan a repair. For transient backlogs, freeing snapshots or increasing storage may be necessary, but the real fix is delivering the hints or removing them and repairing.
Prevention
Monitor hints directory size with the same urgency as data directory growth. Alert on sustained growth when no maintenance window is active. Keep hinted_handoff_throttle_in_kb identical on every node to avoid creating bottlenecks. Monitor file count in the hints directory, not just byte size, because a very large number of small files increases overhead and can stress filesystem limits. Set max_hint_window_in_ms to a value aligned with your recovery time objective, and treat any outage approaching that window as a mandatory repair trigger. During rolling restarts or upgrades, maintain only one node down at a time to prevent cross-node hint amplification. Review your restart and decommission procedures. A rolling restart should never leave a node down long enough to accumulate gigabytes of hints. If you anticipate an extended outage, consider proactively increasing the hint throttle on the remaining nodes so that delivery is faster when the node returns, provided the target hardware can absorb it.
How Netdata helps
- Correlate hints directory size with disk space utilization on coordinators to catch backlog-driven disk exhaustion before writes block.
- Track
TotalHintsInProgressand delivery outcome rates (HintsSucceeded,HintsFailed,HintsTimedOut) to distinguish healthy replay from a failing storm. - Overlay hint replay periods with GC pause duration and disk I/O wait on the recovered node to expose replay storms in real time.
- Alert on node liveness transitions and
DownEndpointCountso you detect the root cause before hints begin to accumulate.
Related guides
- Cassandra compaction strategies: STCS vs LCS vs TWCS vs UCS
- Cassandra clock skew: how NTP drift silently corrupts data
- Cassandra compaction death spiral: when writes outrun compaction throughput
- Cassandra consistency levels explained: QUORUM, ONE, LOCAL_QUORUM, and EACH_QUORUM
- Cassandra zombie data resurrection: gc_grace_seconds and unrepaired tombstones
- Cassandra disk space exhaustion: emergency recovery when the data volume fills







