Cassandra hint overflow: max_hint_window expiry and silent data divergence

You restart a node after a four-hour outage. Gossip converges, nodetool status shows UN, and clients reconnect. Reads at consistency level ONE return stale data. The node has never been repaired.

The problem is hint overflow: the outage lasted longer than max_hint_window_in_ms (default three hours), so coordinators stopped saving hints after the window expired. Writes accepted during the final hour of the outage are missing from that replica. Coordinator logs show no errors; write acknowledgments succeeded because other replicas responded. Only anti-entropy repair closes the gap. Without it, the missing data sits on that replica indefinitely, surfacing as inconsistent reads or resurrected deletes.

What this means

Hinted handoff is a temporary durability mechanism. When a write’s target replica is unreachable, the coordinator stores a hint locally on disk at /var/lib/cassandra/hints/. When the replica returns, the coordinator replays those hints as mutations. Hints carry the original mutation timestamp, so replay is idempotent and will not overwrite newer data.

The mechanism is bounded by max_hint_window_in_ms (default: 10800000 ms, three hours). Once a node has been down longer than this window, the cluster stops creating hints for that replica. Writes continue to the remaining replicas, but the down node receives nothing. When the node recovers, it replays whatever hints remain from the initial window, but all writes from the post-window period are permanently absent from that replica unless you run a full anti-entropy repair.

The cluster appears healthy throughout. Coordinators do not fail writes when the replication factor and consistency level are satisfied, and clients see normal acknowledgments. The missing data is only detectable when the under-replicated partition is read from the recovered node, or when a repair compares Merkle trees.

flowchart LR
    A[Replica DOWN] --> B[Coordinators store hints]
    B --> C{Outage > max_hint_window?}
    C -->|No| D[Replay all hints]
    D --> E[Replica consistent]
    C -->|Yes| F[Hint creation stops]
    F --> G[Post-window writes lost]
    G --> H[Replica recovers]
    H --> I[Repair required]

Common causes

CauseWhat it looks likeFirst thing to check
Node down longer than max_hint_window_in_ms (default 3h)Hints stop accumulating after the window expires; recovered replica misses post-window mutationsnodetool status to confirm downtime duration versus the window
Hinted handoff disabledNo hints stored even for brief outages; immediate divergence on any replica downtimenodetool statushandoff
CASSANDRA-19495 (Cassandra 4.1.0-4.1.4)Node recovers then fails again; no hints created on the second outage even if total time is within windownodetool version to confirm if the fix is present
Coordinator disk saturated by hint backlogHints directory grows to tens of gigabytes; disk pressure triggers broader write-path degradationdf -h /var/lib/cassandra/hints/ or data root
Hint delivery throttle too low for backlogRecovered node cannot drain hints fast enough; replay stalls and the node may become overloadednodetool tpstats HintsDispatcher pending

Quick checks

# Is hinted handoff enabled?
nodetool statushandoff

# Current max hint window (Cassandra 4.0+)
nodetool getmaxhintwindow

# Hints directory size on coordinators
du -sh /var/lib/cassandra/hints/

# Hint delivery thread pool activity
nodetool tpstats | grep -A1 "HintsDispatcher"

# Down nodes
nodetool status

# Active streaming and hint delivery sessions
nodetool netstats

How to diagnose it

  1. Confirm the outage exceeded the hint window. Check nodetool status and your monitoring for the DOWN timestamp. Compare the duration against max_hint_window_in_ms (default 3h). If the node was down longer than the window, assume divergence.
  2. Verify hint accumulation stopped mid-outage. On coordinators, check the hints directory size with du -sh /var/lib/cassandra/hints/. If the directory stopped growing while the node was still DOWN, the window expired and hint creation ceased.
  3. Check the cumulative hint counter. Monitor JMX org.apache.cassandra.metrics:type=Storage,name=TotalHints Count. A rising counter during the outage indicates hint generation; a flat counter after the window indicates the cluster moved on without that replica.
  4. Inspect hint delivery after recovery. Run nodetool tpstats and look at HintsDispatcher. Active or Pending tasks indicate replay is in progress. Completed should increase over time.
  5. Validate repair status. Check system_distributed.repair_history or nodetool repair_admin list (4.0+). If no repair has run since the node recovered, the post-window gap is still present.
  6. Check for CASSANDRA-19495 if the node went down twice. In Cassandra 4.1.0 through 4.1.4, if a node recovers and then fails again, and the total elapsed time since the first downtime start exceeds max_hint_window, no hints are created on the second outage at all. If this applies, upgrade to 4.1.5+ or 5.0+ and run full repair.

Metrics and signals to monitor

SignalWhy it mattersWarning sign
Storage.TotalHints CountCumulative hints written since restart; growth during a node outage confirms hint generationContinuously increasing while a node is DOWN; sudden stop indicates window expiry
Hints directory sizeHints consume disk on coordinators; large backlogs risk disk exhaustion and I/O contentiondu -sh /var/lib/cassandra/hints/ growing past 1 GB or trending upward for hours
HintsDispatcher pool (nodetool tpstats)Tracks active hint delivery threads replaying to recovered nodesPending greater than 0 or Active greater than 0 sustained after recovery; Completed stalling
HintsService.HintsFailed / HintsTimedOutFailed hint delivery means the target node is rejecting or missing the replayNon-zero rate sustained for more than 5 minutes
Repair completion (repair_history)The only mechanism that reconciles post-window divergenceNo successful repair session after a node recovery from an extended outage
Client write timeouts / unavailablesMay indicate the down node is affecting quorum or that coordinators are strugglingSustained increase correlating with node DOWN events

Fixes

Replica recovered but outage exceeded the window

Run full anti-entropy repair on the recovered node. In Cassandra 4.0+, run nodetool repair -full. If you use Reaper, schedule a full repair job. If you are uncertain whether the outage exceeded the window, run repair anyway; it is the only safe path. Do not rely on read repair to close the gap.

Hints are draining too slowly and overloading the recovered node

The default hinted_handoff_throttle_in_kb is 1024 KiB/s per delivery thread. For large backlogs, this can be too slow. In Cassandra 4.0+, increase the throttle dynamically with nodetool sethintedhandoffthrottlekb to drain faster, but monitor the target node for compaction and GC pressure. If the node struggles, reduce the throttle to prevent re-failure.

Coordinator disk pressure from hint accumulation

If the hints directory is filling the disk and the target node is still down, you have two risky options. You can clear hints manually from /var/lib/cassandra/hints/ and restart the coordinator. This destroys the hints and requires full repair of the target node regardless. Only do this under disk-exhaustion emergency. The safer path is to add storage or bring the target node back to drain hints normally.

Affected by CASSANDRA-19495

If you run Cassandra 4.1.0 through 4.1.4 and the node experienced a second outage, upgrade to 4.1.5 or 5.0+ before bringing the node back. After upgrade, run full repair.

Temporarily extending the window

If your mean time to recovery is consistently near or above three hours, raise max_hint_window_in_ms. In Cassandra 4.0+, use nodetool setmaxhintwindow to adjust the running JVM. A longer window increases disk usage on every coordinator and prolongs hint replay.

Prevention

  • Treat a node DOWN for more than 80% of max_hint_window_in_ms as a repair mandate. If you cannot recover the node within roughly two and a half hours, schedule full repair before it rejoins or immediately after.
  • Monitor hints directory size on all coordinators. A runaway hint backlog indicates an extended outage or a very slow replica.
  • Automate repair scheduling with Reaper or equivalent to ensure repairs complete within gc_grace_seconds. Repair is the only safety net once hints expire.
  • Size the hint delivery throttle for your workload. After any outage longer than thirty minutes, check whether the default 1024 KiB/s throttle will drain hints before compaction or GC pressure builds. Adjust proactively with nodetool sethintedhandoffthrottlekb.
  • Do not disable hinted handoff. Running with hinted_handoff_enabled: false removes the safety net entirely and guarantees divergence on any replica downtime.
  • Upgrade past CASSANDRA-19495. If you run Cassandra 4.1.x, ensure you are on 4.1.5 or newer.

How Netdata helps

  • Correlate Storage.TotalHints counter growth with node liveness transitions to identify when hint generation started and whether it stopped before recovery.
  • Monitor hints directory size growth rate alongside disk space alerts to catch coordinator disk exhaustion before accumulated hint files fill the volume.
  • Cross-reference HintsDispatcher activity with write latency spikes on recovered nodes to detect replay overload before it cascades into GC pressure.
  • Alert when a node remains DOWN for a sustained duration approaching max_hint_window_in_ms. This gives a proactive window to schedule repair before silent divergence begins.