Cassandra hint overflow: max_hint_window expiry and silent data divergence

You restart a node after a four-hour outage. Gossip converges, nodetool status shows UN, and clients reconnect. Reads at consistency level ONE return stale data. The node has never been repaired.

The problem is hint overflow: the outage lasted longer than max_hint_window_in_ms (default three hours), so coordinators stopped saving hints after the window expired. Writes accepted during the final hour of the outage are missing from that replica. Coordinator logs show no errors; write acknowledgments succeeded because other replicas responded. Only anti-entropy repair closes the gap. Without it, the missing data sits on that replica indefinitely, surfacing as inconsistent reads or resurrected deletes.

What this means

Hinted handoff is a temporary durability mechanism. When a write’s target replica is unreachable, the coordinator stores a hint locally on disk at /var/lib/cassandra/hints/. When the replica returns, the coordinator replays those hints as mutations. Hints carry the original mutation timestamp, so replay is idempotent and will not overwrite newer data.

The mechanism is bounded by max_hint_window_in_ms (default: 10800000 ms, three hours). Once a node has been down longer than this window, the cluster stops creating hints for that replica. Writes continue to the remaining replicas, but the down node receives nothing. When the node recovers, it replays whatever hints remain from the initial window, but all writes from the post-window period are permanently absent from that replica unless you run a full anti-entropy repair.

The cluster appears healthy throughout. Coordinators do not fail writes when the replication factor and consistency level are satisfied, and clients see normal acknowledgments. The missing data is only detectable when the under-replicated partition is read from the recovered node, or when a repair compares Merkle trees.

flowchart LR
    A[Replica DOWN] --> B[Coordinators store hints]
    B --> C{Outage > max_hint_window?}
    C -->|No| D[Replay all hints]
    D --> E[Replica consistent]
    C -->|Yes| F[Hint creation stops]
    F --> G[Post-window writes lost]
    G --> H[Replica recovers]
    H --> I[Repair required]

Common causes

Cause	What it looks like	First thing to check
Node down longer than `max_hint_window_in_ms` (default 3h)	Hints stop accumulating after the window expires; recovered replica misses post-window mutations	`nodetool status` to confirm downtime duration versus the window
Hinted handoff disabled	No hints stored even for brief outages; immediate divergence on any replica downtime	`nodetool statushandoff`
CASSANDRA-19495 (Cassandra 4.1.0-4.1.4)	Node recovers then fails again; no hints created on the second outage even if total time is within window	`nodetool version` to confirm if the fix is present
Coordinator disk saturated by hint backlog	Hints directory grows to tens of gigabytes; disk pressure triggers broader write-path degradation	`df -h /var/lib/cassandra/hints/` or data root
Hint delivery throttle too low for backlog	Recovered node cannot drain hints fast enough; replay stalls and the node may become overloaded	`nodetool tpstats` HintsDispatcher pending

Quick checks

# Is hinted handoff enabled?
nodetool statushandoff

# Current max hint window (Cassandra 4.0+)
nodetool getmaxhintwindow

# Hints directory size on coordinators
du -sh /var/lib/cassandra/hints/

# Hint delivery thread pool activity
nodetool tpstats | grep -A1 "HintsDispatcher"

# Down nodes
nodetool status

# Active streaming and hint delivery sessions
nodetool netstats

How to diagnose it

Confirm the outage exceeded the hint window. Check nodetool status and your monitoring for the DOWN timestamp. Compare the duration against max_hint_window_in_ms (default 3h). If the node was down longer than the window, assume divergence.
Verify hint accumulation stopped mid-outage. On coordinators, check the hints directory size with du -sh /var/lib/cassandra/hints/. If the directory stopped growing while the node was still DOWN, the window expired and hint creation ceased.
Check the cumulative hint counter. Monitor JMX org.apache.cassandra.metrics:type=Storage,name=TotalHints Count. A rising counter during the outage indicates hint generation; a flat counter after the window indicates the cluster moved on without that replica.
Inspect hint delivery after recovery. Run nodetool tpstats and look at HintsDispatcher. Active or Pending tasks indicate replay is in progress. Completed should increase over time.
Validate repair status. Check system_distributed.repair_history or nodetool repair_admin list (4.0+). If no repair has run since the node recovered, the post-window gap is still present.
Check for CASSANDRA-19495 if the node went down twice. In Cassandra 4.1.0 through 4.1.4, if a node recovers and then fails again, and the total elapsed time since the first downtime start exceeds max_hint_window, no hints are created on the second outage at all. If this applies, upgrade to 4.1.5+ or 5.0+ and run full repair.

Metrics and signals to monitor

Signal	Why it matters	Warning sign
`Storage.TotalHints` Count	Cumulative hints written since restart; growth during a node outage confirms hint generation	Continuously increasing while a node is DOWN; sudden stop indicates window expiry
Hints directory size	Hints consume disk on coordinators; large backlogs risk disk exhaustion and I/O contention	`du -sh /var/lib/cassandra/hints/` growing past 1 GB or trending upward for hours
`HintsDispatcher` pool (`nodetool tpstats`)	Tracks active hint delivery threads replaying to recovered nodes	Pending greater than 0 or Active greater than 0 sustained after recovery; Completed stalling
`HintsService.HintsFailed` / `HintsTimedOut`	Failed hint delivery means the target node is rejecting or missing the replay	Non-zero rate sustained for more than 5 minutes
Repair completion (`repair_history`)	The only mechanism that reconciles post-window divergence	No successful repair session after a node recovery from an extended outage
Client write timeouts / unavailables	May indicate the down node is affecting quorum or that coordinators are struggling	Sustained increase correlating with node DOWN events

Fixes

Replica recovered but outage exceeded the window

Run full anti-entropy repair on the recovered node. In Cassandra 4.0+, run nodetool repair -full. If you use Reaper, schedule a full repair job. If you are uncertain whether the outage exceeded the window, run repair anyway; it is the only safe path. Do not rely on read repair to close the gap.

Hints are draining too slowly and overloading the recovered node

The default hinted_handoff_throttle_in_kb is 1024 KiB/s per delivery thread. For large backlogs, this can be too slow. In Cassandra 4.0+, increase the throttle dynamically with nodetool sethintedhandoffthrottlekb to drain faster, but monitor the target node for compaction and GC pressure. If the node struggles, reduce the throttle to prevent re-failure.

Coordinator disk pressure from hint accumulation

If the hints directory is filling the disk and the target node is still down, you have two risky options. You can clear hints manually from /var/lib/cassandra/hints/ and restart the coordinator. This destroys the hints and requires full repair of the target node regardless. Only do this under disk-exhaustion emergency. The safer path is to add storage or bring the target node back to drain hints normally.

Affected by CASSANDRA-19495

If you run Cassandra 4.1.0 through 4.1.4 and the node experienced a second outage, upgrade to 4.1.5 or 5.0+ before bringing the node back. After upgrade, run full repair.

Temporarily extending the window

If your mean time to recovery is consistently near or above three hours, raise max_hint_window_in_ms. In Cassandra 4.0+, use nodetool setmaxhintwindow to adjust the running JVM. A longer window increases disk usage on every coordinator and prolongs hint replay.

Prevention

Treat a node DOWN for more than 80% of max_hint_window_in_ms as a repair mandate. If you cannot recover the node within roughly two and a half hours, schedule full repair before it rejoins or immediately after.
Monitor hints directory size on all coordinators. A runaway hint backlog indicates an extended outage or a very slow replica.
Automate repair scheduling with Reaper or equivalent to ensure repairs complete within gc_grace_seconds. Repair is the only safety net once hints expire.
Size the hint delivery throttle for your workload. After any outage longer than thirty minutes, check whether the default 1024 KiB/s throttle will drain hints before compaction or GC pressure builds. Adjust proactively with nodetool sethintedhandoffthrottlekb.
Do not disable hinted handoff. Running with hinted_handoff_enabled: false removes the safety net entirely and guarantees divergence on any replica downtime.
Upgrade past CASSANDRA-19495. If you run Cassandra 4.1.x, ensure you are on 4.1.5 or newer.

How Netdata helps

Correlate Storage.TotalHints counter growth with node liveness transitions to identify when hint generation started and whether it stopped before recovery.
Monitor hints directory size growth rate alongside disk space alerts to catch coordinator disk exhaustion before accumulated hint files fill the volume.
Cross-reference HintsDispatcher activity with write latency spikes on recovered nodes to detect replay overload before it cascades into GC pressure.
Alert when a node remains DOWN for a sustained duration approaching max_hint_window_in_ms. This gives a proactive window to schedule repair before silent divergence begins.

The Netdata solution

Cassandra monitoring with Netdata

Netdata monitors Apache Cassandra with per-second metrics and automatic dashboards. Correlate GC pauses, compaction backlog, tombstone rates, pending hints, and disk usage across nodes to catch a creeping cluster before it tips over.

See Cassandra monitoring → Start monitoring free

Cassandra hint overflow: max_hint_window expiry and silent data divergence

Cassandra hint overflow: max_hint_window expiry and silent data divergence

What this means

Common causes

Quick checks

How to diagnose it

Metrics and signals to monitor

Fixes

Prevention

How Netdata helps

Related guides

Cassandra monitoring with Netdata