Cassandra streaming failures: stalled bootstrap, decommission, and rebuild
Bootstrapping stuck in JOINING for six hours. Decommission streaming with zero byte progress after eight hours. Rebuild failed, leaving incomplete token ranges. These are streaming failures.
Streaming moves SSTables between nodes during topology changes and repair. When a session fails, the topology is left incomplete. When it stalls with no progress for more than thirty minutes, the node stays in a transitional state and may not handle traffic correctly. Root causes are usually network timeouts, source node bottlenecks, or configuration mismatches. Unlike client request timeouts, streaming failures do not always surface as explicit errors; a session can hang silently while the control channel stays open.
What this means
Cassandra uses streaming to transfer SSTable data during bootstrap, decommission, rebuild, and repair. In Cassandra 4.0+, Zero Copy Streaming (ZCS) transfers entire SSTable files directly off disk, bypassing object reification and reducing CPU overhead versus the partition-based path. A streaming session is a long-lived TCP connection with a control channel and one or more data channels.
A failed session is detected when the control channel closes or two consecutive keep-alive cycles fail. A stalled session is different: the connection is alive, but bytes stop moving. If no progress occurs for more than thirty minutes, the operation is frozen. The node remains in a transitional state (JOINING, LEAVING, or REBUILDING) and may not accept or redirect traffic correctly. There is no automatic abort timeout for a stalled bootstrap; the operator must choose between resume, restart, or a full wipe and retry.
Common causes
| Cause | What it looks like | First thing to check |
|---|---|---|
| Firewall or NAT idle timeout | Stream active but bytes frozen; session closes after ~10 min with keep-alive failure | Firewall idle timeout vs streaming keep-alive period |
| Throughput throttle too low for dataset size | Progress moves steadily but ETA is hours or days | stream_throughput_outbound_megabits_per_sec (default 200 Mbps) and data volume |
| Source node overload | Streaming pauses correlate with high compaction pending, GC pauses, or disk IO saturation on the source | nodetool tpstats, GC logs, and iostat on the source node |
| Zero Copy Streaming silently disabled | Cassandra 4.0+ falls back to slow partition-based streaming when internode_encryption is enabled; no warning in netstats | internode_encryption under server_encryption_options in cassandra.yaml |
| CDC or materialized view commitlog overhead | Streaming throughput far below network and disk capacity; high commitlog IO on target | Whether CDC or materialized views are enabled on the target keyspaces; streamed data may double-write to commitlog |
| Consistent bootstrap blocked by DOWN replica | Bootstrap fails immediately or hangs when a replica for the range is unreachable | nodetool status showing non-UN nodes |
| Deprecated timeout parameter ignored | Operator raised streaming_socket_timeout_in_ms but it has no effect in 5.0+; TCP timeout still 300s | Current streaming TCP user timeout parameter in cassandra.yaml |
Quick checks
# Check active streaming sessions and progress
nodetool netstats
# Check cluster topology and node states
nodetool status
# List active SSTable tasks including streaming (Cassandra 4.0+)
cqlsh -e "SELECT keyspace_name, table_name, task_id, total, completed FROM system_views.sstable_tasks;"
# Search logs for streaming errors or bootstrap state
grep -iE "streaming|bootstrap|decommission|Some data streaming failed" /var/log/cassandra/system.log
# Check streaming throughput trend via JMX
# TotalIncomingBytes / TotalOutgoingBytes should change between samples
# Verify streaming tunables in configuration
grep -E "stream_throughput_outbound_megabits_per_sec|streaming_socket_timeout" /etc/cassandra/cassandra.yaml
# Check if internode encryption is enabled (disables ZCS)
grep -A 5 "server_encryption_options" /etc/cassandra/cassandra.yaml
How to diagnose it
flowchart TD
A[Streaming stalled >30 min] --> B{nodetool netstats active?}
B -->|Yes| C{Bytes moving?}
B -->|No| D[Check logs for session close]
C -->|No| E[Check firewall vs keep-alive period]
C -->|Yes slowly| F[Check throttle and source node IO]
D --> G[Check for DOWN replicas]
E --> H[Fix network timeout]
F --> I[Raise throttle or relieve source]
G --> J[Resume or override bootstrap]Confirm the operation and state. Run
nodetool statusto check if the node is JOINING, LEAVING, or NORMAL. Runnodetool netstatsto confirm active streaming sessions and whether bytes are moving.Determine if the stream is failed or stalled. A failed session disappears from
nodetool netstatsand usually leaves an error insystem.log. A stalled session remains listed but shows no byte progress for more than thirty minutes. On Cassandra 4.0+, querysystem_views.sstable_tasksto confirm whether active streaming work is registered.Check for network layer drops. Compare the firewall or NAT idle timeout against the streaming keep-alive period. If the firewall timeout is shorter than 600 seconds (two keep-alive cycles), the firewall drops the connection even when the stream is healthy but slow. This is a common cause of mysterious streaming interruptions.
Verify throughput expectations. Calculate the dataset size divided by the configured
stream_throughput_outbound_megabits_per_sec(default 200 Mbps). If the result is many hours, the operation may not be stalled; it may simply be throttled. However, ifnodetool netstatsshows no byte increase over a 15-minute window while the session is open, the stream is stalled regardless of throttle.Inspect the source node. Streaming competes with compaction, flushes, and client traffic on the source. Check the source node for pending compactions, blocked thread pools, long GC pauses, and disk IO wait (
iostat -x). If the source is in a GC death spiral or compaction backlog, it cannot feed the stream.Check for Zero Copy Streaming fallback. In Cassandra 4.0+, ZCS is enabled by default, but it is automatically disabled when
internode_encryptionis enabled. There is no warning innodetool netstats. If encryption is on, expect slower partition-based streaming and plan longer windows.Evaluate CDC and materialized view overhead. If CDC or materialized views are enabled, streamed data may be written through the commitlog on the target. This can significantly slow streaming compared to direct SSTable ingestion.
Review timeout configuration. In Cassandra 5.0 and later,
streaming_socket_timeout_in_msis absent. The equivalent control is the internode streaming TCP user timeout parameter. If an operator set the old parameter name, it has no effect. Similarly, the streaming keep-alive period controls the control-channel heartbeat.Check for consistent bootstrap blocking. Since Cassandra 3.0, bootstrap requires all replicas for a token range to be available. If any replica is DOWN, the bootstrap will fail or hang. You can override this with
-Dcassandra.consistent.rangemovement=false, but the joining node may miss data from the unavailable replica.
Metrics and signals to monitor
| Signal | Why it matters | Warning sign |
|---|---|---|
| TotalIncomingBytes / TotalOutgoingBytes | Measures actual streaming throughput | Flat for > 30 min while session is active |
| system_views.sstable_tasks (4.0+) | Exposes active streaming tasks and byte progress | No rows for an ongoing bootstrap/rebuild |
| nodetool netstats | Shows per-session file and byte counts | Session listed but completed bytes unchanged |
| Streaming keep-alive failures | Control channel health | Session closes after ~10 min with no error |
| Disk IO await / %util on source and target | Streaming saturates sequential IO | await > 50 ms or %util > 90% sustained |
| GC pause duration on source node | Long pauses freeze streaming reads | Pauses > 2 s correlate with streaming stalls |
| Pending compactions on source node | Compaction debt steals disk IO from streaming | Pending tasks trending up during streaming |
| Node liveness (FailureDetector) | DOWN replica blocks consistent bootstrap | Any DN node during a bootstrap |
Fixes
Resume or restart a stalled bootstrap
On Cassandra 2.2 and later, a stalled bootstrap can often be resumed:
# Attempt to resume from saved state
nodetool bootstrap resume
If the node resumes but fails again, inspect the source node and network first. Restarting the node also resumes bootstrap automatically in most versions. If the bootstrap state is corrupted and you must start fresh, wipe the progress and restart:
# Wipe bootstrap state







