Cassandra node stuck in joining (UJ): bootstrap diagnosis
You add a node to the ring, run nodetool status, and see it stuck in UJ (Up/Joining) for hours. The cluster sees it in gossip, but it never transitions to UN (Up/Normal). Client drivers do not route traffic to it, so the expansion has not added usable capacity. Until the state changes, the node is a ghost member: visible to the ring but unable to serve reads or writes for its assigned token ranges.
Bootstrap streams the SSTables that belong to the new node’s assigned token ranges from current replica owners. The joining node is passive for client traffic until every byte is received, validated, and made available locally. If a stream stalls, fails silently, or the joining node is interrupted mid-transfer, it stays in UJ indefinitely. The most common root cause is not the joining node; it is the health and capacity of the source nodes serving the stream.
What this means
UJ means the node has passed gossip startup and token allocation, but has not finished ingesting its replica data. It holds a token assignment, so the cluster knows it owns ranges, yet it cannot serve them. Streaming sessions are TCP-based, long-lived transfers of SSTable files. They are vulnerable to anything that interrupts sustained disk read on the source side: disk saturation, GC pauses, corrupt files, or network blips.
Modern Cassandra versions persist partial bootstrap progress, so a restart can resume from the last checkpoint rather than starting over entirely. Even so, the stream will not complete until every source-side blockage is cleared. Because streaming reads raw SSTables from disk, it competes directly with compaction, client reads, and memtable flushes on the source node. When those consumers saturate the disk, the stream starves.
flowchart TD
A[Node enters UJ state] --> B[Select source replicas]
B --> C[Stream SSTables per range]
C --> D{Progress stalls?}
D -->|No| E[Continue until complete]
E --> F[Transition to UN]
D -->|Yes| G[Check source disk I/O]
G --> H[Check source GC and heap]
H --> I[Check for corrupt SSTables]
I --> J[Resume or restart join]
J --> CCommon causes
| Cause | What it looks like | First thing to check |
|---|---|---|
| Source node disk saturation | nodetool netstats shows no byte increase between samples; source iostat shows high %util or await | iostat -x 1 on each source node |
| Source node GC pressure | Source node drops messages or flaps between UP and DOWN during the stream | GC logs and nodetool info heap usage on source |
| Corrupt SSTable on source | Stream fails repeatedly at the same file or token range; errors in system log | nodetool verify on the source replica |
| Too many parallel token ranges | High num_tokens creates many concurrent streams, overwhelming source heap or disk | nodetool netstats session count and num_tokens in cassandra.yaml |
| Thread pool saturation on source | nodetool tpstats on source shows pending or blocked MutationStage or ReadStage tasks | nodetool tpstats on source nodes |
| File descriptor exhaustion | Source or joining node cannot open new SSTable components | nodetool info FD count versus ulimit -n |
| Network or internode timeout | Session breaks with timeout errors in logs; network latency between nodes spikes | Connectivity and error logs on both sides |
Quick checks
# Confirm the node is still joining
nodetool status
# Inspect active streaming sessions and bytes received
nodetool netstats
# Check for backpressure in internal thread pools
nodetool tpstats
# Check source node disk saturation
iostat -x 1
# Review recent errors and timeouts
grep -iE "stream|timeout|corrupt" /var/log/cassandra/system.log
# Check heap and GC health on the source
nodetool info | grep -i "Heap Memory"
grep -i "pause" /var/log/cassandra/gc.log | tail -20
# Check compaction backlog on source
nodetool compactionstats
# Check file descriptor pressure
nodetool info | grep "File Descriptors"
ulimit -n
How to diagnose it
- Confirm
UJstate withnodetool statusand identify the streaming sources fromnodetool netstatson the joining node. - Sample
nodetool netstatstwice, spaced by a few minutes. If bytes received or files completed do not increment, the stream is stalled. - Log into the source nodes identified in netstats. Run
iostat -x 1and check%utilandawait. If the disk backing the data directory is saturated, streaming reads are queued behind compaction and client traffic. - On the source nodes, run
nodetool tpstats. Sustained pending tasks in MutationStage, ReadStage, or CompactionExecutor mean the node is too loaded to serve streams promptly. - Check the source node GC logs. Stop-the-world pauses longer than a few seconds can cause internode messaging timeouts, which tear down streaming sessions.
- Search
system.logon both sides forCorruptSSTableException,FSError, or stream timeout messages. A single corrupt SSTable on a source replica can block an entire range transfer. - If the joining node was restarted mid-bootstrap, check
nodetool netstatsfor resumed progress. On versions that support resumable bootstrap, uncompleted ranges replay from the last checkpoint. Repeated restarts can still leave gaps or conflicting sessions. If sessions look inconsistent, restart the joining node only after all source nodes are stable. - If the joining node has many concurrent sessions in
nodetool netstats, checknum_tokensin cassandra.yaml. A very high vnode count increases parallel stream count and can saturate source-side heap or I/O.
Metrics and signals to monitor
| Signal | Why it matters | Warning sign |
|---|---|---|
| Streaming incoming bytes | Direct measure of bootstrap progress | Flat for more than 30 minutes |
| Disk I/O await on source | High await means source disk cannot read SSTables fast enough | await greater than 50 ms sustained |
| GC pause duration on source | Long pauses break internode TCP sessions and stall streams | Pauses greater than 2 seconds |
| Pending compactions on source | Compaction competes for the same disk as streaming reads | Count trending upward during bootstrap |
| Thread pool pending tasks | Queued tasks mean the source cannot keep up with requests | Pending greater than 0 in MutationStage or ReadStage |
| Dropped messages on source | The node is shedding load; streams may be next | Any sustained non-zero rate |
| File descriptor usage | FD exhaustion prevents opening SSTable files | Usage greater than 80% of ulimit |
| Pending flushes | Write path saturation delays all disk operations | MemtableFlushWriter pending greater than 0 sustained |
Fixes
Address source node disk saturation
If iostat shows the data device is saturated, streaming cannot proceed until I/O is freed. Pause non-critical repairs, reduce compaction throughput with nodetool setcompactionthroughput, or schedule the bootstrap during a lower-traffic window. Adding IOPS to the source node or moving the commitlog to a separate device are longer-term fixes. Do not raise streaming socket timeouts to mask the stall; the timeout is a symptom, and extending it without fixing the source disk will prolong the incident.
Reduce pressure on source nodes
If the source node is in a GC death spiral or thread pool saturation, stop increasing load. Do not repeatedly trigger resume operations while the source is unhealthy; the stream will only fail again. Wait for the source node to return to a stable state with zero pending tasks and normal GC before allowing the join to continue.
Handle corrupt SSTables
If nodetool verify on a source node reports corruption, that SSTable must be replaced or repaired. nodetool verify reads every row and is expensive on large tables; run it during low traffic. If replication factor is greater than one, you can temporarily take the corrupt source node offline so the joining node streams from healthy replicas instead. After the new node joins, run a full repair on the affected range.
Resume or restart the joining node
If the stream failed but the joining node persists bootstrap state, a clean restart of the joining node will resume from the last checkpoint. Verify with nodetool netstats that progress continues. If the node does not support resumable bootstrap, you may need to wipe the data directory and restart the bootstrap from scratch after fixing the source-side issue.
WARNING: Wiping the data directory is destructive. Stop Cassandra, clear the data, commitlog, and saved_caches directories, and ensure the node is fully removed from the ring before you re-bootstrap.
Lower the parallel stream count
A high num_tokens value increases the number of token ranges and therefore the number of concurrent streaming sessions. If source nodes are OOMing or saturating disk, reducing num_tokens requires reconfiguring and re-bootstrapping the joining node, but it can make large-node bootstraps stable.
Run repair after recovery
Any bootstrap that was interrupted or resumed after timeout may have missed writes, especially if hints were not delivered during the window. Once the node reaches UN, run nodetool repair to reconcile any inconsistencies before the node serves production traffic.
Prevention
- Validate source node health before bootstrap. Check
nodetool compactionstats, heap usage, and disk headroom. - Schedule bootstrap during off-peak hours when source node I/O and GC are stable.
- Monitor source node disk latency and thread pools continuously during the operation.
- Keep
num_tokensaligned with your heap and disk capacity. Very large nodes may need fewer vnodes. - Verify SSTable integrity with
nodetool verifybefore major topology changes. - In containerized environments, use Pod Disruption Budgets to prevent mid-stream pod eviction.
How Netdata helps
- Correlate flat streaming throughput on the joining node with disk latency spikes on the source node in the same time window.
- Track GC pause duration on source nodes to preempt streaming timeouts before sessions break.
- Alert on sustained pending tasks in the MutationStage and CompactionExecutor during bootstrap operations.
- Monitor off-heap memory growth on source nodes to catch OOM risk from too many concurrent SSTable transfers.
- Surface file descriptor utilization per node to detect the approach of ulimit exhaustion during heavy streaming.
Related guides
- Cassandra compaction strategies: STCS vs LCS vs TWCS vs UCS
- Cassandra clock skew: how NTP drift silently corrupts data
- Cassandra compaction death spiral: when writes outrun compaction throughput
- Cassandra consistency levels explained: QUORUM, ONE, LOCAL_QUORUM, and EACH_QUORUM
- Cassandra zombie data resurrection: gc_grace_seconds and unrepaired tombstones
- Cassandra disk space exhaustion: emergency recovery when the data volume fills
- Cassandra dropped mutations: silent write loss and load shedding
- Cassandra dropped reads and other messages: reading nodetool tpstats Dropped
- Cassandra GC death spiral: long pauses, gossip flapping, and recovery
- Cassandra GC pauses too long: diagnosing G1 stop-the-world pauses
- Cassandra gossip flapping: nodes bouncing UP and DOWN
- Cassandra heap pressure: sizing the JVM heap and tuning G1GC







