Cassandra node stuck in joining (UJ): bootstrap diagnosis

You add a node to the ring, run nodetool status, and see it stuck in UJ (Up/Joining) for hours. The cluster sees it in gossip, but it never transitions to UN (Up/Normal). Client drivers do not route traffic to it, so the expansion has not added usable capacity. Until the state changes, the node is a ghost member: visible to the ring but unable to serve reads or writes for its assigned token ranges.

Bootstrap streams the SSTables that belong to the new node’s assigned token ranges from current replica owners. The joining node is passive for client traffic until every byte is received, validated, and made available locally. If a stream stalls, fails silently, or the joining node is interrupted mid-transfer, it stays in UJ indefinitely. The most common root cause is not the joining node; it is the health and capacity of the source nodes serving the stream.

What this means

UJ means the node has passed gossip startup and token allocation, but has not finished ingesting its replica data. It holds a token assignment, so the cluster knows it owns ranges, yet it cannot serve them. Streaming sessions are TCP-based, long-lived transfers of SSTable files. They are vulnerable to anything that interrupts sustained disk read on the source side: disk saturation, GC pauses, corrupt files, or network blips.

Modern Cassandra versions persist partial bootstrap progress, so a restart can resume from the last checkpoint rather than starting over entirely. Even so, the stream will not complete until every source-side blockage is cleared. Because streaming reads raw SSTables from disk, it competes directly with compaction, client reads, and memtable flushes on the source node. When those consumers saturate the disk, the stream starves.

flowchart TD
    A[Node enters UJ state] --> B[Select source replicas]
    B --> C[Stream SSTables per range]
    C --> D{Progress stalls?}
    D -->|No| E[Continue until complete]
    E --> F[Transition to UN]
    D -->|Yes| G[Check source disk I/O]
    G --> H[Check source GC and heap]
    H --> I[Check for corrupt SSTables]
    I --> J[Resume or restart join]
    J --> C

Common causes

CauseWhat it looks likeFirst thing to check
Source node disk saturationnodetool netstats shows no byte increase between samples; source iostat shows high %util or awaitiostat -x 1 on each source node
Source node GC pressureSource node drops messages or flaps between UP and DOWN during the streamGC logs and nodetool info heap usage on source
Corrupt SSTable on sourceStream fails repeatedly at the same file or token range; errors in system lognodetool verify on the source replica
Too many parallel token rangesHigh num_tokens creates many concurrent streams, overwhelming source heap or disknodetool netstats session count and num_tokens in cassandra.yaml
Thread pool saturation on sourcenodetool tpstats on source shows pending or blocked MutationStage or ReadStage tasksnodetool tpstats on source nodes
File descriptor exhaustionSource or joining node cannot open new SSTable componentsnodetool info FD count versus ulimit -n
Network or internode timeoutSession breaks with timeout errors in logs; network latency between nodes spikesConnectivity and error logs on both sides

Quick checks

# Confirm the node is still joining
nodetool status

# Inspect active streaming sessions and bytes received
nodetool netstats

# Check for backpressure in internal thread pools
nodetool tpstats

# Check source node disk saturation
iostat -x 1

# Review recent errors and timeouts
grep -iE "stream|timeout|corrupt" /var/log/cassandra/system.log

# Check heap and GC health on the source
nodetool info | grep -i "Heap Memory"
grep -i "pause" /var/log/cassandra/gc.log | tail -20

# Check compaction backlog on source
nodetool compactionstats

# Check file descriptor pressure
nodetool info | grep "File Descriptors"
ulimit -n

How to diagnose it

  1. Confirm UJ state with nodetool status and identify the streaming sources from nodetool netstats on the joining node.
  2. Sample nodetool netstats twice, spaced by a few minutes. If bytes received or files completed do not increment, the stream is stalled.
  3. Log into the source nodes identified in netstats. Run iostat -x 1 and check %util and await. If the disk backing the data directory is saturated, streaming reads are queued behind compaction and client traffic.
  4. On the source nodes, run nodetool tpstats. Sustained pending tasks in MutationStage, ReadStage, or CompactionExecutor mean the node is too loaded to serve streams promptly.
  5. Check the source node GC logs. Stop-the-world pauses longer than a few seconds can cause internode messaging timeouts, which tear down streaming sessions.
  6. Search system.log on both sides for CorruptSSTableException, FSError, or stream timeout messages. A single corrupt SSTable on a source replica can block an entire range transfer.
  7. If the joining node was restarted mid-bootstrap, check nodetool netstats for resumed progress. On versions that support resumable bootstrap, uncompleted ranges replay from the last checkpoint. Repeated restarts can still leave gaps or conflicting sessions. If sessions look inconsistent, restart the joining node only after all source nodes are stable.
  8. If the joining node has many concurrent sessions in nodetool netstats, check num_tokens in cassandra.yaml. A very high vnode count increases parallel stream count and can saturate source-side heap or I/O.

Metrics and signals to monitor

SignalWhy it mattersWarning sign
Streaming incoming bytesDirect measure of bootstrap progressFlat for more than 30 minutes
Disk I/O await on sourceHigh await means source disk cannot read SSTables fast enoughawait greater than 50 ms sustained
GC pause duration on sourceLong pauses break internode TCP sessions and stall streamsPauses greater than 2 seconds
Pending compactions on sourceCompaction competes for the same disk as streaming readsCount trending upward during bootstrap
Thread pool pending tasksQueued tasks mean the source cannot keep up with requestsPending greater than 0 in MutationStage or ReadStage
Dropped messages on sourceThe node is shedding load; streams may be nextAny sustained non-zero rate
File descriptor usageFD exhaustion prevents opening SSTable filesUsage greater than 80% of ulimit
Pending flushesWrite path saturation delays all disk operationsMemtableFlushWriter pending greater than 0 sustained

Fixes

Address source node disk saturation

If iostat shows the data device is saturated, streaming cannot proceed until I/O is freed. Pause non-critical repairs, reduce compaction throughput with nodetool setcompactionthroughput, or schedule the bootstrap during a lower-traffic window. Adding IOPS to the source node or moving the commitlog to a separate device are longer-term fixes. Do not raise streaming socket timeouts to mask the stall; the timeout is a symptom, and extending it without fixing the source disk will prolong the incident.

Reduce pressure on source nodes

If the source node is in a GC death spiral or thread pool saturation, stop increasing load. Do not repeatedly trigger resume operations while the source is unhealthy; the stream will only fail again. Wait for the source node to return to a stable state with zero pending tasks and normal GC before allowing the join to continue.

Handle corrupt SSTables

If nodetool verify on a source node reports corruption, that SSTable must be replaced or repaired. nodetool verify reads every row and is expensive on large tables; run it during low traffic. If replication factor is greater than one, you can temporarily take the corrupt source node offline so the joining node streams from healthy replicas instead. After the new node joins, run a full repair on the affected range.

Resume or restart the joining node

If the stream failed but the joining node persists bootstrap state, a clean restart of the joining node will resume from the last checkpoint. Verify with nodetool netstats that progress continues. If the node does not support resumable bootstrap, you may need to wipe the data directory and restart the bootstrap from scratch after fixing the source-side issue.

WARNING: Wiping the data directory is destructive. Stop Cassandra, clear the data, commitlog, and saved_caches directories, and ensure the node is fully removed from the ring before you re-bootstrap.

Lower the parallel stream count

A high num_tokens value increases the number of token ranges and therefore the number of concurrent streaming sessions. If source nodes are OOMing or saturating disk, reducing num_tokens requires reconfiguring and re-bootstrapping the joining node, but it can make large-node bootstraps stable.

Run repair after recovery

Any bootstrap that was interrupted or resumed after timeout may have missed writes, especially if hints were not delivered during the window. Once the node reaches UN, run nodetool repair to reconcile any inconsistencies before the node serves production traffic.

Prevention

  • Validate source node health before bootstrap. Check nodetool compactionstats, heap usage, and disk headroom.
  • Schedule bootstrap during off-peak hours when source node I/O and GC are stable.
  • Monitor source node disk latency and thread pools continuously during the operation.
  • Keep num_tokens aligned with your heap and disk capacity. Very large nodes may need fewer vnodes.
  • Verify SSTable integrity with nodetool verify before major topology changes.
  • In containerized environments, use Pod Disruption Budgets to prevent mid-stream pod eviction.

How Netdata helps

  • Correlate flat streaming throughput on the joining node with disk latency spikes on the source node in the same time window.
  • Track GC pause duration on source nodes to preempt streaming timeouts before sessions break.
  • Alert on sustained pending tasks in the MutationStage and CompactionExecutor during bootstrap operations.
  • Monitor off-heap memory growth on source nodes to catch OOM risk from too many concurrent SSTable transfers.
  • Surface file descriptor utilization per node to detect the approach of ulimit exhaustion during heavy streaming.