Cassandra adding and removing nodes safely: vnodes, tokens, and cleanup

Expanding or contracting a Cassandra cluster redistributes token ownership, triggers bulk streaming, and leaves stale data that cleanup must reclaim. Skip cleanup after a bootstrap, run it while a node is joining, or misconfigure token allocation, and you waste disk space, create hot spots, and risk silent inconsistency.

This guide covers adding and removing nodes with virtual nodes (vnodes) enabled, the default for Cassandra 3.x through 5.x. Commands and paths assume standard packaged Cassandra on Linux. Perform topology changes one node at a time.

Before any operation, verify cluster health. All nodes must show status UN in nodetool status, schema version must agree in nodetool describecluster, and existing nodes need disk headroom for both streaming and cleanup.

How the token ring drives topology changes

Cassandra distributes data using a partitioner that maps partition keys to tokens on a contiguous ring. Each node owns the ranges between itself and its predecessor. With virtual nodes (vnodes), a single physical node claims many small ranges by holding multiple tokens, controlled by num_tokens in cassandra.yaml. The open-source default remains 256 through 5.0, though production guidance recommends 16 because higher counts degrade availability earlier and increase streaming overhead during topology changes.

When a new node joins, it claims tokens and streams the corresponding data from current owners. The allocator that picks these tokens determines whether the new node receives a balanced share. The default random allocator can produce skew, especially with low vnode counts. Since Cassandra 3.0, you can set allocate_tokens_for_keyspace to a keyspace name so the allocator respects replication factor and places tokens evenly. In Cassandra 4.0 and later, allocate_tokens_for_local_replication_factor supersedes it . Use the replica-aware allocator whenever possible. With the random allocator and num_tokens of 8 or 16, expect roughly 10% load variance.

flowchart LR
    A[Client write] --> B[Coordinator]
    B --> C[Hash to token]
    C --> D[Token ring]
    D --> E[Node 1 ranges]
    D --> F[Node 2 ranges]
    D --> G[Node 3 ranges]
    H[Bootstrap] --> I[New node claims tokens]
    I --> J[Stream from owners]
    K[Cleanup] --> L[Drop stale SSTables]

Prerequisites

  • One node at a time. Never bootstrap or decommission multiple nodes concurrently. Parallel streaming saturates network and disk I/O on source nodes and can cause token range collisions.
  • Healthy cluster. All existing nodes must be UN. Resolve any DOWN nodes, schema disagreement, or pending repairs first.
  • Disk headroom. Cleanup and streaming both require temporary space. With SizeTieredCompactionStrategy, keep at least 50% free disk space to survive streaming bursts and cleanup’s temporary SSTable duplication. For LeveledCompactionStrategy or TimeWindowCompactionStrategy, keep at least 30% free.
  • Network headroom. Streaming is bursty and competes with client traffic. Ensure cross-rack or cross-DC sustained utilization is below 50% before starting.
  • Schema agreement. Run nodetool describecluster and confirm a single schema version across all nodes. Topology changes block or confuse schema migrations.

Adding a node

  1. Configure the new node. In cassandra.yaml, set the seed list to match the existing cluster. Set num_tokens to match the operational standard of the cluster (open-source default is 256, but 16 is recommended). On Cassandra 4.0+, set allocate_tokens_for_local_replication_factor to your keyspace replication factor. On 3.x, use allocate_tokens_for_keyspace with a representative keyspace name. Do not use both settings simultaneously.

  2. Start Cassandra. The node automatically enters the JOINING state (UJ) and begins bootstrapping. It calculates its tokens, announces them via gossip, and streams the relevant ranges from current owners. Consistent bootstrap requires all replica ranges to be available. If a replica is down, bootstrap fails unless you override with -Dcassandra.consistent.rangemovement=false, which risks data loss and should only be used in emergencies.

  3. Monitor streaming progress. On the joining node and source nodes, run:

    nodetool netstats
    

    Wait until streaming completes and the node transitions to UN (Up/Normal). Depending on data volume, network speed, and compaction strategy, this can take minutes to hours. During this time, expect elevated disk I/O and network utilization on source nodes.

  4. Handle bootstrap failure. If bootstrap stalls or fails, Cassandra 2.2+ supports resuming:

    nodetool bootstrap resume
    

    To discard partial progress and restart fresh, stop the node and start it with -Dcassandra.reset_bootstrap_progress=true.

  5. Run cleanup on existing nodes. After the new node is UN, run the following on every pre-existing node, one node at a time:

    nodetool cleanup
    

    Without this step, old nodes continue to store data they no longer own, wasting disk space and distorting load metrics. Do not run cleanup until the joining node is fully UN; running it during bootstrap can cause data inconsistency because token ownership is still in flux.

  6. Verify space reclamation. Check disk usage before and after cleanup. Cleanup rewrites SSTables and creates new files before unlinking old ones, so transient disk usage can spike up to the size of the largest SSTable. Pre-existing snapshots also complicate this: they hold hard links to old SSTables, so space may not drop immediately until snapshots expire or are cleared.

Removing a node

Graceful removal with decommission

Use this when the node is healthy but being retired.

  1. Run decommission on the node itself:
    nodetool decommission
    
  2. Monitor streaming with nodetool netstats. The node transitions to UL (Up/Leaving) and then disappears from the ring when complete. Do not restart the node during decommission.
  3. No cleanup is required on remaining nodes. Decommission streams the data to its successors; remaining nodes do not hold stale token ranges.

Removing a dead node

Use this when the node is permanently offline and cannot be recovered.

  1. Confirm the node is truly dead. Check from multiple nodes with nodetool status to rule out a network partition. If the node is actually alive and you run removenode, you risk data loss.
  2. Get the host ID of the dead node from nodetool status.
  3. Run removenode from any surviving node:
    nodetool removenode <host_id>
    
  4. If replacing the node with the same IP, start the replacement with:
    -Dcassandra.replace_address_first_boot=<dead_node_ip>
    
  5. Run repair after replacement if the old node was down longer than max_hint_window_in_ms (default 3 hours) or if the replacement process exceeded that window. Without repair, the new node will have permanent data gaps because hints expired while the old node was offline.

Verifying the operation

After adding or removing a node, confirm the cluster state before declaring the operation complete.

  • Ring membership. nodetool status must show all expected nodes as UN with balanced load percentages. A newly added node should show a load value consistent with its token share. A removed node must no longer appear.
  • Schema agreement. nodetool describecluster must show a single schema version.
  • Streaming idle. nodetool netstats must show no active streams.
  • Disk space reconciled. After adding a node and running cleanup, nodetool info and OS-level df should show reduced load on pre-existing nodes.
  • Error logs. Search system.log for FSError, CorruptSSTableException, or streaming failures during the operation.

Common pitfalls

  • Running cleanup while a node is still joining. Cleanup rewrites SSTables based on current token ownership. If the joining node has not yet fully claimed its ranges, cleanup may discard data that still belongs in the ring. Always wait for UN status.
  • Insufficient disk space for cleanup. Cleanup temporarily doubles disk usage proportional to the largest SSTable. On nodes running above 50% disk utilization with STCS, cleanup can exhaust space and crash the node, potentially leaving behind incomplete SSTables. Maintain adequate headroom before expanding.
  • Using the default random token allocator with low vnodes. With num_tokens set to 8 or 16, random allocation can produce approximately 10% load skew. Set allocate_tokens_for_local_replication_factor (4.0+) or allocate_tokens_for_keyspace (3.x) to ensure even distribution.
  • Adding multiple nodes at once. Concurrent bootstraps create overlapping streaming sessions that saturate disk and network, and can cause token range collisions. Add nodes sequentially, waiting for UN status and cleanup between each.
  • Skipping repair after replacing a long-down node. If the node was unavailable longer than max_hint_window_in_ms, hints expired and the replacement will not receive the missing writes. Only nodetool repair can backfill the gap.
  • Mixing num_tokens values across the cluster. All nodes in the same datacenter must use the same num_tokens value. Changing from 256 to 16 (or vice versa) on a live cluster is not supported and requires a new datacenter transfer or full node rebuild.
  • Confusing snapshot growth with cleanup failure. Snapshots use hard links. After cleanup rewrites SSTables, snapshot-linked old files continue consuming space. If disk usage does not drop after cleanup, check nodetool listsnapshots and clear verified backups before panicking.

Signals to monitor during topology changes

SignalWhy it mattersWarning sign
Streaming progressStalled or failed streaming leaves the topology change incompletenodetool netstats shows no progress for > 30 minutes
Node gossip stateBootstrap and decommission transition through Joining/Leaving statesNode stuck in UJ or UL for longer than the data volume justifies
Disk spaceStreaming writes new SSTables and cleanup rewrites existing onesAvailable space drops below 30% during the operation
Client request latencyStreaming and cleanup compete for disk I/O and network bandwidthP99 read/write latency exceeds 3x baseline for > 5 minutes
Pending compactionsCleanup and anti-compaction generate new compaction tasksnodetool compactionstats shows pending tasks increasing monotonically for > 2 hours
Dropped messagesOverload from streaming can cause replicas to drop mutations or readsnodetool tpstats shows non-zero dropped MUTATION or READ
Repair statusReplacement nodes may miss data if hints expired before they startedLast repair time on replacement node older than max_hint_window_in_ms

How Netdata helps

  • Correlate streaming throughput with per-device disk I/O utilization to detect source-node saturation during bootstrap.
  • Alert on disk space runway before cleanup starts, preventing out-of-space crashes during SSTable rewrites.
  • Track JVM heap usage and GC pause duration on nodes acting as streaming sources; bootstrap bursts can trigger heap pressure and gossip flapping.
  • Monitor gossip state transitions (UN to UJ to UN) to detect bootstrap or decommission operations that have stalled mid-stream.
  • Correlate network bandwidth utilization with streaming sessions to identify link saturation before it impacts client traffic.
  • Monitor pending compactions after cleanup to ensure compaction debt does not accumulate and degrade read latency.
  • Cassandra node stuck in joining (UJ): bootstrap diagnosis: /guides/cassandra/cassandra-bootstrap-stuck/
  • Cassandra compaction strategies: STCS vs LCS vs TWCS vs UCS: /guides/cassandra/cassandra-choosing-compaction-strategy/
  • Cassandra clock skew: how NTP drift silently corrupts data: /guides/cassandra/cassandra-clock-skew-data-corruption/
  • Cassandra compaction death spiral: when writes outrun compaction throughput: /guides/cassandra/cassandra-compaction-death-spiral/
  • Cassandra consistency levels explained: QUORUM, ONE, LOCAL_QUORUM, and EACH_QUORUM: /guides/cassandra/cassandra-consistency-levels-explained/
  • Cassandra zombie data resurrection: gc_grace_seconds and unrepaired tombstones: /guides/cassandra/cassandra-data-resurrection-gc-grace/
  • Cassandra disk space exhaustion: emergency recovery when the data volume fills: /guides/cassandra/cassandra-disk-space-exhaustion/
  • Cassandra dropped mutations: silent write loss and load shedding: /guides/cassandra/cassandra-dropped-mutations/
  • Cassandra dropped reads and other messages: reading nodetool tpstats Dropped: /guides/cassandra/cassandra-dropped-reads-and-messages/
  • Cassandra GC death spiral: long pauses, gossip flapping, and recovery: /guides/cassandra/cassandra-gc-death-spiral/
  • Cassandra GC pauses too long: diagnosing G1 stop-the-world pauses: /guides/cassandra/cassandra-gc-pauses-too-long/
  • Cassandra gossip flapping: nodes bouncing UP and DOWN: /guides/cassandra/cassandra-gossip-flapping/