$ guides / elasticsearch / elasticsearch-node-left-cluster ▌

Operations Guides

Elasticsearch node left the cluster: fault detection, reallocation, and recovery

Your cluster health turned yellow and number_of_nodes dropped by one. The master logs a NODE_LEFT event, shards are unassigned, and the remaining nodes absorb extra load. In the next minute, the allocator decides whether to move data. Misread the cause and a transient restart becomes an expensive reallocation storm, or a genuine hardware failure goes unaddressed while replicas rebalance.

This guide covers how Elasticsearch decides a node is gone, what happens to its shards, and how to recover without deepening the incident.

What this means

Elasticsearch 7.x and later use follower checks (master to node) and leader checks (node to master). Each check uses a 10-second timeout with a 1-second interval; after three consecutive failures the master removes the node. A hard TCP disconnect triggers immediate removal without waiting for retries.

Once the node is removed, its shards become unassigned. The master will reallocate them, but index.unassigned.node_left.delayed_timeout defaults to one minute to avoid unnecessary movement during quick restarts. If the node returns before the timeout expires, the cluster cancels pending relocations and reuses the local shard copy. If it does not return, the allocator places shards elsewhere and recovery begins, consuming network bandwidth and disk I/O on both source and target nodes and degrading latency on survivors.

If the departed node was master-eligible, it remains in the voting configuration. Surviving master-eligible nodes must maintain quorum, and you must eventually exclude the departed node via the _cluster/voting_config_exclusions API so it cannot block master elections if additional nodes leave.

Common causes

Cause	What it looks like	First thing to check
OOM kill	Node vanishes abruptly with no shutdown sequence; the process restarts with a new PID and zero JVM uptime	Kernel logs (`dmesg`) or journal for `Out of memory: Kill process` events
Long GC pause	The JVM freezes for tens of seconds; the node rejoins later with the same identity and continuous uptime	GC logs on the node, or historical old GC duration metrics, compared to the `NODE_LEFT` timestamp
Network partition	The Elasticsearch process is still running but logs transport disconnects; only some nodes lose reachability	Inter-node ping, `netstat`, and firewall or security-group logs
Hardware failure	The entire host stops responding; hypervisor or system logs show faults; multiple services on the host die	Host-level health events and BMC or hypervisor logs

Quick checks

Run these read-only commands to size up the incident.

# Cluster membership and unassigned shards
curl -s 'http://localhost:9200/_cluster/health?filter_path=status,number_of_nodes,number_of_data_nodes,unassigned_shards'

# Unassigned shards and their reasons
curl -s 'http://localhost:9200/_cat/shards?v&h=index,shard,prirep,state,unassigned.reason&s=state' | grep UNASSIGNED

# Allocation decision for the first unassigned shard
curl -s 'http://localhost:9200/_cluster/allocation/explain?pretty'

# Heap pressure and GC on remaining nodes
curl -s 'http://localhost:9200/_cat/nodes?v&h=name,heap.percent,gc.old.time,gc.old.count'

# Thread pool rejections indicating survivor overload
curl -s 'http://localhost:9200/_cat/thread_pool/write,search,get?v&h=node_name,name,active,queue,rejected'

# Master backlog from the membership change
curl -s 'http://localhost:9200/_cluster/pending_tasks?pretty'

How to diagnose it

Confirm the departure is genuine. Compare number_of_nodes and number_of_data_nodes against your baseline. Use _cat/nodes to identify the missing node by name. If the drop matches a planned rolling restart, treat recovery as routine.
Check the master logs for the removal trigger. Look for NODE_LEFT. A hard disconnect appears immediately; a follower-check timeout appears after roughly 30 seconds.
Determine whether the node is coming back. A brief process restart resolves within minutes. An OOM-killed or hardware-failed node needs operator intervention.
Correlate with GC logs. On the departed node (or from its logs if it is unreachable), compare old-generation stop-the-world pause timestamps against the NODE_LEFT timestamp. A pause exceeding 10 seconds causes that follower check to fail; three consecutive failures trigger removal.
Inspect shard allocation state. Use _cluster/allocation/explain. If the reason is NODE_LEFT, the allocator is either still inside the delayed timeout or unable to find a target node.
Audit the voting configuration if the node was master-eligible. If multiple departed nodes accumulate without cleanup, the cluster may eventually be unable to elect a master.
Assess pressure on survivors. Remaining data nodes may now carry primaries that previously had replicas, or face concentrated query load. Check heap, thread pool queues, and disk watermarks for cascade signs.

flowchart TD
    A[Node misses follower checks] --> B{Hard TCP disconnect?}
    B -->|Yes| C[Master removes node immediately]
    B -->|No| D[3 retries at 10s timeout
~30s worst case]
    D --> C
    C --> E[Shards become unassigned]
    E --> F{Node rejoins
within delayed timeout?}
    F -->|Yes| G[Cancel relocation
reuse local shard]
    F -->|No| H[Allocate to surviving nodes]
    H --> I[Shard recovery begins]

Metrics and signals to monitor

Signal	Why it matters	Warning sign
Node count / Data node count	Direct measure of cluster membership loss	Sustained drop below baseline for longer than `delayed_timeout`
Unassigned shard count	Tracks data that has no active copy	Increase sustained beyond the delayed timeout
JVM heap used percent	Heap pressure is the leading preventable cause of spurious departure	Sustained above 85 percent
Old GC collection time	Stop-the-world pauses longer than the fault detection window force removal	Individual pauses approaching 10 seconds
Thread pool rejections	Surviving nodes may lack capacity to cover missing shards	Sustained nonzero write or search rejections
Cluster health status	Lagging summary of user impact	Red for unassigned primaries; yellow sustained beyond restart window
Pending cluster tasks	Membership changes generate cluster state work	More than 100 tasks or any task older than 5 minutes
Disk used percent on survivors	Relocated shards need destination space	Approaching the high watermark at 90 percent

Fixes

If the node will return quickly

Do nothing. The default one-minute delayed timeout exists for this case. If the node rejoins before timeout expiry and its shard copies are still in sync, the master cancels relocation automatically. Do not force reroutes or lower the timeout during rolling restarts.

If the node is permanently lost

Data-node shards begin recovering to other nodes once the timeout expires.

If recovery is blocked, run _cluster/allocation/explain to find disk watermark violations, awareness constraints, or corrupt copies.
Before forcing allocation, verify that target nodes have enough disk space. If survivors are already above the low watermark, new shards will be refused.
If shards are stuck in ALLOCATION_FAILED after exhausting max retries, run:
```
# Retry failed shard allocations
curl -X POST "localhost:9200/_cluster/reroute?retry_failed=true"
```
See Elasticsearch ALLOCATION_FAILED after max retries: reroute and corrupt shard recovery for details.

If the departed node was master-eligible, clean up the voting configuration:

# Exclude departed master-eligible node from voting configuration
curl -X POST "localhost:9200/_cluster/voting_config_exclusions" \
  -H "Content-Type: application/json" \
  -d '{"node_names": "departed-node-name"}'

If exclusions accumulate and are never cleaned up, the cluster may eventually struggle to reconfigure membership.

If GC pressure caused the removal

Do not restart the node without fixing the root cause; it will leave again. Reduce heap pressure first:

Identify heavy tasks via GET /_tasks and cancel them with POST /_tasks/<task_id>/_cancel.

If a death spiral is forming, stop the reallocation storm temporarily. This is disruptive and delays recovery, but it prevents survivors from being overwhelmed:

# Stop new allocations while fixing root cause
curl -X PUT "localhost:9200/_cluster/settings" \
  -H "Content-Type: application/json" \
  -d '{"transient":{"cluster.routing.allocation.enable":"none"}}'

Re-enable allocation after resolving heap pressure.

For deeper analysis, see Elasticsearch long GC pauses: old-generation stop-the-world and node drops and Elasticsearch heap pressure death spiral: GC, node removal, and the cascade.

If you need immediate allocation

If you have confirmed the node is permanently lost and cannot wait for the timeout, temporarily reduce index.unassigned.node_left.delayed_timeout to force earlier allocation. Reset it afterward to avoid unnecessary movement during future restarts.

Prevention

Tune the delayed timeout. In Kubernetes or on hosts where restarts routinely exceed one minute, raise index.unassigned.node_left.delayed_timeout to match the expected restart window.
Debounce node-count alerts in Kubernetes. Pods churn during rolling updates, so alert on sustained drops rather than transient gaps.
Monitor heap and old GC. The majority of spurious node departures come from GC pauses, so track the post-GC heap floor and old GC duration.
Use dedicated master nodes. Keep coordination workload off data nodes so that indexing pressure does not destabilize the membership layer.
Maintain allocation awareness. Rack or zone attributes help ensure that a single failure domain does not take down multiple copies of the same shard.
Test snapshot restores. If a node suffers unrecoverable hardware failure, rely on tested backups rather than hoping replicas cover the loss.

How Netdata helps

Node count correlation: Node count drops correlated with per-node JVM heap and GC metrics distinguish OOM kills from network blips.
Old GC pause detection: Old GC pause duration is surfaced before it exceeds the fault detection window.
Thread pool saturation: Write and search queue depths on surviving nodes show cascade overload.
Disk watermark proximity: Per-node disk usage shows whether survivors have headroom for relocated shards.
Cluster health composition: Unassigned shard trends and pending cluster task counts distinguish master backlog from restart transience.

The Netdata solution

Elasticsearch monitoring with Netdata

Netdata monitors Elasticsearch with per-second metrics and ML anomaly detection. Correlate JVM heap pressure, shard counts, disk watermarks, mapping growth, and merge activity with cluster and node health in one view.

See Elasticsearch monitoring → Start monitoring free

Elasticsearch node left the cluster: fault detection, reallocation, and recovery

Elasticsearch node left the cluster: fault detection, reallocation, and recovery

What this means

Common causes

Quick checks

How to diagnose it

Metrics and signals to monitor

Fixes

If the node will return quickly

If the node is permanently lost

If GC pressure caused the removal

If you need immediate allocation

Prevention

How Netdata helps

Related guides

Elasticsearch monitoring with Netdata