Elasticsearch node left the cluster: fault detection, reallocation, and recovery
Your cluster health turned yellow and number_of_nodes dropped by one. The master logs a NODE_LEFT event, shards are unassigned, and the remaining nodes absorb extra load. In the next minute, the allocator decides whether to move data. Misread the cause and a transient restart becomes an expensive reallocation storm, or a genuine hardware failure goes unaddressed while replicas rebalance.
This guide covers how Elasticsearch decides a node is gone, what happens to its shards, and how to recover without deepening the incident.
What this means
Elasticsearch 7.x and later use follower checks (master to node) and leader checks (node to master). Each check uses a 10-second timeout with a 1-second interval; after three consecutive failures the master removes the node. A hard TCP disconnect triggers immediate removal without waiting for retries.
Once the node is removed, its shards become unassigned. The master will reallocate them, but index.unassigned.node_left.delayed_timeout defaults to one minute to avoid unnecessary movement during quick restarts. If the node returns before the timeout expires, the cluster cancels pending relocations and reuses the local shard copy. If it does not return, the allocator places shards elsewhere and recovery begins, consuming network bandwidth and disk I/O on both source and target nodes and degrading latency on survivors.
If the departed node was master-eligible, it remains in the voting configuration. Surviving master-eligible nodes must maintain quorum, and you must eventually exclude the departed node via the _cluster/voting_config_exclusions API so it cannot block master elections if additional nodes leave.
Common causes
| Cause | What it looks like | First thing to check |
|---|---|---|
| OOM kill | Node vanishes abruptly with no shutdown sequence; the process restarts with a new PID and zero JVM uptime | Kernel logs (dmesg) or journal for Out of memory: Kill process events |
| Long GC pause | The JVM freezes for tens of seconds; the node rejoins later with the same identity and continuous uptime | GC logs on the node, or historical old GC duration metrics, compared to the NODE_LEFT timestamp |
| Network partition | The Elasticsearch process is still running but logs transport disconnects; only some nodes lose reachability | Inter-node ping, netstat, and firewall or security-group logs |
| Hardware failure | The entire host stops responding; hypervisor or system logs show faults; multiple services on the host die | Host-level health events and BMC or hypervisor logs |
Quick checks
Run these read-only commands to size up the incident.
# Cluster membership and unassigned shards
curl -s 'http://localhost:9200/_cluster/health?filter_path=status,number_of_nodes,number_of_data_nodes,unassigned_shards'
# Unassigned shards and their reasons
curl -s 'http://localhost:9200/_cat/shards?v&h=index,shard,prirep,state,unassigned.reason&s=state' | grep UNASSIGNED
# Allocation decision for the first unassigned shard
curl -s 'http://localhost:9200/_cluster/allocation/explain?pretty'
# Heap pressure and GC on remaining nodes
curl -s 'http://localhost:9200/_cat/nodes?v&h=name,heap.percent,gc.old.time,gc.old.count'
# Thread pool rejections indicating survivor overload
curl -s 'http://localhost:9200/_cat/thread_pool/write,search,get?v&h=node_name,name,active,queue,rejected'
# Master backlog from the membership change
curl -s 'http://localhost:9200/_cluster/pending_tasks?pretty'
How to diagnose it
- Confirm the departure is genuine. Compare
number_of_nodesandnumber_of_data_nodesagainst your baseline. Use_cat/nodesto identify the missing node by name. If the drop matches a planned rolling restart, treat recovery as routine. - Check the master logs for the removal trigger. Look for
NODE_LEFT. A hard disconnect appears immediately; a follower-check timeout appears after roughly 30 seconds. - Determine whether the node is coming back. A brief process restart resolves within minutes. An OOM-killed or hardware-failed node needs operator intervention.
- Correlate with GC logs. On the departed node (or from its logs if it is unreachable), compare old-generation stop-the-world pause timestamps against the
NODE_LEFTtimestamp. A pause exceeding 10 seconds causes that follower check to fail; three consecutive failures trigger removal. - Inspect shard allocation state. Use
_cluster/allocation/explain. If the reason isNODE_LEFT, the allocator is either still inside the delayed timeout or unable to find a target node. - Audit the voting configuration if the node was master-eligible. If multiple departed nodes accumulate without cleanup, the cluster may eventually be unable to elect a master.
- Assess pressure on survivors. Remaining data nodes may now carry primaries that previously had replicas, or face concentrated query load. Check heap, thread pool queues, and disk watermarks for cascade signs.
flowchart TD
A[Node misses follower checks] --> B{Hard TCP disconnect?}
B -->|Yes| C[Master removes node immediately]
B -->|No| D[3 retries at 10s timeout
~30s worst case]
D --> C
C --> E[Shards become unassigned]
E --> F{Node rejoins
within delayed timeout?}
F -->|Yes| G[Cancel relocation
reuse local shard]
F -->|No| H[Allocate to surviving nodes]
H --> I[Shard recovery begins]Metrics and signals to monitor
| Signal | Why it matters | Warning sign |
|---|---|---|
| Node count / Data node count | Direct measure of cluster membership loss | Sustained drop below baseline for longer than delayed_timeout |
| Unassigned shard count | Tracks data that has no active copy | Increase sustained beyond the delayed timeout |
| JVM heap used percent | Heap pressure is the leading preventable cause of spurious departure | Sustained above 85 percent |
| Old GC collection time | Stop-the-world pauses longer than the fault detection window force removal | Individual pauses approaching 10 seconds |
| Thread pool rejections | Surviving nodes may lack capacity to cover missing shards | Sustained nonzero write or search rejections |
| Cluster health status | Lagging summary of user impact | Red for unassigned primaries; yellow sustained beyond restart window |
| Pending cluster tasks | Membership changes generate cluster state work | More than 100 tasks or any task older than 5 minutes |
| Disk used percent on survivors | Relocated shards need destination space | Approaching the high watermark at 90 percent |
Fixes
If the node will return quickly
Do nothing. The default one-minute delayed timeout exists for this case. If the node rejoins before timeout expiry and its shard copies are still in sync, the master cancels relocation automatically. Do not force reroutes or lower the timeout during rolling restarts.
If the node is permanently lost
Data-node shards begin recovering to other nodes once the timeout expires.
If recovery is blocked, run
_cluster/allocation/explainto find disk watermark violations, awareness constraints, or corrupt copies.Before forcing allocation, verify that target nodes have enough disk space. If survivors are already above the low watermark, new shards will be refused.
If shards are stuck in
ALLOCATION_FAILEDafter exhausting max retries, run:# Retry failed shard allocations curl -X POST "localhost:9200/_cluster/reroute?retry_failed=true"See Elasticsearch ALLOCATION_FAILED after max retries: reroute and corrupt shard recovery for details.
If the departed node was master-eligible, clean up the voting configuration:
# Exclude departed master-eligible node from voting configuration
curl -X POST "localhost:9200/_cluster/voting_config_exclusions" \
-H "Content-Type: application/json" \
-d '{"node_names": "departed-node-name"}'
If exclusions accumulate and are never cleaned up, the cluster may eventually struggle to reconfigure membership.
If GC pressure caused the removal
Do not restart the node without fixing the root cause; it will leave again. Reduce heap pressure first:
Identify heavy tasks via
GET /_tasksand cancel them withPOST /_tasks/<task_id>/_cancel.If a death spiral is forming, stop the reallocation storm temporarily. This is disruptive and delays recovery, but it prevents survivors from being overwhelmed:
# Stop new allocations while fixing root cause curl -X PUT "localhost:9200/_cluster/settings" \ -H "Content-Type: application/json" \ -d '{"transient":{"cluster.routing.allocation.enable":"none"}}'Re-enable allocation after resolving heap pressure.
For deeper analysis, see Elasticsearch long GC pauses: old-generation stop-the-world and node drops and Elasticsearch heap pressure death spiral: GC, node removal, and the cascade.
If you need immediate allocation
If you have confirmed the node is permanently lost and cannot wait for the timeout, temporarily reduce index.unassigned.node_left.delayed_timeout to force earlier allocation. Reset it afterward to avoid unnecessary movement during future restarts.
Prevention
- Tune the delayed timeout. In Kubernetes or on hosts where restarts routinely exceed one minute, raise
index.unassigned.node_left.delayed_timeoutto match the expected restart window. - Debounce node-count alerts in Kubernetes. Pods churn during rolling updates, so alert on sustained drops rather than transient gaps.
- Monitor heap and old GC. The majority of spurious node departures come from GC pauses, so track the post-GC heap floor and old GC duration.
- Use dedicated master nodes. Keep coordination workload off data nodes so that indexing pressure does not destabilize the membership layer.
- Maintain allocation awareness. Rack or zone attributes help ensure that a single failure domain does not take down multiple copies of the same shard.
- Test snapshot restores. If a node suffers unrecoverable hardware failure, rely on tested backups rather than hoping replicas cover the loss.
How Netdata helps
- Node count correlation: Node count drops correlated with per-node JVM heap and GC metrics distinguish OOM kills from network blips.
- Old GC pause detection: Old GC pause duration is surfaced before it exceeds the fault detection window.
- Thread pool saturation: Write and search queue depths on surviving nodes show cascade overload.
- Disk watermark proximity: Per-node disk usage shows whether survivors have headroom for relocated shards.
- Cluster health composition: Unassigned shard trends and pending cluster task counts distinguish master backlog from restart transience.
Related guides
- Elasticsearch all shards failed: diagnosing search_phase_execution_exception
- Elasticsearch CircuitBreakingException: [parent] Data too large - causes and fixes
- Elasticsearch cluster health red: unassigned primaries and how to recover
- Elasticsearch cluster health yellow: unassigned replicas vs real allocation blocks
- Elasticsearch fielddata circuit breaker tripped: text-field aggregations and the keyword fix
- Elasticsearch heap pressure death spiral: GC, node removal, and the cascade
- Elasticsearch JVM heap usage high: reading the sawtooth and the post-GC floor
- Elasticsearch monitoring checklist: the signals every production cluster needs
- Elasticsearch monitoring maturity model: from survival to expert
- Elasticsearch long GC pauses: old-generation stop-the-world and node drops
- Elasticsearch node OOM-killed: heap ceiling, page cache, and container limits
- Elasticsearch ALLOCATION_FAILED after max retries: reroute and corrupt shard recovery







