Elasticsearch unassigned shards: reading allocation explain and fixing each reason
Yellow or red cluster health with unassigned_shards > 0 means the allocator cannot place one or more shard copies on any node. Missing primaries block queries and risk data loss; missing replicas only cost redundancy. Do not guess from the cluster color. The allocator already knows why it rejected every node. Ask it.
What this means
Unassigned primaries make their data unreachable. Affected indices return partial results or fail. Unassigned replicas remove redundancy; a second failure on those primaries drops the data. The master allocator evaluates every node through a chain of deciders: disk watermarks, allocation filters, awareness attributes, the same-shard rule, and retry limits. When every node is rejected, the shard stays UNASSIGNED until the blocking condition clears or you intervene.
flowchart TD
A[Unassigned shards detected] --> B[GET /_cluster/allocation/explain]
B --> C{unassigned_info.reason}
C -->|NODE_LEFT| D[Check node count and delayed_timeout]
C -->|ALLOCATION_FAILED| E[Check failed_allocations count]
E -->|>= 5| F[POST /_cluster/reroute?retry_failed=true]
C -->|Disk watermark| G[Check /_cat/allocation disk percent]
C -->|Filter / Awareness| H[Check /_cat/nodeattrs and index routing settings]
C -->|Same shard| I[Reduce replicas or add data nodes]Common causes
| Cause | What it looks like | First thing to check |
|---|---|---|
| Disk watermark breached | Nodes above low/high/flood watermarks; new allocations blocked | GET /_cat/allocation?v |
| NODE_LEFT with delayed timeout | Node departed; replicas unassigned but waiting | GET /_cat/nodes and index.unassigned.node_left.delayed_timeout |
| ALLOCATION_FAILED at max retries | Shard copy corrupt or failed validation; never retries again | GET /_cluster/allocation/explain for failed_allocations count |
| Allocation filter or awareness mismatch | Index requires a node attribute that no node carries | GET /_cat/nodeattrs?v and index routing.allocation settings |
| Same-shard rule / insufficient nodes | Replica count equals or exceeds data node count | GET /_cat/nodes count vs replica count |
| Allocation explicitly disabled | cluster.routing.allocation.enable set to none or primaries | GET /_cluster/settings?flat_settings=true |
Quick checks
# Cluster health and unassigned count
curl -s 'http://localhost:9200/_cluster/health?filter_path=status,unassigned_shards,number_of_nodes'
# Unassigned shards with reasons
curl -s 'http://localhost:9200/_cat/shards?v&h=index,shard,prirep,state,unassigned.reason&s=state' | grep UNASSIGNED
# The single most useful diagnostic
curl -s 'http://localhost:9200/_cluster/allocation/explain?pretty'
# Disk usage per node
curl -s 'http://localhost:9200/_cat/allocation?v'
# Current allocation enablement settings
curl -s 'http://localhost:9200/_cluster/settings?flat_settings=true&filter_path=**.cluster.routing.allocation.enable'
# Node awareness attributes
curl -s 'http://localhost:9200/_cat/nodeattrs?v'
How to diagnose it
- Confirm severity. An unassigned primary is an immediate incident. An unassigned replica is a ticket unless recovery stalls past your SLO.
- Run
GET /_cluster/allocation/explain. Without a body, it explains the first unassigned shard. For a specific shard, pass{"index":"name","shard":0,"primary":true}. - Read
unassigned_info.reason. Values likeNODE_LEFTorCLUSTER_RECOVEREDoften self-heal.ALLOCATION_FAILEDnever self-heals after max retries. - Read the
can_allocatefield and the decider list. The decider name tells you the rule that rejected every node:disk_watermark,filter,same_shard,awareness,throttle, etc. - Check
allocate_explanationandnode_allocation_decisionsfor per-node rejections. Pass?include_yes_decisions=trueto also see nodes that would accept the shard if other constraints were lifted. - Correlate with node count drops, disk usage, and recent cluster settings changes.
Metrics and signals to monitor
| Signal | Why it matters | Warning sign |
|---|---|---|
unassigned_shards | Direct measure of stuck shards | Nonzero for more than 5 minutes outside maintenance |
| Disk used percent per node | Watermarks block allocation | Any node above the low watermark |
number_of_nodes / number_of_data_nodes | Node loss triggers reallocation | Unexpected drop from baseline |
relocating_shards | Recovery or rebalance storm | Sudden spike without planned change |
cluster.routing.allocation.enable | Admin or automation may disable allocation | Value is not all |
index.allocation.max_retries exceeded | Shards stuck forever without operator action | ALLOCATION_FAILED with failed_allocations >= 5 |
Fixes
Disk watermark breach
If a node crosses the low watermark (85%), the allocator stops sending new shards to it. At the high watermark (90%), Elasticsearch actively relocates shards away. At flood stage (95%), it sets index.blocks.read_only_allow_delete on indices with shards on that node.
Immediate response: free disk by deleting old indices, shrinking indices, or expanding storage. The flood-stage block should clear automatically once disk is freed. If it does not, or you need to unblock immediately after confirming sufficient space:
# Remove read-only block after freeing disk
curl -X PUT 'http://localhost:9200/_all/_settings' -H 'Content-Type: application/json' -d '{"index.blocks.read_only_allow_delete": null}'
Tradeoff: deleting indices is destructive. Reducing replica count frees space but reduces redundancy. Recent 8.x versions support max_headroom watermarks for large disks.
NODE_LEFT and delayed timeout
When a node leaves, replicas go unassigned. By default, the master waits index.unassigned.node_left.delayed_timeout (one minute) before reallocating, in case the node restarts. Once the timeout expires, recovery starts automatically. Lower the timeout to recover faster; raise it during rolling restarts to suppress unnecessary movement.
ALLOCATION_FAILED and max retries
If a shard fails allocation, Elasticsearch retries up to index.allocation.max_retries (default 5). After that, the shard stays unassigned indefinitely, even if the root cause is fixed.
After fixing the underlying issue (disk space, hardware, corrupt translog), trigger a retry:
# Reset retry counter and attempt allocation again
curl -X POST 'http://localhost:9200/_cluster/reroute?retry_failed=true'
Warning: retry_failed=true does not bypass allocation rules. If the disk is still full or the filter still mismatches, the retry fails again and burns another retry cycle. Always read the allocation explain output before retrying.
Allocation filters and awareness attributes
Stale index.routing.allocation.require.*, include.*, or exclude.* settings can pin shards to nodes that no longer exist. Forced awareness (cluster.routing.allocation.awareness.force.*) strands replicas when an attribute value is missing from the cluster. Verify node attributes with GET /_cat/nodeattrs?v, then update or remove the offending index setting. This is common after node replacement if the new node advertises different attributes.
Same-shard rule and shard limits
The same-shard rule forbids a primary and replica from sharing a node. If replica count equals or exceeds the data node count, at least one replica stays unassigned. Reduce replicas or add nodes. Also watch cluster.max_shards_per_node (default 1000 non-frozen shards). Hitting the limit rejects new shards with a maximum shards open error.
Last-resort manual allocation
If the only copies of a primary are gone, you may need allocate_empty_primary or allocate_stale_primary. Both require "accept_data_loss": true. These are destructive and should only be used when the original data is provably gone and recovery from snapshot is not faster.
Prevention
- Project disk time-to-watermark and expand before the low watermark.
- Keep replica counts below the data node count.
- Audit allocation filter settings during node replacements and tier migrations.
- Watch shard density per node; avoid approaching
cluster.max_shards_per_node. - During rolling restarts, set
cluster.routing.allocation.enable: noneto prevent rebalancing storms, then re-enable toall.
How Netdata helps
- Correlate
unassigned_shardswith per-node disk usage to spot the blocking node. - Alert on unexpected node count drops before
NODE_LEFTevents. - Track JVM heap and old-generation GC pauses; long GC causes node removal and unassigned shards.
- Correlate
relocating_shardsspikes with network and disk I/O to spot recovery storms. - Track changes to
cluster.routing.allocation.enableto catch accidental allocation locks.
Related guides
- Elasticsearch CircuitBreakingException: [parent] Data too large - causes and fixes
- Elasticsearch fielddata circuit breaker tripped: text-field aggregations and the keyword fix
- Elasticsearch heap pressure death spiral: GC, node removal, and the cascade
- Elasticsearch JVM heap usage high: reading the sawtooth and the post-GC floor
- Elasticsearch monitoring checklist: the signals every production cluster needs
- Elasticsearch monitoring maturity model: from survival to expert
- Elasticsearch long GC pauses: old-generation stop-the-world and node drops
- How Elasticsearch actually works in production: a mental model for operators







