Elasticsearch unassigned shards: reading allocation explain and fixing each reason

Yellow or red cluster health with unassigned_shards > 0 means the allocator cannot place one or more shard copies on any node. Missing primaries block queries and risk data loss; missing replicas only cost redundancy. Do not guess from the cluster color. The allocator already knows why it rejected every node. Ask it.

What this means

Unassigned primaries make their data unreachable. Affected indices return partial results or fail. Unassigned replicas remove redundancy; a second failure on those primaries drops the data. The master allocator evaluates every node through a chain of deciders: disk watermarks, allocation filters, awareness attributes, the same-shard rule, and retry limits. When every node is rejected, the shard stays UNASSIGNED until the blocking condition clears or you intervene.

flowchart TD
    A[Unassigned shards detected] --> B[GET /_cluster/allocation/explain]
    B --> C{unassigned_info.reason}
    C -->|NODE_LEFT| D[Check node count and delayed_timeout]
    C -->|ALLOCATION_FAILED| E[Check failed_allocations count]
    E -->|>= 5| F[POST /_cluster/reroute?retry_failed=true]
    C -->|Disk watermark| G[Check /_cat/allocation disk percent]
    C -->|Filter / Awareness| H[Check /_cat/nodeattrs and index routing settings]
    C -->|Same shard| I[Reduce replicas or add data nodes]

Common causes

CauseWhat it looks likeFirst thing to check
Disk watermark breachedNodes above low/high/flood watermarks; new allocations blockedGET /_cat/allocation?v
NODE_LEFT with delayed timeoutNode departed; replicas unassigned but waitingGET /_cat/nodes and index.unassigned.node_left.delayed_timeout
ALLOCATION_FAILED at max retriesShard copy corrupt or failed validation; never retries againGET /_cluster/allocation/explain for failed_allocations count
Allocation filter or awareness mismatchIndex requires a node attribute that no node carriesGET /_cat/nodeattrs?v and index routing.allocation settings
Same-shard rule / insufficient nodesReplica count equals or exceeds data node countGET /_cat/nodes count vs replica count
Allocation explicitly disabledcluster.routing.allocation.enable set to none or primariesGET /_cluster/settings?flat_settings=true

Quick checks

# Cluster health and unassigned count
curl -s 'http://localhost:9200/_cluster/health?filter_path=status,unassigned_shards,number_of_nodes'

# Unassigned shards with reasons
curl -s 'http://localhost:9200/_cat/shards?v&h=index,shard,prirep,state,unassigned.reason&s=state' | grep UNASSIGNED

# The single most useful diagnostic
curl -s 'http://localhost:9200/_cluster/allocation/explain?pretty'

# Disk usage per node
curl -s 'http://localhost:9200/_cat/allocation?v'

# Current allocation enablement settings
curl -s 'http://localhost:9200/_cluster/settings?flat_settings=true&filter_path=**.cluster.routing.allocation.enable'

# Node awareness attributes
curl -s 'http://localhost:9200/_cat/nodeattrs?v'

How to diagnose it

  1. Confirm severity. An unassigned primary is an immediate incident. An unassigned replica is a ticket unless recovery stalls past your SLO.
  2. Run GET /_cluster/allocation/explain. Without a body, it explains the first unassigned shard. For a specific shard, pass {"index":"name","shard":0,"primary":true}.
  3. Read unassigned_info.reason. Values like NODE_LEFT or CLUSTER_RECOVERED often self-heal. ALLOCATION_FAILED never self-heals after max retries.
  4. Read the can_allocate field and the decider list. The decider name tells you the rule that rejected every node: disk_watermark, filter, same_shard, awareness, throttle, etc.
  5. Check allocate_explanation and node_allocation_decisions for per-node rejections. Pass ?include_yes_decisions=true to also see nodes that would accept the shard if other constraints were lifted.
  6. Correlate with node count drops, disk usage, and recent cluster settings changes.

Metrics and signals to monitor

SignalWhy it mattersWarning sign
unassigned_shardsDirect measure of stuck shardsNonzero for more than 5 minutes outside maintenance
Disk used percent per nodeWatermarks block allocationAny node above the low watermark
number_of_nodes / number_of_data_nodesNode loss triggers reallocationUnexpected drop from baseline
relocating_shardsRecovery or rebalance stormSudden spike without planned change
cluster.routing.allocation.enableAdmin or automation may disable allocationValue is not all
index.allocation.max_retries exceededShards stuck forever without operator actionALLOCATION_FAILED with failed_allocations >= 5

Fixes

Disk watermark breach

If a node crosses the low watermark (85%), the allocator stops sending new shards to it. At the high watermark (90%), Elasticsearch actively relocates shards away. At flood stage (95%), it sets index.blocks.read_only_allow_delete on indices with shards on that node.

Immediate response: free disk by deleting old indices, shrinking indices, or expanding storage. The flood-stage block should clear automatically once disk is freed. If it does not, or you need to unblock immediately after confirming sufficient space:

# Remove read-only block after freeing disk
curl -X PUT 'http://localhost:9200/_all/_settings' -H 'Content-Type: application/json' -d '{"index.blocks.read_only_allow_delete": null}'

Tradeoff: deleting indices is destructive. Reducing replica count frees space but reduces redundancy. Recent 8.x versions support max_headroom watermarks for large disks.

NODE_LEFT and delayed timeout

When a node leaves, replicas go unassigned. By default, the master waits index.unassigned.node_left.delayed_timeout (one minute) before reallocating, in case the node restarts. Once the timeout expires, recovery starts automatically. Lower the timeout to recover faster; raise it during rolling restarts to suppress unnecessary movement.

ALLOCATION_FAILED and max retries

If a shard fails allocation, Elasticsearch retries up to index.allocation.max_retries (default 5). After that, the shard stays unassigned indefinitely, even if the root cause is fixed.

After fixing the underlying issue (disk space, hardware, corrupt translog), trigger a retry:

# Reset retry counter and attempt allocation again
curl -X POST 'http://localhost:9200/_cluster/reroute?retry_failed=true'

Warning: retry_failed=true does not bypass allocation rules. If the disk is still full or the filter still mismatches, the retry fails again and burns another retry cycle. Always read the allocation explain output before retrying.

Allocation filters and awareness attributes

Stale index.routing.allocation.require.*, include.*, or exclude.* settings can pin shards to nodes that no longer exist. Forced awareness (cluster.routing.allocation.awareness.force.*) strands replicas when an attribute value is missing from the cluster. Verify node attributes with GET /_cat/nodeattrs?v, then update or remove the offending index setting. This is common after node replacement if the new node advertises different attributes.

Same-shard rule and shard limits

The same-shard rule forbids a primary and replica from sharing a node. If replica count equals or exceeds the data node count, at least one replica stays unassigned. Reduce replicas or add nodes. Also watch cluster.max_shards_per_node (default 1000 non-frozen shards). Hitting the limit rejects new shards with a maximum shards open error.

Last-resort manual allocation

If the only copies of a primary are gone, you may need allocate_empty_primary or allocate_stale_primary. Both require "accept_data_loss": true. These are destructive and should only be used when the original data is provably gone and recovery from snapshot is not faster.

Prevention

  • Project disk time-to-watermark and expand before the low watermark.
  • Keep replica counts below the data node count.
  • Audit allocation filter settings during node replacements and tier migrations.
  • Watch shard density per node; avoid approaching cluster.max_shards_per_node.
  • During rolling restarts, set cluster.routing.allocation.enable: none to prevent rebalancing storms, then re-enable to all.

How Netdata helps

  • Correlate unassigned_shards with per-node disk usage to spot the blocking node.
  • Alert on unexpected node count drops before NODE_LEFT events.
  • Track JVM heap and old-generation GC pauses; long GC causes node removal and unassigned shards.
  • Correlate relocating_shards spikes with network and disk I/O to spot recovery storms.
  • Track changes to cluster.routing.allocation.enable to catch accidental allocation locks.