$ guides / elasticsearch / elasticsearch-shard-allocation-failed-max-retries ▌

Operations Guides

Elasticsearch ALLOCATION_FAILED after max retries: reroute and corrupt shard recovery

A shard that repeatedly fails allocation exhausts index.allocation.max_retries (default 5) and becomes permanently UNASSIGNED. Elasticsearch stops automatic placement. Cluster health is RED if the shard is a primary, YELLOW if it is a replica. Indexing to the affected index is blocked until the primary is assigned.

This state typically follows transient node restarts, disk pressure events, or translog corruption. After the fifth failed attempt, the allocator stops retrying. The shard will not move without explicit operator action: POST /_cluster/reroute?retry_failed=true for transient failures, or forced allocation with accepted data loss when no valid copy remains.

This guide covers distinguishing transient retry exhaustion from corrupt copies, safe rerouting, and recovery when no valid copy exists.

What this means

Elasticsearch’s allocator increments a per-shard failure counter when placement fails due to I/O errors, network timeouts, corrupt segment files, or disk watermark blocks. When the counter reaches index.allocation.max_retries (default 5), the shard is marked UNASSIGNED with reason ALLOCATION_FAILED. Automatic rebalancing and recovery skip it until the counter is reset.

Resetting the counter does not fix the underlying problem. If the failure was transient, retry may succeed. If the failure is persistent, such as a CorruptIndexException or TranslogCorruptedException, the shard will fail again until the corrupt data is removed or a different copy is used.

flowchart TD
    A[Shard UNASSIGNED
reason ALLOCATION_FAILED] --> B{Allocation explain}
    B -->|Too many failed attempts
no corruption| C[POST /_cluster/reroute
?retry_failed=true]
    B -->|Corrupt translog| D[Stop node, remove
corrupted translog, restart]
    B -->|Corrupt index, no valid replica| E{Stale copy exists
on target node?}
    E -->|Yes| F[allocate_stale_primary
accept_data_loss=true]
    E -->|No| G[allocate_empty_primary
accept_data_loss=true]
    C --> H[Verify shard assigned]
    D --> H
    F --> H
    G --> H

Common causes

Cause	What it looks like	First thing to check
Transient failure exhausted retries	Node restart or brief network partition caused 5 failed allocation attempts; shard data is intact	`GET /_cluster/allocation/explain` for “too many failed allocation attempts”
Translog corruption	`TranslogCorruptedException` or “ignoring recovery of a corrupt translog entry” in node logs	Logs on the node that attempted allocation
Corrupt Lucene index	`CorruptIndexException` during recovery or validation	Node logs for index corruption errors
Disk watermark blocking target	Target node above high watermark (90%) and allocator cannot place the shard	`GET /_cat/allocation?v` for disk percentages
Missing or incompatible shard copy	Shard data was partially written or version mismatch prevents use	`GET /_cluster/allocation/explain` and `GET /_shard_stores`

Quick checks

These commands are safe and read-only. Run them before making any changes.

# Cluster health and unassigned shard count
curl -s 'http://localhost:9200/_cluster/health?filter_path=status,unassigned_shards,number_of_nodes'

# Unassigned shards with reasons
curl -s 'http://localhost:9200/_cat/shards?v&h=index,shard,prirep,state,unassigned.reason&s=state' | grep UNASSIGNED

# Detailed explanation for a specific shard
curl -s 'http://localhost:9200/_cluster/allocation/explain' -H 'Content-Type: application/json' -d '{"index": "<index_name>", "shard": <shard_num>, "primary": true}'

# Disk usage per node
curl -s 'http://localhost:9200/_cat/allocation?v'

# Active recoveries
curl -s 'http://localhost:9200/_cat/recovery?v&active_only&h=index,shard,stage,source_host,target_host,bytes_percent'

How to diagnose it

Confirm the shard state. Use GET /_cat/shards filtered to UNASSIGNED and note the unassigned.reason. If it is ALLOCATION_FAILED, the shard has hit the retry limit.
Run allocation explain. GET /_cluster/allocation/explain with the index, shard number, and primary flag returns the specific error message from the last failed attempt. Look for CorruptIndexException, TranslogCorruptedException, disk watermark denials, or node-leave events.
Check node logs. On the node that last attempted allocation, search logs for the shard ID. Translog corruption produces TranslogCorruptedException. Lucene corruption produces CorruptIndexException. I/O errors produce IOException.
Check disk watermarks. If /_cat/allocation shows the target node above the high watermark (90% by default), the allocator rejects new shards. Free disk space before retrying.
Determine copy availability. If the primary is corrupt, check whether a valid replica exists. If all copies are corrupt, use GET /_shard_stores or allocation explain to identify any node with a stale copy that can be promoted.
Classify the failure. If the error is transient, retry is likely to succeed. If the error is corruption or persistent I/O failure, retry will fail again and you must repair or replace the copy.

Metrics and signals to monitor

Signal	Why it matters	Warning sign
Unassigned shard count	Persistent unassigned primaries mean data unavailability	Any primary unassigned for >2 minutes
Cluster health status	RED indicates active data loss or blocked queries	Status RED after expected data nodes have joined
Disk usage vs watermarks	Watermarks prevent allocation and trigger relocations	Any data node above 85% low watermark
Node count	Unexpected loss causes allocation storms and retry exhaustion	Drop in `number_of_data_nodes`
Thread pool rejections	Saturated nodes cannot complete recovery operations	Sustained `write` or `search` rejections
JVM heap used percent	Heap pressure precedes GC pauses that trigger node removal	Sustained >85% with rising old GC time

Fixes

Retry failed allocations for transient errors

If allocation explain shows no corruption and the failure was transient, reset the retry counter:

curl -X POST 'http://localhost:9200/_cluster/reroute?retry_failed=true'

This resets the failed_allocation_attempts counter to zero for all ALLOCATION_FAILED shards and triggers one allocation round. Verify with GET /_cat/shards or GET /_cluster/allocation/explain.

If the shard fails again immediately, stop and investigate. You can raise index.allocation.max_retries per index, but without fixing the root cause this only prolongs unavailability.

Recover from corrupt translog

If logs show TranslogCorruptedException, remove the corrupted translog and restart the node. You will lose any uncommitted operations in that translog.

Stop the Elasticsearch process on the affected node.
Delete the translog files for the specific shard under its path in the configured data directory.
Restart the node. Elasticsearch creates a fresh translog and attempts recovery.

As an alternative, the elasticsearch-shard remove-corrupted-data tool can clean corruption and emit a suggested POST /_cluster/reroute command with an updated allocation ID. Stop Elasticsearch before running this tool. Running it against a live node is unsupported.

Force allocation when no valid copy exists

When all copies of a primary are corrupt or missing and you cannot restore from snapshot, choose between stale or empty allocation. Both require accept_data_loss: true.

Allocate stale primary. Use this when a target node already has a stale copy on disk. Find candidates with GET /_shard_stores or allocation explain.

curl -X POST 'http://localhost:9200/_cluster/reroute' -H 'Content-Type: application/json' -d '{
  "commands": [{
    "allocate_stale_primary": {
      "index": "<index_name>",
      "shard": <shard_num>,
      "node": "<target_node_name>",
      "accept_data_loss": true
    }
  }]
}'

Allocate empty primary. Last resort. Creates an empty primary, discarding all previous data for that shard.

curl -X POST 'http://localhost:9200/_cluster/reroute' -H 'Content-Type: application/json' -d '{
  "commands": [{
    "allocate_empty_primary": {
      "index": "<index_name>",
      "shard": <shard_num>,
      "node": "<target_node_name>",
      "accept_data_loss": true
    }
  }]
}'

Warning: Both commands cause data loss. Prefer snapshot restore whenever available.

Prevention

Investigate first failures. Do not wait for max retries. If a shard fails allocation once, check allocation explain and node logs immediately.
Monitor leading indicators. Disk watermark proximity, node departures, and GC pauses are the root causes that drive allocation failures. Alert on them.
Maintain tested snapshots. If corruption occurs, snapshot restore is the only recovery path that does not involve data loss.
Avoid overly aggressive retry limits. Raising index.allocation.max_retries masks disk, network, or hardware problems.
Watch translog size. Large translogs slow recovery and increase exposure to corruption. Monitor translog size and flush performance.

How Netdata helps

Correlate unassigned shards with node pressure. Per-node JVM heap, GC latency, disk utilization, and I/O wait charts help distinguish corruption from node saturation.
Alert on cluster health. Track cluster state, unassigned shard counts, and node count changes to surface ALLOCATION_FAILED without manual polling.
Catch the precursor. Sustained heap pressure and rising old GC times precede node drops that trigger allocation retries. Composite alerts on heap plus GC catch the cascade early.
Disk watermark visibility. Per-node disk usage charts show when a node approaches the 85% low watermark before the allocator starts rejecting placements.

The Netdata solution

Elasticsearch monitoring with Netdata

Netdata monitors Elasticsearch with per-second metrics and ML anomaly detection. Correlate JVM heap pressure, shard counts, disk watermarks, mapping growth, and merge activity with cluster and node health in one view.

See Elasticsearch monitoring → Start monitoring free

Elasticsearch ALLOCATION_FAILED after max retries: reroute and corrupt shard recovery

Elasticsearch ALLOCATION_FAILED after max retries: reroute and corrupt shard recovery

What this means

Common causes

Quick checks

How to diagnose it

Metrics and signals to monitor

Fixes

Retry failed allocations for transient errors

Recover from corrupt translog

Force allocation when no valid copy exists

Prevention

How Netdata helps

Related guides

Elasticsearch monitoring with Netdata