Elasticsearch ALLOCATION_FAILED after max retries: reroute and corrupt shard recovery
A shard that repeatedly fails allocation exhausts index.allocation.max_retries (default 5) and becomes permanently UNASSIGNED. Elasticsearch stops automatic placement. Cluster health is RED if the shard is a primary, YELLOW if it is a replica. Indexing to the affected index is blocked until the primary is assigned.
This state typically follows transient node restarts, disk pressure events, or translog corruption. After the fifth failed attempt, the allocator stops retrying. The shard will not move without explicit operator action: POST /_cluster/reroute?retry_failed=true for transient failures, or forced allocation with accepted data loss when no valid copy remains.
This guide covers distinguishing transient retry exhaustion from corrupt copies, safe rerouting, and recovery when no valid copy exists.
What this means
Elasticsearch’s allocator increments a per-shard failure counter when placement fails due to I/O errors, network timeouts, corrupt segment files, or disk watermark blocks. When the counter reaches index.allocation.max_retries (default 5), the shard is marked UNASSIGNED with reason ALLOCATION_FAILED. Automatic rebalancing and recovery skip it until the counter is reset.
Resetting the counter does not fix the underlying problem. If the failure was transient, retry may succeed. If the failure is persistent, such as a CorruptIndexException or TranslogCorruptedException, the shard will fail again until the corrupt data is removed or a different copy is used.
flowchart TD
A[Shard UNASSIGNED
reason ALLOCATION_FAILED] --> B{Allocation explain}
B -->|Too many failed attempts
no corruption| C[POST /_cluster/reroute
?retry_failed=true]
B -->|Corrupt translog| D[Stop node, remove
corrupted translog, restart]
B -->|Corrupt index, no valid replica| E{Stale copy exists
on target node?}
E -->|Yes| F[allocate_stale_primary
accept_data_loss=true]
E -->|No| G[allocate_empty_primary
accept_data_loss=true]
C --> H[Verify shard assigned]
D --> H
F --> H
G --> HCommon causes
| Cause | What it looks like | First thing to check |
|---|---|---|
| Transient failure exhausted retries | Node restart or brief network partition caused 5 failed allocation attempts; shard data is intact | GET /_cluster/allocation/explain for “too many failed allocation attempts” |
| Translog corruption | TranslogCorruptedException or “ignoring recovery of a corrupt translog entry” in node logs | Logs on the node that attempted allocation |
| Corrupt Lucene index | CorruptIndexException during recovery or validation | Node logs for index corruption errors |
| Disk watermark blocking target | Target node above high watermark (90%) and allocator cannot place the shard | GET /_cat/allocation?v for disk percentages |
| Missing or incompatible shard copy | Shard data was partially written or version mismatch prevents use | GET /_cluster/allocation/explain and GET /_shard_stores |
Quick checks
These commands are safe and read-only. Run them before making any changes.
# Cluster health and unassigned shard count
curl -s 'http://localhost:9200/_cluster/health?filter_path=status,unassigned_shards,number_of_nodes'
# Unassigned shards with reasons
curl -s 'http://localhost:9200/_cat/shards?v&h=index,shard,prirep,state,unassigned.reason&s=state' | grep UNASSIGNED
# Detailed explanation for a specific shard
curl -s 'http://localhost:9200/_cluster/allocation/explain' -H 'Content-Type: application/json' -d '{"index": "<index_name>", "shard": <shard_num>, "primary": true}'
# Disk usage per node
curl -s 'http://localhost:9200/_cat/allocation?v'
# Active recoveries
curl -s 'http://localhost:9200/_cat/recovery?v&active_only&h=index,shard,stage,source_host,target_host,bytes_percent'
How to diagnose it
- Confirm the shard state. Use
GET /_cat/shardsfiltered toUNASSIGNEDand note theunassigned.reason. If it isALLOCATION_FAILED, the shard has hit the retry limit. - Run allocation explain.
GET /_cluster/allocation/explainwith the index, shard number, and primary flag returns the specific error message from the last failed attempt. Look forCorruptIndexException,TranslogCorruptedException, disk watermark denials, or node-leave events. - Check node logs. On the node that last attempted allocation, search logs for the shard ID. Translog corruption produces
TranslogCorruptedException. Lucene corruption producesCorruptIndexException. I/O errors produceIOException. - Check disk watermarks. If
/_cat/allocationshows the target node above the high watermark (90% by default), the allocator rejects new shards. Free disk space before retrying. - Determine copy availability. If the primary is corrupt, check whether a valid replica exists. If all copies are corrupt, use
GET /_shard_storesor allocation explain to identify any node with a stale copy that can be promoted. - Classify the failure. If the error is transient, retry is likely to succeed. If the error is corruption or persistent I/O failure, retry will fail again and you must repair or replace the copy.
Metrics and signals to monitor
| Signal | Why it matters | Warning sign |
|---|---|---|
| Unassigned shard count | Persistent unassigned primaries mean data unavailability | Any primary unassigned for >2 minutes |
| Cluster health status | RED indicates active data loss or blocked queries | Status RED after expected data nodes have joined |
| Disk usage vs watermarks | Watermarks prevent allocation and trigger relocations | Any data node above 85% low watermark |
| Node count | Unexpected loss causes allocation storms and retry exhaustion | Drop in number_of_data_nodes |
| Thread pool rejections | Saturated nodes cannot complete recovery operations | Sustained write or search rejections |
| JVM heap used percent | Heap pressure precedes GC pauses that trigger node removal | Sustained >85% with rising old GC time |
Fixes
Retry failed allocations for transient errors
If allocation explain shows no corruption and the failure was transient, reset the retry counter:
curl -X POST 'http://localhost:9200/_cluster/reroute?retry_failed=true'
This resets the failed_allocation_attempts counter to zero for all ALLOCATION_FAILED shards and triggers one allocation round. Verify with GET /_cat/shards or GET /_cluster/allocation/explain.
If the shard fails again immediately, stop and investigate. You can raise index.allocation.max_retries per index, but without fixing the root cause this only prolongs unavailability.
Recover from corrupt translog
If logs show TranslogCorruptedException, remove the corrupted translog and restart the node. You will lose any uncommitted operations in that translog.
- Stop the Elasticsearch process on the affected node.
- Delete the translog files for the specific shard under its path in the configured data directory.
- Restart the node. Elasticsearch creates a fresh translog and attempts recovery.
As an alternative, the elasticsearch-shard remove-corrupted-data tool can clean corruption and emit a suggested POST /_cluster/reroute command with an updated allocation ID. Stop Elasticsearch before running this tool. Running it against a live node is unsupported.
Force allocation when no valid copy exists
When all copies of a primary are corrupt or missing and you cannot restore from snapshot, choose between stale or empty allocation. Both require accept_data_loss: true.
Allocate stale primary. Use this when a target node already has a stale copy on disk. Find candidates with GET /_shard_stores or allocation explain.
curl -X POST 'http://localhost:9200/_cluster/reroute' -H 'Content-Type: application/json' -d '{
"commands": [{
"allocate_stale_primary": {
"index": "<index_name>",
"shard": <shard_num>,
"node": "<target_node_name>",
"accept_data_loss": true
}
}]
}'
Allocate empty primary. Last resort. Creates an empty primary, discarding all previous data for that shard.
curl -X POST 'http://localhost:9200/_cluster/reroute' -H 'Content-Type: application/json' -d '{
"commands": [{
"allocate_empty_primary": {
"index": "<index_name>",
"shard": <shard_num>,
"node": "<target_node_name>",
"accept_data_loss": true
}
}]
}'
Warning: Both commands cause data loss. Prefer snapshot restore whenever available.
Prevention
- Investigate first failures. Do not wait for max retries. If a shard fails allocation once, check allocation explain and node logs immediately.
- Monitor leading indicators. Disk watermark proximity, node departures, and GC pauses are the root causes that drive allocation failures. Alert on them.
- Maintain tested snapshots. If corruption occurs, snapshot restore is the only recovery path that does not involve data loss.
- Avoid overly aggressive retry limits. Raising
index.allocation.max_retriesmasks disk, network, or hardware problems. - Watch translog size. Large translogs slow recovery and increase exposure to corruption. Monitor translog size and flush performance.
How Netdata helps
- Correlate unassigned shards with node pressure. Per-node JVM heap, GC latency, disk utilization, and I/O wait charts help distinguish corruption from node saturation.
- Alert on cluster health. Track cluster state, unassigned shard counts, and node count changes to surface
ALLOCATION_FAILEDwithout manual polling. - Catch the precursor. Sustained heap pressure and rising old GC times precede node drops that trigger allocation retries. Composite alerts on heap plus GC catch the cascade early.
- Disk watermark visibility. Per-node disk usage charts show when a node approaches the 85% low watermark before the allocator starts rejecting placements.
Related guides
- Elasticsearch CircuitBreakingException: [parent] Data too large - causes and fixes
- Elasticsearch cluster health red: unassigned primaries and how to recover
- Elasticsearch cluster health yellow: unassigned replicas vs real allocation blocks
- Elasticsearch fielddata circuit breaker tripped: text-field aggregations and the keyword fix
- Elasticsearch heap pressure death spiral: GC, node removal, and the cascade
- Elasticsearch JVM heap usage high: reading the sawtooth and the post-GC floor
- Elasticsearch monitoring checklist: the signals every production cluster needs
- Elasticsearch monitoring maturity model: from survival to expert
- Elasticsearch long GC pauses: old-generation stop-the-world and node drops
- Elasticsearch node OOM-killed: heap ceiling, page cache, and container limits
- Elasticsearch unassigned shards: reading allocation explain and fixing each reason
- How Elasticsearch actually works in production: a mental model for operators







