Elasticsearch ALLOCATION_FAILED after max retries: reroute and corrupt shard recovery

A shard that repeatedly fails allocation exhausts index.allocation.max_retries (default 5) and becomes permanently UNASSIGNED. Elasticsearch stops automatic placement. Cluster health is RED if the shard is a primary, YELLOW if it is a replica. Indexing to the affected index is blocked until the primary is assigned.

This state typically follows transient node restarts, disk pressure events, or translog corruption. After the fifth failed attempt, the allocator stops retrying. The shard will not move without explicit operator action: POST /_cluster/reroute?retry_failed=true for transient failures, or forced allocation with accepted data loss when no valid copy remains.

This guide covers distinguishing transient retry exhaustion from corrupt copies, safe rerouting, and recovery when no valid copy exists.

What this means

Elasticsearch’s allocator increments a per-shard failure counter when placement fails due to I/O errors, network timeouts, corrupt segment files, or disk watermark blocks. When the counter reaches index.allocation.max_retries (default 5), the shard is marked UNASSIGNED with reason ALLOCATION_FAILED. Automatic rebalancing and recovery skip it until the counter is reset.

Resetting the counter does not fix the underlying problem. If the failure was transient, retry may succeed. If the failure is persistent, such as a CorruptIndexException or TranslogCorruptedException, the shard will fail again until the corrupt data is removed or a different copy is used.

flowchart TD
    A[Shard UNASSIGNED
reason ALLOCATION_FAILED] --> B{Allocation explain} B -->|Too many failed attempts
no corruption| C[POST /_cluster/reroute
?retry_failed=true] B -->|Corrupt translog| D[Stop node, remove
corrupted translog, restart] B -->|Corrupt index, no valid replica| E{Stale copy exists
on target node?} E -->|Yes| F[allocate_stale_primary
accept_data_loss=true] E -->|No| G[allocate_empty_primary
accept_data_loss=true] C --> H[Verify shard assigned] D --> H F --> H G --> H

Common causes

CauseWhat it looks likeFirst thing to check
Transient failure exhausted retriesNode restart or brief network partition caused 5 failed allocation attempts; shard data is intactGET /_cluster/allocation/explain for “too many failed allocation attempts”
Translog corruptionTranslogCorruptedException or “ignoring recovery of a corrupt translog entry” in node logsLogs on the node that attempted allocation
Corrupt Lucene indexCorruptIndexException during recovery or validationNode logs for index corruption errors
Disk watermark blocking targetTarget node above high watermark (90%) and allocator cannot place the shardGET /_cat/allocation?v for disk percentages
Missing or incompatible shard copyShard data was partially written or version mismatch prevents useGET /_cluster/allocation/explain and GET /_shard_stores

Quick checks

These commands are safe and read-only. Run them before making any changes.

# Cluster health and unassigned shard count
curl -s 'http://localhost:9200/_cluster/health?filter_path=status,unassigned_shards,number_of_nodes'

# Unassigned shards with reasons
curl -s 'http://localhost:9200/_cat/shards?v&h=index,shard,prirep,state,unassigned.reason&s=state' | grep UNASSIGNED

# Detailed explanation for a specific shard
curl -s 'http://localhost:9200/_cluster/allocation/explain' -H 'Content-Type: application/json' -d '{"index": "<index_name>", "shard": <shard_num>, "primary": true}'

# Disk usage per node
curl -s 'http://localhost:9200/_cat/allocation?v'

# Active recoveries
curl -s 'http://localhost:9200/_cat/recovery?v&active_only&h=index,shard,stage,source_host,target_host,bytes_percent'

How to diagnose it

  1. Confirm the shard state. Use GET /_cat/shards filtered to UNASSIGNED and note the unassigned.reason. If it is ALLOCATION_FAILED, the shard has hit the retry limit.
  2. Run allocation explain. GET /_cluster/allocation/explain with the index, shard number, and primary flag returns the specific error message from the last failed attempt. Look for CorruptIndexException, TranslogCorruptedException, disk watermark denials, or node-leave events.
  3. Check node logs. On the node that last attempted allocation, search logs for the shard ID. Translog corruption produces TranslogCorruptedException. Lucene corruption produces CorruptIndexException. I/O errors produce IOException.
  4. Check disk watermarks. If /_cat/allocation shows the target node above the high watermark (90% by default), the allocator rejects new shards. Free disk space before retrying.
  5. Determine copy availability. If the primary is corrupt, check whether a valid replica exists. If all copies are corrupt, use GET /_shard_stores or allocation explain to identify any node with a stale copy that can be promoted.
  6. Classify the failure. If the error is transient, retry is likely to succeed. If the error is corruption or persistent I/O failure, retry will fail again and you must repair or replace the copy.

Metrics and signals to monitor

SignalWhy it mattersWarning sign
Unassigned shard countPersistent unassigned primaries mean data unavailabilityAny primary unassigned for >2 minutes
Cluster health statusRED indicates active data loss or blocked queriesStatus RED after expected data nodes have joined
Disk usage vs watermarksWatermarks prevent allocation and trigger relocationsAny data node above 85% low watermark
Node countUnexpected loss causes allocation storms and retry exhaustionDrop in number_of_data_nodes
Thread pool rejectionsSaturated nodes cannot complete recovery operationsSustained write or search rejections
JVM heap used percentHeap pressure precedes GC pauses that trigger node removalSustained >85% with rising old GC time

Fixes

Retry failed allocations for transient errors

If allocation explain shows no corruption and the failure was transient, reset the retry counter:

curl -X POST 'http://localhost:9200/_cluster/reroute?retry_failed=true'

This resets the failed_allocation_attempts counter to zero for all ALLOCATION_FAILED shards and triggers one allocation round. Verify with GET /_cat/shards or GET /_cluster/allocation/explain.

If the shard fails again immediately, stop and investigate. You can raise index.allocation.max_retries per index, but without fixing the root cause this only prolongs unavailability.

Recover from corrupt translog

If logs show TranslogCorruptedException, remove the corrupted translog and restart the node. You will lose any uncommitted operations in that translog.

  1. Stop the Elasticsearch process on the affected node.
  2. Delete the translog files for the specific shard under its path in the configured data directory.
  3. Restart the node. Elasticsearch creates a fresh translog and attempts recovery.

As an alternative, the elasticsearch-shard remove-corrupted-data tool can clean corruption and emit a suggested POST /_cluster/reroute command with an updated allocation ID. Stop Elasticsearch before running this tool. Running it against a live node is unsupported.

Force allocation when no valid copy exists

When all copies of a primary are corrupt or missing and you cannot restore from snapshot, choose between stale or empty allocation. Both require accept_data_loss: true.

Allocate stale primary. Use this when a target node already has a stale copy on disk. Find candidates with GET /_shard_stores or allocation explain.

curl -X POST 'http://localhost:9200/_cluster/reroute' -H 'Content-Type: application/json' -d '{
  "commands": [{
    "allocate_stale_primary": {
      "index": "<index_name>",
      "shard": <shard_num>,
      "node": "<target_node_name>",
      "accept_data_loss": true
    }
  }]
}'

Allocate empty primary. Last resort. Creates an empty primary, discarding all previous data for that shard.

curl -X POST 'http://localhost:9200/_cluster/reroute' -H 'Content-Type: application/json' -d '{
  "commands": [{
    "allocate_empty_primary": {
      "index": "<index_name>",
      "shard": <shard_num>,
      "node": "<target_node_name>",
      "accept_data_loss": true
    }
  }]
}'

Warning: Both commands cause data loss. Prefer snapshot restore whenever available.

Prevention

  • Investigate first failures. Do not wait for max retries. If a shard fails allocation once, check allocation explain and node logs immediately.
  • Monitor leading indicators. Disk watermark proximity, node departures, and GC pauses are the root causes that drive allocation failures. Alert on them.
  • Maintain tested snapshots. If corruption occurs, snapshot restore is the only recovery path that does not involve data loss.
  • Avoid overly aggressive retry limits. Raising index.allocation.max_retries masks disk, network, or hardware problems.
  • Watch translog size. Large translogs slow recovery and increase exposure to corruption. Monitor translog size and flush performance.

How Netdata helps

  • Correlate unassigned shards with node pressure. Per-node JVM heap, GC latency, disk utilization, and I/O wait charts help distinguish corruption from node saturation.
  • Alert on cluster health. Track cluster state, unassigned shard counts, and node count changes to surface ALLOCATION_FAILED without manual polling.
  • Catch the precursor. Sustained heap pressure and rising old GC times precede node drops that trigger allocation retries. Composite alerts on heap plus GC catch the cascade early.
  • Disk watermark visibility. Per-node disk usage charts show when a node approaches the 85% low watermark before the allocator starts rejecting placements.