Elasticsearch shard recovery stuck: throttling, translog replay, and concurrent limits

You restart a data node during a rollout. Ten minutes later the cluster is still yellow. You check _cat/recovery and see a replica pinned at 38 percent for the last hour. The network is not saturated, the source node is healthy, and there are no obvious errors in the logs, yet the shard refuses to reach STARTED state.

Stuck recovery usually comes down to four constraints: a bandwidth throttle capping transfer speed, a large translog that must replay sequentially, a concurrent recovery limit serializing work, or a disk watermark on the target node blocking allocation entirely.

What this means

When Elasticsearch moves a shard, the recovery process copies Lucene segments from a source node to a target and replays the translog to bring the copy up to date. Until this finishes, the shard is not searchable and cluster health stays yellow or red. A recovery is stuck when the same stage and percentage sit unchanged for longer than the expected window, typically more than thirty minutes, or when it cycles through repeated retries. The throttle defaults to 40 MiB/s per node, concurrent recoveries per node default to two, and large uncommitted translogs can add hours of replay because replay is single-threaded per shard. Recovery traffic also competes with production indexing and search I/O, so a cluster already near disk or network limits may stall further.

flowchart TD
    A[Stuck recovery in _cat/recovery] --> B{Stage?}
    B -->|INIT| C{Target disk >85%?}
    C -->|Yes| D[Disk watermark blocks allocation]
    C -->|No| E[Concurrent limit or delay timeout]
    B -->|INDEX| F{Bytes moving?}
    F -->|Yes| G[Bandwidth throttle or I/O bound]
    F -->|No| H[Corrupt shard or retry loop]
    B -->|TRANSLOG| I[Large translog replay]

Common causes

CauseWhat it looks likeFirst thing to check
Bandwidth throttlingbytes_percent creeps slowly; throughput stays below network capacityGET /_cluster/settings?filter_path=*.indices.recovery.max_bytes_per_sec
Large translog replayRecovery stuck in TRANSLOG stage; translog_ops_percent barely movesGET /_nodes/stats/indices/translog for uncommitted size
Concurrent recovery limitRecovery sits in INIT for more than 10 minutes while other recoveries runGET /_cat/recovery?active_only count versus node limit
Disk watermark blockRecovery never leaves INIT; shard is unassignedGET /_cat/allocation?v for target node disk percent
Retry loop from corruptionRecovery restarts repeatedly at low percentage; logs show shard failuresGET /_cluster/allocation/explain and node logs

Quick checks

Run these read-only commands to triage.

# List active recoveries and their stages
curl -s 'http://localhost:9200/_cat/recovery?v&active_only&h=index,shard,time,type,stage,source_host,target_host,bytes_percent,translog_ops_percent'

# Explain why a specific shard is not allocating
curl -s 'http://localhost:9200/_cluster/allocation/explain?pretty'

# Check disk usage per node for watermark blocks
curl -s 'http://localhost:9200/_cat/allocation?v'

# View current recovery throttle and concurrency defaults
curl -s 'http://localhost:9200/_cluster/settings?include_defaults=true&filter_path=*.indices.recovery.max_bytes_per_sec,*.cluster.routing.allocation.node_concurrent_recoveries'

# Check translog size on nodes
curl -s 'http://localhost:9200/_nodes/stats/indices/translog?filter_path=nodes.*.indices.translog'

# Check source and target node health
curl -s 'http://localhost:9200/_cat/nodes?v&h=name,heap.percent,cpu,load_1m,disk.used_percent'

How to diagnose it

  1. Confirm the recovery is stuck. Run the _cat/recovery?active_only command, wait five minutes, then run it again. If bytes_percent or translog_ops_percent did not change, the recovery is blocked.
  2. Identify the stage. INDEX means segment files are copying. TRANSLOG means operations are replaying. INIT means the recovery is queued or preparing.
  3. If the stage is INIT and the recovery has been there for more than ten minutes, check the target node disk in _cat/allocation. If the node is above the low watermark (85 percent by default), Elasticsearch will not allocate new shards there. If it is above the flood stage (95 percent), indices are read-only and recovery cannot start.
  4. If the stage is INDEX and moving slowly, calculate actual throughput. Record bytes_recovered, wait five minutes, and subtract. The default indices.recovery.max_bytes_per_sec is 40 MiB/s per node. If multiple recoveries share the source or target, they split that budget. Also check disk I/O wait and network utilization on both nodes. Recovery traffic competes with production traffic, so a busy cluster may stall even below the theoretical throttle.
  5. If the stage is TRANSLOG, check the translog size. In _nodes/stats/indices/translog, large uncommitted_size_in_bytes means a long sequential replay. A translog larger than 5 GB per shard makes recovery extremely slow.
  6. Look for retry loops. If the recovery percentage resets frequently, use _cluster/allocation/explain to see if the shard is failing with ALLOCATION_FAILED. Check node logs for TranslogCorruptedException or FlushNotAllowedException.

Metrics and signals to monitor

SignalWhy it mattersWarning sign
bytes_percent in _cat/recoveryMeasures segment copy progressFlat for more than 30 minutes
translog_ops_percentMeasures translog replay progressFlat or advancing slowly for more than 30 minutes
Disk usage vs watermarkDetermines if allocation is blockedTarget node above 85 percent
uncommitted_size_in_bytes per shardIndicates replay durationGreater than 5 GB per shard
Recovery stageDistinguishes copy, replay, or queueINIT or INDEX stage longer than 10 minutes
Node I/O waitReveals resource contentionRising during active recovery

Fixes

Bandwidth throttling

If the recovery is making progress but slowly, raise indices.recovery.max_bytes_per_sec temporarily via a transient cluster setting. The default is 40 MiB/s per node, shared across all concurrent recoveries on that node. Raising it speeds up copying but consumes network and disk bandwidth that production traffic may need. Verify network and disk headroom on both source and target before applying.

# Temporarily raise throttle (example: 100 MiB/s)
# Reset to default after recovery completes
curl -X PUT 'http://localhost:9200/_cluster/settings' -H 'Content-Type: application/json' -d '
{
  "transient": {
    "indices.recovery.max_bytes_per_sec": "100mb"
  }
}'

Large translog replay

For planned maintenance, flush indices beforehand with POST /_flush to truncate translogs. If a recovery is already stuck in TRANSLOG because of a large uncommitted log, wait it out. Translog replay is single-threaded per shard and cannot be parallelized. Avoid restarting nodes or primaries while replay is active, as this resets progress.

Concurrent recovery limits

If recoveries are queued because the node limit is reached, raise cluster.routing.allocation.node_concurrent_recoveries. The default is two, counting both incoming and outgoing recoveries per node. Increasing this raises I/O load across the cluster. Only raise it if disk and network headroom exist.

Disk watermark block

If the target node is above the low watermark, free disk space by deleting old indices, shrinking indices, or expanding the filesystem. If the node hit flood stage (95 percent), Elasticsearch sets index.blocks.read_only_allow_delete on affected indices. After freeing space, remove the block manually so indexing and recovery resume. Target only affected indices; using _all clears the block on every index in the cluster.

# Clear read-only blocks after freeing disk space
# WARNING: this applies to all indices. Use specific index names if possible.
curl -X PUT 'http://localhost:9200/_all/_settings' -H 'Content-Type: application/json' -d '
{
  "index.blocks.read_only_allow_delete": null
}'

Retry loops from allocation failure

If a shard failed too many times and is stuck in ALLOCATION_FAILED, force a retry. This clears the allocation failure counter but does not fix the underlying cause.

# Retry failed shard allocations
curl -X POST 'http://localhost:9200/_cluster/reroute?retry_failed=true'

Prevention

Flush before maintenance. Before planned node restarts, run POST /<index>/_flush on hot indices or POST /_flush globally to minimize translog size and replay time.

Prevent rebalancing storms. During rolling restarts, set cluster.routing.allocation.enable: none so nodes rejoin without triggering recovery traffic. Also set cluster.routing.rebalance.enable: none to stop the cluster from rebalancing existing shards while the rolling restart is in progress.

Monitor disk runway. Keep nodes below 70 percent disk usage. Merges can temporarily spike usage, so headroom prevents watermark blocks during normal operations.

Reset temporary throttles. If you raise recovery bandwidth for maintenance, revert the setting afterward. A permanently elevated throttle silently starves production traffic during the next unplanned recovery.

Respect delayed timeouts. By default, index.unassigned.node_left.delayed_timeout waits one minute before starting recovery. This prevents unnecessary shard dancing during brief restarts. Do not lower it aggressively.

How Netdata helps

  • Correlate disk I/O wait on source and target nodes with _cat/recovery stalls to distinguish throttle limits from disk saturation.
  • Alert on disk watermark proximity using per-node usage trends before recovery is blocked.
  • Track per-node network throughput to verify whether recovery traffic is hitting the 40 MiB/s node cap or the physical link limit.
  • Monitor JVM heap and GC pauses on nodes hosting large translogs; long pauses can stall recovery threads and trigger retries.