Elasticsearch shard recovery stuck: throttling, translog replay, and concurrent limits
You restart a data node during a rollout. Ten minutes later the cluster is still yellow. You check _cat/recovery and see a replica pinned at 38 percent for the last hour. The network is not saturated, the source node is healthy, and there are no obvious errors in the logs, yet the shard refuses to reach STARTED state.
Stuck recovery usually comes down to four constraints: a bandwidth throttle capping transfer speed, a large translog that must replay sequentially, a concurrent recovery limit serializing work, or a disk watermark on the target node blocking allocation entirely.
What this means
When Elasticsearch moves a shard, the recovery process copies Lucene segments from a source node to a target and replays the translog to bring the copy up to date. Until this finishes, the shard is not searchable and cluster health stays yellow or red. A recovery is stuck when the same stage and percentage sit unchanged for longer than the expected window, typically more than thirty minutes, or when it cycles through repeated retries. The throttle defaults to 40 MiB/s per node, concurrent recoveries per node default to two, and large uncommitted translogs can add hours of replay because replay is single-threaded per shard. Recovery traffic also competes with production indexing and search I/O, so a cluster already near disk or network limits may stall further.
flowchart TD
A[Stuck recovery in _cat/recovery] --> B{Stage?}
B -->|INIT| C{Target disk >85%?}
C -->|Yes| D[Disk watermark blocks allocation]
C -->|No| E[Concurrent limit or delay timeout]
B -->|INDEX| F{Bytes moving?}
F -->|Yes| G[Bandwidth throttle or I/O bound]
F -->|No| H[Corrupt shard or retry loop]
B -->|TRANSLOG| I[Large translog replay]Common causes
| Cause | What it looks like | First thing to check |
|---|---|---|
| Bandwidth throttling | bytes_percent creeps slowly; throughput stays below network capacity | GET /_cluster/settings?filter_path=*.indices.recovery.max_bytes_per_sec |
| Large translog replay | Recovery stuck in TRANSLOG stage; translog_ops_percent barely moves | GET /_nodes/stats/indices/translog for uncommitted size |
| Concurrent recovery limit | Recovery sits in INIT for more than 10 minutes while other recoveries run | GET /_cat/recovery?active_only count versus node limit |
| Disk watermark block | Recovery never leaves INIT; shard is unassigned | GET /_cat/allocation?v for target node disk percent |
| Retry loop from corruption | Recovery restarts repeatedly at low percentage; logs show shard failures | GET /_cluster/allocation/explain and node logs |
Quick checks
Run these read-only commands to triage.
# List active recoveries and their stages
curl -s 'http://localhost:9200/_cat/recovery?v&active_only&h=index,shard,time,type,stage,source_host,target_host,bytes_percent,translog_ops_percent'
# Explain why a specific shard is not allocating
curl -s 'http://localhost:9200/_cluster/allocation/explain?pretty'
# Check disk usage per node for watermark blocks
curl -s 'http://localhost:9200/_cat/allocation?v'
# View current recovery throttle and concurrency defaults
curl -s 'http://localhost:9200/_cluster/settings?include_defaults=true&filter_path=*.indices.recovery.max_bytes_per_sec,*.cluster.routing.allocation.node_concurrent_recoveries'
# Check translog size on nodes
curl -s 'http://localhost:9200/_nodes/stats/indices/translog?filter_path=nodes.*.indices.translog'
# Check source and target node health
curl -s 'http://localhost:9200/_cat/nodes?v&h=name,heap.percent,cpu,load_1m,disk.used_percent'
How to diagnose it
- Confirm the recovery is stuck. Run the
_cat/recovery?active_onlycommand, wait five minutes, then run it again. Ifbytes_percentortranslog_ops_percentdid not change, the recovery is blocked. - Identify the stage. INDEX means segment files are copying. TRANSLOG means operations are replaying. INIT means the recovery is queued or preparing.
- If the stage is INIT and the recovery has been there for more than ten minutes, check the target node disk in
_cat/allocation. If the node is above the low watermark (85 percent by default), Elasticsearch will not allocate new shards there. If it is above the flood stage (95 percent), indices are read-only and recovery cannot start. - If the stage is INDEX and moving slowly, calculate actual throughput. Record
bytes_recovered, wait five minutes, and subtract. The defaultindices.recovery.max_bytes_per_secis 40 MiB/s per node. If multiple recoveries share the source or target, they split that budget. Also check disk I/O wait and network utilization on both nodes. Recovery traffic competes with production traffic, so a busy cluster may stall even below the theoretical throttle. - If the stage is TRANSLOG, check the translog size. In
_nodes/stats/indices/translog, largeuncommitted_size_in_bytesmeans a long sequential replay. A translog larger than 5 GB per shard makes recovery extremely slow. - Look for retry loops. If the recovery percentage resets frequently, use
_cluster/allocation/explainto see if the shard is failing withALLOCATION_FAILED. Check node logs forTranslogCorruptedExceptionorFlushNotAllowedException.
Metrics and signals to monitor
| Signal | Why it matters | Warning sign |
|---|---|---|
bytes_percent in _cat/recovery | Measures segment copy progress | Flat for more than 30 minutes |
translog_ops_percent | Measures translog replay progress | Flat or advancing slowly for more than 30 minutes |
| Disk usage vs watermark | Determines if allocation is blocked | Target node above 85 percent |
uncommitted_size_in_bytes per shard | Indicates replay duration | Greater than 5 GB per shard |
| Recovery stage | Distinguishes copy, replay, or queue | INIT or INDEX stage longer than 10 minutes |
| Node I/O wait | Reveals resource contention | Rising during active recovery |
Fixes
Bandwidth throttling
If the recovery is making progress but slowly, raise indices.recovery.max_bytes_per_sec temporarily via a transient cluster setting. The default is 40 MiB/s per node, shared across all concurrent recoveries on that node. Raising it speeds up copying but consumes network and disk bandwidth that production traffic may need. Verify network and disk headroom on both source and target before applying.
# Temporarily raise throttle (example: 100 MiB/s)
# Reset to default after recovery completes
curl -X PUT 'http://localhost:9200/_cluster/settings' -H 'Content-Type: application/json' -d '
{
"transient": {
"indices.recovery.max_bytes_per_sec": "100mb"
}
}'
Large translog replay
For planned maintenance, flush indices beforehand with POST /_flush to truncate translogs. If a recovery is already stuck in TRANSLOG because of a large uncommitted log, wait it out. Translog replay is single-threaded per shard and cannot be parallelized. Avoid restarting nodes or primaries while replay is active, as this resets progress.
Concurrent recovery limits
If recoveries are queued because the node limit is reached, raise cluster.routing.allocation.node_concurrent_recoveries. The default is two, counting both incoming and outgoing recoveries per node. Increasing this raises I/O load across the cluster. Only raise it if disk and network headroom exist.
Disk watermark block
If the target node is above the low watermark, free disk space by deleting old indices, shrinking indices, or expanding the filesystem. If the node hit flood stage (95 percent), Elasticsearch sets index.blocks.read_only_allow_delete on affected indices. After freeing space, remove the block manually so indexing and recovery resume. Target only affected indices; using _all clears the block on every index in the cluster.
# Clear read-only blocks after freeing disk space
# WARNING: this applies to all indices. Use specific index names if possible.
curl -X PUT 'http://localhost:9200/_all/_settings' -H 'Content-Type: application/json' -d '
{
"index.blocks.read_only_allow_delete": null
}'
Retry loops from allocation failure
If a shard failed too many times and is stuck in ALLOCATION_FAILED, force a retry. This clears the allocation failure counter but does not fix the underlying cause.
# Retry failed shard allocations
curl -X POST 'http://localhost:9200/_cluster/reroute?retry_failed=true'
Prevention
Flush before maintenance. Before planned node restarts, run POST /<index>/_flush on hot indices or POST /_flush globally to minimize translog size and replay time.
Prevent rebalancing storms. During rolling restarts, set cluster.routing.allocation.enable: none so nodes rejoin without triggering recovery traffic. Also set cluster.routing.rebalance.enable: none to stop the cluster from rebalancing existing shards while the rolling restart is in progress.
Monitor disk runway. Keep nodes below 70 percent disk usage. Merges can temporarily spike usage, so headroom prevents watermark blocks during normal operations.
Reset temporary throttles. If you raise recovery bandwidth for maintenance, revert the setting afterward. A permanently elevated throttle silently starves production traffic during the next unplanned recovery.
Respect delayed timeouts. By default, index.unassigned.node_left.delayed_timeout waits one minute before starting recovery. This prevents unnecessary shard dancing during brief restarts. Do not lower it aggressively.
How Netdata helps
- Correlate disk I/O wait on source and target nodes with
_cat/recoverystalls to distinguish throttle limits from disk saturation. - Alert on disk watermark proximity using per-node usage trends before recovery is blocked.
- Track per-node network throughput to verify whether recovery traffic is hitting the 40 MiB/s node cap or the physical link limit.
- Monitor JVM heap and GC pauses on nodes hosting large translogs; long pauses can stall recovery threads and trigger retries.
Related guides
- Elasticsearch all shards failed: diagnosing search_phase_execution_exception
- Elasticsearch CircuitBreakingException: [parent] Data too large - causes and fixes
- Elasticsearch cluster_block_exception: blocked by, the read-only blocks explained
- Elasticsearch cluster health red: unassigned primaries and how to recover
- Elasticsearch cluster health yellow: unassigned replicas vs real allocation blocks
- Elasticsearch cluster state too large: field count, index count, and per-node heap
- Elasticsearch disk full: emergency recovery and freeing space safely
- Elasticsearch disk watermark cascade: from low watermark to cluster-wide read-only
- Elasticsearch document indexing failures: index_failed, bulk item errors, and version conflicts
- Elasticsearch EsRejectedExecutionException: write thread pool rejections and HTTP 429
- Elasticsearch fielddata circuit breaker tripped: text-field aggregations and the keyword fix
- Elasticsearch FORBIDDEN/12/index read-only / allow delete (api) — flood stage recovery







