Elasticsearch ILM stuck: indices not rolling over, shrinking, or deleting
Disk usage climbs steadily. Old indices that should have been deleted remain. Shard count grows, and the cluster approaches cluster.max_shards_per_node. In ILM, indices are stuck in one phase for hours or days. This is the ILM stuck pattern: silent accumulation that becomes a disk watermark crisis, heap pressure, or unassigned shard storm when the cluster runs out of room.
ILM polls every ten minutes by default. When an index cannot advance, it sits. Because the failure is gradual, it rarely pages until a secondary limit is breached. Detect the stuck state early and fix the root cause before accumulation triggers cascading failures.
What this means
ILM moves indices through phases (hot, warm, cold, frozen, deleted) and actions (rollover, shrink, force merge, allocate, delete) in discrete steps. If a step fails or blocks, the index stays there until the condition clears or an operator intervenes.
When ILM stops, indices accumulate. Each retained index consumes shards, heap metadata, and file descriptors, growing the cluster state. Over days, disk watermarks trigger, JVM heap pressure rises, and search latency degrades. By the time the flood stage blocks writes, the root cause is often dozens of stuck indices that could have been caught earlier.
flowchart TD
A[Index meets ILM condition] --> B{ILM poll executes}
B -->|Alias missing| C[Stuck in check-rollover-ready]
B -->|Disk full| D[Stuck in shrink]
B -->|Follower active| E[Waiting for retention leases]
B -->|Snapshot active| F[Stuck in delete]
C --> G[Indices accumulate]
D --> G
E --> G
F --> G
G --> H[Shard count grows]
H --> I[Disk or heap crisis]Common causes
| Cause | What it looks like | First thing to check |
|---|---|---|
| Rollover alias misconfiguration | Index stuck in check-rollover-ready; error mentions the write alias | Verify the index has exactly one write alias configured |
| Rollover conditions never met | Index age exceeds max_age but never rolls; low-volume index | Compare index size, document count, and age against the policy criteria |
| Insufficient disk for shrink | Stuck in shrink action; errors about disk or target node | Target node disk usage in _cat/allocation |
| Unassigned shards blocking migration | Stuck in allocate or searchable_snapshot; cluster health yellow or red | _cluster/allocation/explain for the specific index |
| CCR retention lease blocking leader | waiting-for-shard-history-retention-leases on the leader index | Whether follower indices are still active |
| Snapshot blocking delete | Delete action stuck; snapshot may be running on the index | _snapshot/_status for active snapshot operations |
| ILM auto-retry loop | Index in ERROR state with retry count climbing but no progress | _ilm/explain output for repeated identical errors |
Quick checks
# List all ILM-managed indices with errors only
curl -s 'http://localhost:9200/*/_ilm/explain?only_errors=true&only_managed=true&pretty'
# Check ILM health report for stagnating indices (Elasticsearch 8.x)
curl -s 'http://localhost:9200/_health_report/ilm'
# Check cluster health and unassigned shards
curl -s 'http://localhost:9200/_cluster/health?pretty'
# Check disk usage per node
curl -s 'http://localhost:9200/_cat/allocation?v'
# Check for active snapshots that may block deletes
curl -s 'http://localhost:9200/_snapshot/_status'
# Explain the first unassigned shard
curl -s 'http://localhost:9200/_cluster/allocation/explain'
How to diagnose it
- Run the filtered ILM explain query to identify stuck indices. Focus on indices in ERROR or steps that have not progressed after the expected poll interval.
- For each stuck index, read
phase,action,step,step_time, andfailed_stepfrom the_ilm/explainoutput. Common stuck states includecheck-rollover-ready,waiting-for-shard-history-retention-leases, and shrink-related steps. - If the index is stuck in rollover, verify the write alias and that the index name follows the rollover pattern.
- If the index is stuck waiting for retention leases, check whether follower clusters still have active follower indices. Leader indices cannot shrink or delete until followers unfollow.
- If the index is stuck in shrink or allocate, check disk headroom on target nodes. Shrink requires enough space to hold a second copy of the index temporarily, and the target shard count must be strictly less than the current count.
- If the index is stuck in delete, verify no snapshot is currently capturing it. An active snapshot blocks deletion.
- After fixing the root cause, issue
POST /<index>/_ilm/retryto move the index forward. Do not retry before resolving the underlying problem.
Metrics and signals to monitor
| Signal | Why it matters | Warning sign |
|---|---|---|
| ILM stuck index count | Direct measure of policy execution failure | Any ERROR state sustained longer than 20 minutes |
| Index count growth rate | Accumulation leads to shard and cluster state bloat | Monotonic increase over 48 hours |
| Disk usage per node | Shrink and rollover need headroom; deletes free space | Any node above the 85% low watermark |
| Shard count per node | Unmanaged growth stresses heap and file descriptors | Approaching cluster.max_shards_per_node |
| Cluster health status | Unassigned shards block ILM allocate and migrate actions | Yellow or red sustained longer than 5 minutes |
| Pending cluster tasks | Master overload slows ILM state transitions | More than 20 tasks or any task older than 30 seconds |
Fixes
Rollover alias misconfiguration
ILM rollover requires exactly one write index per alias. If the alias is missing, points to multiple indices, or was manually removed, rollover cannot proceed. Check with GET /<index>/_alias or GET _alias/<alias>. Restore the alias mapping, then retry: POST /<index>/_ilm/retry.
Rollover conditions never met
Low-volume indices may never reach max_size or max_docs. If max_age has passed but other conditions block rollover, update the ILM policy or trigger a manual rollover: POST /<alias>/_rollover. Then retry ILM.
Insufficient disk for shrink
Shrink requires enough temporary disk space on the target node for a complete second copy, and the target shard count must be strictly less than the current count. Check _cat/allocation?v. Free disk or add nodes, then retry.
Unassigned shards blocking migration
allocate and searchable_snapshot wait for green or yellow health. Use _cluster/allocation/explain to identify disk watermarks, allocation filters, or awareness attributes blocking assignment. Resolve the blocker, then retry.
CCR retention lease blocking leader
If a leader index is stuck waiting for retention leases, verify follower cluster status. The leader cannot shrink or delete until followers unfollow. If a follower is offline, wait for the lease to expire or unfollow from the follower cluster when it is available. Proceeding after lease expiration can create data gaps on the follower.
Snapshot blocking delete
An active snapshot blocks deletion. Check _snapshot/_status. Wait for it to complete, or cancel it if safe. Once inactive, retry the delete action.
ILM retry loops
If an index is in ERROR with a climbing retry count and no progress, auto-retry will not self-heal a structural problem. Fix the root cause before issuing a manual retry. Repeated retries waste master cycles and delay recovery.
Prevention
- Prefer data streams over manual rollover aliases. Data streams manage the write alias automatically, eliminating the most common rollover failure source.
- Monitor
GET /<index>/_ilm/explainfor errors proactively instead of waiting for disk or heap alerts. - Keep disk usage below 70% on hot nodes to leave headroom for merge and shrink temporary overhead.
- Validate ILM policies in a non-production environment before applying them to production indices.
- Ensure shrink actions target a shard count strictly less than the current count and that destination tiers have adequate disk.
How Netdata helps
- Tracks disk usage per node and index count trends to surface accumulation from ILM failures.
- Correlates JVM heap pressure with shard count growth to warn before heap pressure becomes critical.
- Alerts on disk watermark proximity so you can intervene before the flood stage blocks writes.
- Long-term retention of Elasticsearch metrics makes it easy to spot when index creation exceeds deletions.
- Surfaces cluster health, pending tasks, and thread pool rejections alongside system disk and memory metrics.
Related guides
- Elasticsearch all shards failed: diagnosing search_phase_execution_exception
- Elasticsearch CircuitBreakingException: [parent] Data too large - causes and fixes
- Elasticsearch cluster_block_exception: blocked by, the read-only blocks explained
- Elasticsearch cluster health red: unassigned primaries and how to recover
- Elasticsearch cluster health yellow: unassigned replicas vs real allocation blocks
- Elasticsearch cluster state too large: field count, index count, and per-node heap
- Elasticsearch disk full: emergency recovery and freeing space safely
- Elasticsearch disk watermark cascade: from low watermark to cluster-wide read-only
- Elasticsearch document indexing failures: index_failed, bulk item errors, and version conflicts
- Elasticsearch EsRejectedExecutionException: write thread pool rejections and HTTP 429
- Elasticsearch fielddata circuit breaker tripped: text-field aggregations and the keyword fix
- Elasticsearch FORBIDDEN/12/index read-only / allow delete (api) - flood stage recovery







