$ guides / elasticsearch / elasticsearch-ilm-stuck ▌

Operations Guides

Elasticsearch ILM stuck: indices not rolling over, shrinking, or deleting

Disk usage climbs steadily. Old indices that should have been deleted remain. Shard count grows, and the cluster approaches cluster.max_shards_per_node. In ILM, indices are stuck in one phase for hours or days. This is the ILM stuck pattern: silent accumulation that becomes a disk watermark crisis, heap pressure, or unassigned shard storm when the cluster runs out of room.

ILM polls every ten minutes by default. When an index cannot advance, it sits. Because the failure is gradual, it rarely pages until a secondary limit is breached. Detect the stuck state early and fix the root cause before accumulation triggers cascading failures.

What this means

ILM moves indices through phases (hot, warm, cold, frozen, deleted) and actions (rollover, shrink, force merge, allocate, delete) in discrete steps. If a step fails or blocks, the index stays there until the condition clears or an operator intervenes.

When ILM stops, indices accumulate. Each retained index consumes shards, heap metadata, and file descriptors, growing the cluster state. Over days, disk watermarks trigger, JVM heap pressure rises, and search latency degrades. By the time the flood stage blocks writes, the root cause is often dozens of stuck indices that could have been caught earlier.

flowchart TD
    A[Index meets ILM condition] --> B{ILM poll executes}
    B -->|Alias missing| C[Stuck in check-rollover-ready]
    B -->|Disk full| D[Stuck in shrink]
    B -->|Follower active| E[Waiting for retention leases]
    B -->|Snapshot active| F[Stuck in delete]
    C --> G[Indices accumulate]
    D --> G
    E --> G
    F --> G
    G --> H[Shard count grows]
    H --> I[Disk or heap crisis]

Common causes

Cause	What it looks like	First thing to check
Rollover alias misconfiguration	Index stuck in `check-rollover-ready`; error mentions the write alias	Verify the index has exactly one write alias configured
Rollover conditions never met	Index age exceeds `max_age` but never rolls; low-volume index	Compare index size, document count, and age against the policy criteria
Insufficient disk for shrink	Stuck in shrink action; errors about disk or target node	Target node disk usage in `_cat/allocation`
Unassigned shards blocking migration	Stuck in `allocate` or `searchable_snapshot`; cluster health yellow or red	`_cluster/allocation/explain` for the specific index
CCR retention lease blocking leader	`waiting-for-shard-history-retention-leases` on the leader index	Whether follower indices are still active
Snapshot blocking delete	Delete action stuck; snapshot may be running on the index	`_snapshot/_status` for active snapshot operations
ILM auto-retry loop	Index in ERROR state with retry count climbing but no progress	`_ilm/explain` output for repeated identical errors

Quick checks

# List all ILM-managed indices with errors only
curl -s 'http://localhost:9200/*/_ilm/explain?only_errors=true&only_managed=true&pretty'

# Check ILM health report for stagnating indices (Elasticsearch 8.x)
curl -s 'http://localhost:9200/_health_report/ilm'

# Check cluster health and unassigned shards
curl -s 'http://localhost:9200/_cluster/health?pretty'

# Check disk usage per node
curl -s 'http://localhost:9200/_cat/allocation?v'

# Check for active snapshots that may block deletes
curl -s 'http://localhost:9200/_snapshot/_status'

# Explain the first unassigned shard
curl -s 'http://localhost:9200/_cluster/allocation/explain'

How to diagnose it

Run the filtered ILM explain query to identify stuck indices. Focus on indices in ERROR or steps that have not progressed after the expected poll interval.
For each stuck index, read phase, action, step, step_time, and failed_step from the _ilm/explain output. Common stuck states include check-rollover-ready, waiting-for-shard-history-retention-leases, and shrink-related steps.
If the index is stuck in rollover, verify the write alias and that the index name follows the rollover pattern.
If the index is stuck waiting for retention leases, check whether follower clusters still have active follower indices. Leader indices cannot shrink or delete until followers unfollow.
If the index is stuck in shrink or allocate, check disk headroom on target nodes. Shrink requires enough space to hold a second copy of the index temporarily, and the target shard count must be strictly less than the current count.
If the index is stuck in delete, verify no snapshot is currently capturing it. An active snapshot blocks deletion.
After fixing the root cause, issue POST /<index>/_ilm/retry to move the index forward. Do not retry before resolving the underlying problem.

Metrics and signals to monitor

Signal	Why it matters	Warning sign
ILM stuck index count	Direct measure of policy execution failure	Any ERROR state sustained longer than 20 minutes
Index count growth rate	Accumulation leads to shard and cluster state bloat	Monotonic increase over 48 hours
Disk usage per node	Shrink and rollover need headroom; deletes free space	Any node above the 85% low watermark
Shard count per node	Unmanaged growth stresses heap and file descriptors	Approaching `cluster.max_shards_per_node`
Cluster health status	Unassigned shards block ILM allocate and migrate actions	Yellow or red sustained longer than 5 minutes
Pending cluster tasks	Master overload slows ILM state transitions	More than 20 tasks or any task older than 30 seconds

Fixes

Rollover alias misconfiguration

ILM rollover requires exactly one write index per alias. If the alias is missing, points to multiple indices, or was manually removed, rollover cannot proceed. Check with GET /<index>/_alias or GET _alias/<alias>. Restore the alias mapping, then retry: POST /<index>/_ilm/retry.

Rollover conditions never met

Low-volume indices may never reach max_size or max_docs. If max_age has passed but other conditions block rollover, update the ILM policy or trigger a manual rollover: POST /<alias>/_rollover. Then retry ILM.

Insufficient disk for shrink

Shrink requires enough temporary disk space on the target node for a complete second copy, and the target shard count must be strictly less than the current count. Check _cat/allocation?v. Free disk or add nodes, then retry.

Unassigned shards blocking migration

allocate and searchable_snapshot wait for green or yellow health. Use _cluster/allocation/explain to identify disk watermarks, allocation filters, or awareness attributes blocking assignment. Resolve the blocker, then retry.

CCR retention lease blocking leader

If a leader index is stuck waiting for retention leases, verify follower cluster status. The leader cannot shrink or delete until followers unfollow. If a follower is offline, wait for the lease to expire or unfollow from the follower cluster when it is available. Proceeding after lease expiration can create data gaps on the follower.

Snapshot blocking delete

An active snapshot blocks deletion. Check _snapshot/_status. Wait for it to complete, or cancel it if safe. Once inactive, retry the delete action.

ILM retry loops

If an index is in ERROR with a climbing retry count and no progress, auto-retry will not self-heal a structural problem. Fix the root cause before issuing a manual retry. Repeated retries waste master cycles and delay recovery.

Prevention

Prefer data streams over manual rollover aliases. Data streams manage the write alias automatically, eliminating the most common rollover failure source.
Monitor GET /<index>/_ilm/explain for errors proactively instead of waiting for disk or heap alerts.
Keep disk usage below 70% on hot nodes to leave headroom for merge and shrink temporary overhead.
Validate ILM policies in a non-production environment before applying them to production indices.
Ensure shrink actions target a shard count strictly less than the current count and that destination tiers have adequate disk.

How Netdata helps

Tracks disk usage per node and index count trends to surface accumulation from ILM failures.
Correlates JVM heap pressure with shard count growth to warn before heap pressure becomes critical.
Alerts on disk watermark proximity so you can intervene before the flood stage blocks writes.
Long-term retention of Elasticsearch metrics makes it easy to spot when index creation exceeds deletions.
Surfaces cluster health, pending tasks, and thread pool rejections alongside system disk and memory metrics.

The Netdata solution

Elasticsearch monitoring with Netdata

Netdata monitors Elasticsearch with per-second metrics and ML anomaly detection. Correlate JVM heap pressure, shard counts, disk watermarks, mapping growth, and merge activity with cluster and node health in one view.

See Elasticsearch monitoring → Start monitoring free

Elasticsearch ILM stuck: indices not rolling over, shrinking, or deleting

Elasticsearch ILM stuck: indices not rolling over, shrinking, or deleting

What this means

Common causes

Quick checks

How to diagnose it

Metrics and signals to monitor

Fixes

Rollover alias misconfiguration

Rollover conditions never met

Insufficient disk for shrink

Unassigned shards blocking migration

CCR retention lease blocking leader

Snapshot blocking delete

ILM retry loops

Prevention

How Netdata helps

Related guides

Elasticsearch monitoring with Netdata