Elasticsearch ILM stuck: indices not rolling over, shrinking, or deleting

Disk usage climbs steadily. Old indices that should have been deleted remain. Shard count grows, and the cluster approaches cluster.max_shards_per_node. In ILM, indices are stuck in one phase for hours or days. This is the ILM stuck pattern: silent accumulation that becomes a disk watermark crisis, heap pressure, or unassigned shard storm when the cluster runs out of room.

ILM polls every ten minutes by default. When an index cannot advance, it sits. Because the failure is gradual, it rarely pages until a secondary limit is breached. Detect the stuck state early and fix the root cause before accumulation triggers cascading failures.

What this means

ILM moves indices through phases (hot, warm, cold, frozen, deleted) and actions (rollover, shrink, force merge, allocate, delete) in discrete steps. If a step fails or blocks, the index stays there until the condition clears or an operator intervenes.

When ILM stops, indices accumulate. Each retained index consumes shards, heap metadata, and file descriptors, growing the cluster state. Over days, disk watermarks trigger, JVM heap pressure rises, and search latency degrades. By the time the flood stage blocks writes, the root cause is often dozens of stuck indices that could have been caught earlier.

flowchart TD
    A[Index meets ILM condition] --> B{ILM poll executes}
    B -->|Alias missing| C[Stuck in check-rollover-ready]
    B -->|Disk full| D[Stuck in shrink]
    B -->|Follower active| E[Waiting for retention leases]
    B -->|Snapshot active| F[Stuck in delete]
    C --> G[Indices accumulate]
    D --> G
    E --> G
    F --> G
    G --> H[Shard count grows]
    H --> I[Disk or heap crisis]

Common causes

CauseWhat it looks likeFirst thing to check
Rollover alias misconfigurationIndex stuck in check-rollover-ready; error mentions the write aliasVerify the index has exactly one write alias configured
Rollover conditions never metIndex age exceeds max_age but never rolls; low-volume indexCompare index size, document count, and age against the policy criteria
Insufficient disk for shrinkStuck in shrink action; errors about disk or target nodeTarget node disk usage in _cat/allocation
Unassigned shards blocking migrationStuck in allocate or searchable_snapshot; cluster health yellow or red_cluster/allocation/explain for the specific index
CCR retention lease blocking leaderwaiting-for-shard-history-retention-leases on the leader indexWhether follower indices are still active
Snapshot blocking deleteDelete action stuck; snapshot may be running on the index_snapshot/_status for active snapshot operations
ILM auto-retry loopIndex in ERROR state with retry count climbing but no progress_ilm/explain output for repeated identical errors

Quick checks

# List all ILM-managed indices with errors only
curl -s 'http://localhost:9200/*/_ilm/explain?only_errors=true&only_managed=true&pretty'

# Check ILM health report for stagnating indices (Elasticsearch 8.x)
curl -s 'http://localhost:9200/_health_report/ilm'

# Check cluster health and unassigned shards
curl -s 'http://localhost:9200/_cluster/health?pretty'

# Check disk usage per node
curl -s 'http://localhost:9200/_cat/allocation?v'

# Check for active snapshots that may block deletes
curl -s 'http://localhost:9200/_snapshot/_status'

# Explain the first unassigned shard
curl -s 'http://localhost:9200/_cluster/allocation/explain'

How to diagnose it

  1. Run the filtered ILM explain query to identify stuck indices. Focus on indices in ERROR or steps that have not progressed after the expected poll interval.
  2. For each stuck index, read phase, action, step, step_time, and failed_step from the _ilm/explain output. Common stuck states include check-rollover-ready, waiting-for-shard-history-retention-leases, and shrink-related steps.
  3. If the index is stuck in rollover, verify the write alias and that the index name follows the rollover pattern.
  4. If the index is stuck waiting for retention leases, check whether follower clusters still have active follower indices. Leader indices cannot shrink or delete until followers unfollow.
  5. If the index is stuck in shrink or allocate, check disk headroom on target nodes. Shrink requires enough space to hold a second copy of the index temporarily, and the target shard count must be strictly less than the current count.
  6. If the index is stuck in delete, verify no snapshot is currently capturing it. An active snapshot blocks deletion.
  7. After fixing the root cause, issue POST /<index>/_ilm/retry to move the index forward. Do not retry before resolving the underlying problem.

Metrics and signals to monitor

SignalWhy it mattersWarning sign
ILM stuck index countDirect measure of policy execution failureAny ERROR state sustained longer than 20 minutes
Index count growth rateAccumulation leads to shard and cluster state bloatMonotonic increase over 48 hours
Disk usage per nodeShrink and rollover need headroom; deletes free spaceAny node above the 85% low watermark
Shard count per nodeUnmanaged growth stresses heap and file descriptorsApproaching cluster.max_shards_per_node
Cluster health statusUnassigned shards block ILM allocate and migrate actionsYellow or red sustained longer than 5 minutes
Pending cluster tasksMaster overload slows ILM state transitionsMore than 20 tasks or any task older than 30 seconds

Fixes

Rollover alias misconfiguration

ILM rollover requires exactly one write index per alias. If the alias is missing, points to multiple indices, or was manually removed, rollover cannot proceed. Check with GET /<index>/_alias or GET _alias/<alias>. Restore the alias mapping, then retry: POST /<index>/_ilm/retry.

Rollover conditions never met

Low-volume indices may never reach max_size or max_docs. If max_age has passed but other conditions block rollover, update the ILM policy or trigger a manual rollover: POST /<alias>/_rollover. Then retry ILM.

Insufficient disk for shrink

Shrink requires enough temporary disk space on the target node for a complete second copy, and the target shard count must be strictly less than the current count. Check _cat/allocation?v. Free disk or add nodes, then retry.

Unassigned shards blocking migration

allocate and searchable_snapshot wait for green or yellow health. Use _cluster/allocation/explain to identify disk watermarks, allocation filters, or awareness attributes blocking assignment. Resolve the blocker, then retry.

CCR retention lease blocking leader

If a leader index is stuck waiting for retention leases, verify follower cluster status. The leader cannot shrink or delete until followers unfollow. If a follower is offline, wait for the lease to expire or unfollow from the follower cluster when it is available. Proceeding after lease expiration can create data gaps on the follower.

Snapshot blocking delete

An active snapshot blocks deletion. Check _snapshot/_status. Wait for it to complete, or cancel it if safe. Once inactive, retry the delete action.

ILM retry loops

If an index is in ERROR with a climbing retry count and no progress, auto-retry will not self-heal a structural problem. Fix the root cause before issuing a manual retry. Repeated retries waste master cycles and delay recovery.

Prevention

  • Prefer data streams over manual rollover aliases. Data streams manage the write alias automatically, eliminating the most common rollover failure source.
  • Monitor GET /<index>/_ilm/explain for errors proactively instead of waiting for disk or heap alerts.
  • Keep disk usage below 70% on hot nodes to leave headroom for merge and shrink temporary overhead.
  • Validate ILM policies in a non-production environment before applying them to production indices.
  • Ensure shrink actions target a shard count strictly less than the current count and that destination tiers have adequate disk.

How Netdata helps

  • Tracks disk usage per node and index count trends to surface accumulation from ILM failures.
  • Correlates JVM heap pressure with shard count growth to warn before heap pressure becomes critical.
  • Alerts on disk watermark proximity so you can intervene before the flood stage blocks writes.
  • Long-term retention of Elasticsearch metrics makes it easy to spot when index creation exceeds deletions.
  • Surfaces cluster health, pending tasks, and thread pool rejections alongside system disk and memory metrics.