$ guides / elasticsearch / elasticsearch-disk-full-emergency-recovery ▌

Operations Guides

Elasticsearch disk full: emergency recovery and freeing space safely

Writes fail with TOO_MANY_REQUESTS/12/index read-only / allow delete (api). Cluster health is red or yellow and data nodes are pinned above 90 percent disk. Indexers buffer or drop data while the cluster attempts shard relocations onto already-full disks. Recover without corrupting metadata or amplifying pressure.

What this means

Elasticsearch uses three disk watermarks. Low (85 percent) stops new shard allocation. High (90 percent) starts relocating shards off the node. Flood stage (95 percent) forces every index with a shard on that node into index.blocks.read_only_allow_delete: true, which stops writes.

In 7.x and 8.x, Elasticsearch removes the flood-stage block automatically once disk usage drops below the high watermark. Until then, the block persists and the cluster stays write-locked.

Background merges amplify the risk. Lucene creates new segments before deleting old ones, temporarily requiring roughly 2x the segment size on disk. A node at 80 percent can spike to 90 percent during a large merge even without new ingest.

Deleting individual documents does not reclaim space until the next merge. Deleting an entire index via the Delete Index API removes the shard directory immediately. Reducing replica count also frees space immediately on nodes that drop a copy.

flowchart TD
    A[Disk >95% or write rejections] --> B[Check _cat/allocation]
    B --> C{Fastest safe win?}
    C --> D[Delete old time-series indices via API]
    C --> E[Reduce replica count]
    D --> F[Disk below 90%?]
    E --> F
    F -->|No| G[Check ILM and snapshots]
    F -->|Yes| H[Clear read_only_allow_delete block]
    G --> H

Common causes

Cause	What it looks like	First thing to check
Time-series accumulation without ILM cleanup	Old indices dominate disk; shard count grows daily	`GET /_cat/indices?v&h=index,store.size&s=store.size:desc`
Replica over-allocation on full nodes	High disk usage with moderate primary data; many replica shards	`GET /_cat/allocation?v&h=node,disk.percent,shards`
Merge temporarily doubling segment size	Sudden disk spike on an active node without new ingest	`GET /_cat/nodes?v&h=name,segments.count,merges.current`
Snapshot or repository bloat on local storage	Snapshot repository on the same mount as data paths	`GET /_snapshot/_status` and OS-level `df -h`
ILM stuck in a transition step	Indices that should be deleted or shrunk stay open	`GET /*/_ilm/explain?only_errors=true&only_managed=true`

Quick checks

Run these read-only commands to assess scope before destructive action.

# Per-node disk usage and watermark state
curl -s 'http://localhost:9200/_cat/allocation?v&h=node,disk.percent,disk.used,disk.total,shards'

# Cluster health and unassigned shards
curl -s 'http://localhost:9200/_cluster/health?filter_path=status,unassigned_shards,number_of_data_nodes'

# Check which indices carry the flood-stage write block
curl -s 'http://localhost:9200/_all/_settings?filter_path=*.blocks.read_only_allow_delete'

# Largest indices by store size
curl -s 'http://localhost:9200/_cat/indices?v&h=index,pri,rep,store.size&s=store.size:desc'

# Active merges and segment count per node
curl -s 'http://localhost:9200/_cat/nodes?v&h=name,segments.count,merges.current'

# ILM-managed indices with errors
curl -s 'http://localhost:9200/*/_ilm/explain?only_errors=true&only_managed=true'

How to diagnose it

Identify the fullest nodes with _cat/allocation. Note which nodes crossed 95 percent (flood stage) and which are above 90 percent (high watermark).
Check _cluster/health for unassigned shards. If the allocator is moving shards off full nodes, target nodes may also be climbing toward their watermarks.
Inspect _all/_settings for read_only_allow_delete. If the block is present, writes stop even if other nodes have space.
List the largest indices. In time-series clusters, the oldest logs or metrics indices are usually the biggest and safest to drop.
Check _cat/nodes for active merges. High merges.current means temporary segment duplication may be causing the spike.
Verify ILM status. Indices stuck in a delete phase will refill the disk after recovery unless you fix the underlying error.

Metrics and signals to monitor

Signal	Why it matters	Warning sign
Disk usage percent per node (`_cat/allocation`)	Watermarks trigger allocation changes and write blocks	Sustained >85% on any data node
`index.blocks.read_only_allow_delete`	Flood stage fired and writes are blocked	Block present on indices receiving writes
Segment count and merge activity	Merges temporarily require 2x segment disk space	Segment count growing with active merges on a node near 90% disk
Unassigned shard count	Relocations may be blocked by disk watermarks	Unassigned shards increasing while disk >90%
Indexing rate	Confirms whether writes are actually failing	Drops to zero while ingest pipelines are active
ILM execution errors	Old indices may not be deleting as expected	Indices stuck in delete/warm/shrink phase

Fixes

Delete old or unused indices

This is the fastest way to reclaim space. Target old time-series indices first.

# WARNING: Irreversible. Deletes the entire index and all its data.
curl -X DELETE 'http://localhost:9200/<index>'

Do not manually remove files under $DATA_DIR/nodes/0/indices/ while Elasticsearch is running. Doing so corrupts cluster state and risks data loss.

Reduce replica count

Lowering replicas drops entire shard copies.

# WARNING: Reduces redundancy. Use only during emergency recovery.
curl -X PUT 'http://localhost:9200/<index>/_settings' -H 'Content-Type: application/json' -d '
{
  "index.number_of_replicas": 0
}'

Space is freed immediately on every node that hosted a removed replica. Re-add replicas after the cluster is stable and disk usage is back below the low watermark.

Do not start force-merges on a full disk

Force-merging to a single segment temporarily requires roughly 2x the index size on disk. If a merge runs out of space mid-operation, the shard can fail . If a merge is already running, monitor merges.current and wait for it to finish. Lucene reclaims the space automatically when it swaps the new segment in.

Clear the flood-stage read-only block

In 7.x and 8.x, the block clears automatically when disk usage falls below the high watermark (90 percent). If it does not clear, or if you need to unblock writes immediately after deletions:

# Run this after disk is actually below 90%.
curl -X PUT 'http://localhost:9200/_all/_settings' -H 'Content-Type: application/json' -d '
{
  "index.blocks.read_only_allow_delete": null
}'

If disk is still above the high watermark, Elasticsearch reapplies the block on its next check.

Address ILM and snapshot cleanup

If ILM is stuck due to disk pressure, free space first, then retry:

curl -X POST 'http://localhost:9200/<index>/_ilm/retry'

Failed or partial snapshots can consume repository space. If the snapshot repository shares a filesystem with data nodes, delete old snapshots via the Delete Snapshot API. Do not manually delete files inside the repository path.

Prevention

Maintain sustained disk usage below 70 percent to absorb merge spikes without crossing watermarks.
Verify ILM policies delete indices. Check GET /*/_ilm/explain for stuck transitions.
Size shards so losing one or two replicas does not push nodes past 90 percent.
Plan capacity for merge overhead, not just raw data size.
Monitor per-node disk, not cluster-wide averages. One hot node triggers the cascade.

How Netdata helps

Per-node disk usage charts show which data nodes are approaching watermarks before the cluster blocks writes.
Disk I/O wait and throughput metrics help distinguish temporary merge storms from sustained ingest growth.
JVM heap and GC metrics correlate with segment metadata growth and give early warning before disk pressure forces emergency deletions.
Alerts on cluster health and node-level disk percent provide the lead time to act before flood stage.

The Netdata solution

Elasticsearch monitoring with Netdata

Netdata monitors Elasticsearch with per-second metrics and ML anomaly detection. Correlate JVM heap pressure, shard counts, disk watermarks, mapping growth, and merge activity with cluster and node health in one view.

See Elasticsearch monitoring → Start monitoring free

Elasticsearch disk full: emergency recovery and freeing space safely

Elasticsearch disk full: emergency recovery and freeing space safely

What this means

Common causes

Quick checks

How to diagnose it

Metrics and signals to monitor

Fixes

Delete old or unused indices

Reduce replica count

Do not start force-merges on a full disk

Clear the flood-stage read-only block

Address ILM and snapshot cleanup

Prevention

How Netdata helps

Related guides

Elasticsearch monitoring with Netdata