Elasticsearch disk full: emergency recovery and freeing space safely
Writes fail with TOO_MANY_REQUESTS/12/index read-only / allow delete (api). Cluster health is red or yellow and data nodes are pinned above 90 percent disk. Indexers buffer or drop data while the cluster attempts shard relocations onto already-full disks. Recover without corrupting metadata or amplifying pressure.
What this means
Elasticsearch uses three disk watermarks. Low (85 percent) stops new shard allocation. High (90 percent) starts relocating shards off the node. Flood stage (95 percent) forces every index with a shard on that node into index.blocks.read_only_allow_delete: true, which stops writes.
In 7.x and 8.x, Elasticsearch removes the flood-stage block automatically once disk usage drops below the high watermark. Until then, the block persists and the cluster stays write-locked.
Background merges amplify the risk. Lucene creates new segments before deleting old ones, temporarily requiring roughly 2x the segment size on disk. A node at 80 percent can spike to 90 percent during a large merge even without new ingest.
Deleting individual documents does not reclaim space until the next merge. Deleting an entire index via the Delete Index API removes the shard directory immediately. Reducing replica count also frees space immediately on nodes that drop a copy.
flowchart TD
A[Disk >95% or write rejections] --> B[Check _cat/allocation]
B --> C{Fastest safe win?}
C --> D[Delete old time-series indices via API]
C --> E[Reduce replica count]
D --> F[Disk below 90%?]
E --> F
F -->|No| G[Check ILM and snapshots]
F -->|Yes| H[Clear read_only_allow_delete block]
G --> HCommon causes
| Cause | What it looks like | First thing to check |
|---|---|---|
| Time-series accumulation without ILM cleanup | Old indices dominate disk; shard count grows daily | GET /_cat/indices?v&h=index,store.size&s=store.size:desc |
| Replica over-allocation on full nodes | High disk usage with moderate primary data; many replica shards | GET /_cat/allocation?v&h=node,disk.percent,shards |
| Merge temporarily doubling segment size | Sudden disk spike on an active node without new ingest | GET /_cat/nodes?v&h=name,segments.count,merges.current |
| Snapshot or repository bloat on local storage | Snapshot repository on the same mount as data paths | GET /_snapshot/_status and OS-level df -h |
| ILM stuck in a transition step | Indices that should be deleted or shrunk stay open | GET /*/_ilm/explain?only_errors=true&only_managed=true |
Quick checks
Run these read-only commands to assess scope before destructive action.
# Per-node disk usage and watermark state
curl -s 'http://localhost:9200/_cat/allocation?v&h=node,disk.percent,disk.used,disk.total,shards'
# Cluster health and unassigned shards
curl -s 'http://localhost:9200/_cluster/health?filter_path=status,unassigned_shards,number_of_data_nodes'
# Check which indices carry the flood-stage write block
curl -s 'http://localhost:9200/_all/_settings?filter_path=*.blocks.read_only_allow_delete'
# Largest indices by store size
curl -s 'http://localhost:9200/_cat/indices?v&h=index,pri,rep,store.size&s=store.size:desc'
# Active merges and segment count per node
curl -s 'http://localhost:9200/_cat/nodes?v&h=name,segments.count,merges.current'
# ILM-managed indices with errors
curl -s 'http://localhost:9200/*/_ilm/explain?only_errors=true&only_managed=true'
How to diagnose it
- Identify the fullest nodes with
_cat/allocation. Note which nodes crossed 95 percent (flood stage) and which are above 90 percent (high watermark). - Check
_cluster/healthfor unassigned shards. If the allocator is moving shards off full nodes, target nodes may also be climbing toward their watermarks. - Inspect
_all/_settingsforread_only_allow_delete. If the block is present, writes stop even if other nodes have space. - List the largest indices. In time-series clusters, the oldest logs or metrics indices are usually the biggest and safest to drop.
- Check
_cat/nodesfor active merges. Highmerges.currentmeans temporary segment duplication may be causing the spike. - Verify ILM status. Indices stuck in a delete phase will refill the disk after recovery unless you fix the underlying error.
Metrics and signals to monitor
| Signal | Why it matters | Warning sign |
|---|---|---|
Disk usage percent per node (_cat/allocation) | Watermarks trigger allocation changes and write blocks | Sustained >85% on any data node |
index.blocks.read_only_allow_delete | Flood stage fired and writes are blocked | Block present on indices receiving writes |
| Segment count and merge activity | Merges temporarily require 2x segment disk space | Segment count growing with active merges on a node near 90% disk |
| Unassigned shard count | Relocations may be blocked by disk watermarks | Unassigned shards increasing while disk >90% |
| Indexing rate | Confirms whether writes are actually failing | Drops to zero while ingest pipelines are active |
| ILM execution errors | Old indices may not be deleting as expected | Indices stuck in delete/warm/shrink phase |
Fixes
Delete old or unused indices
This is the fastest way to reclaim space. Target old time-series indices first.
# WARNING: Irreversible. Deletes the entire index and all its data.
curl -X DELETE 'http://localhost:9200/<index>'
Do not manually remove files under $DATA_DIR/nodes/0/indices/ while Elasticsearch is running. Doing so corrupts cluster state and risks data loss.
Reduce replica count
Lowering replicas drops entire shard copies.
# WARNING: Reduces redundancy. Use only during emergency recovery.
curl -X PUT 'http://localhost:9200/<index>/_settings' -H 'Content-Type: application/json' -d '
{
"index.number_of_replicas": 0
}'
Space is freed immediately on every node that hosted a removed replica. Re-add replicas after the cluster is stable and disk usage is back below the low watermark.
Do not start force-merges on a full disk
Force-merging to a single segment temporarily requires roughly 2x the index size on disk. If a merge runs out of space mid-operation, the shard can fail . If a merge is already running, monitor merges.current and wait for it to finish. Lucene reclaims the space automatically when it swaps the new segment in.
Clear the flood-stage read-only block
In 7.x and 8.x, the block clears automatically when disk usage falls below the high watermark (90 percent). If it does not clear, or if you need to unblock writes immediately after deletions:
# Run this after disk is actually below 90%.
curl -X PUT 'http://localhost:9200/_all/_settings' -H 'Content-Type: application/json' -d '
{
"index.blocks.read_only_allow_delete": null
}'
If disk is still above the high watermark, Elasticsearch reapplies the block on its next check.
Address ILM and snapshot cleanup
If ILM is stuck due to disk pressure, free space first, then retry:
curl -X POST 'http://localhost:9200/<index>/_ilm/retry'
Failed or partial snapshots can consume repository space. If the snapshot repository shares a filesystem with data nodes, delete old snapshots via the Delete Snapshot API. Do not manually delete files inside the repository path.
Prevention
- Maintain sustained disk usage below 70 percent to absorb merge spikes without crossing watermarks.
- Verify ILM policies delete indices. Check
GET /*/_ilm/explainfor stuck transitions. - Size shards so losing one or two replicas does not push nodes past 90 percent.
- Plan capacity for merge overhead, not just raw data size.
- Monitor per-node disk, not cluster-wide averages. One hot node triggers the cascade.
How Netdata helps
- Per-node disk usage charts show which data nodes are approaching watermarks before the cluster blocks writes.
- Disk I/O wait and throughput metrics help distinguish temporary merge storms from sustained ingest growth.
- JVM heap and GC metrics correlate with segment metadata growth and give early warning before disk pressure forces emergency deletions.
- Alerts on cluster health and node-level disk percent provide the lead time to act before flood stage.
Related guides
- Elasticsearch all shards failed: diagnosing search_phase_execution_exception
- Elasticsearch CircuitBreakingException: [parent] Data too large - causes and fixes
- Elasticsearch cluster health red: unassigned primaries and how to recover
- Elasticsearch cluster health yellow: unassigned replicas vs real allocation blocks
- Elasticsearch fielddata circuit breaker tripped: text-field aggregations and the keyword fix
- Elasticsearch FORBIDDEN/12/index read-only / allow delete (api) — flood stage recovery
- Elasticsearch heap pressure death spiral: GC, node removal, and the cascade
- Elasticsearch high disk watermark [90%] exceeded: shard relocation and the cascade
- Elasticsearch JVM heap usage high: reading the sawtooth and the post-GC floor
- Elasticsearch this action would add too many shards: max_shards_per_node limit
- Elasticsearch monitoring checklist: the signals every production cluster needs
- Elasticsearch monitoring maturity model: from survival to expert







