Elasticsearch disk watermark cascade: from low watermark to cluster-wide read-only
Writes fail with cluster_block_exception or FORBIDDEN/12/index read-only / allow delete (api). Kibana becomes unreachable; Logstash and Beats buffer or drop data. The cluster did not fail at once. It crossed a sequence of thresholds that turned single-node disk pressure into a cluster-wide write outage. With homogeneous disk sizes, every data node likely hit the thresholds within minutes, leaving no relocation target and no relief valve.
Elasticsearch uses three disk watermarks. The low watermark (85% by default) stops new shard allocation. The high watermark (90%) triggers shard relocation, which consumes I/O and disk on both source and target. The flood stage (95%) forces every index with a shard on the affected node into a read_only_allow_delete state. Writes stop, but relocation traffic may still run, compounding the pressure.
This guide covers how to identify the active threshold, why the cascade happened, and how to recover without triggering additional relocations or I/O storms.
What this means
The disk allocator evaluates three thresholds per data path:
- Low watermark (85%): No new shards (replicas or primaries for new indices) are allocated to the node. Existing shards stay.
- High watermark (90%): Elasticsearch begins relocating shards off the node to peers with free space. This generates segment file copies and translog replays, increasing I/O load and temporarily raising disk usage on the target node.
- Flood stage (95%): Elasticsearch sets
index.blocks.read_only_allow_delete: trueon every index that has a shard on the affected node. All writes to those indices are rejected.
In Elasticsearch 8.x, max_headroom settings provide absolute floors for large disks. Built-in defaults when percentages are used without explicit headroom values are: low 200 GB, high 150 GB, flood stage 100 GB.
Since Elasticsearch 7.4, the flood-stage block is automatically removed once disk usage on the node drops below the high watermark. On older versions, you must clear it manually after freeing space.
flowchart TD
A[Disk below 85%] --> B[Low watermark 85%]
B --> C[No new shards allocated]
C --> D[High watermark 90%]
D --> E[Shard relocation starts]
E --> F[Target nodes approach watermark]
F --> G[Flood stage 95%]
G --> H[Indices set read-only]
H --> I[Automatic unblock below 90% in 7.4+]
H --> J[Manual unblock required in older versions]Common causes
| Cause | What it looks like | First thing to check |
|---|---|---|
| ILM not deleting old indices | Disk grows steadily; indices older than retention still exist | GET /_cat/indices?v&s=creation.date:desc |
| Data growth outpacing capacity | Linear disk increase across all nodes with no sudden spike | GET /_cat/allocation?v for per-node trends |
| Merge operations temporarily doubling disk | Spike during heavy indexing or force merge | GET /_nodes/stats/indices/merges |
| Uneven shard distribution | One node near 95% while others sit at 60% | GET /_cat/allocation?v disk percent column |
Quick checks
Run these read-only commands to assess state.
# Disk usage per node and active shard counts
curl -s 'http://localhost:9200/_cat/allocation?v'
# Current watermark thresholds and max_headroom settings
curl -s 'http://localhost:9200/_cluster/settings?include_defaults=true&filter_path=*.cluster.routing.allocation.disk.watermark.*'
# Active flood-stage read-only blocks
curl -s 'http://localhost:9200/_all/_settings?filter_path=*.index.blocks.read_only_allow_delete'
# Active shard relocations
curl -s 'http://localhost:9200/_cat/shards?v&h=index,shard,prirep,state,relocating.node&s=state:asc'
# ILM errors for indices that should have been deleted
curl -s 'http://localhost:9200/*/_ilm/explain?only_errors=true&only_managed=true'
# Merge activity that may be spiking disk usage
curl -s 'http://localhost:9200/_nodes/stats/indices/merges?filter_path=nodes.*.indices.merges'
How to diagnose it
- Identify affected nodes. Use
GET /_cat/allocation?v&s=disk.percent:descand note which nodes are above 85%, 90%, and 95%. If most data nodes are above 85%, you are in a homogeneous storage cascade. - Check for read-only blocks. Use
GET /_all/_settings?filter_path=*.index.blocks.read_only_allow_delete. Any index withtruehas been touched by flood stage on at least one hosting node. - Determine if relocation is making it worse. Use
GET /_cat/shards?v&h=index,shard,prirep,state,relocating.nodeand look forRELOCATINGshards. Heavy relocation while disk is near 90% means the allocator is copying large segment files, adding temporary disk overhead on both sides. - Find data that can be removed. Use
GET /_cat/indices?v&s=store.size:descto find the largest indices. Correlate withcreation.dateto identify old indices that ILM should have deleted. - Check ILM status. Use
GET /*/_ilm/explain?only_errors=trueto see if indices are stuck in a transition step (for example, waiting for force merge or shrink). Stuck ILM is a common root cause of unbounded disk growth. - Verify raw filesystem usage.
_cat/allocationshows shard data, not total disk. Non-Elasticsearch files on the same volume (logs, snapshots, temporary files) count toward the watermark. Checkdf -hon the node to confirm total usage.
Metrics and signals to monitor
| Signal | Why it matters | Warning sign |
|---|---|---|
Disk usage per node (_cat/allocation) | Watermarks trigger allocation changes and read-only blocks | Any data node above 85% |
index.blocks.read_only_allow_delete | Direct evidence of flood stage impact | Block present on any index |
Shard relocations (relocating_shards) | Indicates high watermark has fired and I/O storm is active | Sustained nonzero relocations with disk near 90% |
| ILM explain errors | Stuck ILM causes indices to accumulate beyond retention | Indices stuck in a phase for more than 24 hours |
Merge activity (merges.current) | Merges temporarily require disk for old and new segments | merges.current persistently at max thread count |
| Disk percent trend | Predicts time to watermark | Projected to cross 85% within 7 days |
Fixes
Immediate response when flood stage is active
Warning: The following commands delete data or reduce redundancy. Verify what you are deleting before running them.
Free disk space. The fastest relief is to delete old indices that are no longer needed:
DELETE /<old-index>If you cannot delete indices, temporarily reduce replica count to free space, at the cost of redundancy:
PUT /<index>/_settings {"number_of_replicas": 0}Clear the read-only block. In Elasticsearch 7.4 and later, the block is automatically removed once disk drops below the high watermark. If it does not clear automatically, or if you are on an older version, remove it manually after freeing space:
PUT /_all/_settings {"index.blocks.read_only_allow_delete": null}Stop unnecessary relocations. If you are in a relocation storm and the cluster is unstable, temporarily disable rebalancing to stop the I/O burn:
PUT /_cluster/settings {"transient": {"cluster.routing.rebalance.enable": "none"}}Re-enable after the incident:
PUT /_cluster/settings {"transient": {"cluster.routing.rebalance.enable": "all"}}
ILM not keeping pace
If old indices still exist, ILM may be stuck:
- Find stuck indices:
GET /*/_ilm/explain?only_errors=true&only_managed=true - Read the error message for the specific index. Common issues include missing write aliases, insufficient disk for shrink, or snapshot policies blocking deletion.
- Fix the root cause, then retry the policy:
POST /<index>/_ilm/retry
Uneven shard distribution
If one node is at 95% while others are at 60%, the cluster has a hot spot. Long term, use index-level allocation filters or shard allocation awareness to spread large shards. Short term, manual reroute is disruptive and should be used sparingly.
Merge-induced spikes
If disk usage spiked during a force merge or heavy indexing, do not trigger additional merges. Reduce index.merge.scheduler.max_thread_count temporarily if I/O is saturated, or wait for the current merge to finish. Do not force merge indices when disk is already near the high watermark.
Prevention
- Monitor per-node disk, not cluster averages. A cluster-wide average of 70% can hide a node at 92%. Alert when any data node crosses 80%.
- Project time-to-watermark. Track daily disk growth and alert when a node is projected to cross the low watermark within 14 days.
- Verify ILM deletion. Periodically audit
/_ilm/explainto ensure indices are actually being deleted. A common failure mode is a policy that rolls over but never reaches the delete phase. - Leave merge headroom. Merges can temporarily require disk space equivalent to the full size of the segments being merged. Plan capacity so that peak usage during merges stays below 85%.
- Test your snapshot restore. If you must delete indices to recover from flood stage, ensure your snapshot repository is healthy and restore-tested.
How Netdata helps
- Per-node disk utilization alongside I/O wait, merge activity, and indexing latency in one view. Spot whether a relocation storm or merge backlog is the immediate trigger.
- Custom threshold alerts on any node crossing 85%, 90%, or 95% disk usage catch homogeneous cascades before flood stage.
- Correlation of disk pressure with thread pool rejections and search latency distinguishes capacity problems from query-driven I/O spikes.
- Disk usage trends project days-until-watermark for proactive capacity planning.
Related guides
- Elasticsearch all shards failed: diagnosing search_phase_execution_exception
- Elasticsearch CircuitBreakingException: [parent] Data too large - causes and fixes
- Elasticsearch cluster health red: unassigned primaries and how to recover
- Elasticsearch cluster health yellow: unassigned replicas vs real allocation blocks
- Elasticsearch fielddata circuit breaker tripped: text-field aggregations and the keyword fix
- Elasticsearch FORBIDDEN/12/index read-only / allow delete (api) — flood stage recovery
- Elasticsearch heap pressure death spiral: GC, node removal, and the cascade
- Elasticsearch JVM heap usage high: reading the sawtooth and the post-GC floor
- Elasticsearch this action would add too many shards: max_shards_per_node limit
- Elasticsearch monitoring checklist: the signals every production cluster needs
- Elasticsearch monitoring maturity model: from survival to expert
- Elasticsearch node left the cluster: fault detection, reallocation, and recovery







