Elasticsearch disk watermark cascade: from low watermark to cluster-wide read-only

Writes fail with cluster_block_exception or FORBIDDEN/12/index read-only / allow delete (api). Kibana becomes unreachable; Logstash and Beats buffer or drop data. The cluster did not fail at once. It crossed a sequence of thresholds that turned single-node disk pressure into a cluster-wide write outage. With homogeneous disk sizes, every data node likely hit the thresholds within minutes, leaving no relocation target and no relief valve.

Elasticsearch uses three disk watermarks. The low watermark (85% by default) stops new shard allocation. The high watermark (90%) triggers shard relocation, which consumes I/O and disk on both source and target. The flood stage (95%) forces every index with a shard on the affected node into a read_only_allow_delete state. Writes stop, but relocation traffic may still run, compounding the pressure.

This guide covers how to identify the active threshold, why the cascade happened, and how to recover without triggering additional relocations or I/O storms.

What this means

The disk allocator evaluates three thresholds per data path:

  • Low watermark (85%): No new shards (replicas or primaries for new indices) are allocated to the node. Existing shards stay.
  • High watermark (90%): Elasticsearch begins relocating shards off the node to peers with free space. This generates segment file copies and translog replays, increasing I/O load and temporarily raising disk usage on the target node.
  • Flood stage (95%): Elasticsearch sets index.blocks.read_only_allow_delete: true on every index that has a shard on the affected node. All writes to those indices are rejected.

In Elasticsearch 8.x, max_headroom settings provide absolute floors for large disks. Built-in defaults when percentages are used without explicit headroom values are: low 200 GB, high 150 GB, flood stage 100 GB.

Since Elasticsearch 7.4, the flood-stage block is automatically removed once disk usage on the node drops below the high watermark. On older versions, you must clear it manually after freeing space.

flowchart TD
    A[Disk below 85%] --> B[Low watermark 85%]
    B --> C[No new shards allocated]
    C --> D[High watermark 90%]
    D --> E[Shard relocation starts]
    E --> F[Target nodes approach watermark]
    F --> G[Flood stage 95%]
    G --> H[Indices set read-only]
    H --> I[Automatic unblock below 90% in 7.4+]
    H --> J[Manual unblock required in older versions]

Common causes

CauseWhat it looks likeFirst thing to check
ILM not deleting old indicesDisk grows steadily; indices older than retention still existGET /_cat/indices?v&s=creation.date:desc
Data growth outpacing capacityLinear disk increase across all nodes with no sudden spikeGET /_cat/allocation?v for per-node trends
Merge operations temporarily doubling diskSpike during heavy indexing or force mergeGET /_nodes/stats/indices/merges
Uneven shard distributionOne node near 95% while others sit at 60%GET /_cat/allocation?v disk percent column

Quick checks

Run these read-only commands to assess state.

# Disk usage per node and active shard counts
curl -s 'http://localhost:9200/_cat/allocation?v'

# Current watermark thresholds and max_headroom settings
curl -s 'http://localhost:9200/_cluster/settings?include_defaults=true&filter_path=*.cluster.routing.allocation.disk.watermark.*'

# Active flood-stage read-only blocks
curl -s 'http://localhost:9200/_all/_settings?filter_path=*.index.blocks.read_only_allow_delete'

# Active shard relocations
curl -s 'http://localhost:9200/_cat/shards?v&h=index,shard,prirep,state,relocating.node&s=state:asc'

# ILM errors for indices that should have been deleted
curl -s 'http://localhost:9200/*/_ilm/explain?only_errors=true&only_managed=true'

# Merge activity that may be spiking disk usage
curl -s 'http://localhost:9200/_nodes/stats/indices/merges?filter_path=nodes.*.indices.merges'

How to diagnose it

  1. Identify affected nodes. Use GET /_cat/allocation?v&s=disk.percent:desc and note which nodes are above 85%, 90%, and 95%. If most data nodes are above 85%, you are in a homogeneous storage cascade.
  2. Check for read-only blocks. Use GET /_all/_settings?filter_path=*.index.blocks.read_only_allow_delete. Any index with true has been touched by flood stage on at least one hosting node.
  3. Determine if relocation is making it worse. Use GET /_cat/shards?v&h=index,shard,prirep,state,relocating.node and look for RELOCATING shards. Heavy relocation while disk is near 90% means the allocator is copying large segment files, adding temporary disk overhead on both sides.
  4. Find data that can be removed. Use GET /_cat/indices?v&s=store.size:desc to find the largest indices. Correlate with creation.date to identify old indices that ILM should have deleted.
  5. Check ILM status. Use GET /*/_ilm/explain?only_errors=true to see if indices are stuck in a transition step (for example, waiting for force merge or shrink). Stuck ILM is a common root cause of unbounded disk growth.
  6. Verify raw filesystem usage. _cat/allocation shows shard data, not total disk. Non-Elasticsearch files on the same volume (logs, snapshots, temporary files) count toward the watermark. Check df -h on the node to confirm total usage.

Metrics and signals to monitor

SignalWhy it mattersWarning sign
Disk usage per node (_cat/allocation)Watermarks trigger allocation changes and read-only blocksAny data node above 85%
index.blocks.read_only_allow_deleteDirect evidence of flood stage impactBlock present on any index
Shard relocations (relocating_shards)Indicates high watermark has fired and I/O storm is activeSustained nonzero relocations with disk near 90%
ILM explain errorsStuck ILM causes indices to accumulate beyond retentionIndices stuck in a phase for more than 24 hours
Merge activity (merges.current)Merges temporarily require disk for old and new segmentsmerges.current persistently at max thread count
Disk percent trendPredicts time to watermarkProjected to cross 85% within 7 days

Fixes

Immediate response when flood stage is active

Warning: The following commands delete data or reduce redundancy. Verify what you are deleting before running them.

  1. Free disk space. The fastest relief is to delete old indices that are no longer needed:

    DELETE /<old-index>
    

    If you cannot delete indices, temporarily reduce replica count to free space, at the cost of redundancy:

    PUT /<index>/_settings
    {"number_of_replicas": 0}
    
  2. Clear the read-only block. In Elasticsearch 7.4 and later, the block is automatically removed once disk drops below the high watermark. If it does not clear automatically, or if you are on an older version, remove it manually after freeing space:

    PUT /_all/_settings
    {"index.blocks.read_only_allow_delete": null}
    
  3. Stop unnecessary relocations. If you are in a relocation storm and the cluster is unstable, temporarily disable rebalancing to stop the I/O burn:

    PUT /_cluster/settings
    {"transient": {"cluster.routing.rebalance.enable": "none"}}
    

    Re-enable after the incident:

    PUT /_cluster/settings
    {"transient": {"cluster.routing.rebalance.enable": "all"}}
    

ILM not keeping pace

If old indices still exist, ILM may be stuck:

  1. Find stuck indices:
    GET /*/_ilm/explain?only_errors=true&only_managed=true
    
  2. Read the error message for the specific index. Common issues include missing write aliases, insufficient disk for shrink, or snapshot policies blocking deletion.
  3. Fix the root cause, then retry the policy:
    POST /<index>/_ilm/retry
    

Uneven shard distribution

If one node is at 95% while others are at 60%, the cluster has a hot spot. Long term, use index-level allocation filters or shard allocation awareness to spread large shards. Short term, manual reroute is disruptive and should be used sparingly.

Merge-induced spikes

If disk usage spiked during a force merge or heavy indexing, do not trigger additional merges. Reduce index.merge.scheduler.max_thread_count temporarily if I/O is saturated, or wait for the current merge to finish. Do not force merge indices when disk is already near the high watermark.

Prevention

  • Monitor per-node disk, not cluster averages. A cluster-wide average of 70% can hide a node at 92%. Alert when any data node crosses 80%.
  • Project time-to-watermark. Track daily disk growth and alert when a node is projected to cross the low watermark within 14 days.
  • Verify ILM deletion. Periodically audit /_ilm/explain to ensure indices are actually being deleted. A common failure mode is a policy that rolls over but never reaches the delete phase.
  • Leave merge headroom. Merges can temporarily require disk space equivalent to the full size of the segments being merged. Plan capacity so that peak usage during merges stays below 85%.
  • Test your snapshot restore. If you must delete indices to recover from flood stage, ensure your snapshot repository is healthy and restore-tested.

How Netdata helps

  • Per-node disk utilization alongside I/O wait, merge activity, and indexing latency in one view. Spot whether a relocation storm or merge backlog is the immediate trigger.
  • Custom threshold alerts on any node crossing 85%, 90%, or 95% disk usage catch homogeneous cascades before flood stage.
  • Correlation of disk pressure with thread pool rejections and search latency distinguishes capacity problems from query-driven I/O spikes.
  • Disk usage trends project days-until-watermark for proactive capacity planning.