Elasticsearch disk watermark tuning: thresholds, max_headroom, and multiple data paths

Elasticsearch uses disk watermarks to decide when a node is too full to accept shards. The defaults (85%, 90%, 95%) were designed for small disks. On modern nodes with multi-terabyte volumes, those percentages leave hundreds of gigabytes free while still blocking allocation and triggering expensive rebalancing. Elasticsearch 8.x introduced max_headroom to cap the free-space requirement on large disks, but the interaction between percentages, absolute bytes, and max_headroom is not obvious. If you run multiple data paths, watermarks apply per path rather than per node, so a partially full disk can cause node-wide allocation restrictions. This article explains how the allocator evaluates disk space, how to tune thresholds without causing relocation storms, and why multiple data paths complicate the picture.

What it is and why it matters

Elasticsearch monitors disk usage on every data node through three thresholds: low, high, and flood_stage. The defaults are 85%, 90%, and 95% used.

  • Low: the allocator stops placing new shards on the node.
  • High: Elasticsearch begins actively relocating existing shards away.
  • Flood stage: Elasticsearch sets index.blocks.read_only_allow_delete, blocking writes until disk usage drops below the high watermark.

These thresholds prevent a single node from filling up and corrupting shards, but they are not a substitute for capacity planning. In production, watermarks surface as seemingly random shard relocations, yellow cluster health during rebalancing, or sudden write rejections returning HTTP 429 with cluster_block_exception.

For clusters with large disks, the defaults create a paradox: a 10TB node at 85% still has 1.5TB free, yet the cluster refuses to allocate new shards there. Elasticsearch 8.x addresses this with max_headroom settings that override the percentage-based requirement when the calculated free space would be excessive.

How it works

The allocator evaluates watermarks independently for each data path. Even on a single-path node, the mechanism is identical. The settings live under cluster.routing.allocation.disk.watermark. Each level accepts either a percentage of used disk or an absolute byte value of free space. You cannot mix formats across the three levels: either all are percentages, or all are absolute byte values.

Semantics invert between modes:

  • Percentage mode: the value represents used space (default low: 85% used).
  • Absolute byte mode: the value represents free space required (low: the amount of free space that must remain).

This inversion means a safe byte-mode configuration requires the low threshold to be numerically larger than the high threshold. For example, low: "500GB" and high: "200GB" means the node must keep 500GB free at the low threshold and 200GB free at the high threshold.

In 8.x, when a watermark uses a percentage, a sibling max_headroom setting caps the absolute free-space requirement. The defaults are:

  • low.max_headroom: 200GB
  • high.max_headroom: 150GB
  • flood_stage.max_headroom: 100GB
  • flood_stage.frozen.max_headroom: 20GB

These defaults are conditional. They apply when the parent watermark is left unset. If you explicitly set cluster.routing.allocation.disk.watermark.low: "85%", the 200GB default disappears unless you also explicitly set low.max_headroom. This is a frequent source of surprise: an operator configures a percentage on a large volume and expects a reasonable cap, but without the explicit max_headroom sibling, the allocator demands the full percentage as free space.

Watermark changes apply dynamically through the cluster settings API, but the allocator relies on disk usage reports that refresh every 30 seconds by default (cluster.info.update.interval). Do not expect instant allocation after raising a threshold; wait for the next refresh cycle.

At the low watermark, Elasticsearch refuses to allocate new primary or replica shards to the node. Existing shards stay in place. At the high watermark, the allocator begins relocating shards away. Relocation consumes disk I/O and network bandwidth, degrading indexing and search performance on both source and target nodes. At the flood stage, Elasticsearch sets index.blocks.read_only_allow_delete. Writes return HTTP 429 with cluster_block_exception. In 7.x and 8.x, the block is automatically removed when disk usage drops below the high watermark. If the disk cannot free itself automatically, delete data or move shards, then either wait for automatic removal or clear the block manually:

PUT /_all/_settings
{
  "index.blocks.read_only_allow_delete": null
}

A separate frozen-tier watermark exists at cluster.routing.allocation.disk.watermark.flood_stage.frozen. It has its own max_headroom default and operates independently of the regular flood stage.

flowchart TD
    A[Disk usage check per path] --> B{Below low?}
    B -->|Yes| C[Normal allocation]
    B -->|No| D{Below high?}
    D -->|Yes| E[No new shards]
    D -->|No| F{Below flood?}
    F -->|Yes| G[Relocate shards away]
    F -->|No| H[Read-only block]
    H --> I{Below high again?}
    I -->|Yes| C
    I -->|No| H

Where it shows up in production

Large disks and asymmetric volumes. On a node with a 4TB data volume, the default low watermark requires 600GB of free space. If you also mount a 500GB volume on the same node, that path hits the watermark much sooner. The allocator does not average across paths; the fullest path triggers the node-wide restriction. Multiple data paths are deprecated since 7.13.0 but remain functional in 8.x.

Ingest bursts and ILM delays. A sudden spike in indexing, a failed ILM delete action, or a large force merge can push a node from 80% to 95% within minutes. Because the flood stage applies to every index with a shard on the node, a single full disk can block writes to dozens of indices, not just the one causing growth.

Frozen tiers. If you run a frozen tier, the default flood_stage.frozen.max_headroom of 20GB means searchable snapshot nodes can hit the read-only block while still having significant space on their cache volumes. Operators must configure both the regular and frozen watermarks.

Managed service constraints. On AWS OpenSearch, disk watermarks are hardcoded and cannot be changed via the API. Tuning guidance does not apply there; the only remedy is scaling storage or adding nodes.

Tradeoffs and when to use it

Use absolute byte values when you know the physical capacity and growth rate of your volumes. Absolute values express free space directly, which simplifies capacity planning. However, if you expand a volume later, the watermark does not scale with it. Percentages scale automatically but create excessive headroom on large disks. Most production clusters with volumes over 2TB should use percentages combined with explicit max_headroom values.

Raising watermarks during an incident is valid temporary relief, but it requires two API calls. First, raise the thresholds:

PUT _cluster/settings
{
  "transient": {
    "cluster.routing.allocation.disk.watermark.high": "95%",
    "cluster.routing.allocation.disk.watermark.flood_stage": "97%"
  }
}

Then clear existing read-only blocks:

PUT /_all/_settings
{
  "index.blocks.read_only_allow_delete": null
}

Warning: Raising watermarks alone does not remove blocks that are already in place. If you use transient settings for the override, remember they do not survive a full cluster restart. Persistent settings are safer for long-lived tuning, but during an incident transient is often preferable to avoid leaving a permanent unsafe configuration. Do not forget to lower the thresholds after the incident.

Lowering watermarks below defaults increases safety but raises the risk of premature relocation storms. If you set the high watermark to 80%, a large merge that temporarily doubles segment size can push the node over the limit and trigger unnecessary rebalancing. Keep merge overhead in mind: Lucene needs temporary space for old and new segments during a merge, so maintain a buffer above your watermark.

If you still use multiple data paths, be aware that shard balancing across paths is not supported. A single filled path triggers the watermark for the entire node, even if other paths are empty. Elastic’s recommended migration is to use a spanned filesystem (LVM, RAID, or Storage Spaces) or run one node per physical disk.

Signals to watch in production

SignalWhy it mattersWarning sign
_cat/allocation disk percent per nodeShows distance to watermark per nodeAny node above 80% or asymmetric across the cluster
_cluster/settings for watermark and max_headroomConfirms effective thresholds after dynamic updatesMissing max_headroom after explicit percentage settings
relocating_shards countHigh rate indicates watermark-driven rebalancingSustained nonzero relocation with no planned maintenance
index.blocks.read_only_allow_deleteEvidence of flood stage impactPresent on any write-active index
ILM execution statusStuck ILM causes disk growth that hits watermarksIndices stuck in delete or shrink phase
Merge activity and segment countMerges temporarily increase disk usageMerge current sustained with disk above 85%

How Netdata helps

  • Per-node disk utilization charts show exactly which node is approaching a watermark, so you do not rely on cluster-wide averages.
  • Historical disk growth rates let you project time-to-watermark and catch ILM failures before they cause allocation blocks.
  • Correlating disk usage with shard relocation counts and indexing rate distinguishes normal growth from merge spikes or uneven shard distribution.
  • Alerts on disk utilization approaching watermarks reduce MTTR during write outages.
  • Elasticsearch all shards failed: diagnosing search_phase_execution_exception: /guides/elasticsearch/elasticsearch-all-shards-failed/
  • Elasticsearch CircuitBreakingException: [parent] Data too large - causes and fixes: /guides/elasticsearch/elasticsearch-circuitbreakingexception-parent-data-too-large/
  • Elasticsearch cluster_block_exception: blocked by, the read-only blocks explained: /guides/elasticsearch/elasticsearch-cluster-block-exception/
  • Elasticsearch cluster health red: unassigned primaries and how to recover: /guides/elasticsearch/elasticsearch-cluster-health-red/
  • Elasticsearch cluster health yellow: unassigned replicas vs real allocation blocks: /guides/elasticsearch/elasticsearch-cluster-health-yellow/
  • Elasticsearch fielddata circuit breaker tripped: text-field aggregations and the keyword fix: /guides/elasticsearch/elasticsearch-fielddata-circuit-breaker-tripped/
  • Elasticsearch FORBIDDEN/12/index read-only / allow delete (api) - flood stage recovery: /guides/elasticsearch/elasticsearch-forbidden-12-index-read-only-allow-delete/
  • Elasticsearch heap pressure death spiral: GC, node removal, and the cascade: /guides/elasticsearch/elasticsearch-heap-pressure-death-spiral/
  • Elasticsearch high disk watermark [90%] exceeded: shard relocation and the cascade: /guides/elasticsearch/elasticsearch-high-disk-watermark-exceeded/
  • Elasticsearch JVM heap usage high: reading the sawtooth and the post-GC floor: /guides/elasticsearch/elasticsearch-jvm-heap-high/
  • Elasticsearch this action would add too many shards: max_shards_per_node limit: /guides/elasticsearch/elasticsearch-max-shards-per-node-exceeded/
  • Elasticsearch monitoring checklist: the signals every production cluster needs: /guides/elasticsearch/elasticsearch-monitoring-checklist/