Elasticsearch disk I/O saturation: merges, fsync, and page-cache starvation

When Elasticsearch data nodes show climbing I/O wait while indexing and search latency rise, but CPU is not the bottleneck, the cluster stays green while throughput falls and thread pool queues grow. This pattern usually traces to one of three disk pressures: background segment merges rewriting data faster than storage can absorb, translog fsync overhead from durability guarantees, or OS page-cache eviction forcing every search to read from disk. This guide shows how to tell them apart, confirm the bottleneck with safe read-only checks, and relieve pressure.

What this means

When I/O wait is the primary bottleneck, runnable threads are blocked on disk operations. Elasticsearch intensifies this through its write path: refreshes create new Lucene segments, flushes commit the translog, and background merges rewrite segments to reclaim deletions. The read path assumes hot segment files live in the OS page cache. If merges, fsyncs, or cache misses saturate the disk, latency rises across both indexing and search even though cluster health stays green.

flowchart TD
    A[High I/O wait] --> B{High writes?}
    B -->|Yes| C[Merge storm or translog fsync]
    B -->|No| D[Page cache misses]
    C --> E[Check merges.current and segment count]
    D --> F[Check OS page cache vs index size]
    E --> G[Reduce refresh rate or use async durability]
    F --> H[Isolate node or add memory]

Common causes

Cause	What it looks like	First thing to check
Merge storm	High write throughput, segment count growing, `merges.current` persistently at max	`_cat/nodes` for `merges.current` and `segments.count`
Translog fsync pressure	High write operations, low bytes per operation, indexing latency spikes	`_nodes/stats/indices/translog` and `index.translog.durability`
Page-cache starvation	High reads, low CPU, dataset larger than RAM, elevated fetch latency	OS `free -m` and `iostat` read throughput
External I/O consumers	Backup agents or log shippers on data nodes	`pidstat` or `iostat` showing non-ES disk consumers

Quick checks

Run these read-only checks to narrow the cause before making changes.

# Check OS I/O wait and per-disk throughput
iostat -xz 1 5

# Check ES-reported disk stats (Linux only)
curl -s 'http://localhost:9200/_nodes/stats/fs?filter_path=nodes.*.fs.io_stats'

# Check current merges and segment counts
curl -s 'http://localhost:9200/_cat/nodes?v&h=name,segments.count,merges.current,merges.current_size'

# Check translog size and durability setting
curl -s 'http://localhost:9200/_nodes/stats/indices/translog?filter_path=nodes.*.indices.translog'
curl -s 'http://localhost:9200/<index>/_settings?filter_path=*.index.translog.durability'

# Check refresh and flush latency
curl -s 'http://localhost:9200/_nodes/stats/indices/refresh,flush?filter_path=nodes.*.indices.refresh,nodes.*.indices.flush'

How to diagnose it

Confirm the bottleneck is disk, not CPU or memory. Run iostat -xz 1. If await is far above the device baseline while user and system CPU remain low, the disk is saturated. On single-queue devices, sustained %util above 90 corroborates this; on NVMe, rely on await instead.
Correlate with merge activity. Query _cat/nodes?v&h=name,merges.current,segments.count. If merges.current stays at the configured max_thread_count and segment count climbs, the merge scheduler cannot keep up.
Check translog pressure. Query _nodes/stats/indices/translog. If uncommitted_size_in_bytes is large and growing, or if index.translog.durability is request, fsync overhead is likely dominating write IOPS.
Evaluate page-cache effectiveness. Run free -m. If buffered and cached memory are small relative to total index size on disk, and iostat shows high read throughput during search, the working set does not fit in RAM.
Look for external disk consumers. Run pidstat -d 1 or inspect /proc/diskstats attribution. ES fs.io_stats aggregates I/O from all system processes, so sibling containers or backup agents inflate the same counters. Compare pidstat output with ES process disk activity to identify foreign consumers.

Metrics and signals to monitor

Signal	Why it matters	Warning sign
OS I/O wait percentage	Primary saturation indicator	Sustained above 20% on HDD or 30% on SSD with rising latency
`fs.io_stats` counters	ES-reported disk I/O (Linux only)	Write operation rate growing faster than indexing rate
`merges.current`	Concurrent merge work	Persistently at `max_thread_count` with growing segment count
`translog.uncommitted_size_in_bytes`	Flush health and recovery window	Above 512 MB (default threshold) and growing
Segment count per shard	Merge backlog	Above 100 per shard on actively searched indices
OS page cache available	Read path efficiency	Buffered/cached memory smaller than the hot working set

Fixes

Merge storms

Increase index.refresh_interval temporarily on heavy-write indices. The default is 1s; raising it to 30s reduces segment creation rate. Each refresh creates a new searchable segment, and the default interval can create segments faster than the merge scheduler can consolidate them. This trades near-real-time visibility for lower merge pressure.
Force-merge read-only indices to shrink the segment count: POST /<index>/_forcemerge?max_num_segments=1. Warning: this is I/O-intensive and will saturate disk while it runs. Do not force-merge indices that are still receiving writes.
On spinning disks, set index.merge.scheduler.max_thread_count: 1. The default formula, Math.max(1, Math.min(4, processors / 2)), assumes SSDs. This setting is per-shard, so many shards on a node still create parallel merge work.
Free disk space if the node is near the low watermark. Merges require temporary space for both old and new segments, and running out of headroom stalls them.

Translog fsync pressure

Switch index.translog.durability from request to async only if losing up to the sync_interval window of data on crash is acceptable. The default sync_interval is 5s. This batches fsyncs and sharply reduces write IOPS, but unsynced acknowledged writes are lost on a hard crash.
Avoid setting index.translog.flush_threshold_size arbitrarily high. The default is 512 MB. Values well above this delay flushes, extend recovery time, and increase translog disk usage. Monitor translog size after any change; it should stabilize below the flush threshold.

Page-cache starvation

Allocate up to approximately 30 GB to the Elasticsearch heap, and leave the remainder for the OS page cache. Do not split system RAM 50/50 on nodes with more than 64 GB total memory. Elasticsearch relies heavily on the OS page cache for search; starvation manifests as high disk reads despite low indexing volume.
Move backup agents, log shippers, and other memory-heavy processes off data nodes so they do not compete for page cache or disk I/O.
If the dataset far exceeds RAM, add nodes or migrate older indices to warm or cold tiers rather than relying on cache residency.

Prevention

Size storage for merge overhead. Temporary segment copies during large merges can require free space equal to the source segments. Plan headroom so merges do not trigger disk watermarks at the worst possible time.
Monitor segment count growth as a leading indicator. Do not wait for search latency to spike; trending segment count per shard predicts merge backlog before it saturates disk.
Use ILM to roll over, shrink, and delete indices. Preventing indefinite segment accumulation is more effective than tuning merge threads after the fact.
Validate merge concurrency against hardware. The default thread cap assumes SSDs. If you run on spinning disks, set max_thread_count to 1 before saturation appears.

How Netdata helps

Netdata collects OS-level disk metrics (I/O wait, throughput, queue depth) per device. This isolates real storage pressure from ES fs.io_stats, which conflates all system processes.
The Elasticsearch collector surfaces _nodes/stats, so you can overlay merges.current, indexing.index_time_in_millis, and search.query_time_in_millis against disk saturation on the same charts.
Page cache metrics show available memory, cache, and buffers alongside ES search latency, revealing cold-cache behavior after restarts or eviction.
Alerts on sustained I/O wait per disk, translog growth, and segment count anomalies give early warning before write and search thread pool queues fill.

The Netdata solution

Elasticsearch monitoring with Netdata

Netdata monitors Elasticsearch with per-second metrics and ML anomaly detection. Correlate JVM heap pressure, shard counts, disk watermarks, mapping growth, and merge activity with cluster and node health in one view.

See Elasticsearch monitoring → Start monitoring free

Elasticsearch disk I/O saturation: merges, fsync, and page-cache starvation

Elasticsearch disk I/O saturation: merges, fsync, and page-cache starvation

What this means

Common causes

Quick checks

How to diagnose it

Metrics and signals to monitor

Fixes

Merge storms

Translog fsync pressure

Page-cache starvation

Prevention

How Netdata helps

Related guides

Elasticsearch monitoring with Netdata