Elasticsearch disk I/O saturation: merges, fsync, and page-cache starvation

When Elasticsearch data nodes show climbing I/O wait while indexing and search latency rise, but CPU is not the bottleneck, the cluster stays green while throughput falls and thread pool queues grow. This pattern usually traces to one of three disk pressures: background segment merges rewriting data faster than storage can absorb, translog fsync overhead from durability guarantees, or OS page-cache eviction forcing every search to read from disk. This guide shows how to tell them apart, confirm the bottleneck with safe read-only checks, and relieve pressure.

What this means

When I/O wait is the primary bottleneck, runnable threads are blocked on disk operations. Elasticsearch intensifies this through its write path: refreshes create new Lucene segments, flushes commit the translog, and background merges rewrite segments to reclaim deletions. The read path assumes hot segment files live in the OS page cache. If merges, fsyncs, or cache misses saturate the disk, latency rises across both indexing and search even though cluster health stays green.

flowchart TD
    A[High I/O wait] --> B{High writes?}
    B -->|Yes| C[Merge storm or translog fsync]
    B -->|No| D[Page cache misses]
    C --> E[Check merges.current and segment count]
    D --> F[Check OS page cache vs index size]
    E --> G[Reduce refresh rate or use async durability]
    F --> H[Isolate node or add memory]

Common causes

CauseWhat it looks likeFirst thing to check
Merge stormHigh write throughput, segment count growing, merges.current persistently at max_cat/nodes for merges.current and segments.count
Translog fsync pressureHigh write operations, low bytes per operation, indexing latency spikes_nodes/stats/indices/translog and index.translog.durability
Page-cache starvationHigh reads, low CPU, dataset larger than RAM, elevated fetch latencyOS free -m and iostat read throughput
External I/O consumersBackup agents or log shippers on data nodespidstat or iostat showing non-ES disk consumers

Quick checks

Run these read-only checks to narrow the cause before making changes.

# Check OS I/O wait and per-disk throughput
iostat -xz 1 5

# Check ES-reported disk stats (Linux only)
curl -s 'http://localhost:9200/_nodes/stats/fs?filter_path=nodes.*.fs.io_stats'

# Check current merges and segment counts
curl -s 'http://localhost:9200/_cat/nodes?v&h=name,segments.count,merges.current,merges.current_size'

# Check translog size and durability setting
curl -s 'http://localhost:9200/_nodes/stats/indices/translog?filter_path=nodes.*.indices.translog'
curl -s 'http://localhost:9200/<index>/_settings?filter_path=*.index.translog.durability'

# Check refresh and flush latency
curl -s 'http://localhost:9200/_nodes/stats/indices/refresh,flush?filter_path=nodes.*.indices.refresh,nodes.*.indices.flush'

How to diagnose it

  1. Confirm the bottleneck is disk, not CPU or memory. Run iostat -xz 1. If await is far above the device baseline while user and system CPU remain low, the disk is saturated. On single-queue devices, sustained %util above 90 corroborates this; on NVMe, rely on await instead.
  2. Correlate with merge activity. Query _cat/nodes?v&h=name,merges.current,segments.count. If merges.current stays at the configured max_thread_count and segment count climbs, the merge scheduler cannot keep up.
  3. Check translog pressure. Query _nodes/stats/indices/translog. If uncommitted_size_in_bytes is large and growing, or if index.translog.durability is request, fsync overhead is likely dominating write IOPS.
  4. Evaluate page-cache effectiveness. Run free -m. If buffered and cached memory are small relative to total index size on disk, and iostat shows high read throughput during search, the working set does not fit in RAM.
  5. Look for external disk consumers. Run pidstat -d 1 or inspect /proc/diskstats attribution. ES fs.io_stats aggregates I/O from all system processes, so sibling containers or backup agents inflate the same counters. Compare pidstat output with ES process disk activity to identify foreign consumers.

Metrics and signals to monitor

SignalWhy it mattersWarning sign
OS I/O wait percentagePrimary saturation indicatorSustained above 20% on HDD or 30% on SSD with rising latency
fs.io_stats countersES-reported disk I/O (Linux only)Write operation rate growing faster than indexing rate
merges.currentConcurrent merge workPersistently at max_thread_count with growing segment count
translog.uncommitted_size_in_bytesFlush health and recovery windowAbove 512 MB (default threshold) and growing
Segment count per shardMerge backlogAbove 100 per shard on actively searched indices
OS page cache availableRead path efficiencyBuffered/cached memory smaller than the hot working set

Fixes

Merge storms

  • Increase index.refresh_interval temporarily on heavy-write indices. The default is 1s; raising it to 30s reduces segment creation rate. Each refresh creates a new searchable segment, and the default interval can create segments faster than the merge scheduler can consolidate them. This trades near-real-time visibility for lower merge pressure.
  • Force-merge read-only indices to shrink the segment count: POST /<index>/_forcemerge?max_num_segments=1. Warning: this is I/O-intensive and will saturate disk while it runs. Do not force-merge indices that are still receiving writes.
  • On spinning disks, set index.merge.scheduler.max_thread_count: 1. The default formula, Math.max(1, Math.min(4, processors / 2)), assumes SSDs. This setting is per-shard, so many shards on a node still create parallel merge work.
  • Free disk space if the node is near the low watermark. Merges require temporary space for both old and new segments, and running out of headroom stalls them.

Translog fsync pressure

  • Switch index.translog.durability from request to async only if losing up to the sync_interval window of data on crash is acceptable. The default sync_interval is 5s. This batches fsyncs and sharply reduces write IOPS, but unsynced acknowledged writes are lost on a hard crash.
  • Avoid setting index.translog.flush_threshold_size arbitrarily high. The default is 512 MB. Values well above this delay flushes, extend recovery time, and increase translog disk usage. Monitor translog size after any change; it should stabilize below the flush threshold.

Page-cache starvation

  • Allocate up to approximately 30 GB to the Elasticsearch heap, and leave the remainder for the OS page cache. Do not split system RAM 50/50 on nodes with more than 64 GB total memory. Elasticsearch relies heavily on the OS page cache for search; starvation manifests as high disk reads despite low indexing volume.
  • Move backup agents, log shippers, and other memory-heavy processes off data nodes so they do not compete for page cache or disk I/O.
  • If the dataset far exceeds RAM, add nodes or migrate older indices to warm or cold tiers rather than relying on cache residency.

Prevention

  • Size storage for merge overhead. Temporary segment copies during large merges can require free space equal to the source segments. Plan headroom so merges do not trigger disk watermarks at the worst possible time.
  • Monitor segment count growth as a leading indicator. Do not wait for search latency to spike; trending segment count per shard predicts merge backlog before it saturates disk.
  • Use ILM to roll over, shrink, and delete indices. Preventing indefinite segment accumulation is more effective than tuning merge threads after the fact.
  • Validate merge concurrency against hardware. The default thread cap assumes SSDs. If you run on spinning disks, set max_thread_count to 1 before saturation appears.

How Netdata helps

  • Netdata collects OS-level disk metrics (I/O wait, throughput, queue depth) per device. This isolates real storage pressure from ES fs.io_stats, which conflates all system processes.
  • The Elasticsearch collector surfaces _nodes/stats, so you can overlay merges.current, indexing.index_time_in_millis, and search.query_time_in_millis against disk saturation on the same charts.
  • Page cache metrics show available memory, cache, and buffers alongside ES search latency, revealing cold-cache behavior after restarts or eviction.
  • Alerts on sustained I/O wait per disk, translog growth, and segment count anomalies give early warning before write and search thread pool queues fill.