Elasticsearch disk I/O saturation: merges, fsync, and page-cache starvation
When Elasticsearch data nodes show climbing I/O wait while indexing and search latency rise, but CPU is not the bottleneck, the cluster stays green while throughput falls and thread pool queues grow. This pattern usually traces to one of three disk pressures: background segment merges rewriting data faster than storage can absorb, translog fsync overhead from durability guarantees, or OS page-cache eviction forcing every search to read from disk. This guide shows how to tell them apart, confirm the bottleneck with safe read-only checks, and relieve pressure.
What this means
When I/O wait is the primary bottleneck, runnable threads are blocked on disk operations. Elasticsearch intensifies this through its write path: refreshes create new Lucene segments, flushes commit the translog, and background merges rewrite segments to reclaim deletions. The read path assumes hot segment files live in the OS page cache. If merges, fsyncs, or cache misses saturate the disk, latency rises across both indexing and search even though cluster health stays green.
flowchart TD
A[High I/O wait] --> B{High writes?}
B -->|Yes| C[Merge storm or translog fsync]
B -->|No| D[Page cache misses]
C --> E[Check merges.current and segment count]
D --> F[Check OS page cache vs index size]
E --> G[Reduce refresh rate or use async durability]
F --> H[Isolate node or add memory]Common causes
| Cause | What it looks like | First thing to check |
|---|---|---|
| Merge storm | High write throughput, segment count growing, merges.current persistently at max | _cat/nodes for merges.current and segments.count |
| Translog fsync pressure | High write operations, low bytes per operation, indexing latency spikes | _nodes/stats/indices/translog and index.translog.durability |
| Page-cache starvation | High reads, low CPU, dataset larger than RAM, elevated fetch latency | OS free -m and iostat read throughput |
| External I/O consumers | Backup agents or log shippers on data nodes | pidstat or iostat showing non-ES disk consumers |
Quick checks
Run these read-only checks to narrow the cause before making changes.
# Check OS I/O wait and per-disk throughput
iostat -xz 1 5
# Check ES-reported disk stats (Linux only)
curl -s 'http://localhost:9200/_nodes/stats/fs?filter_path=nodes.*.fs.io_stats'
# Check current merges and segment counts
curl -s 'http://localhost:9200/_cat/nodes?v&h=name,segments.count,merges.current,merges.current_size'
# Check translog size and durability setting
curl -s 'http://localhost:9200/_nodes/stats/indices/translog?filter_path=nodes.*.indices.translog'
curl -s 'http://localhost:9200/<index>/_settings?filter_path=*.index.translog.durability'
# Check refresh and flush latency
curl -s 'http://localhost:9200/_nodes/stats/indices/refresh,flush?filter_path=nodes.*.indices.refresh,nodes.*.indices.flush'
How to diagnose it
- Confirm the bottleneck is disk, not CPU or memory. Run
iostat -xz 1. Ifawaitis far above the device baseline while user and system CPU remain low, the disk is saturated. On single-queue devices, sustained%utilabove 90 corroborates this; on NVMe, rely onawaitinstead. - Correlate with merge activity. Query
_cat/nodes?v&h=name,merges.current,segments.count. Ifmerges.currentstays at the configuredmax_thread_countand segment count climbs, the merge scheduler cannot keep up. - Check translog pressure. Query
_nodes/stats/indices/translog. Ifuncommitted_size_in_bytesis large and growing, or ifindex.translog.durabilityisrequest, fsync overhead is likely dominating write IOPS. - Evaluate page-cache effectiveness. Run
free -m. If buffered and cached memory are small relative to total index size on disk, andiostatshows high read throughput during search, the working set does not fit in RAM. - Look for external disk consumers. Run
pidstat -d 1or inspect/proc/diskstatsattribution. ESfs.io_statsaggregates I/O from all system processes, so sibling containers or backup agents inflate the same counters. Comparepidstatoutput with ES process disk activity to identify foreign consumers.
Metrics and signals to monitor
| Signal | Why it matters | Warning sign |
|---|---|---|
| OS I/O wait percentage | Primary saturation indicator | Sustained above 20% on HDD or 30% on SSD with rising latency |
fs.io_stats counters | ES-reported disk I/O (Linux only) | Write operation rate growing faster than indexing rate |
merges.current | Concurrent merge work | Persistently at max_thread_count with growing segment count |
translog.uncommitted_size_in_bytes | Flush health and recovery window | Above 512 MB (default threshold) and growing |
| Segment count per shard | Merge backlog | Above 100 per shard on actively searched indices |
| OS page cache available | Read path efficiency | Buffered/cached memory smaller than the hot working set |
Fixes
Merge storms
- Increase
index.refresh_intervaltemporarily on heavy-write indices. The default is1s; raising it to30sreduces segment creation rate. Each refresh creates a new searchable segment, and the default interval can create segments faster than the merge scheduler can consolidate them. This trades near-real-time visibility for lower merge pressure. - Force-merge read-only indices to shrink the segment count:
POST /<index>/_forcemerge?max_num_segments=1. Warning: this is I/O-intensive and will saturate disk while it runs. Do not force-merge indices that are still receiving writes. - On spinning disks, set
index.merge.scheduler.max_thread_count: 1. The default formula,Math.max(1, Math.min(4, processors / 2)), assumes SSDs. This setting is per-shard, so many shards on a node still create parallel merge work. - Free disk space if the node is near the low watermark. Merges require temporary space for both old and new segments, and running out of headroom stalls them.
Translog fsync pressure
- Switch
index.translog.durabilityfromrequesttoasynconly if losing up to thesync_intervalwindow of data on crash is acceptable. The defaultsync_intervalis5s. This batches fsyncs and sharply reduces write IOPS, but unsynced acknowledged writes are lost on a hard crash. - Avoid setting
index.translog.flush_threshold_sizearbitrarily high. The default is512 MB. Values well above this delay flushes, extend recovery time, and increase translog disk usage. Monitor translog size after any change; it should stabilize below the flush threshold.
Page-cache starvation
- Allocate up to approximately 30 GB to the Elasticsearch heap, and leave the remainder for the OS page cache. Do not split system RAM 50/50 on nodes with more than 64 GB total memory. Elasticsearch relies heavily on the OS page cache for search; starvation manifests as high disk reads despite low indexing volume.
- Move backup agents, log shippers, and other memory-heavy processes off data nodes so they do not compete for page cache or disk I/O.
- If the dataset far exceeds RAM, add nodes or migrate older indices to warm or cold tiers rather than relying on cache residency.
Prevention
- Size storage for merge overhead. Temporary segment copies during large merges can require free space equal to the source segments. Plan headroom so merges do not trigger disk watermarks at the worst possible time.
- Monitor segment count growth as a leading indicator. Do not wait for search latency to spike; trending segment count per shard predicts merge backlog before it saturates disk.
- Use ILM to roll over, shrink, and delete indices. Preventing indefinite segment accumulation is more effective than tuning merge threads after the fact.
- Validate merge concurrency against hardware. The default thread cap assumes SSDs. If you run on spinning disks, set
max_thread_countto 1 before saturation appears.
How Netdata helps
- Netdata collects OS-level disk metrics (I/O wait, throughput, queue depth) per device. This isolates real storage pressure from ES
fs.io_stats, which conflates all system processes. - The Elasticsearch collector surfaces
_nodes/stats, so you can overlaymerges.current,indexing.index_time_in_millis, andsearch.query_time_in_millisagainst disk saturation on the same charts. - Page cache metrics show available memory, cache, and buffers alongside ES search latency, revealing cold-cache behavior after restarts or eviction.
- Alerts on sustained I/O wait per disk, translog growth, and segment count anomalies give early warning before write and search thread pool queues fill.
Related guides
- Elasticsearch all shards failed: diagnosing search_phase_execution_exception
- Elasticsearch authentication failures: audit logs, brute force, and credential drift
- Elasticsearch CircuitBreakingException: [parent] Data too large - causes and fixes
- Elasticsearch cluster_block_exception: blocked by, the read-only blocks explained
- Elasticsearch cluster health red: unassigned primaries and how to recover
- Elasticsearch cluster health yellow: unassigned replicas vs real allocation blocks
- Elasticsearch cluster state too large: field count, index count, and per-node heap
- Elasticsearch disk full: emergency recovery and freeing space safely
- Elasticsearch disk watermark cascade: from low watermark to cluster-wide read-only
- Elasticsearch document indexing failures: index_failed, bulk item errors, and version conflicts
- Elasticsearch EsRejectedExecutionException: write thread pool rejections and HTTP 429
- Elasticsearch exposed without authentication: open clusters and snapshot exfiltration







