Elasticsearch merge storms: segment explosion, I/O saturation, and refresh tuning
Search latency climbs, indexing slows, and heap usage rises while cluster health stays green. The cause is often a merge storm. Background Lucene segment consolidation has fallen behind, leaving nodes with hundreds or thousands of small segments. Each extra segment adds search overhead, consumes file descriptors, and increases memory pressure. This guide covers how merge storms develop, how to confirm the diagnosis, and how to fix them without making things worse.
What this means
Documents accumulate in an in-memory buffer until refresh writes them to a new immutable Lucene segment. By default, Elasticsearch refreshes every second. A background merge scheduler combines segments to keep search efficient and reclaim space from deleted documents. When ingest outpaces merge capacity, segments accumulate. The scatter-gather read path must check each segment, so search slows. Segment metadata consumes heap, visible as segments.memory. File descriptors rise because each segment spans multiple files. Eventually merge threads run at full concurrency, disk I/O saturates, and indexing slows because it competes with merges for the same disks.
flowchart TD
A[High ingest rate] --> B[Default 1s refresh]
B --> C[Many small Lucene segments]
C --> D[Merge scheduler falls behind]
D --> E[Segment count grows]
E --> F[Search latency rises]
E --> G[Heap climbs from segment metadata]
E --> H[File descriptors increase]
D --> I[Disk I/O saturation]
I --> J[Indexing latency rises]
F --> K[User-facing slowdown]
G --> L[Circuit breaker pressure]Common causes
| Cause | What it looks like | First thing to check |
|---|---|---|
Aggressive refresh_interval on hot indices | Segment count climbs steadily during ingest; refresh time increases | GET /<index>/_settings for refresh_interval |
| Default merge threads on spinning disks | merges.current pinned at max; high I/O wait on HDD nodes | index.merge.scheduler.max_thread_count |
| Force merge or ILM merge saturating I/O | Sudden I/O spike coinciding with ILM window; background merges stall | Active force merge tasks and merge size |
| Disk watermark pressure blocking allocation | Disk above 85%; merges need temp space; relocations start | GET /_cat/allocation |
| Bulk load with refresh disabled and no follow-up force merge | Thousands of segments on old indices; heap pressure from metadata | Segment count on read-only time-series indices |
Quick checks
Run these safe, read-only commands to assess cluster state.
# Segment counts per index, sorted by highest primary segment count
curl -s 'http://localhost:9200/_cat/indices?v&h=index,pri,rep,docs.count,store.size,pri.segments.count&s=pri.segments.count:desc' | head -20
# Node-level segment memory, segment count, and current merges
curl -s 'http://localhost:9200/_cat/nodes?v&h=name,segments.count,segments.memory,merges.current'
# Detailed merge statistics per node
curl -s 'http://localhost:9200/_nodes/stats/indices/merges?filter_path=nodes.*.indices.merges'
# Refresh and flush total time to spot I/O slowdown
curl -s 'http://localhost:9200/_nodes/stats/indices/refresh,flush?filter_path=nodes.*.indices.refresh,nodes.*.indices.flush'
# Write and search thread pool queues and rejections
curl -s 'http://localhost:9200/_cat/thread_pool/write,search?v&h=node_name,name,active,queue,rejected'
# JVM heap percent and segment memory per node
curl -s 'http://localhost:9200/_cat/nodes?v&h=name,heap.percent,segments.memory'
# File descriptor usage per node
curl -s 'http://localhost:9200/_cat/nodes?v&h=name,file_desc.current,file_desc.max,file_desc.percent'
# Disk usage and shard distribution
curl -s 'http://localhost:9200/_cat/allocation?v'
# Indexing latency requires two samples; capture totals to compute delta
curl -s 'http://localhost:9200/_nodes/stats/indices/indexing?filter_path=nodes.*.indices.indexing.index_total,nodes.*.indices.indexing.index_time_in_millis'
# Check disk I/O wait and queue depth at the OS level
iostat -xz 1 5
How to diagnose it
- Confirm segment explosion. Use
_cat/indicesand look forpri.segments.countabove 100 per shard on active indices, or a monotonic rise over hours. Time-series indices that are no longer written should have far fewer. - Check merge concurrency. Use
_cat/nodesor_nodes/stats/indices/merges. Ifmerges.currentis continuously atmax_thread_count(defaultmax(1, min(4, processors/2))on SSD), the scheduler is saturated. - Correlate with refresh rate. Check
refresh_intervalvia_settings. The default of1sis aggressive for high-throughput indexing. Also check whetherrefresh.total_time_in_millisis growing. - Check I/O saturation. Use OS-level
iostat -xzor_nodes/stats/fs(fs.io_statson Linux). Sustained high wait percentage or queue depth indicates disk-bound merges. - Measure heap impact. Check
segments.memoryin_cat/nodes. Growing segment metadata contributes to old-generation pressure and can push the node toward circuit breaker trips. - Review file descriptors. High segment counts drive
file_desc.currentupward. ES recommends a minimum of 65,536. Approaching the limit causes cryptic I/O errors. - Identify interfering operations. Check if a force merge or ILM action is running. Large
merges.current_sizevalues suggest a big merge is consuming I/O. Force merges block background merges on the same shard. - Check disk headroom. Merges require temporary free space roughly equal to the size of the segments being merged. If nodes are above 80%, a large merge can push them past the 90% high watermark and trigger relocations.
Metrics and signals to monitor
| Signal | Why it matters | Warning sign |
|---|---|---|
pri.segments.count | Each segment adds search and metadata overhead. | >100 per shard sustained on active indices. |
merges.current | Indicates whether the scheduler is keeping up. | Continuously at max_thread_count. |
segments.memory | Segment metadata lives in heap. | Growing trend or consuming >10% of heap. |
refresh.total_time_in_millis | Slow refresh creates backlog and more segments. | Sustained average >1s or 3x baseline. |
indexing.index_time_in_millis / index_total | Merge I/O competes with the write path. | Latency >2x baseline with stable ingest. |
search.query_time_in_millis / query_total | Scatter-gather latency rises with segment count. | Sustained >5x baseline. |
file_desc.percent | Thousands of segments exhaust file descriptors. | >80% of max_file_descriptors. |
disk.used_percent | Merges need temporary space; watermark blocks allocation. | >85% or approaching high watermark. |
Fixes
Reduce refresh frequency on hot indices
Set index.refresh_interval to 30s or higher on indices receiving heavy writes. This reduces the rate of segment creation, giving the merge scheduler room to catch up. Tradeoff: documents become searchable less frequently. Do not change this on indices that require near-real-time visibility without confirming the business requirement.
Force merge read-only indices
For time-series indices that are no longer written, run:
# Consolidate segments on a read-only index
POST /<index>/_forcemerge?max_num_segments=1
This reduces segment count, lowers heap usage, and improves search performance. Warning: never force merge a live index receiving writes. The operation is resource-intensive and requires temporary free space roughly equal to the size of the segments being merged. Ensure adequate disk headroom before starting, and run during low-traffic windows.
Tune merge scheduler for storage type
On spinning disks, set index.merge.scheduler.max_thread_count: 1. The default max(1, min(4, processors/2)) is optimized for SSDs. On HDDs, higher concurrency causes random I/O thrashing that slows both merges and searches. Apply this via index templates so new indices inherit the setting.
Free disk space and clear blocks
If nodes are above the high watermark (90%), delete old indices to free space immediately. Merges will stall or fail if the disk cannot accommodate temporary segment copies. Reducing replica count is an emergency option, but it lowers availability and risks data loss if another node fails. If flood stage (95%) triggered index.blocks.read_only_allow_delete, remove the block after freeing space:
# Clear read-only blocks after freeing disk space
PUT /_all/_settings
{"index.blocks.read_only_allow_delete": null}
Throttle ingest temporarily
If the cluster is I/O saturated and you cannot add capacity immediately, reduce client-side bulk concurrency or increase batch sizes to lower the request rate. This is temporary pressure relief, not a long-term fix. It buys time for the merge backlog to drain.
Prevention
- Monitor segment count trends per node and per index. Do not wait for search latency to spike.
- Use ILM to force merge indices after rollover and before they transition to warm or cold tiers.
- Match
refresh_intervalto the business requirement. Hot logging indices rarely need 1s visibility; 30s is usually sufficient. - Provision disk with merge headroom. A node at 80% can hit 90% during a large merge.
- Verify
index.merge.scheduler.max_thread_countis appropriate for the storage medium. - Keep file descriptor limits well above current usage. 65,536 is the recommended minimum.
How Netdata helps
Netdata correlates per-node disk I/O wait with indexing latency to highlight merge saturation. It tracks segments.memory alongside JVM heap usage to show metadata-driven heap pressure. Alerts on file descriptor percentage and disk watermark proximity fire before they become hard limits. Search and indexing latency appear on the same timeline as segment count growth, and thread pool queue depths and rejections are monitored to catch backlog before it cascades.
Related guides
- Elasticsearch all shards failed: diagnosing search_phase_execution_exception
- Elasticsearch authentication failures: audit logs, brute force, and credential drift
- Elasticsearch CircuitBreakingException: [parent] Data too large - causes and fixes
- Elasticsearch cluster_block_exception: blocked by, the read-only blocks explained
- Elasticsearch cluster health red: unassigned primaries and how to recover
- Elasticsearch cluster health yellow: unassigned replicas vs real allocation blocks
- Elasticsearch cluster state too large: field count, index count, and per-node heap
- Elasticsearch disk full: emergency recovery and freeing space safely
- Elasticsearch disk watermark cascade: from low watermark to cluster-wide read-only
- Elasticsearch document indexing failures: index_failed, bulk item errors, and version conflicts
- Elasticsearch EsRejectedExecutionException: write thread pool rejections and HTTP 429
- Elasticsearch exposed without authentication: open clusters and snapshot exfiltration







