Elasticsearch merge storms: segment explosion, I/O saturation, and refresh tuning

Search latency climbs, indexing slows, and heap usage rises while cluster health stays green. The cause is often a merge storm. Background Lucene segment consolidation has fallen behind, leaving nodes with hundreds or thousands of small segments. Each extra segment adds search overhead, consumes file descriptors, and increases memory pressure. This guide covers how merge storms develop, how to confirm the diagnosis, and how to fix them without making things worse.

What this means

Documents accumulate in an in-memory buffer until refresh writes them to a new immutable Lucene segment. By default, Elasticsearch refreshes every second. A background merge scheduler combines segments to keep search efficient and reclaim space from deleted documents. When ingest outpaces merge capacity, segments accumulate. The scatter-gather read path must check each segment, so search slows. Segment metadata consumes heap, visible as segments.memory. File descriptors rise because each segment spans multiple files. Eventually merge threads run at full concurrency, disk I/O saturates, and indexing slows because it competes with merges for the same disks.

flowchart TD
    A[High ingest rate] --> B[Default 1s refresh]
    B --> C[Many small Lucene segments]
    C --> D[Merge scheduler falls behind]
    D --> E[Segment count grows]
    E --> F[Search latency rises]
    E --> G[Heap climbs from segment metadata]
    E --> H[File descriptors increase]
    D --> I[Disk I/O saturation]
    I --> J[Indexing latency rises]
    F --> K[User-facing slowdown]
    G --> L[Circuit breaker pressure]

Common causes

CauseWhat it looks likeFirst thing to check
Aggressive refresh_interval on hot indicesSegment count climbs steadily during ingest; refresh time increasesGET /<index>/_settings for refresh_interval
Default merge threads on spinning disksmerges.current pinned at max; high I/O wait on HDD nodesindex.merge.scheduler.max_thread_count
Force merge or ILM merge saturating I/OSudden I/O spike coinciding with ILM window; background merges stallActive force merge tasks and merge size
Disk watermark pressure blocking allocationDisk above 85%; merges need temp space; relocations startGET /_cat/allocation
Bulk load with refresh disabled and no follow-up force mergeThousands of segments on old indices; heap pressure from metadataSegment count on read-only time-series indices

Quick checks

Run these safe, read-only commands to assess cluster state.

# Segment counts per index, sorted by highest primary segment count
curl -s 'http://localhost:9200/_cat/indices?v&h=index,pri,rep,docs.count,store.size,pri.segments.count&s=pri.segments.count:desc' | head -20
# Node-level segment memory, segment count, and current merges
curl -s 'http://localhost:9200/_cat/nodes?v&h=name,segments.count,segments.memory,merges.current'
# Detailed merge statistics per node
curl -s 'http://localhost:9200/_nodes/stats/indices/merges?filter_path=nodes.*.indices.merges'
# Refresh and flush total time to spot I/O slowdown
curl -s 'http://localhost:9200/_nodes/stats/indices/refresh,flush?filter_path=nodes.*.indices.refresh,nodes.*.indices.flush'
# Write and search thread pool queues and rejections
curl -s 'http://localhost:9200/_cat/thread_pool/write,search?v&h=node_name,name,active,queue,rejected'
# JVM heap percent and segment memory per node
curl -s 'http://localhost:9200/_cat/nodes?v&h=name,heap.percent,segments.memory'
# File descriptor usage per node
curl -s 'http://localhost:9200/_cat/nodes?v&h=name,file_desc.current,file_desc.max,file_desc.percent'
# Disk usage and shard distribution
curl -s 'http://localhost:9200/_cat/allocation?v'
# Indexing latency requires two samples; capture totals to compute delta
curl -s 'http://localhost:9200/_nodes/stats/indices/indexing?filter_path=nodes.*.indices.indexing.index_total,nodes.*.indices.indexing.index_time_in_millis'
# Check disk I/O wait and queue depth at the OS level
iostat -xz 1 5

How to diagnose it

  1. Confirm segment explosion. Use _cat/indices and look for pri.segments.count above 100 per shard on active indices, or a monotonic rise over hours. Time-series indices that are no longer written should have far fewer.
  2. Check merge concurrency. Use _cat/nodes or _nodes/stats/indices/merges. If merges.current is continuously at max_thread_count (default max(1, min(4, processors/2)) on SSD), the scheduler is saturated.
  3. Correlate with refresh rate. Check refresh_interval via _settings. The default of 1s is aggressive for high-throughput indexing. Also check whether refresh.total_time_in_millis is growing.
  4. Check I/O saturation. Use OS-level iostat -xz or _nodes/stats/fs (fs.io_stats on Linux). Sustained high wait percentage or queue depth indicates disk-bound merges.
  5. Measure heap impact. Check segments.memory in _cat/nodes. Growing segment metadata contributes to old-generation pressure and can push the node toward circuit breaker trips.
  6. Review file descriptors. High segment counts drive file_desc.current upward. ES recommends a minimum of 65,536. Approaching the limit causes cryptic I/O errors.
  7. Identify interfering operations. Check if a force merge or ILM action is running. Large merges.current_size values suggest a big merge is consuming I/O. Force merges block background merges on the same shard.
  8. Check disk headroom. Merges require temporary free space roughly equal to the size of the segments being merged. If nodes are above 80%, a large merge can push them past the 90% high watermark and trigger relocations.

Metrics and signals to monitor

SignalWhy it mattersWarning sign
pri.segments.countEach segment adds search and metadata overhead.>100 per shard sustained on active indices.
merges.currentIndicates whether the scheduler is keeping up.Continuously at max_thread_count.
segments.memorySegment metadata lives in heap.Growing trend or consuming >10% of heap.
refresh.total_time_in_millisSlow refresh creates backlog and more segments.Sustained average >1s or 3x baseline.
indexing.index_time_in_millis / index_totalMerge I/O competes with the write path.Latency >2x baseline with stable ingest.
search.query_time_in_millis / query_totalScatter-gather latency rises with segment count.Sustained >5x baseline.
file_desc.percentThousands of segments exhaust file descriptors.>80% of max_file_descriptors.
disk.used_percentMerges need temporary space; watermark blocks allocation.>85% or approaching high watermark.

Fixes

Reduce refresh frequency on hot indices

Set index.refresh_interval to 30s or higher on indices receiving heavy writes. This reduces the rate of segment creation, giving the merge scheduler room to catch up. Tradeoff: documents become searchable less frequently. Do not change this on indices that require near-real-time visibility without confirming the business requirement.

Force merge read-only indices

For time-series indices that are no longer written, run:

# Consolidate segments on a read-only index
POST /<index>/_forcemerge?max_num_segments=1

This reduces segment count, lowers heap usage, and improves search performance. Warning: never force merge a live index receiving writes. The operation is resource-intensive and requires temporary free space roughly equal to the size of the segments being merged. Ensure adequate disk headroom before starting, and run during low-traffic windows.

Tune merge scheduler for storage type

On spinning disks, set index.merge.scheduler.max_thread_count: 1. The default max(1, min(4, processors/2)) is optimized for SSDs. On HDDs, higher concurrency causes random I/O thrashing that slows both merges and searches. Apply this via index templates so new indices inherit the setting.

Free disk space and clear blocks

If nodes are above the high watermark (90%), delete old indices to free space immediately. Merges will stall or fail if the disk cannot accommodate temporary segment copies. Reducing replica count is an emergency option, but it lowers availability and risks data loss if another node fails. If flood stage (95%) triggered index.blocks.read_only_allow_delete, remove the block after freeing space:

# Clear read-only blocks after freeing disk space
PUT /_all/_settings
{"index.blocks.read_only_allow_delete": null}

Throttle ingest temporarily

If the cluster is I/O saturated and you cannot add capacity immediately, reduce client-side bulk concurrency or increase batch sizes to lower the request rate. This is temporary pressure relief, not a long-term fix. It buys time for the merge backlog to drain.

Prevention

  • Monitor segment count trends per node and per index. Do not wait for search latency to spike.
  • Use ILM to force merge indices after rollover and before they transition to warm or cold tiers.
  • Match refresh_interval to the business requirement. Hot logging indices rarely need 1s visibility; 30s is usually sufficient.
  • Provision disk with merge headroom. A node at 80% can hit 90% during a large merge.
  • Verify index.merge.scheduler.max_thread_count is appropriate for the storage medium.
  • Keep file descriptor limits well above current usage. 65,536 is the recommended minimum.

How Netdata helps

Netdata correlates per-node disk I/O wait with indexing latency to highlight merge saturation. It tracks segments.memory alongside JVM heap usage to show metadata-driven heap pressure. Alerts on file descriptor percentage and disk watermark proximity fire before they become hard limits. Search and indexing latency appear on the same timeline as segment count growth, and thread pool queue depths and rejections are monitored to catch backlog before it cascades.