Elasticsearch too many segments per shard: search slowdown and force-merge

Searches that used to complete in tens of milliseconds are now taking hundreds. CPU on data nodes climbs without a traffic spike. Cluster health stays green, but query latency degrades and timeouts increase. The problem is invisible at the cluster level: shards have accumulated too many Lucene segments, and every query scans every segment in every relevant shard. When a shard holds more than roughly 100 segments, the background merge policy has fallen behind. Search slows, file descriptor usage rises, and merge I/O competes with indexing and queries. Since Elasticsearch 7.7, most segment metadata moved off-heap, so the node may not run out of JVM heap, but search performance degrades all the same.

What this means

A Lucene segment is an immutable chunk of an inverted index. During the write path, documents collect in an in-memory buffer; on refresh, Elasticsearch writes a new segment that is immediately searchable. On flush, segments become durable and the translog is truncated. Background merges combine small segments into larger ones, reclaim space from deleted documents, and improve search performance. Healthy shards typically hold 10 to 50 segments. When the count exceeds 100 per shard, merges are not keeping up with the creation rate.

Elasticsearch uses a scatter-gather model: the coordinating node broadcasts the query to one copy of every relevant shard. Each shard executes the query against its local segments and returns results. The slowest shard determines overall latency. When segment counts are high, every shard scan opens more files, iterates more data structures, and consumes more CPU. The node also holds more open file descriptors, which can approach system limits. Merge operations themselves consume disk I/O and CPU, so a merge backlog can saturate storage and slow indexing even further.

flowchart TD
    A[High indexing rate or frequent refresh] --> B[Many small segments created]
    C[Update-heavy workload] --> B
    B --> D[Merge backlog]
    D --> E[Segment count exceeds 100 per shard]
    E --> F[Search latency increases]
    E --> G[File descriptor pressure]
    D --> H[Disk I/O saturation]
    H --> I[Indexing slowdown]

Common causes

CauseWhat it looks likeFirst thing to check
Refresh interval too aggressiveSegment count rising on actively indexed indices; high refresh throughput/_nodes/stats/indices/refresh for rising total time
Merge I/O saturationSegment count growing despite steady load; merge time increasing/_nodes/stats/indices/merges for current, plus OS iostat
Indexing burst outpacing merge threadsMany small segments; merges persistently active/_nodes/stats/indices/merges for current and total
Update-heavy workload with deleted docsMerges are large and slow; space not reclaimed after deletes/_nodes/stats/indices/merges for total_size_in_bytes
Force merge blocking background mergesSegment count flat or rising while a force merge is running/_nodes/stats/indices/merges current metric

Quick checks

Read-only checks for segment density, merge health, and search impact.

# Segment counts per index, sorted by primary segment count descending
curl -s 'http://localhost:9200/_cat/indices?v&h=index,pri,rep,docs.count,store.size,pri.segments.count&s=pri.segments.count:desc' | head -20
# Node-level segment count and segment memory
curl -s 'http://localhost:9200/_cat/nodes?v&h=name,segments.count,segments.memory'
# Node-level merge concurrency and cumulative merges
curl -s 'http://localhost:9200/_nodes/stats/indices/merges?filter_path=nodes.*.indices.merges.current,nodes.*.indices.merges.total'
# Detailed merge stats: current merges, total time, and size
curl -s 'http://localhost:9200/_nodes/stats/indices/merges?filter_path=nodes.*.indices.merges'
# Search latency: query and fetch phase cumulative counters
curl -s 'http://localhost:9200/_nodes/stats/indices/search?filter_path=nodes.*.indices.search.query_total,nodes.*.indices.search.query_time_in_millis,nodes.*.indices.search.fetch_total,nodes.*.indices.search.fetch_time_in_millis'
# Refresh and flush performance
curl -s 'http://localhost:9200/_nodes/stats/indices/refresh,flush?filter_path=nodes.*.indices.refresh,nodes.*.indices.flush'
# File descriptor utilization per node
curl -s 'http://localhost:9200/_cat/nodes?v&h=name,file_desc.current,file_desc.max,file_desc.percent'

How to diagnose it

  1. Confirm segment counts per shard. Use _cat/indices with pri.segments.count. Sort descending. Healthy shards show 10 to 50 segments. Anything above 100 suggests a merge backlog.
  2. Check if merges are keeping up. Use _nodes/stats/indices/merges. If current is persistently non-zero and segment counts continue to rise, the node cannot consolidate segments faster than they are created.
  3. Inspect merge throughput and cost. Use _nodes/stats/indices/merges to see total_time_in_millis and total_size_in_bytes. Rapid growth in merge time without a corresponding drop in segment count indicates I/O-bound merges.
  4. Correlate with search latency. Sample _nodes/stats/indices/search twice over a 30-second interval. Compute delta(query_time) / delta(query_total). Elevated query latency with stable query complexity points to segment scan overhead.
  5. Evaluate refresh behavior. Check _nodes/stats/indices/refresh for total_time_in_millis. If refresh time is high or refresh_interval is set very low, the node is creating segments faster than the merge policy can absorb them.
  6. Check file descriptor pressure. Use _cat/nodes with file_desc.percent. Each segment consists of multiple files. Segment explosion drives file descriptor growth, which can approach the per-process limit.

Metrics and signals to monitor

SignalWhy it mattersWarning sign
pri.segments.count per indexDirect measure of per-shard segment density>100 per shard
segments.count per nodeNode-wide load and file descriptor pressureGrowing monotonically
segments.memory per nodeMemory consumed by segment-related structuresSustained upward trend
Search query latencyUser-facing impact of segment scan overhead>2x baseline sustained
merges.currentMerge concurrency saturationPersistently non-zero while segment counts rise
Merge total timeBackground I/O and CPU costGrowing faster than segment count drops
file_desc.percentSegments consume multiple files per shard>80% of system limit

Fixes

Reduce refresh frequency on write-heavy indices

For indices still receiving writes, increase index.refresh_interval to 30s or higher during bulk loads. This reduces the creation rate of small segments and gives the TieredMergePolicy room to consolidate them. Restore 1s only after ingestion completes and segment counts normalize. Do not set refresh_interval: -1 on live production search indices unless you can tolerate delayed searchability.

Tune merge concurrency for storage type

If merges are falling behind on spinning disks, reduce index.merge.scheduler.max_thread_count to 1 to lower random I/O contention. On SSDs, the default scheduler concurrency is tuned for available cores. Raising the thread count above the default rarely helps and usually increases contention with indexing and search.

Force-merge read-only indices only

For time-series indices that are no longer written to, force merge to one segment per shard:

# WARNING: blocking and resource-intensive. Only for read-only indices.
POST /<index>/_forcemerge?max_num_segments=1

This can take hours on large shards and will spike CPU and disk I/O while it runs. Never run force merge on an index that is still receiving writes. On a live write index, force merge creates very large segments that the background merge policy will not efficiently recombine with future small segments. This wastes disk space and CPU and can trigger subsequent merge storms. After a force merge, new writes create new segments alongside the force-merged giant, leaving you with worse fragmentation than before.

Address over-sharding

If segment counts are high because shards are undersized, use the shrink API to reduce the primary shard count on read-only indices, or reindex into a new index with fewer, larger shards. Target shard sizes in the 10 to 50 GB range. Fewer shards means fewer total segments and lower fixed overhead.

Prevention

  • Set refresh_interval to -1 during large bulk imports and restore it afterward.
  • Use ILM to transition time-series indices to read-only and force-merge them automatically once writes stop.
  • Monitor pri.segments.count and segments.count per node as leading indicators, not just lagging symptoms.
  • Size disks with merge headroom in mind. A large merge can temporarily require disk space for both old and new segments.
  • Avoid dynamic index creation rates that outpace merge capacity. Consolidate time-series granularity where possible.

How Netdata helps

Netdata correlates Elasticsearch segment counts with per-node search latency and CPU to isolate segment pressure from query load. It tracks file descriptor utilization per node alongside segment growth, surfaces OS disk I/O wait alongside merge activity to distinguish merge saturation from generic disk pressure, and charts JVM heap and old GC behavior against segment memory trends. Alert on monotonically rising segment counts per node before query performance degrades.