Elasticsearch search latency high: query phase, fetch phase, and the slow shard

Elasticsearch P95 search latency jumped from 50 ms to multiple seconds. Cluster health is green, search thread pool queues are growing, and users report timeouts. Elasticsearch executes every search as a two-phase scatter-gather across targeted shards, so the slowest shard sets the floor for response time. The _nodes/stats API reports cumulative counters, not per-shard or per-query latencies, meaning one bad shard or one expensive query can be invisible to cluster-wide averages while destroying tail latency. Determine whether time is lost in the query phase, the fetch phase, or the coordinating node merge step, then isolate the outlier.

What this means

When a search arrives at a coordinating node, it executes in two phases. In the query phase, the coordinating node broadcasts the request to one copy of every relevant shard. Each shard runs the query against its local Lucene index and returns document IDs plus sort values. In the fetch phase, the coordinating node merges shard-level results, selects the top N hits, and requests the actual _source documents from the relevant shards. The response returns only after the slowest participating shard finishes each phase.

Latency is a function of the maximum, not the average. A query hitting one hundred shards is gated by the single slowest shard in each phase. The _nodes/stats/indices/search counters (query_time_in_millis, query_total, fetch_time_in_millis, fetch_total) report cumulative totals across all shards and all queries since node startup. A single pathological query or cold node will not move the mean enough to explain a P99 spike, but it will dominate user experience. For tail latency, the slow log is essential.

flowchart TD
    Coord[Coordinating node] -->|1. Query phase| S1[Shard 1]
    Coord -->|1. Query phase| S2[Shard 2]
    Coord -->|1. Query phase| S3[Shard N]
    S1 -->|doc IDs + scores| Merge[Merge and rank]
    S2 -->|doc IDs + scores| Merge
    S3 -->|doc IDs + scores| Merge
    Merge -->|2. Fetch phase| S1
    Merge -->|2. Fetch phase| S3
    S1 -->|_source| Resp[Response to client]
    S3 -->|_source| Resp
    S3 -.->|Slowest shard gates overall latency| Merge

Common causes

CauseWhat it looks likeFirst thing to check
Cold OS page cache after restartLatency spike after node restart; low CPU, high I/O waitNode uptime and OS cached memory relative to index size
Expensive query patternsQuery-phase latency jumps; CPU spikes on specific nodes; tail latency spikes while average looks normalSlow log query phase; _nodes/hot_threads
Too many Lucene segmentsGradual latency increase; search thread pool queues growing; segment metadata heap risingpri.segments.count per index
Excessive shard fan-outLatency scales with number of shards queried; mild, even load across many nodesShard count for the target index
Coordinating node merge pressureFetch latency or merge time high; coordinating node heap/CPU up; data nodes look normalNode roles and request circuit breaker trips
Disk-bound fetch phaseFetch latency high; I/O wait elevated; large _source documentsSlow log fetch phase; average document size

Quick checks

Run these safe, read-only commands to orient the investigation.

# Check query-phase and fetch-phase averages
curl -s 'http://localhost:9200/_nodes/stats/indices/search?filter_path=nodes.*.indices.search.query_total,nodes.*.indices.search.query_time_in_millis,nodes.*.indices.search.fetch_total,nodes.*.indices.search.fetch_time_in_millis'

# Check search thread pool saturation
curl -s 'http://localhost:9200/_cat/thread_pool/search?v&h=node_name,active,queue,rejected&s=queue:desc'

# Check which indices have the most segments
curl -s 'http://localhost:9200/_cat/indices?v&h=index,pri,rep,store.size,pri.segments.count&s=pri.segments.count:desc' | head -20

# Check what is consuming CPU right now
curl -s 'http://localhost:9200/_nodes/hot_threads'

<!-- TODO: verify gc.old.count and gc.old.time are valid _cat/nodes headers in your ES version -->
# Check old-generation GC pause accumulation
curl -s 'http://localhost:9200/_cat/nodes?v&h=name,heap.percent,gc.old.count,gc.old.time'

# Check for long-running search tasks
curl -s 'http://localhost:9200/_tasks?detailed=true&actions=*search*'

How to diagnose it

  1. Determine which phase is slow. Collect _nodes/stats/indices/search output twice, 30 to 60 seconds apart. Compute delta(query_time_in_millis) / delta(query_total) and delta(fetch_time_in_millis) / delta(fetch_total). If query latency dominates, the problem is shard-side Lucene execution. If fetch latency dominates, the problem is retrieving and transferring document source.

  2. Determine whether the issue is cluster-wide or index-specific. Correlate node-level latency with index stats to see if one or two indices account for the majority of search time. If the problem is isolated to specific indices, inspect their mappings, query patterns, and shard counts.

  3. Inspect the slow log. The slow log defaults to disabled (threshold -1). Enable index-level slow logs for both query and fetch phases to capture individual requests that exceed your baseline. The slow log reveals the exact queries and shards driving tail latency, which averages from the stats API cannot show.

  4. Check segment counts. High segment counts force Lucene to merge many small immutable segment result sets instead of searching a few large ones. Check _cat/nodes?v&h=name,segments.memory for segment heap overhead. If actively searched indices have more than roughly 100 segments per shard, background merges are falling behind. Check _cat/indices?v&h=index,merges.current&s=merges.current:desc for indices with active merge pressure.

  5. Check for cold page cache. If nodes restarted recently, the OS page cache will be empty and search latency will be elevated until hot segment files are reloaded into memory. This is expected behavior, not a fault, but it can last minutes to hours depending on dataset size relative to RAM. Check free -m or /proc/meminfo for low cached memory compared to total index size on the node.

  6. Inspect coordinating node resources. If fetch latency is high but data nodes look healthy, the coordinating node merging large result sets is the bottleneck. Large result sets, deep pagination, or high-cardinality aggregations spike heap and CPU while merging sorted hit lists. Check the request circuit breaker and heap usage on coordinating nodes.

  7. Correlate with OS I/O wait. Run iostat -xz 1 on data nodes. High I/O wait with low CPU suggests searches are disk-bound due to page cache misses or slow storage. This is especially common after restarts or when the dataset exceeds available RAM.

Metrics and signals to monitor

SignalWhy it mattersWarning sign
Query latency (delta)Shard-side execution costSustained increase >2x baseline
Fetch latency (delta)Document retrieval and transfer costSustained increase >2x baseline
Search thread pool queue depthPrecursor to rejections and visible latency spikesQueue growing and not draining between samples
Segment count per shardMore segments = more files to scan per query>100 segments on actively searched shards
Segments.memoryHeap overhead from segment metadataGrowing linearly with segment count
JVM old GC collection timeStop-the-world pauses stall search threadsPause duration >5 seconds or frequency increasing
OS page cache effectivenessLucene relies on the kernel page cache for segment accessCached memory low relative to total index size on node
Disk I/O waitIndicates disk-bound searches when cache misses occuriowait >20% sustained

Fixes

Cold page cache after restart

Do not restart the node again. Wait for the working set to reload into the OS page cache naturally. If you need to accelerate warmup, run representative searches against hot indices to pull segments into cache. Ensure JVM heap is sized to no more than 50% of physical RAM so the OS has sufficient memory for page cache.

Expensive query patterns

Replace leading wildcards and regex queries with prefix queries, edge_ngram analysis at index time, or keyword sub-fields. Avoid script-based sorting and script_score if native numeric or geo fields can achieve the same ordering. Reduce nested aggregation depth, and aggregate on keyword fields rather than analyzed text fields.

Too many segments

For indices that are no longer written to, run POST /<index>/_forcemerge?max_num_segments=1. This reduces query-phase scan overhead and segment metadata heap usage. Do not force-merge indices that are still actively receiving writes; oversized segments degrade write performance and make future merges expensive. For active write indices, increase index.refresh_interval to reduce the rate of small segment creation.

Excessive shard fan-out

Reduce the number of shards per query by shrinking indices with the Shrink API or reindexing into fewer, larger shards. Avoid creating indices with more shards than your ingest volume justifies. Note that adding replicas increases search throughput but does not reduce the query-phase latency of a single request.

Coordinating node merge pressure

Reduce the size parameter in searches and limit aggregation bucket counts. If large aggregations are required, route them through dedicated coordinating nodes with additional heap. Monitor the request circuit breaker on the coordinating node for trips caused by oversized result sets.

Disk-bound fetch phase

If I/O wait is high and the page cache is warm, consider moving to faster storage. Reduce stored _source size by trimming unnecessary fields at ingest time, or use _source filtering in requests to avoid transferring large documents across the network.

Prevention

Enable slow logs for both query and fetch phases on production indices so tail latency is captured before it becomes an incident. Monitor segment counts and schedule force-merge as part of your ILM policy for rolled indices. Size shards to avoid excessive fan-out; as a guideline, avoid querying more than a few hundred shards in a single request. Ensure the OS page cache has sufficient headroom by limiting JVM heap to 50% of physical RAM (maximum roughly 26 to 30 GB). Review query patterns in development for leading wildcards, deep nesting, and script usage.

How Netdata helps

  • Correlate Elasticsearch search latency with OS-level disk I/O wait, CPU utilization, and memory page cache metrics on the same node to distinguish CPU-bound from I/O-bound searches.
  • Track JVM heap usage and old-generation GC pause times alongside search thread pool queue depth to catch heap-pressure-induced latency before it triggers rejections or node departures.
  • Alert on per-node segment memory growth and file descriptor utilization, giving early warning of segment explosion that precedes query latency spikes.
  • Visualize search thread pool queue depth and rejection rates in real time to identify saturation before it drives up tail latency.