Elasticsearch search latency high: query phase, fetch phase, and the slow shard
Elasticsearch P95 search latency jumped from 50 ms to multiple seconds. Cluster health is green, search thread pool queues are growing, and users report timeouts. Elasticsearch executes every search as a two-phase scatter-gather across targeted shards, so the slowest shard sets the floor for response time. The _nodes/stats API reports cumulative counters, not per-shard or per-query latencies, meaning one bad shard or one expensive query can be invisible to cluster-wide averages while destroying tail latency. Determine whether time is lost in the query phase, the fetch phase, or the coordinating node merge step, then isolate the outlier.
What this means
When a search arrives at a coordinating node, it executes in two phases. In the query phase, the coordinating node broadcasts the request to one copy of every relevant shard. Each shard runs the query against its local Lucene index and returns document IDs plus sort values. In the fetch phase, the coordinating node merges shard-level results, selects the top N hits, and requests the actual _source documents from the relevant shards. The response returns only after the slowest participating shard finishes each phase.
Latency is a function of the maximum, not the average. A query hitting one hundred shards is gated by the single slowest shard in each phase. The _nodes/stats/indices/search counters (query_time_in_millis, query_total, fetch_time_in_millis, fetch_total) report cumulative totals across all shards and all queries since node startup. A single pathological query or cold node will not move the mean enough to explain a P99 spike, but it will dominate user experience. For tail latency, the slow log is essential.
flowchart TD
Coord[Coordinating node] -->|1. Query phase| S1[Shard 1]
Coord -->|1. Query phase| S2[Shard 2]
Coord -->|1. Query phase| S3[Shard N]
S1 -->|doc IDs + scores| Merge[Merge and rank]
S2 -->|doc IDs + scores| Merge
S3 -->|doc IDs + scores| Merge
Merge -->|2. Fetch phase| S1
Merge -->|2. Fetch phase| S3
S1 -->|_source| Resp[Response to client]
S3 -->|_source| Resp
S3 -.->|Slowest shard gates overall latency| MergeCommon causes
| Cause | What it looks like | First thing to check |
|---|---|---|
| Cold OS page cache after restart | Latency spike after node restart; low CPU, high I/O wait | Node uptime and OS cached memory relative to index size |
| Expensive query patterns | Query-phase latency jumps; CPU spikes on specific nodes; tail latency spikes while average looks normal | Slow log query phase; _nodes/hot_threads |
| Too many Lucene segments | Gradual latency increase; search thread pool queues growing; segment metadata heap rising | pri.segments.count per index |
| Excessive shard fan-out | Latency scales with number of shards queried; mild, even load across many nodes | Shard count for the target index |
| Coordinating node merge pressure | Fetch latency or merge time high; coordinating node heap/CPU up; data nodes look normal | Node roles and request circuit breaker trips |
| Disk-bound fetch phase | Fetch latency high; I/O wait elevated; large _source documents | Slow log fetch phase; average document size |
Quick checks
Run these safe, read-only commands to orient the investigation.
# Check query-phase and fetch-phase averages
curl -s 'http://localhost:9200/_nodes/stats/indices/search?filter_path=nodes.*.indices.search.query_total,nodes.*.indices.search.query_time_in_millis,nodes.*.indices.search.fetch_total,nodes.*.indices.search.fetch_time_in_millis'
# Check search thread pool saturation
curl -s 'http://localhost:9200/_cat/thread_pool/search?v&h=node_name,active,queue,rejected&s=queue:desc'
# Check which indices have the most segments
curl -s 'http://localhost:9200/_cat/indices?v&h=index,pri,rep,store.size,pri.segments.count&s=pri.segments.count:desc' | head -20
# Check what is consuming CPU right now
curl -s 'http://localhost:9200/_nodes/hot_threads'
<!-- TODO: verify gc.old.count and gc.old.time are valid _cat/nodes headers in your ES version -->
# Check old-generation GC pause accumulation
curl -s 'http://localhost:9200/_cat/nodes?v&h=name,heap.percent,gc.old.count,gc.old.time'
# Check for long-running search tasks
curl -s 'http://localhost:9200/_tasks?detailed=true&actions=*search*'
How to diagnose it
Determine which phase is slow. Collect
_nodes/stats/indices/searchoutput twice, 30 to 60 seconds apart. Computedelta(query_time_in_millis) / delta(query_total)anddelta(fetch_time_in_millis) / delta(fetch_total). If query latency dominates, the problem is shard-side Lucene execution. If fetch latency dominates, the problem is retrieving and transferring document source.Determine whether the issue is cluster-wide or index-specific. Correlate node-level latency with index stats to see if one or two indices account for the majority of search time. If the problem is isolated to specific indices, inspect their mappings, query patterns, and shard counts.
Inspect the slow log. The slow log defaults to disabled (threshold
-1). Enable index-level slow logs for both query and fetch phases to capture individual requests that exceed your baseline. The slow log reveals the exact queries and shards driving tail latency, which averages from the stats API cannot show.Check segment counts. High segment counts force Lucene to merge many small immutable segment result sets instead of searching a few large ones. Check
_cat/nodes?v&h=name,segments.memoryfor segment heap overhead. If actively searched indices have more than roughly 100 segments per shard, background merges are falling behind. Check_cat/indices?v&h=index,merges.current&s=merges.current:descfor indices with active merge pressure.Check for cold page cache. If nodes restarted recently, the OS page cache will be empty and search latency will be elevated until hot segment files are reloaded into memory. This is expected behavior, not a fault, but it can last minutes to hours depending on dataset size relative to RAM. Check
free -mor/proc/meminfofor low cached memory compared to total index size on the node.Inspect coordinating node resources. If fetch latency is high but data nodes look healthy, the coordinating node merging large result sets is the bottleneck. Large result sets, deep pagination, or high-cardinality aggregations spike heap and CPU while merging sorted hit lists. Check the
requestcircuit breaker and heap usage on coordinating nodes.Correlate with OS I/O wait. Run
iostat -xz 1on data nodes. High I/O wait with low CPU suggests searches are disk-bound due to page cache misses or slow storage. This is especially common after restarts or when the dataset exceeds available RAM.
Metrics and signals to monitor
| Signal | Why it matters | Warning sign |
|---|---|---|
| Query latency (delta) | Shard-side execution cost | Sustained increase >2x baseline |
| Fetch latency (delta) | Document retrieval and transfer cost | Sustained increase >2x baseline |
| Search thread pool queue depth | Precursor to rejections and visible latency spikes | Queue growing and not draining between samples |
| Segment count per shard | More segments = more files to scan per query | >100 segments on actively searched shards |
| Segments.memory | Heap overhead from segment metadata | Growing linearly with segment count |
| JVM old GC collection time | Stop-the-world pauses stall search threads | Pause duration >5 seconds or frequency increasing |
| OS page cache effectiveness | Lucene relies on the kernel page cache for segment access | Cached memory low relative to total index size on node |
| Disk I/O wait | Indicates disk-bound searches when cache misses occur | iowait >20% sustained |
Fixes
Cold page cache after restart
Do not restart the node again. Wait for the working set to reload into the OS page cache naturally. If you need to accelerate warmup, run representative searches against hot indices to pull segments into cache. Ensure JVM heap is sized to no more than 50% of physical RAM so the OS has sufficient memory for page cache.
Expensive query patterns
Replace leading wildcards and regex queries with prefix queries, edge_ngram analysis at index time, or keyword sub-fields. Avoid script-based sorting and script_score if native numeric or geo fields can achieve the same ordering. Reduce nested aggregation depth, and aggregate on keyword fields rather than analyzed text fields.
Too many segments
For indices that are no longer written to, run POST /<index>/_forcemerge?max_num_segments=1. This reduces query-phase scan overhead and segment metadata heap usage. Do not force-merge indices that are still actively receiving writes; oversized segments degrade write performance and make future merges expensive. For active write indices, increase index.refresh_interval to reduce the rate of small segment creation.
Excessive shard fan-out
Reduce the number of shards per query by shrinking indices with the Shrink API or reindexing into fewer, larger shards. Avoid creating indices with more shards than your ingest volume justifies. Note that adding replicas increases search throughput but does not reduce the query-phase latency of a single request.
Coordinating node merge pressure
Reduce the size parameter in searches and limit aggregation bucket counts. If large aggregations are required, route them through dedicated coordinating nodes with additional heap. Monitor the request circuit breaker on the coordinating node for trips caused by oversized result sets.
Disk-bound fetch phase
If I/O wait is high and the page cache is warm, consider moving to faster storage. Reduce stored _source size by trimming unnecessary fields at ingest time, or use _source filtering in requests to avoid transferring large documents across the network.
Prevention
Enable slow logs for both query and fetch phases on production indices so tail latency is captured before it becomes an incident. Monitor segment counts and schedule force-merge as part of your ILM policy for rolled indices. Size shards to avoid excessive fan-out; as a guideline, avoid querying more than a few hundred shards in a single request. Ensure the OS page cache has sufficient headroom by limiting JVM heap to 50% of physical RAM (maximum roughly 26 to 30 GB). Review query patterns in development for leading wildcards, deep nesting, and script usage.
How Netdata helps
- Correlate Elasticsearch search latency with OS-level disk I/O wait, CPU utilization, and memory page cache metrics on the same node to distinguish CPU-bound from I/O-bound searches.
- Track JVM heap usage and old-generation GC pause times alongside search thread pool queue depth to catch heap-pressure-induced latency before it triggers rejections or node departures.
- Alert on per-node segment memory growth and file descriptor utilization, giving early warning of segment explosion that precedes query latency spikes.
- Visualize search thread pool queue depth and rejection rates in real time to identify saturation before it drives up tail latency.
Related guides
- Elasticsearch all shards failed: diagnosing search_phase_execution_exception
- Elasticsearch authentication failures: audit logs, brute force, and credential drift
- Elasticsearch CircuitBreakingException: [parent] Data too large - causes and fixes
- Elasticsearch cluster_block_exception: blocked by, the read-only blocks explained
- Elasticsearch cluster health red: unassigned primaries and how to recover
- Elasticsearch cluster health yellow: unassigned replicas vs real allocation blocks
- Elasticsearch cluster state too large: field count, index count, and per-node heap
- Elasticsearch disk full: emergency recovery and freeing space safely
- Elasticsearch disk watermark cascade: from low watermark to cluster-wide read-only
- Elasticsearch document indexing failures: index_failed, bulk item errors, and version conflicts
- Elasticsearch EsRejectedExecutionException: write thread pool rejections and HTTP 429
- Elasticsearch exposed without authentication: open clusters and snapshot exfiltration







