Elasticsearch expensive queries: leading wildcards, regex, deep pagination, and scripts
Search latency spikes and data-node CPU saturation usually trace to one of four query patterns: leading wildcards or unbounded regex, deep from+size pagination, runtime Painless scripts, and deep aggregations. Elasticsearch uses a scatter-gather read path: the coordinating node broadcasts every query to one copy of every target shard. A single expensive query fans out, and the slowest shard sets overall latency. Coordinating-node heap pressure rises when it merges large intermediate result sets during the fetch phase or aggregation reduce phase.
What this means
Each shard executes the query against its local Lucene index and returns document IDs plus sort values. The coordinating node merges those results and requests the actual _source in the fetch phase.
Leading wildcards and regex prevent Lucene from using the inverted index. A pattern starting with * or .* forces a full scan of the term dictionary on every target shard. Deep pagination with from+size forces every shard to generate and sort a large result window, which the coordinating node must hold in heap while merging. Runtime scripts execute per-document and cannot use index structures; they repeatedly invoke the Painless engine across matching documents. Deep aggregations can generate enough buckets to exhaust the request circuit breaker before the coordinating node can reduce them.
Common causes
| Cause | What it looks like | First thing to check |
|---|---|---|
| Leading wildcard or regex | CPU spikes on data nodes; search latency rises without thread pool queue growth | Slow log for wildcard or regexp queries; /_nodes/hot_threads |
Deep from+size pagination | Coordinating-node heap spikes; request circuit breaker trips; errors beyond index.max_result_window | /_tasks for large from values; task descriptions |
| Painless scripts | High process CPU with low search throughput; script compilation cache pressure | Slow log for script queries; /_nodes/stats/script |
| Deep aggregations | request or parent breaker trips; too_many_buckets_exception; heap grows during merge | Query body for high terms.size or nested sub-aggs; /_nodes/stats/breaker |
Quick checks
# List running search tasks to spot long-running queries
curl -s 'http://localhost:9200/_tasks?detailed=true&actions=*search*'
# Sample hot threads to see what is consuming CPU
curl -s 'http://localhost:9200/_nodes/hot_threads'
# Check search thread pool saturation
curl -s 'http://localhost:9200/_cat/thread_pool/search?v&h=node_name,name,active,queue,rejected'
# Check request and parent circuit breakers
curl -s 'http://localhost:9200/_nodes/stats/breaker?filter_path=nodes.*.breakers.request,nodes.*.breakers.parent'
# Sample JVM heap and old GC behavior
curl -s 'http://localhost:9200/_nodes/stats/jvm?filter_path=nodes.*.jvm.mem.heap_used_percent,nodes.*.jvm.gc.collectors.old'
# Check script compilation and cache pressure
curl -s 'http://localhost:9200/_nodes/stats/script?filter_path=nodes.*.script'
How to diagnose it
Confirm resource impact. Use
GET /_cat/nodes?v&h=name,cpu,heap.percent,ram.percent,disk.used.percentto spot data nodes with high CPU or heap. Cross-check withGET /_cat/thread_pool/search?v&h=node_name,name,active,queue,rejectedfor query-phase saturation. Queue growth without rejections points to slow queries; rejections point to outright overload.Capture in-flight queries with
GET /_tasks?detailed=true&actions=*search*. Look for:- Long-running searches with high
running_time_in_nanos. - Large
fromvalues in the task description. - Script mentions in the task description or query source.
- Aggregation contexts with high
sizeparameters.
- Long-running searches with high
Sample hot threads with
GET /_nodes/hot_threads. CPU-heavy regex or wildcard work appears inMultiTermQueryor automaton construction stacks. Script work shows Painless execution frames referencingorg.elasticsearch.painless. Deep pagination and aggregation merge time often appears in priority-queue or collection classes on the coordinating node.Enable or lower the slow log threshold temporarily to tie a query string to a latency spike. Set it per index:
curl -X PUT 'localhost:9200/<index>/_settings' -H 'Content-Type: application/json' -d' { "index.search.slowlog.threshold.query.warn": "10s" }'Then tail the slow log on the data nodes. Revert the threshold when finished.
Check for aggregation bucket explosions. If responses fail with
too_many_buckets_exception, the query exceededsearch.max_buckets(default 65,536). Check the current limit with:curl -s 'localhost:9200/_cluster/settings?include_defaults=true&filter_path=**.max_buckets'Do not raise the limit to hide the symptom. Reduce aggregation scope, add a filter query to narrow the dataset, or paginate through buckets with a
compositeaggregation.Map the query pattern to the resource signal. Heap spikes concentrated on the coordinating node point to deep pagination or large aggregations. CPU spikes spread across data nodes point to wildcards, regex, or scripts. Circuit breaker trips on
requestorparentconfirm memory-heavy merge operations.
Metrics and signals to monitor
| Signal | Why it matters | Warning sign |
|---|---|---|
| Search latency (query and fetch phases) | Expensive queries slow every shard | Sustained >5x baseline |
| Search thread pool queue / rejected | Saturation from expensive work | Queue >100 sustained or rejections >0/min |
| JVM heap used percent | Scripts and aggregations consume heap | Sustained >85% or post-GC floor rising |
Circuit breaker trips (request, parent) | Queries allocating too much memory | Any delta >0 on request or parent |
| Slow log entries | Direct evidence of slow queries | Entries appearing at the warn threshold |
| CPU utilization per node | Regex, wildcards, and scripts burn CPU | Sustained >80% on data nodes |
Fixes
Leading wildcards and regex
Avoid *foo patterns, especially on analyzed text fields. They force a full term dictionary scan on every shard. If pattern matching is required, shift cost to ingest time:
- Index the field with an
ngramoredge_ngramanalyzer so the query becomes a cheap term lookup. - Store a reversed copy of the field and rewrite suffix wildcards as prefix queries.
- Prefer simple
prefixqueries when the pattern is a leading string. - Consider the
wildcardfield type for heavy wildcard workloads. It speeds up wildcard queries at the cost of increased index size and ingest overhead.
Deep pagination
Do not page deeply with from+size. Elasticsearch rejects requests beyond index.max_result_window (default 10,000). The coordinating node must build a priority queue across all shards for every page; cost grows linearly with from.
For deep pagination, use search_after with a Point in Time (PIT). Open a PIT:
curl -X POST 'localhost:9200/<index>/_pit?keep_alive=1m'
Then pass the PIT id and the last sort values in search_after. Include a unique tiebreaker, such as _id, to avoid nondeterministic ordering when sort values collide:
curl -X GET 'localhost:9200/_search' -H 'Content-Type: application/json' -d'
{
"pit": { "id": "<pit_id>", "keep_alive": "1m" },
"sort": [ { "@timestamp": "asc" }, { "_id": "asc" } ],
"search_after": [ 1234567890000, "abc123" ]
}'
PIT preserves index state and handles tiebreaking automatically. Delete the PIT explicitly when the client is done:
curl -X DELETE 'localhost:9200/_pit' -H 'Content-Type: application/json' -d'
{ "id": "<pit_id>" }'
Warning: Abandoned PITs hold segments open and leak heap. Avoid the Scroll API for new workloads.
Scripts
Painless executes per-document and cannot use the inverted index. Accessing _source inside a script is slower than reading doc_values; prefer doc where possible. Better yet, move logic out of the query entirely:
- Use
keywordfields for sorting instead of script-based sorts. - Use
copy_tofor multi-field search instead of runtime concatenation. - Prefer index-time enrichment over runtime scripts.
Scripts are compiled and cached per unique source string. Monitor /_nodes/stats/script for high compilations or cache_evictions. If the compilation rate approaches script.max_compilations_rate, the cluster is compiling scripts too often. Use stored scripts (/_scripts) to reduce compilation overhead.
Aggregations and guardrails
Reduce the size parameter on terms aggregations. Do not fetch all buckets at once. Paginate through buckets with a composite aggregation, passing the after key from the previous response. Do not raise search.max_buckets indefinitely to hide abusive queries. Treat every too_many_buckets_exception as a production incident.
The indices.query.bool.max_clause_count node setting is deprecated and has no effect in newer releases. Elasticsearch dynamically sizes the clause limit based on available heap.
Prevention
- Set
index.search.slowlog.threshold.query.warnto a duration that captures genuinely abusive queries (for example,10s). Apply it through an index template so new indices inherit the threshold. Ensure the slow log appender is writing to disk. - Review application query generation. Ban leading wildcards and unbounded regex at the application layer or API gateway if possible. Reject
fromvalues above a safe threshold before they reach the cluster. - Use
search.max_bucketsas a guardrail, not a target. Investigate every violation. - For clusters running heavy analytics, provision dedicated coordinating nodes and monitor their heap and CPU separately from data nodes.
- Delete abandoned PITs and scroll contexts promptly.
How Netdata helps
- Correlate per-node search latency charts with search thread pool queued operations to confirm query-phase saturation from expensive queries.
- Watch JVM heap used percent alongside per-node CPU to distinguish heap-bound aggregations from CPU-bound term dictionary scans.
- Alert on search thread pool rejections to catch expensive queries before they cascade into cluster-wide failures.
- Use per-node CPU asymmetry charts to spot coordinating-node overload or hot-sharded indices.
Related guides
- Elasticsearch all shards failed: diagnosing search_phase_execution_exception
- Elasticsearch authentication failures: audit logs, brute force, and credential drift
- Elasticsearch CircuitBreakingException: [parent] Data too large - causes and fixes
- Elasticsearch cluster_block_exception: blocked by, the read-only blocks explained
- Elasticsearch cluster health red: unassigned primaries and how to recover
- Elasticsearch cluster health yellow: unassigned replicas vs real allocation blocks
- Elasticsearch cluster state too large: field count, index count, and per-node heap
- Elasticsearch disk full: emergency recovery and freeing space safely
- Elasticsearch disk watermark cascade: from low watermark to cluster-wide read-only
- Elasticsearch document indexing failures: index_failed, bulk item errors, and version conflicts
- Elasticsearch EsRejectedExecutionException: write thread pool rejections and HTTP 429
- Elasticsearch exposed without authentication: open clusters and snapshot exfiltration







