Elasticsearch expensive queries: leading wildcards, regex, deep pagination, and scripts

Search latency spikes and data-node CPU saturation usually trace to one of four query patterns: leading wildcards or unbounded regex, deep from+size pagination, runtime Painless scripts, and deep aggregations. Elasticsearch uses a scatter-gather read path: the coordinating node broadcasts every query to one copy of every target shard. A single expensive query fans out, and the slowest shard sets overall latency. Coordinating-node heap pressure rises when it merges large intermediate result sets during the fetch phase or aggregation reduce phase.

What this means

Each shard executes the query against its local Lucene index and returns document IDs plus sort values. The coordinating node merges those results and requests the actual _source in the fetch phase.

Leading wildcards and regex prevent Lucene from using the inverted index. A pattern starting with * or .* forces a full scan of the term dictionary on every target shard. Deep pagination with from+size forces every shard to generate and sort a large result window, which the coordinating node must hold in heap while merging. Runtime scripts execute per-document and cannot use index structures; they repeatedly invoke the Painless engine across matching documents. Deep aggregations can generate enough buckets to exhaust the request circuit breaker before the coordinating node can reduce them.

Common causes

CauseWhat it looks likeFirst thing to check
Leading wildcard or regexCPU spikes on data nodes; search latency rises without thread pool queue growthSlow log for wildcard or regexp queries; /_nodes/hot_threads
Deep from+size paginationCoordinating-node heap spikes; request circuit breaker trips; errors beyond index.max_result_window/_tasks for large from values; task descriptions
Painless scriptsHigh process CPU with low search throughput; script compilation cache pressureSlow log for script queries; /_nodes/stats/script
Deep aggregationsrequest or parent breaker trips; too_many_buckets_exception; heap grows during mergeQuery body for high terms.size or nested sub-aggs; /_nodes/stats/breaker

Quick checks

# List running search tasks to spot long-running queries
curl -s 'http://localhost:9200/_tasks?detailed=true&actions=*search*'

# Sample hot threads to see what is consuming CPU
curl -s 'http://localhost:9200/_nodes/hot_threads'

# Check search thread pool saturation
curl -s 'http://localhost:9200/_cat/thread_pool/search?v&h=node_name,name,active,queue,rejected'

# Check request and parent circuit breakers
curl -s 'http://localhost:9200/_nodes/stats/breaker?filter_path=nodes.*.breakers.request,nodes.*.breakers.parent'

# Sample JVM heap and old GC behavior
curl -s 'http://localhost:9200/_nodes/stats/jvm?filter_path=nodes.*.jvm.mem.heap_used_percent,nodes.*.jvm.gc.collectors.old'

# Check script compilation and cache pressure
curl -s 'http://localhost:9200/_nodes/stats/script?filter_path=nodes.*.script'

How to diagnose it

  1. Confirm resource impact. Use GET /_cat/nodes?v&h=name,cpu,heap.percent,ram.percent,disk.used.percent to spot data nodes with high CPU or heap. Cross-check with GET /_cat/thread_pool/search?v&h=node_name,name,active,queue,rejected for query-phase saturation. Queue growth without rejections points to slow queries; rejections point to outright overload.

  2. Capture in-flight queries with GET /_tasks?detailed=true&actions=*search*. Look for:

    • Long-running searches with high running_time_in_nanos.
    • Large from values in the task description.
    • Script mentions in the task description or query source.
    • Aggregation contexts with high size parameters.
  3. Sample hot threads with GET /_nodes/hot_threads. CPU-heavy regex or wildcard work appears in MultiTermQuery or automaton construction stacks. Script work shows Painless execution frames referencing org.elasticsearch.painless. Deep pagination and aggregation merge time often appears in priority-queue or collection classes on the coordinating node.

  4. Enable or lower the slow log threshold temporarily to tie a query string to a latency spike. Set it per index:

    curl -X PUT 'localhost:9200/<index>/_settings' -H 'Content-Type: application/json' -d'
    {
      "index.search.slowlog.threshold.query.warn": "10s"
    }'
    

    Then tail the slow log on the data nodes. Revert the threshold when finished.

  5. Check for aggregation bucket explosions. If responses fail with too_many_buckets_exception, the query exceeded search.max_buckets (default 65,536). Check the current limit with:

    curl -s 'localhost:9200/_cluster/settings?include_defaults=true&filter_path=**.max_buckets'
    

    Do not raise the limit to hide the symptom. Reduce aggregation scope, add a filter query to narrow the dataset, or paginate through buckets with a composite aggregation.

  6. Map the query pattern to the resource signal. Heap spikes concentrated on the coordinating node point to deep pagination or large aggregations. CPU spikes spread across data nodes point to wildcards, regex, or scripts. Circuit breaker trips on request or parent confirm memory-heavy merge operations.

Metrics and signals to monitor

SignalWhy it mattersWarning sign
Search latency (query and fetch phases)Expensive queries slow every shardSustained >5x baseline
Search thread pool queue / rejectedSaturation from expensive workQueue >100 sustained or rejections >0/min
JVM heap used percentScripts and aggregations consume heapSustained >85% or post-GC floor rising
Circuit breaker trips (request, parent)Queries allocating too much memoryAny delta >0 on request or parent
Slow log entriesDirect evidence of slow queriesEntries appearing at the warn threshold
CPU utilization per nodeRegex, wildcards, and scripts burn CPUSustained >80% on data nodes

Fixes

Leading wildcards and regex

Avoid *foo patterns, especially on analyzed text fields. They force a full term dictionary scan on every shard. If pattern matching is required, shift cost to ingest time:

  • Index the field with an ngram or edge_ngram analyzer so the query becomes a cheap term lookup.
  • Store a reversed copy of the field and rewrite suffix wildcards as prefix queries.
  • Prefer simple prefix queries when the pattern is a leading string.
  • Consider the wildcard field type for heavy wildcard workloads. It speeds up wildcard queries at the cost of increased index size and ingest overhead.

Deep pagination

Do not page deeply with from+size. Elasticsearch rejects requests beyond index.max_result_window (default 10,000). The coordinating node must build a priority queue across all shards for every page; cost grows linearly with from.

For deep pagination, use search_after with a Point in Time (PIT). Open a PIT:

curl -X POST 'localhost:9200/<index>/_pit?keep_alive=1m'

Then pass the PIT id and the last sort values in search_after. Include a unique tiebreaker, such as _id, to avoid nondeterministic ordering when sort values collide:

curl -X GET 'localhost:9200/_search' -H 'Content-Type: application/json' -d'
{
  "pit": { "id": "<pit_id>", "keep_alive": "1m" },
  "sort": [ { "@timestamp": "asc" }, { "_id": "asc" } ],
  "search_after": [ 1234567890000, "abc123" ]
}'

PIT preserves index state and handles tiebreaking automatically. Delete the PIT explicitly when the client is done:

curl -X DELETE 'localhost:9200/_pit' -H 'Content-Type: application/json' -d'
{ "id": "<pit_id>" }'

Warning: Abandoned PITs hold segments open and leak heap. Avoid the Scroll API for new workloads.

Scripts

Painless executes per-document and cannot use the inverted index. Accessing _source inside a script is slower than reading doc_values; prefer doc where possible. Better yet, move logic out of the query entirely:

  • Use keyword fields for sorting instead of script-based sorts.
  • Use copy_to for multi-field search instead of runtime concatenation.
  • Prefer index-time enrichment over runtime scripts.

Scripts are compiled and cached per unique source string. Monitor /_nodes/stats/script for high compilations or cache_evictions. If the compilation rate approaches script.max_compilations_rate, the cluster is compiling scripts too often. Use stored scripts (/_scripts) to reduce compilation overhead.

Aggregations and guardrails

Reduce the size parameter on terms aggregations. Do not fetch all buckets at once. Paginate through buckets with a composite aggregation, passing the after key from the previous response. Do not raise search.max_buckets indefinitely to hide abusive queries. Treat every too_many_buckets_exception as a production incident.

The indices.query.bool.max_clause_count node setting is deprecated and has no effect in newer releases. Elasticsearch dynamically sizes the clause limit based on available heap.

Prevention

  • Set index.search.slowlog.threshold.query.warn to a duration that captures genuinely abusive queries (for example, 10s). Apply it through an index template so new indices inherit the threshold. Ensure the slow log appender is writing to disk.
  • Review application query generation. Ban leading wildcards and unbounded regex at the application layer or API gateway if possible. Reject from values above a safe threshold before they reach the cluster.
  • Use search.max_buckets as a guardrail, not a target. Investigate every violation.
  • For clusters running heavy analytics, provision dedicated coordinating nodes and monitor their heap and CPU separately from data nodes.
  • Delete abandoned PITs and scroll contexts promptly.

How Netdata helps

  • Correlate per-node search latency charts with search thread pool queued operations to confirm query-phase saturation from expensive queries.
  • Watch JVM heap used percent alongside per-node CPU to distinguish heap-bound aggregations from CPU-bound term dictionary scans.
  • Alert on search thread pool rejections to catch expensive queries before they cascade into cluster-wide failures.
  • Use per-node CPU asymmetry charts to spot coordinating-node overload or hot-sharded indices.