Elasticsearch coordinating node overload: aggregation merge, heap spikes, and 429s

HTTP 429 or 503 responses appear on search requests while data nodes look healthy. Heap spikes on one node while others stay flat. The slow log shows heavy aggregation queries. That node is the coordinator, and it is running out of heap during the reduce phase.

Every node can act as a coordinating node. For each search, the coordinator broadcasts the query to relevant shards, collects partial results, and merges them. Aggregations compute locally per shard and reduce in memory on the coordinator. High-cardinality terms aggregations, deep pagination, or large fetch sizes force the coordinator to hold massive intermediate structures in heap. If the estimate exceeds the circuit breaker limit, the request is rejected. If the breaker is too slow, the node may suffer long GC pauses or disconnect from the cluster.

The key symptom is asymmetry: the coordinator shows high heap, breaker trips, and search thread pool queues, while data nodes handling the same query show normal heap and CPU.

What this means

Elasticsearch search follows a scatter-gather model. The coordinating node sends the query phase to every target shard. Each shard executes locally, builds partial aggregation states, and returns document IDs with sort values. The coordinator reduces these partial results into a single response. During the reduce phase, aggregation buckets from every shard are materialized simultaneously in heap. The slowest shard determines overall query latency, and the coordinator must hold all intermediate state until the last shard responds.

High-cardinality terms aggregations are the most common trigger. A terms aggregation on a field with millions of unique values forces each shard to return a large bucket list. Nested sub-aggregations multiply memory cost. Deep pagination and large fetch sizes add document sources on top of aggregation structures. If the estimated memory exceeds the request circuit breaker limit (default 60% of JVM heap), Elasticsearch rejects the query with HTTP 429. If the parent circuit breaker trips (default 95% of heap with real memory tracking), the node is protecting itself from OOM. Repeated trips on the coordinator while data nodes remain calm point directly to reduce-phase pressure.

flowchart LR
    Client -->|1. Search| Coord[Coordinating Node]
    Coord -->|2. Query phase| S1[Shard 1]
    Coord -->|2. Query phase| S2[Shard 2]
    Coord -->|2. Query phase| SN[Shard N]
    S1 -->|3. Partial results| Coord
    S2 -->|3. Partial results| Coord
    SN -->|3. Partial results| Coord
    Coord -->|4. Response| Client

Common causes

CauseWhat it looks likeFirst thing to check
High-cardinality terms aggregation with many shardsrequest or parent breaker trips on the coordinator; slow log shows terms aggs on fields with millions of unique values; data nodes show normal heapSlow log for aggregation type and target field
Deep pagination with large size or fromHeap spikes correlate with large size parameters; fetch latency is elevated while query latency is normalQuery parameters for from and size
Large fetch sizes combined with aggregationsCompound memory pressure from holding both aggregation buckets and full document sources at the coordinatorRequest payload for size and aggregation depth
Too many shards per queryCoordinator merges results from hundreds of shards; fixed overhead per shard dominatesShard count for target indices via _cat/shards
Permissive breaker limits masking pressureBreaker trips are rare but the coordinator experiences long GC pauses or OOMsBreaker estimated_size_in_bytes vs limit_size_in_bytes

Quick checks

# Compare heap asymmetry across nodes to spot the coordinator under pressure
curl -s 'http://localhost:9200/_cat/nodes?v&h=name,node.role,heap.percent,cpu,load_1m'

# Inspect circuit breaker trips and current estimated sizes
curl -s 'http://localhost:9200/_nodes/stats/breaker?filter_path=nodes.*.breakers'

# Check search thread pool queues and rejections on the coordinator
curl -s 'http://localhost:9200/_cat/thread_pool/search?v&h=node_name,name,active,queue,rejected'

# List active search tasks to identify heavy queries
curl -s 'http://localhost:9200/_tasks?detailed=true&actions=*search*'

# Verify JVM heap and GC behavior per node
curl -s 'http://localhost:9200/_nodes/stats/jvm?filter_path=nodes.*.jvm.mem,nodes.*.jvm.gc'

# Count shards for the index being queried to assess fan-out
curl -s 'http://localhost:9200/_cat/shards/<index>?v&h=index,shard,prirep,state'

# Sample hot threads to see what is consuming CPU on the coordinator
curl -s 'http://localhost:9200/_nodes/hot_threads?threads=10'

How to diagnose it

  1. Confirm asymmetry. Run _cat/nodes and sort by heap.percent. One node significantly higher than others while serving the same traffic is likely the coordinator absorbing merge overhead.
  2. Identify the breaker. Run _nodes/stats/breaker and compare tripped counters. Look for request or parent breakers incrementing on the coordinator. Compare estimated_size_in_bytes to limit_size_in_bytes.
  3. Correlate with queries. Run _tasks?detailed=true&actions=*search* to see active searches. Look for large size, deep from, or heavy aggregations. Cross-reference with the slow log.
  4. Check shard fan-out. Use _cat/shards to see how many shards the query hits. Many small indices or excessive shards multiply merge overhead.
  5. Verify GC behavior. Use _nodes/stats/jvm to check for frequent old-generation GC or long pauses on the coordinator while data nodes are stable.
  6. Check search queues. Use _cat/thread_pool/search to confirm queue depth grows on the coordinator before rejections begin.

Metrics and signals to monitor

SignalWhy it mattersWarning sign
heap.percent on coordinating nodesHeap spikes are the primary symptom of merge pressureSustained >75% while data nodes are <50%
request breaker trippedIndicates single-request aggregation structures are too largeAny delta > 0 during normal traffic
parent breaker trippedLast line of defense; node is near OOMAny delta > 0 on coordinator
Old GC collection timeLong old GC pauses cause node removalPauses >10s correlated with heap spikes
thread_pool.search.queuePrecursor to rejection and latencySustained queue growth on coordinator
thread_pool.search.rejectedDirect measure of failed searchesNonzero rate sustained >5 minutes
Query latency vs. end-to-end latencyDistinguishes shard slowness from coordinator merge costShard query latency normal but total latency high
Shard count per queryMerge overhead scales with shard countQueries hitting >100 shards regularly

Fixes

Reduce aggregation cardinality

Rewrite queries to avoid high-cardinality terms aggregations where possible. Pre-filter with a selective query clause, or use sampler and diversified_sampler aggregations to reduce the bucket count returned to the coordinator.

Limit fetch size and pagination depth

Lower the size parameter. Avoid deep from + size pagination for large result sets; use search_after instead.

Reduce shards per query

Target fewer indices or use the shrink API to reduce shard count for read-only indices. Avoid broad wildcard patterns that fan out to hundreds of shards.

Add dedicated coordinating nodes

Deploy nodes with no data, master, or ingest roles to handle client requests. This isolates merge heap pressure from data nodes. Size heap generously, staying under the compressed OOPs threshold (usually 30 GB or less).

Cancel abusive queries

Find long-running searches via _tasks and cancel them. This terminates the query for the client but protects the cluster:

# Cancel a specific search task (disruptive: terminates the query)
curl -X POST 'http://localhost:9200/_tasks/<task_id>/_cancel'

Block reallocation during recovery

If a coordinating node drops out due to GC and triggers a reallocation storm, pause allocation until the root query is fixed. This prevents shard recovery:

# Temporarily disable shard reallocation (disruptive: prevents recovery)
curl -X PUT 'http://localhost:9200/_cluster/settings' -H 'Content-Type: application/json' -d '{"persistent":{"cluster.routing.allocation.enable":"none"}}'

Re-enable after fixing the query.

Prevention

  • Size coordinating capacity independently. Deploy dedicated coordinating-only nodes for heavy analytics, and monitor their heap, thread pools, and circuit breakers separately from data nodes.
  • Set slow log thresholds. Configure index.search.slowlog.threshold.query.warn to catch expensive aggregations before they trip breakers.
  • Limit shard fan-out. Design index patterns so routine queries hit a manageable number of shards. Use time-based aliases or data streams that exclude cold indices from active searches.
  • Monitor the heap floor. Track the post-old-GC heap minimum on coordinating nodes. A rising floor predicts merge pressure before breakers trip.
  • Stage-test aggregation queries. Run high-cardinality aggregations and deep pagination against representative data and shard counts before releasing to production.

How Netdata helps

  • Real-time JVM heap per node, surfacing coordinator asymmetry without manual _cat/nodes cross-referencing.
  • search thread pool queue depth and rejections per node, highlighting the bottleneck before clients see HTTP 429.
  • Circuit breaker tripped counters and estimated vs. limit sizes from _nodes/stats/breaker.
  • Old-generation GC pause duration alongside search latency to show when merge pressure causes stop-the-world pauses.
  • Alerts on sustained heap usage above baseline per node, distinguishing the overloaded coordinator from healthy data nodes.