$ guides / elasticsearch / elasticsearch-coordinating-node-overload ▌

Operations Guides

Elasticsearch coordinating node overload: aggregation merge, heap spikes, and 429s

HTTP 429 or 503 responses appear on search requests while data nodes look healthy. Heap spikes on one node while others stay flat. The slow log shows heavy aggregation queries. That node is the coordinator, and it is running out of heap during the reduce phase.

Every node can act as a coordinating node. For each search, the coordinator broadcasts the query to relevant shards, collects partial results, and merges them. Aggregations compute locally per shard and reduce in memory on the coordinator. High-cardinality terms aggregations, deep pagination, or large fetch sizes force the coordinator to hold massive intermediate structures in heap. If the estimate exceeds the circuit breaker limit, the request is rejected. If the breaker is too slow, the node may suffer long GC pauses or disconnect from the cluster.

The key symptom is asymmetry: the coordinator shows high heap, breaker trips, and search thread pool queues, while data nodes handling the same query show normal heap and CPU.

What this means

Elasticsearch search follows a scatter-gather model. The coordinating node sends the query phase to every target shard. Each shard executes locally, builds partial aggregation states, and returns document IDs with sort values. The coordinator reduces these partial results into a single response. During the reduce phase, aggregation buckets from every shard are materialized simultaneously in heap. The slowest shard determines overall query latency, and the coordinator must hold all intermediate state until the last shard responds.

High-cardinality terms aggregations are the most common trigger. A terms aggregation on a field with millions of unique values forces each shard to return a large bucket list. Nested sub-aggregations multiply memory cost. Deep pagination and large fetch sizes add document sources on top of aggregation structures. If the estimated memory exceeds the request circuit breaker limit (default 60% of JVM heap), Elasticsearch rejects the query with HTTP 429. If the parent circuit breaker trips (default 95% of heap with real memory tracking), the node is protecting itself from OOM. Repeated trips on the coordinator while data nodes remain calm point directly to reduce-phase pressure.

flowchart LR
    Client -->|1. Search| Coord[Coordinating Node]
    Coord -->|2. Query phase| S1[Shard 1]
    Coord -->|2. Query phase| S2[Shard 2]
    Coord -->|2. Query phase| SN[Shard N]
    S1 -->|3. Partial results| Coord
    S2 -->|3. Partial results| Coord
    SN -->|3. Partial results| Coord
    Coord -->|4. Response| Client

Common causes

Cause	What it looks like	First thing to check
High-cardinality terms aggregation with many shards	`request` or `parent` breaker trips on the coordinator; slow log shows `terms` aggs on fields with millions of unique values; data nodes show normal heap	Slow log for aggregation type and target field
Deep pagination with large `size` or `from`	Heap spikes correlate with large `size` parameters; fetch latency is elevated while query latency is normal	Query parameters for `from` and `size`
Large fetch sizes combined with aggregations	Compound memory pressure from holding both aggregation buckets and full document sources at the coordinator	Request payload for `size` and aggregation depth
Too many shards per query	Coordinator merges results from hundreds of shards; fixed overhead per shard dominates	Shard count for target indices via `_cat/shards`
Permissive breaker limits masking pressure	Breaker trips are rare but the coordinator experiences long GC pauses or OOMs	Breaker `estimated_size_in_bytes` vs `limit_size_in_bytes`

Quick checks

# Compare heap asymmetry across nodes to spot the coordinator under pressure
curl -s 'http://localhost:9200/_cat/nodes?v&h=name,node.role,heap.percent,cpu,load_1m'

# Inspect circuit breaker trips and current estimated sizes
curl -s 'http://localhost:9200/_nodes/stats/breaker?filter_path=nodes.*.breakers'

# Check search thread pool queues and rejections on the coordinator
curl -s 'http://localhost:9200/_cat/thread_pool/search?v&h=node_name,name,active,queue,rejected'

# List active search tasks to identify heavy queries
curl -s 'http://localhost:9200/_tasks?detailed=true&actions=*search*'

# Verify JVM heap and GC behavior per node
curl -s 'http://localhost:9200/_nodes/stats/jvm?filter_path=nodes.*.jvm.mem,nodes.*.jvm.gc'

# Count shards for the index being queried to assess fan-out
curl -s 'http://localhost:9200/_cat/shards/<index>?v&h=index,shard,prirep,state'

# Sample hot threads to see what is consuming CPU on the coordinator
curl -s 'http://localhost:9200/_nodes/hot_threads?threads=10'

How to diagnose it

Confirm asymmetry. Run _cat/nodes and sort by heap.percent. One node significantly higher than others while serving the same traffic is likely the coordinator absorbing merge overhead.
Identify the breaker. Run _nodes/stats/breaker and compare tripped counters. Look for request or parent breakers incrementing on the coordinator. Compare estimated_size_in_bytes to limit_size_in_bytes.
Correlate with queries. Run _tasks?detailed=true&actions=*search* to see active searches. Look for large size, deep from, or heavy aggregations. Cross-reference with the slow log.
Check shard fan-out. Use _cat/shards to see how many shards the query hits. Many small indices or excessive shards multiply merge overhead.
Verify GC behavior. Use _nodes/stats/jvm to check for frequent old-generation GC or long pauses on the coordinator while data nodes are stable.
Check search queues. Use _cat/thread_pool/search to confirm queue depth grows on the coordinator before rejections begin.

Metrics and signals to monitor

Signal	Why it matters	Warning sign
`heap.percent` on coordinating nodes	Heap spikes are the primary symptom of merge pressure	Sustained >75% while data nodes are <50%
`request` breaker `tripped`	Indicates single-request aggregation structures are too large	Any delta > 0 during normal traffic
`parent` breaker `tripped`	Last line of defense; node is near OOM	Any delta > 0 on coordinator
Old GC collection time	Long old GC pauses cause node removal	Pauses >10s correlated with heap spikes
`thread_pool.search.queue`	Precursor to rejection and latency	Sustained queue growth on coordinator
`thread_pool.search.rejected`	Direct measure of failed searches	Nonzero rate sustained >5 minutes
Query latency vs. end-to-end latency	Distinguishes shard slowness from coordinator merge cost	Shard query latency normal but total latency high
Shard count per query	Merge overhead scales with shard count	Queries hitting >100 shards regularly

Fixes

Reduce aggregation cardinality

Rewrite queries to avoid high-cardinality terms aggregations where possible. Pre-filter with a selective query clause, or use sampler and diversified_sampler aggregations to reduce the bucket count returned to the coordinator.

Limit fetch size and pagination depth

Lower the size parameter. Avoid deep from + size pagination for large result sets; use search_after instead.

Reduce shards per query

Target fewer indices or use the shrink API to reduce shard count for read-only indices. Avoid broad wildcard patterns that fan out to hundreds of shards.

Add dedicated coordinating nodes

Deploy nodes with no data, master, or ingest roles to handle client requests. This isolates merge heap pressure from data nodes. Size heap generously, staying under the compressed OOPs threshold (usually 30 GB or less).

Cancel abusive queries

Find long-running searches via _tasks and cancel them. This terminates the query for the client but protects the cluster:

# Cancel a specific search task (disruptive: terminates the query)
curl -X POST 'http://localhost:9200/_tasks/<task_id>/_cancel'

Block reallocation during recovery

If a coordinating node drops out due to GC and triggers a reallocation storm, pause allocation until the root query is fixed. This prevents shard recovery:

# Temporarily disable shard reallocation (disruptive: prevents recovery)
curl -X PUT 'http://localhost:9200/_cluster/settings' -H 'Content-Type: application/json' -d '{"persistent":{"cluster.routing.allocation.enable":"none"}}'

Re-enable after fixing the query.

Prevention

Size coordinating capacity independently. Deploy dedicated coordinating-only nodes for heavy analytics, and monitor their heap, thread pools, and circuit breakers separately from data nodes.
Set slow log thresholds. Configure index.search.slowlog.threshold.query.warn to catch expensive aggregations before they trip breakers.
Limit shard fan-out. Design index patterns so routine queries hit a manageable number of shards. Use time-based aliases or data streams that exclude cold indices from active searches.
Monitor the heap floor. Track the post-old-GC heap minimum on coordinating nodes. A rising floor predicts merge pressure before breakers trip.
Stage-test aggregation queries. Run high-cardinality aggregations and deep pagination against representative data and shard counts before releasing to production.

How Netdata helps

Real-time JVM heap per node, surfacing coordinator asymmetry without manual _cat/nodes cross-referencing.
search thread pool queue depth and rejections per node, highlighting the bottleneck before clients see HTTP 429.
Circuit breaker tripped counters and estimated vs. limit sizes from _nodes/stats/breaker.
Old-generation GC pause duration alongside search latency to show when merge pressure causes stop-the-world pauses.
Alerts on sustained heap usage above baseline per node, distinguishing the overloaded coordinator from healthy data nodes.

The Netdata solution

Elasticsearch monitoring with Netdata

Netdata monitors Elasticsearch with per-second metrics and ML anomaly detection. Correlate JVM heap pressure, shard counts, disk watermarks, mapping growth, and merge activity with cluster and node health in one view.

See Elasticsearch monitoring → Start monitoring free

Elasticsearch coordinating node overload: aggregation merge, heap spikes, and 429s

Elasticsearch coordinating node overload: aggregation merge, heap spikes, and 429s

What this means

Common causes

Quick checks

How to diagnose it

Metrics and signals to monitor

Fixes

Reduce aggregation cardinality

Limit fetch size and pagination depth

Reduce shards per query

Add dedicated coordinating nodes

Cancel abusive queries

Block reallocation during recovery

Prevention

How Netdata helps

Related guides

Elasticsearch monitoring with Netdata