Elasticsearch all shards failed: diagnosing search_phase_execution_exception

You run a search and Elasticsearch returns search_phase_execution_exception with reason all shards failed. Every shard copy involved in the query returned a failure to the coordinating node. The outer error is a container; the actual root cause lives in the per-shard failure reasons inside the response body. Do not assume the cluster is down. This error fires on clusters with green health and stable nodes when a query is malformed, a mapping is incompatible, or a resource limit is breached uniformly across every target shard.

The read path uses a two-phase scatter-gather. The coordinating node broadcasts the query to one copy of each relevant shard. Each shard executes the query locally and returns document IDs and sort values. When every shard copy fails, the coordinating node has no partial result to merge and throws the composite exception. Determine whether the failure is data unavailability, a query error, or resource exhaustion.

What this means

  • Cluster health green does not prevent this error. Green only means all primaries and replicas are assigned. It says nothing about query correctness or node resource headroom. A malformed query against a healthy cluster produces all shards failed.
  • The error is a composite signal. It can represent unassigned primaries (cluster red), a bad query that fails identically on every shard, a mapping mismatch, or a uniform resource limit breach such as circuit breakers, thread pool rejections, or disk flood-stage blocks.
  • The per-shard reason is the diagnosis. If every shard reports the same mapper_parsing_exception or illegal_argument_exception, the problem is the query. If the reason references an unassigned shard or a tripped circuit breaker, the problem is infrastructure.
flowchart TD
    A[search_phase_execution_exception
all shards failed] --> B{Read per-shard
failure reasons} B -->|mapper_parsing_exception
or query error| C[Fix query / mapping] B -->|es_rejected_execution_exception| D[Check thread pool
saturation] B -->|circuit_breaking_exception| E[Check heap and
circuit breakers] B -->|unassigned or node_left| F[Check cluster health
and allocation explain] B -->|read_only_allow_delete| G[Check disk watermark
and clear blocks] C --> H[Cancel bad tasks
and reissue] D --> I[Reduce concurrency
or scale out] E --> J[Reduce aggregation
cardinality / add heap] F --> K[Retry failed or
restore nodes] G --> L[Free disk and
remove blocks]

Common causes

CauseWhat it looks likeFirst thing to check
Unassigned or initializing primary shardsCluster health red; affected indices return complete query failureGET /_cluster/health and GET /_cluster/allocation/explain
Query or mapping error hitting every shard identicallyCluster health green; error text references unknown fields, missing .keyword sub-field, or script errorsThe per-shard reason block in the error response
Circuit breaker tripped (request or parent)Heap pressure sustained above 85 percent; heavy aggregations fail uniformlyGET /_nodes/stats/breaker for tripped counters and estimated_size_in_bytes
Search thread pool saturationQuery load spikes; rejections inside shard failuresGET /_cat/thread_pool/search?v&h=node_name,active,queue,rejected
Disk flood-stage watermark / read-only blockDisk above 95 percent; indexing also failingGET /_cat/allocation?v and index settings for index.blocks.read_only_allow_delete
High-cardinality aggregation exceeding memory limitsDeep terms aggregations fail across all shards; request breaker may tripSlow log and GET /_nodes/stats/breaker

Quick checks

# Check cluster health and unassigned shards
curl -s 'http://localhost:9200/_cluster/health?filter_path=status,unassigned_shards,number_of_nodes'

# Check search thread pool saturation
curl -s 'http://localhost:9200/_cat/thread_pool/search?v&h=node_name,active,queue,rejected'

# Check circuit breaker trips and estimated sizes
curl -s 'http://localhost:9200/_nodes/stats/breaker?filter_path=nodes.*.breakers.parent,nodes.*.breakers.request'

# Check disk watermark proximity and blocks
curl -s 'http://localhost:9200/_cat/allocation?v&h=node_name,disk.percent,disk.used,disk.total'

# Check JVM heap pressure on data nodes
curl -s 'http://localhost:9200/_cat/nodes?v&h=name,heap.percent,gc.old.time,gc.old.count'
# Check for active read-only blocks on indices
curl -s 'http://localhost:9200/_all/_settings?filter_path=*.settings.index.blocks.read_only_allow_delete'

How to diagnose it

  1. Read the shard-level failure reasons. Each failed shard includes a reason object with type and reason fields. If every shard reports the same parsing or scripting error, the problem is the query.
  2. Check cluster health. Run GET /_cluster/health. If the status is red, unassigned primaries are the cause. Use GET /_cluster/allocation/explain to find the allocation block. See Elasticsearch cluster health red: unassigned primaries and how to recover and Elasticsearch unassigned shards: reading allocation explain and fixing each reason.
  3. Check for circuit breaker trips. Run GET /_nodes/stats/breaker. If parent.tripped or request.tripped are increasing, the query is consuming too much heap. Look for heavy aggregations, high-cardinality terms, or loading fielddata on text fields. See Elasticsearch fielddata circuit breaker tripped: text-field aggregations and the keyword fix and Elasticsearch CircuitBreakingException: [parent] Data too large - causes and fixes.
  4. Check search thread pool saturation. Run GET /_cat/thread_pool/search. If rejected is increasing, the search queue is full. This surfaces as es_rejected_execution_exception inside shard failures. Reduce query concurrency or add data nodes.
  5. Check disk watermark and index blocks. Run GET /_cat/allocation. If any node is above 95 percent disk usage, Elasticsearch sets index.blocks.read_only_allow_delete. Free disk space and clear the block. See Elasticsearch cluster health yellow: unassigned replicas vs real allocation blocks.
  6. Check the slow log for expensive queries. If shard failures correlate with a specific query pattern, the slow log reveals the aggregation or script that breached limits. Cancel the task if it is still running via POST /_tasks/{task_id}/_cancel.
  7. Check for node departures mid-query. If shard failures reference a node that left the cluster, correlate with GET /_cat/nodes and GC logs. Long GC pauses above 10 seconds cause fault detection timeouts and node removal. See Elasticsearch long GC pauses: old-generation stop-the-world and node drops.

Metrics and signals to monitor

SignalWhy it mattersWarning sign
Cluster health statusUnassigned primaries make data unavailable for queriesRed sustained for more than 2 minutes
Search thread pool rejectionsDirect cause of shard-level query rejections under loadSustained rate above 0 per minute for more than 5 minutes
Circuit breaker parent or request tripsMemory limits prevent query execution or aggregation completionAny delta greater than 0 per interval
JVM heap used percentHeap pressure triggers breakers and long GC pausesSustained above 85 percent
Disk usage vs watermarksFlood stage blocks writes and can cause query failuresAbove 95 percent or read_only_allow_delete block present
Unassigned shard countMissing primaries guarantee all shards failed on affected indicesAny unassigned primary
Search latency (query phase)Slow shards drag down the scatter-gather responseSustained above 5 times baseline

Fixes

Unassigned primaries. Use GET /_cluster/allocation/explain to find the specific blocker. If shards are stuck in ALLOCATION_FAILED after max retries, run POST /_cluster/reroute?retry_failed=true. If disk watermarks are the cause, free space or add nodes.

Query or mapping errors. Fix the query client-side. Use .keyword sub-fields for aggregations and sorting instead of analyzed text fields. Verify that field names in the query match the current mapping. If a bad query is still running, identify it via GET /_tasks?detailed=true&actions=*search* and cancel it.

Circuit breaker trips. Reduce aggregation cardinality by lowering the size parameter or adding pre-filters. If fielddata is the culprit, reindex with a keyword multi-field. Do not raise breaker limits to mask the problem; this risks OOM. See Elasticsearch heap pressure death spiral: GC, node removal, and the cascade.

Thread pool saturation. Add data nodes or reduce concurrent search load. Do not increase the search queue size blindly; larger queues increase memory pressure and delay rejection without fixing throughput.

Disk watermark / read-only blocks. Delete old indices or reduce replica count to free space. After freeing disk, remove the block with PUT /_all/_settings {"index.blocks.read_only_allow_delete": null}. In 7.x and 8.x, the block is automatically removed when disk drops below the flood-stage watermark, but only if space was actually freed.

Prevention

  • Monitor leading indicators, not just cluster health. Green status does not mean queries will succeed. Track JVM heap floor, search thread pool queue depth, and disk growth rate.
  • Enforce query review. Catch expensive aggregations and text-field sorts in development.
  • Cap shard count. Too many shards amplify the blast radius of any query error and increase heap pressure from segment metadata. Use ILM to roll over and delete on schedule.
  • Set slow log thresholds. Configure index.search.slowlog.threshold.query.warn to catch pathological queries before they trigger breakers.
  • Maintain disk headroom. Keep data nodes below 70 percent disk usage. Merges temporarily require extra space, and flood-stage blocks are a hard stop.

How Netdata helps

  • Correlate JVM heap, old GC pauses, and circuit breaker trips to catch heap pressure before it causes shard failures.
  • Alert on search thread pool queue depth and rejections before queries fail.
  • Track disk usage against watermark thresholds to warn before flood-stage blocks.
  • Monitor unassigned shards and cluster health transitions alongside node departures and GC activity.
  • Surface search latency spikes against query rates to distinguish capacity exhaustion from a single bad query.