Elasticsearch CircuitBreakingException: [parent] Data too large - causes and fixes

When a search or indexing request returns HTTP 429 with CircuitBreakingException: [parent] Data too large, data for [<http_request>] would be [X], which is larger than the limit of [Y], the parent circuit breaker has rejected the operation. This is Elasticsearch protecting the JVM from an out-of-memory kill, not a client-side rate limit.

Since version 7.0, the parent breaker tracks real memory usage by default. It can trip even when individual child breakers are within limits. The node is under genuine heap pressure. Determine quickly whether the cause is a single abusive query or structural memory exhaustion.

What this means

The parent circuit breaker guards total JVM heap consumption. When indices.breaker.total.use_real_memory is true (default since Elasticsearch 7.0), the breaker sums actual heap utilization plus the bytes the current operation would reserve. If the sum exceeds the limit, Elasticsearch rejects the operation with HTTP 429. The error message includes real usage, new bytes reserved, and a per-child breakdown inside usages[].

This breakdown identifies the subsystem consuming the most memory. Because the parent uses real memory, it can trip when fielddata or segment metadata fills the heap, even if no single request is large. The parent threshold defaults to 70% of the JVM heap.

The use_real_memory setting is static. Changing it requires a node restart, and disabling it masks real heap pressure. All other breaker limits are dynamic and can be updated via PUT /_cluster/settings. Do not raise the parent threshold as a routine fix; it is the last defense before cascading heap pressure failure.

flowchart TD
    A[Parent breaker trips
HTTP 429] --> B{Check _nodes/stats/breaker} B -->|estimated_size near limit
no single large usage| C[Systemic heap pressure] B -->|usages shows large fielddata| D[Fielddata cache load] B -->|Single request oversized| E[Abusive query or bulk] C --> F[Check segments.memory
and shard count] D --> G[Clear fielddata cache
fix text field mappings] E --> H[Cancel task via _tasks
reduce batch size]

Common causes

CauseWhat it looks likeFirst thing to check
Fielddata cache on text fieldsError usages[] shows a large fielddata component; aggregations or sorts on analyzed text fieldsGET /_nodes/stats/indices/fielddata?fields=*
Oversized bulk or indexing batchesTrips during ingest spikes; write thread pool queues and rejections increaseGET /_cat/thread_pool/write and bulk request sizes
Runaway search or aggregationCoordinating node heap spikes; single query reserves massive request structuresGET /_tasks?detailed=true&actions=*search*
Structural heap pressurePost-GC heap floor rising across nodes; segments.memory growing steadilyGET /_cat/nodes?h=name,segments.memory,heap.percent

Quick checks

Run these read-only commands to triage without changing cluster state.

# Check which breaker tripped and by how much
curl -s 'http://localhost:9200/_nodes/stats/breaker?filter_path=nodes.*.breakers'

# Check JVM heap percent
curl -s 'http://localhost:9200/_cat/nodes?v&h=name,heap.percent,heap.current,heap.max'

# Check old GC collection count and time
curl -s 'http://localhost:9200/_nodes/stats/jvm?filter_path=nodes.*.jvm.gc.collectors.old'

# Check fielddata cache size and eviction churn
curl -s 'http://localhost:9200/_cat/nodes?v&h=name,fielddata.memory_size,fielddata.evictions'

# Check segment metadata heap usage
curl -s 'http://localhost:9200/_cat/nodes?v&h=name,segments.memory'

# Check for expensive in-flight search tasks
curl -s 'http://localhost:9200/_tasks?detailed=true&actions=*search*'

# Check write thread pool rejections under ingest load
curl -s 'http://localhost:9200/_cat/thread_pool/write?v&h=node_name,active,queue,rejected'

How to diagnose it

  1. Confirm the breaker. Query /_nodes/stats/breaker and verify the parent breaker tripped count has incremented. Compare estimated_size_in_bytes to limit_size_in_bytes. Note the current limit; if it was already raised above the default 70%, the cluster is running without adequate headroom. If the ratio is consistently above 70%, the node is operating with thin margins even when not actively tripping.

  2. Read the error breakdown. If the exception usages[] shows a large fielddata component, the cause is text-field aggregations. If request is large, a single query is at fault. If both are small but real usage is high, the heap is consumed by segments or other long-lived structures.

  3. Find the abusive task. Check /_tasks?detailed=true&actions=*search* for long-running searches. If one task dominates memory, note its task_id and cancel it. If no single task stands out, the pressure is systemic.

  4. Inspect the heap floor. Look at heap.percent and old GC counts. If heap usage is sustained above 85% and old GC frequency is climbing, the node is entering a heap pressure death spiral. Check segments.memory and total shard count per node.

  5. Correlate with workload. If the trip coincides with bulk ingest, check thread_pool.write queue and rejection rates. If it coincides with a new query deployment, review the slow log for expensive aggregations.

Metrics and signals to monitor

SignalWhy it mattersWarning sign
breakers.parent.tripped (delta)Count of parent rejections; any sustained delta means the node is near OOMDelta > 0 per minute
jvm.mem.heap_used_percentCurrent heap utilization; sustained elevation precedes breaker tripsSustained >85%
Post-GC heap floorMinimum heap after old GC; a rising floor means long-lived objects are accumulatingTrending upward over days
fielddata.memory_size_in_bytesText field data loaded into heap; should be minimal with doc_values>25% of heap
segments.memorySegment metadata lives in heap; grows with shard and segment countSteady growth without index deletion
thread_pool.write.rejectedWrite pool pushback; often precedes parent trips under ingest loadSustained nonzero rate

Fixes

Immediate relief for runaway queries

Find and cancel the abusive task:

# List active search tasks
curl -s 'http://localhost:9200/_tasks?detailed=true&actions=*search*&filter_path=nodes.*.tasks'
# Cancel by task ID
curl -X POST 'http://localhost:9200/_tasks/<task_id>/_cancel'

Cancel only the specific task. Mass cancellation requests can generate enough internal memory traffic to trip breakers on remote nodes.

Fielddata pressure

Clear the cache to recover headroom:

curl -X POST 'http://localhost:9200/_cache/clear?fielddata=true'

Expect temporary query latency degradation until hot data reloads.

Then fix the mapping. Use _nodes/stats/indices/fielddata?fields=* to identify offending text fields. Replace text-field aggregations and sorts with keyword sub-fields. The fielddata breaker default is 40% of heap, but if fielddata alone consumes 22 GB on a 30 GB heap, the parent breaker has no margin left. Any significant fielddata usage in modern Elasticsearch indicates a mapping problem.

Bulk indexing pressure

Reduce bulk batch size. Standard practice is 1,000 to 5,000 documents per batch. Large in-flight requests add temporary heap pressure. Combined with existing segment or fielddata load, the parent trips even for moderate batches. Also check for very large individual documents that inflate request overhead.

Structural heap pressure

If segments.memory is large and shard count per node is high, reduce the shard burden:

  • Close or delete old indices.
  • Force-merge read-only indices to 1 segment only if they are no longer written and you accept the I/O cost.
  • Shrink indices with the shrink API.
  • Add data nodes to redistribute shards.

Check _cat/allocation to confirm per-node shard distribution before expanding the cluster.

Do not disable real-memory tracking. Setting indices.breaker.total.use_real_memory: false masks heap pressure and leads to OOM kills. This is a static setting that requires a restart, so it is not a viable incident response.

Temporary dynamic tuning

You can raise child breaker limits dynamically to avoid false trips while fixing the root cause:

curl -X PUT 'http://localhost:9200/_cluster/settings' -H 'Content-Type: application/json' -d '{
  "transient": {
    "indices.breaker.fielddata.limit": "45%",
    "indices.breaker.request.limit": "65%"
  }
}'

Tradeoff: higher limits delay rejection but increase OOM risk. Revert these changes once the root cause is fixed. Do not raise the parent limit.

Prevention

  • Monitor the post-GC heap floor, not just the peak. A rising floor is the best leading indicator of the heap pressure death spiral.
  • Keep fielddata near zero by using doc_values and keyword fields for aggregations and sorting.
  • Size bulk batches conservatively and monitor indexing pressure if you run Elasticsearch 7.9 or later.
  • Use ILM to delete or rollup old indices before shard count and segment overhead exhaust heap.
  • Keep JVM heap at or below 31 GB to stay within compressed OOPs, and leave at least 50% of system RAM for the OS page cache.
  • Alert on breakers.parent.estimated_size_in_bytes consistently above 70% of limit_size_in_bytes.

How Netdata helps

Netdata correlates elasticsearch.breakers.parent.tripped with per-node JVM heap usage and old-generation GC pauses to distinguish a bad query from systemic pressure. Per-node charts for elasticsearch.fielddata.memory_size_in_bytes and elasticsearch.segments.memory_in_bytes identify the heap consumer without manual API sampling. Alert on jvm.mem.heap_used_percent and GC collection time for a leading indicator before the breaker fires. Thread pool rejection rates and shard counts per node surface gradual accumulation that eventually triggers trips.