Elasticsearch heap pressure death spiral: GC, node removal, and the cascade
When a node drops from _cat/nodes and network tests pass, check its JVM GC logs. Stop-the-world pauses over 10 seconds cause the master to remove the node. Survivors then absorb recovery traffic, their heap climbs, and they begin missing fault-detection checks too. This feedback loop is the heap pressure death spiral. It masquerades as network instability because operators check connectivity while the real problem is memory saturation.
What this means
By default, Elasticsearch follower/leader fault detection uses a 10-second timeout, 1-second interval, and 3 retries before removing a node. A GC pause over 10 seconds causes one failed check; three consecutive failures trigger removal. A hard TCP disconnect removes the node immediately.
When a node is removed, the master waits for index.unassigned.node_left.delayed_timeout (default 1 minute) before reassigning shards. If heap pressure caused the removal, reallocation begins after the delay. Recovery copies segments and rebuilds in-memory structures on the targets, consuming heap, CPU, network, and disk. If targets already run above 85% heap, recovery traffic pushes their own GC pauses past the fault-detection threshold. The cascade accelerates.
The root cause is always excessive long-lived heap occupancy. The usual suspects are too many shards per node (segment metadata overhead), mapping explosions (bloated cluster state), fielddata loading on text fields, large aggregation result sets, or unbounded bulk request buffering.
flowchart TD
A[Heap sustains above 85%] --> B[Frequent old GC]
B --> C[Pause exceeds 10s]
C --> D[Fault detection fails]
D --> E[Master removes node]
E --> F[Shard reallocation]
F --> G[Recovery traffic on survivors]
G --> H[Heap pressure rises]
H --> BCommon causes
| Cause | What it looks like | First thing to check |
|---|---|---|
| Too many shards / segment metadata | segments.memory growing; shard count approaching 1000 per node | _cat/nodes?v&h=name,segments.memory,segments.count |
| Mapping explosion | Pending tasks climbing; cluster state bloated | _cluster/stats?filter_path=indices.mappings.total_field_count |
| Fielddata on text fields | fielddata.memory_size high; fielddata circuit breaker trips | _nodes/stats/indices/fielddata?fields=* |
| Expensive aggregations | Query latency spikes; request breaker trips | _tasks?detailed=true&actions=*search* |
| Oversized bulk indexing | Write queues high; indexing pressure near limit | _nodes/stats/indexing_pressure |
Quick checks
# Check heap utilization and old GC activity per node
curl -s 'http://localhost:9200/_nodes/stats/jvm?filter_path=nodes.*.name,nodes.*.jvm.mem.heap_used_percent,nodes.*.jvm.gc.collectors.old'
# Check segment metadata overhead in heap
curl -s 'http://localhost:9200/_cat/nodes?v&h=name,segments.memory,segments.count'
# Check fielddata cache size and evictions
curl -s 'http://localhost:9200/_cat/nodes?v&h=name,fielddata.memory_size,fielddata.evictions'
# Check circuit breaker trips and estimated sizes
curl -s 'http://localhost:9200/_nodes/stats/breaker?filter_path=nodes.*.breakers'
# Check active search tasks that may be consuming heap
curl -s 'http://localhost:9200/_tasks?detailed=true&actions=*search*'
# Check node count to confirm departures
curl -s 'http://localhost:9200/_cluster/health?filter_path=number_of_nodes,number_of_data_nodes'
# Check total field count across all indices
curl -s 'http://localhost:9200/_cluster/stats?filter_path=indices.mappings.total_field_count'
How to diagnose it
Confirm heap pressure is sustained. Look for
heap_used_percentabove 85% for multiple minutes, not a transient spike. The critical signal is the post-GC floor: if the minimum heap after old GC is climbing, long-lived objects are accumulating.Correlate GC pauses with node departures. On the removed node, JVM GC logs should show old-generation pauses exceeding 10 seconds immediately before the node left the cluster. Master logs will show the removal shortly after the fault-detection timeout.
Identify the heap consumer. Compare
segments.memory,fielddata.memory_size,query_cache.memory_size, andrequest_cache.memory_sizevia_nodes/stats/indices/segments,fielddata,query_cache,request_cache. Segment metadata and fielddata are the most common culprits in death spirals.Check for circuit breaker trips. A rising
trippedcount on theparentbreaker means the node is near OOM.requestbreaker trips suggest large aggregations;fielddatatrips indicate text-field aggregations.Find expensive queries. Use
_tasks?detailed=true&actions=*search*to identify long-running searches. Cancel abusive tasks withPOST /_tasks/<task_id>/_cancel.Measure cluster state bloat. A large cluster state consumes heap on every node. Check total field count with
_cluster/stats. If field count grows without bound, dynamic mapping is the driver.Confirm the cascade on survivors. After a node drops, check whether remaining nodes show rising heap, growing old GC times, and new circuit breaker trips as recovery traffic arrives.
Metrics and signals to monitor
| Signal | Why it matters | Warning sign |
|---|---|---|
jvm.mem.heap_used_percent | Hard ceiling; sustained pressure triggers old GC | Sustained above 85% |
| Old GC pause duration | Stop-the-world pauses cause fault-detection failures | Individual pause exceeds 10 seconds |
breakers.parent.tripped | Last line of defense before OOM | Any delta greater than 0 |
segments.memory | Segment metadata lives in old generation | Growing in step with heap floor |
| Node count | Direct indicator of cascade progression | Unplanned drop |
| Unassigned shard count | Precedes recovery load on survivors | Rising after node departure |
| Thread pool queue depth | Precursor to rejection and latency | write or search queue sustained above 50% of max |
| Fielddata cache size | Text-field aggregations waste heap | Above 10% of heap or any evictions |
Fixes
Stop the cascade immediately
WARNING: The next command leaves shards unassigned. Cluster health will turn yellow or red until you re-enable allocation.
Prevent new shard allocation and rebalancing to stop recovery traffic from pushing surviving nodes over their own heap limits:
curl -X PUT 'http://localhost:9200/_cluster/settings' -H 'Content-Type: application/json' -d '{
"transient": {"cluster.routing.allocation.enable": "none"}
}'
Cancel any expensive queries that are pinning heap:
curl -s 'http://localhost:9200/_tasks?detailed=true&actions=*search*'
curl -X POST 'http://localhost:9200/_tasks/<task_id>/_cancel'
Reduce heap consumers
Fielddata. If _nodes/stats/indices/fielddata?fields=* shows large fielddata usage, identify the offending text field and add a keyword sub-field. Update queries to use the keyword field for aggregations and sorting.
Segment metadata. If segments.memory is high, reduce the shard count per node. Close or delete old indices, shrink existing indices, or add data nodes. Force-merge indices that are no longer written to, but never force-merge live indices under heap pressure because the operation itself consumes significant heap and I/O.
Mapping explosion. Cap field growth with index.mapping.total_fields.limit and disable dynamic mapping on indices that receive unstructured JSON. If the cluster state is already bloated, removing indices or fields is the only immediate relief.
Circuit breaker tuning. Temporarily lowering indices.breaker.fielddata.limit can force earlier rejection of bad queries, giving the cluster breathing room. Do not raise the parent breaker limit. That invites OOM and a deeper spiral.
Add capacity
If the cluster is simply undersized for the data volume, add data nodes to spread shards and reduce per-node heap occupancy. Do not increase heap above roughly 26 GB per node. Beyond this threshold, compressed ordinary object pointers are lost, increasing effective memory usage. It is better to add nodes than to grow heap past the compressed-OOPs threshold.
After heap stabilizes below 75% and GC pauses stop, re-enable allocation:
curl -X PUT 'http://localhost:9200/_cluster/settings' -H 'Content-Type: application/json' -d '{
"transient": {"cluster.routing.allocation.enable": "all"}
}'
Prevention
- Monitor the heap floor, not just the peak. A sawtooth between 30% and 75% is normal; a floor that rises from 30% to 60% over days is a leading indicator of the death spiral.
- Alert on composite conditions:
heap_used_percentabove 85% combined with increasing old GC time or parent circuit breaker trips. Heap percentage alone creates false positives from transient bursts. - Keep sustained heap below 75%. Maintain headroom for bursts and recovery traffic.
- Monitor
segments.memoryandsegments.countper node. Time-series indices that are closed or read-only should be force-merged to one segment to reduce metadata overhead. - Monitor total field count across all indices. Set
index.mapping.total_fields.limitand reject dynamic mapping on unstructured data sources. - Ensure
fielddatacache is near zero in normal operation. Any significant fielddata indicates a mapping or query design problem. - During rolling restarts, set
cluster.routing.allocation.enable: noneand rely onindex.unassigned.node_left.delayed_timeoutto prevent unnecessary reallocation.
How Netdata helps
- Charts
heap_used_percent, old GC duration, and node count together so you can see the cascade form before the cluster turns red. - Collects
segments.memoryandfielddatacache per node to identify heap consumers without running ad-hoc_nodes/statsqueries. - Supports composite alerts that fire only when heap is sustained above 85% alongside rising old GC time or circuit breaker trips, reducing false positives from transient spikes.
- Surfaces per-node thread pool queue depths and rejection rates so you can spot survivors being overwhelmed by recovery traffic.
Related guides
- Elasticsearch CircuitBreakingException: [parent] Data too large - causes and fixes
- Elasticsearch monitoring checklist: the signals every production cluster needs
- Elasticsearch monitoring maturity model: from survival to expert
- How Elasticsearch actually works in production: a mental model for operators







