Elasticsearch heap pressure death spiral: GC, node removal, and the cascade

When a node drops from _cat/nodes and network tests pass, check its JVM GC logs. Stop-the-world pauses over 10 seconds cause the master to remove the node. Survivors then absorb recovery traffic, their heap climbs, and they begin missing fault-detection checks too. This feedback loop is the heap pressure death spiral. It masquerades as network instability because operators check connectivity while the real problem is memory saturation.

What this means

By default, Elasticsearch follower/leader fault detection uses a 10-second timeout, 1-second interval, and 3 retries before removing a node. A GC pause over 10 seconds causes one failed check; three consecutive failures trigger removal. A hard TCP disconnect removes the node immediately.

When a node is removed, the master waits for index.unassigned.node_left.delayed_timeout (default 1 minute) before reassigning shards. If heap pressure caused the removal, reallocation begins after the delay. Recovery copies segments and rebuilds in-memory structures on the targets, consuming heap, CPU, network, and disk. If targets already run above 85% heap, recovery traffic pushes their own GC pauses past the fault-detection threshold. The cascade accelerates.

The root cause is always excessive long-lived heap occupancy. The usual suspects are too many shards per node (segment metadata overhead), mapping explosions (bloated cluster state), fielddata loading on text fields, large aggregation result sets, or unbounded bulk request buffering.

flowchart TD
    A[Heap sustains above 85%] --> B[Frequent old GC]
    B --> C[Pause exceeds 10s]
    C --> D[Fault detection fails]
    D --> E[Master removes node]
    E --> F[Shard reallocation]
    F --> G[Recovery traffic on survivors]
    G --> H[Heap pressure rises]
    H --> B

Common causes

CauseWhat it looks likeFirst thing to check
Too many shards / segment metadatasegments.memory growing; shard count approaching 1000 per node_cat/nodes?v&h=name,segments.memory,segments.count
Mapping explosionPending tasks climbing; cluster state bloated_cluster/stats?filter_path=indices.mappings.total_field_count
Fielddata on text fieldsfielddata.memory_size high; fielddata circuit breaker trips_nodes/stats/indices/fielddata?fields=*
Expensive aggregationsQuery latency spikes; request breaker trips_tasks?detailed=true&actions=*search*
Oversized bulk indexingWrite queues high; indexing pressure near limit_nodes/stats/indexing_pressure

Quick checks

# Check heap utilization and old GC activity per node
curl -s 'http://localhost:9200/_nodes/stats/jvm?filter_path=nodes.*.name,nodes.*.jvm.mem.heap_used_percent,nodes.*.jvm.gc.collectors.old'

# Check segment metadata overhead in heap
curl -s 'http://localhost:9200/_cat/nodes?v&h=name,segments.memory,segments.count'

# Check fielddata cache size and evictions
curl -s 'http://localhost:9200/_cat/nodes?v&h=name,fielddata.memory_size,fielddata.evictions'

# Check circuit breaker trips and estimated sizes
curl -s 'http://localhost:9200/_nodes/stats/breaker?filter_path=nodes.*.breakers'

# Check active search tasks that may be consuming heap
curl -s 'http://localhost:9200/_tasks?detailed=true&actions=*search*'

# Check node count to confirm departures
curl -s 'http://localhost:9200/_cluster/health?filter_path=number_of_nodes,number_of_data_nodes'

# Check total field count across all indices
curl -s 'http://localhost:9200/_cluster/stats?filter_path=indices.mappings.total_field_count'

How to diagnose it

  1. Confirm heap pressure is sustained. Look for heap_used_percent above 85% for multiple minutes, not a transient spike. The critical signal is the post-GC floor: if the minimum heap after old GC is climbing, long-lived objects are accumulating.

  2. Correlate GC pauses with node departures. On the removed node, JVM GC logs should show old-generation pauses exceeding 10 seconds immediately before the node left the cluster. Master logs will show the removal shortly after the fault-detection timeout.

  3. Identify the heap consumer. Compare segments.memory, fielddata.memory_size, query_cache.memory_size, and request_cache.memory_size via _nodes/stats/indices/segments,fielddata,query_cache,request_cache. Segment metadata and fielddata are the most common culprits in death spirals.

  4. Check for circuit breaker trips. A rising tripped count on the parent breaker means the node is near OOM. request breaker trips suggest large aggregations; fielddata trips indicate text-field aggregations.

  5. Find expensive queries. Use _tasks?detailed=true&actions=*search* to identify long-running searches. Cancel abusive tasks with POST /_tasks/<task_id>/_cancel.

  6. Measure cluster state bloat. A large cluster state consumes heap on every node. Check total field count with _cluster/stats. If field count grows without bound, dynamic mapping is the driver.

  7. Confirm the cascade on survivors. After a node drops, check whether remaining nodes show rising heap, growing old GC times, and new circuit breaker trips as recovery traffic arrives.

Metrics and signals to monitor

SignalWhy it mattersWarning sign
jvm.mem.heap_used_percentHard ceiling; sustained pressure triggers old GCSustained above 85%
Old GC pause durationStop-the-world pauses cause fault-detection failuresIndividual pause exceeds 10 seconds
breakers.parent.trippedLast line of defense before OOMAny delta greater than 0
segments.memorySegment metadata lives in old generationGrowing in step with heap floor
Node countDirect indicator of cascade progressionUnplanned drop
Unassigned shard countPrecedes recovery load on survivorsRising after node departure
Thread pool queue depthPrecursor to rejection and latencywrite or search queue sustained above 50% of max
Fielddata cache sizeText-field aggregations waste heapAbove 10% of heap or any evictions

Fixes

Stop the cascade immediately

WARNING: The next command leaves shards unassigned. Cluster health will turn yellow or red until you re-enable allocation.

Prevent new shard allocation and rebalancing to stop recovery traffic from pushing surviving nodes over their own heap limits:

curl -X PUT 'http://localhost:9200/_cluster/settings' -H 'Content-Type: application/json' -d '{
  "transient": {"cluster.routing.allocation.enable": "none"}
}'

Cancel any expensive queries that are pinning heap:

curl -s 'http://localhost:9200/_tasks?detailed=true&actions=*search*'

curl -X POST 'http://localhost:9200/_tasks/<task_id>/_cancel'

Reduce heap consumers

Fielddata. If _nodes/stats/indices/fielddata?fields=* shows large fielddata usage, identify the offending text field and add a keyword sub-field. Update queries to use the keyword field for aggregations and sorting.

Segment metadata. If segments.memory is high, reduce the shard count per node. Close or delete old indices, shrink existing indices, or add data nodes. Force-merge indices that are no longer written to, but never force-merge live indices under heap pressure because the operation itself consumes significant heap and I/O.

Mapping explosion. Cap field growth with index.mapping.total_fields.limit and disable dynamic mapping on indices that receive unstructured JSON. If the cluster state is already bloated, removing indices or fields is the only immediate relief.

Circuit breaker tuning. Temporarily lowering indices.breaker.fielddata.limit can force earlier rejection of bad queries, giving the cluster breathing room. Do not raise the parent breaker limit. That invites OOM and a deeper spiral.

Add capacity

If the cluster is simply undersized for the data volume, add data nodes to spread shards and reduce per-node heap occupancy. Do not increase heap above roughly 26 GB per node. Beyond this threshold, compressed ordinary object pointers are lost, increasing effective memory usage. It is better to add nodes than to grow heap past the compressed-OOPs threshold.

After heap stabilizes below 75% and GC pauses stop, re-enable allocation:

curl -X PUT 'http://localhost:9200/_cluster/settings' -H 'Content-Type: application/json' -d '{
  "transient": {"cluster.routing.allocation.enable": "all"}
}'

Prevention

  • Monitor the heap floor, not just the peak. A sawtooth between 30% and 75% is normal; a floor that rises from 30% to 60% over days is a leading indicator of the death spiral.
  • Alert on composite conditions: heap_used_percent above 85% combined with increasing old GC time or parent circuit breaker trips. Heap percentage alone creates false positives from transient bursts.
  • Keep sustained heap below 75%. Maintain headroom for bursts and recovery traffic.
  • Monitor segments.memory and segments.count per node. Time-series indices that are closed or read-only should be force-merged to one segment to reduce metadata overhead.
  • Monitor total field count across all indices. Set index.mapping.total_fields.limit and reject dynamic mapping on unstructured data sources.
  • Ensure fielddata cache is near zero in normal operation. Any significant fielddata indicates a mapping or query design problem.
  • During rolling restarts, set cluster.routing.allocation.enable: none and rely on index.unassigned.node_left.delayed_timeout to prevent unnecessary reallocation.

How Netdata helps

  • Charts heap_used_percent, old GC duration, and node count together so you can see the cascade form before the cluster turns red.
  • Collects segments.memory and fielddata cache per node to identify heap consumers without running ad-hoc _nodes/stats queries.
  • Supports composite alerts that fire only when heap is sustained above 85% alongside rising old GC time or circuit breaker trips, reducing false positives from transient spikes.
  • Surfaces per-node thread pool queue depths and rejection rates so you can spot survivors being overwhelmed by recovery traffic.