Elasticsearch JVM heap usage high: reading the sawtooth and the post-GC floor

Your Elasticsearch alert fires: jvm.mem.heap_used_percent has crossed 75 percent and is holding there. You pull up the graph and see a jagged sawtooth climbing toward the ceiling. The first instinct is to add heap or restart the node. Both are usually wrong.

The sawtooth is normal. Elasticsearch runs on the JVM with a young generation that fills with short-lived objects and empties on young garbage collections. The peak of the tooth is noise. The signal that matters is the post-GC floor: the minimum heap used immediately after a collection. In a healthy node, the floor stays between roughly 30 and 50 percent of max heap, and young GC dominates. When the floor trends upward, old generation objects are accumulating. Old GC pauses stop the world, and once a pause exceeds the cluster fault detection timeout, the master removes the node and triggers shard reallocation. That reallocation places more heap pressure on the survivors, beginning a death spiral.

What this means

Elasticsearch runs on the JVM and typically uses G1GC, which divides the heap into young and old regions. Short-lived objects created during indexing and search live in the young generation and are collected frequently and cheaply. Long-lived structures such as segment metadata, fielddata caches, cluster state, and large aggregation buffers live in the old generation and require stop-the-world collections to reclaim.

Sustained heap_used_percent above 75 percent is a ticket-level concern. Above 85 percent combined with rising old GC time or circuit breaker trips, it is a page. Above 90 percent, the node is at risk of being dropped from the cluster.

Configure Elasticsearch with -Xms equal to -Xmx, and keep the max at or below 26 GB to stay within compressed ordinary object pointers. More heap is not always better; Lucene relies on the operating system page cache, so leave roughly half of physical RAM for the OS.

flowchart TD
    A[Young GC reclaims ephemeral objects] --> B[Post-GC floor stable]
    C[Segment metadata fielddata cluster state] --> D[Post-GC floor rises]
    D --> E[Old GC fires frequently]
    E --> F[Pause exceeds fault detection timeout]
    F --> G[Master removes node]
    G --> H[Shard reallocation]
    H --> I[Survivor heap pressure increases]
    I --> D

Common causes

CauseWhat it looks likeFirst thing to check
Shard count and segment metadatasegments.memory grows linearly with shard count; node holds thousands of shards_cat/nodes?v&h=name,segments.count,segments.memory
Fielddata cache on text fieldsfielddata.memory_size is significant or evictions are nonzero; fielddata breaker trips_nodes/stats/indices/fielddata?fields=*
Cluster state or mapping explosionPending cluster tasks grow; master node heap is high_cluster/stats?filter_path=indices.mappings.field_types
High-cardinality aggregationsHeap spikes during queries; request breaker trips; slow queries in log_nodes/stats/breaker and /_tasks?detailed=true&actions=*search*
Oversized bulk batchesTransient heap spikes; indexing latency rises_nodes/stats/indices/indexing

Quick checks

Run these read-only commands against the cluster to characterize the pressure.

# Heap percent and GC counters per node
curl -s 'http://localhost:9200/_cat/nodes?v&h=name,heap.percent,heap.max,gc.young.count,gc.young.time,gc.old.count,gc.old.time'
# Detailed heap usage and cumulative GC time in millis
curl -s 'http://localhost:9200/_nodes/stats/jvm?filter_path=nodes.*.jvm.mem,nodes.*.jvm.gc'
# Segment metadata heap per node
curl -s 'http://localhost:9200/_cat/nodes?v&h=name,segments.count,segments.memory'
# Fielddata cache size and evictions
curl -s 'http://localhost:9200/_cat/nodes?v&h=name,fielddata.memory_size,fielddata.evictions'
# Circuit breaker estimated sizes and trip counts
curl -s 'http://localhost:9200/_nodes/stats/breaker?filter_path=nodes.*.breakers'
# Active search tasks that may be consuming heap
curl -s 'http://localhost:9200/_tasks?detailed=true&actions=*search*'
# Mapping breadth by field type; sum counts as a proxy for cluster state weight
curl -s 'http://localhost:9200/_cluster/stats?filter_path=indices.mappings.field_types'

How to diagnose it

  1. Confirm the floor is rising. Compare the minimum heap_used_percent observed over a 10-15 minute window, or sample _nodes/stats/jvm shortly after an old GC event. If the minimum is climbing, you have structural accumulation, not a transient spike.

  2. Identify the dominant heap consumer. Query _nodes/stats/indices/segments,fielddata,query_cache,request_cache and compare segments.memory, fielddata.memory_size, and cache sizes. One category will dominate.

  3. If segments.memory is high, check shard density. Use _cat/nodes?v&h=name,segments.count. If a node holds more than a few hundred shards or individual indices show more than 100 segments per shard, segment metadata is the driver. Significant metadata moved off-heap in recent releases, but per-segment-per-field overhead still accumulates.

  4. If fielddata is high, find the offending field. Query _nodes/stats/indices/fielddata?fields=*&filter_path=nodes.*.indices.fielddata.fields. This usually means a text field is being aggregated or sorted. Fielddata is disabled by default on text fields; if it is enabled, that is the problem. Add a keyword sub-field and change the query to use it.

  5. If cache sizes are high, evaluate the working set. Large query or request caches reduce headroom for in-flight operations. Consider whether the cache hit ratio justifies the memory cost, especially on actively written indices where refreshes invalidate cached segments.

  6. If cluster state is large, check mapping breadth. Sum count values from _cluster/stats?filter_path=indices.mappings.field_types. Master nodes need adequate heap to serialize and publish state. If the total field count is growing without bound, you have a mapping explosion.

  7. Correlate with old GC behavior. In _nodes/stats/jvm, check jvm.gc.collectors.old.collection_count and jvm.gc.collectors.old.collection_time_in_millis. If the cumulative time is climbing rapidly, old collections are getting longer or more frequent. For individual pause duration, inspect the node’s GC logs for multi-second collections.

  8. Check for breaker trips. Any delta greater than zero on breakers.parent.tripped means the node is already rejecting operations to protect itself from OOM. breakers.fielddata.tripped points directly to text-field misuse.

Metrics and signals to monitor

SignalWhy it mattersWarning sign
heap_used_percent sustainedInstantaneous memory pressure>75% for more than 5 minutes
Post-GC heap floorStructural accumulation of long-lived objectsRising over hours or days
Old GC collection count and cumulative timeStop-the-world frequency and durationIncreasing trend; individual pauses >5 seconds
segments.memoryPersistent per-segment-per-field overheadGrowing linearly with shard count
fielddata.memory_size and evictionsExpensive text-field aggregations>10% of heap or any evictions
breakers.parent.trippedNode is near OOMAny delta > 0
Pending cluster tasksMaster coordination health>20 tasks or any task older than 30 seconds

Fixes

Shard and segment metadata reduction

Close or delete unused indices and enforce retention with ILM. Warning: closing an index makes it unavailable for search; deleting is irreversible.

Reduce shard count by shrinking existing indices with the _shrink API or reindexing into fewer shards. For read-only indices, force merge to one segment with POST /<index>/_forcemerge?max_num_segments=1. Warning: do not force merge indices that are still receiving writes; the operation is I/O intensive and blocks the index. Target shard sizes between 10 and 50 GB.

Fielddata and mapping fixes

Stop aggregating or sorting on text fields. Use keyword sub-fields instead. If fielddata is explicitly enabled in a mapping, remove it. You can set indices.fielddata.cache.size to place a hard cap, but evictions under that cap are a sign of a mapping problem, not a healthy state.

Query and aggregation tuning

Cancel heavy tasks identified via /_tasks using POST /_tasks/{task_id}/_cancel. Warning: this aborts in-flight user requests.

Reduce aggregation cardinality; high-cardinality terms aggregations consume disproportionate heap. For heavy terms aggregations, consider tuning execution_hint (for example, map versus global_ordinals) based on your cardinality and shard layout.

Reduce size parameters or replace deep paging with search_after.

Cluster state and master relief

Pause rapid index creation or ILM churn if pending tasks are backing up. Cap field growth with index.mapping.total_fields.limit and index.mapping.depth.limit. If cluster state size is large, deploy dedicated master nodes with sufficient heap; three master-eligible nodes is the standard minimum.

Emergency braking during a cascade

If nodes are already dropping and reallocating, set cluster.routing.allocation.enable: none to stop the rebalancing storm while you fix the root cause. Warning: this stops all shard allocation, including recovery of replicas, and can turn a yellow cluster red if nodes stay offline. Re-enable allocation only after the floor stabilizes.

Lower indices.breaker.fielddata.limit temporarily to force earlier rejection of abusive queries while you fix mappings. Warning: this can cause legitimate aggregations to trip the breaker immediately.

Prevention

  • Monitor the post-GC floor, not just the peak. Alert when the floor rises above 50 percent or trends upward over a week.
  • Track shard count per node and segment count per index. Keep shard counts well below the per-node default limit of 1000, and comfortable below 200 per node if possible.
  • Review mappings before index creation. Disable dynamic mapping or set strict limits to prevent mapping explosions.
  • Size the heap correctly. Set -Xms equal to -Xmx and keep the max at or below 26 GB to preserve compressed OOPs.
  • Leave half of physical RAM for the OS page cache. Do not starve Lucene by overallocating heap.
  • Schedule maintenance operations such as snapshots and force merges outside peak traffic windows.

How Netdata helps

Netdata surfaces jvm.mem.heap_used_percent and GC metrics per node, so you can read the floor without manual GC log inspection. Correlate heap usage with indexing rate, search rate, and thread pool rejections to distinguish ingest pressure from query pressure. Alert on composite conditions such as sustained heap above 85 percent combined with rising old GC cumulative time to suppress noise from normal young-GC oscillation. Long-term retention of per-node segments.memory and fielddata sizes exposes gradual floor rise that point-in-time API checks often miss.