Elasticsearch JVM heap usage high: reading the sawtooth and the post-GC floor
Your Elasticsearch alert fires: jvm.mem.heap_used_percent has crossed 75 percent and is holding there. You pull up the graph and see a jagged sawtooth climbing toward the ceiling. The first instinct is to add heap or restart the node. Both are usually wrong.
The sawtooth is normal. Elasticsearch runs on the JVM with a young generation that fills with short-lived objects and empties on young garbage collections. The peak of the tooth is noise. The signal that matters is the post-GC floor: the minimum heap used immediately after a collection. In a healthy node, the floor stays between roughly 30 and 50 percent of max heap, and young GC dominates. When the floor trends upward, old generation objects are accumulating. Old GC pauses stop the world, and once a pause exceeds the cluster fault detection timeout, the master removes the node and triggers shard reallocation. That reallocation places more heap pressure on the survivors, beginning a death spiral.
What this means
Elasticsearch runs on the JVM and typically uses G1GC, which divides the heap into young and old regions. Short-lived objects created during indexing and search live in the young generation and are collected frequently and cheaply. Long-lived structures such as segment metadata, fielddata caches, cluster state, and large aggregation buffers live in the old generation and require stop-the-world collections to reclaim.
Sustained heap_used_percent above 75 percent is a ticket-level concern. Above 85 percent combined with rising old GC time or circuit breaker trips, it is a page. Above 90 percent, the node is at risk of being dropped from the cluster.
Configure Elasticsearch with -Xms equal to -Xmx, and keep the max at or below 26 GB to stay within compressed ordinary object pointers. More heap is not always better; Lucene relies on the operating system page cache, so leave roughly half of physical RAM for the OS.
flowchart TD
A[Young GC reclaims ephemeral objects] --> B[Post-GC floor stable]
C[Segment metadata fielddata cluster state] --> D[Post-GC floor rises]
D --> E[Old GC fires frequently]
E --> F[Pause exceeds fault detection timeout]
F --> G[Master removes node]
G --> H[Shard reallocation]
H --> I[Survivor heap pressure increases]
I --> DCommon causes
| Cause | What it looks like | First thing to check |
|---|---|---|
| Shard count and segment metadata | segments.memory grows linearly with shard count; node holds thousands of shards | _cat/nodes?v&h=name,segments.count,segments.memory |
| Fielddata cache on text fields | fielddata.memory_size is significant or evictions are nonzero; fielddata breaker trips | _nodes/stats/indices/fielddata?fields=* |
| Cluster state or mapping explosion | Pending cluster tasks grow; master node heap is high | _cluster/stats?filter_path=indices.mappings.field_types |
| High-cardinality aggregations | Heap spikes during queries; request breaker trips; slow queries in log | _nodes/stats/breaker and /_tasks?detailed=true&actions=*search* |
| Oversized bulk batches | Transient heap spikes; indexing latency rises | _nodes/stats/indices/indexing |
Quick checks
Run these read-only commands against the cluster to characterize the pressure.
# Heap percent and GC counters per node
curl -s 'http://localhost:9200/_cat/nodes?v&h=name,heap.percent,heap.max,gc.young.count,gc.young.time,gc.old.count,gc.old.time'
# Detailed heap usage and cumulative GC time in millis
curl -s 'http://localhost:9200/_nodes/stats/jvm?filter_path=nodes.*.jvm.mem,nodes.*.jvm.gc'
# Segment metadata heap per node
curl -s 'http://localhost:9200/_cat/nodes?v&h=name,segments.count,segments.memory'
# Fielddata cache size and evictions
curl -s 'http://localhost:9200/_cat/nodes?v&h=name,fielddata.memory_size,fielddata.evictions'
# Circuit breaker estimated sizes and trip counts
curl -s 'http://localhost:9200/_nodes/stats/breaker?filter_path=nodes.*.breakers'
# Active search tasks that may be consuming heap
curl -s 'http://localhost:9200/_tasks?detailed=true&actions=*search*'
# Mapping breadth by field type; sum counts as a proxy for cluster state weight
curl -s 'http://localhost:9200/_cluster/stats?filter_path=indices.mappings.field_types'
How to diagnose it
Confirm the floor is rising. Compare the minimum
heap_used_percentobserved over a 10-15 minute window, or sample_nodes/stats/jvmshortly after an old GC event. If the minimum is climbing, you have structural accumulation, not a transient spike.Identify the dominant heap consumer. Query
_nodes/stats/indices/segments,fielddata,query_cache,request_cacheand comparesegments.memory,fielddata.memory_size, and cache sizes. One category will dominate.If segments.memory is high, check shard density. Use
_cat/nodes?v&h=name,segments.count. If a node holds more than a few hundred shards or individual indices show more than 100 segments per shard, segment metadata is the driver. Significant metadata moved off-heap in recent releases, but per-segment-per-field overhead still accumulates.If fielddata is high, find the offending field. Query
_nodes/stats/indices/fielddata?fields=*&filter_path=nodes.*.indices.fielddata.fields. This usually means atextfield is being aggregated or sorted. Fielddata is disabled by default on text fields; if it is enabled, that is the problem. Add akeywordsub-field and change the query to use it.If cache sizes are high, evaluate the working set. Large query or request caches reduce headroom for in-flight operations. Consider whether the cache hit ratio justifies the memory cost, especially on actively written indices where refreshes invalidate cached segments.
If cluster state is large, check mapping breadth. Sum
countvalues from_cluster/stats?filter_path=indices.mappings.field_types. Master nodes need adequate heap to serialize and publish state. If the total field count is growing without bound, you have a mapping explosion.Correlate with old GC behavior. In
_nodes/stats/jvm, checkjvm.gc.collectors.old.collection_countandjvm.gc.collectors.old.collection_time_in_millis. If the cumulative time is climbing rapidly, old collections are getting longer or more frequent. For individual pause duration, inspect the node’s GC logs for multi-second collections.Check for breaker trips. Any delta greater than zero on
breakers.parent.trippedmeans the node is already rejecting operations to protect itself from OOM.breakers.fielddata.trippedpoints directly to text-field misuse.
Metrics and signals to monitor
| Signal | Why it matters | Warning sign |
|---|---|---|
heap_used_percent sustained | Instantaneous memory pressure | >75% for more than 5 minutes |
| Post-GC heap floor | Structural accumulation of long-lived objects | Rising over hours or days |
| Old GC collection count and cumulative time | Stop-the-world frequency and duration | Increasing trend; individual pauses >5 seconds |
segments.memory | Persistent per-segment-per-field overhead | Growing linearly with shard count |
fielddata.memory_size and evictions | Expensive text-field aggregations | >10% of heap or any evictions |
breakers.parent.tripped | Node is near OOM | Any delta > 0 |
| Pending cluster tasks | Master coordination health | >20 tasks or any task older than 30 seconds |
Fixes
Shard and segment metadata reduction
Close or delete unused indices and enforce retention with ILM. Warning: closing an index makes it unavailable for search; deleting is irreversible.
Reduce shard count by shrinking existing indices with the _shrink API or reindexing into fewer shards. For read-only indices, force merge to one segment with POST /<index>/_forcemerge?max_num_segments=1. Warning: do not force merge indices that are still receiving writes; the operation is I/O intensive and blocks the index. Target shard sizes between 10 and 50 GB.
Fielddata and mapping fixes
Stop aggregating or sorting on text fields. Use keyword sub-fields instead. If fielddata is explicitly enabled in a mapping, remove it. You can set indices.fielddata.cache.size to place a hard cap, but evictions under that cap are a sign of a mapping problem, not a healthy state.
Query and aggregation tuning
Cancel heavy tasks identified via /_tasks using POST /_tasks/{task_id}/_cancel. Warning: this aborts in-flight user requests.
Reduce aggregation cardinality; high-cardinality terms aggregations consume disproportionate heap. For heavy terms aggregations, consider tuning execution_hint (for example, map versus global_ordinals) based on your cardinality and shard layout.
Reduce size parameters or replace deep paging with search_after.
Cluster state and master relief
Pause rapid index creation or ILM churn if pending tasks are backing up. Cap field growth with index.mapping.total_fields.limit and index.mapping.depth.limit. If cluster state size is large, deploy dedicated master nodes with sufficient heap; three master-eligible nodes is the standard minimum.
Emergency braking during a cascade
If nodes are already dropping and reallocating, set cluster.routing.allocation.enable: none to stop the rebalancing storm while you fix the root cause. Warning: this stops all shard allocation, including recovery of replicas, and can turn a yellow cluster red if nodes stay offline. Re-enable allocation only after the floor stabilizes.
Lower indices.breaker.fielddata.limit temporarily to force earlier rejection of abusive queries while you fix mappings. Warning: this can cause legitimate aggregations to trip the breaker immediately.
Prevention
- Monitor the post-GC floor, not just the peak. Alert when the floor rises above 50 percent or trends upward over a week.
- Track shard count per node and segment count per index. Keep shard counts well below the per-node default limit of 1000, and comfortable below 200 per node if possible.
- Review mappings before index creation. Disable dynamic mapping or set strict limits to prevent mapping explosions.
- Size the heap correctly. Set
-Xmsequal to-Xmxand keep the max at or below 26 GB to preserve compressed OOPs. - Leave half of physical RAM for the OS page cache. Do not starve Lucene by overallocating heap.
- Schedule maintenance operations such as snapshots and force merges outside peak traffic windows.
How Netdata helps
Netdata surfaces jvm.mem.heap_used_percent and GC metrics per node, so you can read the floor without manual GC log inspection. Correlate heap usage with indexing rate, search rate, and thread pool rejections to distinguish ingest pressure from query pressure. Alert on composite conditions such as sustained heap above 85 percent combined with rising old GC cumulative time to suppress noise from normal young-GC oscillation. Long-term retention of per-node segments.memory and fielddata sizes exposes gradual floor rise that point-in-time API checks often miss.
Related guides
- Elasticsearch CircuitBreakingException: [parent] Data too large - causes and fixes
- Elasticsearch monitoring checklist: the signals every production cluster needs
- Elasticsearch monitoring maturity model: from survival to expert
- How Elasticsearch actually works in production: a mental model for operators







