Elasticsearch long GC pauses: old-generation stop-the-world and node drops
In Elasticsearch 8.x, nodes can drop out of the cluster without logging errors. The master logs node-left with reason disconnected, while the departed node shows no ERROR entries because its JVM was frozen in an old-generation stop-the-world GC pause. A single pause longer than 10 seconds fails a fault-detection check; roughly 30 seconds of total unresponsiveness triggers removal. Once the master reallocates shards, remaining nodes face additional heap pressure and the cascade continues.
The node logs nothing during the pause because every thread, including logging and network I/O, is suspended. Without correlating GC metrics to node departures, the symptom looks like a network problem or sudden crash. In reality, it is almost always structural heap pressure.
Old-gen GC pauses are not a root cause. They are the final warning that the heap is full of long-lived objects the collector cannot reclaim quickly enough. This guide shows how to distinguish an isolated allocation spike from a rising trend that will bring down the cluster, and how to stop the cascade without masking it behind longer timeouts.
What this means
Since ES 8.0, G1GC is the default collector. Old-generation collections reclaim long-lived objects such as segment metadata, fielddata, and cluster state structures. Under normal conditions these pauses are brief. Under heap pressure, old GC cannot keep up, pauses lengthen into seconds or tens of seconds, and the JVM stops every thread including those used for cluster coordination and transport.
The cluster coordination layer performs follower and leader checks with a default 10-second timeout and three retries before node removal. A GC pause longer than 10 seconds fails one check; roughly 30 seconds of total unresponsiveness causes the master to remove the node. A hard TCP disconnect triggers immediate removal. After removal, the master reallocates that node’s shards to remaining nodes, which generates additional indexing, search, and merge load on peers already near their limits. This is the heap pressure death spiral.
flowchart TD
A[Heap pressure] --> B[Frequent old-gen GC]
B --> C[Stop-the-world pause]
C --> D{Pause > 10s?}
D -->|Yes| E[Fault detection timeout]
E --> F[Node marked failed]
F --> G[Shard reallocation]
G --> H[More heap pressure on peers]
H --> BCommon causes
| Cause | What it looks like | First thing to check |
|---|---|---|
| Too many shards / segment metadata bloat | segments.memory climbing; high shard count; heap floor rising | GET /_cat/nodes?v&h=name,segments.memory,segments.count |
| Fielddata loading on text fields | fielddata.memory_size_in_bytes high; slow logs show aggregations on text | GET /_nodes/stats/indices/fielddata?fields=*&filter_path=nodes.*.indices.fielddata.fields |
| Mapping explosion or bloated cluster state | Pending cluster tasks growing; master heap elevated | GET /_cluster/pending_tasks and GET /_cluster/stats?filter_path=indices.mappings |
| Expensive aggregations or oversized bulk requests | parent or request circuit breaker trips; large tasks in /_tasks | GET /_tasks?detailed=true&actions=*search* |
| Inadequate heap for sustained workload | heap_used_percent >85% with rising old GC count and time | GET /_nodes/stats/jvm?filter_path=nodes.*.jvm.mem,nodes.*.jvm.gc |
Quick checks
# Per-node heap and old-generation GC time
curl -s 'http://localhost:9200/_nodes/stats/jvm?filter_path=nodes.*.jvm.mem.heap_used_percent,nodes.*.jvm.gc.collectors.*.collection_time_in_millis'
# Cluster health and node count for recent departures
curl -s 'http://localhost:9200/_cluster/health?filter_path=status,unassigned_shards,number_of_nodes'
# Segment metadata pressure per node
curl -s 'http://localhost:9200/_cat/nodes?v&h=name,segments.memory,segments.count'
# Currently running search tasks
curl -s 'http://localhost:9200/_tasks?detailed=true&actions=*search*'
# Hot threads to see what is consuming CPU
curl -s 'http://localhost:9200/_nodes/hot_threads'
How to diagnose it
- Correlate node departures with GC. On the removed node, compare the departure timestamp with old-generation collection time spikes under
jvm.gc.collectors. - Decide whether this is an isolated pause or a rising trend. An isolated spike suggests a single oversized allocation or one-off query. A steady increase in collection count and collection time over hours means structural heap pressure that will not self-resolve.
- Identify the dominant heap consumer. Query
GET /_nodes/stats/indices/segments,fielddata,query_cache,request_cache,completionand compare memory sizes. If segment memory dominates, shard count or field count is the driver. If fielddata dominates, mappings are the driver. If request cache or query cache dominate, search patterns are the issue. - Inspect cluster state pressure. High pending tasks on the master combined with master-node heap spikes point to mapping explosion or excessive index churn. Check
GET /_cluster/stats?filter_path=indices.mappingsfor runaway field counts and compare with the master’sjvm.mem.heap_used_percent. - Find expensive in-flight operations. Use
GET /_tasks?detailed=true&actions=*search*to look for long-running aggregations, scrolls, or bulk operations that coincide with the GC window. Cancel them withPOST /_tasks/{task_id}/_cancelif safe. Warning: Cancelling tasks aborts in-flight queries and can return errors to clients. - Verify fault-detection configuration. Query
GET /_cluster/settings?include_defaults=trueand inspectcluster.fault_detection.*. Iffollower_check.timeoutorleader_check.timeouthave been raised above the 10-second default, the cluster is likely masking chronic heap pressure instead of fixing it.
Metrics and signals to monitor
| Signal | Why it matters | Warning sign |
|---|---|---|
jvm.mem.heap_used_percent | Sustained elevation precedes old GC storms | >75% sustained |
| Old-generation GC collection time | Measures stop-the-world duration | Rate increasing; single pause >10 s |
| Old-generation GC collection count | Frequency of old collections | Rising trend over hours |
| Post-GC heap floor | Long-lived object accumulation | Minimum heap after GC creeps upward |
segments.memory | Segment metadata lives in old generation | Growing linearly with shard count |
breakers.parent.estimated_size_in_bytes vs limit_size_in_bytes | Proximity to OOM protection | Ratio consistently >70% |
Fixes
Immediate stabilization
If nodes are dropping in a cascade, stop the rebalancing storm before fixing the root cause. Set cluster.routing.allocation.enable: none to prevent the master from relocating shards onto already stressed peers. Warning: This stops all shard allocation and relocation, including replica recovery. Re-enable it after stabilization; the cluster will not self-heal while it is set.
Cancel expensive in-flight operations identified via /_tasks. This is disruptive to the affected queries, but it can free heap immediately. Reducing indices.breaker.fielddata.limit can force earlier rejection of fielddata-heavy queries, trading query failures for heap headroom.
Structural fixes
When segment memory is high, reduce shard density. Close or delete old indices, implement ILM policies, or use the shrink API to reduce the shard count on heavy nodes. Force-merge read-only indices to one segment to cut segment metadata, but never force-merge a live index receiving writes because the I/O cost can worsen pressure.
For fielddata issues, change mappings to use keyword sub-fields for aggregations and sorting instead of loading fielddata on text fields. If the cluster state is bloated, enforce index.mapping.total_fields.limit and audit dynamic mapping on unstructured input. Each new field increases cluster state size, which is held in heap on every node.
For query-driven pressure, reduce bulk batch sizes and limit aggregation cardinality at the application layer. Do not simply increase heap unless the current size is below the compressed-OOP threshold. Adding heap delays the problem without fixing the consumer, and heaps above the compressed-OOP threshold waste space. Ensure -Xms equals -Xmx so the JVM does not resize during pressure.
Configuration pitfalls
Do not increase cluster.fault_detection.follower_check.timeout or leader_check.timeout to tolerate GC pauses. This hides symptoms, does not fix underlying heap pressure, and delays detection of genuine node failures, extending the window of cluster instability.
Prevention
Monitor the post-GC heap floor trend, not just the peak percentage. A rising floor is the best leading indicator that the death spiral is approaching. Keep sustained heap usage below 75% and ensure old GC pauses stay well under the 10-second fault-detection timeout. Maintain shard count per node below 500-800 and monitor segment memory weekly for growth.
Review slow logs regularly for queries that load fielddata or build large aggregation structures. After any restart, allow time for OS page cache warming before declaring latency anomalies, but watch for heap pressure that outlasts the warmup window. In container deployments, ensure the cgroup memory limit matches the JVM heap plus native and off-heap overhead so the Linux OOM killer does not terminate the process before GC can complete.
How Netdata helps
- Correlates old-generation GC time and collection count rate with node reachability and cluster health on the same timeline, making the pause-to-removal pattern visible.
- Tracks
jvm.mem.heap_used_percentand estimates the post-GC floor per node without manual delta calculations. - Alerts on composite conditions such as sustained heap greater than 85% combined with rising old GC time, which reduces noise from transient spikes.
- Surfaces per-node segment memory, shard counts, and circuit breaker utilization alongside GC metrics to reveal gradual heap consumers before they cause pauses.
- Maps thread pool rejections and fault-detection events to the nodes experiencing GC pressure, clarifying whether a node drop is a cause or a symptom.
Related guides
- Elasticsearch CircuitBreakingException: [parent] Data too large - causes and fixes
- Elasticsearch monitoring checklist: the signals every production cluster needs
- Elasticsearch monitoring maturity model: from survival to expert
- How Elasticsearch actually works in production: a mental model for operators







