Elasticsearch node OOM-killed: heap ceiling, page cache, and container limits

An Elasticsearch node leaves the cluster, restarts seconds later via systemd or a supervisor, and is killed again. Kernel logs show the OOM-killer terminated the Java process. heap.percent often looks reasonable right up until the kill.

The JVM heap is only one component of resident set size. Off-heap allocations, memory-mapped Lucene segments, and co-located processes all compete for the same memory budget. In containers, the cgroup limit is the hard boundary, not the host’s physical RAM.

Setting -Xmx caps the JVM heap, not the process RSS. Elasticsearch uses off-heap buffers for network I/O via Netty 4. The JVM allocates Metaspace, JIT code cache, and thread stacks outside the heap. Lucene accesses index segments via memory-mapped files, which consume OS page cache. The page cache drives search performance, but also contributes to RSS.

In a containerized deployment, the OOM-killer triggers when the cgroup’s total memory usage reaches the container limit. This can happen even when heap.percent is below 75% because the heap is not the only consumer.

The parent circuit breaker defaults to 95% of JVM heap with real memory tracking. It rejects operations that would push heap usage too high, but it does not account for Lucene mmap regions, direct ByteBuffers, or memory used by other processes sharing the cgroup. Consequently, the breaker may never trip before the kernel kills the process.

flowchart TD
    A[Bulk indexing or aggregations] --> B[JVM heap fills]
    B --> C[Circuit breaker may trip]
    A --> D[Netty direct buffers grow]
    A --> E[Lucene mmap segments expand]
    D --> F[Container RSS hits memory limit]
    E --> F
    B --> F
    F --> G[Kernel OOM-killer sends SIGKILL]
    G --> H[Node exits 137]
    H --> I[Master removes node]
    I --> J[Shard reallocation starts]
    J --> K[Remaining nodes absorb load]
    K --> A

Common causes

CauseWhat it looks likeFirst thing to check
Container memory limit too tightNode restarts in a loop with exit code 137; dmesg shows oom-killerContainer memory limit vs -Xmx plus headroom
Heap sized above 50% of available RAMFrequent OOM despite moderate heap percent; search latency high from cold page cache_cat/nodes heap.max vs container or host total memory
Off-heap pressure from segments and buffersRSS grows steadily while heap stays flat; many open file descriptors_cat/nodes segments.count and segments.memory
Startup RSS spikeNode killed during bootstrap before handling trafficService logs for early exit, dmesg timestamp vs start time
Co-located services in pod or on hostES process alone fits budget, but total RSS exceeds limitPer-process RSS with ps or container sidecar metrics

Quick checks

# Confirm kernel OOM-killer killed the Java process
dmesg | grep -i "killed process"

# Same check via journalctl if dmesg is empty or rotated
journalctl -k | grep -i "killed process"

# Check JVM heap max and current usage
curl -s 'http://localhost:9200/_cat/nodes?v&h=name,heap.max,heap.percent'

# Check segment count and off-heap segment memory per node
curl -s 'http://localhost:9200/_cat/nodes?v&h=name,segments.count,segments.memory'

# Inspect circuit breaker state
curl -s 'http://localhost:9200/_nodes/stats/breaker?filter_path=nodes.*.breakers'

# Check for restart loops in systemd logs
systemctl status elasticsearch --no-pager

# Show process RSS on the host
ps -o pid,rss,comm -p $(pgrep -f org.elasticsearch.bootstrap.Elasticsearch)

# Read container memory limit from inside the pod/container
cat /sys/fs/cgroup/memory/memory.limit_in_bytes 2>/dev/null || cat /sys/fs/cgroup/memory.max 2>/dev/null

How to diagnose it

  1. Verify the OOM kill. Run dmesg | grep -i "out of memory" or journalctl -k | grep -i "killed process". Look for lines naming the Java PID and reporting anon-rss. Note the timestamp. If the node is in a container, check the host dmesg, not the container.

  2. Confirm the restart pattern. Check systemctl status elasticsearch or the container runtime for exit code 137 (128 + SIGKILL 9). Rapid uptime resets in _cat/nodes indicate the supervisor is respawning the process.

  3. Compare heap to limit. Query _cat/nodes?v&h=name,heap.max,heap.percent. Convert heap.max to the same unit as the container limit or host RAM. If heap.max exceeds 50% of the limit, the configuration violates the headroom guideline.

  4. Measure off-heap growth. Check _cat/nodes?v&h=name,segments.count,segments.memory. High segment count increases mmap pressure and file descriptor usage. Correlate with _nodes/stats/jvm?filter_path=nodes.*.jvm.mem to see the gap between heap committed and process RSS.

  5. Check circuit breaker history. Query _nodes/stats/breaker. If the parent breaker tripped count is zero, the OOM was caused by untracked memory. If it tripped repeatedly, heap pressure preceded the kill but was not the only factor.

  6. Identify co-located consumers. On Kubernetes, check the pod spec for sidecar containers. On bare metal or VMs, sum RSS across all processes. Non-ES consumers can push total usage over the limit even when ES itself is sized correctly.

Metrics and signals to monitor

SignalWhy it mattersWarning sign
jvm.mem.heap_used_percentLargest controllable memory consumerSustained >75%
breakers.parent.trippedIndicates heap pressure before OOMAny delta > 0
segments.memorySegment metadata and mmap pressureGrowing without index growth
process.open_file_descriptorsProxies for segment count and mmap regions>80% of max
Container or host memory usage vs limitHard boundary for OOM-killerUsage >80% of limit
Node uptime / restart frequencyCatches supervisor respawn loopsUnexpected restart within 10 minutes

Fixes

Raise the container memory limit

If the container limit is artificially low, increase it. Do not raise -Xmx to consume all the extra space. Keep -Xmx at no more than 50% of the container limit, capped at roughly 26-30 GB to keep compressed OOPs enabled.

Lower -Xmx to free headroom

If you cannot raise the limit, reduce -Xmx. This requires a rolling restart. A smaller heap gives more room to the OS page cache and off-heap allocations. Tradeoff: young GC frequency rises and heavy aggregation loads are more likely to trip the parent circuit breaker.

Reduce segment and shard pressure

High segment counts increase off-heap memory and file descriptor usage. Force merge read-only indices to reduce segments. Delete old indices or close them. Warning: force merge is I/O-intensive and temporarily doubles disk usage for the segments involved.

Isolate co-located workloads

Move monitoring agents, log shippers, and sidecars out of the Elasticsearch pod or off the host. If that is impossible, size their memory and subtract it from the available budget before setting -Xmx.

Correct CPU container detection

If running in a container with CPU limits, set -XX:ActiveProcessorCount to match the limit. Thread pools sized for too many cores allocate excessive thread stacks, adding to RSS. This also requires a rolling restart.

Prevention

  • Size heap to half the budget. Set -Xms and -Xmx to no more than 50% of the memory available to the node, with a ceiling of roughly 26-30 GB.
  • Leave headroom for page cache. Elasticsearch relies on the OS page cache for Lucene segment access. Starving the page cache increases search latency and does not prevent OOM.
  • Monitor total memory usage, not just heap. Heap percentage is a sawtooth that hides off-heap growth. Track process or container memory usage against the limit.
  • Account for startup spikes. Some versions briefly allocate extra memory during bootstrap. Size container limits to handle startup, not just steady state.
  • Watch for respawn loops. A supervisor restarting the process after exit 137 creates a flapping node that triggers unnecessary shard reallocation. Alert on unexpected node uptime resets.

How Netdata helps

  • Correlates elasticsearch.jvm_heap_used_percent with system RAM and cgroup memory usage, revealing when RSS diverges from heap.
  • Surfaces kernel OOM-killer events from system logs without manual dmesg searches.
  • Tracks elasticsearch.thread_pool_queued_operations and elasticsearch.breaker_tripped to identify memory pressure before the kernel intervenes.
  • Alerts on node uptime drops and process restarts, catching supervisor respawn loops that mask chronic OOM kills.
  • Monitors per-process RSS and open file descriptors to expose segment-related off-heap growth.