Docker JVM memory tuning: heap, off-heap, and the cgroup mismatch
Your Java container is OOMKilled at 02:00. Heap usage is 60%. Docker reports exit code 137 and OOMKilled: true. The JVM never threw an OutOfMemoryError.
In a container, the kernel enforces memory limits through cgroups, but the JVM heap is only one component of process RSS. Off-heap memory, metaspace, thread stacks, direct byte buffers, and GC overhead all count against the same cgroup limit. When total RSS crosses that limit, the kernel kills the container without a JVM-level error. In some JDK and kernel combinations, the JVM fails to detect the cgroup limit entirely and sizes the heap against host RAM, which guarantees an OOM kill.
This guide shows how to determine whether an OOM kill was caused by off-heap pressure, cgroup detection failure, or oversizing, and how to set explicit limits that prevent recurrence.
What this means
Docker writes memory limits to cgroup v1 memory.limit_in_bytes or cgroup v2 memory.max. Since Java 10, the JVM enables -XX:+UseContainerSupport by default to read these limits and size the heap ergonomically. The default allocates 25% of the detected limit to max heap via -XX:MaxRAMPercentage.
The kernel’s OOM killer evaluates total RSS: heap, metaspace, code cache, thread stacks, direct byte buffers allocated via NIO or Netty, JNI library allocations, and GC working memory. A container with a 2 GB limit and a 1.5 GB heap can still be killed if a Netty client allocates 600 MB of direct buffers during a traffic spike.
A second failure mode is cgroup detection failure. On Linux kernel 6.12 and later, changes to /proc/cgroups caused JDK 21.0.9 and earlier to misread the container environment as having no memory controller enabled. The JVM falls back to host memory, sets a massive heap, and the container is OOM-killed shortly after startup. Similar detection gaps occur when the container’s cgroup scope lacks an explicit limit and the JVM reads a parent systemd slice instead.
Common causes
| Cause | What it looks like | First thing to check |
|---|---|---|
| Cgroup detection failure (JDK/kernel regression) | Max heap is 25% of host RAM, not the container limit; container dies within seconds of start | Effective -Xmx or MaxRAMPercentage output vs memory.max |
| Off-heap exhaustion (Netty, gRPC, NIO direct buffers) | Heap usage healthy; container RSS at limit; OOMKilled under I/O load | docker stats RSS vs expected heap; presence of Netty or gRPC clients |
| Heap sized equal to container limit | Stable for hours, then OOMKilled during GC or class loading | -Xmx value vs docker inspect memory limit |
| Missing container memory limit | Intermittent host-level OOM; container killed unpredictably | docker inspect --format '{{.HostConfig.Memory}}' |
| Ancestor cgroup limit confusion | JVM respects a limit, but it is the parent slice limit, not the container limit | memory.max in container scope vs parent scope |
Quick checks
Run these read-only checks in order. They confirm whether the kill was OOM, what limit the kernel enforced, and what the JVM thought it could use.
# Check OOMKilled status and exit code
docker inspect --format '{{.State.OOMKilled}} {{.State.ExitCode}}' <container_id>
# Live memory usage and limit for all containers
docker stats --no-stream --format "table {{.Name}}\t{{.MemUsage}}\t{{.MemPerc}}"
# Check configured container memory limit
docker inspect --format '{{.HostConfig.Memory}}' <container_id>
# Inspect cgroup memory limit directly (cgroup v2; run inside the container)
cat /sys/fs/cgroup/memory.max
# Inspect cgroup memory limit directly (cgroup v1; run inside the container)
cat /sys/fs/cgroup/memory/memory.limit_in_bytes
# Check kernel OOM kill log (run on the host)
dmesg | grep -i "oom\|killed process" | tail -20
# Stream recent OOM events from Docker
docker events --filter event=oom --since 1h
# Memory breakdown inside container (cgroup v2)
cat /sys/fs/cgroup/memory.stat
# Check configured JVM flags (works even on stopped containers)
docker inspect --format '{{.Config.Entrypoint}} {{.Config.Cmd}}' <container_id>
# Check effective JVM flags of the running process
docker exec <container_id> cat /proc/1/cmdline | tr '\0' ' '
How to diagnose it
- Confirm the kill was OOM. Check
docker inspectforOOMKilled: trueand exit code 137. IfOOMKilledis false, the container received SIGKILL from an external source. If true, proceed to memory accounting. - Verify the JVM saw the container limit. Look at the effective max heap. If it is sized to host RAM (for example, 25% of 64 GB on a node with a 2 GB container limit), the JVM failed cgroup detection. This is common on kernel 6.12+ with JDK 21.0.9 and earlier, or when the container scope has no explicit
memory.max. - Compare heap to limit. If
-Xmxis set explicitly, ensure it is not equal to the container limit. Size-Xmxto no more than 75% of the container limit, leaving headroom for off-heap components. - Account for off-heap memory. Check whether the application uses Netty, gRPC, or OpenTelemetry. These allocate direct byte buffers outside the heap. If
docker statsshows RSS significantly higher than heap usage, off-heap pressure is the gap. - Check for ancestor limit confusion. On systemd hosts, read
memory.maxfrom the container’s cgroup path. If it readsmax, the JVM may be reading the parent slice limit. Ensure Docker or Kubernetes sets an explicit limit on the container scope. - Correlate timing. If OOM kills align with traffic spikes, batch jobs, or JIT compilation warm-up, the container limit may be correctly sized for idle state but insufficient for peak RSS.
- Check restart loops. A container with a restart policy that hits its memory limit repeatedly will show a climbing
RestartCountalongside exit code 137. This is a crash loop, not a one-time spike.
Metrics and signals to monitor
| Signal | Why it matters | Warning sign |
|---|---|---|
| Container memory usage % of limit | The kernel OOM killer evaluates this, not the JVM heap | Sustained usage >80% of limit |
OOMKilled status | Binary confirmation that the cgroup limit was breached | Any true in production |
Container RSS from docker stats | Includes heap, off-heap, and cache; this is the kernel’s view | RSS near limit while heap metrics are low |
| JVM heap usage | Growth inside the heap is invisible to cgroup metrics until allocated | Heap consistently >80% of -Xmx |
| Container restart count | OOM crash loops surface as rapid restarts | Restart count increasing with exit code 137 |
memory.stat anon bytes | Non-reclaimable anonymous memory drives OOM risk | Steady anon growth without traffic increase |
cgroup memory.max vs JVM max heap | Reveals detection failures where the JVM ignored the limit | Heap sized to host RAM instead of container limit |
Fixes
If the cause is cgroup detection failure
Upgrade the JDK. JDK 21.0.10+ includes the fix for the kernel 6.12+ /proc/cgroups regression that caused the JVM to ignore container limits. JDK 17 users should verify their build contains the backport.
If upgrading is not immediately possible, replace -XX:MaxRAMPercentage with explicit -Xmx and -Xms values that fit comfortably inside the container limit. Explicit flags bypass ergonomic detection entirely and are the safest workaround. Do not disable container support with -XX:-UseContainerSupport; this removes cgroup detection entirely and usually causes larger heap sizing problems.
If the cause is off-heap exhaustion
Set explicit off-heap limits: -XX:MaxDirectMemorySize, -XX:MaxMetaspaceSize, and -XX:ReservedCodeCacheSize. Without these, native allocations can grow unbounded.
Pad the container memory limit. A practical starting formula is: container limit >= -Xmx + -XX:MaxDirectMemorySize + metaspace headroom + GC overhead. For a 2 GB heap with 700 MB direct memory, set the container limit to at least 3 GB.
Profile direct buffer usage if you use Netty or gRPC. Native memory leaks in direct buffers are invisible to standard JVM heap dumps and will push RSS past the limit indefinitely.
If the cause is heap oversizing
Set -Xmx to no more than 75% of the container memory limit. The remaining 25% covers metaspace, thread stacks, code cache, direct buffers, and JVM native overhead.
Do not set both -Xmx and -XX:MaxRAMPercentage together. If both are present, -Xmx wins silently and the percentage flag is ignored, which can confuse operators during incidents who expect percentage-based sizing to be active.
If the cause is missing or incorrect ancestor limit
Ensure Docker or Kubernetes sets an explicit memory limit on the container’s cgroup scope. On cgroup v2, verify that memory.max in the container’s scope is a concrete number, not max. If the parent systemd slice enforces a lower effective limit than the orchestrator intended, adjust the workload configuration so the container scope receives the correct bound.
Prevention
- Test cgroup detection after every JDK or kernel upgrade. Start a test container with a known memory limit and verify that the ergonomically sized heap matches
MaxRAMPercentageof that limit, not host RAM. - Monitor RSS, not just heap. Application metrics from JMX show heap state; cgroup metrics show RSS. The gap between them is off-heap pressure.
- Configure explicit limits for every Java container. Set
-Xmx,-XX:MaxDirectMemorySize,-XX:MaxMetaspaceSize, and a containermemory.max. Implicit defaults hide sizing errors. - Set container restart policies to surface loops early. A container that OOMs and restarts repeatedly is easier to spot than one that dies once and stays stopped.
- Document your JDK and kernel compatibility matrix. Note which JDK builds are verified against your host kernel and cgroup version to avoid deploying known-bad combinations into production.
How Netdata helps
Netdata collects per-container cgroup memory (memory.current, memory.stat) and process RSS. Chart the gap between RSS and heap usage to isolate off-heap growth.
OOM kill alerts and restart count spikes expose crash loops before you inspect logs.
CPU throttling charts let you exclude CPU pressure as a confounding factor during memory incidents.
Related guides
- Docker container high memory usage: how to diagnose it
- Docker container memory leak: how to find one and prove it
- Docker container keeps restarting: causes, checks, and fixes
- Docker container exits immediately: how to diagnose it
- Docker CPU throttling: the hidden cause of container latency
- Docker exit code 1: application errors and how to find them





