$ guides / docker / docker-oomkilled ▌

Operations Guides

Docker OOMKilled: causes, detection, and prevention

A container exits with code 137 and restarts. The application loses in-memory state. Dependent services start failing. The restart loop begins. This is the OOMKilled pattern, and it is one of the most common and most misdiagnosed failure modes in Docker environments.

This article covers how to confirm an OOM kill, distinguish it from an external SIGKILL, understand why it happened, and prevent recurrence. It also covers the JVM-in-container memory mismatch, which is responsible for a large share of OOM kills in Java workloads.

What this means

When a container exceeds its cgroup memory limit, the Linux kernel OOM killer terminates one or more processes in that cgroup. Docker records this in the container’s state as OOMKilled: true. The container exits with code 137 (SIGKILL).

The key distinction: exit code 137 means the process received SIGKILL, but SIGKILL can come from two sources. The kernel OOM killer sends it when memory is exhausted. An external actor (operator, orchestrator, or another process) can also send SIGKILL directly. These look identical at the exit code level. The OOMKilled flag in docker inspect is what separates them.

OOM kills cause immediate, hard termination. There is no graceful shutdown, no flush of in-memory state, no chance for the application to clean up. For stateful workloads, this means data loss or corruption. For stateless workloads, it means a restart and a brief service interruption.

Common causes

Cause	What it looks like	First thing to check
Memory limit set too low	OOMKilled shortly after startup or under normal load	`docker inspect` limit vs actual usage under load
Application memory leak	OOMKilled after hours or days of running; memory grows steadily	Memory usage trend over time
JVM heap + overhead exceeds limit	OOMKilled in Java containers; JVM ignores cgroup limit	JVM flags; heap + metaspace + native overhead vs limit
Traffic spike causing abnormal allocation	OOMKilled during peak load only	Correlate OOM time with request rate
Child process OOM where parent survives	Container appears running but degraded; child process killed	Check if PID 1 is still running but a worker process died
System-wide memory pressure (no per-container limit)	Multiple containers OOM killed simultaneously	Host memory usage; which containers have no limit set

Quick checks

# Confirm OOMKilled status for a specific container
docker inspect --format '{{.State.OOMKilled}}' <container_id>

# Check exit code alongside OOMKilled
docker inspect --format 'OOMKilled={{.State.OOMKilled}} ExitCode={{.State.ExitCode}}' <container_id>

# Find all containers that have been OOM killed
for c in $(docker ps -aq); do
  oom=$(docker inspect --format '{{.State.OOMKilled}}' "$c")
  [ "$oom" = "true" ] && docker inspect --format '{{.Name}} OOMKilled={{.State.OOMKilled}}' "$c"
done

# Check memory limit and current usage for a running container
docker stats --no-stream --format "table {{.Name}}\t{{.MemUsage}}\t{{.MemPerc}}" <container_id>

# Check memory limit set on a container
docker inspect --format '{{.HostConfig.Memory}}' <container_id>
# Returns 0 if no limit is set; otherwise bytes

# Check kernel OOM events in dmesg
dmesg | grep -i "oom\|out of memory\|killed process"

# Check restart count (OOM kills cause restarts if restart policy is set)
docker inspect --format '{{.RestartCount}}' <container_id>

# Check cgroup memory limit directly (cgroups v1)
cat /sys/fs/cgroup/memory/docker/<container_id>/memory.limit_in_bytes

# Check cgroup memory current usage (cgroups v1)
cat /sys/fs/cgroup/memory/docker/<container_id>/memory.usage_in_bytes

For the dmesg output, look for lines like oom-kill event in process <name> or Killed process <pid>. These confirm the kernel OOM killer acted and identify which process was killed.

How to diagnose it

Step 1: Confirm the kill source. Run docker inspect --format '{{.State.OOMKilled}}' <container_id>. If true, the kernel OOM killer acted. If false but exit code is 137, the SIGKILL came from outside the cgroup. Check orchestrator logs, operator history, or any watchdog processes.

Step 2: Check whether a memory limit was set. Run docker inspect --format '{{.HostConfig.Memory}}' <container_id>. A value of 0 means no per-container limit was set. In this case, the container was killed because the host ran out of memory. Check host-level memory pressure and which other containers or processes were competing.

Step 3: Check the memory usage trend before the kill. If you have metrics, look at memory usage in the minutes before the OOM kill. Steady growth over hours suggests a leak. Sudden spike suggests a traffic event or abnormal input. Flat usage near the limit suggests the limit is simply too low for the workload.

Step 4: For JVM workloads, check the full memory footprint. The JVM heap is not the only memory the JVM uses. A container limit of 2GB with -Xmx1800m will OOM kill because the JVM also needs metaspace, code cache, thread stacks, and native memory. The total footprint is typically heap + 20-30% overhead at minimum. Check what JVM flags are set inside the container:

docker exec <container_id> ps aux | grep java
# Look for -Xmx, -Xms, -XX:MaxMetaspaceSize flags

If no explicit heap flags are set, the JVM may be sizing itself based on host memory rather than the cgroup limit. This is a common source of OOM kills in Java containers. JVM versions that are not cgroup-aware will read total host RAM and set heap accordingly, then exceed the container limit.

Step 5: Check for child process OOM. If PID 1 in the container is a process manager or shell, the OOM killer may have killed a child worker process rather than PID 1. The container stays “running” but is degraded. Check:

docker inspect --format '{{.State.OOMKilled}}' <container_id>
# May still be true even if container is currently running
docker logs <container_id> --tail 50
# Look for worker process crash messages

Step 6: Correlate with dmesg. The kernel logs the OOM kill event with the process name, PID, and memory statistics. This is the most authoritative source:

dmesg | grep -i "oom\|killed process" | tail -20

Metrics and signals to monitor

Signal	Why it matters	Warning sign
`OOMKilled` flag (docker inspect)	Confirms memory kill vs external SIGKILL	Any `true` value in production
Container memory usage vs limit	Leading indicator before OOM kill	Usage above 75% of limit
Container restart count	OOM kills cause restarts if restart policy is set	Increasing count, especially with exit code 137
Exit code	Distinguishes OOM (137) from graceful stop (143) or app error (1)	Code 137 without `OOMKilled=true` needs separate investigation
Host memory usage	System-wide pressure can kill containers without per-container limits	Host memory above 85%
`dmesg` OOM events	Kernel-level confirmation and process details	Any OOM kill event
Memory usage trend (rate of change)	Distinguishes leak from undersized limit	Steady growth over hours without plateau
JVM heap flags (for Java containers)	Ensures JVM is not sizing heap against host RAM	No `-Xmx` set, or `-Xmx` plus overhead exceeds container limit

Fixes

If the limit is too low for the workload

Increase the memory limit with headroom. A container running at 95% of its limit has no buffer for GC pauses, traffic spikes, or normal allocation variance. A reasonable starting point is to set the limit at 150% of the observed peak usage under normal load.

# Update a running container's memory limit (takes effect immediately)
docker update --memory 2g --memory-swap 2g <container_id>

Note: --memory-swap should equal --memory if you want to disable swap for the container, or be set higher if swap is acceptable. Setting --memory-swap equal to --memory disables swap.

If you are managing containers via Compose or run commands, update the mem_limit or --memory flag and redeploy.

If the cause is a JVM memory mismatch

Set explicit heap bounds and ensure the JVM is cgroup-aware:

Use -Xmx and -Xms to bound the heap explicitly. Leave at least 25-30% of the container limit for non-heap memory.
Use -XX:+UseContainerSupport so the JVM reads cgroup limits rather than host RAM. Enabled by default in JDK 10 and later, and backported to JDK 8u191+ (JDK-8146115). On older JVMs you must set it explicitly.
Set -XX:MaxMetaspaceSize to cap metaspace growth.

Example: for a 2GB container limit, a conservative JVM configuration might be -Xmx1200m -Xms1200m -XX:MaxMetaspaceSize=256m. This leaves roughly 550MB for code cache, thread stacks, and native overhead.

If the cause is a memory leak

A memory limit increase buys time but does not fix a leak. The container will OOM kill again, just later.

Steps:

Capture a heap dump or memory profile before the next OOM kill.
For JVM: use -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/tmp/heap.hprof to capture a dump automatically.
Analyze the dump to identify the leak source.
Fix the application code or configuration causing unbounded allocation.

In the short term, increasing the limit and adding a restart policy gives the application more runway while the fix is developed.

If the cause is system-wide memory pressure

If no per-container limit is set and the host runs out of memory, the kernel OOM killer chooses victims based on its own heuristics. This is unpredictable. The fix is to set explicit memory limits on all containers so the kernel can make informed decisions, and to ensure the sum of all container limits does not exceed available host memory.

# Find containers with no memory limit set
for c in $(docker ps -q); do
  limit=$(docker inspect --format '{{.HostConfig.Memory}}' "$c")
  name=$(docker inspect --format '{{.Name}}' "$c")
  [ "$limit" = "0" ] && echo "No limit: $name"
done

If a child process was OOM killed but PID 1 survived

The container is running but degraded. The application may be silently broken. Options:

Configure the container so that a child process death causes PID 1 to exit (and trigger a restart).
Add a health check that detects the degraded state and marks the container unhealthy.
Use a process supervisor inside the container that restarts the child process.

Prevention

Set explicit memory limits on every production container. A container without a limit can consume all host memory and trigger system-wide OOM kills affecting unrelated workloads.

Add headroom above observed peak usage. Measure peak memory usage under realistic load, then set the limit at 150% of that value. This absorbs GC pauses, traffic spikes, and normal variance.

Configure log rotation. Unbounded container logs consume disk, not memory, but disk exhaustion can cause containers to crash in ways that look like other failures. Keep this separate from OOM investigation.

For JVM workloads, always set -Xmx explicitly. Do not rely on the JVM’s automatic sizing. Verify the JVM is reading cgroup limits by checking the effective heap size at startup in container logs.

Monitor memory usage trend, not just current usage. A container at 60% memory usage that is growing 5% per hour will OOM kill in roughly 8 hours. Trend monitoring catches this before the kill.

Alert before the kill, not after. Set alerts at 75-80% of the memory limit. By the time OOMKilled fires, the damage is done.

Test memory limits under load before production deployment. Run load tests against the container with the production memory limit set. Confirm the container survives peak load with headroom remaining.

Document memory budgets for JVM containers. For each Java service, record: heap max, metaspace max, expected native overhead, and container limit. Review this when the JVM version or application changes.

How Netdata helps

Netdata collects cgroup-level memory metrics per container, which makes it practical to catch OOM conditions before they happen:

Container memory usage vs limit: Netdata tracks memory.usage and memory.limit per container cgroup, so you can see the ratio trending toward 100% before the kill occurs.
OOMKilled events: Netdata surfaces the OOMKilled state change as a container state event, which can trigger alerts.
Restart count tracking: Rising restart counts correlated with exit code 137 in Netdata’s container state charts confirm a recurring OOM kill pattern.
Host memory pressure: Netdata’s host memory charts show system-wide pressure, which helps distinguish per-container OOM from system-wide exhaustion.
Anomaly detection on memory trends: Netdata’s anomaly advisor can flag containers whose memory usage is growing at an unusual rate, giving earlier warning than threshold-based alerts alone.

Docker exit code 137: OOMKilled or SIGKILL? - detailed breakdown of how to distinguish the two causes of exit code 137
Docker container high memory usage: how to diagnose it - step-by-step guide for profiling and diagnosing high memory consumption
Docker container keeps restarting: causes, checks, and fixes - broader restart loop diagnosis, including OOM as one cause
Docker monitoring checklist: the signals every production host needs - full signal inventory for production Docker hosts
Docker container high CPU usage: causes and fixes - CPU-side resource pressure that often accompanies memory issues
Docker CPU throttling: the hidden cause of container latency - CPU quota effects that can be confused with memory-related slowdowns

Docker OOMKilled: causes, detection, and prevention

Docker OOMKilled: causes, detection, and prevention

What this means

Common causes

Quick checks

How to diagnose it

Metrics and signals to monitor

Fixes

If the limit is too low for the workload

If the cause is a JVM memory mismatch

If the cause is a memory leak

If the cause is system-wide memory pressure

If a child process was OOM killed but PID 1 survived

Prevention

How Netdata helps

Related guides