Docker OOMKilled: causes, detection, and prevention
A container exits with code 137 and restarts. The application loses in-memory state. Dependent services start failing. The restart loop begins. This is the OOMKilled pattern, and it is one of the most common and most misdiagnosed failure modes in Docker environments.
This article covers how to confirm an OOM kill, distinguish it from an external SIGKILL, understand why it happened, and prevent recurrence. It also covers the JVM-in-container memory mismatch, which is responsible for a large share of OOM kills in Java workloads.
What this means
When a container exceeds its cgroup memory limit, the Linux kernel OOM killer terminates one or more processes in that cgroup. Docker records this in the container’s state as OOMKilled: true. The container exits with code 137 (SIGKILL).
The key distinction: exit code 137 means the process received SIGKILL, but SIGKILL can come from two sources. The kernel OOM killer sends it when memory is exhausted. An external actor (operator, orchestrator, or another process) can also send SIGKILL directly. These look identical at the exit code level. The OOMKilled flag in docker inspect is what separates them.
OOM kills cause immediate, hard termination. There is no graceful shutdown, no flush of in-memory state, no chance for the application to clean up. For stateful workloads, this means data loss or corruption. For stateless workloads, it means a restart and a brief service interruption.
Common causes
| Cause | What it looks like | First thing to check |
|---|---|---|
| Memory limit set too low | OOMKilled shortly after startup or under normal load | docker inspect limit vs actual usage under load |
| Application memory leak | OOMKilled after hours or days of running; memory grows steadily | Memory usage trend over time |
| JVM heap + overhead exceeds limit | OOMKilled in Java containers; JVM ignores cgroup limit | JVM flags; heap + metaspace + native overhead vs limit |
| Traffic spike causing abnormal allocation | OOMKilled during peak load only | Correlate OOM time with request rate |
| Child process OOM where parent survives | Container appears running but degraded; child process killed | Check if PID 1 is still running but a worker process died |
| System-wide memory pressure (no per-container limit) | Multiple containers OOM killed simultaneously | Host memory usage; which containers have no limit set |
Quick checks
# Confirm OOMKilled status for a specific container
docker inspect --format '{{.State.OOMKilled}}' <container_id>
# Check exit code alongside OOMKilled
docker inspect --format 'OOMKilled={{.State.OOMKilled}} ExitCode={{.State.ExitCode}}' <container_id>
# Find all containers that have been OOM killed
for c in $(docker ps -aq); do
oom=$(docker inspect --format '{{.State.OOMKilled}}' "$c")
[ "$oom" = "true" ] && docker inspect --format '{{.Name}} OOMKilled={{.State.OOMKilled}}' "$c"
done
# Check memory limit and current usage for a running container
docker stats --no-stream --format "table {{.Name}}\t{{.MemUsage}}\t{{.MemPerc}}" <container_id>
# Check memory limit set on a container
docker inspect --format '{{.HostConfig.Memory}}' <container_id>
# Returns 0 if no limit is set; otherwise bytes
# Check kernel OOM events in dmesg
dmesg | grep -i "oom\|out of memory\|killed process"
# Check restart count (OOM kills cause restarts if restart policy is set)
docker inspect --format '{{.RestartCount}}' <container_id>
# Check cgroup memory limit directly (cgroups v1)
cat /sys/fs/cgroup/memory/docker/<container_id>/memory.limit_in_bytes
# Check cgroup memory current usage (cgroups v1)
cat /sys/fs/cgroup/memory/docker/<container_id>/memory.usage_in_bytes
For the dmesg output, look for lines like oom-kill event in process <name> or Killed process <pid>. These confirm the kernel OOM killer acted and identify which process was killed.
How to diagnose it
Step 1: Confirm the kill source.
Run docker inspect --format '{{.State.OOMKilled}}' <container_id>. If true, the kernel OOM killer acted. If false but exit code is 137, the SIGKILL came from outside the cgroup. Check orchestrator logs, operator history, or any watchdog processes.
Step 2: Check whether a memory limit was set.
Run docker inspect --format '{{.HostConfig.Memory}}' <container_id>. A value of 0 means no per-container limit was set. In this case, the container was killed because the host ran out of memory. Check host-level memory pressure and which other containers or processes were competing.
Step 3: Check the memory usage trend before the kill. If you have metrics, look at memory usage in the minutes before the OOM kill. Steady growth over hours suggests a leak. Sudden spike suggests a traffic event or abnormal input. Flat usage near the limit suggests the limit is simply too low for the workload.
Step 4: For JVM workloads, check the full memory footprint.
The JVM heap is not the only memory the JVM uses. A container limit of 2GB with -Xmx1800m will OOM kill because the JVM also needs metaspace, code cache, thread stacks, and native memory. The total footprint is typically heap + 20-30% overhead at minimum. Check what JVM flags are set inside the container:
docker exec <container_id> ps aux | grep java
# Look for -Xmx, -Xms, -XX:MaxMetaspaceSize flags
If no explicit heap flags are set, the JVM may be sizing itself based on host memory rather than the cgroup limit. This is a common source of OOM kills in Java containers. JVM versions that are not cgroup-aware will read total host RAM and set heap accordingly, then exceed the container limit.
Step 5: Check for child process OOM. If PID 1 in the container is a process manager or shell, the OOM killer may have killed a child worker process rather than PID 1. The container stays “running” but is degraded. Check:
docker inspect --format '{{.State.OOMKilled}}' <container_id>
# May still be true even if container is currently running
docker logs <container_id> --tail 50
# Look for worker process crash messages
Step 6: Correlate with dmesg. The kernel logs the OOM kill event with the process name, PID, and memory statistics. This is the most authoritative source:
dmesg | grep -i "oom\|killed process" | tail -20
Metrics and signals to monitor
| Signal | Why it matters | Warning sign |
|---|---|---|
OOMKilled flag (docker inspect) | Confirms memory kill vs external SIGKILL | Any true value in production |
| Container memory usage vs limit | Leading indicator before OOM kill | Usage above 75% of limit |
| Container restart count | OOM kills cause restarts if restart policy is set | Increasing count, especially with exit code 137 |
| Exit code | Distinguishes OOM (137) from graceful stop (143) or app error (1) | Code 137 without OOMKilled=true needs separate investigation |
| Host memory usage | System-wide pressure can kill containers without per-container limits | Host memory above 85% |
dmesg OOM events | Kernel-level confirmation and process details | Any OOM kill event |
| Memory usage trend (rate of change) | Distinguishes leak from undersized limit | Steady growth over hours without plateau |
| JVM heap flags (for Java containers) | Ensures JVM is not sizing heap against host RAM | No -Xmx set, or -Xmx plus overhead exceeds container limit |
Fixes
If the limit is too low for the workload
Increase the memory limit with headroom. A container running at 95% of its limit has no buffer for GC pauses, traffic spikes, or normal allocation variance. A reasonable starting point is to set the limit at 150% of the observed peak usage under normal load.
# Update a running container's memory limit (takes effect immediately)
docker update --memory 2g --memory-swap 2g <container_id>
Note: --memory-swap should equal --memory if you want to disable swap for the container, or be set higher if swap is acceptable. Setting --memory-swap equal to --memory disables swap.
If you are managing containers via Compose or run commands, update the mem_limit or --memory flag and redeploy.
If the cause is a JVM memory mismatch
Set explicit heap bounds and ensure the JVM is cgroup-aware:
- Use
-Xmxand-Xmsto bound the heap explicitly. Leave at least 25-30% of the container limit for non-heap memory. - Use
-XX:+UseContainerSupportso the JVM reads cgroup limits rather than host RAM. Enabled by default in JDK 10 and later, and backported to JDK 8u191+ (JDK-8146115). On older JVMs you must set it explicitly. - Set
-XX:MaxMetaspaceSizeto cap metaspace growth.
Example: for a 2GB container limit, a conservative JVM configuration might be -Xmx1200m -Xms1200m -XX:MaxMetaspaceSize=256m. This leaves roughly 550MB for code cache, thread stacks, and native overhead.
If the cause is a memory leak
A memory limit increase buys time but does not fix a leak. The container will OOM kill again, just later.
Steps:
- Capture a heap dump or memory profile before the next OOM kill.
- For JVM: use
-XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/tmp/heap.hprofto capture a dump automatically. - Analyze the dump to identify the leak source.
- Fix the application code or configuration causing unbounded allocation.
In the short term, increasing the limit and adding a restart policy gives the application more runway while the fix is developed.
If the cause is system-wide memory pressure
If no per-container limit is set and the host runs out of memory, the kernel OOM killer chooses victims based on its own heuristics. This is unpredictable. The fix is to set explicit memory limits on all containers so the kernel can make informed decisions, and to ensure the sum of all container limits does not exceed available host memory.
# Find containers with no memory limit set
for c in $(docker ps -q); do
limit=$(docker inspect --format '{{.HostConfig.Memory}}' "$c")
name=$(docker inspect --format '{{.Name}}' "$c")
[ "$limit" = "0" ] && echo "No limit: $name"
done
If a child process was OOM killed but PID 1 survived
The container is running but degraded. The application may be silently broken. Options:
- Configure the container so that a child process death causes PID 1 to exit (and trigger a restart).
- Add a health check that detects the degraded state and marks the container unhealthy.
- Use a process supervisor inside the container that restarts the child process.
Prevention
Set explicit memory limits on every production container. A container without a limit can consume all host memory and trigger system-wide OOM kills affecting unrelated workloads.
Add headroom above observed peak usage. Measure peak memory usage under realistic load, then set the limit at 150% of that value. This absorbs GC pauses, traffic spikes, and normal variance.
Configure log rotation. Unbounded container logs consume disk, not memory, but disk exhaustion can cause containers to crash in ways that look like other failures. Keep this separate from OOM investigation.
For JVM workloads, always set -Xmx explicitly. Do not rely on the JVM’s automatic sizing. Verify the JVM is reading cgroup limits by checking the effective heap size at startup in container logs.
Monitor memory usage trend, not just current usage. A container at 60% memory usage that is growing 5% per hour will OOM kill in roughly 8 hours. Trend monitoring catches this before the kill.
Alert before the kill, not after. Set alerts at 75-80% of the memory limit. By the time OOMKilled fires, the damage is done.
Test memory limits under load before production deployment. Run load tests against the container with the production memory limit set. Confirm the container survives peak load with headroom remaining.
Document memory budgets for JVM containers. For each Java service, record: heap max, metaspace max, expected native overhead, and container limit. Review this when the JVM version or application changes.
How Netdata helps
Netdata collects cgroup-level memory metrics per container, which makes it practical to catch OOM conditions before they happen:
- Container memory usage vs limit: Netdata tracks
memory.usageandmemory.limitper container cgroup, so you can see the ratio trending toward 100% before the kill occurs. - OOMKilled events: Netdata surfaces the
OOMKilledstate change as a container state event, which can trigger alerts. - Restart count tracking: Rising restart counts correlated with exit code 137 in Netdata’s container state charts confirm a recurring OOM kill pattern.
- Host memory pressure: Netdata’s host memory charts show system-wide pressure, which helps distinguish per-container OOM from system-wide exhaustion.
- Anomaly detection on memory trends: Netdata’s anomaly advisor can flag containers whose memory usage is growing at an unusual rate, giving earlier warning than threshold-based alerts alone.
Related guides
- Docker exit code 137: OOMKilled or SIGKILL? - detailed breakdown of how to distinguish the two causes of exit code 137
- Docker container high memory usage: how to diagnose it - step-by-step guide for profiling and diagnosing high memory consumption
- Docker container keeps restarting: causes, checks, and fixes - broader restart loop diagnosis, including OOM as one cause
- Docker monitoring checklist: the signals every production host needs - full signal inventory for production Docker hosts
- Docker container high CPU usage: causes and fixes - CPU-side resource pressure that often accompanies memory issues
- Docker CPU throttling: the hidden cause of container latency - CPU quota effects that can be confused with memory-related slowdowns




