Docker container memory leak: how to find one and prove it
Memory that only ever climbs is easy to spot. The harder problem is proving whether the growth is a leak, unbounded caching, or a limit set below the working set. During an incident, operators need to decide in minutes whether to page an on-call developer or bump a cgroup limit. This guide shows how to use cgroup memory.stat, process-level RSS, and container restart patterns to build a defensible diagnosis. You will be able to separate anonymous memory growth from reclaimable cache, identify whether the leak lives in application heap or runtime overhead, and present evidence that justifies either a code fix or a capacity change.
What this means
A memory leak inside a Docker container manifests as a monotonic increase in non-reclaimable memory within the container’s cgroup. The cgroup reports total usage via memory.current (v2) or memory.usage_in_bytes (v1), but this total mixes anonymous pages (heap, stack, runtime data), file-backed pages (page cache), and kernel slab allocations. Only anonymous and slab growth that survives workload idle periods indicates a leak; file growth is usually reclaimable under pressure. The kernel OOM killer operates at the cgroup level. When the container exceeds memory.max, it kills a process inside the cgroup, often PID 1, producing exit code 137 and a restart if a restart policy is configured. A leak turns this into a sawtooth pattern: grow, kill, restart, grow again.
Common causes
| Cause | What it looks like | First thing to check |
|---|---|---|
| Application heap leak | anon in memory.stat grows steadily while traffic is flat | Heap dump or runtime profiler for the main process |
| Native / off-heap leak (JVM, Node) | Total RSS exceeds reported heap size; OOMKilled with heap well below limit | Native memory tracking or runtime-specific off-heap metrics |
| Thread or connection leak | pids.current rises with memory; many threads in docker top | Thread count and connection pool limits inside the container |
| Aggressive unbounded cache | file dominates memory.stat; no OOM kills, but high total usage | Application cache configuration and buffer sizes |
| JVM metaspace / code cache exhaustion | OOMKilled after long uptime despite stable heap | JVM metaspace and code cache utilization |
Quick checks
# Check cgroup v2 memory breakdown and limit
CONTAINER_ID=$(docker inspect --format '{{.Id}}' <container_name>)
echo "=== current / limit ==="
cat /sys/fs/cgroup/system.slice/docker-${CONTAINER_ID}.scope/memory.current
cat /sys/fs/cgroup/system.slice/docker-${CONTAINER_ID}.scope/memory.max
echo "=== stat breakdown ==="
grep -E 'anon|file|slab' /sys/fs/cgroup/system.slice/docker-${CONTAINER_ID}.scope/memory.stat
What to look for: anon should plateau after warmup; file may be large but variable. If anon increases by hundreds of MB per hour with flat traffic, the container is leaking.
# Check OOM history and restart count
docker inspect --format '{{.Name}} RestartCount={{.RestartCount}} OOMKilled={{.State.OOMKilled}} ExitCode={{.State.ExitCode}}' <container_name>
cat /sys/fs/cgroup/system.slice/docker-${CONTAINER_ID}.scope/memory.events
What to look for: RestartCount climbing, oom_kill greater than zero, and ExitCode 137 confirm an OOM-driven restart loop.
# Map container PID to host /proc and read RSS
CONTAINER_PID=$(docker inspect --format '{{.State.Pid}}' <container_name>)
grep -E 'VmRSS|VmSize' /proc/${CONTAINER_PID}/status
What to look for: VmRSS should track the cgroup anon growth. If it does not, the leak may be in a sidecar process or in kernel slab.
# Check for PID leaks inside the cgroup
cat /sys/fs/cgroup/system.slice/docker-${CONTAINER_ID}.scope/pids.current
cat /sys/fs/cgroup/system.slice/docker-${CONTAINER_ID}.scope/pids.max
docker top <container_name> -o pid,ppid,stat,comm | grep -c ' '
What to look for: Rising PID count alongside memory growth suggests a thread or fork leak.
# Review recent OOM killer decisions
dmesg | grep -i "oom-kill\|killed process" | tail -10
What to look for: The container’s cgroup path in the OOM message confirms it was the victim, not another process on the host.
How to diagnose it
Confirm the container is dying from memory pressure, not an application crash. Check
docker inspectforOOMKilledandExitCode137. IfOOMKilledis false, the kill came from outside or the application exited voluntarily. Memory leak diagnosis is moot if the cause is a segfault or an externaldocker kill.Separate anon from file cache using memory.stat. Read
memory.statinside the cgroup. Ifanonis flat andfileis growing, the workload is caching aggressively or reading large files. This is reclaimable and usually not a leak. Ifanongrows monotonically, the leak is in heap, stack, or runtime native memory.Map cgroup memory to the container’s main process. Get the container PID and read
/proc/<PID>/status. IfVmRSStracks the cgroupanongrowth closely, the leak is in the main application process. IfVmRSSis much smaller, suspect a sidecar process, shared library mapping accounting, or kernel slab growth.Determine whether the growth correlates with workload. Compare memory growth to request rate, job queue depth, or connection count. If memory climbs when traffic is flat, it is a leak. If it climbs only under load and plateaus afterward, it may be a legitimate working set that exceeds the limit.
Use runtime-specific tooling to isolate heap from native memory. For JVM containers, compare heap usage to total RSS. If heap is stable but RSS climbs, the leak is off-heap (metaspace, direct buffers, thread stacks). For Go or Node, use runtime memory profiles to distinguish heap growth from runtime overhead. This step turns suspicion into a developer ticket.
Capture evidence before the next OOM kill. OOM kills send SIGKILL, which does not allow cleanup. Trigger a heap dump or memory profile while the container is near its limit but still alive, or raise
memory.maxtemporarily to buy time for profiling. Document theanongrowth rate and the process RSS to prove the leak is reproducible.
Metrics and signals to monitor
| Signal | Why it matters | Warning sign |
|---|---|---|
cgroup anon bytes | Non-reclaimable application memory; the true leak indicator | Steady growth over multiple hours with flat traffic |
cgroup file bytes | Page cache; reclaimable under pressure | Sudden spike without corresponding I/O increase may indicate misconfigured buffers |
cgroup slab bytes | Kernel allocations inside the cgroup | Growth without bound indicates kernel-side leak or many small objects |
| Container memory usage vs limit | Proximity to OOM kill | Greater than 80% of limit sustained; no headroom for bursts |
OOM kill events (memory.events) | Confirms kernel is killing due to cgroup limit | Any nonzero oom_kill count in production |
| Container restart count | Crash-loop indicator caused by repeated OOMs | Increasing restart count with ExitCode 137 |
Process VmRSS from /proc/<pid>/status | Maps cgroup memory to a specific process | RSS tracks cgroup anon growth closely |
| Container PID count | Thread or fork leaks consume memory | pids.current growing alongside memory |
Fixes
If the cause is an application heap leak
There is no quick operational fix. Capture a heap dump or runtime profile, then restart the container to reclaim memory. If the leak is slow, temporarily raise the memory limit to extend the time between restarts while the code is fixed.
If the cause is native or off-heap memory (JVM, Node)
For JVM containers, set -Xmx to roughly 75% of the container memory limit and cap metaspace and code cache explicitly. For Node or Go, inspect native memory allocations and external buffers. If the runtime is misconfigured, a code deploy may not be necessary; a configuration change can stop the leak.
If the cause is an unbounded thread or connection pool
Limit the pool size in application configuration. If you cannot change the config, set pids.max on the container to prevent a fork or thread bomb from exhausting host PIDs, though this will cause application errors once the limit is hit.
If the cause is aggressive caching
Move cache data to a volume or a dedicated cache service so it does not compete with application memory. Alternatively, tune cache eviction policies so that file memory does not pressure the host into reclaiming cache needed by other containers.
If the memory limit is simply too low
Increase memory.max or the Docker --memory flag only after proving the growth is not unbounded. Giving more memory to a true leak only delays the restart. Document the working set size under normal load and set the limit to working set plus a burst margin.
Prevention
- Monitor
anontrend, not just total cgroup memory. Total memory includes cache, which fluctuates and causes false alarms. - Set container memory limits with at least a 20% buffer above the observed working set. A container running perpetually above 80% of its limit has no margin for spikes.
- Configure meaningful health checks that fail before memory reaches the limit, giving the orchestrator a chance to replace the container gracefully.
- For JVM workloads, always size the heap and metaspace to leave headroom inside the cgroup. Heap equal to the container limit guarantees an OOM kill.
- Run automated checks for containers with nonzero restart counts and ExitCode 137. A restart policy can hide a leak for days by making the container appear running.
- Review application connection pool and thread pool limits during deployment. Unbounded pools are the most common source of slow memory leaks in containerized services.
How Netdata helps
- Netdata breaks down container memory into
anon,file, andslabfrom cgroup metrics, so you can alert on the component that actually indicates a leak. - Container memory usage is shown against the cgroup limit, making it easy to spot when a container is heading for an OOM kill before the kernel acts.
- OOM kill events and container restart counts are surfaced per container, correlating the sawtooth restart pattern with memory saturation.
- Process RSS for the container’s PID 1 is tracked on the host, helping you map cgroup growth to the application process quickly.
Related guides
- See Docker container high memory usage: how to diagnose it for broader memory pressure scenarios that are not leaks.
- See Docker OOMKilled: causes, detection, and prevention for detailed OOM mechanics and cgroup behavior.
- See Docker container keeps restarting: causes, checks, and fixes when restart counts climb but the cause is not yet confirmed as memory.
- See Docker exit code 137: OOMKilled or SIGKILL? to distinguish OOM kills from external SIGKILL.
- See Docker monitoring checklist: the signals every production host needs for baseline monitoring coverage.




