Docker monitoring checklist: the signals every production host needs
Production Docker incidents rarely look like Docker problems at first. They show up as application latency, deployment failures, or hosts that suddenly refuse to schedule containers. By the time you notice, the daemon may be hung, a log file has filled the disk, or a container has been silently throttled into unusable latency. This checklist groups the essential production signals into three priority tiers: must-have alerts that keep the host alive, should-have metrics that expose resource pressure before it becomes an outage, and nice-to-have security and internal signals for mature environments. Every signal includes where to read it from the raw cgroup filesystem or the Docker API so you can instrument hosts without guessing paths.
What this means
Docker is not a single process to monitor. It is a stack of interdependent components: dockerd, containerd, runc, a storage driver, and a network bridge. Each layer emits its own failure signals. A storage driver deadlock can hang the daemon even though containers keep running. A container can report moderate CPU usage while the CFS bandwidth controller throttles it into p99 latency spikes. A missing log rotation configuration can fill the disk while all container-level metrics look normal. This checklist maps each signal to its authoritative source so you can distinguish between a healthy host, a degraded one, and a host about to fail.
Common causes
| Cause | What it looks like | First thing to check |
|---|---|---|
| Disk exhaustion cascade | Container creation fails with “no space left on device”; existing containers log write errors | docker system df -v and df -h /var/lib/docker/ |
| Daemon hang | dockerd process exists but docker ps hangs; orchestrator marks node unhealthy | curl --unix-socket /var/run/docker.sock http://localhost/_ping |
| OOM kill crash loop | Container restarts repeatedly with exit code 137 and OOMKilled: true | docker inspect and cgroup v2 memory.events |
| CPU throttling storm | Application latency rises while average CPU usage looks moderate | cgroup v2 cpu.stat for nr_throttled |
| Container death spiral | Restart count climbs; container flaps between running and exited | docker inspect exit code and restart count |
| Network black hole | Containers run but cannot resolve names or reach services | docker exec <id> nslookup <target> |
Quick checks
Run these commands to get a snapshot of host health before you set up continuous monitoring.
# Check daemon responsiveness (should return OK in under 1 second)
time curl -s --max-time 5 --unix-socket /var/run/docker.sock http://localhost/_ping
# Check container states for dead or restarting containers
docker ps -a --format '{{.State}}' | sort | uniq -c
# Check restart counts for all containers
docker inspect --format '{{.Name}} {{.RestartCount}}' $(docker ps -aq)
# Check for OOM-killed containers
docker ps -aq | xargs -I{} docker inspect --format '{{.Name}} OOMKilled={{.State.OOMKilled}}' {}
# Check Docker disk usage and reclaimable space
docker system df
# Check CPU throttling across containers (cgroup v2, systemd driver)
for cg in /sys/fs/cgroup/system.slice/docker-*.scope; do
echo "$(basename $cg): $(grep -E 'nr_throttled|throttled_usec' $cg/cpu.stat)"
done
# Check memory OOM events across containers (cgroup v2, systemd driver)
for cg in /sys/fs/cgroup/system.slice/docker-*.scope; do
echo "$(basename $cg): $(grep oom_kill $cg/memory.events)"
done
# Check daemon file descriptor usage against its limit
ls /proc/$(pgrep dockerd)/fd | wc -l
cat /proc/$(pgrep dockerd)/limits | grep "open files"
# Check for privileged containers
docker ps -q | xargs -I{} docker inspect --format '{{if .HostConfig.Privileged}}PRIVILEGED: {{.Name}}{{end}}' {} | grep PRIVILEGED
cgroup paths vary by driver and distribution. For systemd, which is the default on cgroup v2, the paths above use system.slice/docker-<id>.scope. If your host uses the cgroupfs driver, look under /sys/fs/cgroup/docker/<id>/ instead. Verify with systemd-cgls or ls /sys/fs/cgroup/system.slice/.
How to diagnose it
Use this flow during an incident or when onboarding a new host.
- Verify the daemon is responsive. Run
curl --unix-socket /var/run/docker.sock http://localhost/_pingwith a 5-second timeout. If it hangs, you have a daemon deadlock or storage driver hang. Check if container processes are still alive withps aux | grep containerd-shim. - Check for disk exhaustion. Run
docker system dfanddf -h /var/lib/docker/. If the filesystem is over 80% full, disk pressure is likely the root cause of secondary failures. - Inspect container states. Run
docker ps -a. Any container indeadstate indicates storage corruption. Containers inrestartingindicate a crash loop. - Classify crashes. For any stopped container, check
docker inspectforExitCodeandOOMKilled. Exit code 137 withOOMKilled: truemeans memory exhaustion. Exit code 1 means an application error. Exit code 139 means a segfault. - Check for throttling before blaming the application. If latency is high but CPU usage looks moderate, read
cpu.statin the container cgroup. Increasingnr_throttledmeans the CFS quota is too low. - Validate network from inside the container. Run
docker exec <id> cat /etc/resolv.confto confirm the embedded DNS at127.0.0.11, then test resolution withnslookup. Network errors indocker statsindicate veth or bridge issues. - Audit security exposure. Scan for privileged containers, docker socket mounts, and added capabilities. Unexpected changes here can indicate compromise or misconfiguration.
Metrics and signals to monitor
Must-have: availability and survival
These six signals tell you whether the daemon is working, whether disk is filling, and whether containers are running without crashing.
| Signal | Alert when | Why it matters | Read it from |
|---|---|---|---|
| Docker daemon responsiveness | /_ping fails or response time exceeds 5 seconds | A hung daemon leaves containers running but unmanageable; orchestrators lose the node | curl --unix-socket /var/run/docker.sock http://localhost/_ping |
| Container state distribution | Any container in dead state, or unexpected restarting state | Dead containers signal storage driver corruption; restarting containers signal crash loops | docker ps -a --format '{{.State}}' | sort | uniq -c |
| Container restart count | Nonzero for stable long-running containers, or delta greater than 5 in 10 minutes | Crash loops waste resources, flood logs, and hide root causes | docker inspect --format '{{.RestartCount}}' <id> |
| Container OOM killed status | OOMKilled is true | The kernel killed the container for exceeding its memory limit, risking data loss | docker inspect --format '{{.State.OOMKilled}}' <id>; cgroup v2 memory.events oom_kill counter |
| Docker disk usage | Greater than 80 percent of the /var/lib/docker filesystem | Disk exhaustion prevents image pulls, container creation, and daemon state updates | docker system df and df -h /var/lib/docker/ |
Must-have: resource utilization
These signals catch resource pressure inside containers before it becomes an outage.
| Signal | Alert when | Why it matters | Read it from |
|---|---|---|---|
| Container CPU usage | Sustained usage greater than 80 percent of limit or host capacity | Indicates CPU-bound work, runaway processes, or contention | docker stats --no-stream; cgroup v2 cpu.stat usage_usec |
| Container CPU throttling | nr_throttled increasing in cpu.stat | The CFS bandwidth controller is pausing processes, causing silent latency spikes even when average CPU looks moderate | cgroup v2 cpu.stat fields nr_throttled and throttled_usec |
| Container memory usage | Usage greater than 80 percent of limit; anon memory steadily growing | OOM kills are imminent; steadily growing anonymous memory indicates a leak | docker stats --no-stream; cgroup v2 memory.current and memory.stat field anon |
| Container network errors | rx_errors, tx_errors, or rx_dropped increasing | Packet loss causes application retries, timeouts, and degraded performance | API /containers/<id>/stats networks object, or docker exec <id> cat /proc/net/dev |
Should-have: storage, I/O, and daemon internals
Monitor these to catch disk growth, I/O contention, and daemon stress before they cascade.
| Signal | Alert when | Why it matters | Read it from |
|---|---|---|---|
| Container block I/O | Sustained I/O wait or bandwidth near device limits | I/O-heavy containers starve neighbors on shared storage | API blkio_stats; cgroup v2 io.stat |
| Container log file size | Any single json-file log exceeds 1 GB without rotation | Unbounded container logs are the leading cause of disk exhaustion on Docker hosts | ls -lh /var/lib/docker/containers/<id>/<id>-json.log |
| Docker daemon file descriptors | Usage greater than 80 percent of process limit | FD exhaustion blocks API connections, log streaming, and container operations | ls /proc/$(pgrep dockerd)/fd | wc -l and /proc/$(pgrep dockerd)/limits |
| Container health check status | Status is unhealthy or FailingStreak is greater than 0 | The application may be deadlocked or failing even though the container is running | docker inspect --format '{{.State.Health.Status}}' <id> |
| Container exit codes | Nonzero exit codes on stable containers, especially 137 or 139 | Classifies the failure mode: OOM, segfault, or application error | docker inspect --format '{{.State.ExitCode}}' <id> |
| Docker daemon errors | Any panic or fatal message; sustained error rate above baseline | Reveals storage driver corruption, internal bugs, and resource exhaustion | journalctl -u docker.service -p err --since "1 hour ago" |
Nice-to-have: security and deep internals
Add these after you have coverage of the tiers above.
| Signal | Alert when | Why it matters | Read it from |
|---|---|---|---|
| Privileged container count | Any privileged container that is not a known infrastructure agent | Privileged mode disables most isolation and enables host compromise | docker inspect --format '{{.HostConfig.Privileged}}' <id> |
| Docker socket mounts | Any container mounting /var/run/docker.sock unexpectedly | Socket access is equivalent to root on the host | docker inspect --format '{{json .Mounts}}' <id> |
| Container capability additions | SYS_ADMIN, NET_ADMIN, or SYS_PTRACE added | Dangerous capabilities significantly weaken container isolation | docker inspect --format '{{.HostConfig.CapAdd}}' <id> |
| Docker daemon goroutine count | Greater than 10,000 sustained, or growing without bound | Indicates goroutine leaks or internal deadlock forming | curl --unix-socket /var/run/docker.sock http://localhost/debug/pprof/goroutine?debug=1 (if debug enabled); approximate via /proc/$(pgrep dockerd)/status Threads |
Fixes
Apply fixes based on the signal category.
If disk is the bottleneck
Truncate the largest unrotated log files safely while containers run: truncate -s 0 /var/lib/docker/containers/<id>/<id>-json.log. Reclaim space with docker image prune -a for dangling images and docker volume prune for unused volumes. Set log-opts in /etc/docker/daemon.json with max-size and max-file to prevent recurrence.
If containers are crash-looping
Check docker inspect for exit code and OOMKilled. If OOM, raise the memory limit or fix the leak. For exit code 1, read docker logs. Break the restart loop temporarily with docker update --restart=no <id> while you debug.
If CPU throttling is causing latency
Calculate the throttle percentage from nr_throttled / nr_periods in cpu.stat. Raise the CPU limit or switch to cpuset pinning for latency-sensitive workloads instead of CFS quotas.
If the daemon is hung
Confirm container processes are still alive via ps or ctr -n moby containers list. If live-restore is enabled, restart dockerd with systemctl restart docker; running containers will survive. Without live-restore, a restart kills all containers.
Prevention
- Configure log rotation in
/etc/docker/daemon.jsonwithmax-sizeandmax-filedefaults. - Set memory limits that leave headroom for native allocations. For JVM workloads, set
-Xmxto roughly 75 percent of the container limit. - Enable meaningful health checks in every production image.
- Automate cleanup of exited containers and dangling images with a scheduled
docker system pruneor equivalent. - Avoid
--privilegedand docker socket mounts in production workloads. Drop capabilities and run as non-root. - Set
net.netfilter.nf_conntrack_maxto at least 262144 on busy hosts and monitor utilization.
How Netdata helps
- Correlates container CPU usage with throttling metrics on the same chart, exposing the silent cause of latency spikes.
- Reads cgroup v2
memory.eventsto alert immediately on OOM kills without waiting fordocker inspect. - Tracks per-container disk usage and log growth to catch storage pressure before the host filesystem fills.
- Monitors dockerd health, API latency, and file descriptor usage alongside container metrics to detect daemon stress.
Related guides
- Docker container high CPU usage: causes and fixes
- Docker container high memory usage: how to diagnose it
- Docker container keeps restarting: causes, checks, and fixes
- Docker container memory leak: how to find one and prove it
- Docker container running but unhealthy: how to diagnose health check failures
- Docker CPU throttling: the hidden cause of container latency
- Docker daemon not responding: how to troubleshoot a hung dockerd
- Docker disk space full: how to troubleshoot /var/lib/docker
- Docker DNS not working inside containers
- Docker exit code 137: OOMKilled or SIGKILL?
- Docker log rotation: preventing json-file logs from filling disk
- Docker logs taking too much disk space: how to fix log growth




