Docker monitoring checklist: the signals every production host needs

Production Docker incidents rarely look like Docker problems at first. They show up as application latency, deployment failures, or hosts that suddenly refuse to schedule containers. By the time you notice, the daemon may be hung, a log file has filled the disk, or a container has been silently throttled into unusable latency. This checklist groups the essential production signals into three priority tiers: must-have alerts that keep the host alive, should-have metrics that expose resource pressure before it becomes an outage, and nice-to-have security and internal signals for mature environments. Every signal includes where to read it from the raw cgroup filesystem or the Docker API so you can instrument hosts without guessing paths.

What this means

Docker is not a single process to monitor. It is a stack of interdependent components: dockerd, containerd, runc, a storage driver, and a network bridge. Each layer emits its own failure signals. A storage driver deadlock can hang the daemon even though containers keep running. A container can report moderate CPU usage while the CFS bandwidth controller throttles it into p99 latency spikes. A missing log rotation configuration can fill the disk while all container-level metrics look normal. This checklist maps each signal to its authoritative source so you can distinguish between a healthy host, a degraded one, and a host about to fail.

Common causes

CauseWhat it looks likeFirst thing to check
Disk exhaustion cascadeContainer creation fails with “no space left on device”; existing containers log write errorsdocker system df -v and df -h /var/lib/docker/
Daemon hangdockerd process exists but docker ps hangs; orchestrator marks node unhealthycurl --unix-socket /var/run/docker.sock http://localhost/_ping
OOM kill crash loopContainer restarts repeatedly with exit code 137 and OOMKilled: truedocker inspect and cgroup v2 memory.events
CPU throttling stormApplication latency rises while average CPU usage looks moderatecgroup v2 cpu.stat for nr_throttled
Container death spiralRestart count climbs; container flaps between running and exiteddocker inspect exit code and restart count
Network black holeContainers run but cannot resolve names or reach servicesdocker exec <id> nslookup <target>

Quick checks

Run these commands to get a snapshot of host health before you set up continuous monitoring.

# Check daemon responsiveness (should return OK in under 1 second)
time curl -s --max-time 5 --unix-socket /var/run/docker.sock http://localhost/_ping

# Check container states for dead or restarting containers
docker ps -a --format '{{.State}}' | sort | uniq -c

# Check restart counts for all containers
docker inspect --format '{{.Name}} {{.RestartCount}}' $(docker ps -aq)

# Check for OOM-killed containers
docker ps -aq | xargs -I{} docker inspect --format '{{.Name}} OOMKilled={{.State.OOMKilled}}' {}

# Check Docker disk usage and reclaimable space
docker system df

# Check CPU throttling across containers (cgroup v2, systemd driver)
for cg in /sys/fs/cgroup/system.slice/docker-*.scope; do
  echo "$(basename $cg): $(grep -E 'nr_throttled|throttled_usec' $cg/cpu.stat)"
done

# Check memory OOM events across containers (cgroup v2, systemd driver)
for cg in /sys/fs/cgroup/system.slice/docker-*.scope; do
  echo "$(basename $cg): $(grep oom_kill $cg/memory.events)"
done

# Check daemon file descriptor usage against its limit
ls /proc/$(pgrep dockerd)/fd | wc -l
cat /proc/$(pgrep dockerd)/limits | grep "open files"

# Check for privileged containers
docker ps -q | xargs -I{} docker inspect --format '{{if .HostConfig.Privileged}}PRIVILEGED: {{.Name}}{{end}}' {} | grep PRIVILEGED

cgroup paths vary by driver and distribution. For systemd, which is the default on cgroup v2, the paths above use system.slice/docker-<id>.scope. If your host uses the cgroupfs driver, look under /sys/fs/cgroup/docker/<id>/ instead. Verify with systemd-cgls or ls /sys/fs/cgroup/system.slice/.

How to diagnose it

Use this flow during an incident or when onboarding a new host.

  1. Verify the daemon is responsive. Run curl --unix-socket /var/run/docker.sock http://localhost/_ping with a 5-second timeout. If it hangs, you have a daemon deadlock or storage driver hang. Check if container processes are still alive with ps aux | grep containerd-shim.
  2. Check for disk exhaustion. Run docker system df and df -h /var/lib/docker/. If the filesystem is over 80% full, disk pressure is likely the root cause of secondary failures.
  3. Inspect container states. Run docker ps -a. Any container in dead state indicates storage corruption. Containers in restarting indicate a crash loop.
  4. Classify crashes. For any stopped container, check docker inspect for ExitCode and OOMKilled. Exit code 137 with OOMKilled: true means memory exhaustion. Exit code 1 means an application error. Exit code 139 means a segfault.
  5. Check for throttling before blaming the application. If latency is high but CPU usage looks moderate, read cpu.stat in the container cgroup. Increasing nr_throttled means the CFS quota is too low.
  6. Validate network from inside the container. Run docker exec <id> cat /etc/resolv.conf to confirm the embedded DNS at 127.0.0.11, then test resolution with nslookup. Network errors in docker stats indicate veth or bridge issues.
  7. Audit security exposure. Scan for privileged containers, docker socket mounts, and added capabilities. Unexpected changes here can indicate compromise or misconfiguration.

Metrics and signals to monitor

Must-have: availability and survival

These six signals tell you whether the daemon is working, whether disk is filling, and whether containers are running without crashing.

SignalAlert whenWhy it mattersRead it from
Docker daemon responsiveness/_ping fails or response time exceeds 5 secondsA hung daemon leaves containers running but unmanageable; orchestrators lose the nodecurl --unix-socket /var/run/docker.sock http://localhost/_ping
Container state distributionAny container in dead state, or unexpected restarting stateDead containers signal storage driver corruption; restarting containers signal crash loopsdocker ps -a --format '{{.State}}' | sort | uniq -c
Container restart countNonzero for stable long-running containers, or delta greater than 5 in 10 minutesCrash loops waste resources, flood logs, and hide root causesdocker inspect --format '{{.RestartCount}}' <id>
Container OOM killed statusOOMKilled is trueThe kernel killed the container for exceeding its memory limit, risking data lossdocker inspect --format '{{.State.OOMKilled}}' <id>; cgroup v2 memory.events oom_kill counter
Docker disk usageGreater than 80 percent of the /var/lib/docker filesystemDisk exhaustion prevents image pulls, container creation, and daemon state updatesdocker system df and df -h /var/lib/docker/

Must-have: resource utilization

These signals catch resource pressure inside containers before it becomes an outage.

SignalAlert whenWhy it mattersRead it from
Container CPU usageSustained usage greater than 80 percent of limit or host capacityIndicates CPU-bound work, runaway processes, or contentiondocker stats --no-stream; cgroup v2 cpu.stat usage_usec
Container CPU throttlingnr_throttled increasing in cpu.statThe CFS bandwidth controller is pausing processes, causing silent latency spikes even when average CPU looks moderatecgroup v2 cpu.stat fields nr_throttled and throttled_usec
Container memory usageUsage greater than 80 percent of limit; anon memory steadily growingOOM kills are imminent; steadily growing anonymous memory indicates a leakdocker stats --no-stream; cgroup v2 memory.current and memory.stat field anon
Container network errorsrx_errors, tx_errors, or rx_dropped increasingPacket loss causes application retries, timeouts, and degraded performanceAPI /containers/<id>/stats networks object, or docker exec <id> cat /proc/net/dev

Should-have: storage, I/O, and daemon internals

Monitor these to catch disk growth, I/O contention, and daemon stress before they cascade.

SignalAlert whenWhy it mattersRead it from
Container block I/OSustained I/O wait or bandwidth near device limitsI/O-heavy containers starve neighbors on shared storageAPI blkio_stats; cgroup v2 io.stat
Container log file sizeAny single json-file log exceeds 1 GB without rotationUnbounded container logs are the leading cause of disk exhaustion on Docker hostsls -lh /var/lib/docker/containers/<id>/<id>-json.log
Docker daemon file descriptorsUsage greater than 80 percent of process limitFD exhaustion blocks API connections, log streaming, and container operationsls /proc/$(pgrep dockerd)/fd | wc -l and /proc/$(pgrep dockerd)/limits
Container health check statusStatus is unhealthy or FailingStreak is greater than 0The application may be deadlocked or failing even though the container is runningdocker inspect --format '{{.State.Health.Status}}' <id>
Container exit codesNonzero exit codes on stable containers, especially 137 or 139Classifies the failure mode: OOM, segfault, or application errordocker inspect --format '{{.State.ExitCode}}' <id>
Docker daemon errorsAny panic or fatal message; sustained error rate above baselineReveals storage driver corruption, internal bugs, and resource exhaustionjournalctl -u docker.service -p err --since "1 hour ago"

Nice-to-have: security and deep internals

Add these after you have coverage of the tiers above.

SignalAlert whenWhy it mattersRead it from
Privileged container countAny privileged container that is not a known infrastructure agentPrivileged mode disables most isolation and enables host compromisedocker inspect --format '{{.HostConfig.Privileged}}' <id>
Docker socket mountsAny container mounting /var/run/docker.sock unexpectedlySocket access is equivalent to root on the hostdocker inspect --format '{{json .Mounts}}' <id>
Container capability additionsSYS_ADMIN, NET_ADMIN, or SYS_PTRACE addedDangerous capabilities significantly weaken container isolationdocker inspect --format '{{.HostConfig.CapAdd}}' <id>
Docker daemon goroutine countGreater than 10,000 sustained, or growing without boundIndicates goroutine leaks or internal deadlock formingcurl --unix-socket /var/run/docker.sock http://localhost/debug/pprof/goroutine?debug=1 (if debug enabled); approximate via /proc/$(pgrep dockerd)/status Threads

Fixes

Apply fixes based on the signal category.

If disk is the bottleneck

Truncate the largest unrotated log files safely while containers run: truncate -s 0 /var/lib/docker/containers/<id>/<id>-json.log. Reclaim space with docker image prune -a for dangling images and docker volume prune for unused volumes. Set log-opts in /etc/docker/daemon.json with max-size and max-file to prevent recurrence.

If containers are crash-looping

Check docker inspect for exit code and OOMKilled. If OOM, raise the memory limit or fix the leak. For exit code 1, read docker logs. Break the restart loop temporarily with docker update --restart=no <id> while you debug.

If CPU throttling is causing latency

Calculate the throttle percentage from nr_throttled / nr_periods in cpu.stat. Raise the CPU limit or switch to cpuset pinning for latency-sensitive workloads instead of CFS quotas.

If the daemon is hung

Confirm container processes are still alive via ps or ctr -n moby containers list. If live-restore is enabled, restart dockerd with systemctl restart docker; running containers will survive. Without live-restore, a restart kills all containers.

Prevention

  • Configure log rotation in /etc/docker/daemon.json with max-size and max-file defaults.
  • Set memory limits that leave headroom for native allocations. For JVM workloads, set -Xmx to roughly 75 percent of the container limit.
  • Enable meaningful health checks in every production image.
  • Automate cleanup of exited containers and dangling images with a scheduled docker system prune or equivalent.
  • Avoid --privileged and docker socket mounts in production workloads. Drop capabilities and run as non-root.
  • Set net.netfilter.nf_conntrack_max to at least 262144 on busy hosts and monitor utilization.

How Netdata helps

  • Correlates container CPU usage with throttling metrics on the same chart, exposing the silent cause of latency spikes.
  • Reads cgroup v2 memory.events to alert immediately on OOM kills without waiting for docker inspect.
  • Tracks per-container disk usage and log growth to catch storage pressure before the host filesystem fills.
  • Monitors dockerd health, API latency, and file descriptor usage alongside container metrics to detect daemon stress.