$ guides / docker / docker-monitoring-checklist ▌

Operations Guides

Docker monitoring checklist: the signals every production host needs

Production Docker incidents rarely look like Docker problems at first. They show up as application latency, deployment failures, or hosts that suddenly refuse to schedule containers. By the time you notice, the daemon may be hung, a log file has filled the disk, or a container has been silently throttled into unusable latency. This checklist groups the essential production signals into three priority tiers: must-have alerts that keep the host alive, should-have metrics that expose resource pressure before it becomes an outage, and nice-to-have security and internal signals for mature environments. Every signal includes where to read it from the raw cgroup filesystem or the Docker API so you can instrument hosts without guessing paths.

What this means

Docker is not a single process to monitor. It is a stack of interdependent components: dockerd, containerd, runc, a storage driver, and a network bridge. Each layer emits its own failure signals. A storage driver deadlock can hang the daemon even though containers keep running. A container can report moderate CPU usage while the CFS bandwidth controller throttles it into p99 latency spikes. A missing log rotation configuration can fill the disk while all container-level metrics look normal. This checklist maps each signal to its authoritative source so you can distinguish between a healthy host, a degraded one, and a host about to fail.

Common causes

Cause	What it looks like	First thing to check
Disk exhaustion cascade	Container creation fails with “no space left on device”; existing containers log write errors	`docker system df -v` and `df -h /var/lib/docker/`
Daemon hang	dockerd process exists but `docker ps` hangs; orchestrator marks node unhealthy	`curl --unix-socket /var/run/docker.sock http://localhost/_ping`
OOM kill crash loop	Container restarts repeatedly with exit code 137 and `OOMKilled: true`	`docker inspect` and cgroup v2 `memory.events`
CPU throttling storm	Application latency rises while average CPU usage looks moderate	cgroup v2 `cpu.stat` for `nr_throttled`
Container death spiral	Restart count climbs; container flaps between running and exited	`docker inspect` exit code and restart count
Network black hole	Containers run but cannot resolve names or reach services	`docker exec <id> nslookup <target>`

Quick checks

Run these commands to get a snapshot of host health before you set up continuous monitoring.

# Check daemon responsiveness (should return OK in under 1 second)
time curl -s --max-time 5 --unix-socket /var/run/docker.sock http://localhost/_ping

# Check container states for dead or restarting containers
docker ps -a --format '{{.State}}' | sort | uniq -c

# Check restart counts for all containers
docker inspect --format '{{.Name}} {{.RestartCount}}' $(docker ps -aq)

# Check for OOM-killed containers
docker ps -aq | xargs -I{} docker inspect --format '{{.Name}} OOMKilled={{.State.OOMKilled}}' {}

# Check Docker disk usage and reclaimable space
docker system df

# Check CPU throttling across containers (cgroup v2, systemd driver)
for cg in /sys/fs/cgroup/system.slice/docker-*.scope; do
  echo "$(basename $cg): $(grep -E 'nr_throttled|throttled_usec' $cg/cpu.stat)"
done

# Check memory OOM events across containers (cgroup v2, systemd driver)
for cg in /sys/fs/cgroup/system.slice/docker-*.scope; do
  echo "$(basename $cg): $(grep oom_kill $cg/memory.events)"
done

# Check daemon file descriptor usage against its limit
ls /proc/$(pgrep dockerd)/fd | wc -l
cat /proc/$(pgrep dockerd)/limits | grep "open files"

# Check for privileged containers
docker ps -q | xargs -I{} docker inspect --format '{{if .HostConfig.Privileged}}PRIVILEGED: {{.Name}}{{end}}' {} | grep PRIVILEGED

cgroup paths vary by driver and distribution. For systemd, which is the default on cgroup v2, the paths above use system.slice/docker-<id>.scope. If your host uses the cgroupfs driver, look under /sys/fs/cgroup/docker/<id>/ instead. Verify with systemd-cgls or ls /sys/fs/cgroup/system.slice/.

How to diagnose it

Use this flow during an incident or when onboarding a new host.

Verify the daemon is responsive. Run curl --unix-socket /var/run/docker.sock http://localhost/_ping with a 5-second timeout. If it hangs, you have a daemon deadlock or storage driver hang. Check if container processes are still alive with ps aux | grep containerd-shim.
Check for disk exhaustion. Run docker system df and df -h /var/lib/docker/. If the filesystem is over 80% full, disk pressure is likely the root cause of secondary failures.
Inspect container states. Run docker ps -a. Any container in dead state indicates storage corruption. Containers in restarting indicate a crash loop.
Classify crashes. For any stopped container, check docker inspect for ExitCode and OOMKilled. Exit code 137 with OOMKilled: true means memory exhaustion. Exit code 1 means an application error. Exit code 139 means a segfault.
Check for throttling before blaming the application. If latency is high but CPU usage looks moderate, read cpu.stat in the container cgroup. Increasing nr_throttled means the CFS quota is too low.
Validate network from inside the container. Run docker exec <id> cat /etc/resolv.conf to confirm the embedded DNS at 127.0.0.11, then test resolution with nslookup. Network errors in docker stats indicate veth or bridge issues.
Audit security exposure. Scan for privileged containers, docker socket mounts, and added capabilities. Unexpected changes here can indicate compromise or misconfiguration.

Metrics and signals to monitor

Must-have: availability and survival

These six signals tell you whether the daemon is working, whether disk is filling, and whether containers are running without crashing.

Signal	Alert when	Why it matters	Read it from
Docker daemon responsiveness	`/_ping` fails or response time exceeds 5 seconds	A hung daemon leaves containers running but unmanageable; orchestrators lose the node	`curl --unix-socket /var/run/docker.sock http://localhost/_ping`
Container state distribution	Any container in `dead` state, or unexpected `restarting` state	Dead containers signal storage driver corruption; restarting containers signal crash loops	`docker ps -a --format '{{.State}}' \| sort \| uniq -c`
Container restart count	Nonzero for stable long-running containers, or delta greater than 5 in 10 minutes	Crash loops waste resources, flood logs, and hide root causes	`docker inspect --format '{{.RestartCount}}' <id>`
Container OOM killed status	`OOMKilled` is `true`	The kernel killed the container for exceeding its memory limit, risking data loss	`docker inspect --format '{{.State.OOMKilled}}' <id>`; cgroup v2 `memory.events` `oom_kill` counter
Docker disk usage	Greater than 80 percent of the `/var/lib/docker` filesystem	Disk exhaustion prevents image pulls, container creation, and daemon state updates	`docker system df` and `df -h /var/lib/docker/`

Must-have: resource utilization

These signals catch resource pressure inside containers before it becomes an outage.

Signal	Alert when	Why it matters	Read it from
Container CPU usage	Sustained usage greater than 80 percent of limit or host capacity	Indicates CPU-bound work, runaway processes, or contention	`docker stats --no-stream`; cgroup v2 `cpu.stat` `usage_usec`
Container CPU throttling	`nr_throttled` increasing in `cpu.stat`	The CFS bandwidth controller is pausing processes, causing silent latency spikes even when average CPU looks moderate	cgroup v2 `cpu.stat` fields `nr_throttled` and `throttled_usec`
Container memory usage	Usage greater than 80 percent of limit; `anon` memory steadily growing	OOM kills are imminent; steadily growing anonymous memory indicates a leak	`docker stats --no-stream`; cgroup v2 `memory.current` and `memory.stat` field `anon`
Container network errors	`rx_errors`, `tx_errors`, or `rx_dropped` increasing	Packet loss causes application retries, timeouts, and degraded performance	API `/containers/<id>/stats` networks object, or `docker exec <id> cat /proc/net/dev`

Should-have: storage, I/O, and daemon internals

Monitor these to catch disk growth, I/O contention, and daemon stress before they cascade.

Signal	Alert when	Why it matters	Read it from
Container block I/O	Sustained I/O wait or bandwidth near device limits	I/O-heavy containers starve neighbors on shared storage	API `blkio_stats`; cgroup v2 `io.stat`
Container log file size	Any single json-file log exceeds 1 GB without rotation	Unbounded container logs are the leading cause of disk exhaustion on Docker hosts	`ls -lh /var/lib/docker/containers/<id>/<id>-json.log`
Docker daemon file descriptors	Usage greater than 80 percent of process limit	FD exhaustion blocks API connections, log streaming, and container operations	`ls /proc/$(pgrep dockerd)/fd \| wc -l` and `/proc/$(pgrep dockerd)/limits`
Container health check status	Status is `unhealthy` or `FailingStreak` is greater than 0	The application may be deadlocked or failing even though the container is running	`docker inspect --format '{{.State.Health.Status}}' <id>`
Container exit codes	Nonzero exit codes on stable containers, especially 137 or 139	Classifies the failure mode: OOM, segfault, or application error	`docker inspect --format '{{.State.ExitCode}}' <id>`
Docker daemon errors	Any panic or fatal message; sustained error rate above baseline	Reveals storage driver corruption, internal bugs, and resource exhaustion	`journalctl -u docker.service -p err --since "1 hour ago"`

Nice-to-have: security and deep internals

Add these after you have coverage of the tiers above.

Signal	Alert when	Why it matters	Read it from
Privileged container count	Any privileged container that is not a known infrastructure agent	Privileged mode disables most isolation and enables host compromise	`docker inspect --format '{{.HostConfig.Privileged}}' <id>`
Docker socket mounts	Any container mounting `/var/run/docker.sock` unexpectedly	Socket access is equivalent to root on the host	`docker inspect --format '{{json .Mounts}}' <id>`
Container capability additions	`SYS_ADMIN`, `NET_ADMIN`, or `SYS_PTRACE` added	Dangerous capabilities significantly weaken container isolation	`docker inspect --format '{{.HostConfig.CapAdd}}' <id>`
Docker daemon goroutine count	Greater than 10,000 sustained, or growing without bound	Indicates goroutine leaks or internal deadlock forming	`curl --unix-socket /var/run/docker.sock http://localhost/debug/pprof/goroutine?debug=1` (if debug enabled); approximate via `/proc/$(pgrep dockerd)/status` `Threads`

Fixes

Apply fixes based on the signal category.

If disk is the bottleneck

Truncate the largest unrotated log files safely while containers run: truncate -s 0 /var/lib/docker/containers/<id>/<id>-json.log. Reclaim space with docker image prune -a for dangling images and docker volume prune for unused volumes. Set log-opts in /etc/docker/daemon.json with max-size and max-file to prevent recurrence.

If containers are crash-looping

Check docker inspect for exit code and OOMKilled. If OOM, raise the memory limit or fix the leak. For exit code 1, read docker logs. Break the restart loop temporarily with docker update --restart=no <id> while you debug.

If CPU throttling is causing latency

Calculate the throttle percentage from nr_throttled / nr_periods in cpu.stat. Raise the CPU limit or switch to cpuset pinning for latency-sensitive workloads instead of CFS quotas.

If the daemon is hung

Confirm container processes are still alive via ps or ctr -n moby containers list. If live-restore is enabled, restart dockerd with systemctl restart docker; running containers will survive. Without live-restore, a restart kills all containers.

Prevention

Configure log rotation in /etc/docker/daemon.json with max-size and max-file defaults.
Set memory limits that leave headroom for native allocations. For JVM workloads, set -Xmx to roughly 75 percent of the container limit.
Enable meaningful health checks in every production image.
Automate cleanup of exited containers and dangling images with a scheduled docker system prune or equivalent.
Avoid --privileged and docker socket mounts in production workloads. Drop capabilities and run as non-root.
Set net.netfilter.nf_conntrack_max to at least 262144 on busy hosts and monitor utilization.

How Netdata helps

Correlates container CPU usage with throttling metrics on the same chart, exposing the silent cause of latency spikes.
Reads cgroup v2 memory.events to alert immediately on OOM kills without waiting for docker inspect.
Tracks per-container disk usage and log growth to catch storage pressure before the host filesystem fills.
Monitors dockerd health, API latency, and file descriptor usage alongside container metrics to detect daemon stress.

Docker monitoring checklist: the signals every production host needs

Docker monitoring checklist: the signals every production host needs

What this means

Common causes

Quick checks

How to diagnose it

Metrics and signals to monitor

Must-have: availability and survival

Must-have: resource utilization

Should-have: storage, I/O, and daemon internals

Nice-to-have: security and deep internals

Fixes

If disk is the bottleneck

If containers are crash-looping

If CPU throttling is causing latency

If the daemon is hung

Prevention

How Netdata helps

Related guides