| Netdata

$ guides / raw_playbooks / docker / playbook-opus ▌

Operations Guides

PLAYBOOK: Monitoring Docker

SECTION 0 — Operator’s Mental Model

Internal Architecture

Docker is a layered runtime system. The operator must understand three distinct layers that process every container lifecycle event:

dockerd (the Docker daemon) — Accepts API requests via a Unix socket (/var/run/docker.sock) or TCP. Manages images, volumes, networks, and the high-level container lifecycle. It is a single long-running Go process. If it hangs or crashes, all management-plane operations stop, but running containers continue (in default configuration).
containerd — The actual container runtime manager. dockerd delegates container creation, execution, and lifecycle to containerd via gRPC. containerd manages container processes through shim processes (one shim per running container). The shim is what survives a containerd restart — this is how Docker achieves “live restore” of running containers.
runc (or alternative OCI runtime) — The low-level binary that sets up Linux namespaces, cgroups, seccomp profiles, and actually exec’s the container’s entrypoint process. runc exits after container start; the shim takes over supervision.

Resource Competition

Docker containers compete for every host resource, but the competition is mediated through Linux kernel mechanisms:

CPU: Controlled via cgroups cpu/cpuset controllers. Without limits, all containers compete equally with host processes via CFS scheduling. CPU throttling is the silent killer — containers get their work done, just slowly, and throttling metrics are buried.
Memory: Controlled via cgroups memory controller. The OOM killer operates at the cgroup level. A container hitting its memory limit gets OOM-killed by the kernel, not by Docker. Docker just reports the corpse.
Disk I/O: Controlled via cgroups blkio controller, but rarely configured in practice. Containers sharing a storage driver (overlay2) share the same underlying filesystem I/O path.
Network: Containers on the default bridge network go through iptables NAT. Every published port creates iptables rules. At scale, iptables rule evaluation becomes a CPU tax on every packet.
Storage: Images, writable container layers, volumes, build cache, and logs all compete for the same filesystem. This is the single most common resource exhaustion vector.
PIDs: Each container gets its own PID namespace, but host PIDs are finite. Containers without PID limits can fork-bomb the host.
File Descriptors: The dockerd process itself holds file descriptors for every container’s log stream, every API connection, and every event subscription. containerd and shims hold their own sets.
Inotify watches: The Docker daemon and containers that watch filesystems consume kernel inotify watches from a shared pool.

Characteristic Failure Archetypes

Storage exhaustion — The #1 cause of Docker host incidents. Images accumulate, build cache grows, container writable layers fill with logs or temp files, volumes are never cleaned. The host filesystem fills, and everything breaks at once: new containers can’t start, running containers can’t write, the daemon can’t update its state database.
Daemon hang/deadlock — dockerd is a complex Go process with many goroutines, internal locks, and external dependencies (containerd, storage driver, network plugins). When it hangs, docker ps blocks, health checks stop updating, orchestrators lose visibility, and operators panic — even though containers keep running.
Container resource starvation — A container without memory limits slowly eats host RAM until the kernel OOM-killer shoots something — often not the offending container, but whatever looks sacrificially appropriate to the kernel’s heuristics. CPU throttling is even more insidious: the container appears “fine” but is being throttled by cgroup CFS bandwidth enforcement, causing latency spikes that look like application bugs.
Networking failures — Docker’s bridge networking relies on iptables, Linux bridge, veth pairs, and an embedded DNS server. Any of these can break independently. The embedded DNS resolver (at 127.0.0.11 inside containers) is a single-threaded component and a common bottleneck. iptables rule corruption after daemon restarts is a recurring class of incidents.
Zombie/orphan accumulation — Containers whose PID 1 does not properly reap child processes accumulate zombie processes. Docker added a --init flag (tini) to solve this, but many deployments don’t use it. Zombies consume PID table entries but no other resources — until the PID limit is hit.
Log storage explosion — The default json-file log driver writes unbounded logs to /var/lib/docker/containers/<id>/<id>-json.log. Without max-size/max-file configuration, a chatty container fills the disk. This is the most commonly encountered Docker storage incident in the wild.
Image pull storms — Multiple containers or hosts pulling images simultaneously from a registry can exhaust network bandwidth, registry rate limits, or local disk I/O. Deployments without image pre-pulling or registry mirrors are vulnerable.
Overlay filesystem corruption — The overlay2 storage driver maintains layer metadata that can become inconsistent after unclean shutdowns, disk errors, or kernel bugs. Symptoms: containers fail to start with cryptic “layer not found” or “invalid argument” errors.

Deployment Variants That Change Monitoring

Standalone Docker vs Swarm mode: Swarm adds Raft consensus, service mesh (routing mesh with IPVS), internal load balancing, and overlay networks with VXLAN tunneling. Each adds its own failure modes and signals.
Docker with systemd cgroup driver vs cgroupfs: Affects how cgroup hierarchies are organized and how resource metrics are collected. Kubernetes mandates systemd; standalone Docker typically uses cgroupfs.
Docker with live-restore enabled: When live-restore: true, containers survive daemon restarts. Changes daemon failure behavior significantly — monitoring must account for containers running without daemon supervision.
Storage driver choice: overlay2 (modern default) vs devicemapper (legacy, different failure modes) vs btrfs/zfs (different space accounting). overlay2 is >95% of production deployments now.
Logging driver: json-file (default, local), journald, syslog, fluentd, awslogs, etc. Changes where container logs appear and what log-related metrics are available.
User namespace remapping: When enabled, changes file ownership semantics and can affect volume permissions. Affects how you interpret file-level metrics.
Rootless Docker: Runs without root privileges. Changes which host metrics are accessible and how resource limits are applied.

SECTION 1 — Signal Catalog

DOMAIN: Availability

SIGNAL: Docker Daemon Responsiveness

WHAT IT IS: Whether the Docker daemon (dockerd) is accepting and responding to API requests within a reasonable time.

SOURCE: Unix socket /var/run/docker.sock — the Docker Engine API. The /_ping endpoint is the lightweight health check. The systemd unit docker.service tracks process state.

HOW TO COLLECT IT MANUALLY:

# Lightweight ping (should return "OK" in <100ms)
curl -s --unix-socket /var/run/docker.sock http://localhost/_ping

# With timeout to detect hangs
curl -s --max-time 5 --unix-socket /var/run/docker.sock http://localhost/_ping

# Check systemd unit state
systemctl is-active docker

# Check process existence and state
ps aux | grep dockerd | grep -v grep

WHAT IT TELLS YOU: If this fails or is slow, the Docker management plane is down or degraded. Running containers continue operating, but you cannot start, stop, inspect, or manage containers. Orchestrators (Kubernetes, Swarm, Nomad) lose control of the node. Health checks that rely on docker inspect or docker exec stop working.

SEVERITY:

PAGE — If /_ping does not respond within 10 seconds or the process is absent. This is a complete loss of container management capability on this host.
TICKET — If /_ping responds but consistently takes >1 second. The daemon is under stress or partially locked.

THRESHOLDS:

Response time to /_ping should be under 100ms under normal conditions. Sustained >500ms indicates daemon stress. No response within 5 seconds is a hang.
Any absence of the dockerd process is critical.

FAILURE MODES DETECTED:

Daemon crash (process gone)
Daemon deadlock (process alive, not responding — the worst case because it looks alive to process monitors)
containerd communication failure (daemon up but unable to manage containers)
Storage driver lock (daemon up but blocked on I/O operations)

NUANCES & GOTCHAS:

/_ping can return OK even when the daemon is partially degraded — it’s a shallow health check. A hung docker ps with a healthy /_ping means the daemon is alive but one of its internal subsystems is locked.
systemd will report the service as active even during a deadlock because the process exists.
With live-restore: true, a daemon restart is operationally different from a crash — containers survive. Without it, all containers die when the daemon does.
Docker Desktop (macOS/Windows) has a VM layer between the daemon and the host — monitoring the daemon requires entering the VM.

CORRELATES WITH:

If daemon is unresponsive AND container processes are still running (visible via ps), this is a daemon hang, not a host failure.
If daemon is unresponsive AND containerd is also unresponsive (ctr version fails), the issue is deeper — possibly a kernel or storage driver problem.

SIGNAL: Containerd Responsiveness

WHAT IT IS: Whether the containerd runtime is operational and responding to requests from dockerd.

SOURCE: containerd’s gRPC socket, typically at /run/containerd/containerd.sock. The ctr CLI talks to containerd directly.

HOW TO COLLECT IT MANUALLY:

# Check containerd is responding
ctr version

# Check containerd's namespace list (Docker uses the "moby" namespace)
ctr -n moby containers list

# Check systemd unit
systemctl is-active containerd

# Check containerd process
ps aux | grep containerd | grep -v grep

WHAT IT TELLS YOU: containerd is the actual supervisor of running containers via shim processes. If containerd is down but shims are alive, running containers continue. If containerd hangs, dockerd will eventually hang too — it cannot complete any container lifecycle operation.

SEVERITY:

PAGE — If containerd is unresponsive and dockerd is also becoming unresponsive. Cascade failure in progress.
TICKET — If containerd is slow but still responding. Usually indicates storage driver or snapshot issues.

THRESHOLDS:

ctr version should respond within 1 second. No response within 5 seconds indicates a problem.

FAILURE MODES DETECTED:

containerd crash (rare but possible — usually a Go runtime panic)
containerd blocked on storage operations (snapshot driver issues)
containerd blocked on shim communication (usually indicates a stuck container)

NUANCES & GOTCHAS:

containerd serves multiple namespaces. Docker uses the “moby” namespace. Kubernetes uses a different namespace (“k8s.io”). Issues in one namespace don’t necessarily affect the other.
containerd can appear healthy while individual shim processes are stuck. ctr -n moby tasks list shows task states.

CORRELATES WITH:

containerd hang + specific container stuck in “stopping” state = likely a shim or mount issue for that container
containerd hang + disk I/O saturation = storage driver problem

SIGNAL: Container State Distribution

WHAT IT IS: The count of containers in each state: running, paused, stopped (exited), restarting, created, dead, and removal-in-progress.

SOURCE: Docker Engine API endpoint /containers/json?all=true. Each container has a State.Status field.

HOW TO COLLECT IT MANUALLY:

# Count by state
docker ps -a --format '{{.Status}}' | awk '{print $1}' | sort | uniq -c | sort -rn

# Specifically running count
docker ps -q | wc -l

# All containers count
docker ps -aq | wc -l

# Restarting containers (the danger signal)
docker ps --filter "status=restarting" --format '{{.Names}} {{.Status}}'

# Dead containers
docker ps --filter "status=dead" --format '{{.Names}} {{.Status}}'

WHAT IT TELLS YOU: The health of the workload running on this Docker host. Key patterns:

Many “restarting” containers = crash loops, misconfiguration, or dependency failures
“Dead” containers = containers that could not be removed, usually due to storage driver issues
Running count lower than expected = something has crashed and is not restarting
Large number of stopped containers accumulating = no cleanup policy, slow storage consumption

SEVERITY:

PAGE — Any container in “dead” state (indicates storage/runtime corruption). Any critical-path container not in “running” state.
TICKET — Containers in restart loops (restarting state). Unexpected stopped containers.
INFO — Normal stopped containers from batch jobs, CI/CD runs.

THRESHOLDS:

“Dead” containers: any nonzero count is abnormal and requires investigation.
“Restarting” containers: any nonzero count for non-transient containers requires investigation.
Stopped container accumulation: more than 100 stopped containers suggests missing cleanup.

FAILURE MODES DETECTED:

Application crash loops (restarting)
Storage driver corruption (dead state)
Configuration errors (immediate exit after start)
OOM kills (exit code 137, containers that were running and stopped unexpectedly)
Dependency failures (containers exiting because a linked service is unavailable)

NUANCES & GOTCHAS:

A container in “restarting” state with a restart policy will flap between “restarting” and “running.” The RestartCount field is more informative than the current state.
Exit code 137 = SIGKILL (typically OOM kill). Exit code 143 = SIGTERM (graceful shutdown). Exit code 139 = SIGSEGV. These are crucial for diagnosis.
Containers with restart: always mask crashes — the container keeps coming back, but every restart is data worth tracking.
docker ps itself can hang if the daemon is hung. The hang of the monitoring command is itself a signal.

CORRELATES WITH:

Restarting containers + OOM events in dmesg = containers being OOM-killed and restarting
Dead containers + disk full = overlay2 metadata corruption from disk pressure
Sudden drop in running count + daemon responsive = either orchestrator-initiated drain or mass crash

SIGNAL: Container Restart Count

WHAT IT IS: The cumulative number of times a container has been restarted by its restart policy since creation.

SOURCE: Docker Engine API: /containers/<id>/json → RestartCount field. Also available in docker inspect.

HOW TO COLLECT IT MANUALLY:

# Show restart counts for all running containers
docker inspect --format '{{.Name}} {{.RestartCount}}' $(docker ps -q)

# Show containers with nonzero restart counts
docker inspect --format '{{.Name}} {{.RestartCount}} {{.State.ExitCode}}' $(docker ps -aq) | awk '$2 > 0'

WHAT IT TELLS YOU: A nonzero and increasing restart count means the container is crash-looping. The rate of restarts is more important than the absolute count — a container that restarted 50 times last month is different from one that restarted 50 times in the last hour.

SEVERITY:

PAGE — Restart count increasing faster than once per minute for any production container.
TICKET — Any nonzero restart count for a container that should be stable (non-batch workloads).
INFO — Low restart counts on development/CI containers.

THRESHOLDS:

Rate-based: any sustained restart rate. A single restart could be a transient issue; sustained restarts indicate a persistent problem.
The combination of restart count + exit code is more informative than either alone.

FAILURE MODES DETECTED:

Application bugs causing crashes
Resource exhaustion (OOM) causing kills
Misconfiguration (bad environment variables, missing mounts)
Dependency unavailability (database down, DNS not resolving)
Health check failures causing orchestrator-driven restarts

NUANCES & GOTCHAS:

Docker’s exponential backoff for restarts means a high restart count may span a long time. Check State.StartedAt to determine recency.
restart: unless-stopped behaves differently from restart: always across daemon restarts. Know which policy is in use.
Kubernetes has its own restart counter (pod restarts) which is separate from Docker’s RestartCount. Don’t conflate them.

CORRELATES WITH:

Restart count increasing + exit code 137 = OOM kill loop, check memory limits and actual usage
Restart count increasing + exit code 1 = application error, check container logs
Restart count increasing + exit code 126/127 = binary not found or not executable, configuration problem

DOMAIN: Resource Utilization — CPU

SIGNAL: Container CPU Usage (User + System)

WHAT IT IS: The total CPU time consumed by all processes within a container’s cgroup, expressed as a percentage of allocated or available CPU time.

SOURCE: cgroup v1: /sys/fs/cgroup/cpu,cpuacct/docker/<container-id>/cpuacct.usage (total nanoseconds) and cpuacct.stat (user/system ticks). cgroup v2: /sys/fs/cgroup/system.slice/docker-<container-id>.scope/cpu.stat (usage_usec, user_usec, system_usec). Docker Engine API: /containers/<id>/stats stream provides cpu_stats and precpu_stats for calculating percentage.

HOW TO COLLECT IT MANUALLY:

# Via docker stats (live, simple)
docker stats --no-stream --format "table {{.Name}}\t{{.CPUPerc}}\t{{.MemUsage}}"

# Via cgroup v2 directly (more precise)
CONTAINER_ID=$(docker inspect --format '{{.Id}}' <name>)
cat /sys/fs/cgroup/system.slice/docker-${CONTAINER_ID}.scope/cpu.stat

# Via Docker API (raw stats for calculation)
curl -s --unix-socket /var/run/docker.sock http://localhost/containers/<id>/stats?stream=false | python3 -m json.tool

WHAT IT TELLS YOU: How much compute the container is actually consuming. This must be interpreted alongside CPU limits — a container using 100% of 0.5 CPU cores is in a very different situation from one using 100% of 16 cores.

SEVERITY:

TICKET — Container consistently using >90% of its CPU limit (about to hit throttling). Container using more than its expected baseline by >2x without traffic increase.
PLAN — Sustained CPU growth trend over days/weeks.
INFO — Normal CPU utilization within expected ranges.

THRESHOLDS:

If CPU limits are set: sustained usage >80% of the limit signals impending throttling.
If no CPU limits: compare against host CPU capacity. A single container using >50% of total host CPU is concerning unless expected.
Rate of change: sudden 3x+ increase without corresponding traffic increase warrants investigation.

FAILURE MODES DETECTED:

CPU-bound processing bottleneck
Infinite loops or runaway computation
Garbage collection storms (JVM, Go, .NET)
Cryptomining malware (steady high CPU with no corresponding business traffic)

NUANCES & GOTCHAS:

docker stats CPU percentage can exceed 100% on multi-core hosts — 200% means two full cores.
CPU usage alone doesn’t tell you about throttling. A container can show moderate CPU usage while being heavily throttled if the burst/limit ratio is tight.
cgroup v1 and v2 expose CPU metrics differently. Most modern distributions use cgroup v2.
Short bursts of high CPU (compiling, startup) are normal and shouldn’t trigger alerts. Sustained high CPU is the signal.
System CPU vs user CPU: high system CPU in a container usually indicates heavy I/O (system calls), not compute work.

CORRELATES WITH:

High CPU + high CPU throttling = container is CPU-limited, needs limit increase or optimization
High CPU + low request rate = processing inefficiency, possible runaway
High CPU + high memory = might be swap thrashing (if swap is enabled in cgroup)

SIGNAL: Container CPU Throttling

WHAT IT IS: The amount of time a container’s processes were prevented from running by the CFS bandwidth controller because they exceeded their CPU quota within a scheduling period.

SOURCE: cgroup v1: /sys/fs/cgroup/cpu,cpuacct/docker/<container-id>/cpu.stat — fields nr_throttled (count of periods throttled) and throttled_time (total nanoseconds throttled). cgroup v2: /sys/fs/cgroup/system.slice/docker-<container-id>.scope/cpu.stat — fields nr_throttled and throttled_usec.

HOW TO COLLECT IT MANUALLY:

# For cgroup v2
CONTAINER_ID=$(docker inspect --format '{{.Id}}' <name>)
cat /sys/fs/cgroup/system.slice/docker-${CONTAINER_ID}.scope/cpu.stat
# Look at: nr_periods, nr_throttled, throttled_usec

# Calculate throttle percentage
# throttled_pct = (nr_throttled / nr_periods) * 100

# Quick view across all containers
for cg in /sys/fs/cgroup/system.slice/docker-*.scope; do
  echo "=== $(basename $cg) ==="
  grep -E "nr_periods|nr_throttled|throttled" $cg/cpu.stat
done

WHAT IT TELLS YOU: This is the most under-monitored signal in Docker deployments. CPU throttling directly causes latency — the container’s processes are literally paused by the kernel. Application developers see unexplained latency spikes, p99 degradation, and timeout errors. The container doesn’t look “overloaded” in CPU percentage because it’s being prevented from using more CPU.

SEVERITY:

PAGE — Throttle percentage >50% for any latency-sensitive container. The container is spending more time waiting than working.
TICKET — Throttle percentage >25% for production containers. Noticeable latency impact.
PLAN — Any nonzero throttling for containers that are supposed to be performant.

THRESHOLDS:

Throttle percentage = nr_throttled / nr_periods * 100
5% throttling is noticeable in latency-sensitive applications
25% throttling causes significant p99 latency degradation
50% throttling indicates the CPU limit is fundamentally too low for the workload

FAILURE MODES DETECTED:

Incorrect CPU limits (too restrictive for the workload)
Bursty workloads hitting CFS period boundaries (100ms default period)
GC pauses that consume the entire CPU quota in a burst
CPU limits set based on average usage without accounting for burst needs

NUANCES & GOTCHAS:

The CFS period problem: Linux CFS enforces CPU limits over 100ms periods by default. A container with a 50ms quota (0.5 CPU) that does all its work in a 50ms burst at the start of each period will be throttled for the remaining 50ms — even though it averaged 50% utilization. This is the #1 cause of “unexplained latency” in containerized applications.
Multi-threaded applications are hit harder — all threads share the quota. A JVM with 8 GC threads can consume the entire period’s quota during a GC pause.
Some operators “fix” this by removing CPU limits entirely. This trades throttling problems for noisy-neighbor problems. The correct fix is usually increasing the limit or adjusting cpu.cfs_period_us.
Kubernetes 1.20+ supports cpuManagerPolicy: static for guaranteed pods, which bypasses CFS entirely by pinning to cores.
cgroup v2 improved the CFS bandwidth controller with burst support (cpu.max.burst), partially addressing the bursty workload problem.

CORRELATES WITH:

CPU throttling + application latency spikes = the throttling IS the cause of the latency
CPU throttling + moderate CPU usage percentage = the limit is too low, not the workload too high
CPU throttling + GC metrics showing frequent collection = GC is consuming the CPU budget

DOMAIN: Resource Utilization — Memory

SIGNAL: Container Memory Usage

WHAT IT IS: The current memory consumption of all processes within a container’s cgroup, including RSS (resident set size), cache/page cache, and kernel memory.

SOURCE: cgroup v1: /sys/fs/cgroup/memory/docker/<container-id>/memory.usage_in_bytes and memory.stat cgroup v2: /sys/fs/cgroup/system.slice/docker-<container-id>.scope/memory.current and memory.stat Docker API: /containers/<id>/stats → memory_stats.usage and memory_stats.limit

HOW TO COLLECT IT MANUALLY:

# Via docker stats
docker stats --no-stream --format "table {{.Name}}\t{{.MemUsage}}\t{{.MemPerc}}"

# Via cgroup v2 directly
CONTAINER_ID=$(docker inspect --format '{{.Id}}' <name>)
cat /sys/fs/cgroup/system.slice/docker-${CONTAINER_ID}.scope/memory.current
cat /sys/fs/cgroup/system.slice/docker-${CONTAINER_ID}.scope/memory.stat

# Key fields in memory.stat:
# anon = anonymous pages (heap, stack — actual application memory)
# file = page cache (can be reclaimed)
# slab = kernel slab allocations

# Show limit
cat /sys/fs/cgroup/system.slice/docker-${CONTAINER_ID}.scope/memory.max

WHAT IT TELLS YOU: How much memory the container is using relative to its limit (if set) or to host memory (if not). The critical distinction is between reclaimable memory (page cache) and non-reclaimable memory (anonymous pages, kernel memory). Many operators over-alert on total memory because they include cache — when the kernel needs that memory, it reclaims it.

SEVERITY:

PAGE — Container memory usage (excluding reclaimable cache) >95% of its limit. OOM kill is imminent.
TICKET — Container memory usage >80% of limit with upward trend.
PLAN — Steady memory growth over days (potential leak).
INFO — Stable memory usage within expected range.

THRESHOLDS:

Usage as percentage of limit: >80% warrants attention, >90% is urgent
Rate of growth: memory that only goes up (never plateaus) is a leak until proven otherwise
The ratio of anon to file in memory.stat matters: high anon that keeps growing = probable leak

FAILURE MODES DETECTED:

Memory leaks (steadily increasing anon memory)
Cache pressure (high file/cache memory competing with application needs)
OOM kill risk (approaching memory limit)
Kernel memory leaks (rare but devastating — slab growth)
Connection/goroutine/thread leaks (each consumes memory)

NUANCES & GOTCHAS:

docker stats shows usage / limit but usage includes page cache. The “real” usage that will trigger OOM is the non-reclaimable portion. Use memory.stat’s anon + kernel fields for the non-reclaimable number.
cgroup v1’s memory.usage_in_bytes includes cache. memory.usage_in_bytes - memory.stat[total_inactive_file] was the canonical “real” usage formula, but it’s imprecise. cgroup v2 is much cleaner.
JVM applications are especially tricky: the JVM pre-allocates heap (-Xmx), so RSS jumps to the heap max at startup and stays there. Growth within the JVM is invisible to cgroup metrics — you need JVM-level metrics for that.
memory.max = max means no limit is set. The container can consume all host memory.
Swap accounting: if memory.swap.max is set (cgroup v2), the container can use swap. Memory pressure signals change meaning when swap is involved.

CORRELATES WITH:

Memory approaching limit + OOM events in dmesg = containers being killed
Memory growth + increasing container restart count = OOM crash loop
High file memory + high disk read I/O = healthy cache behavior, not a problem
High anon memory + no corresponding workload increase = memory leak

SIGNAL: Container OOM Kill Events

WHAT IT IS: Occurrences of the kernel’s Out-of-Memory killer terminating processes within a container’s cgroup because memory usage hit the cgroup limit.

SOURCE: cgroup v1: /sys/fs/cgroup/memory/docker/<container-id>/memory.oom_control — oom_kill counter. cgroup v2: /sys/fs/cgroup/system.slice/docker-<container-id>.scope/memory.events — oom_kill counter. Kernel log: dmesg / journalctl -k — messages containing “Killed process” or “oom-kill” with cgroup path. Docker events: docker events --filter event=oom.

HOW TO COLLECT IT MANUALLY:

# Via cgroup v2
CONTAINER_ID=$(docker inspect --format '{{.Id}}' <name>)
cat /sys/fs/cgroup/system.slice/docker-${CONTAINER_ID}.scope/memory.events
# Look for: oom_kill N

# Via kernel log
dmesg | grep -i "oom\|killed process" | tail -20

# Via Docker events (streaming — run in background)
docker events --filter event=oom --since 1h

# Check if a stopped container was OOM-killed
docker inspect --format '{{.State.OOMKilled}} {{.State.ExitCode}}' <name>

WHAT IT TELLS YOU: The kernel forcefully terminated a process because its cgroup exceeded the memory limit. This is not a graceful shutdown — the process is killed with SIGKILL. Data loss, corrupted state, and incomplete transactions are likely. Exit code 137 is the telltale.

SEVERITY:

PAGE — Any OOM kill in production. Every OOM kill risks data corruption and user-visible impact.
TICKET — OOM kills in staging/dev environments that indicate a limit needs adjustment before reaching production.

THRESHOLDS:

Any nonzero OOM kill count is abnormal for a correctly configured container. This is a binary signal.

FAILURE MODES DETECTED:

Memory limit too low for the workload
Memory leak in the application
Burst memory usage (e.g., large query result sets, file processing)
Fork/exec bombs consuming memory
JVM heap misconfigured relative to container limit (common: -Xmx set equal to container limit, leaving no room for JVM metaspace, threads, native memory)

NUANCES & GOTCHAS:

The OOM killer may not kill the container’s PID 1. It kills the process with the highest oom_score_adj + memory usage. In a multi-process container, a child process might be killed while PID 1 survives, leaving the container in a degraded state without triggering a restart.
docker inspect only shows OOMKilled: true if PID 1 was killed. Child OOM kills are invisible to Docker.
In Kubernetes, pod OOM kills show as OOMKilled reason in pod status. But kubectl describe pod may show Reason: OOMKilled even when it was a child process.
The kernel’s OOM killer considers oom_score_adj. Some critical system processes are protected, which means container processes are preferentially killed.
cgroup v2’s memory.events also has oom (count of times OOM was triggered, which may not result in a kill if memory.oom.group is not set) vs oom_kill (actual kills). These are different counters.

CORRELATES WITH:

OOM kills + container restart count increasing = OOM crash loop
OOM kills + no memory limit set = host-level OOM, much worse — the kernel is choosing victims across the entire system
OOM kills + swap usage = swap exhausted too, or swap not available

DOMAIN: Resource Utilization — Storage

SIGNAL: Docker Data Directory Disk Usage

WHAT IT IS: The total disk space consumed by Docker’s data directory, which stores images, containers, volumes, build cache, and runtime state.

SOURCE: The Docker data directory, typically /var/lib/docker/. The docker system df command provides a breakdown. The underlying filesystem’s usage via df.

HOW TO COLLECT IT MANUALLY:

# Overall Docker disk usage breakdown
docker system df

# Detailed breakdown (warning: slow on large deployments)
docker system df -v

# Filesystem usage of Docker's data directory
df -h /var/lib/docker/

# Breakdown by type
du -sh /var/lib/docker/overlay2/   # Image and container layers
du -sh /var/lib/docker/volumes/    # Named volumes
du -sh /var/lib/docker/containers/ # Container metadata and logs
du -sh /var/lib/docker/buildkit/   # Build cache

# Dangling images (unused, reclaimable)
docker images --filter "dangling=true" -q | wc -l

# Unused volumes
docker volume ls --filter "dangling=true" -q | wc -l

WHAT IT TELLS YOU: Whether Docker is approaching disk exhaustion. When the Docker data directory’s filesystem fills up, the failure is catastrophic and cascading: containers can’t write, new containers can’t start, images can’t be pulled, the daemon’s internal database (in /var/lib/docker/) can’t be updated, and the daemon may hang or crash.

SEVERITY:

PAGE — Filesystem containing /var/lib/docker/ at >90% utilization. Active risk of cascading failure.
TICKET — >80% utilization or >50GB of reclaimable space (dangling images + unused volumes + build cache).
PLAN — Steady growth trend that projects 90% within 2 weeks.

THRESHOLDS:

Filesystem usage: >90% is critical, >80% is warning
Reclaimable space: when reclaimable space > 20% of total Docker disk usage, a cleanup is overdue
Container log sizes: any single container log file >1GB without log rotation configured
Build cache: >10GB suggests no automated cache cleanup

FAILURE MODES DETECTED:

Disk exhaustion from accumulated images (no image cleanup policy)
Disk exhaustion from container logs (no log rotation)
Disk exhaustion from orphaned volumes (volumes from deleted containers)
Disk exhaustion from build cache (CI/CD hosts)
Overlay2 metadata growth from many layers
Large container writable layers (application writing to container filesystem instead of volumes)

NUANCES & GOTCHAS:

docker system df shows “RECLAIMABLE” space — this is what you can recover with docker system prune. But prune will delete ALL stopped containers, ALL unused images, ALL unused networks, and optionally ALL unused volumes. In production, this is dangerous — those stopped containers might contain debug data from recent incidents.
The overlay2 directory structure uses hard links for layer sharing. du may over-count or under-count depending on the filesystem and du flags. docker system df is the authoritative source.
Container log files live in /var/lib/docker/containers/<id>/ and are NOT cleaned up by docker system prune. They’re only removed when the container is removed.
Volume data persists even after the container using it is removed. Orphan volumes accumulate silently.
Build cache can grow to hundreds of gigabytes on CI/CD hosts. docker builder prune is separate from docker system prune.

CORRELATES WITH:

Disk usage high + many stopped containers = need container cleanup policy
Disk usage high + large overlay2 directory = too many image layers, need image cleanup
Disk usage high + large containers/ directory = log rotation not configured
Sudden disk usage spike = likely a container writing large files to its writable layer

SIGNAL: Container Writable Layer Size

WHAT IT IS: The amount of data written to a container’s writable (top) layer in the overlay2 filesystem. This is data written to the container’s filesystem that is not on a volume.

SOURCE: Docker Engine API: /containers/<id>/json → SizeRw field (requires size=true query parameter). Also visible via docker ps --size.

HOW TO COLLECT IT MANUALLY:

# Show writable layer size for all running containers
# WARNING: --size is expensive, triggers a filesystem walk for each container
docker ps --size --format "table {{.Names}}\t{{.Size}}"

# For a specific container
docker inspect --size --format '{{.SizeRw}}' <name>

WHAT IT TELLS YOU: Containers should ideally write nothing to their writable layer — all persistent or large data should go to volumes. A growing writable layer indicates the application is writing temp files, logs, or data to the container filesystem. This data is lost on container restart and consumes overlay2 space.

SEVERITY:

TICKET — Any container with writable layer >1GB. Indicates architectural issue or misconfiguration.
PLAN — Containers with writable layers >100MB growing steadily.

THRESHOLDS:

100MB: worth investigating
1GB: definitely misconfigured — either logs, temp files, or data should be on a volume
Any sustained growth in writable layer size over time

FAILURE MODES DETECTED:

Application writing logs to filesystem instead of stdout/stderr
Application storing temp files that aren’t cleaned up
Application writing data that should be on a volume
Disk pressure from accumulating writable layers

NUANCES & GOTCHAS:

docker ps --size and docker inspect --size are expensive operations — they trigger a filesystem walk. Do not poll this frequently.
The writable layer is specific to overlay2. Other storage drivers have different mechanisms.
Deleted files in lower layers still appear in the writable layer as whiteout files — the writable layer size can grow even if the container’s df shows low usage.

CORRELATES WITH:

Large writable layer + no log rotation + no volume mounts for /var/log = log accumulation
Large writable layer + high disk I/O = container writing heavily to overlay2 instead of volumes

SIGNAL: Container Log File Size

WHAT IT IS: The size of log files generated by the Docker logging driver for each container.

SOURCE: For the default json-file driver: /var/lib/docker/containers/<container-id>/<container-id>-json.log The docker logs command reads from these files.

HOW TO COLLECT IT MANUALLY:

# Size of all container log files
find /var/lib/docker/containers/ -name "*-json.log" -exec ls -lh {} \;

# Total log space
du -sh /var/lib/docker/containers/*/

# For a specific container
CONTAINER_ID=$(docker inspect --format '{{.Id}}' <name>)
ls -lh /var/lib/docker/containers/${CONTAINER_ID}/${CONTAINER_ID}-json.log

# Check log rotation config for a container
docker inspect --format '{{.HostConfig.LogConfig}}' <name>

WHAT IT TELLS YOU: Whether container logs are consuming excessive disk space. The default json-file driver has NO size limit unless configured with max-size and max-file options. A single chatty container can fill a disk.

SEVERITY:

PAGE — Any single container log file >10GB or total log space causing filesystem pressure.
TICKET — Any container log file >1GB without rotation configured.
PLAN — Log growth rate that projects disk pressure within 2 weeks.

THRESHOLDS:

Individual log files >1GB without max-size configured: requires immediate configuration fix
Total container log space >20% of Docker data directory: cleanup and rotation needed

FAILURE MODES DETECTED:

Missing log rotation configuration (the most common Docker storage incident)
Application-level logging too verbose (debug logging in production)
Error loops generating massive log output
Health check output generating logs (health check results go to container logs)

NUANCES & GOTCHAS:

docker logs for a container with a multi-gigabyte log file can OOM the daemon or the client, or both. Use --tail N always.
Log rotation in Docker (max-size and max-file in log-opts) performs rotation at the Docker daemon level. The application inside the container doesn’t know its logs are being rotated.
If the logging driver is journald, logs go to the journal and are subject to journal rotation. If it’s syslog, logs go to the syslog daemon. The file-based metrics above only apply to json-file and local drivers.
Docker’s local logging driver is more efficient than json-file and supports rotation by default — but is not the default.

CORRELATES WITH:

Large log files + disk usage high = logs are a primary contributor to disk pressure
Large log files + container restart loops = error output flooding logs on each restart

DOMAIN: Resource Utilization — Network

SIGNAL: Container Network I/O

WHAT IT IS: The bytes and packets transmitted and received on each network interface within a container’s network namespace.

SOURCE: Container’s network namespace: /proc/<container-pid>/net/dev Docker API: /containers/<id>/stats → networks object, containing rx_bytes, tx_bytes, rx_packets, tx_packets, rx_errors, tx_errors, rx_dropped, tx_dropped per interface.

HOW TO COLLECT IT MANUALLY:

# Via docker stats
docker stats --no-stream --format "table {{.Name}}\t{{.NetIO}}"

# Via container's /proc
CONTAINER_PID=$(docker inspect --format '{{.State.Pid}}' <name>)
cat /proc/${CONTAINER_PID}/net/dev

# Via nsenter for full view
nsenter -t ${CONTAINER_PID} -n ip -s link show

# Via Docker API
curl -s --unix-socket /var/run/docker.sock \
  http://localhost/containers/<id>/stats?stream=false | \
  python3 -c "import sys,json; d=json.load(sys.stdin); print(json.dumps(d['networks'], indent=2))"

WHAT IT TELLS YOU: The volume and quality of network communication for a container. Rates and errors matter more than absolute values. Dropped packets and errors indicate network infrastructure problems — veth pair issues, bridge saturation, iptables bottlenecks, or upstream network problems.

SEVERITY:

PAGE — rx_errors or tx_errors increasing for any container. Packet drops increasing for production containers.
TICKET — Unusual traffic patterns (e.g., sudden 10x increase in tx_bytes).
INFO — Normal traffic volume and patterns.

THRESHOLDS:

Errors/drops: any nonzero rate in production requires investigation
Traffic volume: deviation of >3x from the rolling 1-hour average
Packet rate: extremely high packet rates (>100k pps per container) stress the kernel network stack

FAILURE MODES DETECTED:

Network interface saturation
veth pair issues (one side of the pipe failing)
iptables rule corruption causing drops
DNS resolution failures (visible as connection errors, not network errors per se)
Container networking plugin failures

NUANCES & GOTCHAS:

Containers with --network host share the host’s network namespace — their traffic appears on host interfaces, not container-specific ones.
In overlay networks (Swarm, Weave, etc.), additional encapsulation (VXLAN) adds overhead. The reported bytes don’t include encapsulation overhead.
docker stats aggregates across all container interfaces. A container with multiple networks has separate stats per interface in the API.
loopback traffic (container talking to itself, e.g., sidecar pattern) appears in lo stats, not on the external interface.

CORRELATES WITH:

High tx_bytes + no corresponding rx_bytes = container is a producer/publisher
rx_errors increasing + container timeouts = upstream network issue
Sudden traffic drop to zero + container still running = network namespace issue or networking plugin failure

SIGNAL: Docker Bridge Network Connection Count

WHAT IT IS: The number of established network connections tracked by the Linux connection tracker (conntrack) for Docker’s network namespaces and iptables NAT rules.

SOURCE: Host kernel: /proc/sys/net/netfilter/nf_conntrack_count (current) and /proc/sys/net/netfilter/nf_conntrack_max (limit). iptables NAT table: iptables -t nat -L -n -v. Per-container: nsenter -t <pid> -n ss -s or nsenter -t <pid> -n cat /proc/net/nf_conntrack.

HOW TO COLLECT IT MANUALLY:

# Host-level conntrack usage
cat /proc/sys/net/netfilter/nf_conntrack_count
cat /proc/sys/net/netfilter/nf_conntrack_max
echo "scale=2; $(cat /proc/sys/net/netfilter/nf_conntrack_count) / $(cat /proc/sys/net/netfilter/nf_conntrack_max) * 100" | bc

# Per-container connection count
CONTAINER_PID=$(docker inspect --format '{{.State.Pid}}' <name>)
nsenter -t ${CONTAINER_PID} -n ss -s

# Docker's iptables rules count
iptables -t nat -L DOCKER -n | wc -l

WHAT IT TELLS YOU: Docker’s default bridge networking uses iptables NAT for every published port. Each connection through NAT creates a conntrack entry. When conntrack table fills up, new connections are silently dropped — there’s no error, no reject, just silence. This is one of the most infuriating failure modes in Docker networking because it’s invisible to the application.

SEVERITY:

PAGE — conntrack usage >80% of max. Silent connection drops are imminent or occurring.
TICKET — conntrack usage >60% of max. Growth trend analysis needed.

THRESHOLDS:

conntrack count / conntrack max: >80% is critical
Default nf_conntrack_max is often 65536 on smaller hosts — far too low for busy Docker hosts
Rate of conntrack entry creation: sudden spikes indicate connection storms or SYN floods

FAILURE MODES DETECTED:

conntrack table exhaustion causing silent connection drops
iptables rule proliferation from many published ports
Connection storms from misconfigured health checks or retry loops
SYN flood attacks consuming conntrack entries

NUANCES & GOTCHAS:

conntrack entries persist for the conntrack timeout period (default: 432000 seconds = 5 days for established TCP). Short-lived connections accumulate entries that linger.
Containers with --network host bypass Docker’s iptables NAT but still use the host’s conntrack.
docker-proxy (userland proxy) runs for every published port when userland-proxy: true (default). Each proxy process consumes a file descriptor and a goroutine. With many published ports, this adds up.
When docker-proxy is disabled ("userland-proxy": false), only iptables handles port forwarding. This is faster but has subtly different behavior for hairpin NAT.

CORRELATES WITH:

High conntrack count + dropped packets on host interfaces = conntrack exhaustion
High conntrack count + many published ports + many containers = architectural issue, consider overlay networking or host networking
conntrack full + dmesg shows “nf_conntrack: table full, dropping packet” = confirmed exhaustion

SIGNAL: Docker DNS Resolution (Embedded DNS)

WHAT IT IS: The health and performance of Docker’s embedded DNS server (at 127.0.0.11 inside containers), which resolves container names to IP addresses for inter-container communication.

SOURCE: Inside the container: /etc/resolv.conf shows the DNS configuration. The embedded DNS is the resolver at 127.0.0.11. DNS resolution failures appear in container logs or application error metrics.

HOW TO COLLECT IT MANUALLY:

# Check DNS config inside a container
docker exec <name> cat /etc/resolv.conf

# Test DNS resolution from within a container
docker exec <name> nslookup <other-container-name>

# Test with explicit timing
docker exec <name> sh -c "time nslookup <other-container-name>"

# Check if embedded DNS is responding
CONTAINER_PID=$(docker inspect --format '{{.State.Pid}}' <name>)
nsenter -t ${CONTAINER_PID} -n dig @127.0.0.11 <other-container-name>

# Check for DNS-related errors in container logs
docker logs <name> 2>&1 | grep -i "dns\|resolve\|lookup\|NXDOMAIN"

WHAT IT TELLS YOU: Container-to-container name resolution is working correctly. DNS failures in Docker are a common source of mysterious connectivity issues — containers can’t reach each other by name, health checks fail, service discovery breaks.

SEVERITY:

PAGE — DNS resolution failing for any container on a user-defined network. Inter-container communication is broken.
TICKET — DNS resolution slow (>100ms) consistently.

THRESHOLDS:

Resolution time: should be <10ms for container-to-container names, <100ms for external names
Any NXDOMAIN for a container name that should exist is a failure

FAILURE MODES DETECTED:

Embedded DNS server hung (rare but devastating)
Container not attached to the correct network (can’t resolve names on other networks)
Container name conflicts
DNS response containing stale IP after container restart (brief window)
Upstream DNS failures leaking through to containers
/etc/resolv.conf misconfigured (search domains, ndots settings causing unnecessary lookups)

NUANCES & GOTCHAS:

Docker’s embedded DNS only works on user-defined networks, NOT the default bridge. On the default bridge, containers must use --link (deprecated) or external DNS.
The ndots setting in /etc/resolv.conf causes a common performance problem: if ndots:5 (Kubernetes default), any name with fewer than 5 dots triggers search domain appending, causing 4-6x DNS queries for every resolution. Docker’s default is saner (ndots:0 or not set), but Kubernetes overrides this.
Docker’s embedded DNS is a Go process embedded in dockerd. If dockerd is under memory pressure, DNS resolution can slow down.
DNS responses are cached briefly. After a container restart, other containers may briefly resolve the old IP address.

CORRELATES WITH:

DNS failures + containers on default bridge network = expected, default bridge doesn’t support embedded DNS
DNS failures + dockerd under resource pressure = daemon degradation affecting DNS
Intermittent DNS failures + container restarts = stale DNS cache

DOMAIN: Resource Utilization — PIDs and File Descriptors

SIGNAL: Container PID Count

WHAT IT IS: The number of processes (including threads appearing as lightweight processes) within a container’s PID namespace/cgroup.

SOURCE: cgroup v1: /sys/fs/cgroup/pids/docker/<container-id>/pids.current and pids.max cgroup v2: /sys/fs/cgroup/system.slice/docker-<container-id>.scope/pids.current and pids.max Container’s /proc: docker exec <name> ps aux | wc -l

HOW TO COLLECT IT MANUALLY:

# Via cgroup v2
CONTAINER_ID=$(docker inspect --format '{{.Id}}' <name>)
echo "Current: $(cat /sys/fs/cgroup/system.slice/docker-${CONTAINER_ID}.scope/pids.current)"
echo "Max: $(cat /sys/fs/cgroup/system.slice/docker-${CONTAINER_ID}.scope/pids.max)"

# Via Docker top
docker top <name> | wc -l

# Check for zombies
docker top <name> -o pid,ppid,stat,comm | grep Z

WHAT IT TELLS YOU: Whether a container is spawning processes normally or has a process/thread leak. Containers running JVMs, Node.js, or Python with many threads/workers will have higher PID counts. A steadily increasing PID count that never decreases is a leak.

SEVERITY:

PAGE — PID count approaching pids.max limit. Container will fail to fork new processes.
TICKET — Zombie process count >10 in any container.
PLAN — Steadily increasing PID count over hours/days.

THRESHOLDS:

If pids.max is set: >80% utilization warrants investigation
Zombie processes: >5 indicates PID 1 is not reaping children
Rate of PID growth: any sustained upward trend is abnormal for a steady-state application

FAILURE MODES DETECTED:

Fork bomb (rapid PID count increase)
Process/thread leak (slow PID count increase)
Zombie process accumulation (PID 1 not handling SIGCHLD)
Thread pool exhaustion in runtime (JVM, Go)

NUANCES & GOTCHAS:

Linux threads appear as lightweight processes in cgroups. A JVM with 200 threads will show ~200 PIDs. This is normal.
If pids.max is “max” (unlimited), there’s no container-level limit — but the host-level PID limit (/proc/sys/kernel/pid_max, default 32768 or 4194304) still applies.
Zombies (Z state) consume PID table entries but no CPU/memory. They’re only a problem if they accumulate enough to exhaust PID space.
Using --init (Docker’s tini) as PID 1 solves zombie accumulation for most containers.
Short-lived child processes in shell scripts create and destroy PIDs rapidly — the count fluctuates and averages can be misleading. Look at peak count.

CORRELATES WITH:

High PID count + high CPU = legitimate compute work with many workers/threads
High PID count + low CPU = thread/process leak (threads waiting/blocked)
Zombie count increasing + single-process container = PID 1 doesn’t handle SIGCHLD

SIGNAL: Docker Daemon File Descriptor Usage

WHAT IT IS: The number of file descriptors held open by the dockerd process itself.

SOURCE: procfs: /proc/<dockerd-pid>/fd/ — count of entries. System-wide: /proc/sys/fs/file-nr — system-wide FD usage.

HOW TO COLLECT IT MANUALLY:

# dockerd's PID
DOCKERD_PID=$(pgrep -x dockerd)

# Current FD count
ls /proc/${DOCKERD_PID}/fd/ | wc -l

# FD limits for the process
cat /proc/${DOCKERD_PID}/limits | grep "Max open files"

# System-wide FD usage
cat /proc/sys/fs/file-nr
# Format: allocated  free  max

# What FDs are open (useful for debugging)
ls -la /proc/${DOCKERD_PID}/fd/ | head -50

WHAT IT TELLS YOU: dockerd holds file descriptors for every container’s log stream (json-file driver), every API connection, every event listener, and internal operations. Many running containers + active API consumers = high FD usage. If dockerd hits its FD limit, it cannot accept new connections, open new log files, or manage new containers.

SEVERITY:

PAGE — dockerd FD usage >80% of its ulimit. Daemon functionality is at risk.
TICKET — FD usage growing without corresponding container growth.

THRESHOLDS:

FD count relative to the process’s Max open files limit: >80% is critical
Growth rate: FDs should roughly correlate with container count. Disproportionate growth indicates a leak.
Typical baseline: ~50-100 FDs per running container, plus base overhead (~200-500 for the daemon itself)

FAILURE MODES DETECTED:

FD leak in dockerd (rare but has occurred in specific Docker versions)
Too many concurrent API clients (monitoring tools, CI/CD pipelines, orchestrators all querying the daemon)
Log driver FD accumulation
Event stream consumers not disconnecting

NUANCES & GOTCHAS:

The systemd unit for Docker may set LimitNOFILE=infinity or a specific value. Check both the systemd unit and the actual /proc/PID/limits.
Each docker logs --follow invocation holds a FD. Monitoring tools that tail many containers’ logs via the API can consume significant FDs.
FD exhaustion in dockerd manifests as “too many open files” errors in Docker logs and API responses.
containerd has its own separate FD pool with similar concerns.

CORRELATES WITH:

High FD count + many docker logs --follow clients = monitoring/log collection contributing
High FD count + many containers = normal scaling, but verify limits are adequate
FD count growth without container growth = possible FD leak

DOMAIN: Throughput

SIGNAL: Container Create/Start/Stop Rate

WHAT IT IS: The rate at which containers are being created, started, stopped, and destroyed on this Docker host.

SOURCE: Docker events stream: docker events. Each container lifecycle event is emitted with timestamp, container ID, and event type. Docker Engine API: /events endpoint (streaming).

HOW TO COLLECT IT MANUALLY:

# Watch events in real-time
docker events --format '{{.Time}} {{.Type}} {{.Action}} {{.Actor.Attributes.name}}'

# Count events in the last hour by type
docker events --since 1h --until 0s --format '{{.Action}}' | sort | uniq -c | sort -rn

# Specifically container lifecycle events
docker events --filter type=container --since 1h --until 0s \
  --format '{{.Action}}' | sort | uniq -c | sort -rn

WHAT IT TELLS YOU: The operational tempo of the Docker host. High create/destroy rates are normal for CI/CD runners and batch processing but abnormal for long-running application servers. Unexpected high rates indicate crash loops, orchestrator instability, or scaling storms.

SEVERITY:

TICKET — Container lifecycle event rate >10x normal baseline. Indicates thrashing, crash loops, or scaling storms.
INFO — Normal lifecycle event rates consistent with workload type.

THRESHOLDS:

Workload-dependent. CI/CD hosts may create/destroy hundreds of containers per hour normally.
Application hosts should have near-zero create/destroy rates during steady state.
Any sustained create/stop cycle for the same container name = crash loop.

FAILURE MODES DETECTED:

Container crash loops
Orchestrator thrashing (scheduling/descheduling rapidly)
Scaling storms (autoscaler overreacting)
Deployment rollout issues

NUANCES & GOTCHAS:

docker events is a streaming API. Historical events are available with --since but only from daemon memory — events from before the daemon’s last restart are lost.
High container churn causes overlay2 disk I/O from creating/destroying writable layers. This is a side-effect that amplifies disk I/O signals.
Each container create/destroy involves multiple iptables rule modifications if the container has published ports. High churn + many ports = iptables lock contention.

CORRELATES WITH:

High create/stop rate + restart count increasing = crash loops
High create/stop rate + high disk I/O = overlay2 churn
High create/stop rate + iptables lock contention = networking bottleneck from lifecycle events

SIGNAL: Image Pull Rate and Latency

WHAT IT IS: The frequency and duration of Docker image pull operations from registries.

SOURCE: Docker events: docker events --filter event=pull. dockerd logs: journal or syslog entries for pull operations. Registry responses: HTTP response codes and transfer times.

HOW TO COLLECT IT MANUALLY:

# Watch pull events
docker events --filter event=pull --format '{{.Time}} {{.Actor.ID}}'

# Time a pull operation
time docker pull <image>:<tag>

# Check dockerd logs for pull-related messages
journalctl -u docker.service | grep -i "pull\|download\|layer" | tail -20

WHAT IT TELLS YOU: Whether image pulls are succeeding and how fast. Slow or failing pulls delay container starts, rollouts, and scaling. Pull failures cascade — if a node can’t pull an image, all containers requiring that image can’t start.

SEVERITY:

PAGE — Pull failures for images needed by critical containers. Prevents scaling and recovery.
TICKET — Pull latency >5 minutes for regularly-used images. Indicates registry or network issues.
PLAN — Increasing pull latency trend.

THRESHOLDS:

Pull failures: any failure for a production image is actionable
Pull duration: highly dependent on image size and network. Baseline your environment. >3x baseline is anomalous.
Registry rate limiting: Docker Hub limits anonymous pulls to 100/6h, authenticated to 200/6h. Check X-RateLimit-Remaining headers.

FAILURE MODES DETECTED:

Registry unavailability
Registry rate limiting (Docker Hub especially)
Network connectivity to registry
Image not found (tag deleted or misconfigured)
TLS certificate issues with private registries
Authentication token expiry for private registries

NUANCES & GOTCHAS:

Docker caches image layers locally. If most layers are cached, a “pull” only downloads the changed layers. Monitoring pull duration without knowing cache state is misleading.
docker pull pulls all layers sequentially by default. Large images with many layers can be slow even on fast networks.
Private registries may have their own rate limits, storage quotas, and authentication mechanisms.
Manifest list pulls (multi-architecture images) involve additional round trips.

CORRELATES WITH:

Pull failures + many nodes pulling simultaneously = registry rate limiting or bandwidth saturation
Pull failures + network errors elsewhere = upstream network issue
Slow pulls + high disk I/O = storage can’t keep up with writing layers

DOMAIN: Internal State

SIGNAL: Docker Daemon Goroutine Count

WHAT IT IS: The number of active goroutines in the dockerd Go process. Goroutines are lightweight threads in Go; each concurrent operation spawns goroutines.

SOURCE: Docker debug API: /debug/pprof/goroutine (available when debug mode is enabled). procfs: /proc/<dockerd-pid>/status → Threads field (approximate — Go runtime multiplexes goroutines across OS threads, but thread count correlates).

HOW TO COLLECT IT MANUALLY:

# If Docker daemon is in debug mode (check daemon.json for "debug": true)
curl -s --unix-socket /var/run/docker.sock http://localhost/debug/pprof/goroutine?debug=1 | head -5

# Thread count (correlated but not identical to goroutine count)
DOCKERD_PID=$(pgrep -x dockerd)
grep Threads /proc/${DOCKERD_PID}/status

# Full goroutine dump (for debugging — large output)
curl -s --unix-socket /var/run/docker.sock http://localhost/debug/pprof/goroutine?debug=2 > /tmp/goroutines.txt
wc -l /tmp/goroutines.txt

WHAT IT TELLS YOU: Goroutine count reflects the internal concurrency of the daemon. It grows with container count, API client count, and event subscriptions. A goroutine count that grows without corresponding container/client growth is a goroutine leak — a precursor to daemon memory growth and eventual instability.

SEVERITY:

TICKET — Goroutine count growing without container count growth. Possible goroutine leak.
PLAN — Goroutine count consistently >10,000 on a host with <100 containers.
INFO — Normal goroutine count proportional to workload.

THRESHOLDS:

Rough baseline: 100-200 base goroutines + ~50-100 per running container + ~10-50 per active API client
10,000 goroutines on a small deployment warrants investigation
Sustained upward trend without corresponding workload increase = goroutine leak

FAILURE MODES DETECTED:

Goroutine leak (specific Docker versions have had these)
Excessive API client connections (monitoring tools, scripts in infinite loops)
Event stream consumers accumulating goroutines
Container lifecycle operations stuck (each stuck operation holds goroutines)

NUANCES & GOTCHAS:

Debug mode must be enabled to access pprof endpoints. In production, many deployments don’t enable this, losing valuable introspection.
The /debug/pprof/goroutine endpoint with debug=2 produces a full goroutine stack dump — invaluable for diagnosing daemon hangs, but the output can be megabytes.
Go’s garbage collector is affected by goroutine count — more goroutines = more GC work = more CPU overhead.

CORRELATES WITH:

High goroutine count + high daemon CPU = GC pressure from goroutine proliferation
High goroutine count + daemon slow to respond = too many concurrent operations
Goroutine count stable at high value = stuck operations holding goroutines (check for containers stuck in “stopping”)

SIGNAL: Docker Daemon Memory Usage

WHAT IT IS: The resident memory (RSS) consumption of the dockerd process itself, separate from container memory.

SOURCE: procfs: /proc/<dockerd-pid>/status → VmRSS. Also: /proc/<dockerd-pid>/statm

HOW TO COLLECT IT MANUALLY:

DOCKERD_PID=$(pgrep -x dockerd)

# RSS in kB
grep VmRSS /proc/${DOCKERD_PID}/status

# More detail
grep -E "VmPeak|VmSize|VmRSS|VmData|VmStk|VmSwap" /proc/${DOCKERD_PID}/status

# Quick view
ps -p ${DOCKERD_PID} -o pid,rss,vsz,%mem,etime

WHAT IT TELLS YOU: The Docker daemon’s own memory consumption. dockerd should use a relatively stable amount of memory proportional to the number of containers, images, and networks it manages. Growing daemon memory without corresponding growth in managed objects indicates a memory leak.

SEVERITY:

TICKET — dockerd RSS >2GB on a host with <50 containers. Or any sustained growth trend.
PLAN — dockerd RSS gradually increasing over weeks.
INFO — Stable memory usage proportional to managed objects.

THRESHOLDS:

Baseline: ~200-500MB for a typical deployment with <50 containers
1GB: investigate, especially if container count is low
2GB: likely a leak or an excessive number of cached objects
Any monotonic growth that doesn’t plateau

FAILURE MODES DETECTED:

Memory leak in dockerd (has occurred in multiple Docker versions)
Excessive event history accumulation
Large number of images/layers in local cache (metadata is held in memory)
Goroutine leak (each goroutine’s stack consumes memory)

NUANCES & GOTCHAS:

dockerd’s Go runtime requests memory from the OS in chunks and doesn’t always return it promptly. A temporary spike may not decrease immediately. Sustained high RSS is more meaningful than transient spikes.
Docker stores container metadata, image metadata, and network state in memory. Hosts with thousands of images will have higher baseline memory usage.
docker system prune reduces not just disk usage but also daemon memory (by reducing cached metadata).

CORRELATES WITH:

Daemon memory growth + goroutine count growth = goroutine leak
Daemon memory growth + image count growth = need image cleanup
Daemon memory high + daemon slow to respond = GC pressure or approaching host memory limits

DOMAIN: Errors

SIGNAL: Container Exit Codes

WHAT IT IS: The exit code of the main process (PID 1) in a container when it stops. Exit codes encode the reason for termination.

SOURCE: Docker Engine API: /containers/<id>/json → State.ExitCode. Docker CLI: docker inspect --format '{{.State.ExitCode}}' <container>.

HOW TO COLLECT IT MANUALLY:

# Exit codes for all stopped containers
docker ps -a --filter "status=exited" --format "{{.Names}}\t{{.Status}}" | head -20

# Specific exit code
docker inspect --format '{{.Name}} ExitCode={{.State.ExitCode}} OOMKilled={{.State.OOMKilled}} Error={{.State.Error}}' $(docker ps -aq)

# Group by exit code
docker inspect --format '{{.State.ExitCode}}' $(docker ps -aq --filter "status=exited") | sort | uniq -c | sort -rn

WHAT IT TELLS YOU: Why containers are stopping. The exit code is the first diagnostic signal for any container failure.

Key exit codes:

0: Clean exit (normal)
1: Generic application error
2: Shell built-in misuse (Bash-specific)
126: Command not executable (permission issue)
127: Command not found (bad entrypoint/CMD, missing binary)
128+N: Killed by signal N (128+9=137 SIGKILL, 128+15=143 SIGTERM, 128+11=139 SIGSEGV)
137: SIGKILL — OOM kill or docker kill
139: SIGSEGV — Segmentation fault (application crash, native code bug)
143: SIGTERM — Graceful shutdown request (normal during docker stop)
255: Exit status out of range (often a scripting error)

SEVERITY:

PAGE — Exit code 137 (OOM kill) or 139 (segfault) for production containers.
TICKET — Exit code 1 for containers that should be stable. Exit codes 126/127 (misconfiguration).
INFO — Exit code 0 (normal completion) or 143 (graceful stop).

THRESHOLDS:

Any nonzero unexpected exit code for a long-running container is actionable.
Rate of abnormal exits: more than 1 per hour for any single container warrants investigation.

FAILURE MODES DETECTED:

Application crashes (exit code 1, segfault 139)
OOM kills (exit code 137 + OOMKilled=true)
Configuration errors (126, 127)
Graceful shutdowns taking too long and being SIGKILL’d (143 followed by 137)
Native library crashes (139)

NUANCES & GOTCHAS:

Exit code 137 doesn’t always mean OOM kill. docker kill also sends SIGKILL, resulting in exit code 137. Check State.OOMKilled to distinguish.
Exit code 143 from docker stop is normal — Docker sends SIGTERM, waits the stop timeout (default 10s), then sends SIGKILL. If the container exits with 143, it handled SIGTERM. If it exits with 137 after a stop, it didn’t handle SIGTERM in time.
Some applications don’t set meaningful exit codes. A process crashing with exit code 1 could be anything from a null pointer to a config error.
In multi-process containers, the exit code reflects only PID 1. Child process crashes may not change PID 1’s exit code.

CORRELATES WITH:

Exit code 137 + OOMKilled=true + memory at limit = OOM kill, increase memory limit or fix leak
Exit code 137 + OOMKilled=false = killed externally (operator or orchestrator)
Exit code 127 + recent image update = entrypoint binary changed or moved in new image
Exit code 1 + restart count increasing = application crash loop

SIGNAL: Docker Daemon Error Logs

WHAT IT IS: Error-level log entries from the dockerd process, indicating operational problems.

SOURCE: systemd journal: journalctl -u docker.service Syslog: /var/log/syslog or /var/log/messages (depending on distribution) Docker’s log location is determined by the logging setup — most modern distros use journald.

HOW TO COLLECT IT MANUALLY:

# Recent errors from Docker daemon
journalctl -u docker.service --priority=err --since "1 hour ago"

# All Docker daemon logs (including warnings)
journalctl -u docker.service --since "1 hour ago" --no-pager | tail -100

# containerd errors
journalctl -u containerd.service --priority=err --since "1 hour ago"

# Kernel-level messages about Docker/containers
dmesg | grep -iE "docker|container|overlay|cgroup|oom" | tail -20

WHAT IT TELLS YOU: Daemon-level errors indicate systemic problems: storage driver issues, network configuration failures, containerd communication problems, image layer corruption, or internal errors.

SEVERITY:

PAGE — Errors indicating daemon instability: “panic”, “fatal”, storage driver errors, containerd communication failures.
TICKET — Warnings about degraded functionality: “failed to cleanup”, “context deadline exceeded”, specific container operation failures.
INFO — Transient errors during normal operations (e.g., container not found during rapid create/destroy).

THRESHOLDS:

Error rate: any sustained increase in daemon error rate requires investigation
Specific patterns: “panic” or “fatal” = immediate investigation
Storage driver errors: any = immediate investigation

FAILURE MODES DETECTED:

Storage driver corruption
containerd communication failure
Network plugin errors
Resource exhaustion (FDs, memory, disk)
Internal bugs (panics, data races)
Image layer corruption
Volume driver failures

NUANCES & GOTCHAS:

Docker logs can be extremely verbose in debug mode. Filter by priority/level.
Some errors are transient and expected during rapid container lifecycle operations (races between delete and inspect). Sustained errors of the same type are the signal.
“context deadline exceeded” errors often indicate the daemon is under load or the storage driver is slow, not necessarily a bug.
containerd errors may appear in a separate journal unit (containerd.service). Check both.

CORRELATES WITH:

Daemon errors + daemon slow to respond = systemic daemon issue
Storage driver errors + high disk I/O = disk performance affecting Docker
Network errors + container connectivity issues = networking subsystem failure

DOMAIN: Saturation

SIGNAL: Docker Daemon API Request Queue Depth

WHAT IT IS: The number of concurrent API requests being processed by the Docker daemon. The daemon uses goroutines for request handling, and some operations take internal locks that serialize access.

SOURCE: Not directly exposed as a metric. Inferred from daemon responsiveness (latency of /_ping, docker ps, and docker info). Debug pprof goroutine dumps show blocked goroutines.

HOW TO COLLECT IT MANUALLY:

# Measure response time of docker info (heavier than _ping, touches more subsystems)
time docker info > /dev/null 2>&1

# Measure docker ps response time
time docker ps > /dev/null 2>&1

# If debug mode is on, check for blocked goroutines
curl -s --unix-socket /var/run/docker.sock http://localhost/debug/pprof/goroutine?debug=1 | \
  grep -c "semacquire\|lock\|waiting"

WHAT IT TELLS YOU: When the daemon is saturated, all Docker operations slow down — container starts, stops, inspections, log reads, everything. This is the choke point of the entire Docker runtime on a host.

SEVERITY:

PAGE — docker ps takes >5 seconds. Daemon is saturated or deadlocked.
TICKET — docker ps takes >1 second consistently.

THRESHOLDS:

docker info should complete in <1 second. >3 seconds indicates daemon stress.
docker ps should complete in <500ms for <100 containers. Longer indicates load or lock contention.

FAILURE MODES DETECTED:

Daemon lock contention (multiple concurrent lifecycle operations)
Storage driver saturation (every container operation touches the storage driver)
Too many concurrent API clients (monitoring, CI/CD, scripts)
Internal deadlock (daemon bug — rare but catastrophic)

NUANCES & GOTCHAS:

Some operations take an exclusive daemon lock (container create, network create). Others are read-only (inspect, ps). A heavy write workload blocks readers.
docker ps with --size is dramatically slower than without (triggers per-container filesystem walk).
Monitoring tools that query the Docker API can themselves contribute to daemon saturation if polling too frequently.

CORRELATES WITH:

Slow daemon + high container lifecycle rate = lifecycle operations saturating the daemon
Slow daemon + high disk I/O = storage driver bottleneck
Slow daemon + high goroutine count = concurrent operations overwhelming the daemon

SECTION 2 — Composite Failure Patterns

PATTERN: Storage Exhaustion Cascade

SIGNALS INVOLVED:

Docker data directory disk usage at >90%
Container log file sizes growing
docker system df showing high reclaimable space
New container starts failing with “no space left on device”
Existing containers logging write errors
Daemon errors about overlay2 operations

NARRATIVE: The filesystem containing /var/lib/docker/ fills up. This triggers a cascade: containers can’t write to their writable layers or logs, new containers can’t be created (no space for writable layer), the daemon can’t update its internal database, image pulls fail, builds fail. If the daemon was in the middle of a container operation when space ran out, metadata can become inconsistent, leaving containers in “dead” state that can’t be removed without manual intervention.

SEVERITY: PAGE — Affects all containers on the host. Recovery requires manual cleanup.

DISTINGUISHING FEATURES:

“no space left on device” errors in daemon logs
docker ps may still work (reads from memory), but docker create, docker pull, and docker run fail
Running containers may continue if they’re only reading or writing to volumes on different filesystems

COMMON CAUSES:

No container log rotation configured (json-file driver with no max-size)
Accumulation of unused images over months
Build cache growth on CI/CD hosts
Orphaned volumes from deleted containers
Containers writing large files to writable layers instead of volumes

FIRST RESPONSE:

Check what’s consuming space: du -sh /var/lib/docker/*/ and docker system df
If log files dominate: truncate the largest log files (truncate -s 0 <logfile>) — this is safe even while the container is running (the log file is opened with O_APPEND)
If reclaimable images dominate: docker image prune -a --filter "until=48h" (removes images not used in 48 hours — safer than a blanket prune)

PATTERN: OOM Kill Crash Loop

SIGNALS INVOLVED:

Container restart count increasing rapidly
Exit code 137 on stopped containers
State.OOMKilled: true in container inspect
cgroup memory usage at limit
dmesg showing OOM kill messages with cgroup path
Container memory usage shows sawtooth pattern (grow → kill → restart → grow)

NARRATIVE: A container hits its memory limit, the kernel OOM-kills it, Docker’s restart policy restarts it, it loads its application, memory grows back to the limit, and the cycle repeats. Each restart may cause data loss, connection drops, and recovery overhead. The restart policy makes this look like the container is “flapping” rather than fundamentally broken.

SEVERITY: PAGE — User-facing impact on every cycle. Data loss risk on every kill.

DISTINGUISHING FEATURES:

Exit code is specifically 137 (not 143)
OOMKilled field is true (distinguishes from external SIGKILL)
Memory usage reaches exactly the cgroup limit before each restart
dmesg shows “Memory cgroup out of memory: Killed process” with the container’s cgroup path

COMMON CAUSES:

Memory limit too low for the application’s actual needs
Memory leak in the application (limit is correct, application is broken)
JVM heap (-Xmx) set too close to container memory limit (no room for native memory, metaspace, thread stacks)
Application loading large datasets into memory
Connection/session accumulation without bounds

FIRST RESPONSE:

Check docker inspect for OOMKilled and memory limit configuration
Check dmesg | grep -i oom for kernel OOM messages with details
Temporarily increase memory limit by 2x to stop the crash loop and allow diagnosis
Examine application memory profiling (heap dumps for JVM, memory profiler for Python/Go/Node)

PATTERN: Daemon Hang (Silent Deadlock)

SIGNALS INVOLVED:

docker ps blocks indefinitely (doesn’t timeout, just hangs)
/_ping may or may not respond (depends on which internal lock is held)
dockerd process is alive and consuming some CPU
Running containers continue operating normally
Orchestrator marks the node as unhealthy (can’t communicate with daemon)
Goroutine dump (if accessible) shows goroutines blocked on mutexes

NARRATIVE: The Docker daemon enters an internal deadlock or gets stuck waiting for a resource that never becomes available. The process doesn’t crash — it just stops making progress. This is the most operationally painful failure mode because the daemon appears alive to process monitors, running containers are fine, but no management operations work. Orchestrators can’t drain the node, operators can’t inspect containers, and automated recovery systems are confused.

SEVERITY: PAGE — Complete loss of container management capability with no self-recovery.

DISTINGUISHING FEATURES:

Process is alive (not a crash)
Some API endpoints may work while others hang (depends on which lock is held)
Containers continue running (they’re supervised by containerd shims, not dockerd)
docker info and docker ps hang, but ctr -n moby containers list may work (bypasses dockerd)

COMMON CAUSES:

Storage driver operations blocking indefinitely (NFS mounts, iSCSI timeouts, overlay2 bugs)
containerd shim process stuck during container stop/remove
Volume driver plugin not responding
Network plugin not responding
Internal race conditions in specific Docker versions

FIRST RESPONSE:

Confirm containers are still running: ctr -n moby containers list or check container processes directly via ps aux
If live-restore is enabled: restart dockerd (systemctl restart docker). Containers survive.
If live-restore is NOT enabled: restart will kill all containers. Weigh the impact.
Before restart: capture goroutine dump if debug is enabled (curl --max-time 5 --unix-socket /var/run/docker.sock http://localhost/debug/pprof/goroutine?debug=2 > /tmp/goroutine-dump.txt). This is invaluable for root cause analysis.

PATTERN: Network Namespace Leak

SIGNALS INVOLVED:

ip netns list or /var/run/docker/netns/ shows growing number of network namespaces
Host-level network interface count increasing (ip link show | wc -l)
veth pair accumulation (orphan veth interfaces)
iptables rule count growing
Eventual failure to create new containers with network errors

NARRATIVE: When containers are destroyed, their network namespaces, veth pairs, and iptables rules should be cleaned up. When cleanup fails (daemon crash during removal, networking plugin bug, race conditions), these network resources leak. Each leaked namespace and veth pair consumes kernel memory and file descriptors. Eventually, the host runs out of network interfaces, the iptables rules become unmanageably large, and new containers can’t get network connectivity.

SEVERITY: TICKET — Slow degradation, but eventual hard failure if not addressed.

DISTINGUISHING FEATURES:

Network interface count (ip link show | wc -l) grows over time without container count growing
Orphan veth interfaces with no corresponding container
/var/run/docker/netns/ contains namespaces for containers that no longer exist

COMMON CAUSES:

Docker daemon crash during container removal
Forced container removal (docker rm -f) race conditions
Docker networking plugin bugs
Host reboots without clean container shutdown

FIRST RESPONSE:

Compare network namespaces with running containers: ls /var/run/docker/netns/ | wc -l vs docker ps -q | wc -l
Identify orphan veth interfaces: ip link show type veth and cross-reference with container network setup
Clean up: docker network prune removes unused networks. For deeper cleanup, a daemon restart with live-restore: true allows cleanup without affecting running containers.

PATTERN: CPU Throttling Latency Amplification

SIGNALS INVOLVED:

Container CPU throttling percentage elevated (>10% of periods throttled)
Application latency (p95/p99) increasing
Container CPU usage percentage appears moderate (not near 100%)
Application error rates may increase (timeouts)
No corresponding traffic increase

NARRATIVE: A container with CPU limits is being throttled by the CFS bandwidth controller. The application’s processes are periodically paused for the remainder of the CFS period (100ms by default). This causes latency spikes that are invisible in average metrics but devastating to tail latency. The operator sees latency degradation, checks CPU usage, sees it at 60%, and concludes CPU isn’t the problem — but it is. The throttling, not the utilization, is the cause.

SEVERITY: TICKET — User-visible latency impact but no data loss.

DISTINGUISHING FEATURES:

CPU usage percentage is moderate (30-70%), NOT near 100%
cpu.stat shows nr_throttled > 0 and increasing
Latency degradation is bursty, not steady (aligns with CFS period boundaries)
Removing CPU limits eliminates the latency issue

COMMON CAUSES:

CPU limits set based on average usage without accounting for burst needs
Multi-threaded applications (JVM, Go) where all threads share the CFS quota
GC pauses consuming the entire CFS period’s budget in a burst
CPU limits copy-pasted from development configs to production

FIRST RESPONSE:

Check nr_throttled and throttled_usec in cpu.stat for the container’s cgroup
Calculate throttle percentage: nr_throttled / nr_periods * 100
Increase CPU limit by 2x and observe latency impact
Consider using cpuset (pinning to cores) for latency-sensitive workloads instead of CFS limits

PATTERN: Image Layer Corruption

SIGNALS INVOLVED:

Container start failures with “layer not found” or “invalid argument” errors
docker pull failing with checksum or layer errors
Docker daemon logs showing overlay2 errors
docker system df may show inconsistent numbers
Multiple containers failing to start simultaneously (if they share layers)

NARRATIVE: The overlay2 storage driver’s metadata or layer data has become inconsistent. This can happen after an unclean shutdown (power loss, kernel panic), disk errors, or rarely from storage driver bugs. The daemon tries to assemble the container’s filesystem from layers but finds a layer is missing or corrupted. No container using the affected image can start.

SEVERITY: PAGE — Prevents container starts. May affect many containers sharing the corrupted layer.

DISTINGUISHING FEATURES:

Error messages reference specific layer digests or diffs
Multiple containers using the same base image all fail simultaneously
docker inspect on the image shows the layer, but the filesystem data is missing or corrupt
Re-pulling the image fixes the issue (the layer is re-downloaded)

COMMON CAUSES:

Unclean host shutdown (power loss, kernel panic) while Docker was writing layers
Disk errors (bad sectors, full disk during write)
Docker daemon crash during image pull or container creation
Manual intervention in /var/lib/docker/overlay2/ directory
Filesystem corruption (particularly ext4 without journaling)

FIRST RESPONSE:

Identify the corrupted layer from error messages
Remove the affected image: docker rmi <image> (may fail if containers reference it)
Re-pull the image: docker pull <image>
If removal fails: docker system prune with careful scope, or restart daemon
Check filesystem health: dmesg | grep -i "error\|corrupt\|ext4"

SECTION 3 — Capacity & Saturation Leading Indicators

RESOURCE: Disk Space for Docker Data Directory

LEADING INDICATORS:

Reclaimable space growing (dangling images, stopped containers, unused volumes) — docker system df RECLAIMABLE column
Container log file growth rate (predict disk fill time)
Image pull frequency without corresponding image cleanup
Build cache size (on CI/CD hosts)

DEGRADATION CURVE: Cliff-edge. Docker operates normally until the filesystem is full, then fails catastrophically. There is no graceful degradation — writes fail, daemon operations fail, containers error.

RUNWAY ESTIMATION:

# Calculate log growth rate (bytes per hour)
# Take two measurements of /var/lib/docker/containers/ size, 1 hour apart, subtract
# Free space / growth rate = hours until full

# Quick estimate
df /var/lib/docker/ | awk 'NR==2 {print "Available: "$4" KB"}'
docker system df | grep "Local Volumes\|Build Cache\|Images"

HEADROOM DEFINITION: Minimum 20% free space on the Docker data directory filesystem. This accounts for burst container creation, image pulls, and transient log growth. Less than 20% free should trigger automated cleanup.

RESOURCE: Host Memory (for containers without limits)

LEADING INDICATORS:

Total container memory usage approaching host physical memory
Swap usage beginning (if swap is enabled)
Page cache declining (kernel reclaiming cache for application memory)
Memory pressure indicators in /proc/pressure/memory

DEGRADATION CURVE: Graceful degradation followed by cliff-edge. As memory pressure increases, the kernel reclaims page cache (degrading I/O performance), then increases swapping (if enabled, degrading everything), then invokes the OOM killer (cliff-edge — processes are killed with no warning).

RUNWAY ESTIMATION:

# Current memory pressure
cat /proc/pressure/memory

# Available memory (includes reclaimable cache)
grep -E "MemTotal|MemAvailable|SwapTotal|SwapFree" /proc/meminfo

# Total container memory usage
docker stats --no-stream --format '{{.MemUsage}}' | awk -F'/' '{print $1}'

HEADROOM DEFINITION: MemAvailable (not MemFree) should be >20% of MemTotal. MemFree is misleadingly low because the kernel intentionally uses free memory for page cache. MemAvailable accounts for reclaimable cache.

RESOURCE: conntrack Table

LEADING INDICATORS:

conntrack count / conntrack max ratio increasing
New connection establishment rate increasing
Number of published ports across all containers
Container network error/drop counters increasing

DEGRADATION CURVE: Cliff-edge. When the conntrack table fills, new connections are silently dropped. Existing connections continue working. There is no graceful degradation — the transition from “working” to “dropping” is instant and silent (no log message by default, only a kernel counter).

RUNWAY ESTIMATION:

# Current utilization
echo "$(cat /proc/sys/net/netfilter/nf_conntrack_count) / $(cat /proc/sys/net/netfilter/nf_conntrack_max)" | bc -l

# New connection rate (take two measurements, subtract)
cat /proc/sys/net/netfilter/nf_conntrack_count

HEADROOM DEFINITION: conntrack utilization should stay below 70%. If consistently above, increase nf_conntrack_max (sysctl net.netfilter.nf_conntrack_max). Default is often 65536 — busy Docker hosts need 262144 or higher.

RESOURCE: PID Space

LEADING INDICATORS:

Total container PID count approaching host pid_max
Individual containers’ PID counts growing
Zombie process accumulation across containers
Fork failures in container logs

DEGRADATION CURVE: Cliff-edge. When PIDs are exhausted, no process on the host can fork — not just containers, but system services, SSH, everything. The host becomes unmanageable.

RUNWAY ESTIMATION:

# Total PIDs on host
ls /proc/ | grep -c '^[0-9]'

# PID max
cat /proc/sys/kernel/pid_max

# Container PIDs specifically
for cg in /sys/fs/cgroup/system.slice/docker-*.scope; do
  echo "$(basename $cg): $(cat $cg/pids.current 2>/dev/null || echo N/A)"
done

HEADROOM DEFINITION: Total system PID usage should stay below 80% of pid_max. Every container should have a pids.max limit set in production. Default pid_max is 32768 on many systems — should be increased to 4194304 on container hosts.

RESOURCE: Docker Daemon File Descriptors

LEADING INDICATORS:

dockerd FD count growth rate
Number of docker logs --follow consumers
Number of active API connections
Container count increase

DEGRADATION CURVE: Cliff-edge. When dockerd hits its FD limit, it fails with “too many open files” on the next operation requiring an FD — which is nearly every operation (opening log files, accepting connections, communicating with containerd).

RUNWAY ESTIMATION:

DOCKERD_PID=$(pgrep -x dockerd)
CURRENT=$(ls /proc/${DOCKERD_PID}/fd/ 2>/dev/null | wc -l)
MAX=$(grep "Max open files" /proc/${DOCKERD_PID}/limits | awk '{print $4}')
echo "Using ${CURRENT} of ${MAX} ($(echo "scale=1; ${CURRENT}*100/${MAX}" | bc)%)"

HEADROOM DEFINITION: dockerd should use <50% of its FD limit during normal operations. If consistently above 50%, increase the limit in the systemd unit (LimitNOFILE).

SECTION 4 — Operational Edge Cases

Behaviors That Look Alarming But Are Normal:

High memory usage reported by docker stats for JVM containers: JVMs pre-allocate heap at startup. A JVM container may immediately show 80% memory usage and stay there forever. This is by design — the JVM allocated its heap. The actual live data usage is invisible without JVM metrics.
Container CPU >100%: On a multi-core host, a container using 2 full cores shows 200% CPU. This is correct cgroup accounting, not a bug.
Rapid container create/destroy on CI/CD hosts: CI/CD runners legitimately create and destroy hundreds of containers per hour. The concern is disk cleanup, not the lifecycle rate itself.
docker system df showing large “RECLAIMABLE” for images: If you have many tagged images for different versions, they share layers. The reclaimable space is the unique layers — much less than the total image sizes. This is overlay2 working correctly.
containerd memory growing slowly: containerd caches metadata. Its memory grows with the number of containers it has managed since its last restart. A periodic containerd restart (with live-restore) is acceptable.
High read I/O on overlay2: Reading files from container images involves traversing overlay layers. First access of a file requires traversing the layer stack. This is a performance tax of overlay2, not a bug.

Behaviors That Look Normal But Are Silently Catastrophic:

Container using 60% CPU and latency is fine — But if throttle percentage is 30%, latency will be terrible at p99 and you’re just not measuring it yet. Tail latency from throttling is invisible to averages and medians.
Container memory stable at 70% of limit — But if anon (non-reclaimable) portion is growing 0.1% per hour, you have a slow leak that will OOM-kill the container in ~300 hours. The total usage looks stable because the kernel is reclaiming cache to keep total under the limit.
Container running fine with no restart count — But exit code from last run was 139 (SIGSEGV) and it was restarted once. A segfault that happens to self-recover is a ticking time bomb.
Low disk usage on Docker data directory — But /var/lib/docker/ is on a different filesystem from the volumes, and the volume filesystem is full. docker system df doesn’t report volume destination filesystem usage.
Docker daemon responding to /_ping — But docker ps hangs. The daemon is partially deadlocked. The ping endpoint doesn’t take the same locks as container listing.

Cold Start / Warmup Behaviors:

After a daemon restart, the first docker ps is slow because the daemon is rebuilding its internal state from disk.
Container first-start pulls image layers, initializes overlay mount, sets up networking — all slower than subsequent restarts from the same image.
JVM containers show high CPU for 30-120 seconds during JIT compilation warmup. This is not a performance problem — it’s the JVM optimizing.
Containers restoring large checkpoint/snapshot data on start will show high disk I/O and memory growth that levels off.

Known Instrumentation Limitations:

docker stats CPU percentage has a resolution of one reporting interval (typically 1 second). Sub-second CPU spikes are averaged away.
cgroup memory metrics include kernel allocations for the cgroup that are not directly visible to ps or application-level tools. The memory total from cgroup will be higher than the sum of process RSS.
Network metrics from docker stats are cumulative counters, not rates. You must compute rates from two samples. The counter resets on container restart.
There is no native Docker metric for request latency or application-level performance. Docker monitors the container, not the application inside it.

Interactions With Adjacent Systems:

systemd and Docker cgroups: systemd manages cgroup hierarchies. Docker creates cgroups under systemd’s scope. If systemd’s DefaultMemoryAccounting or DefaultCPUAccounting is disabled, some cgroup metrics may not be populated.
Kubernetes and Docker: When Kubernetes uses Docker as its runtime (now deprecated in favor of containerd), Kubernetes adds its own layer of resource management. Pod resource limits map to container cgroup limits, but Kubernetes adds additional overhead containers (pause containers) for each pod.
NFS/network storage and Docker volumes: Docker volumes on NFS mounts introduce network latency into every I/O operation. Stale NFS mounts can hang the Docker daemon (blocking in kernel space, which Go cannot timeout).
iptables and firewall managers: Docker manipulates iptables rules. If another tool (firewalld, ufw, custom scripts) also manages iptables, they can overwrite each other’s rules, causing networking failures that appear randomly and are difficult to diagnose.

SECTION 5 — Security & Integrity Signals

SIGNAL: Privileged Container Detection

WHAT IT IS: Whether any container is running in privileged mode, which disables most security isolation features (capabilities, seccomp, AppArmor, device access restrictions).

SOURCE: Docker API: /containers/<id>/json → HostConfig.Privileged.

HOW TO COLLECT IT MANUALLY:

# Check all running containers for privileged mode
docker ps -q | xargs -I{} docker inspect --format '{{.Name}} Privileged={{.HostConfig.Privileged}}' {}

# Filter for privileged containers
docker ps -q | xargs -I{} docker inspect --format '{{if .HostConfig.Privileged}}PRIVILEGED: {{.Name}}{{end}}' {} | grep PRIVILEGED

WHAT IT TELLS YOU: A privileged container has full access to the host’s devices, can load kernel modules, modify kernel parameters, and access all host resources. It is effectively root on the host. In production, privileged containers should be extremely rare and limited to specific infrastructure tools (monitoring agents, storage drivers).

SEVERITY:

PAGE — Any unexpected privileged container. This is a potential host compromise vector.
INFO — Known infrastructure containers running privileged (with documented justification).

SIGNAL: Docker Socket Mount Detection

WHAT IT IS: Whether any container has the Docker socket (/var/run/docker.sock) mounted as a volume, giving it full control over the Docker daemon and, by extension, the host.

SOURCE: Docker API: /containers/<id>/json → Mounts array. Look for mounts with Source: /var/run/docker.sock.

HOW TO COLLECT IT MANUALLY:

# Find containers with docker socket mounted
docker ps -q | xargs -I{} docker inspect --format '{{.Name}} {{range .Mounts}}{{if eq .Source "/var/run/docker.sock"}}DOCKER_SOCKET_MOUNTED{{end}}{{end}}' {} | grep DOCKER_SOCKET_MOUNTED

# More detailed view
docker ps -q | xargs -I{} docker inspect --format '{{.Name}}: {{range .Mounts}}{{.Source}}->{{.Destination}} {{end}}' {} | grep docker.sock

WHAT IT TELLS YOU: A container with the Docker socket can create, destroy, and manage all containers on the host. It can pull images, start privileged containers, mount host filesystems, and execute commands in any other container. This is equivalent to root access on the host and is a critical security concern.

SEVERITY:

PAGE — Any container with Docker socket mounted that isn’t a known, audited management tool (CI/CD runners, monitoring agents — and even these should be scrutinized).

SIGNAL: Container Running as Root

WHAT IT IS: Whether the main process in a container is running as UID 0 (root).

SOURCE: Docker API: /containers/<id>/json → Config.User. Also inspectable via docker top.

HOW TO COLLECT IT MANUALLY:

# Check user for running containers
docker ps -q | xargs -I{} docker inspect --format '{{.Name}} User={{.Config.User}}' {}
# Empty User field means root

# Check actual runtime user
for c in $(docker ps -q); do
  echo "$(docker inspect --format '{{.Name}}' $c): $(docker top $c -o user | tail -1)"
done

WHAT IT TELLS YOU: Running as root inside a container is less dangerous than running as root on the host (due to namespace isolation), but it increases the impact of container escape vulnerabilities. If a container escape exploit exists, a root-inside-container process becomes root on the host.

SEVERITY:

TICKET — Production containers running as root without documented necessity.
INFO — Infrastructure containers that require root (monitoring agents, networking tools).

SIGNAL: Sensitive Host Path Mounts

WHAT IT IS: Whether any container has host directories mounted that could allow access to sensitive data or system modification.

SOURCE: Docker API: /containers/<id>/json → Mounts array, specifically bind mounts.

HOW TO COLLECT IT MANUALLY:

# Show all bind mounts for all running containers
docker ps -q | xargs -I{} docker inspect --format '{{.Name}}: {{range .Mounts}}{{if eq .Type "bind"}}[{{.Source}} -> {{.Destination}} RW={{.RW}}] {{end}}{{end}}' {}

# Specifically check for sensitive paths
docker ps -q | xargs -I{} docker inspect --format '{{.Name}} {{range .Mounts}}{{.Source}} {{end}}' {} | grep -E "/(etc|root|var/run|proc|sys|dev|boot|lib/modules)"

WHAT IT TELLS YOU: Containers with access to sensitive host paths can read secrets (certificates, keys), modify system configuration, access device nodes, or interfere with other processes.

SEVERITY:

PAGE — Unexpected mounts of /, /etc, /root, /var/run, /proc/sysrq-trigger, or /dev.
TICKET — Any host bind mount that grants write access to directories outside the container’s expected scope.

SIGNAL: Container Capability Escalation

WHAT IT IS: Whether containers are running with additional Linux capabilities beyond the Docker default set, or with capabilities that enable privilege escalation.

SOURCE: Docker API: /containers/<id>/json → HostConfig.CapAdd and HostConfig.CapDrop.

HOW TO COLLECT IT MANUALLY:

# Show capabilities for all running containers
docker ps -q | xargs -I{} docker inspect --format '{{.Name}} CapAdd={{.HostConfig.CapAdd}} CapDrop={{.HostConfig.CapDrop}}' {}

# Highlight dangerous capabilities
docker ps -q | xargs -I{} docker inspect --format '{{.Name}} {{.HostConfig.CapAdd}}' {} | grep -iE "SYS_ADMIN|SYS_PTRACE|NET_ADMIN|DAC_OVERRIDE|SYS_RAWIO|SYS_MODULE"

WHAT IT TELLS YOU: Capabilities like SYS_ADMIN, SYS_PTRACE, and NET_ADMIN significantly weaken container isolation. SYS_ADMIN alone is nearly equivalent to --privileged. These should be audited and justified.

SEVERITY:

PAGE — SYS_ADMIN, SYS_MODULE, or SYS_RAWIO on any non-infrastructure container.
TICKET — NET_ADMIN, SYS_PTRACE, DAC_OVERRIDE without documented justification.

SIGNAL: Unusual Container Network Activity

WHAT IT IS: Network traffic patterns from containers that deviate from expected behavior — unexpected outbound connections, connections to unusual ports, high traffic volumes to unknown destinations.

SOURCE: Container network namespace: nsenter -t <pid> -n ss -tnp for connection listing. Host-level: conntrack -L filtered by container IP.

HOW TO COLLECT IT MANUALLY:

# Active connections from a container
CONTAINER_PID=$(docker inspect --format '{{.State.Pid}}' <name>)
nsenter -t ${CONTAINER_PID} -n ss -tnp

# All listening ports in a container
nsenter -t ${CONTAINER_PID} -n ss -tlnp

# Connection count by destination port
nsenter -t ${CONTAINER_PID} -n ss -tn | awk '{print $5}' | rev | cut -d: -f1 | rev | sort | uniq -c | sort -rn

WHAT IT TELLS YOU: Unexpected outbound connections may indicate compromised containers (reverse shells, data exfiltration, cryptominer command-and-control). Unexpected listening ports may indicate unauthorized services.

SEVERITY:

PAGE — Outbound connections to unknown IPs on unusual ports (IRC ports, tor, known C2 ports). Unexpected listening ports.
TICKET — Unusual traffic volume patterns.

SECTION 6 — Monitoring Maturity Levels

LEVEL 1 — SURVIVAL

The absolute minimum. You know whether Docker is working or completely broken.

Docker daemon responsiveness (/_ping check)
Docker data directory filesystem usage percentage
Container state: is each expected container running? (process-level check)
Host memory, CPU, disk space (basic host monitoring)

With these four signals, you will know:

If the daemon is down
If disk is full (the #1 Docker failure cause)
If your containers are running
If the host itself is struggling

LEVEL 2 — OPERATIONAL

What a professional team monitors in production. Missing these is an operational gap.

Everything from Level 1, plus:

Container restart counts (crash detection)
Container exit codes (failure classification)
Container memory usage vs limits (OOM prevention)
Container CPU usage (resource accountability)
Container log file sizes (disk protection)
Docker daemon error logs (operational issues)
Image and volume disk usage breakdown (docker system df)
Container network errors and drops
conntrack table utilization
OOM kill events from dmesg / cgroup memory.events
Privileged container and Docker socket mount audit

LEVEL 3 — MATURE

Full operational visibility. Senior SRE instrumentation.

Everything from Level 2, plus:

CPU throttling metrics (nr_throttled, throttled_usec)
Container memory breakdown (anon vs file vs slab)
Daemon goroutine count and memory usage
Daemon file descriptor utilization
Container PID count and zombie detection
Container writable layer sizes
Network namespace count (leak detection)
Image pull rate and latency
Container lifecycle event rate
DNS resolution health from within containers
Composite failure pattern detection (multi-signal correlation)
Capacity trend analysis for all saturation resources

LEVEL 4 — EXPERT

Deep signals added after painful incidents. Often-missed but high-value.

Everything from Level 3, plus:

CFS period-level CPU analysis (not just throttle counts, but burst behavior relative to CFS period boundaries)
Overlay2 layer count per image and per container (deep layers = slower filesystem operations)
Docker daemon pprof profiling (goroutine stacks, heap profile, mutex contention)
containerd shim process health (per-container — a stuck shim blocks operations on that container only)
iptables rule count and NAT table size (grows with published ports × containers)
Container seccomp, AppArmor, and capability audit (security posture monitoring)
Docker daemon API request latency distribution (not just is-it-up, but how-fast)
Memory pressure stall information (/proc/pressure/memory) correlated with container memory events
Kernel memory (kmem) usage within container cgroups (rare but devastating leaks)
Per-container I/O accounting (blkio cgroup stats — IOPS and bandwidth per container)
Volume driver latency (for network-attached volumes — NFS, EBS, etc.)
Build cache breakdown and age analysis (CI/CD hosts)
Docker socket access audit log (who is calling the API and when)

SECTION 7 — What Most Teams Get Wrong

1. No container log rotation — the #1 Docker operational failure across the industry.

The default json-file log driver has NO size limit. A single chatty container fills the disk in hours or days. This is such a common failure that it should be the first thing configured on any Docker host, yet the majority of Docker deployments encountered in incident reports have no log rotation. Add "log-opts": {"max-size": "10m", "max-file": "3"} to daemon.json or use the local log driver. This one configuration change prevents the single most common Docker incident class.

2. CPU throttling is invisible.

Most teams monitor CPU utilization but not CPU throttling. A container can show 40% average CPU usage while being throttled 30% of the time, causing severe tail latency. The CFS bandwidth controller is the source of more “unexplained latency” postmortems in containerized environments than any other single mechanism. Teams that don’t monitor nr_throttled and throttled_usec are blind to this entire failure class.

3. Memory monitoring uses the wrong metric.

Teams alert on total cgroup memory usage, which includes reclaimable page cache. This causes false alarms (cache is large but reclaimable) and missed true alarms (anon memory is slowly leaking but total usage looks stable because cache is being reclaimed to compensate). The correct metric is anon + kernel memory from memory.stat, not memory.current.

4. No conntrack monitoring.

When the conntrack table fills, connections are silently dropped. There is no error message, no log entry (without explicit kernel configuration), no TCP RST — just silence. The client times out. Most teams discover conntrack exhaustion during incidents, not before. Any Docker host with significant network traffic must monitor nf_conntrack_count vs nf_conntrack_max.

5. Docker socket security is ignored.

The Docker socket is equivalent to root access. Any container with the socket mounted can compromise the entire host. Many CI/CD setups, monitoring tools, and “Docker-in-Docker” patterns mount the socket without understanding the security implications. Teams routinely mount it for convenience and never audit which containers have it.

6. No daemon health monitoring beyond process existence.

Teams check “is dockerd running?” but not “is dockerd responsive?”. The daemon hang/deadlock failure mode — where the process is alive but not functioning — is missed by process monitors (systemd, uptime checks). A functional health check must actually query the API (/_ping) with a timeout, not just check for the process.

7. JVM memory in containers is misconfigured.

JVM containers are set with -Xmx equal to the container memory limit, leaving no headroom for metaspace, thread stacks, native memory, JNI allocations, or direct byte buffers. The JVM consumes more memory than its heap, and the container gets OOM-killed despite the heap being within limits. -Xmx should be set to ~75% of the container memory limit, with explicit -XX:MaxMetaspaceSize and -XX:ReservedCodeCacheSize.

8. No overlay2 / storage driver health monitoring.

Teams monitor disk space but not storage driver health. Overlay2 metadata corruption after unclean shutdowns causes container start failures that are difficult to diagnose. The only indication is error messages in daemon logs and container create failures — but without monitoring these log patterns, teams discover corruption only when new deployments fail.

9. Zombie process accumulation is unmonitored.

Containers without --init (tini) and with PID 1 processes that don’t handle SIGCHLD accumulate zombie processes. This is invisible until PID space is exhausted — and then every process on the host fails to fork. Monitoring zombie count per container and across the host is rarely done but prevents a catastrophic, host-wide failure.

10. Stopped container accumulation is unmanaged.

Stopped containers consume disk space (writable layer + log files) and clutter docker ps -a. Without a cleanup policy (either --rm on transient containers, or periodic docker container prune), hosts accumulate thousands of stopped containers over months. Each one holds its log file and writable layer. Teams discover this when they’re already in a disk crisis.