| Netdata

$ guides / raw_playbooks / docker / playbook-glm5 ▌

Operations Guides

PLAYBOOK: Monitoring Docker

SECTION 0 — Operator’s Mental Model

Docker is not a single thing you monitor — it is a stack of interdependent components, each with its own failure modes. Understanding this stack is essential to reasoning about any signal.

THE DOCKER STACK (bottom-up):

runc — The actual container execution engine. It creates containers by wiring together Linux namespaces, cgroups, and filesystem mounts. Each container gets its own runc process. If runc hangs, the container is frozen.
containerd — A daemon that manages the complete container lifecycle: pulling images, creating containers, starting/executing/stopping them. It calls runc for actual execution. containerd maintains the container state database. If containerd crashes, running containers survive but become unmanageable until containerd restarts and reconnects.
dockerd (Docker Daemon) — The user-facing API server. It translates docker CLI commands into containerd gRPC calls. It also handles volume management, network management (via libnetwork), image building, and the HTTP API. dockerd is stateful — its database tracks container metadata, network configurations, and volume mappings. If dockerd hangs, all management operations hang; running containers continue but you cannot inspect, stop, or create new ones.
Storage Driver — Typically overlay2, but could be devicemapper, btrfs, zfs, or aufs. This layer manages the copy-on-write filesystem that gives each container its apparent root filesystem. It consumes disk space for image layers, container writable layers, and metadata. Storage driver health directly impacts container I/O performance and disk exhaustion risk.
libnetwork / CNM — Docker’s network stack. It manages bridges, veth pairs, iptables rules, DNS resolution via embedded DNS server, and network namespaces. Each container gets a veth pair bridged to docker0 (or custom networks). Network misconfigurations manifest as connectivity loss, DNS failures, or port conflicts.

WHAT DOCKER IS DOING AT ALL TIMES:

Event loop — dockerd processes API requests, container events, health checks, and internal housekeeping
Container supervision — monitoring container processes, collecting exit codes, restarting if configured
Log routing — capturing stdout/stderr from containers and routing to configured log drivers
Network maintenance — managing iptables rules, DNS resolution, load balancing for container networks
Image management — tracking layers, handling pulls/pushes, garbage collection of unused layers
Volume I/O — proxying filesystem operations from containers to mounted volumes

RESOURCES DOCKER COMPETES FOR:

Resource	How Docker Uses It	What Happens When Starved
Disk I/O	Image pulls, container writes, log rotation, overlay operations	Daemon operations stall, container I/O slows, builds hang
Disk space	/var/lib/docker stores images, containers, volumes, logs	Cannot pull images, containers fail to start, daemon may crash
File descriptors	Each container, network socket, API connection uses FDs	Cannot create containers, API becomes unresponsive
Memory	Image caching, container metadata, log buffering, network state	OOM killer may kill dockerd (catastrophic)
CPU	Image extraction, overlay operations, log processing	Slow daemon response, delayed health checks
Network ports	Port binding for containers, API socket	Port conflicts, failed container starts
IP addresses	Bridge network allocation	Cannot create containers on default network
iptables rules	DNAT for port publishing, network isolation	Rule table exhaustion, network failures

CHARACTERISTIC FAILURE ARCHETYPES:

“The daemon wedged” — dockerd becomes unresponsive while containers keep running. Cannot inspect, stop, or create. Often caused by storage driver hangs or deadlocks in internal state management.
“The disk filled silently” — /var/lib/docker grows until exhaustion. Sources: unbounded container logs, dangling images, orphaned volumes, build cache accumulation.
“The container death spiral” — A container repeatedly crashes and restarts (if restart policy permits), consuming resources, flooding logs, potentially masking the root cause.
“The network black hole” — DNS resolution fails inside containers, or inter-container networking breaks due to iptables corruption, DNS server issues, or bridge misconfiguration.
“The zombie apocalypse” — Containers in “dead” or “removing” state that cannot be cleaned up, often after daemon crashes during container removal. They consume resources and block names/IDs.
“The resource leak” — File descriptors, IP addresses, or network namespaces leak over time, eventually hitting system limits.
“The OOM cascade” — A container exceeds its memory limit, OOM killer acts, but the workload is stateful or causes dependent services to fail. If dockerd itself is OOM killed, all management capability is lost.
“The image layer corruption” — Image layers become corrupted on disk, causing container starts to fail with cryptic errors.

DEPLOYMENT VARIANTS THAT CHANGE MONITORING:

Rootless Docker — Daemon runs as non-root user. Different filesystem paths (~/.local/share/docker), different resource limits, cannot bind privileged ports (<1024). Many metrics still accessible but paths change.
Docker in Docker (DinD) — Runs inside a container with Docker socket or daemon. Adds a layer of complexity; disk exhaustion in outer container kills inner daemon; signal propagation is complex.
Rootful vs rootless containers — Rootful containers have more capabilities and thus more failure modes and security exposure.
Storage driver differences — overlay2 is most common, but devicemapper has different failure modes (thin pool exhaustion), zfs/btrfs have their own storage pool management.
Log driver configuration — json-file (default) causes disk pressure; journald has different failure modes; gcplogs/awslogs/etc. depend on external services.

SECTION 1 — Signal Catalog

AVAILABILITY DOMAIN

SIGNAL: Docker Daemon Process Health

WHAT IT IS: Whether the dockerd process is running and responsive to API requests. This is the most fundamental signal — if dockerd is down or hung, all container management is impossible.

SOURCE:

Process: dockerd (PID typically visible in ps aux | grep dockerd)
Unix socket: /var/run/docker.sock (or /run/docker.sock)
HTTP API: typically at /var/run/docker.sock or TCP port 2375/2376 if configured

HOW TO COLLECT IT MANUALLY:

# Check process is running
pgrep -x dockerd && echo "ALIVE" || echo "DOWN"

# Check daemon responsiveness via socket
docker info > /dev/null 2>&1 && echo "RESPONSIVE" || echo "UNRESPONSIVE"

# Direct socket probe (no docker CLI needed)
curl --unix-socket /var/run/docker.sock http://localhost/_ping
# Returns "OK" if daemon is responsive

# With TLS (production often uses this)
curl --unix-socket /var/run/docker.sock http://localhost/version
# Returns JSON with daemon version info

WHAT IT TELLS YOU: If process is gone, daemon crashed or was killed. If process exists but socket is unresponsive, daemon is hung (storage deadlock, internal panic, or blocked I/O). A hung daemon is worse than a crashed one — containers keep running but you cannot manage them, and the daemon cannot be gracefully recovered without potentially affecting running containers.

SEVERITY:

PAGE: Process missing OR socket unresponsive (either means immediate operational impact)
TICKET: Process exists but response time > 5 seconds (indicates daemon stress)
INFO: Normal operation

THRESHOLDS:

Binary: process must exist AND socket must respond
Response time: socket should respond within 1 second under normal conditions
Any failure to respond within 30 seconds indicates a hang requiring intervention

FAILURE MODES DETECTED:

Daemon crash (process termination)
Daemon hang/deadlock (process exists, no response)
Storage driver unresponsiveness (hangs on I/O)
Socket file deletion or corruption

NUANCES & GOTCHAS:

Socket file may exist briefly after daemon death; always probe the socket, don’t just check file existence
Daemon may be slow to respond during heavy operations (image pulls, builds) — distinguish transient slowness from hang
In DinD setups, outer container health doesn’t guarantee inner daemon health
Rootless Docker uses different socket path: ~/.docker/run/docker.sock

CORRELATES WITH:

Docker Daemon Response Latency — if latency is climbing before failure, indicates progressive stress
Host Disk I/O Utilization — high I/O often precedes daemon hangs
Docker Daemon Goroutine Count — rapid growth may indicate deadlock forming

SIGNAL: Docker Daemon Response Latency

WHAT IT IS: The time it takes for the Docker daemon to respond to API requests. This measures daemon processing overhead and system load impact on Docker operations.

SOURCE:

Docker API endpoints via socket
Any simple query like docker version, docker info, or /_ping

HOW TO COLLECT IT MANUALLY:

# Time a simple API call
time docker version > /dev/null 2>&1

# More precise measurement
start=$(date +%s%N); docker info > /dev/null 2>&1; end=$(date +%s%N)
echo "Latency: $(( (end - start) / 1000000 )) ms"

# Using curl directly
time curl --unix-socket /var/run/docker.sock http://localhost/_ping

WHAT IT TELLS YOU: Rising latency indicates the daemon is under stress — heavy I/O, many concurrent operations, or internal lock contention. If latency exceeds several seconds, container management operations (start, stop, logs) will be noticeably delayed, and automated systems (health checks, orchestration) may time out.

SEVERITY:

TICKET: Latency > 2 seconds sustained over 5 minutes
PLAN: Latency > 500ms sustained over 15 minutes (early warning)
INFO: Baseline tracking (typically <100ms on healthy systems)

THRESHOLDS:

Normal: < 100ms for simple queries
Degraded: > 500ms indicates daemon stress
Critical: > 5 seconds indicates severe contention or approaching hang

FAILURE MODES DETECTED:

Daemon overload from too many concurrent operations
Storage driver I/O bottleneck
Internal lock contention (database, state management)
Impending daemon hang

NUANCES & GOTCHAS:

First call after daemon start may be slower (warmup)
Image-related operations take much longer; use simple queries like /_ping for consistent measurement
Latency naturally spikes during large image pulls or intensive builds
Container count affects latency logarithmically (hundreds of containers cause measurable slowdown)

CORRELATES WITH:

Container Count — more containers means more internal state to traverse
Host Disk I/O Utilization — I/O contention directly impacts daemon latency
Docker Daemon Goroutine Count — high goroutine count with high latency suggests thread starvation

SIGNAL: Container Count by State

WHAT IT IS: The number of containers in each state: running, paused, exited/stopped, dead. This provides a snapshot of workload health and identifies stuck containers.

SOURCE:

Docker API: GET /containers/json?all=true
Command: docker ps -a --format '{{.State}}'

HOW TO COLLECT IT MANUALLY:

# Count by state
docker ps -a --format '{{.State}}' | sort | uniq -c

# Or via API
curl -s --unix-socket /var/run/docker.sock \
  "http://localhost/containers/json?all=true" | \
  jq -r '.[].State' | sort | uniq -c

# JSON with full breakdown
docker ps -a --format '{{json .}}' | jq -s 'group_by(.State) | map({state: .[0].State, count: length})'

WHAT IT TELLS YOU: Running count indicates active workload. High exited count with few running may indicate crash loops or workload completion. Dead containers indicate failed cleanup — they consume resources and cannot be removed normally. Paused containers are intentionally frozen but consume disk space.

SEVERITY:

PAGE: Any containers in “dead” state (indicates failed removal requiring intervention)
TICKET: Rapidly growing exited count (>50% increase in 1 hour without corresponding job completions)
PLAN: Paused containers accumulating without cleanup policy
INFO: Normal state distribution tracking

THRESHOLDS:

Dead containers: any nonzero is abnormal
Exited containers: compare to historical baseline; sudden growth indicates problems
Running containers: track against capacity limits

FAILURE MODES DETECTED:

Dead containers: daemon crash during container removal, resource cleanup failure
Exited container accumulation: crash loops, missing cleanup jobs, disk space consumption
No running containers: workload failure or intentional shutdown

NUANCES & GOTCHAS:

Exited containers may be intentional (batch jobs, one-off tasks) — correlate with workload type
Dead containers cannot be removed with docker rm alone; may require manual cleanup of /var/lib/docker/containers entries
Container count directly affects daemon API response times for list operations
In orchestrated environments (Swarm, K8s), the orchestrator manages container lifecycle — exited containers may be expected

CORRELATES WITH:

Container Restart Count — high restarts + high exited = crash loop
Docker Disk Usage — exited containers consume space in /var/lib/docker
Log Volume — exited containers may leave behind large log files

SIGNAL: Container Restart Count

WHAT IT IS: The number of times a container has been restarted due to crashing or being killed. This signal identifies unstable workloads before they cause broader impact.

SOURCE:

Docker API: GET /containers/{id}/json
Inspect field: .RestartCount

HOW TO COLLECT IT MANUALLY:

# Check restart count for specific container
docker inspect --format '{{.RestartCount}}' <container_id>

# List all containers with restart counts > 0
docker ps -a --format '{{.ID}} {{.RestartCount}} {{.Names}}' | \
  awk '$2 > 0 {print}'

# Via API
curl -s --unix-socket /var/run/docker.sock \
  "http://localhost/containers/json?all=true" | \
  jq -r '.[] | select(.RestartCount > 0) | {ID: .Id[:12], RestartCount, Names: .Names[0]}'

WHAT IT TELLS YOU: Any nonzero restart count means the container crashed or was killed and Docker restarted it (if restart policy permits). Rising restart counts indicate an unstable application, resource exhaustion, or configuration problem. Frequent restarts waste resources, flood logs, and may indicate a workload that cannot run successfully.

SEVERITY:

PAGE: Restart count increasing by >5 in 10 minutes for any container
TICKET: Restart count > 3 for any container in last hour
PLAN: Any container with restart count > 0 tracked over time
INFO: Baseline restart patterns for known-unstable services

THRESHOLDS:

Normal: restart count = 0 or stable (intentional restarts)
Warning: restart count increasing faster than 1/hour for sustained period
Critical: restart count increasing faster than 1/minute (crash loop)

FAILURE MODES DETECTED:

Application crash (code bugs, unhandled errors)
OOM kill (memory limit exceeded)
Health check failure (if configured with restart on unhealthy)
Resource starvation (CPU throttling causing timeout)
Dependency failure (container dies when required service unavailable)

NUANCES & GOTCHAS:

Restart count persists across daemon restart — it’s stored in container metadata
Manual restarts (docker restart) increment the count — distinguish manual vs automatic
“Always” restart policy will restart even manually stopped containers after daemon restart
Container with restart policy “no” will show 0 restarts regardless of crash frequency

CORRELATES WITH:

Container Exit Codes — restart + nonzero exit indicates crash pattern
Container OOM Killed — restart + OOM indicates memory exhaustion
Daemon Memory/Disk Pressure — restarts during resource pressure may indicate starvation

SIGNAL: Container Exit Codes

WHAT IT IS: The exit code of the container’s main process. Exit codes indicate why a container stopped and are essential for distinguishing crashes from graceful shutdowns.

SOURCE:

Docker API: GET /containers/{id}/json
Inspect field: .State.ExitCode

HOW TO COLLECT IT MANUALLY:

# Exit code for specific container
docker inspect --format '{{.State.ExitCode}}' <container_id>

# All containers with nonzero exit codes
docker ps -a --format '{{.ID}} {{.State}} {{.Names}}' | \
  while read id state name; do
    exit_code=$(docker inspect --format '{{.State.ExitCode}}' "$id")
    [ "$exit_code" != "0" ] && echo "$id $exit_code $state $name"
  done

# Via API for all exited containers
curl -s --unix-socket /var/run/docker.sock \
  "http://localhost/containers/json?all=true&filters={\"status\":[\"exited\"]}" | \
  jq -r '.[] | "\(.Id[:12]) ExitCode:\(.State.ExitCode) \(.Names[0])"'

WHAT IT TELLS YOU: Exit code 0 = graceful exit (successful completion or intentional stop). Exit code 1 = application error. Exit code 137 = SIGKILL (often OOM). Exit code 139 = segfault. Exit code 143 = SIGTERM (normal stop signal). Understanding exit codes enables proper alerting and incident classification.

SEVERITY:

TICKET: Exit code 1 (application error) for any production container
TICKET: Exit code 139 (segfault) — indicates serious application bug
PLAN: Exit code 137 without OOM indication (may need memory tuning)
INFO: Exit code 0 or 143 (normal shutdown)

THRESHOLDS:

Exit 0: Normal/expected
Exit 1: Application error — needs investigation
Exit 137: SIGKILL received — investigate OOM or external kill
Exit 139: Segmentation fault — application bug
Exit 143: SIGTERM — usually normal (orchestration, manual stop)
Other nonzero: Application-specific, needs documentation

FAILURE MODES DETECTED:

Application unhandled exceptions (exit 1)
Memory exhaustion/OOM kill (exit 137)
Memory corruption/segfault (exit 139)
Hard timeout kills (exit 137 from external)
Graceful shutdown (exit 143)

NUANCES & GOTCHAS:

Exit code 137 can be OOM kill OR external SIGKILL — check OOMKilled field to distinguish
Exit code 143 (SIGTERM) is normal in orchestrated environments during scaling/deployments
Custom exit codes are application-defined; document what your applications use
Exit codes may be truncated to 8 bits (255 max); check documentation for codes > 255

CORRELATES WITH:

Container OOM Killed — confirms memory as cause for exit 137
Container Restart Count — exit code + restarts indicates crash pattern
Application Logs — for root cause of exit 1

SIGNAL: Container OOM Killed Status

WHAT IT IS: A boolean indicating whether the container was killed by the OOM (Out of Memory) killer. Critical for distinguishing memory exhaustion from other causes of container death.

SOURCE:

Docker API: GET /containers/{id}/json
Inspect field: .State.OOMKilled

HOW TO COLLECT IT MANUALLY:

# Check specific container
docker inspect --format '{{.State.OOMKilled}}' <container_id>

# Find all OOM-killed containers
for c in $(docker ps -aq); do
  oom=$(docker inspect --format '{{.State.OOMKilled}}' "$c")
  [ "$oom" = "true" ] && echo "$c was OOM killed"
done

# Via API
curl -s --unix-socket /var/run/docker.sock \
  "http://localhost/containers/json?all=true" | \
  jq -r '.[] | select(.State.OOMKilled == true) | .Id[:12]'

WHAT IT TELLS YOU: When true, the container exceeded its memory limit and the kernel OOM killer terminated it. This indicates either: memory limit is too low for the workload, the application has a memory leak, or the workload experienced an abnormal memory spike. OOM kills cause data loss for in-memory state and may cause cascading failures in dependent services.

SEVERITY:

PAGE: OOM killed = true for stateful/critical production containers
TICKET: OOM killed = true for any production container
PLAN: Repeated OOM kills for same container (needs memory tuning)

THRESHOLDS:

Any true value in production is abnormal and requires investigation
Development/test containers may have intentionally low limits

FAILURE MODES DETECTED:

Memory limit undersized for workload
Application memory leak
Memory spike from abnormal input/load
JVM/container memory mismatch (heap + metaspace + overhead > limit)

NUANCES & GOTCHAS:

OOMKilled is set at container death; it may be reset if container restarts
Containers without memory limits can still be OOM killed if system memory is exhausted
JVM applications need careful tuning: heap + metaspace + code cache + native overhead must fit within limit
OOM kills don’t always mean the guilty container was killed — the kernel may kill any process in the cgroup

CORRELATES WITH:

Container Memory Usage — approaching limit before OOM is leading indicator
Container Exit Codes — exit 137 + OOMKilled confirms memory cause
Host Memory Pressure — system-wide OOM may kill containers without per-container limits

THROUGHPUT DOMAIN

SIGNAL: Container Operations Rate

WHAT IT IS: The rate of container lifecycle operations: creates, starts, stops, removes, and dies. This measures the velocity of container churn on the host.

SOURCE:

Docker events API: GET /events
Event types: create, start, stop, die, destroy

HOW TO COLLECT IT MANUALLY:

# Stream events in real-time
docker events --filter 'type=container' --format '{{.Action}} {{.Actor.ID}}'

# Count events over time window via API
curl -s --unix-socket /var/run/docker.sock \
  "http://localhost/events?filters={\"type\":[\"container\"]}" &

# Or parse recent events (requires event stream monitoring)
docker events --since 1h --until 0s --filter 'type=container' --format '{{.Action}}' | \
  sort | uniq -c

WHAT IT TELLS YOU: High container operation rates indicate dynamic workloads (CI/CD, batch jobs, serverless-on-containers). Excessive churn causes daemon stress, disk pressure (image layers, log files), and may indicate runaway processes or orchestration issues. Unusual patterns (many stops without starts) indicate workload problems.

SEVERITY:

TICKET: Operation rate >10x baseline sustained for >15 minutes
PLAN: Trending increase in operation rate over days (capacity planning)
INFO: Baseline operation patterns

THRESHOLDS:

Compare to historical baseline for the host
Normal varies by workload: CI runners may see 100s/hour; stable services may see 1/week
Any unexplained sudden spike warrants investigation

FAILURE MODES DETECTED:

Orchestration instability (repeated rescheduling)
Failed deployments (create → die loops)
Runaway processes creating containers
CI/CD queue backup clearing suddenly

NUANCES & GOTCHAS:

Events are ephemeral; if you’re not listening, you miss them
daemon restart resets event stream; some events may be lost
Rate calculation requires persistent counting over time windows
Differentiate user-initiated operations from daemon/orchestration-initiated

CORRELATES WITH:

Container Restart Count — high restarts + high operation rate = instability
Docker Daemon Response Latency — high churn often increases latency
Docker Disk Usage — high create rate without cleanup = disk growth

SIGNAL: Image Pull Rate

WHAT IT IS: The frequency of image pull operations. This measures dependency on external registries and can indicate deployment activity or configuration problems causing re-pulls.

SOURCE:

Docker events: events with action=pull
Registry API response times

HOW TO COLLECT IT MANUALLY:

# Monitor pull events
docker events --filter 'type=image' --filter 'event=pull' --format '{{.Time}} {{.Actor.Attributes.name}}'

# Count pulls in last hour
docker events --since 1h --filter 'type=image' --filter 'event=pull' --format '.' | wc -l

WHAT IT TELLS YOU: High pull rates indicate active deployments or problems with image caching. If images are being re-pulled that should be cached, it indicates either image tag instability (latest tag always changes), cache invalidation, or disk cleanup removing cached layers. Pull failures block container starts.

SEVERITY:

TICKET: Pull rate significantly above deployment frequency (indicates cache problems)
TICKET: Any pull failures in production
PLAN: Trending increase in pull rate (may need local registry or larger cache)

THRESHOLDS:

Baseline depends on deployment frequency
More than 1 pull per unique deployment may indicate caching issue
Sustained pulls without corresponding new container creates = waste

FAILURE MODES DETECTED:

Image cache thrashing (images removed and re-pulled repeatedly)
Registry availability issues
Network connectivity problems
Tag instability (latest changes frequently)

NUANCES & GOTCHAS:

Pulling same image twice for different containers should hit cache; if not, cache is not working
Large image pulls can saturate network bandwidth
Registry rate limiting may cause pull failures during high-activity periods
Image digest vs tag pulls behave differently for caching

CORRELATES WITH:

Docker Disk Usage (Images) — high pulls may increase disk usage
Network Bandwidth — pulls consume bandwidth
Container Create Rate — creates should correlate with pulls for new images

LATENCY DOMAIN

SIGNAL: Container Start Latency

WHAT IT IS: The time from container create request to container running state. This includes image pull (if not cached), filesystem setup, and process start.

SOURCE:

Docker events: timestamp difference between ‘create’ and ‘start’ events
Docker API: container creation/start timestamps

HOW TO COLLECT IT MANUALLY:

# Time a container start
time docker run --rm alpine:latest echo "test"

# Measure start latency for a specific container via events
docker events --filter 'container=<id>' --format '{{.Time}} {{.Action}}'
# Calculate difference between create and start timestamps

# Via inspection
docker inspect --format '{{.Created}}' <container_id>
# Compare to actual start time

WHAT IT TELLS YOU: High start latency impacts application scaling speed, deployment rollback time, and overall system responsiveness. Slow starts may be caused by: image pull time (large images, slow network), storage driver performance (overlay operations), host resource contention, or application initialization time.

SEVERITY:

TICKET: Start latency >30 seconds for any container
PLAN: Start latency trending upward over time
INFO: Baseline start times per image type

THRESHOLDS:

Small images (alpine, distroless): should start in <2 seconds (excluding app init)
Large images (>1GB): 10-60 seconds depending on cache status
Any start >60 seconds indicates problem (unless expected for image size)
Compare to baseline for each image type

FAILURE MODES DETECTED:

Large unoptimized images causing slow pulls
Storage driver performance degradation
Network/registry issues causing slow pulls
Resource contention on host
Application slow initialization

NUANCES & GOTCHAS:

First start of an image includes pull time; subsequent starts use cache
Container start latency is different from application ready time — app may take longer to become functional
Health check grace period should account for start latency
Very slow starts may trigger health check failures before app is ready

CORRELATES WITH:

Image Size — larger images have longer start latency
Host Disk I/O — high I/O slows overlay operations during start
Container Start Latency — correlates with storage driver performance
Network Latency (to registry) — affects pull time

ERRORS DOMAIN

SIGNAL: Docker Daemon Errors in Logs

WHAT IT IS: Error-level messages in the Docker daemon logs indicating internal failures, misconfigurations, or operational problems.

SOURCE:

Journal: journalctl -u docker (for systemd-managed Docker)
Log file: /var/log/docker.log (depending on configuration)
Daemon stderr/stdout

HOW TO COLLECT IT MANUALLY:

# View recent daemon errors
journalctl -u docker.service -p err --since "1 hour ago"

# Watch for errors in real-time
journalctl -u docker.service -p err -f

# Search for specific error patterns
journalctl -u docker.service --since "1 day ago" | grep -iE "(error|fatal|panic|fail)"

# Count errors by type
journalctl -u docker.service --since "1 hour ago" -p err | grep -oP '(?<=level=)[a-z]+' | sort | uniq -c

WHAT IT TELLS YOU: Daemon errors indicate problems that may affect container operations. Common error types include: storage driver failures, network setup errors, image layer corruption, API errors, and internal panics. A sudden increase in error rate often precedes or accompanies operational problems.

SEVERITY:

PAGE: Any panic/fatal in daemon logs
PAGE: Errors indicating data corruption or unrecoverable state
TICKET: Any error rate increase above baseline
PLAN: Warnings trending upward

THRESHOLDS:

Any panic or fatal: immediate investigation
10 errors/hour sustained: needs investigation (adjust based on baseline)
Error rate increase >2x: early warning

FAILURE MODES DETECTED:

Storage driver corruption
Network configuration failures
Image layer corruption
Daemon internal errors
API handler failures
Resource exhaustion

NUANCES & GOTCHAS:

Some errors are transient and may not indicate ongoing problems
Log format varies by Docker version and configuration
Daemon restart causes many “normal” errors during state recovery
Some errors are in library dependencies (containerd, runc) and may have different formats

CORRELATES WITH:

Container Operations Rate — errors during high operation rate may indicate overload
Docker Daemon Response Latency — errors + latency often correlate
Host Resource Metrics — errors during resource pressure indicate causation

SIGNAL: Container Creation Failures

WHAT IT IS: Failed attempts to create containers, indicating image problems, resource constraints, or configuration errors.

SOURCE:

Docker events: create events with error field
Docker API: POST /containers/create returns error response
Daemon logs

HOW TO COLLECT IT MANUALLY:

# Attempt container creation and capture error
docker create <image> 2>&1 || echo "CREATE_FAILED"

# Monitor creation failures in daemon logs
journalctl -u docker.service | grep -i "failed to create"

# Via API (example failed create)
curl -s --unix-socket /var/run/docker.sock \
  -X POST "http://localhost/containers/create" \
  -H "Content-Type: application/json" \
  -d '{"Image":"nonexistent"}' | jq .

# Check recent events for failures
docker events --filter 'event=create' --since 1h --format '{{.Actor.ID}} {{.Actor.Attributes.error}}' | grep -v '^$'

WHAT IT TELLS YOU: Creation failures block deployments and scaling. Common causes: image not found (missing pull), image pull failure, invalid configuration, resource constraints (disk space, memory), port conflicts, name conflicts, and volume mount failures.

SEVERITY:

TICKET: Any creation failure for production workload
PAGE: Creation failure rate >10% of attempts sustained
PLAN: Occasional failures in development (expected for some scenarios)

THRESHOLDS:

Normal: near 0% failure rate for production workloads
5% failure rate sustained: needs investigation
Any failure for critical service: immediate attention

FAILURE MODES DETECTED:

Missing images (not pulled)
Invalid container configuration
Resource exhaustion (disk, memory, FDs)
Port conflicts
Name conflicts
Volume mount failures
Network attachment failures

NUANCES & GOTCHAS:

Creation failure doesn’t always log clearly; check both API response and daemon logs
Some failures are expected in CI/CD (testing failure scenarios)
Name conflicts from previous containers not cleaned up
Image config may be invalid in ways that only manifest at create time

CORRELATES WITH:

Docker Disk Usage — disk exhaustion causes creation failures
Image Pull Failures — missing images cause creation failures
Container Count — name conflicts more likely with many containers

SATURATION DOMAIN

SIGNAL: Docker Disk Usage (System-Wide)

WHAT IT IS: Total disk space consumed by Docker: images, containers, volumes, and build cache. This is the top-level view of Docker’s disk footprint.

SOURCE:

Command: docker system df
Docker API: GET /system/df

HOW TO COLLECT IT MANUALLY:

# Human-readable summary
docker system df

# Verbose breakdown
docker system df -v

# JSON output for parsing
docker system df --format '{{json .}}'

# Via API
curl -s --unix-socket /var/run/docker.sock \
  "http://localhost/system/df" | jq .

# Raw directory size (fallback if API unavailable)
du -sh /var/lib/docker/

WHAT IT TELLS YOU: Docker disk usage grows over time if not managed. Images accumulate (old versions), exited containers leave behind writable layers and logs, volumes accumulate orphaned data, and build cache grows. When /var/lib/docker fills, Docker cannot function — cannot pull images, cannot create containers, may crash the daemon.

SEVERITY:

PAGE: Usage >90% of /var/lib/docker partition
TICKET: Usage >75% or growing faster than 1GB/day
PLAN: Usage >50% (capacity planning)
INFO: Baseline tracking

THRESHOLDS:

Compare to total space available on /var/lib/docker partition
Warning at 70% used
Critical at 85% used
Monitor growth rate: >5GB/day sustained requires cleanup

FAILURE MODES DETECTED:

Image accumulation (no cleanup of old versions)
Container log growth (unbounded logs)
Orphaned volumes (no automatic cleanup)
Build cache bloat
General disk exhaustion

NUANCES & GOTCHAS:

docker system df shows reclaimable space, not just used space
Some storage drivers (overlay2) may not report exact reclaimable due to layer sharing
Build cache can consume significant space on CI runners
Volumes are NOT included in reclaimable calculation — they persist independently
Running docker system prune can recover space but may be destructive

CORRELATES WITH:

Host Disk Usage — Docker disk usage contributes to host usage
Container Count — more containers = more disk usage
Image Count — more images = more disk usage
Log Driver Configuration — json-file logs stored in container directories

SIGNAL: Docker Disk Usage by Images

WHAT IT IS: Disk space consumed by container images. This is often the largest component of Docker disk usage.

SOURCE:

Command: docker system df (Images line)
Docker API: GET /system/df (Images field)

HOW TO COLLECT IT MANUALLY:

# Image disk usage summary
docker system df | grep Images

# Detailed image sizes
docker images --format 'table {{.Repository}}\t{{.Tag}}\t{{.Size}}'

# Via API
curl -s --unix-socket /var/run/docker.sock \
  "http://localhost/system/df" | jq '.Images[] | {Repository: .Repository, Size: .Size}'

# Sort images by size
docker images --format '{{.Size}}\t{{.Repository}}:{{.Tag}}' | sort -hr | head -20

WHAT IT TELLS YOU: Image disk usage reflects how many images are cached and their sizes. Large images, old image versions, and rarely-used images waste disk space. High image usage with low active usage indicates cleanup is needed.

SEVERITY:

TICKET: Image usage >50GB or growing without corresponding workload increase
PLAN: Largest images should be reviewed for optimization
INFO: Baseline image footprint

THRESHOLDS:

Depends on available disk and workload needs
Compare active images (in use by running containers) to total images
If active/total ratio <20%, cleanup needed

FAILURE MODES DETECTED:

Image accumulation without cleanup
Bloated images (unnecessary files, wrong base image)
Duplicate images (different tags, same content)
Unused development/test images

NUANCES & GOTCHAS:

Shared layers mean total image sizes may not sum correctly
<none> images (dangling) are usually safe to remove
Some images may be base layers for others — removal cascades
Registry mirrors/local registries may pre-cache images

CORRELATES WITH:

Docker Disk Usage (Total) — images often largest component
Image Pull Rate — high pulls may increase image usage
Container Count — more unique running images = more image storage

SIGNAL: Docker Disk Usage by Containers

WHAT IT IS: Disk space consumed by container writable layers. Each running or stopped container has a writable layer that consumes disk space.

SOURCE:

Command: docker system df (Containers line)
Docker API: GET /system/df (Containers field)

HOW TO COLLECT IT MANUALLY:

# Container disk usage summary
docker system df | grep Containers

# Container sizes (including writable layer)
docker ps -a --size --format 'table {{.ID}}\t{{.Names}}\t{{.Size}}'

# Via API (container sizes require additional call)
curl -s --unix-socket /var/run/docker.sock \
  "http://localhost/containers/json?all=true&size=true" | \
  jq '.[] | {Names: .Names[0], SizeRw: .SizeRw, SizeRootFs: .SizeRootFs}'

# Writable layer sizes only
docker ps -a --size --format '{{.ID}} {{.Size}}' | grep -v '0B'

WHAT IT TELLS YOU: Container disk usage reflects: number of containers, size of writable layers (how much the container has written), and log file sizes (for json-file log driver). Growing container usage indicates containers writing data or log accumulation.

SEVERITY:

TICKET: Container disk usage growing without cleanup
TICKET: Individual container writable layer >10GB (may indicate log/file bloat)
PLAN: Track growth trend for capacity planning

THRESHOLDS:

Normal: each container’s writable layer <1GB (depends on workload)
Warning: individual container >5GB or total growing >1GB/day
Cleanup: many stopped containers with nontrivial sizes

FAILURE MODES DETECTED:

Containers writing large amounts to their writable layer (logs, temp files)
Exited containers accumulating without cleanup
Log files growing (json-file driver)
Memory-heavy containers writing to tmpfs

NUANCES & GOTCHAS:

Size includes both writable layer and (often) log files
SizeRw is writable layer only; SizeRootFs includes image layers
Stopped containers still consume disk space
Containers with volume mounts don’t count volume data in container size

CORRELATES WITH:

Container Count — more containers = more potential disk usage
Log Configuration — json-file logs stored in container directory
Docker Disk Usage (Total) — containers contribute to total

SIGNAL: Docker Disk Usage by Volumes

WHAT IT IS: Disk space consumed by Docker volumes. Volumes persist data independently of containers and can grow without bound if not monitored.

SOURCE:

Command: docker system df (Volumes line, v1.13+)
Command: docker volume ls + du inspection
Docker API: GET /system/df (Volumes field)
Filesystem: /var/lib/docker/volumes/

HOW TO COLLECT IT MANUALLY:

# Volume usage summary
docker system df | grep Volumes

# Via API
curl -s --unix-socket /var/run/docker.sock \
  "http://localhost/system/df" | jq '.Volumes[] | {Name: .Name, UsageData: .UsageData}'

# List volumes with sizes (requires inspection)
docker volume ls --format '{{.Name}}' | while read vol; do
  size=$(docker run --rm -v $vol:/data alpine du -sh /data 2>/dev/null | cut -f1)
  echo "$vol: $size"
done

# Direct filesystem inspection
sudo du -sh /var/lib/docker/volumes/*

# Find largest volumes
sudo du -s /var/lib/docker/volumes/*/ | sort -n | tail -10

WHAT IT TELLS YOU: Volume usage reflects persistent data growth. Database volumes, log volumes, and data volumes can grow over time. Orphaned volumes (not attached to any container) waste space. Volume growth must be monitored for capacity planning.

SEVERITY:

TICKET: Volume usage >75% of available disk
TICKET: Rapid growth rate (>5GB/day) without explanation
PLAN: Volume growth trend for capacity planning
INFO: Baseline volume usage per service

THRESHOLDS:

Compare to disk space available
Growth rate depends on workload type (databases vs config volumes)
Orphaned volumes: any significant number is waste

FAILURE MODES DETECTED:

Database growth without limits
Log accumulation in mounted volumes
Orphaned volumes from deleted containers
Backup/snapshot volumes accumulating
Volume data corruption (can’t measure directly, but growth anomalies may indicate)

NUANCES & GOTCHAS:

Volumes are NOT automatically cleaned up with docker system prune without -v flag
Named volumes vs anonymous volumes have different cleanup behaviors
Volume driver (local, NFS, cloud) affects how size is reported and measured
Some volumes may be mounted but not actively used (zombie data)

CORRELATES WITH:

Docker Disk Usage (Total) — volumes often largest persistent usage
Container Count — orphaned volumes when containers removed
Application-specific metrics (database size, etc.)

SIGNAL: Docker Disk Usage by Build Cache

WHAT IT IS: Disk space consumed by Docker’s build cache, which stores intermediate layers from image builds to speed up subsequent builds.

SOURCE:

Command: docker system df (Build Cache line)
Docker API: GET /system/df (BuildCache field, API v1.39+)

HOW TO COLLECT IT MANUALLY:

# Build cache summary
docker system df | grep "Build Cache"

# Detailed build cache info
docker builder prune --dry-run

# Via API (v1.39+)
curl -s --unix-socket /var/run/docker.sock \
  "http://localhost/system/df" | jq '.BuildCache'

# Direct inspection (buildkit)
sudo du -sh /var/lib/docker/buildkit/

WHAT IT TELLS YOU: Build cache grows with each build operation. On CI/CD runners that build many images, cache can consume significant space. While cache speeds up builds, unlimited growth wastes disk space.

SEVERITY:

TICKET: Build cache >20GB or >20% of Docker disk usage
PLAN: Regular cleanup schedule needed for build-heavy systems
INFO: Baseline cache size

THRESHOLDS:

Depends on build frequency
On build systems, 10-30GB is often normal
On non-build systems, any cache is potentially stale
Cache hit rate should be monitored; large cache with low hit rate is waste

FAILURE MODES DETECTED:

Unbounded cache growth on build servers
Stale cache causing build inconsistencies
Cache corruption causing build failures
Cache consuming space needed for production images

NUANCES & GOTCHAS:

BuildKit uses different cache storage than legacy builder
Cache is invalidated by Dockerfile changes, not just cleanup
docker builder prune is separate from docker system prune
Cache entries have TTL and last-used timestamps for selective cleanup

CORRELATES WITH:

Build Frequency — more builds = more cache
Docker Disk Usage (Total) — cache contributes to total
Build Time — large cache should correlate with faster builds

SIGNAL: Dangling Images Count

WHAT IT IS: The number of images that are not tagged and not referenced by any container. These are typically intermediate layers or images left over from builds.

SOURCE:

Command: docker images -f "dangling=true"
Docker API: GET /images/json with filters

HOW TO COLLECT IT MANUALLY:

# Count dangling images
docker images -f "dangling=true" -q | wc -l

# List with sizes
docker images -f "dangling=true"

# Via API
curl -s --unix-socket /var/run/docker.sock \
  "http://localhost/images/json?filters={\"dangling\":[\"true\"]}" | \
  jq 'length'

# Size of dangling images
docker images -f "dangling=true" --format '{{.Size}}'

WHAT IT TELLS YOU: Dangling images are usually safe to remove. They accumulate from: failed builds (intermediate layers), builds that overwrite tags (old image becomes dangling), and image pulls that replace existing images. High dangling image count indicates cleanup is needed.

SEVERITY:

PLAN: Dangling images >10GB or >100 images
INFO: Baseline tracking

THRESHOLDS:

Small number is normal and expected
100 dangling images or >10GB indicates cleanup needed
Rapid accumulation indicates frequent image changes

FAILURE MODES DETECTED:

Build detritus accumulation
Tag churn (pushing to same tag repeatedly)
Incomplete cleanup after image deletions

NUANCES & GOTCHAS:

Dangling images may still be used as cache for builds
Removing dangling images during builds can cause failures
Some dangling images are legitimate intermediate layers needed for builds
Filter carefully: some tools use <none> as legitimate placeholder

CORRELATES WITH:

Build Frequency — more builds = more dangling images
Docker Disk Usage (Images) — dangling images contribute
Image Pull/Push Rate — high rate = more dangling

SIGNAL: Orphaned Volumes Count

WHAT IT IS: The number of volumes that exist but are not referenced by any container. These volumes persist data that may no longer be needed.

SOURCE:

Command: docker volume ls -q cross-referenced with container mounts
Docker API: GET /volumes and GET /containers/json

HOW TO COLLECT IT MANUALLY:

# Find volumes not used by any container
docker volume ls -q | while read vol; do
  count=$(docker ps -a --filter volume=$vol -q | wc -l)
  [ $count -eq 0 ] && echo "$vol (orphaned)"
done

# Simpler: use docker system df -v to show volume usage
docker system df -v | grep -A 100 "Volumes space usage"

# Via API - get all volumes, then check container mounts
curl -s --unix-socket /var/run/docker.sock "http://localhost/volumes" | jq -r '.Volumes[].Name'
curl -s --unix-socket /var/run/docker.sock "http://localhost/containers/json?all=true" | jq -r '.[].Mounts[].Name' | sort -u

WHAT IT TELLS YOU: Orphaned volumes consume disk space and may contain sensitive data. They’re created when containers are removed without the -v flag. Database data, uploaded files, and configuration can be stranded in orphaned volumes.

SEVERITY:

TICKET: Orphaned volume count >10 or total size >20GB
PLAN: Regular orphaned volume cleanup policy needed
INFO: Baseline orphaned volume tracking

THRESHOLDS:

Any orphaned volumes represent potential waste
Size matters more than count — one 100GB orphaned volume is worse than 100 1MB volumes

FAILURE MODES DETECTED:

Data loss risk (containers removed without proper data migration)
Disk space waste
Security risk (sensitive data in forgotten volumes)
Compliance issues (PII in untracked volumes)

NUANCES & GOTCHAS:

Some volumes are intentionally standalone (data-only containers pattern, now deprecated)
Named volumes are more likely intentional; anonymous volumes more likely orphaned
Orphaned volumes may be needed for disaster recovery — don’t auto-delete
Volume drivers may not support all query operations

CORRELATES WITH:

Container Creation/Deletion Rate — high churn = more orphaned volumes
Docker Disk Usage (Volumes) — orphaned volumes contribute
Host Disk Usage — direct correlation

RESOURCE UTILIZATION DOMAIN

SIGNAL: Container CPU Usage

WHAT IT IS: The CPU time consumed by each container relative to host CPU capacity. This measures computational load per container.

SOURCE:

Docker API: GET /containers/{id}/stats
File: /sys/fs/cgroup/cpu/docker/<container_id>/cpuacct.usage (cgroups v1)
File: /sys/fs/cgroup/cpu.stat (cgroups v2, includes usage_usec)

HOW TO COLLECT IT MANUALLY:

# Live stats for all containers
docker stats --no-stream

# Specific container
docker stats <container_id> --no-stream

# Via API (JSON, continuous stream)
curl -s --unix-socket /var/run/docker.sock \
  "http://localhost/containers/<id>/stats?stream=false" | jq '.cpu_stats'

# Direct cgroups v1
cat /sys/fs/cgroup/cpu/docker/<container_id>/cpuacct.usage

# Direct cgroups v2
cat /sys/fs/cgroup/<container_id>/cpu.stat

# Calculate CPU percentage (cgroups)
# delta_usage / (delta_time * cpu_count * 1e9) * 100

WHAT IT TELLS YOU: CPU usage indicates how much processing a container is doing. High usage may indicate: heavy workload, CPU-bound application, inefficient code, or resource contention. Containers hitting their CPU quota (if set) will be throttled.

SEVERITY:

TICKET: Container CPU >80% sustained for >15 minutes
PLAN: Container CPU trending upward over time
INFO: Baseline CPU usage per container

THRESHOLDS:

Normal varies by workload type
Sustained >80% on multi-core: may need more resources or optimization
Sustained >95% single-core: application may be bottlenecked
Compare to CPU limit if set; throttling occurs at 100% of limit

FAILURE MODES DETECTED:

CPU-bound application (needs optimization or more resources)
Runaway process (infinite loop, crypto mining)
Resource contention (multiple containers competing)
CPU throttling (if quota set, container being limited)

NUANCES & GOTCHAS:

CPU percentage is relative to total host CPU, not per-container limit
On multi-core hosts, 100% CPU = 1 core fully used, not all cores
CPU usage is cumulative; calculate rate of change for percentage
Throttling metrics (if CPU quota set) are more important than raw usage

CORRELATES WITH:

Container CPU Throttling — throttling + high usage = quota too low
Container Memory Usage — CPU + memory patterns indicate workload type
Host CPU Usage — container CPU contributes to host total

SIGNAL: Container CPU Throttling

WHAT IT IS: The amount of time a container’s CPU usage was throttled because it exceeded its CPU quota. This indicates containers hitting CPU limits.

SOURCE:

Docker API: GET /containers/{id}/stats (cpu_stats.throttling_data)
File: /sys/fs/cgroup/cpu/docker/<container_id>/cpu.stat (throttled_time, nr_throttled)

HOW TO COLLECT IT MANUALLY:

# Via API stats
curl -s --unix-socket /var/run/docker.sock \
  "http://localhost/containers/<id>/stats?stream=false" | \
  jq '.cpu_stats.throttling_data'

# Direct cgroups v1
cat /sys/fs/cgroup/cpu/docker/<container_id>/cpu.stat | grep throttle

# cgroups v2
cat /sys/fs/cgroup/<container_id>/cpu.stat

# Calculate throttling percentage
# throttled_time_delta / (time_delta * cpu_count * 1e9) * 100

WHAT IT TELLS YOU: Throttling means the container wanted more CPU than its quota allows. This causes application slowdown, increased latency, and potential timeout failures. Any sustained throttling indicates the CPU limit is too low for the workload.

SEVERITY:

TICKET: Any sustained throttling (throttling time increasing)
PLAN: Occasional throttling during peak loads
INFO: Baseline throttling patterns

THRESHOLDS:

Throttling time = 0: normal, no issues
Any increasing throttling: quota needs adjustment
Throttling >10% of container uptime: significant impact
Burst throttling acceptable if latency SLAs allow

FAILURE MODES DETECTED:

CPU quota too low for workload
CPU burst patterns (sporadic high CPU needs)
Application latency caused by throttling
Cascading delays from throttled services

NUANCES & GOTCHAS:

Throttling metrics are cumulative; track rate of change
Containers without CPU quota will never show throttling
Throttling can cause “noisy neighbor” issues to become worse
Some workloads (batch jobs) tolerate throttling better than latency-sensitive ones

CORRELATES WITH:

Container CPU Usage — high usage + throttling = needs more quota
Application Latency — throttling often causes latency spikes
Container Restart Count — if throttling causes timeouts, may cause restarts

SIGNAL: Container Memory Usage

WHAT IT IS: The memory currently allocated to a container, including cache, RSS, and other memory types. Critical for detecting memory exhaustion before OOM kill.

SOURCE:

Docker API: GET /containers/{id}/stats (memory_stats)
File: /sys/fs/cgroup/memory/docker/<container_id>/memory.usage_in_bytes (cgroups v1)
File: /sys/fs/cgroup/<container_id>/memory.current (cgroups v2)

HOW TO COLLECT IT MANUALLY:

# Live stats for all containers
docker stats --no-stream --format "table {{.Name}}\t{{.MemUsage}}"

# Specific container
docker stats <container_id> --no-stream

# Via API
curl -s --unix-socket /var/run/docker.sock \
  "http://localhost/containers/<id>/stats?stream=false" | \
  jq '.memory_stats'

# Direct cgroups v1
cat /sys/fs/cgroup/memory/docker/<container_id>/memory.usage_in_bytes
cat /sys/fs/cgroup/memory/docker/<container_id>/memory.limit_in_bytes
cat /sys/fs/cgroup/memory/docker/<container_id>/memory.stat

# cgroups v2
cat /sys/fs/cgroup/<container_id>/memory.current
cat /sys/fs/cgroup/<container_id>/memory.max
cat /sys/fs/cgroup/<container_id>/memory.stat

WHAT IT TELLS YOU: Memory usage indicates how much RAM a container is using. If usage approaches the limit (if set), OOM kill is imminent. Growing memory usage may indicate a memory leak. Cache memory can be reclaimed, but RSS (resident set) cannot.

SEVERITY:

PAGE: Memory usage >90% of limit (if set) sustained
TICKET: Memory usage >75% of limit or growing trend
PLAN: Memory usage trend for capacity planning
INFO: Baseline memory patterns

THRESHOLDS:

Compare to container memory limit (if set)
Warning at >75% of limit
Critical at >90% of limit
Without limit, compare to host memory and other containers

FAILURE MODES DETECTED:

Memory leak (continuously growing usage)
Memory limit undersized for workload
Cache pressure (application caching aggressively)
Memory spike (sudden allocation)

NUANCES & GOTCHAS:

Total memory includes cache; cache is reclaimable
RSS (resident set size) is the more critical metric
Java applications: heap + metaspace + native overhead should fit under limit with buffer
Memory usage may spike during garbage collection; look at sustained usage

CORRELATES WITH:

Container OOM Killed — confirms memory exhaustion
Host Memory Usage — container contributes to host pressure
Container Restart Count — memory issues often cause restarts

SIGNAL: Container Network I/O

WHAT IT IS: Bytes received and transmitted per container, measuring network throughput and identifying network-heavy workloads.

SOURCE:

Docker API: GET /containers/{id}/stats (networks.{interface}.rx_bytes, tx_bytes)
File: /sys/class/net//statistics/rx_bytes, tx_bytes (inside container namespace)
File: /proc//net/dev (container process network namespace)

HOW TO COLLECT IT MANUALLY:

# Live stats
docker stats --no-stream

# Via API
curl -s --unix-socket /var/run/docker.sock \
  "http://localhost/containers/<id>/stats?stream=false" | \
  jq '.networks'

# Inside container network namespace
docker exec <container_id> cat /proc/net/dev

# From host (find veth pair)
docker inspect --format '{{.State.Pid}}' <container_id>
ls /proc/<pid>/ns/net
# Then use nsenter or find the veth

# Calculate rates (need two samples)

WHAT IT TELLS YOU: Network I/O shows how much data a container is sending/receiving. High network usage may indicate: data-intensive application, log shipping, database replication, or abnormal activity (exfiltration, DDoS participation). Compare to expected bandwidth.

SEVERITY:

TICKET: Network I/O >10x baseline for container
TICKET: Transmit significantly higher than receive (may indicate data exfiltration)
PLAN: Network I/O growth trend
INFO: Baseline network patterns

THRESHOLDS:

Depends on application type (web server vs batch processor)
Compare to historical baseline for the container
Compare to network interface capacity
Any unexplained spike warrants investigation

FAILURE MODES DETECTED:

Network saturation (bandwidth limit reached)
Abnormal traffic patterns (security issue)
Misconfigured logging (excessive log shipping)
Data synchronization storms

NUANCES & GOTCHAS:

Bytes are cumulative; calculate rate from delta
Multiple network interfaces (eth0, eth1 for multi-network) may need separate tracking
Container network interfaces are veth pairs; host-side counters in different location
Network errors and drops are more important than raw throughput

CORRELATES WITH:

Container Network Errors — high I/O + errors = saturation or problem
Host Network Usage — container contributes to host total
Container CPU Usage — high network often correlates with CPU (encryption, compression)

SIGNAL: Container Network Errors

WHAT IT IS: Network errors (dropped packets, frame errors, carrier losses) for container network interfaces. Errors indicate connectivity problems.

SOURCE:

Docker API: GET /containers/{id}/stats (networks.{interface}.rx_dropped, tx_dropped, rx_errors, tx_errors)
File: /proc/net/dev inside container namespace

HOW TO COLLECT IT MANUALLY:

# Via API
curl -s --unix-socket /var/run/docker.sock \
  "http://localhost/containers/<id>/stats?stream=false" | \
  jq '.networks.eth0 | {rx_errors, tx_errors, rx_dropped, tx_dropped}'

# Inside container
docker exec <container_id> cat /proc/net/dev

# Host-side veth statistics (find veth interface name first)
docker exec <container_id> ip link show eth0
# Match with host ip link show, then:
cat /sys/class/net/vethXXX/statistics/rx_errors

WHAT IT TELLS YOU: Network errors indicate connectivity problems: packet loss, interface errors, buffer overflows, or hardware issues. Any nonzero error rate is abnormal and causes application-level retries, timeouts, and degraded performance.

SEVERITY:

TICKET: Any sustained network error rate (>0 errors/minute)
TICKET: Error rate as percentage of packets >0.1%
INFO: Baseline (should be zero)

THRESHOLDS:

Normal: errors = 0, or very small (<0.01% of packets)
Any increasing error count: investigate
Errors >0.1% of packets: significant impact

FAILURE MODES DETECTED:

Network interface saturation
veth pair buffer overflow
Physical network issues (on host)
MTU mismatch causing dropped packets
Firewall/rule issues

NUANCES & GOTCHAS:

Errors are cumulative; track rate of change
Dropped packets may be normal if QoS/traffic shaping is in effect
Container-to-container traffic on same bridge doesn’t hit physical network
DNS errors don’t show up in interface statistics

CORRELATES WITH:

Container Network I/O — high I/O + errors = saturation
Application Latency — network errors cause latency and timeouts
Host Network Errors — if host has errors, containers will too

SIGNAL: Container Block I/O

WHAT IT IS: Disk read and write bytes/operations per container. This measures storage I/O consumption and identifies disk-heavy workloads.

SOURCE:

Docker API: GET /containers/{id}/stats (blkio_stats)
File: /sys/fs/cgroup/blkio/docker/<container_id>/blkio.throttle.io_service_bytes (cgroups v1)
File: /sys/fs/cgroup/<container_id>/io.stat (cgroups v2)

HOW TO COLLECT IT MANUALLY:

# Via API
curl -s --unix-socket /var/run/docker.sock \
  "http://localhost/containers/<id>/stats?stream=false" | \
  jq '.blkio_stats'

# Direct cgroups v1
cat /sys/fs/cgroup/blkio/docker/<container_id>/blkio.throttle.io_service_bytes
# Format: major:minor operation bytes

# cgroups v2
cat /sys/fs/cgroup/<container_id>/io.stat

# Calculate rates (need two samples)
# Example: sum bytes for Read/Write operations

WHAT IT TELLS YOU: Block I/O shows disk activity per container. High I/O may indicate: database workloads, log writing, file processing, or inefficient caching. I/O-heavy containers can starve other containers and impact host performance.

SEVERITY:

TICKET: Block I/O >80% of device bandwidth sustained
TICKET: Sustained I/O wait causing latency
PLAN: I/O patterns for capacity planning
INFO: Baseline I/O per container type

THRESHOLDS:

Compare to storage device capacity (IOPS, bandwidth)
Sustained high I/O on shared storage affects all containers
Device saturation varies by storage type (SSD vs HDD, local vs network)

FAILURE MODES DETECTED:

Disk-intensive workload (may need dedicated storage)
I/O throttling (if limits set)
Log flooding (excessive writes)
Database working set not fitting in memory (excessive reads)

NUANCES & GOTCHAS:

Blkio stats may not include all I/O (depends on cgroup configuration)
I/O to volumes depends on volume driver and may not be fully attributed
OverlayFS adds overhead; container I/O may be higher than reported
Async I/O may have different accounting than sync I/O

CORRELATES WITH:

Host Disk I/O — container I/O contributes to host total
Container Memory Usage — low memory = more swap/disk I/O
Container Latency — high I/O wait = high latency

SIGNAL: Docker Daemon File Descriptor Count

WHAT IT IS: The number of open file descriptors used by the Docker daemon process. FDs are used for: API connections, container stdio, network sockets, and internal state.

SOURCE:

Process: /proc/$(pgrep dockerd)/fd (count of entries)
Command: ls /proc/$(pgrep dockerd)/fd | wc -l

HOW TO COLLECT IT MANUALLY:

# Count FDs for dockerd
sudo ls /proc/$(pgrep dockerd)/fd | wc -l

# FD limit
sudo cat /proc/$(pgrep dockerd)/limits | grep "open files"

# Via /proc directly
sudo cat /proc/$(pgrep dockerd)/status | grep -i fd

# Detailed breakdown
sudo ls -l /proc/$(pgrep dockerd)/fd | head -20

WHAT IT TELLS YOU: FD count indicates daemon resource usage. Each container uses multiple FDs (stdio, network, logs). High FD usage approaching the limit causes “too many open files” errors, failed operations, and daemon instability.

SEVERITY:

PAGE: FD count >90% of limit
TICKET: FD count >75% of limit
PLAN: FD growth trend (leak detection)
INFO: Baseline FD usage

THRESHOLDS:

Compare to process FD limit (often 65535 or higher)
Warning at >50% of limit
Critical at >80% of limit
Investigate any sustained growth

FAILURE MODES DETECTED:

FD leak (opened but not closed)
Too many containers for current limit
API connection leak (clients not closing properly)
Log file FD accumulation

NUANCES & GOTCHAS:

FD count includes network sockets, not just files
Each docker logs -f consumes an FD
FD limit can be increased, but indicates underlying issue if growing
System-wide FD limits also matter

CORRELATES WITH:

Container Count — more containers = more FDs
API Connection Count — active API connections consume FDs
Docker Daemon Errors — FD exhaustion causes errors

SIGNAL: Docker Daemon Goroutine Count

WHAT IT IS: The number of goroutines currently active in the Docker daemon (written in Go). This indicates concurrent operations and potential thread starvation.

SOURCE:

Debug endpoint: GET /debug/vars (if enabled)
Prometheus metrics: GET /metrics (if enabled)
Process threads: /proc/$(pgrep dockerd)/status (Threads field, approximate)

HOW TO COLLECT IT MANUALLY:

# If debug endpoint enabled (usually disabled in production)
curl -s --unix-socket /var/run/docker.sock http://localhost/debug/vars | jq '.num_goroutine'

# If Prometheus metrics enabled
curl -s --unix-socket /var/run/docker.sock http://localhost/metrics | grep goroutines

# Approximate via process threads
cat /proc/$(pgrep dockerd)/status | grep Threads

# Via pprof (if exposed)
curl -s --unix-socket /var/run/docker.sock http://localhost/debug/pprof/goroutine?debug=1

WHAT IT TELLS YOU: High goroutine count indicates many concurrent operations. Rapidly growing goroutine count indicates a goroutine leak (operations blocked indefinitely). Extremely high counts cause memory pressure and daemon slowdown.

SEVERITY:

TICKET: Goroutine count >10,000 sustained
TICKET: Goroutine count growing without bound
PLAN: Track goroutine baseline during normal and peak operations
INFO: Baseline goroutine patterns

THRESHOLDS:

Normal varies by workload: typically hundreds to thousands
10,000 indicates potential issue
Sustained growth without corresponding workload = leak

FAILURE MODES DETECTED:

Goroutine leak (operations blocked on I/O or locks)
Daemon overload (too many concurrent operations)
Internal deadlock (goroutines waiting indefinitely)
Memory pressure from goroutine stacks

NUANCES & GOTCHAS:

Debug endpoints are often disabled in production for security
Goroutine count includes idle/background goroutines, not just active
Sudden spikes during heavy operations are normal
Goroutine count != thread count; Go runtime multiplexes

CORRELATES WITH:

Docker Daemon Response Latency — high goroutines + latency = potential deadlock
Docker Daemon Memory — goroutines consume stack memory
Container Operations Rate — high operations = more goroutines

SIGNAL: Docker Daemon Memory Usage

WHAT IT IS: Memory consumed by the Docker daemon process itself (not containers). This is separate from container memory and indicates daemon resource footprint.

SOURCE:

Process: /proc/$(pgrep dockerd)/status (VmRSS, VmSize)
Command: ps -o rss,vsz -p $(pgrep dockerd)

HOW TO COLLECT IT MANUALLY:

# RSS and VSZ
ps -o rss,vsz -p $(pgrep dockerd)

# Detailed memory from /proc
cat /proc/$(pgrep dockerd)/status | grep -E 'Vm|Rss'

# From smaps (more detailed)
sudo cat /proc/$(pgrep dockerd)/smaps_rollup

# Using pmap
sudo pmap $(pgrep dockerd) | tail -1

WHAT IT TELLS YOU: Daemon memory usage should be relatively stable. Growing memory indicates a memory leak. Very high daemon memory can cause OOM (catastrophic: daemon dies while containers keep running). Daemon memory includes: image metadata, container state, network state, plugin data.

SEVERITY:

PAGE: Daemon memory approaching host OOM threshold
TICKET: Daemon memory growing >100MB/day without workload change
PLAN: Track daemon memory trend
INFO: Baseline daemon memory (typically 100-500MB depending on scale)

THRESHOLDS:

Normal: depends on scale; typically 100MB-1GB for moderate deployments
Warning: >1GB or growing trend
Critical: approaching host memory limit (daemon OOM is catastrophic)

FAILURE MODES DETECTED:

Memory leak in daemon
Excessive image/container metadata
Plugin memory consumption
Large log buffer accumulation

NUANCES & GOTCHAS:

Daemon memory doesn’t include container memory (that’s in cgroups)
Go’s garbage collector means some fluctuation is normal
Memory usage correlates with number of images, containers, and networks
Daemon restart clears most accumulated memory (but is disruptive)

CORRELATES WITH:

Container Count — more containers = more daemon memory
Image Count — more images = more metadata memory
Docker Disk Usage — disk operations may buffer in memory

INTERNAL STATE DOMAIN

SIGNAL: Docker Storage Driver Status

WHAT IT IS: The health and performance characteristics of Docker’s storage driver (typically overlay2). Storage driver issues directly impact container operations.

SOURCE:

Command: docker info (Storage Driver section)
File: /proc/mounts (overlay mounts)
File: /sys/fs/overlay/ (overlay-specific stats, if available)

HOW TO COLLECT IT MANUALLY:

# Check storage driver
docker info | grep -A5 "Storage Driver"

# Verify overlay mounts
mount | grep overlay

# Check backing filesystem
df -h /var/lib/docker

# For overlay2, check layer directories
ls /var/lib/docker/overlay2/

# Disk usage of overlay storage
du -sh /var/lib/docker/overlay2/

# Check for xfs quota (if used)
xfs_quota -x -c 'df -h' /var/lib/docker

WHAT IT TELLS YOU: Storage driver health is essential for container operations. Problems include: disk exhaustion, inode exhaustion, mount failures, and performance degradation. overlay2 is most common; other drivers (devicemapper, btrfs) have different failure modes.

SEVERITY:

PAGE: Storage driver errors in daemon log
TICKET: Disk usage >80% of backing filesystem
PLAN: Track storage growth trend
INFO: Baseline storage driver metrics

THRESHOLDS:

Backing filesystem: warning at 70%, critical at 85%
Inode usage: warning at 70%, critical at 85%
Any storage driver errors require investigation

FAILURE MODES DETECTED:

Disk space exhaustion
Inode exhaustion (many small files)
Mount failures
Layer corruption
Performance degradation (slow container starts)

NUANCES & GOTCHAS:

Different storage drivers have very different characteristics
overlay2 requires backing filesystem support (preferably xfs with pquota)
devicemapper has thin pool that can fill independently
btrfs/zfs have their own volume management

CORRELATES WITH:

Docker Disk Usage — storage driver stores all data
Container Start Latency — storage performance affects start time
Docker Daemon Errors — storage issues log errors

SIGNAL: Docker Network Bridge Status

WHAT IT IS: The state of Docker’s default bridge (docker0) and custom networks. Network issues cause container connectivity problems.

SOURCE:

Command: docker network ls
Command: docker network inspect bridge
File: /sys/class/net/docker0/ (bridge interface stats)
Command: ip link show docker0, brctl show

HOW TO COLLECT IT MANUALLY:

# List networks
docker network ls

# Inspect default bridge
docker network inspect bridge

# Bridge interface statistics
ip -s link show docker0

# Bridge details (if brctl available)
brctl show docker0

# Via /sys
cat /sys/class/net/docker0/operstate
cat /sys/class/net/docker0/carrier

# Check iptables rules for Docker
iptables -t nat -L DOCKER -n -v
iptables -L DOCKER -n -v

WHAT IT TELLS YOU: Bridge network status indicates container networking health. Problems include: bridge interface down, IP address exhaustion, iptables rule corruption, and veth pair issues. Containers on a broken bridge cannot communicate.

SEVERITY:

PAGE: Bridge interface operstate != “up” and containers exist
TICKET: IP allocation approaching subnet limit
PLAN: Network configuration drift
INFO: Baseline network configuration

THRESHOLDS:

Bridge should be “up” when containers are running
IP pool utilization >80% indicates approaching exhaustion
Any carrier=0 with active containers = problem

FAILURE MODES DETECTED:

Bridge interface down
IP address exhaustion (subnet full)
iptables rules corrupted
veth pair orphaning
MTU mismatch

NUANCES & GOTCHAS:

Custom networks (overlay, macvlan) have different failure modes
Docker creates/destroys iptables rules dynamically
Bridge network is default; custom networks may be primary in production
IP conflicts can occur with manually assigned IPs

CORRELATES WITH:

Container Network Errors — bridge issues cause container errors
Container Network I/O — broken bridge = no I/O
Docker Daemon Errors — network issues log errors

SIGNAL: Docker Events Stream Liveness

WHAT IT IS: Whether the Docker events stream is producing events as expected. The events stream is the heartbeat of container activity.

SOURCE:

Docker API: GET /events
Command: docker events

HOW TO COLLECT IT MANUALLY:

# Check events are flowing (timeout after 5 seconds)
timeout 5 docker events --filter 'type=container' --format 'event received'

# Via API with timeout
timeout 5 curl -s --unix-socket /var/run/docker.sock \
  "http://localhost/events" > /dev/null && echo "Events flowing" || echo "No events"

# Check for recent events
docker events --since 1m --until 0s | head -5

WHAT IT TELLS YOU: The events stream should produce events when containers are created, started, stopped, etc. If the stream is silent when activity is expected, or the API hangs, there may be daemon internal issues. Some monitoring systems depend on the events stream.

SEVERITY:

TICKET: Events stream unresponsive or hanging
INFO: Baseline event rates

THRESHOLDS:

Events stream should respond within seconds
No events during known activity = problem
API hanging on events request = daemon issue

FAILURE MODES DETECTED:

Daemon internal state corruption
Events buffer overflow
API handler deadlock

NUANCES & GOTCHAS:

Low activity systems may have long event silences (normal)
Daemon restart clears event buffer
Events are not persisted; only available while streaming
Multiple event subscribers are supported; one shouldn’t block others

CORRELATES WITH:

Docker Daemon Response Latency — hanging events = daemon stress
Container Operations Rate — should correlate with events

SIGNAL: Docker API Health Check Endpoint

WHAT IT IS: A simple endpoint that verifies the daemon’s HTTP API is responding. This is the simplest daemon health check.

SOURCE:

Docker API: GET /_ping

HOW TO COLLECT IT MANUALLY:

# Simple ping
curl --unix-socket /var/run/docker.sock http://localhost/_ping
# Returns: OK

# With timing
time curl --unix-socket /var/run/docker.sock http://localhost/_ping

# Via TCP (if configured)
curl http://localhost:2375/_ping

WHAT IT TELLS YOU: The /_ping endpoint returns “OK” if the daemon is minimally functional. It’s the lightest-weight check for daemon liveness. A response means the HTTP handler is working, but doesn’t guarantee full functionality.

SEVERITY:

PAGE: /_ping not responding for >30 seconds
TICKET: /_ping response time >5 seconds
INFO: Baseline response time

THRESHOLDS:

Response time <1 second = normal
Response time >5 seconds = degraded
No response = critical

FAILURE MODES DETECTED:

Daemon process dead
Daemon hung (internal deadlock)
Socket file removed/corrupted

NUANCES & GOTCHAS:

/_ping is very lightweight; it may respond even when daemon is stressed
It doesn’t verify container operations work
TCP socket (2375/2376) is often disabled or secured; unix socket is preferred
Always check response content (should be “OK”), not just HTTP status

CORRELATES WITH:

Docker Daemon Process Health — /_ping is the responsiveness check
Docker Daemon Response Latency — more detailed latency measurement

REPLICATION/CONSISTENCY DOMAIN

SIGNAL: Container Health Check Status

WHAT IT IS: The result of container health checks (if configured). Health checks verify the application inside the container is functional, not just running.

SOURCE:

Docker API: GET /containers/{id}/json (State.Health field)
Command: docker inspect --format '{{.State.Health.Status}}' <container_id>

HOW TO COLLECT IT MANUALLY:

# Health status for specific container
docker inspect --format '{{json .State.Health}}' <container_id> | jq .

# List all containers with health status
docker ps --format '{{.ID}} {{.Status}}' | grep -E '\(healthy|unhealthy\)'

# Via API
curl -s --unix-socket /var/run/docker.sock \
  "http://localhost/containers/<id>/json" | jq '.State.Health'

# Check last 5 health check results
docker inspect --format '{{range .State.Health.Log}}{{.End}}: {{.ExitCode}} {{.Output}}{{"\n"}}{{end}}' <container_id> | tail -5

WHAT IT TELLS YOU: Health status shows whether the container is actually healthy, not just running. Status is: starting, healthy, or unhealthy. Unhealthy containers (with restart policy) will be restarted. Health check output may reveal why the check failed.

SEVERITY:

PAGE: Container health status = “unhealthy” for critical service
TICKET: Container health status = “unhealthy” for any production service
PLAN: Health check failure frequency
INFO: Health check timing and success rate

THRESHOLDS:

Healthy = normal
Unhealthy = immediate attention
“Starting” for longer than start_period = problem

FAILURE MODES DETECTED:

Application not responding (web server, database)
Dependency failure (cannot connect to required service)
Resource starvation (too slow to respond)
Configuration error (wrong health check command)
Application deadlock

NUANCES & GOTCHAS:

Health checks must be configured in image or at run time; not all containers have them
Health check interval matters: frequent checks add load
Health check command runs inside the container
Container may be “running” but “unhealthy” — different conditions
Restart on unhealthy can cause loops if health check is misconfigured

CORRELATES WITH:

Container Restart Count — unhealthy + restart policy = restarts
Container Application Logs — health check failures often logged
Container CPU/Memory Usage — resource starvation causes health failures

SIGNAL: Docker Daemon Version and API Version

WHAT IT IS: The version of Docker daemon and its API. Version mismatches between client and daemon cause errors.

SOURCE:

Docker API: GET /version
Command: docker version

HOW TO COLLECT IT MANUALLY:

# Full version info
docker version

# Via API
curl -s --unix-socket /var/run/docker.sock http://localhost/version | jq .

# Just daemon version
docker version --format '{{.Server.Version}}'

# API version
docker version --format '{{.Server.APIVersion}}'

WHAT IT TELLS YOU: Version tracking is important for: compatibility (client/daemon mismatch), security (known vulnerabilities), and feature availability. Version drift across hosts can cause inconsistencies.

SEVERITY:

TICKET: Version mismatch between clients and daemon
TICKET: Daemon version has known security vulnerabilities
PLAN: Version consistency across fleet
INFO: Fleet version tracking

THRESHOLDS:

Major version mismatches often cause errors
Patch version differences usually compatible
Track CVEs for Docker versions

FAILURE MODES DETECTED:

Client/daemon incompatibility
Missing features in older versions
Security vulnerabilities in outdated versions

NUANCES & GOTCHAS:

API version is more important than release version for compatibility
Docker ships with multiple API versions; daemon negotiates with client
Downgrades are not supported
Version output includes OS/Arch and other metadata

CORRELATES WITH:

Docker Daemon Errors — version mismatch errors
Container Creation Failures — API incompatibility

SECURITY DOMAIN

SIGNAL: Privileged Container Count

WHAT IT IS: The number of containers running with the --privileged flag, which gives them full access to the host.

SOURCE:

Docker API: GET /containers/json (HostConfig.Privileged field)
Command: docker inspect --format '{{.HostConfig.Privileged}}' <container_id>

HOW TO COLLECT IT MANUALLY:

# Find all privileged containers
docker ps --format '{{.ID}} {{.Names}}' | while read id name; do
  if [ "$(docker inspect --format '{{.HostConfig.Privileged}}' $id)" = "true" ]; then
    echo "PRIVILEGED: $id $name"
  fi
done

# Via API
curl -s --unix-socket /var/run/docker.sock \
  "http://localhost/containers/json" | \
  jq -r '.[] | select(.HostConfig.Privileged == true) | .Id[:12]'

# Count privileged containers
docker ps --format '{{.HostConfig.Privileged}}' | grep -c true

WHAT IT TELLS YOU: Privileged containers have essentially host-level access. They can: access all devices, modify kernel parameters, load kernel modules, and potentially escape container isolation. Any privileged container is a security risk that should be justified and minimized.

SEVERITY:

TICKET: Any new privileged container in production
TICKET: Privileged container count increasing
PLAN: Audit all privileged containers for necessity
INFO: Baseline privileged container list

THRESHOLDS:

Target: zero privileged containers
Any privileged container requires documented justification
Unexpected privileged container = security incident

FAILURE MODES DETECTED:

Container escape risk
Host compromise via privileged container
Unauthorized privileged containers (malicious actor)

NUANCES & GOTCHAS:

Some legitimate use cases: Docker-in-Docker, system monitoring, hardware access
Use capability dropping instead of privileged when possible
Privileged bypasses most security controls
Also check for specific capabilities that may be excessive (SYS_ADMIN, etc.)

CORRELATES WITH:

Container Capabilities List — fine-grained capability audit
Container Volume Mounts — privileged + host mounts = high risk

SIGNAL: Containers with Host Network

WHAT IT IS: The number of containers using host network mode (--network host), which bypasses Docker’s network isolation.

SOURCE:

Docker API: GET /containers/json (HostConfig.NetworkMode field)
Command: docker inspect --format '{{.HostConfig.NetworkMode}}' <container_id>

HOW TO COLLECT IT MANUALLY:

# Find containers with host network
docker ps --format '{{.ID}} {{.Names}}' | while read id name; do
  net=$(docker inspect --format '{{.HostConfig.NetworkMode}}' $id)
  if [ "$net" = "host" ]; then
    echo "HOST_NETWORK: $id $name"
  fi
done

# Via API
curl -s --unix-socket /var/run/docker.sock \
  "http://localhost/containers/json" | \
  jq -r '.[] | select(.HostConfig.NetworkMode == "host") | .Id[:12]'

# Count
docker ps -a --format '{{.HostConfig.NetworkMode}}' | grep -c host

WHAT IT TELLS YOU: Host network mode gives containers direct access to the host’s network interfaces. The container shares the host’s IP address and can bind to any port. This bypasses network isolation and can cause port conflicts.

SEVERITY:

TICKET: Any new host-network container in production
PLAN: Audit host-network containers for necessity
INFO: Baseline host-network container list

THRESHOLDS:

Target: minimize host-network containers
Any host-network container requires documented justification
Unexpected host-network = security concern

FAILURE MODES DETECTED:

Port conflicts with host services
Network isolation bypass
Unauthorized network access
Service masquerading (container appears as host)

NUANCES & GOTCHAS:

Some legitimate use cases: high-performance networking, port ranges, network diagnostics
Host network is less risky than privileged, but still weakens isolation
Container can still be limited by other controls (capabilities, seccomp)
Port bindings don’t apply to host-network containers

CORRELATES WITH:

Privileged Container Count — both weaken isolation
Container Capabilities List — network-related capabilities

SIGNAL: Containers with Host Path Mounts

WHAT IT IS: Containers that have directories from the host filesystem mounted inside them. Sensitive host paths (/, /etc, /var/run/docker.sock) create security risks.

SOURCE:

Docker API: GET /containers/{id}/json (Mounts field)
Command: docker inspect --format '{{json .Mounts}}' <container_id>

HOW TO COLLECT IT MANUALLY:

# List all mounts for all containers
docker ps --format '{{.ID}} {{.Names}}' | while read id name; do
  docker inspect --format '{{range .Mounts}}{{$id}}: {{.Source}} -> {{.Destination}}{{"\n"}}{{end}}' $id | sed "s/^/$name: /"
done

# Find containers mounting docker socket
docker ps --format '{{.ID}}' | while read id; do
  if docker inspect --format '{{range .Mounts}}{{if eq .Destination "/var/run/docker.sock"}}YES{{end}}{{end}}' $id | grep -q YES; then
    echo "DOCKER_SOCKET: $id"
  fi
done

# Via API
curl -s --unix-socket /var/run/docker.sock \
  "http://localhost/containers/json" | \
  jq -r '.[] | {Id: .Id[:12], Mounts: [.Mounts[]?.Source]}'

WHAT IT TELLS YOU: Volume mounts allow containers to access host filesystem paths. Mounting sensitive paths (docker.sock, /etc, /root, /) gives containers host-level access. Docker socket mount allows container to control Docker daemon (effectively root on host).

SEVERITY:

PAGE: Container mounting /var/run/docker.sock from untrusted source
TICKET: Any container mounting sensitive host paths (/, /etc, /root, /var/lib/docker)
PLAN: Audit all host mounts for necessity and minimal access
INFO: Baseline mount inventory

THRESHOLDS:

Docker socket mount: high risk, requires strong justification
/etc mount: can read secrets, modify host config
/ mount: full filesystem access
Any write mount to sensitive path: critical

FAILURE MODES DETECTED:

Container escape via docker socket
Credential theft (reading /etc/shadow, /root/.ssh)
Host modification (writing to /etc, /bin)
Docker daemon control (via socket mount)

NUANCES & GOTCHAS:

Many tools require docker socket mount (CI/CD, monitoring)
Read-only mounts reduce risk but don’t eliminate it
Named volumes are safer than bind mounts
Also check for /var/run (symlink) as docker socket path

CORRELATES WITH:

Privileged Container Count — combined with sensitive mounts = critical
Container Capabilities List — mounts + capabilities compound risk

SIGNAL: Container Capabilities List

WHAT IT IS: The Linux capabilities assigned to each container. Excessive capabilities weaken isolation and increase security risk.

SOURCE:

Docker API: GET /containers/{id}/json (HostConfig.CapAdd, HostConfig.CapDrop)
Command: docker inspect --format '{{json .HostConfig.CapAdd}}' <container_id>

HOW TO COLLECT IT MANUALLY:

# List capabilities for container
docker inspect --format '{{json .HostConfig.CapAdd}}' <container_id> | jq .
docker inspect --format '{{json .HostConfig.CapDrop}}' <container_id> | jq .

# Find containers with added capabilities
docker ps --format '{{.ID}} {{.Names}}' | while read id name; do
  caps=$(docker inspect --format '{{json .HostConfig.CapAdd}}' $id)
  if [ "$caps" != "null" ] && [ "$caps" != "[]" ]; then
    echo "$name: $caps"
  fi
done

# Via API
curl -s --unix-socket /var/run/docker.sock \
  "http://localhost/containers/<id>/json" | \
  jq '{Added: .HostConfig.CapAdd, Dropped: .HostConfig.CapDrop}'

WHAT IT TELLS YOU: Docker drops most capabilities by default. Added capabilities increase container power and risk. Dangerous capabilities include: SYS_ADMIN (many admin operations), NET_ADMIN (network configuration), SYS_PTRACE (debugging, can bypass isolation), ALL (all capabilities).

SEVERITY:

TICKET: Container with SYS_ADMIN, NET_ADMIN, SYS_PTRACE, or ALL capabilities
PLAN: Audit capability additions for necessity
INFO: Baseline capability inventory

THRESHOLDS:

Default capabilities are relatively safe
SYS_ADMIN is particularly dangerous (almost as bad as privileged)
NET_ADMIN can modify firewall rules
ALL = effectively privileged
Any capability addition requires justification

FAILURE MODES DETECTED:

Container escape via dangerous capabilities
Host manipulation (mount, network, kernel)
Privilege escalation

NUANCES & GOTCHAS:

Some applications legitimately need specific capabilities (e.g., NETWORK_ADMIN for VPN)
CapDrop is good practice even without CapAdd
Seccomp and AppArmor interact with capabilities
Capability meanings are complex; review Linux capability documentation

CORRELATES WITH:

Privileged Container Count — similar risk profile
Container Security Profile (seccomp, AppArmor)

SIGNAL: Docker Daemon Audit Logs

WHAT IT IS: Security-relevant events in Docker daemon logs or system audit logs: API access, authentication attempts, configuration changes.

SOURCE:

Journal: journalctl -u docker (filtered for security events)
File: /var/log/audit/audit.log (if auditd configured for Docker)
Docker API: GET /events (filtered for security-relevant actions)

HOW TO COLLECT IT MANUALLY:

# Docker daemon logs with security focus
journalctl -u docker.service | grep -iE "(auth|denied|forbidden|unauthorized|security)"

# If auditd rules for Docker are configured
ausearch -m avc -c dockerd
ausearch -m USER_LOGIN -c docker

# Docker events for sensitive actions
docker events --filter 'event=create' --filter 'event=attach' --since 1h

# Check for container privilege changes
journalctl -u docker.service --since "1 day ago" | grep -i privileged

WHAT IT TELLS YOU: Security audit logs reveal: unauthorized access attempts, privilege escalation, unusual API calls, and configuration changes. These should be monitored and alerted on for security incidents.

SEVERITY:

PAGE: Evidence of unauthorized access or privilege escalation
TICKET: Any authentication failures or denied operations
PLAN: Regular security log review
INFO: Baseline security event patterns

THRESHOLDS:

Any authentication failure: investigate
Any privilege escalation: investigate
Unusual API patterns: investigate
Audit log gaps: investigate

FAILURE MODES DETECTED:

Unauthorized API access
Container escape attempts
Malicious container creation
Configuration tampering

NUANCES & GOTCHAS:

Docker doesn’t have built-in user authentication; relies on TLS or socket access
Audit logging may require additional configuration
Docker Content Trust provides image verification (separate signal)
Swarm mode adds additional auth/audit capabilities

CORRELATES WITH:

Privileged Container Count — new privileged containers should have audit trail
Container Creation Failures — repeated failures may be attacks

SECTION 2 — Composite Failure Patterns

PATTERN: Disk Exhaustion Cascade

SIGNALS INVOLVED:

Docker Disk Usage approaching 100%
Container creation failures increasing
Image pull failures
Daemon latency increasing
Log write failures in containers

NARRATIVE: Docker disk usage grows gradually (images, logs, containers, volumes) until it nears the filesystem limit. As free space dwindles, writes become slower. Image pulls fail. Container creates fail. Running containers may crash if they can’t write logs or data. The daemon may become unresponsive during storage operations. Recovery requires disk cleanup but cleanup operations themselves may fail without working space.

SEVERITY: PAGE — system approaching complete unavailability

DISTINGUISHING FEATURES:

Disk usage is the primary indicator
Multiple failure types appear simultaneously
Failures are all storage-related

COMMON CAUSES:

Unbounded container log growth (json-file without max-size/max-file)
Image accumulation without cleanup
Orphaned volumes growing
Build cache bloat on CI runners
Application data growth in volumes

FIRST RESPONSE:

Identify largest disk consumers: docker system df -v
Quick reclaim: docker system prune -f (images, build cache)
Identify and remove large log files in container directories
If critical, stop non-essential containers to free space
Schedule root cause analysis for log/image management

PATTERN: Container Death Spiral

SIGNALS INVOLVED:

Container restart count increasing rapidly
Exit codes 1 or 137 appearing repeatedly
Health check failures
Container start latency increasing (if many simultaneous restarts)
Daemon errors in logs

NARRATIVE: A container crashes (application error, OOM, or resource issue) and Docker restarts it (if restart policy permits). The container crashes again quickly, restarts again. Each restart consumes resources. If multiple containers are in this state, they can overwhelm the daemon, cause disk pressure (logs), and mask the root cause. The system appears “running” but is non-functional.

SEVERITY: PAGE if affecting critical service; TICKET otherwise

DISTINGUISHING FEATURES:

Restart count climbing rapidly
Containers are briefly “running” then “exited” repeatedly
Exit codes consistent (same failure cause)

COMMON CAUSES:

Application bug causing immediate crash
OOM kill (memory limit too low)
Missing dependencies (config, secrets, other services)
Invalid container configuration
Health check too aggressive (kills before app ready)

FIRST RESPONSE:

Identify affected container(s) and their exit codes
Check container logs: docker logs <container_id>
If OOM, check memory usage and limits
If application error, check application-level logs
Consider pausing restart policy temporarily: docker update --restart=no <container_id>
Fix root cause before re-enabling restarts

PATTERN: Daemon Hang

SIGNALS INVOLVED:

Daemon API unresponsive (/ping fails or times out)
Daemon process still running
Containers still running (workload not affected)
docker commands hang
Daemon latency spike before hang

NARRATIVE: The Docker daemon becomes unresponsive while containers continue running. All management operations hang: cannot inspect, create, stop, or get logs. This is typically caused by internal deadlock, storage driver hang, or extreme resource contention. Running workloads are unaffected but unmanageable. Recovery may require daemon restart (which briefly affects containers) or in extreme cases, host reboot.

SEVERITY: PAGE — operational capability lost

DISTINGUISHING FEATURES:

Daemon process exists but is unresponsive
Containers are still running (key difference from daemon crash)
Often preceded by latency increase

COMMON CAUSES:

Storage driver deadlock (overlay2 bug, filesystem issue)
Internal daemon deadlock (bug in Docker)
Extreme I/O contention causing storage operations to hang
File descriptor exhaustion
Kernel-level issue affecting cgroups/namespaces

FIRST RESPONSE:

Confirm containers are still running: ps aux | grep containerd
Check daemon process: ps aux | grep dockerd
Check storage and system health: df -h, iostat
Attempt graceful daemon restart: systemctl restart docker
If graceful restart hangs, may need kill -9 on dockerd
In extreme cases, host reboot (last resort)

PATTERN: Network Partition / DNS Failure

SIGNALS INVOLVED:

Container network errors increasing
Application errors for external service calls
DNS resolution failures inside containers
docker0 bridge or network interface issues
Health check failures for network-dependent services

NARRATIVE: Containers lose network connectivity or DNS resolution fails. Applications cannot reach databases, APIs, or other services. This may be caused by Docker network misconfiguration, iptables corruption, embedded DNS server failure, or external network issues. Containers appear healthy but are functionally broken.

SEVERITY: PAGE if affecting production services

DISTINGUISHING FEATURES:

Containers running and “healthy” but application failing
DNS errors in application logs
Network errors correlate with application errors
May affect all containers or only specific networks

COMMON CAUSES:

Docker embedded DNS server issue
iptables rules corrupted
Bridge interface misconfiguration
External DNS server unreachable
Network driver issue (overlay networks in Swarm)
MTU mismatch causing packet drops

FIRST RESPONSE:

Test connectivity from inside container: docker exec <id> ping -c 3 <external_host>
Test DNS resolution: docker exec <id> nslookup <hostname>
Check bridge interface: ip link show docker0
Check iptables: iptables -t nat -L DOCKER -n
Restart Docker networking: systemctl restart docker (brief impact)
If DNS issue, may need to restart containers to reinitialize DNS client

PATTERN: Resource Throttling Storm

SIGNALS INVOLVED:

Container CPU throttling increasing
Container memory usage near limits
Application latency increasing
Health check failures due to slowness
High CPU/memory usage on host

NARRATIVE: Containers are hitting their CPU or memory limits and being throttled. This causes application slowdown, which causes health check failures, which may trigger restarts. If multiple containers are affected, they may be competing for host resources. The system appears running but is degraded.

SEVERITY: TICKET for gradual onset; PAGE for sudden severe degradation

DISTINGUISHING FEATURES:

Throttling metrics are primary indicator
Degradation correlates with resource pressure
May be gradual (creeping workload) or sudden (traffic spike)

COMMON CAUSES:

Resource limits set too low for workload
Traffic increase exceeding capacity
Inefficient code causing high resource usage
Memory leak causing increasing memory usage
Multiple containers competing for host resources

FIRST RESPONSE:

Identify throttled containers: docker stats
Check throttling metrics per container
Compare usage to limits
Increase limits if justified: docker update --cpus/--memory
Investigate root cause of increased resource usage
Consider horizontal scaling if workload increased

PATTERN: Zombie Container Accumulation

SIGNALS INVOLVED:

Container count increasing (especially exited/dead)
Disk usage growing (container layers, logs)
Dead containers appearing
Daemon operations slowing (more state to manage)

NARRATIVE: Containers are being created but not properly cleaned up. Exited containers accumulate. Some containers may be in “dead” state (failed removal). This consumes disk space, slows daemon operations, and may exhaust IP addresses or container name space. Cleanup requires manual intervention.

SEVERITY: TICKET for accumulation; PAGE if dead containers blocking operations

DISTINGUISHING FEATURES:

Exited container count growing
Dead containers appearing
No corresponding container removals

COMMON CAUSES:

Missing cleanup automation
Deployment process not cleaning old containers
Removal failures leaving containers in dead state
Orchestration system not tracking all containers

FIRST RESPONSE:

Identify exited containers: docker ps -a --filter status=exited
Identify dead containers: docker ps -a --filter status=dead
Remove exited containers: docker container prune -f
For dead containers, may need manual cleanup of /var/lib/docker/containers
Investigate why cleanup isn’t happening automatically
Implement cleanup automation if missing

SECTION 3 — Capacity & Saturation Leading Indicators

RESOURCE: Disk Space on /var/lib/docker

LEADING INDICATORS:

Docker Disk Usage growth rate (>1GB/day sustained)
Image count increasing
Container log sizes growing
Build cache size increasing
Volume sizes growing

DEGRADATION CURVE: Sudden cliff-edge. System functions normally until ~95% full, then degrades rapidly. At 100%, daemon may crash or become unresponsive.

RUNWAY ESTIMATION:

current_usage_gb = $(docker system df --format '{{.Size}}' | head -1)
daily_growth_gb = [calculated from trend]
days_to_full = (total_space * 0.95 - current_usage_gb) / daily_growth_gb

HEADROOM DEFINITION:

Minimum 20% free space on /var/lib/docker filesystem
Or minimum 50GB free, whichever is larger
Growth rate should not exceed 2% per day of available space

RESOURCE: File Descriptors for Docker Daemon

LEADING INDICATORS:

FD count trending upward
Container count increasing
API connections not being released
Log file handles accumulating

DEGRADATION CURVE: Graceful until limit approached, then sudden failures. “Too many open files” errors appear. New containers cannot be created. API connections fail.

RUNWAY ESTIMATION:

current_fd = $(ls /proc/$(pgrep dockerd)/fd | wc -l)
fd_limit = $(cat /proc/$(pgrep dockerd)/limits | grep "open files" | awk '{print $4}')
fd_growth_per_day = [calculated from trend]
days_to_limit = (fd_limit * 0.8 - current_fd) / fd_growth_per_day

HEADROOM DEFINITION:

Keep FD usage below 50% of limit
Investigate any FD growth without corresponding workload increase
Consider increasing limit if legitimate growth

RESOURCE: Container IP Address Pool (Default Bridge)

LEADING INDICATORS:

Container count on default network increasing
IP allocation approaching subnet limit
Network creation failures

DEGRADATION CURVE: Graceful until pool exhausted, then container creation fails. Default bridge is typically 172.17.0.0/16 (65534 addresses). Custom networks have their own pools.

RUNWAY ESTIMATION:

# Count containers on default bridge
containers_on_bridge = $(docker network inspect bridge --format '{{range .Containers}}{{.Name}} {{end}}' | wc -w)
# Default pool size (varies)
pool_size = 65534
# Rough estimate
percent_used = containers_on_bridge / pool_size * 100

HEADROOM DEFINITION:

Keep IP pool usage below 50%
Use custom networks to distribute load
Consider smaller subnet allocation per network

RESOURCE: Daemon Memory

LEADING INDICATORS:

Daemon memory usage trending upward
Container/image count increasing
No memory recovery after cleanup operations

DEGRADATION CURVE: Gradual degradation as memory pressure increases. Go GC may cause pauses. In extreme cases, OOM kills daemon (catastrophic).

RUNWAY ESTIMATION:

daemon_memory_mb = $(ps -o rss -p $(pgrep dockerd) | tail -1)
daily_growth_mb = [calculated from trend]
host_memory_mb = $(free -m | awk '/Mem:/ {print $2}')
days_to_oom = (host_memory_mb * 0.8 - daemon_memory_mb) / daily_growth_mb

HEADROOM DEFINITION:

Daemon memory should be stable; growth indicates leak
Should not exceed 1GB under normal operation
If growing, investigate and consider daemon restart during maintenance

SECTION 4 — Operational Edge Cases

Behaviors that look alarming but are normal:

High disk usage after large deployment — Pulling many large images consumes space; this is expected. Monitor cleanup afterward.
Container in “created” state — Containers exist in “created” state before “running”. This is normal during startup.
Occasional container restart count increment — If restart policy is “always” or “on-failure”, some restarts are expected. Investigate patterns, not single events.
Dangling images after build — Builds create intermediate images that become dangling. This is normal; cleanup is scheduled.
Network interface flapping during container start/stop — veth interfaces are created/destroyed with containers. Brief carrier losses during this are normal.
Daemon memory usage varying — Go’s garbage collector causes memory to fluctuate. Look for sustained growth, not variation.
CPU spikes during image operations — Image pulls and builds are CPU-intensive. Transient spikes are expected.

Behaviors that look normal but are silently catastrophic:

Stable container count with growing exited containers — Running containers are fine, but exited containers accumulating indicates cleanup failure. Eventually causes disk exhaustion.
Low CPU usage but high throttling — Container appears idle but is being throttled. Application is running slowly but not crashing.
Memory usage stable at 99% of limit — Container is technically within limit but has no buffer for spikes. One traffic burst causes OOM.
Container “running” but health check not configured — Container appears healthy but application may be dead. Without health check, there’s no signal.
Network errors at low rate — Small error rate (0.01%) seems negligible but causes application-level retries, latency variance, and occasional failures.
Daemon responding but slow — Technically “up” but 5-second latency makes it unusable for automation and orchestration.

Cold start, warmup, and initialization behaviors:

First container start after daemon boot is slow — Daemon initializes storage driver, network, and caches. First operation is slower.
First API call after daemon start has latency spike — Internal initialization happens on first request. Subsequent calls are faster.
Image pull before first container start — If image not cached, container start includes pull time. First start is much slower than subsequent.
Volume initialization — First use of a named volume may include filesystem initialization (especially for certain volume drivers).
Network creation — First container on a new network triggers network creation. Brief delay.
Health check “starting” period — Containers with health checks start in “starting” state before becoming “healthy”. This is intentional, not a failure.

Signals critical during incidents but rarely proactively monitored:

Container exit codes — Only examined after something breaks. Pattern analysis could predict issues.
Dead containers — Discovered during incident investigation. Should be monitored proactively.
Docker socket mounts — Security risk, often only audited after security incident.
Container capability additions — Security-relevant but often invisible.
Events stream gaps — During incident, realize events weren’t being captured.
Image pull failures — Often assumed to “just work” until deployment fails.

Known instrumentation limitations:

Blkio statistics incomplete — Depends on cgroup configuration; may not capture all I/O.
Network stats per container — Requires mapping veth pairs; not all tools do this correctly.
Memory usage includes cache — “Used” memory includes reclaimable cache; actual memory pressure is different.
CPU percentage calculation — Requires sampling over time; single sample gives cumulative, not rate.
Container log size — Not directly exposed in API; must inspect filesystem.
Internal daemon state — Much internal state is invisible; only exposed via debug endpoints (often disabled).

Interactions with adjacent systems:

Docker + systemd — systemd manages dockerd; systemd timeouts can kill daemon during slow operations. Daemon restart affects all containers briefly.
Docker + iptables/nftables — Docker manages iptables rules; external firewall changes can conflict. Firewall flush can break Docker networking.
Docker + NFS/Network Storage — /var/lib/docker on NFS is unsupported and problematic. Volumes can be NFS, but storage driver shouldn’t be.
Docker + log aggregation — Log driver choice affects what’s visible in docker logs. journald logs may not appear in docker logs.
Docker + orchestration (Swarm/K8s) — Orchestration systems may restart containers, making restart counts misleading. Orchestration adds its own signals.
Docker + monitoring agents — Agents running in containers have different visibility than agents on host. cgroups v2 changes many metric paths.

SECTION 5 — Security & Integrity Signals

SIGNAL: Docker Socket Access

WHAT IT IS: Detection of processes accessing the Docker socket, which provides full control over Docker daemon.

SOURCE:

File: /proc//fd/ (scan for socket references)
Audit: auditd rules for /var/run/docker.sock

HOW TO COLLECT IT MANUALLY:

# Find processes with Docker socket open
sudo lsof /var/run/docker.sock

# Via /proc
sudo find /proc/*/fd -lname "socket:*" -exec sh -c 'readlink {} | grep -q $(stat -c %i /var/run/docker.sock 2>/dev/null) && echo $(dirname $(dirname {}))' \;

# If auditd configured
ausearch -f /var/run/docker.sock

WHAT IT TELLS YOU: Any process with access to Docker socket can control Docker daemon, effectively giving root access. Unexpected processes with socket access are security incidents.

SEVERITY:

PAGE: Unknown/unauthorized process accessing socket
TICKET: New process granted socket access
INFO: Baseline authorized processes

THRESHOLDS:

Only known, authorized processes should access socket
Any unexpected access = incident

FAILURE MODES DETECTED:

Unauthorized Docker control
Privilege escalation via Docker
Malicious container management

NUANCES & GOTCHAS:

Root user always has potential access
Container with socket mount appears as process inside container
CI/CD systems often need socket access

SIGNAL: Unauthorized API Access Attempts

WHAT IT IS: Failed authentication or authorization attempts against Docker API (if TLS/auth enabled).

SOURCE:

Daemon logs: journalctl -u docker.service
TLS access logs (if configured)

HOW TO COLLECT IT MANUALLY:

# Look for auth failures in daemon logs
journalctl -u docker.service | grep -iE "(unauthorized|forbidden|denied|auth)"

# If TCP socket enabled, check for connection attempts
ss -tlnp | grep 2375
ss -tlnp | grep 2376

WHAT IT TELLS YOU: Repeated unauthorized access attempts indicate scanning or attack. Any successful unauthorized access is a breach.

SEVERITY:

PAGE: Successful unauthorized access
TICKET: Repeated failed access attempts
INFO: Baseline access patterns

THRESHOLDS:

Any successful unauthorized access = incident
10 failed attempts from single source in 1 minute = suspicious

FAILURE MODES DETECTED:

Brute force attempts
Credential compromise
Misconfigured access control

NUANCES & GOTCHAS:

Unix socket has no built-in auth (file permissions only)
TCP socket should be secured with TLS
Swarm mode adds additional auth mechanisms

SIGNAL: Sensitive Environment Variables

WHAT IT IS: Containers with sensitive data (passwords, API keys, tokens) in environment variables, which are visible in inspection and process listing.

SOURCE:

Docker API: GET /containers/{id}/json (Config.Env)
Command: docker inspect --format '{{.Config.Env}}' <container_id>

HOW TO COLLECT IT MANUALLY:

# List environment variables for container
docker inspect --format '{{range .Config.Env}}{{println .}}{{end}}' <container_id>

# Find containers with potentially sensitive env vars
docker ps --format '{{.ID}}' | while read id; do
  env=$(docker inspect --format '{{range .Config.Env}}{{.}}{{end}}' $id)
  if echo "$env" | grep -qiE "(password|secret|token|key|api_key)"; then
    echo "SENSITIVE_ENV: $id"
  fi
done

# Via /proc inside container
docker exec <id> cat /proc/1/environ | tr '\0' '\n'

WHAT IT TELLS YOU: Environment variables with sensitive values are exposed via docker inspect, /proc filesystem, and process listing. This is a security risk. Secrets should use Docker secrets or external secret management.

SEVERITY:

TICKET: Sensitive data in environment variables
PLAN: Migrate to Docker secrets or external secret management
INFO: Audit of sensitive data handling

THRESHOLDS:

Target: zero sensitive data in environment variables
Any secret-like names in env vars = review needed

FAILURE MODES DETECTED:

Credential exposure via inspection
Credential exposure via logs
Credential exposure via process listing
Non-compliant secret handling

NUANCES & GOTCHAS:

Some legacy applications require env-based config
Docker secrets only available in Swarm or with specific run flags
Environment is the least secure option for secrets

SIGNAL: Container Image Provenance

WHAT IT IS: The source and trust status of container images. Untrusted or unknown-origin images are security risks.

SOURCE:

Docker API: GET /images/{name}/json (RepoTags, RepoDigests)
Docker Content Trust: docker trust inspect
Image labels and annotations

HOW TO COLLECT IT MANUALLY:

# List images and their sources
docker images --format 'table {{.Repository}}\t{{.Tag}}\t{{.ID}}'

# Check if image has verified signature (DCT)
docker trust inspect --pretty <image>:<tag>

# Check image labels for provenance
docker inspect --format '{{json .Config.Labels}}' <image>:<tag> | jq .

# Find unsigned images (if DCT enabled)
export DOCKER_CONTENT_TRUST=1
docker images --format '{{.Repository}}:{{.Tag}}' | while read img; do
  docker trust inspect "$img" > /dev/null 2>&1 || echo "UNTRUSTED: $img"
done

WHAT IT TELLS YOU: Images from untrusted sources or without verified signatures may contain vulnerabilities or malicious code. Running unsigned images is risky. Images should come from trusted registries with content trust.

SEVERITY:

TICKET: Unsigned/unverified images in production
TICKET: Images from untrusted registries
PLAN: Implement Docker Content Trust
INFO: Image provenance audit

THRESHOLDS:

Production: only signed images from trusted registries
Development: some flexibility but track sources
Any :latest tag in production = review needed

FAILURE MODES DETECTED:

Malicious images from untrusted sources
Compromised images in trusted registry
Supply chain attacks
Unpinned images changing unexpectedly

NUANCES & GOTCHAS:

Docker Content Trust requires explicit enablement
Some registries have their own signing mechanisms
Digest-pinned images are more verifiable than tag-pinned
Base image vulnerabilities affect all derived images

SECTION 6 — Monitoring Maturity Levels

LEVEL 1 — SURVIVAL

The absolute minimum to know if Docker is alive and not on fire:

Docker Daemon Process Health — Is dockerd running?
Docker API Health Check Endpoint — Is daemon responding?
Docker Disk Usage (Total) — Is /var/lib/docker filling up?
Container Count by State — How many are running/exited?
Container Restart Count — Is anything crash-looping?
Container OOM Killed Status — Did anything die from memory issues?

These 6 signals tell you: is the daemon working, is there disk space, and are containers running without crashing. Without these, you are flying blind.

LEVEL 2 — OPERATIONAL

What a competent team running Docker in production monitors:

All Level 1 signals
Docker Disk Usage breakdown — Images, containers, volumes, build cache separately
Container CPU Usage — Per-container CPU consumption
Container Memory Usage — Per-container memory vs limits
Container Network I/O — Basic throughput per container
Docker Daemon Response Latency — Is the daemon slow?
Container Health Check Status — Are containers actually healthy?
Docker Daemon Errors in Logs — Any errors in daemon logs?
Container Exit Codes — Why did containers stop?
Image Pull Rate/Failures — Are deployments working?

These signals give you visibility into resource consumption, performance, and reliability. You can detect and diagnose most common issues.

LEVEL 3 — MATURE

Full coverage: internals, leading indicators, composite patterns:

All Level 2 signals
Container CPU Throttling — Are containers being limited?
Container Network Errors — Packet loss and errors
Container Block I/O — Disk usage per container
Docker Daemon File Descriptor Count — Approaching limits?
Docker Daemon Memory Usage — Daemon memory footprint
Docker Storage Driver Status — Health of overlay2/etc
Docker Network Bridge Status — Network infrastructure health
Dangling Images Count — Cleanup needed?
Orphaned Volumes Count — Data volumes not in use
Container Start Latency — How long to start containers
Container Operations Rate — Churn rate
Docker Events Stream Liveness — Are events flowing?
Docker Build Cache Size — Build cache consumption

At this level you have leading indicators, can predict capacity issues, and have detailed performance visibility.

LEVEL 4 — EXPERT

Deep signals that experienced operators add after incidents:

All Level 3 signals
Docker Daemon Goroutine Count — Internal concurrency health
Privileged Container Count — Security risk tracking
Containers with Host Network — Network isolation bypass
Containers with Host Path Mounts — Filesystem exposure
Container Capabilities List — Capability audit
Docker Socket Access — Who can control Docker
Sensitive Environment Variables — Secret exposure
Container Image Provenance — Trust and verification
Docker Daemon Version/API Version — Fleet consistency
Per-container log file sizes — Disk consumption from logs
Network namespace leaks — Leaked ns after container stop
Storage driver performance metrics — Overlay2-specific stats
Container density per host — Host packing efficiency
Layer sharing efficiency — How much layer reuse

At this level you have security observability, can detect subtle resource leaks, understand performance deeply, and have comprehensive audit capability.

SECTION 7 — What Most Teams Get Wrong

1. Not monitoring container logs disk consumption

Container logs (json-file driver) are stored in /var/lib/docker/containers// and can grow unbounded. Most teams monitor total disk usage but not the specific contribution of logs. A verbose application can fill the disk with logs while all other metrics look normal.

What to do: Monitor per-container log file sizes directly, or configure log rotation (max-size, max-file) and monitor the total size of /var/lib/docker/containers.

2. Ignoring exited containers until disk exhaustion

Exited containers consume disk space (writable layers, logs) but don’t show up in docker ps (only docker ps -a). Teams often have hundreds of exited containers accumulating, then hit disk issues suddenly.

What to do: Monitor exited container count and total size. Implement automated cleanup. Alert on accumulation rate.

3. Not tracking container restart counts until something breaks

Restart count is a leading indicator of instability, but most teams only look at it during incidents. A container restarting occasionally is invisible until it’s restarting constantly.

What to do: Alert on any restart count increase for production containers. Track restart rate over time.

4. Assuming “running” means “healthy”

Containers can be in “running” state while the application is deadlocked, waiting for missing dependencies, or otherwise non-functional. Without health checks, you have no visibility.

What to do: Configure meaningful health checks for all containers. Monitor health check status, not just container state.

5. Not monitoring Docker daemon latency

A daemon that responds in 5 seconds instead of 50ms is technically “up” but causes orchestration timeouts, slow deployments, and operational frustration. Most teams only check if the daemon process exists.

What to do: Monitor daemon API response latency. Alert on degradation, not just failure.

6. Missing network errors because they’re rare

A 0.01% packet error rate seems trivial but causes application-level retries, latency variance, and occasional failures. These errors are often invisible in high-level metrics.

What to do: Monitor network error counters, not just throughput. Any nonzero error rate warrants investigation.

7. Not monitoring CPU throttling

Containers with CPU quotas can be throttled without appearing to use much CPU. The application is slow but metrics show low usage. This is confusing and often misdiagnosed.

What to do: Monitor throttling metrics (nr_throttled, throttled_time) for containers with CPU quotas.

8. Blind trust of images from Docker Hub

Many teams pull images directly from Docker Hub without verification. These images may be outdated, vulnerable, or in rare cases, malicious.

What to do: Use approved base images from trusted registries. Pin to digests, not tags. Implement vulnerability scanning. Consider Docker Content Trust.

9. Not understanding memory metrics

Docker memory “usage” includes cache, which is reclaimable. Teams often see high memory usage and panic, or see low memory usage and miss OOM risk because they don’t understand the metrics.

What to do: Understand and monitor RSS (resident set size) in addition to total memory. Understand how your runtime (JVM, etc.) reports memory.

10. Monitoring containers but not the daemon

Teams often have detailed container monitoring but nothing about the Docker daemon itself. Daemon issues affect all containers but are invisible to container-level monitoring.

What to do: Monitor daemon health, latency, memory, FDs, and errors with the same rigor as containers.

11. No alerting on “dead” containers

Containers in “dead” state cannot be removed normally and require manual intervention. They’re rare enough that teams don’t notice them until they accumulate.

What to do: Alert on any container in “dead” state. They indicate prior daemon or cleanup issues.

12. Assuming disk cleanup will always work

Teams rely on docker system prune for disk management but don’t test it. When disk is critically full, cleanup operations themselves may fail due to lack of working space.

What to do: Monitor disk usage at lower thresholds (70-80%). Don’t wait until 95% to clean up. Test cleanup procedures.

13. Not monitoring build cache on CI runners

CI/CD runners that build images accumulate build cache rapidly. This is often discovered only when the runner runs out of disk space and builds start failing.

What to do: Monitor build cache size. Implement regular cache pruning. Consider cache limits.

14. Missing security signals entirely

Most Docker monitoring is performance-focused. Security-relevant signals (privileged containers, socket mounts, capability additions) are invisible until a security incident.

What to do: Add security signals to monitoring. Audit privileged containers, sensitive mounts, and capability additions regularly.

15. No fleet-wide version tracking

Docker versions drift across hosts. This causes inconsistent behavior, API incompatibilities, and vulnerability exposure. Most teams don’t track versions systematically.

What to do: Monitor Docker versions across all hosts. Alert on version drift. Track CVEs for Docker versions.