PLAYBOOK: Monitoring Docker

SECTION 0 — Operator’s Mental Model

Docker is not a single thing you monitor — it is a stack of interdependent components, each with its own failure modes. Understanding this stack is essential to reasoning about any signal.

THE DOCKER STACK (bottom-up):

  1. runc — The actual container execution engine. It creates containers by wiring together Linux namespaces, cgroups, and filesystem mounts. Each container gets its own runc process. If runc hangs, the container is frozen.

  2. containerd — A daemon that manages the complete container lifecycle: pulling images, creating containers, starting/executing/stopping them. It calls runc for actual execution. containerd maintains the container state database. If containerd crashes, running containers survive but become unmanageable until containerd restarts and reconnects.

  3. dockerd (Docker Daemon) — The user-facing API server. It translates docker CLI commands into containerd gRPC calls. It also handles volume management, network management (via libnetwork), image building, and the HTTP API. dockerd is stateful — its database tracks container metadata, network configurations, and volume mappings. If dockerd hangs, all management operations hang; running containers continue but you cannot inspect, stop, or create new ones.

  4. Storage Driver — Typically overlay2, but could be devicemapper, btrfs, zfs, or aufs. This layer manages the copy-on-write filesystem that gives each container its apparent root filesystem. It consumes disk space for image layers, container writable layers, and metadata. Storage driver health directly impacts container I/O performance and disk exhaustion risk.

  5. libnetwork / CNM — Docker’s network stack. It manages bridges, veth pairs, iptables rules, DNS resolution via embedded DNS server, and network namespaces. Each container gets a veth pair bridged to docker0 (or custom networks). Network misconfigurations manifest as connectivity loss, DNS failures, or port conflicts.

WHAT DOCKER IS DOING AT ALL TIMES:

  • Event loop — dockerd processes API requests, container events, health checks, and internal housekeeping
  • Container supervision — monitoring container processes, collecting exit codes, restarting if configured
  • Log routing — capturing stdout/stderr from containers and routing to configured log drivers
  • Network maintenance — managing iptables rules, DNS resolution, load balancing for container networks
  • Image management — tracking layers, handling pulls/pushes, garbage collection of unused layers
  • Volume I/O — proxying filesystem operations from containers to mounted volumes

RESOURCES DOCKER COMPETES FOR:

ResourceHow Docker Uses ItWhat Happens When Starved
Disk I/OImage pulls, container writes, log rotation, overlay operationsDaemon operations stall, container I/O slows, builds hang
Disk space/var/lib/docker stores images, containers, volumes, logsCannot pull images, containers fail to start, daemon may crash
File descriptorsEach container, network socket, API connection uses FDsCannot create containers, API becomes unresponsive
MemoryImage caching, container metadata, log buffering, network stateOOM killer may kill dockerd (catastrophic)
CPUImage extraction, overlay operations, log processingSlow daemon response, delayed health checks
Network portsPort binding for containers, API socketPort conflicts, failed container starts
IP addressesBridge network allocationCannot create containers on default network
iptables rulesDNAT for port publishing, network isolationRule table exhaustion, network failures

CHARACTERISTIC FAILURE ARCHETYPES:

  1. “The daemon wedged” — dockerd becomes unresponsive while containers keep running. Cannot inspect, stop, or create. Often caused by storage driver hangs or deadlocks in internal state management.

  2. “The disk filled silently” — /var/lib/docker grows until exhaustion. Sources: unbounded container logs, dangling images, orphaned volumes, build cache accumulation.

  3. “The container death spiral” — A container repeatedly crashes and restarts (if restart policy permits), consuming resources, flooding logs, potentially masking the root cause.

  4. “The network black hole” — DNS resolution fails inside containers, or inter-container networking breaks due to iptables corruption, DNS server issues, or bridge misconfiguration.

  5. “The zombie apocalypse” — Containers in “dead” or “removing” state that cannot be cleaned up, often after daemon crashes during container removal. They consume resources and block names/IDs.

  6. “The resource leak” — File descriptors, IP addresses, or network namespaces leak over time, eventually hitting system limits.

  7. “The OOM cascade” — A container exceeds its memory limit, OOM killer acts, but the workload is stateful or causes dependent services to fail. If dockerd itself is OOM killed, all management capability is lost.

  8. “The image layer corruption” — Image layers become corrupted on disk, causing container starts to fail with cryptic errors.

DEPLOYMENT VARIANTS THAT CHANGE MONITORING:

  • Rootless Docker — Daemon runs as non-root user. Different filesystem paths (~/.local/share/docker), different resource limits, cannot bind privileged ports (<1024). Many metrics still accessible but paths change.

  • Docker in Docker (DinD) — Runs inside a container with Docker socket or daemon. Adds a layer of complexity; disk exhaustion in outer container kills inner daemon; signal propagation is complex.

  • Rootful vs rootless containers — Rootful containers have more capabilities and thus more failure modes and security exposure.

  • Storage driver differences — overlay2 is most common, but devicemapper has different failure modes (thin pool exhaustion), zfs/btrfs have their own storage pool management.

  • Log driver configuration — json-file (default) causes disk pressure; journald has different failure modes; gcplogs/awslogs/etc. depend on external services.


SECTION 1 — Signal Catalog


AVAILABILITY DOMAIN


SIGNAL: Docker Daemon Process Health

WHAT IT IS: Whether the dockerd process is running and responsive to API requests. This is the most fundamental signal — if dockerd is down or hung, all container management is impossible.

SOURCE:

  • Process: dockerd (PID typically visible in ps aux | grep dockerd)
  • Unix socket: /var/run/docker.sock (or /run/docker.sock)
  • HTTP API: typically at /var/run/docker.sock or TCP port 2375/2376 if configured

HOW TO COLLECT IT MANUALLY:

# Check process is running
pgrep -x dockerd && echo "ALIVE" || echo "DOWN"

# Check daemon responsiveness via socket
docker info > /dev/null 2>&1 && echo "RESPONSIVE" || echo "UNRESPONSIVE"

# Direct socket probe (no docker CLI needed)
curl --unix-socket /var/run/docker.sock http://localhost/_ping
# Returns "OK" if daemon is responsive

# With TLS (production often uses this)
curl --unix-socket /var/run/docker.sock http://localhost/version
# Returns JSON with daemon version info

WHAT IT TELLS YOU: If process is gone, daemon crashed or was killed. If process exists but socket is unresponsive, daemon is hung (storage deadlock, internal panic, or blocked I/O). A hung daemon is worse than a crashed one — containers keep running but you cannot manage them, and the daemon cannot be gracefully recovered without potentially affecting running containers.

SEVERITY:

  • PAGE: Process missing OR socket unresponsive (either means immediate operational impact)
  • TICKET: Process exists but response time > 5 seconds (indicates daemon stress)
  • INFO: Normal operation

THRESHOLDS:

  • Binary: process must exist AND socket must respond
  • Response time: socket should respond within 1 second under normal conditions
  • Any failure to respond within 30 seconds indicates a hang requiring intervention

FAILURE MODES DETECTED:

  • Daemon crash (process termination)
  • Daemon hang/deadlock (process exists, no response)
  • Storage driver unresponsiveness (hangs on I/O)
  • Socket file deletion or corruption

NUANCES & GOTCHAS:

  • Socket file may exist briefly after daemon death; always probe the socket, don’t just check file existence
  • Daemon may be slow to respond during heavy operations (image pulls, builds) — distinguish transient slowness from hang
  • In DinD setups, outer container health doesn’t guarantee inner daemon health
  • Rootless Docker uses different socket path: ~/.docker/run/docker.sock

CORRELATES WITH:

  • Docker Daemon Response Latency — if latency is climbing before failure, indicates progressive stress
  • Host Disk I/O Utilization — high I/O often precedes daemon hangs
  • Docker Daemon Goroutine Count — rapid growth may indicate deadlock forming

SIGNAL: Docker Daemon Response Latency

WHAT IT IS: The time it takes for the Docker daemon to respond to API requests. This measures daemon processing overhead and system load impact on Docker operations.

SOURCE:

  • Docker API endpoints via socket
  • Any simple query like docker version, docker info, or /_ping

HOW TO COLLECT IT MANUALLY:

# Time a simple API call
time docker version > /dev/null 2>&1

# More precise measurement
start=$(date +%s%N); docker info > /dev/null 2>&1; end=$(date +%s%N)
echo "Latency: $(( (end - start) / 1000000 )) ms"

# Using curl directly
time curl --unix-socket /var/run/docker.sock http://localhost/_ping

WHAT IT TELLS YOU: Rising latency indicates the daemon is under stress — heavy I/O, many concurrent operations, or internal lock contention. If latency exceeds several seconds, container management operations (start, stop, logs) will be noticeably delayed, and automated systems (health checks, orchestration) may time out.

SEVERITY:

  • TICKET: Latency > 2 seconds sustained over 5 minutes
  • PLAN: Latency > 500ms sustained over 15 minutes (early warning)
  • INFO: Baseline tracking (typically <100ms on healthy systems)

THRESHOLDS:

  • Normal: < 100ms for simple queries
  • Degraded: > 500ms indicates daemon stress
  • Critical: > 5 seconds indicates severe contention or approaching hang

FAILURE MODES DETECTED:

  • Daemon overload from too many concurrent operations
  • Storage driver I/O bottleneck
  • Internal lock contention (database, state management)
  • Impending daemon hang

NUANCES & GOTCHAS:

  • First call after daemon start may be slower (warmup)
  • Image-related operations take much longer; use simple queries like /_ping for consistent measurement
  • Latency naturally spikes during large image pulls or intensive builds
  • Container count affects latency logarithmically (hundreds of containers cause measurable slowdown)

CORRELATES WITH:

  • Container Count — more containers means more internal state to traverse
  • Host Disk I/O Utilization — I/O contention directly impacts daemon latency
  • Docker Daemon Goroutine Count — high goroutine count with high latency suggests thread starvation

SIGNAL: Container Count by State

WHAT IT IS: The number of containers in each state: running, paused, exited/stopped, dead. This provides a snapshot of workload health and identifies stuck containers.

SOURCE:

  • Docker API: GET /containers/json?all=true
  • Command: docker ps -a --format '{{.State}}'

HOW TO COLLECT IT MANUALLY:

# Count by state
docker ps -a --format '{{.State}}' | sort | uniq -c

# Or via API
curl -s --unix-socket /var/run/docker.sock \
  "http://localhost/containers/json?all=true" | \
  jq -r '.[].State' | sort | uniq -c

# JSON with full breakdown
docker ps -a --format '{{json .}}' | jq -s 'group_by(.State) | map({state: .[0].State, count: length})'

WHAT IT TELLS YOU: Running count indicates active workload. High exited count with few running may indicate crash loops or workload completion. Dead containers indicate failed cleanup — they consume resources and cannot be removed normally. Paused containers are intentionally frozen but consume disk space.

SEVERITY:

  • PAGE: Any containers in “dead” state (indicates failed removal requiring intervention)
  • TICKET: Rapidly growing exited count (>50% increase in 1 hour without corresponding job completions)
  • PLAN: Paused containers accumulating without cleanup policy
  • INFO: Normal state distribution tracking

THRESHOLDS:

  • Dead containers: any nonzero is abnormal
  • Exited containers: compare to historical baseline; sudden growth indicates problems
  • Running containers: track against capacity limits

FAILURE MODES DETECTED:

  • Dead containers: daemon crash during container removal, resource cleanup failure
  • Exited container accumulation: crash loops, missing cleanup jobs, disk space consumption
  • No running containers: workload failure or intentional shutdown

NUANCES & GOTCHAS:

  • Exited containers may be intentional (batch jobs, one-off tasks) — correlate with workload type
  • Dead containers cannot be removed with docker rm alone; may require manual cleanup of /var/lib/docker/containers entries
  • Container count directly affects daemon API response times for list operations
  • In orchestrated environments (Swarm, K8s), the orchestrator manages container lifecycle — exited containers may be expected

CORRELATES WITH:

  • Container Restart Count — high restarts + high exited = crash loop
  • Docker Disk Usage — exited containers consume space in /var/lib/docker
  • Log Volume — exited containers may leave behind large log files

SIGNAL: Container Restart Count

WHAT IT IS: The number of times a container has been restarted due to crashing or being killed. This signal identifies unstable workloads before they cause broader impact.

SOURCE:

  • Docker API: GET /containers/{id}/json
  • Inspect field: .RestartCount

HOW TO COLLECT IT MANUALLY:

# Check restart count for specific container
docker inspect --format '{{.RestartCount}}' <container_id>

# List all containers with restart counts > 0
docker ps -a --format '{{.ID}} {{.RestartCount}} {{.Names}}' | \
  awk '$2 > 0 {print}'

# Via API
curl -s --unix-socket /var/run/docker.sock \
  "http://localhost/containers/json?all=true" | \
  jq -r '.[] | select(.RestartCount > 0) | {ID: .Id[:12], RestartCount, Names: .Names[0]}'

WHAT IT TELLS YOU: Any nonzero restart count means the container crashed or was killed and Docker restarted it (if restart policy permits). Rising restart counts indicate an unstable application, resource exhaustion, or configuration problem. Frequent restarts waste resources, flood logs, and may indicate a workload that cannot run successfully.

SEVERITY:

  • PAGE: Restart count increasing by >5 in 10 minutes for any container
  • TICKET: Restart count > 3 for any container in last hour
  • PLAN: Any container with restart count > 0 tracked over time
  • INFO: Baseline restart patterns for known-unstable services

THRESHOLDS:

  • Normal: restart count = 0 or stable (intentional restarts)
  • Warning: restart count increasing faster than 1/hour for sustained period
  • Critical: restart count increasing faster than 1/minute (crash loop)

FAILURE MODES DETECTED:

  • Application crash (code bugs, unhandled errors)
  • OOM kill (memory limit exceeded)
  • Health check failure (if configured with restart on unhealthy)
  • Resource starvation (CPU throttling causing timeout)
  • Dependency failure (container dies when required service unavailable)

NUANCES & GOTCHAS:

  • Restart count persists across daemon restart — it’s stored in container metadata
  • Manual restarts (docker restart) increment the count — distinguish manual vs automatic
  • “Always” restart policy will restart even manually stopped containers after daemon restart
  • Container with restart policy “no” will show 0 restarts regardless of crash frequency

CORRELATES WITH:

  • Container Exit Codes — restart + nonzero exit indicates crash pattern
  • Container OOM Killed — restart + OOM indicates memory exhaustion
  • Daemon Memory/Disk Pressure — restarts during resource pressure may indicate starvation

SIGNAL: Container Exit Codes

WHAT IT IS: The exit code of the container’s main process. Exit codes indicate why a container stopped and are essential for distinguishing crashes from graceful shutdowns.

SOURCE:

  • Docker API: GET /containers/{id}/json
  • Inspect field: .State.ExitCode

HOW TO COLLECT IT MANUALLY:

# Exit code for specific container
docker inspect --format '{{.State.ExitCode}}' <container_id>

# All containers with nonzero exit codes
docker ps -a --format '{{.ID}} {{.State}} {{.Names}}' | \
  while read id state name; do
    exit_code=$(docker inspect --format '{{.State.ExitCode}}' "$id")
    [ "$exit_code" != "0" ] && echo "$id $exit_code $state $name"
  done

# Via API for all exited containers
curl -s --unix-socket /var/run/docker.sock \
  "http://localhost/containers/json?all=true&filters={\"status\":[\"exited\"]}" | \
  jq -r '.[] | "\(.Id[:12]) ExitCode:\(.State.ExitCode) \(.Names[0])"'

WHAT IT TELLS YOU: Exit code 0 = graceful exit (successful completion or intentional stop). Exit code 1 = application error. Exit code 137 = SIGKILL (often OOM). Exit code 139 = segfault. Exit code 143 = SIGTERM (normal stop signal). Understanding exit codes enables proper alerting and incident classification.

SEVERITY:

  • TICKET: Exit code 1 (application error) for any production container
  • TICKET: Exit code 139 (segfault) — indicates serious application bug
  • PLAN: Exit code 137 without OOM indication (may need memory tuning)
  • INFO: Exit code 0 or 143 (normal shutdown)

THRESHOLDS:

  • Exit 0: Normal/expected
  • Exit 1: Application error — needs investigation
  • Exit 137: SIGKILL received — investigate OOM or external kill
  • Exit 139: Segmentation fault — application bug
  • Exit 143: SIGTERM — usually normal (orchestration, manual stop)
  • Other nonzero: Application-specific, needs documentation

FAILURE MODES DETECTED:

  • Application unhandled exceptions (exit 1)
  • Memory exhaustion/OOM kill (exit 137)
  • Memory corruption/segfault (exit 139)
  • Hard timeout kills (exit 137 from external)
  • Graceful shutdown (exit 143)

NUANCES & GOTCHAS:

  • Exit code 137 can be OOM kill OR external SIGKILL — check OOMKilled field to distinguish
  • Exit code 143 (SIGTERM) is normal in orchestrated environments during scaling/deployments
  • Custom exit codes are application-defined; document what your applications use
  • Exit codes may be truncated to 8 bits (255 max); check documentation for codes > 255

CORRELATES WITH:

  • Container OOM Killed — confirms memory as cause for exit 137
  • Container Restart Count — exit code + restarts indicates crash pattern
  • Application Logs — for root cause of exit 1

SIGNAL: Container OOM Killed Status

WHAT IT IS: A boolean indicating whether the container was killed by the OOM (Out of Memory) killer. Critical for distinguishing memory exhaustion from other causes of container death.

SOURCE:

  • Docker API: GET /containers/{id}/json
  • Inspect field: .State.OOMKilled

HOW TO COLLECT IT MANUALLY:

# Check specific container
docker inspect --format '{{.State.OOMKilled}}' <container_id>

# Find all OOM-killed containers
for c in $(docker ps -aq); do
  oom=$(docker inspect --format '{{.State.OOMKilled}}' "$c")
  [ "$oom" = "true" ] && echo "$c was OOM killed"
done

# Via API
curl -s --unix-socket /var/run/docker.sock \
  "http://localhost/containers/json?all=true" | \
  jq -r '.[] | select(.State.OOMKilled == true) | .Id[:12]'

WHAT IT TELLS YOU: When true, the container exceeded its memory limit and the kernel OOM killer terminated it. This indicates either: memory limit is too low for the workload, the application has a memory leak, or the workload experienced an abnormal memory spike. OOM kills cause data loss for in-memory state and may cause cascading failures in dependent services.

SEVERITY:

  • PAGE: OOM killed = true for stateful/critical production containers
  • TICKET: OOM killed = true for any production container
  • PLAN: Repeated OOM kills for same container (needs memory tuning)

THRESHOLDS:

  • Any true value in production is abnormal and requires investigation
  • Development/test containers may have intentionally low limits

FAILURE MODES DETECTED:

  • Memory limit undersized for workload
  • Application memory leak
  • Memory spike from abnormal input/load
  • JVM/container memory mismatch (heap + metaspace + overhead > limit)

NUANCES & GOTCHAS:

  • OOMKilled is set at container death; it may be reset if container restarts
  • Containers without memory limits can still be OOM killed if system memory is exhausted
  • JVM applications need careful tuning: heap + metaspace + code cache + native overhead must fit within limit
  • OOM kills don’t always mean the guilty container was killed — the kernel may kill any process in the cgroup

CORRELATES WITH:

  • Container Memory Usage — approaching limit before OOM is leading indicator
  • Container Exit Codes — exit 137 + OOMKilled confirms memory cause
  • Host Memory Pressure — system-wide OOM may kill containers without per-container limits

THROUGHPUT DOMAIN


SIGNAL: Container Operations Rate

WHAT IT IS: The rate of container lifecycle operations: creates, starts, stops, removes, and dies. This measures the velocity of container churn on the host.

SOURCE:

  • Docker events API: GET /events
  • Event types: create, start, stop, die, destroy

HOW TO COLLECT IT MANUALLY:

# Stream events in real-time
docker events --filter 'type=container' --format '{{.Action}} {{.Actor.ID}}'

# Count events over time window via API
curl -s --unix-socket /var/run/docker.sock \
  "http://localhost/events?filters={\"type\":[\"container\"]}" &

# Or parse recent events (requires event stream monitoring)
docker events --since 1h --until 0s --filter 'type=container' --format '{{.Action}}' | \
  sort | uniq -c

WHAT IT TELLS YOU: High container operation rates indicate dynamic workloads (CI/CD, batch jobs, serverless-on-containers). Excessive churn causes daemon stress, disk pressure (image layers, log files), and may indicate runaway processes or orchestration issues. Unusual patterns (many stops without starts) indicate workload problems.

SEVERITY:

  • TICKET: Operation rate >10x baseline sustained for >15 minutes
  • PLAN: Trending increase in operation rate over days (capacity planning)
  • INFO: Baseline operation patterns

THRESHOLDS:

  • Compare to historical baseline for the host
  • Normal varies by workload: CI runners may see 100s/hour; stable services may see 1/week
  • Any unexplained sudden spike warrants investigation

FAILURE MODES DETECTED:

  • Orchestration instability (repeated rescheduling)
  • Failed deployments (create → die loops)
  • Runaway processes creating containers
  • CI/CD queue backup clearing suddenly

NUANCES & GOTCHAS:

  • Events are ephemeral; if you’re not listening, you miss them
  • daemon restart resets event stream; some events may be lost
  • Rate calculation requires persistent counting over time windows
  • Differentiate user-initiated operations from daemon/orchestration-initiated

CORRELATES WITH:

  • Container Restart Count — high restarts + high operation rate = instability
  • Docker Daemon Response Latency — high churn often increases latency
  • Docker Disk Usage — high create rate without cleanup = disk growth

SIGNAL: Image Pull Rate

WHAT IT IS: The frequency of image pull operations. This measures dependency on external registries and can indicate deployment activity or configuration problems causing re-pulls.

SOURCE:

  • Docker events: events with action=pull
  • Registry API response times

HOW TO COLLECT IT MANUALLY:

# Monitor pull events
docker events --filter 'type=image' --filter 'event=pull' --format '{{.Time}} {{.Actor.Attributes.name}}'

# Count pulls in last hour
docker events --since 1h --filter 'type=image' --filter 'event=pull' --format '.' | wc -l

WHAT IT TELLS YOU: High pull rates indicate active deployments or problems with image caching. If images are being re-pulled that should be cached, it indicates either image tag instability (latest tag always changes), cache invalidation, or disk cleanup removing cached layers. Pull failures block container starts.

SEVERITY:

  • TICKET: Pull rate significantly above deployment frequency (indicates cache problems)
  • TICKET: Any pull failures in production
  • PLAN: Trending increase in pull rate (may need local registry or larger cache)

THRESHOLDS:

  • Baseline depends on deployment frequency
  • More than 1 pull per unique deployment may indicate caching issue
  • Sustained pulls without corresponding new container creates = waste

FAILURE MODES DETECTED:

  • Image cache thrashing (images removed and re-pulled repeatedly)
  • Registry availability issues
  • Network connectivity problems
  • Tag instability (latest changes frequently)

NUANCES & GOTCHAS:

  • Pulling same image twice for different containers should hit cache; if not, cache is not working
  • Large image pulls can saturate network bandwidth
  • Registry rate limiting may cause pull failures during high-activity periods
  • Image digest vs tag pulls behave differently for caching

CORRELATES WITH:

  • Docker Disk Usage (Images) — high pulls may increase disk usage
  • Network Bandwidth — pulls consume bandwidth
  • Container Create Rate — creates should correlate with pulls for new images

LATENCY DOMAIN


SIGNAL: Container Start Latency

WHAT IT IS: The time from container create request to container running state. This includes image pull (if not cached), filesystem setup, and process start.

SOURCE:

  • Docker events: timestamp difference between ‘create’ and ‘start’ events
  • Docker API: container creation/start timestamps

HOW TO COLLECT IT MANUALLY:

# Time a container start
time docker run --rm alpine:latest echo "test"

# Measure start latency for a specific container via events
docker events --filter 'container=<id>' --format '{{.Time}} {{.Action}}'
# Calculate difference between create and start timestamps

# Via inspection
docker inspect --format '{{.Created}}' <container_id>
# Compare to actual start time

WHAT IT TELLS YOU: High start latency impacts application scaling speed, deployment rollback time, and overall system responsiveness. Slow starts may be caused by: image pull time (large images, slow network), storage driver performance (overlay operations), host resource contention, or application initialization time.

SEVERITY:

  • TICKET: Start latency >30 seconds for any container
  • PLAN: Start latency trending upward over time
  • INFO: Baseline start times per image type

THRESHOLDS:

  • Small images (alpine, distroless): should start in <2 seconds (excluding app init)
  • Large images (>1GB): 10-60 seconds depending on cache status
  • Any start >60 seconds indicates problem (unless expected for image size)
  • Compare to baseline for each image type

FAILURE MODES DETECTED:

  • Large unoptimized images causing slow pulls
  • Storage driver performance degradation
  • Network/registry issues causing slow pulls
  • Resource contention on host
  • Application slow initialization

NUANCES & GOTCHAS:

  • First start of an image includes pull time; subsequent starts use cache
  • Container start latency is different from application ready time — app may take longer to become functional
  • Health check grace period should account for start latency
  • Very slow starts may trigger health check failures before app is ready

CORRELATES WITH:

  • Image Size — larger images have longer start latency
  • Host Disk I/O — high I/O slows overlay operations during start
  • Container Start Latency — correlates with storage driver performance
  • Network Latency (to registry) — affects pull time

ERRORS DOMAIN


SIGNAL: Docker Daemon Errors in Logs

WHAT IT IS: Error-level messages in the Docker daemon logs indicating internal failures, misconfigurations, or operational problems.

SOURCE:

  • Journal: journalctl -u docker (for systemd-managed Docker)
  • Log file: /var/log/docker.log (depending on configuration)
  • Daemon stderr/stdout

HOW TO COLLECT IT MANUALLY:

# View recent daemon errors
journalctl -u docker.service -p err --since "1 hour ago"

# Watch for errors in real-time
journalctl -u docker.service -p err -f

# Search for specific error patterns
journalctl -u docker.service --since "1 day ago" | grep -iE "(error|fatal|panic|fail)"

# Count errors by type
journalctl -u docker.service --since "1 hour ago" -p err | grep -oP '(?<=level=)[a-z]+' | sort | uniq -c

WHAT IT TELLS YOU: Daemon errors indicate problems that may affect container operations. Common error types include: storage driver failures, network setup errors, image layer corruption, API errors, and internal panics. A sudden increase in error rate often precedes or accompanies operational problems.

SEVERITY:

  • PAGE: Any panic/fatal in daemon logs
  • PAGE: Errors indicating data corruption or unrecoverable state
  • TICKET: Any error rate increase above baseline
  • PLAN: Warnings trending upward

THRESHOLDS:

  • Any panic or fatal: immediate investigation
  • 10 errors/hour sustained: needs investigation (adjust based on baseline)

  • Error rate increase >2x: early warning

FAILURE MODES DETECTED:

  • Storage driver corruption
  • Network configuration failures
  • Image layer corruption
  • Daemon internal errors
  • API handler failures
  • Resource exhaustion

NUANCES & GOTCHAS:

  • Some errors are transient and may not indicate ongoing problems
  • Log format varies by Docker version and configuration
  • Daemon restart causes many “normal” errors during state recovery
  • Some errors are in library dependencies (containerd, runc) and may have different formats

CORRELATES WITH:

  • Container Operations Rate — errors during high operation rate may indicate overload
  • Docker Daemon Response Latency — errors + latency often correlate
  • Host Resource Metrics — errors during resource pressure indicate causation

SIGNAL: Container Creation Failures

WHAT IT IS: Failed attempts to create containers, indicating image problems, resource constraints, or configuration errors.

SOURCE:

  • Docker events: create events with error field
  • Docker API: POST /containers/create returns error response
  • Daemon logs

HOW TO COLLECT IT MANUALLY:

# Attempt container creation and capture error
docker create <image> 2>&1 || echo "CREATE_FAILED"

# Monitor creation failures in daemon logs
journalctl -u docker.service | grep -i "failed to create"

# Via API (example failed create)
curl -s --unix-socket /var/run/docker.sock \
  -X POST "http://localhost/containers/create" \
  -H "Content-Type: application/json" \
  -d '{"Image":"nonexistent"}' | jq .

# Check recent events for failures
docker events --filter 'event=create' --since 1h --format '{{.Actor.ID}} {{.Actor.Attributes.error}}' | grep -v '^$'

WHAT IT TELLS YOU: Creation failures block deployments and scaling. Common causes: image not found (missing pull), image pull failure, invalid configuration, resource constraints (disk space, memory), port conflicts, name conflicts, and volume mount failures.

SEVERITY:

  • TICKET: Any creation failure for production workload
  • PAGE: Creation failure rate >10% of attempts sustained
  • PLAN: Occasional failures in development (expected for some scenarios)

THRESHOLDS:

  • Normal: near 0% failure rate for production workloads
  • 5% failure rate sustained: needs investigation

  • Any failure for critical service: immediate attention

FAILURE MODES DETECTED:

  • Missing images (not pulled)
  • Invalid container configuration
  • Resource exhaustion (disk, memory, FDs)
  • Port conflicts
  • Name conflicts
  • Volume mount failures
  • Network attachment failures

NUANCES & GOTCHAS:

  • Creation failure doesn’t always log clearly; check both API response and daemon logs
  • Some failures are expected in CI/CD (testing failure scenarios)
  • Name conflicts from previous containers not cleaned up
  • Image config may be invalid in ways that only manifest at create time

CORRELATES WITH:

  • Docker Disk Usage — disk exhaustion causes creation failures
  • Image Pull Failures — missing images cause creation failures
  • Container Count — name conflicts more likely with many containers

SATURATION DOMAIN


SIGNAL: Docker Disk Usage (System-Wide)

WHAT IT IS: Total disk space consumed by Docker: images, containers, volumes, and build cache. This is the top-level view of Docker’s disk footprint.

SOURCE:

  • Command: docker system df
  • Docker API: GET /system/df

HOW TO COLLECT IT MANUALLY:

# Human-readable summary
docker system df

# Verbose breakdown
docker system df -v

# JSON output for parsing
docker system df --format '{{json .}}'

# Via API
curl -s --unix-socket /var/run/docker.sock \
  "http://localhost/system/df" | jq .

# Raw directory size (fallback if API unavailable)
du -sh /var/lib/docker/

WHAT IT TELLS YOU: Docker disk usage grows over time if not managed. Images accumulate (old versions), exited containers leave behind writable layers and logs, volumes accumulate orphaned data, and build cache grows. When /var/lib/docker fills, Docker cannot function — cannot pull images, cannot create containers, may crash the daemon.

SEVERITY:

  • PAGE: Usage >90% of /var/lib/docker partition
  • TICKET: Usage >75% or growing faster than 1GB/day
  • PLAN: Usage >50% (capacity planning)
  • INFO: Baseline tracking

THRESHOLDS:

  • Compare to total space available on /var/lib/docker partition
  • Warning at 70% used
  • Critical at 85% used
  • Monitor growth rate: >5GB/day sustained requires cleanup

FAILURE MODES DETECTED:

  • Image accumulation (no cleanup of old versions)
  • Container log growth (unbounded logs)
  • Orphaned volumes (no automatic cleanup)
  • Build cache bloat
  • General disk exhaustion

NUANCES & GOTCHAS:

  • docker system df shows reclaimable space, not just used space
  • Some storage drivers (overlay2) may not report exact reclaimable due to layer sharing
  • Build cache can consume significant space on CI runners
  • Volumes are NOT included in reclaimable calculation — they persist independently
  • Running docker system prune can recover space but may be destructive

CORRELATES WITH:

  • Host Disk Usage — Docker disk usage contributes to host usage
  • Container Count — more containers = more disk usage
  • Image Count — more images = more disk usage
  • Log Driver Configuration — json-file logs stored in container directories

SIGNAL: Docker Disk Usage by Images

WHAT IT IS: Disk space consumed by container images. This is often the largest component of Docker disk usage.

SOURCE:

  • Command: docker system df (Images line)
  • Docker API: GET /system/df (Images field)

HOW TO COLLECT IT MANUALLY:

# Image disk usage summary
docker system df | grep Images

# Detailed image sizes
docker images --format 'table {{.Repository}}\t{{.Tag}}\t{{.Size}}'

# Via API
curl -s --unix-socket /var/run/docker.sock \
  "http://localhost/system/df" | jq '.Images[] | {Repository: .Repository, Size: .Size}'

# Sort images by size
docker images --format '{{.Size}}\t{{.Repository}}:{{.Tag}}' | sort -hr | head -20

WHAT IT TELLS YOU: Image disk usage reflects how many images are cached and their sizes. Large images, old image versions, and rarely-used images waste disk space. High image usage with low active usage indicates cleanup is needed.

SEVERITY:

  • TICKET: Image usage >50GB or growing without corresponding workload increase
  • PLAN: Largest images should be reviewed for optimization
  • INFO: Baseline image footprint

THRESHOLDS:

  • Depends on available disk and workload needs
  • Compare active images (in use by running containers) to total images
  • If active/total ratio <20%, cleanup needed

FAILURE MODES DETECTED:

  • Image accumulation without cleanup
  • Bloated images (unnecessary files, wrong base image)
  • Duplicate images (different tags, same content)
  • Unused development/test images

NUANCES & GOTCHAS:

  • Shared layers mean total image sizes may not sum correctly
  • <none> images (dangling) are usually safe to remove
  • Some images may be base layers for others — removal cascades
  • Registry mirrors/local registries may pre-cache images

CORRELATES WITH:

  • Docker Disk Usage (Total) — images often largest component
  • Image Pull Rate — high pulls may increase image usage
  • Container Count — more unique running images = more image storage

SIGNAL: Docker Disk Usage by Containers

WHAT IT IS: Disk space consumed by container writable layers. Each running or stopped container has a writable layer that consumes disk space.

SOURCE:

  • Command: docker system df (Containers line)
  • Docker API: GET /system/df (Containers field)

HOW TO COLLECT IT MANUALLY:

# Container disk usage summary
docker system df | grep Containers

# Container sizes (including writable layer)
docker ps -a --size --format 'table {{.ID}}\t{{.Names}}\t{{.Size}}'

# Via API (container sizes require additional call)
curl -s --unix-socket /var/run/docker.sock \
  "http://localhost/containers/json?all=true&size=true" | \
  jq '.[] | {Names: .Names[0], SizeRw: .SizeRw, SizeRootFs: .SizeRootFs}'

# Writable layer sizes only
docker ps -a --size --format '{{.ID}} {{.Size}}' | grep -v '0B'

WHAT IT TELLS YOU: Container disk usage reflects: number of containers, size of writable layers (how much the container has written), and log file sizes (for json-file log driver). Growing container usage indicates containers writing data or log accumulation.

SEVERITY:

  • TICKET: Container disk usage growing without cleanup
  • TICKET: Individual container writable layer >10GB (may indicate log/file bloat)
  • PLAN: Track growth trend for capacity planning

THRESHOLDS:

  • Normal: each container’s writable layer <1GB (depends on workload)
  • Warning: individual container >5GB or total growing >1GB/day
  • Cleanup: many stopped containers with nontrivial sizes

FAILURE MODES DETECTED:

  • Containers writing large amounts to their writable layer (logs, temp files)
  • Exited containers accumulating without cleanup
  • Log files growing (json-file driver)
  • Memory-heavy containers writing to tmpfs

NUANCES & GOTCHAS:

  • Size includes both writable layer and (often) log files
  • SizeRw is writable layer only; SizeRootFs includes image layers
  • Stopped containers still consume disk space
  • Containers with volume mounts don’t count volume data in container size

CORRELATES WITH:

  • Container Count — more containers = more potential disk usage
  • Log Configuration — json-file logs stored in container directory
  • Docker Disk Usage (Total) — containers contribute to total

SIGNAL: Docker Disk Usage by Volumes

WHAT IT IS: Disk space consumed by Docker volumes. Volumes persist data independently of containers and can grow without bound if not monitored.

SOURCE:

  • Command: docker system df (Volumes line, v1.13+)
  • Command: docker volume ls + du inspection
  • Docker API: GET /system/df (Volumes field)
  • Filesystem: /var/lib/docker/volumes/

HOW TO COLLECT IT MANUALLY:

# Volume usage summary
docker system df | grep Volumes

# Via API
curl -s --unix-socket /var/run/docker.sock \
  "http://localhost/system/df" | jq '.Volumes[] | {Name: .Name, UsageData: .UsageData}'

# List volumes with sizes (requires inspection)
docker volume ls --format '{{.Name}}' | while read vol; do
  size=$(docker run --rm -v $vol:/data alpine du -sh /data 2>/dev/null | cut -f1)
  echo "$vol: $size"
done

# Direct filesystem inspection
sudo du -sh /var/lib/docker/volumes/*

# Find largest volumes
sudo du -s /var/lib/docker/volumes/*/ | sort -n | tail -10

WHAT IT TELLS YOU: Volume usage reflects persistent data growth. Database volumes, log volumes, and data volumes can grow over time. Orphaned volumes (not attached to any container) waste space. Volume growth must be monitored for capacity planning.

SEVERITY:

  • TICKET: Volume usage >75% of available disk
  • TICKET: Rapid growth rate (>5GB/day) without explanation
  • PLAN: Volume growth trend for capacity planning
  • INFO: Baseline volume usage per service

THRESHOLDS:

  • Compare to disk space available
  • Growth rate depends on workload type (databases vs config volumes)
  • Orphaned volumes: any significant number is waste

FAILURE MODES DETECTED:

  • Database growth without limits
  • Log accumulation in mounted volumes
  • Orphaned volumes from deleted containers
  • Backup/snapshot volumes accumulating
  • Volume data corruption (can’t measure directly, but growth anomalies may indicate)

NUANCES & GOTCHAS:

  • Volumes are NOT automatically cleaned up with docker system prune without -v flag
  • Named volumes vs anonymous volumes have different cleanup behaviors
  • Volume driver (local, NFS, cloud) affects how size is reported and measured
  • Some volumes may be mounted but not actively used (zombie data)

CORRELATES WITH:

  • Docker Disk Usage (Total) — volumes often largest persistent usage
  • Container Count — orphaned volumes when containers removed
  • Application-specific metrics (database size, etc.)

SIGNAL: Docker Disk Usage by Build Cache

WHAT IT IS: Disk space consumed by Docker’s build cache, which stores intermediate layers from image builds to speed up subsequent builds.

SOURCE:

  • Command: docker system df (Build Cache line)
  • Docker API: GET /system/df (BuildCache field, API v1.39+)

HOW TO COLLECT IT MANUALLY:

# Build cache summary
docker system df | grep "Build Cache"

# Detailed build cache info
docker builder prune --dry-run

# Via API (v1.39+)
curl -s --unix-socket /var/run/docker.sock \
  "http://localhost/system/df" | jq '.BuildCache'

# Direct inspection (buildkit)
sudo du -sh /var/lib/docker/buildkit/

WHAT IT TELLS YOU: Build cache grows with each build operation. On CI/CD runners that build many images, cache can consume significant space. While cache speeds up builds, unlimited growth wastes disk space.

SEVERITY:

  • TICKET: Build cache >20GB or >20% of Docker disk usage
  • PLAN: Regular cleanup schedule needed for build-heavy systems
  • INFO: Baseline cache size

THRESHOLDS:

  • Depends on build frequency
  • On build systems, 10-30GB is often normal
  • On non-build systems, any cache is potentially stale
  • Cache hit rate should be monitored; large cache with low hit rate is waste

FAILURE MODES DETECTED:

  • Unbounded cache growth on build servers
  • Stale cache causing build inconsistencies
  • Cache corruption causing build failures
  • Cache consuming space needed for production images

NUANCES & GOTCHAS:

  • BuildKit uses different cache storage than legacy builder
  • Cache is invalidated by Dockerfile changes, not just cleanup
  • docker builder prune is separate from docker system prune
  • Cache entries have TTL and last-used timestamps for selective cleanup

CORRELATES WITH:

  • Build Frequency — more builds = more cache
  • Docker Disk Usage (Total) — cache contributes to total
  • Build Time — large cache should correlate with faster builds

SIGNAL: Dangling Images Count

WHAT IT IS: The number of images that are not tagged and not referenced by any container. These are typically intermediate layers or images left over from builds.

SOURCE:

  • Command: docker images -f "dangling=true"
  • Docker API: GET /images/json with filters

HOW TO COLLECT IT MANUALLY:

# Count dangling images
docker images -f "dangling=true" -q | wc -l

# List with sizes
docker images -f "dangling=true"

# Via API
curl -s --unix-socket /var/run/docker.sock \
  "http://localhost/images/json?filters={\"dangling\":[\"true\"]}" | \
  jq 'length'

# Size of dangling images
docker images -f "dangling=true" --format '{{.Size}}'

WHAT IT TELLS YOU: Dangling images are usually safe to remove. They accumulate from: failed builds (intermediate layers), builds that overwrite tags (old image becomes dangling), and image pulls that replace existing images. High dangling image count indicates cleanup is needed.

SEVERITY:

  • PLAN: Dangling images >10GB or >100 images
  • INFO: Baseline tracking

THRESHOLDS:

  • Small number is normal and expected
  • 100 dangling images or >10GB indicates cleanup needed

  • Rapid accumulation indicates frequent image changes

FAILURE MODES DETECTED:

  • Build detritus accumulation
  • Tag churn (pushing to same tag repeatedly)
  • Incomplete cleanup after image deletions

NUANCES & GOTCHAS:

  • Dangling images may still be used as cache for builds
  • Removing dangling images during builds can cause failures
  • Some dangling images are legitimate intermediate layers needed for builds
  • Filter carefully: some tools use <none> as legitimate placeholder

CORRELATES WITH:

  • Build Frequency — more builds = more dangling images
  • Docker Disk Usage (Images) — dangling images contribute
  • Image Pull/Push Rate — high rate = more dangling

SIGNAL: Orphaned Volumes Count

WHAT IT IS: The number of volumes that exist but are not referenced by any container. These volumes persist data that may no longer be needed.

SOURCE:

  • Command: docker volume ls -q cross-referenced with container mounts
  • Docker API: GET /volumes and GET /containers/json

HOW TO COLLECT IT MANUALLY:

# Find volumes not used by any container
docker volume ls -q | while read vol; do
  count=$(docker ps -a --filter volume=$vol -q | wc -l)
  [ $count -eq 0 ] && echo "$vol (orphaned)"
done

# Simpler: use docker system df -v to show volume usage
docker system df -v | grep -A 100 "Volumes space usage"

# Via API - get all volumes, then check container mounts
curl -s --unix-socket /var/run/docker.sock "http://localhost/volumes" | jq -r '.Volumes[].Name'
curl -s --unix-socket /var/run/docker.sock "http://localhost/containers/json?all=true" | jq -r '.[].Mounts[].Name' | sort -u

WHAT IT TELLS YOU: Orphaned volumes consume disk space and may contain sensitive data. They’re created when containers are removed without the -v flag. Database data, uploaded files, and configuration can be stranded in orphaned volumes.

SEVERITY:

  • TICKET: Orphaned volume count >10 or total size >20GB
  • PLAN: Regular orphaned volume cleanup policy needed
  • INFO: Baseline orphaned volume tracking

THRESHOLDS:

  • Any orphaned volumes represent potential waste
  • Size matters more than count — one 100GB orphaned volume is worse than 100 1MB volumes

FAILURE MODES DETECTED:

  • Data loss risk (containers removed without proper data migration)
  • Disk space waste
  • Security risk (sensitive data in forgotten volumes)
  • Compliance issues (PII in untracked volumes)

NUANCES & GOTCHAS:

  • Some volumes are intentionally standalone (data-only containers pattern, now deprecated)
  • Named volumes are more likely intentional; anonymous volumes more likely orphaned
  • Orphaned volumes may be needed for disaster recovery — don’t auto-delete
  • Volume drivers may not support all query operations

CORRELATES WITH:

  • Container Creation/Deletion Rate — high churn = more orphaned volumes
  • Docker Disk Usage (Volumes) — orphaned volumes contribute
  • Host Disk Usage — direct correlation

RESOURCE UTILIZATION DOMAIN


SIGNAL: Container CPU Usage

WHAT IT IS: The CPU time consumed by each container relative to host CPU capacity. This measures computational load per container.

SOURCE:

  • Docker API: GET /containers/{id}/stats
  • File: /sys/fs/cgroup/cpu/docker/<container_id>/cpuacct.usage (cgroups v1)
  • File: /sys/fs/cgroup/cpu.stat (cgroups v2, includes usage_usec)

HOW TO COLLECT IT MANUALLY:

# Live stats for all containers
docker stats --no-stream

# Specific container
docker stats <container_id> --no-stream

# Via API (JSON, continuous stream)
curl -s --unix-socket /var/run/docker.sock \
  "http://localhost/containers/<id>/stats?stream=false" | jq '.cpu_stats'

# Direct cgroups v1
cat /sys/fs/cgroup/cpu/docker/<container_id>/cpuacct.usage

# Direct cgroups v2
cat /sys/fs/cgroup/<container_id>/cpu.stat

# Calculate CPU percentage (cgroups)
# delta_usage / (delta_time * cpu_count * 1e9) * 100

WHAT IT TELLS YOU: CPU usage indicates how much processing a container is doing. High usage may indicate: heavy workload, CPU-bound application, inefficient code, or resource contention. Containers hitting their CPU quota (if set) will be throttled.

SEVERITY:

  • TICKET: Container CPU >80% sustained for >15 minutes
  • PLAN: Container CPU trending upward over time
  • INFO: Baseline CPU usage per container

THRESHOLDS:

  • Normal varies by workload type
  • Sustained >80% on multi-core: may need more resources or optimization
  • Sustained >95% single-core: application may be bottlenecked
  • Compare to CPU limit if set; throttling occurs at 100% of limit

FAILURE MODES DETECTED:

  • CPU-bound application (needs optimization or more resources)
  • Runaway process (infinite loop, crypto mining)
  • Resource contention (multiple containers competing)
  • CPU throttling (if quota set, container being limited)

NUANCES & GOTCHAS:

  • CPU percentage is relative to total host CPU, not per-container limit
  • On multi-core hosts, 100% CPU = 1 core fully used, not all cores
  • CPU usage is cumulative; calculate rate of change for percentage
  • Throttling metrics (if CPU quota set) are more important than raw usage

CORRELATES WITH:

  • Container CPU Throttling — throttling + high usage = quota too low
  • Container Memory Usage — CPU + memory patterns indicate workload type
  • Host CPU Usage — container CPU contributes to host total

SIGNAL: Container CPU Throttling

WHAT IT IS: The amount of time a container’s CPU usage was throttled because it exceeded its CPU quota. This indicates containers hitting CPU limits.

SOURCE:

  • Docker API: GET /containers/{id}/stats (cpu_stats.throttling_data)
  • File: /sys/fs/cgroup/cpu/docker/<container_id>/cpu.stat (throttled_time, nr_throttled)

HOW TO COLLECT IT MANUALLY:

# Via API stats
curl -s --unix-socket /var/run/docker.sock \
  "http://localhost/containers/<id>/stats?stream=false" | \
  jq '.cpu_stats.throttling_data'

# Direct cgroups v1
cat /sys/fs/cgroup/cpu/docker/<container_id>/cpu.stat | grep throttle

# cgroups v2
cat /sys/fs/cgroup/<container_id>/cpu.stat

# Calculate throttling percentage
# throttled_time_delta / (time_delta * cpu_count * 1e9) * 100

WHAT IT TELLS YOU: Throttling means the container wanted more CPU than its quota allows. This causes application slowdown, increased latency, and potential timeout failures. Any sustained throttling indicates the CPU limit is too low for the workload.

SEVERITY:

  • TICKET: Any sustained throttling (throttling time increasing)
  • PLAN: Occasional throttling during peak loads
  • INFO: Baseline throttling patterns

THRESHOLDS:

  • Throttling time = 0: normal, no issues
  • Any increasing throttling: quota needs adjustment
  • Throttling >10% of container uptime: significant impact
  • Burst throttling acceptable if latency SLAs allow

FAILURE MODES DETECTED:

  • CPU quota too low for workload
  • CPU burst patterns (sporadic high CPU needs)
  • Application latency caused by throttling
  • Cascading delays from throttled services

NUANCES & GOTCHAS:

  • Throttling metrics are cumulative; track rate of change
  • Containers without CPU quota will never show throttling
  • Throttling can cause “noisy neighbor” issues to become worse
  • Some workloads (batch jobs) tolerate throttling better than latency-sensitive ones

CORRELATES WITH:

  • Container CPU Usage — high usage + throttling = needs more quota
  • Application Latency — throttling often causes latency spikes
  • Container Restart Count — if throttling causes timeouts, may cause restarts

SIGNAL: Container Memory Usage

WHAT IT IS: The memory currently allocated to a container, including cache, RSS, and other memory types. Critical for detecting memory exhaustion before OOM kill.

SOURCE:

  • Docker API: GET /containers/{id}/stats (memory_stats)
  • File: /sys/fs/cgroup/memory/docker/<container_id>/memory.usage_in_bytes (cgroups v1)
  • File: /sys/fs/cgroup/<container_id>/memory.current (cgroups v2)

HOW TO COLLECT IT MANUALLY:

# Live stats for all containers
docker stats --no-stream --format "table {{.Name}}\t{{.MemUsage}}"

# Specific container
docker stats <container_id> --no-stream

# Via API
curl -s --unix-socket /var/run/docker.sock \
  "http://localhost/containers/<id>/stats?stream=false" | \
  jq '.memory_stats'

# Direct cgroups v1
cat /sys/fs/cgroup/memory/docker/<container_id>/memory.usage_in_bytes
cat /sys/fs/cgroup/memory/docker/<container_id>/memory.limit_in_bytes
cat /sys/fs/cgroup/memory/docker/<container_id>/memory.stat

# cgroups v2
cat /sys/fs/cgroup/<container_id>/memory.current
cat /sys/fs/cgroup/<container_id>/memory.max
cat /sys/fs/cgroup/<container_id>/memory.stat

WHAT IT TELLS YOU: Memory usage indicates how much RAM a container is using. If usage approaches the limit (if set), OOM kill is imminent. Growing memory usage may indicate a memory leak. Cache memory can be reclaimed, but RSS (resident set) cannot.

SEVERITY:

  • PAGE: Memory usage >90% of limit (if set) sustained
  • TICKET: Memory usage >75% of limit or growing trend
  • PLAN: Memory usage trend for capacity planning
  • INFO: Baseline memory patterns

THRESHOLDS:

  • Compare to container memory limit (if set)
  • Warning at >75% of limit
  • Critical at >90% of limit
  • Without limit, compare to host memory and other containers

FAILURE MODES DETECTED:

  • Memory leak (continuously growing usage)
  • Memory limit undersized for workload
  • Cache pressure (application caching aggressively)
  • Memory spike (sudden allocation)

NUANCES & GOTCHAS:

  • Total memory includes cache; cache is reclaimable
  • RSS (resident set size) is the more critical metric
  • Java applications: heap + metaspace + native overhead should fit under limit with buffer
  • Memory usage may spike during garbage collection; look at sustained usage

CORRELATES WITH:

  • Container OOM Killed — confirms memory exhaustion
  • Host Memory Usage — container contributes to host pressure
  • Container Restart Count — memory issues often cause restarts

SIGNAL: Container Network I/O

WHAT IT IS: Bytes received and transmitted per container, measuring network throughput and identifying network-heavy workloads.

SOURCE:

  • Docker API: GET /containers/{id}/stats (networks.{interface}.rx_bytes, tx_bytes)
  • File: /sys/class/net//statistics/rx_bytes, tx_bytes (inside container namespace)
  • File: /proc//net/dev (container process network namespace)

HOW TO COLLECT IT MANUALLY:

# Live stats
docker stats --no-stream

# Via API
curl -s --unix-socket /var/run/docker.sock \
  "http://localhost/containers/<id>/stats?stream=false" | \
  jq '.networks'

# Inside container network namespace
docker exec <container_id> cat /proc/net/dev

# From host (find veth pair)
docker inspect --format '{{.State.Pid}}' <container_id>
ls /proc/<pid>/ns/net
# Then use nsenter or find the veth

# Calculate rates (need two samples)

WHAT IT TELLS YOU: Network I/O shows how much data a container is sending/receiving. High network usage may indicate: data-intensive application, log shipping, database replication, or abnormal activity (exfiltration, DDoS participation). Compare to expected bandwidth.

SEVERITY:

  • TICKET: Network I/O >10x baseline for container
  • TICKET: Transmit significantly higher than receive (may indicate data exfiltration)
  • PLAN: Network I/O growth trend
  • INFO: Baseline network patterns

THRESHOLDS:

  • Depends on application type (web server vs batch processor)
  • Compare to historical baseline for the container
  • Compare to network interface capacity
  • Any unexplained spike warrants investigation

FAILURE MODES DETECTED:

  • Network saturation (bandwidth limit reached)
  • Abnormal traffic patterns (security issue)
  • Misconfigured logging (excessive log shipping)
  • Data synchronization storms

NUANCES & GOTCHAS:

  • Bytes are cumulative; calculate rate from delta
  • Multiple network interfaces (eth0, eth1 for multi-network) may need separate tracking
  • Container network interfaces are veth pairs; host-side counters in different location
  • Network errors and drops are more important than raw throughput

CORRELATES WITH:

  • Container Network Errors — high I/O + errors = saturation or problem
  • Host Network Usage — container contributes to host total
  • Container CPU Usage — high network often correlates with CPU (encryption, compression)

SIGNAL: Container Network Errors

WHAT IT IS: Network errors (dropped packets, frame errors, carrier losses) for container network interfaces. Errors indicate connectivity problems.

SOURCE:

  • Docker API: GET /containers/{id}/stats (networks.{interface}.rx_dropped, tx_dropped, rx_errors, tx_errors)
  • File: /proc/net/dev inside container namespace

HOW TO COLLECT IT MANUALLY:

# Via API
curl -s --unix-socket /var/run/docker.sock \
  "http://localhost/containers/<id>/stats?stream=false" | \
  jq '.networks.eth0 | {rx_errors, tx_errors, rx_dropped, tx_dropped}'

# Inside container
docker exec <container_id> cat /proc/net/dev

# Host-side veth statistics (find veth interface name first)
docker exec <container_id> ip link show eth0
# Match with host ip link show, then:
cat /sys/class/net/vethXXX/statistics/rx_errors

WHAT IT TELLS YOU: Network errors indicate connectivity problems: packet loss, interface errors, buffer overflows, or hardware issues. Any nonzero error rate is abnormal and causes application-level retries, timeouts, and degraded performance.

SEVERITY:

  • TICKET: Any sustained network error rate (>0 errors/minute)
  • TICKET: Error rate as percentage of packets >0.1%
  • INFO: Baseline (should be zero)

THRESHOLDS:

  • Normal: errors = 0, or very small (<0.01% of packets)
  • Any increasing error count: investigate
  • Errors >0.1% of packets: significant impact

FAILURE MODES DETECTED:

  • Network interface saturation
  • veth pair buffer overflow
  • Physical network issues (on host)
  • MTU mismatch causing dropped packets
  • Firewall/rule issues

NUANCES & GOTCHAS:

  • Errors are cumulative; track rate of change
  • Dropped packets may be normal if QoS/traffic shaping is in effect
  • Container-to-container traffic on same bridge doesn’t hit physical network
  • DNS errors don’t show up in interface statistics

CORRELATES WITH:

  • Container Network I/O — high I/O + errors = saturation
  • Application Latency — network errors cause latency and timeouts
  • Host Network Errors — if host has errors, containers will too

SIGNAL: Container Block I/O

WHAT IT IS: Disk read and write bytes/operations per container. This measures storage I/O consumption and identifies disk-heavy workloads.

SOURCE:

  • Docker API: GET /containers/{id}/stats (blkio_stats)
  • File: /sys/fs/cgroup/blkio/docker/<container_id>/blkio.throttle.io_service_bytes (cgroups v1)
  • File: /sys/fs/cgroup/<container_id>/io.stat (cgroups v2)

HOW TO COLLECT IT MANUALLY:

# Via API
curl -s --unix-socket /var/run/docker.sock \
  "http://localhost/containers/<id>/stats?stream=false" | \
  jq '.blkio_stats'

# Direct cgroups v1
cat /sys/fs/cgroup/blkio/docker/<container_id>/blkio.throttle.io_service_bytes
# Format: major:minor operation bytes

# cgroups v2
cat /sys/fs/cgroup/<container_id>/io.stat

# Calculate rates (need two samples)
# Example: sum bytes for Read/Write operations

WHAT IT TELLS YOU: Block I/O shows disk activity per container. High I/O may indicate: database workloads, log writing, file processing, or inefficient caching. I/O-heavy containers can starve other containers and impact host performance.

SEVERITY:

  • TICKET: Block I/O >80% of device bandwidth sustained
  • TICKET: Sustained I/O wait causing latency
  • PLAN: I/O patterns for capacity planning
  • INFO: Baseline I/O per container type

THRESHOLDS:

  • Compare to storage device capacity (IOPS, bandwidth)
  • Sustained high I/O on shared storage affects all containers
  • Device saturation varies by storage type (SSD vs HDD, local vs network)

FAILURE MODES DETECTED:

  • Disk-intensive workload (may need dedicated storage)
  • I/O throttling (if limits set)
  • Log flooding (excessive writes)
  • Database working set not fitting in memory (excessive reads)

NUANCES & GOTCHAS:

  • Blkio stats may not include all I/O (depends on cgroup configuration)
  • I/O to volumes depends on volume driver and may not be fully attributed
  • OverlayFS adds overhead; container I/O may be higher than reported
  • Async I/O may have different accounting than sync I/O

CORRELATES WITH:

  • Host Disk I/O — container I/O contributes to host total
  • Container Memory Usage — low memory = more swap/disk I/O
  • Container Latency — high I/O wait = high latency

SIGNAL: Docker Daemon File Descriptor Count

WHAT IT IS: The number of open file descriptors used by the Docker daemon process. FDs are used for: API connections, container stdio, network sockets, and internal state.

SOURCE:

  • Process: /proc/$(pgrep dockerd)/fd (count of entries)
  • Command: ls /proc/$(pgrep dockerd)/fd | wc -l

HOW TO COLLECT IT MANUALLY:

# Count FDs for dockerd
sudo ls /proc/$(pgrep dockerd)/fd | wc -l

# FD limit
sudo cat /proc/$(pgrep dockerd)/limits | grep "open files"

# Via /proc directly
sudo cat /proc/$(pgrep dockerd)/status | grep -i fd

# Detailed breakdown
sudo ls -l /proc/$(pgrep dockerd)/fd | head -20

WHAT IT TELLS YOU: FD count indicates daemon resource usage. Each container uses multiple FDs (stdio, network, logs). High FD usage approaching the limit causes “too many open files” errors, failed operations, and daemon instability.

SEVERITY:

  • PAGE: FD count >90% of limit
  • TICKET: FD count >75% of limit
  • PLAN: FD growth trend (leak detection)
  • INFO: Baseline FD usage

THRESHOLDS:

  • Compare to process FD limit (often 65535 or higher)
  • Warning at >50% of limit
  • Critical at >80% of limit
  • Investigate any sustained growth

FAILURE MODES DETECTED:

  • FD leak (opened but not closed)
  • Too many containers for current limit
  • API connection leak (clients not closing properly)
  • Log file FD accumulation

NUANCES & GOTCHAS:

  • FD count includes network sockets, not just files
  • Each docker logs -f consumes an FD
  • FD limit can be increased, but indicates underlying issue if growing
  • System-wide FD limits also matter

CORRELATES WITH:

  • Container Count — more containers = more FDs
  • API Connection Count — active API connections consume FDs
  • Docker Daemon Errors — FD exhaustion causes errors

SIGNAL: Docker Daemon Goroutine Count

WHAT IT IS: The number of goroutines currently active in the Docker daemon (written in Go). This indicates concurrent operations and potential thread starvation.

SOURCE:

  • Debug endpoint: GET /debug/vars (if enabled)
  • Prometheus metrics: GET /metrics (if enabled)
  • Process threads: /proc/$(pgrep dockerd)/status (Threads field, approximate)

HOW TO COLLECT IT MANUALLY:

# If debug endpoint enabled (usually disabled in production)
curl -s --unix-socket /var/run/docker.sock http://localhost/debug/vars | jq '.num_goroutine'

# If Prometheus metrics enabled
curl -s --unix-socket /var/run/docker.sock http://localhost/metrics | grep goroutines

# Approximate via process threads
cat /proc/$(pgrep dockerd)/status | grep Threads

# Via pprof (if exposed)
curl -s --unix-socket /var/run/docker.sock http://localhost/debug/pprof/goroutine?debug=1

WHAT IT TELLS YOU: High goroutine count indicates many concurrent operations. Rapidly growing goroutine count indicates a goroutine leak (operations blocked indefinitely). Extremely high counts cause memory pressure and daemon slowdown.

SEVERITY:

  • TICKET: Goroutine count >10,000 sustained
  • TICKET: Goroutine count growing without bound
  • PLAN: Track goroutine baseline during normal and peak operations
  • INFO: Baseline goroutine patterns

THRESHOLDS:

  • Normal varies by workload: typically hundreds to thousands
  • 10,000 indicates potential issue

  • Sustained growth without corresponding workload = leak

FAILURE MODES DETECTED:

  • Goroutine leak (operations blocked on I/O or locks)
  • Daemon overload (too many concurrent operations)
  • Internal deadlock (goroutines waiting indefinitely)
  • Memory pressure from goroutine stacks

NUANCES & GOTCHAS:

  • Debug endpoints are often disabled in production for security
  • Goroutine count includes idle/background goroutines, not just active
  • Sudden spikes during heavy operations are normal
  • Goroutine count != thread count; Go runtime multiplexes

CORRELATES WITH:

  • Docker Daemon Response Latency — high goroutines + latency = potential deadlock
  • Docker Daemon Memory — goroutines consume stack memory
  • Container Operations Rate — high operations = more goroutines

SIGNAL: Docker Daemon Memory Usage

WHAT IT IS: Memory consumed by the Docker daemon process itself (not containers). This is separate from container memory and indicates daemon resource footprint.

SOURCE:

  • Process: /proc/$(pgrep dockerd)/status (VmRSS, VmSize)
  • Command: ps -o rss,vsz -p $(pgrep dockerd)

HOW TO COLLECT IT MANUALLY:

# RSS and VSZ
ps -o rss,vsz -p $(pgrep dockerd)

# Detailed memory from /proc
cat /proc/$(pgrep dockerd)/status | grep -E 'Vm|Rss'

# From smaps (more detailed)
sudo cat /proc/$(pgrep dockerd)/smaps_rollup

# Using pmap
sudo pmap $(pgrep dockerd) | tail -1

WHAT IT TELLS YOU: Daemon memory usage should be relatively stable. Growing memory indicates a memory leak. Very high daemon memory can cause OOM (catastrophic: daemon dies while containers keep running). Daemon memory includes: image metadata, container state, network state, plugin data.

SEVERITY:

  • PAGE: Daemon memory approaching host OOM threshold
  • TICKET: Daemon memory growing >100MB/day without workload change
  • PLAN: Track daemon memory trend
  • INFO: Baseline daemon memory (typically 100-500MB depending on scale)

THRESHOLDS:

  • Normal: depends on scale; typically 100MB-1GB for moderate deployments
  • Warning: >1GB or growing trend
  • Critical: approaching host memory limit (daemon OOM is catastrophic)

FAILURE MODES DETECTED:

  • Memory leak in daemon
  • Excessive image/container metadata
  • Plugin memory consumption
  • Large log buffer accumulation

NUANCES & GOTCHAS:

  • Daemon memory doesn’t include container memory (that’s in cgroups)
  • Go’s garbage collector means some fluctuation is normal
  • Memory usage correlates with number of images, containers, and networks
  • Daemon restart clears most accumulated memory (but is disruptive)

CORRELATES WITH:

  • Container Count — more containers = more daemon memory
  • Image Count — more images = more metadata memory
  • Docker Disk Usage — disk operations may buffer in memory

INTERNAL STATE DOMAIN


SIGNAL: Docker Storage Driver Status

WHAT IT IS: The health and performance characteristics of Docker’s storage driver (typically overlay2). Storage driver issues directly impact container operations.

SOURCE:

  • Command: docker info (Storage Driver section)
  • File: /proc/mounts (overlay mounts)
  • File: /sys/fs/overlay/ (overlay-specific stats, if available)

HOW TO COLLECT IT MANUALLY:

# Check storage driver
docker info | grep -A5 "Storage Driver"

# Verify overlay mounts
mount | grep overlay

# Check backing filesystem
df -h /var/lib/docker

# For overlay2, check layer directories
ls /var/lib/docker/overlay2/

# Disk usage of overlay storage
du -sh /var/lib/docker/overlay2/

# Check for xfs quota (if used)
xfs_quota -x -c 'df -h' /var/lib/docker

WHAT IT TELLS YOU: Storage driver health is essential for container operations. Problems include: disk exhaustion, inode exhaustion, mount failures, and performance degradation. overlay2 is most common; other drivers (devicemapper, btrfs) have different failure modes.

SEVERITY:

  • PAGE: Storage driver errors in daemon log
  • TICKET: Disk usage >80% of backing filesystem
  • PLAN: Track storage growth trend
  • INFO: Baseline storage driver metrics

THRESHOLDS:

  • Backing filesystem: warning at 70%, critical at 85%
  • Inode usage: warning at 70%, critical at 85%
  • Any storage driver errors require investigation

FAILURE MODES DETECTED:

  • Disk space exhaustion
  • Inode exhaustion (many small files)
  • Mount failures
  • Layer corruption
  • Performance degradation (slow container starts)

NUANCES & GOTCHAS:

  • Different storage drivers have very different characteristics
  • overlay2 requires backing filesystem support (preferably xfs with pquota)
  • devicemapper has thin pool that can fill independently
  • btrfs/zfs have their own volume management

CORRELATES WITH:

  • Docker Disk Usage — storage driver stores all data
  • Container Start Latency — storage performance affects start time
  • Docker Daemon Errors — storage issues log errors

SIGNAL: Docker Network Bridge Status

WHAT IT IS: The state of Docker’s default bridge (docker0) and custom networks. Network issues cause container connectivity problems.

SOURCE:

  • Command: docker network ls
  • Command: docker network inspect bridge
  • File: /sys/class/net/docker0/ (bridge interface stats)
  • Command: ip link show docker0, brctl show

HOW TO COLLECT IT MANUALLY:

# List networks
docker network ls

# Inspect default bridge
docker network inspect bridge

# Bridge interface statistics
ip -s link show docker0

# Bridge details (if brctl available)
brctl show docker0

# Via /sys
cat /sys/class/net/docker0/operstate
cat /sys/class/net/docker0/carrier

# Check iptables rules for Docker
iptables -t nat -L DOCKER -n -v
iptables -L DOCKER -n -v

WHAT IT TELLS YOU: Bridge network status indicates container networking health. Problems include: bridge interface down, IP address exhaustion, iptables rule corruption, and veth pair issues. Containers on a broken bridge cannot communicate.

SEVERITY:

  • PAGE: Bridge interface operstate != “up” and containers exist
  • TICKET: IP allocation approaching subnet limit
  • PLAN: Network configuration drift
  • INFO: Baseline network configuration

THRESHOLDS:

  • Bridge should be “up” when containers are running
  • IP pool utilization >80% indicates approaching exhaustion
  • Any carrier=0 with active containers = problem

FAILURE MODES DETECTED:

  • Bridge interface down
  • IP address exhaustion (subnet full)
  • iptables rules corrupted
  • veth pair orphaning
  • MTU mismatch

NUANCES & GOTCHAS:

  • Custom networks (overlay, macvlan) have different failure modes
  • Docker creates/destroys iptables rules dynamically
  • Bridge network is default; custom networks may be primary in production
  • IP conflicts can occur with manually assigned IPs

CORRELATES WITH:

  • Container Network Errors — bridge issues cause container errors
  • Container Network I/O — broken bridge = no I/O
  • Docker Daemon Errors — network issues log errors

SIGNAL: Docker Events Stream Liveness

WHAT IT IS: Whether the Docker events stream is producing events as expected. The events stream is the heartbeat of container activity.

SOURCE:

  • Docker API: GET /events
  • Command: docker events

HOW TO COLLECT IT MANUALLY:

# Check events are flowing (timeout after 5 seconds)
timeout 5 docker events --filter 'type=container' --format 'event received'

# Via API with timeout
timeout 5 curl -s --unix-socket /var/run/docker.sock \
  "http://localhost/events" > /dev/null && echo "Events flowing" || echo "No events"

# Check for recent events
docker events --since 1m --until 0s | head -5

WHAT IT TELLS YOU: The events stream should produce events when containers are created, started, stopped, etc. If the stream is silent when activity is expected, or the API hangs, there may be daemon internal issues. Some monitoring systems depend on the events stream.

SEVERITY:

  • TICKET: Events stream unresponsive or hanging
  • INFO: Baseline event rates

THRESHOLDS:

  • Events stream should respond within seconds
  • No events during known activity = problem
  • API hanging on events request = daemon issue

FAILURE MODES DETECTED:

  • Daemon internal state corruption
  • Events buffer overflow
  • API handler deadlock

NUANCES & GOTCHAS:

  • Low activity systems may have long event silences (normal)
  • Daemon restart clears event buffer
  • Events are not persisted; only available while streaming
  • Multiple event subscribers are supported; one shouldn’t block others

CORRELATES WITH:

  • Docker Daemon Response Latency — hanging events = daemon stress
  • Container Operations Rate — should correlate with events

SIGNAL: Docker API Health Check Endpoint

WHAT IT IS: A simple endpoint that verifies the daemon’s HTTP API is responding. This is the simplest daemon health check.

SOURCE:

  • Docker API: GET /_ping

HOW TO COLLECT IT MANUALLY:

# Simple ping
curl --unix-socket /var/run/docker.sock http://localhost/_ping
# Returns: OK

# With timing
time curl --unix-socket /var/run/docker.sock http://localhost/_ping

# Via TCP (if configured)
curl http://localhost:2375/_ping

WHAT IT TELLS YOU: The /_ping endpoint returns “OK” if the daemon is minimally functional. It’s the lightest-weight check for daemon liveness. A response means the HTTP handler is working, but doesn’t guarantee full functionality.

SEVERITY:

  • PAGE: /_ping not responding for >30 seconds
  • TICKET: /_ping response time >5 seconds
  • INFO: Baseline response time

THRESHOLDS:

  • Response time <1 second = normal
  • Response time >5 seconds = degraded
  • No response = critical

FAILURE MODES DETECTED:

  • Daemon process dead
  • Daemon hung (internal deadlock)
  • Socket file removed/corrupted

NUANCES & GOTCHAS:

  • /_ping is very lightweight; it may respond even when daemon is stressed
  • It doesn’t verify container operations work
  • TCP socket (2375/2376) is often disabled or secured; unix socket is preferred
  • Always check response content (should be “OK”), not just HTTP status

CORRELATES WITH:

  • Docker Daemon Process Health — /_ping is the responsiveness check
  • Docker Daemon Response Latency — more detailed latency measurement

REPLICATION/CONSISTENCY DOMAIN


SIGNAL: Container Health Check Status

WHAT IT IS: The result of container health checks (if configured). Health checks verify the application inside the container is functional, not just running.

SOURCE:

  • Docker API: GET /containers/{id}/json (State.Health field)
  • Command: docker inspect --format '{{.State.Health.Status}}' <container_id>

HOW TO COLLECT IT MANUALLY:

# Health status for specific container
docker inspect --format '{{json .State.Health}}' <container_id> | jq .

# List all containers with health status
docker ps --format '{{.ID}} {{.Status}}' | grep -E '\(healthy|unhealthy\)'

# Via API
curl -s --unix-socket /var/run/docker.sock \
  "http://localhost/containers/<id>/json" | jq '.State.Health'

# Check last 5 health check results
docker inspect --format '{{range .State.Health.Log}}{{.End}}: {{.ExitCode}} {{.Output}}{{"\n"}}{{end}}' <container_id> | tail -5

WHAT IT TELLS YOU: Health status shows whether the container is actually healthy, not just running. Status is: starting, healthy, or unhealthy. Unhealthy containers (with restart policy) will be restarted. Health check output may reveal why the check failed.

SEVERITY:

  • PAGE: Container health status = “unhealthy” for critical service
  • TICKET: Container health status = “unhealthy” for any production service
  • PLAN: Health check failure frequency
  • INFO: Health check timing and success rate

THRESHOLDS:

  • Healthy = normal
  • Unhealthy = immediate attention
  • “Starting” for longer than start_period = problem

FAILURE MODES DETECTED:

  • Application not responding (web server, database)
  • Dependency failure (cannot connect to required service)
  • Resource starvation (too slow to respond)
  • Configuration error (wrong health check command)
  • Application deadlock

NUANCES & GOTCHAS:

  • Health checks must be configured in image or at run time; not all containers have them
  • Health check interval matters: frequent checks add load
  • Health check command runs inside the container
  • Container may be “running” but “unhealthy” — different conditions
  • Restart on unhealthy can cause loops if health check is misconfigured

CORRELATES WITH:

  • Container Restart Count — unhealthy + restart policy = restarts
  • Container Application Logs — health check failures often logged
  • Container CPU/Memory Usage — resource starvation causes health failures

SIGNAL: Docker Daemon Version and API Version

WHAT IT IS: The version of Docker daemon and its API. Version mismatches between client and daemon cause errors.

SOURCE:

  • Docker API: GET /version
  • Command: docker version

HOW TO COLLECT IT MANUALLY:

# Full version info
docker version

# Via API
curl -s --unix-socket /var/run/docker.sock http://localhost/version | jq .

# Just daemon version
docker version --format '{{.Server.Version}}'

# API version
docker version --format '{{.Server.APIVersion}}'

WHAT IT TELLS YOU: Version tracking is important for: compatibility (client/daemon mismatch), security (known vulnerabilities), and feature availability. Version drift across hosts can cause inconsistencies.

SEVERITY:

  • TICKET: Version mismatch between clients and daemon
  • TICKET: Daemon version has known security vulnerabilities
  • PLAN: Version consistency across fleet
  • INFO: Fleet version tracking

THRESHOLDS:

  • Major version mismatches often cause errors
  • Patch version differences usually compatible
  • Track CVEs for Docker versions

FAILURE MODES DETECTED:

  • Client/daemon incompatibility
  • Missing features in older versions
  • Security vulnerabilities in outdated versions

NUANCES & GOTCHAS:

  • API version is more important than release version for compatibility
  • Docker ships with multiple API versions; daemon negotiates with client
  • Downgrades are not supported
  • Version output includes OS/Arch and other metadata

CORRELATES WITH:

  • Docker Daemon Errors — version mismatch errors
  • Container Creation Failures — API incompatibility

SECURITY DOMAIN


SIGNAL: Privileged Container Count

WHAT IT IS: The number of containers running with the --privileged flag, which gives them full access to the host.

SOURCE:

  • Docker API: GET /containers/json (HostConfig.Privileged field)
  • Command: docker inspect --format '{{.HostConfig.Privileged}}' <container_id>

HOW TO COLLECT IT MANUALLY:

# Find all privileged containers
docker ps --format '{{.ID}} {{.Names}}' | while read id name; do
  if [ "$(docker inspect --format '{{.HostConfig.Privileged}}' $id)" = "true" ]; then
    echo "PRIVILEGED: $id $name"
  fi
done

# Via API
curl -s --unix-socket /var/run/docker.sock \
  "http://localhost/containers/json" | \
  jq -r '.[] | select(.HostConfig.Privileged == true) | .Id[:12]'

# Count privileged containers
docker ps --format '{{.HostConfig.Privileged}}' | grep -c true

WHAT IT TELLS YOU: Privileged containers have essentially host-level access. They can: access all devices, modify kernel parameters, load kernel modules, and potentially escape container isolation. Any privileged container is a security risk that should be justified and minimized.

SEVERITY:

  • TICKET: Any new privileged container in production
  • TICKET: Privileged container count increasing
  • PLAN: Audit all privileged containers for necessity
  • INFO: Baseline privileged container list

THRESHOLDS:

  • Target: zero privileged containers
  • Any privileged container requires documented justification
  • Unexpected privileged container = security incident

FAILURE MODES DETECTED:

  • Container escape risk
  • Host compromise via privileged container
  • Unauthorized privileged containers (malicious actor)

NUANCES & GOTCHAS:

  • Some legitimate use cases: Docker-in-Docker, system monitoring, hardware access
  • Use capability dropping instead of privileged when possible
  • Privileged bypasses most security controls
  • Also check for specific capabilities that may be excessive (SYS_ADMIN, etc.)

CORRELATES WITH:

  • Container Capabilities List — fine-grained capability audit
  • Container Volume Mounts — privileged + host mounts = high risk

SIGNAL: Containers with Host Network

WHAT IT IS: The number of containers using host network mode (--network host), which bypasses Docker’s network isolation.

SOURCE:

  • Docker API: GET /containers/json (HostConfig.NetworkMode field)
  • Command: docker inspect --format '{{.HostConfig.NetworkMode}}' <container_id>

HOW TO COLLECT IT MANUALLY:

# Find containers with host network
docker ps --format '{{.ID}} {{.Names}}' | while read id name; do
  net=$(docker inspect --format '{{.HostConfig.NetworkMode}}' $id)
  if [ "$net" = "host" ]; then
    echo "HOST_NETWORK: $id $name"
  fi
done

# Via API
curl -s --unix-socket /var/run/docker.sock \
  "http://localhost/containers/json" | \
  jq -r '.[] | select(.HostConfig.NetworkMode == "host") | .Id[:12]'

# Count
docker ps -a --format '{{.HostConfig.NetworkMode}}' | grep -c host

WHAT IT TELLS YOU: Host network mode gives containers direct access to the host’s network interfaces. The container shares the host’s IP address and can bind to any port. This bypasses network isolation and can cause port conflicts.

SEVERITY:

  • TICKET: Any new host-network container in production
  • PLAN: Audit host-network containers for necessity
  • INFO: Baseline host-network container list

THRESHOLDS:

  • Target: minimize host-network containers
  • Any host-network container requires documented justification
  • Unexpected host-network = security concern

FAILURE MODES DETECTED:

  • Port conflicts with host services
  • Network isolation bypass
  • Unauthorized network access
  • Service masquerading (container appears as host)

NUANCES & GOTCHAS:

  • Some legitimate use cases: high-performance networking, port ranges, network diagnostics
  • Host network is less risky than privileged, but still weakens isolation
  • Container can still be limited by other controls (capabilities, seccomp)
  • Port bindings don’t apply to host-network containers

CORRELATES WITH:

  • Privileged Container Count — both weaken isolation
  • Container Capabilities List — network-related capabilities

SIGNAL: Containers with Host Path Mounts

WHAT IT IS: Containers that have directories from the host filesystem mounted inside them. Sensitive host paths (/, /etc, /var/run/docker.sock) create security risks.

SOURCE:

  • Docker API: GET /containers/{id}/json (Mounts field)
  • Command: docker inspect --format '{{json .Mounts}}' <container_id>

HOW TO COLLECT IT MANUALLY:

# List all mounts for all containers
docker ps --format '{{.ID}} {{.Names}}' | while read id name; do
  docker inspect --format '{{range .Mounts}}{{$id}}: {{.Source}} -> {{.Destination}}{{"\n"}}{{end}}' $id | sed "s/^/$name: /"
done

# Find containers mounting docker socket
docker ps --format '{{.ID}}' | while read id; do
  if docker inspect --format '{{range .Mounts}}{{if eq .Destination "/var/run/docker.sock"}}YES{{end}}{{end}}' $id | grep -q YES; then
    echo "DOCKER_SOCKET: $id"
  fi
done

# Via API
curl -s --unix-socket /var/run/docker.sock \
  "http://localhost/containers/json" | \
  jq -r '.[] | {Id: .Id[:12], Mounts: [.Mounts[]?.Source]}'

WHAT IT TELLS YOU: Volume mounts allow containers to access host filesystem paths. Mounting sensitive paths (docker.sock, /etc, /root, /) gives containers host-level access. Docker socket mount allows container to control Docker daemon (effectively root on host).

SEVERITY:

  • PAGE: Container mounting /var/run/docker.sock from untrusted source
  • TICKET: Any container mounting sensitive host paths (/, /etc, /root, /var/lib/docker)
  • PLAN: Audit all host mounts for necessity and minimal access
  • INFO: Baseline mount inventory

THRESHOLDS:

  • Docker socket mount: high risk, requires strong justification
  • /etc mount: can read secrets, modify host config
  • / mount: full filesystem access
  • Any write mount to sensitive path: critical

FAILURE MODES DETECTED:

  • Container escape via docker socket
  • Credential theft (reading /etc/shadow, /root/.ssh)
  • Host modification (writing to /etc, /bin)
  • Docker daemon control (via socket mount)

NUANCES & GOTCHAS:

  • Many tools require docker socket mount (CI/CD, monitoring)
  • Read-only mounts reduce risk but don’t eliminate it
  • Named volumes are safer than bind mounts
  • Also check for /var/run (symlink) as docker socket path

CORRELATES WITH:

  • Privileged Container Count — combined with sensitive mounts = critical
  • Container Capabilities List — mounts + capabilities compound risk

SIGNAL: Container Capabilities List

WHAT IT IS: The Linux capabilities assigned to each container. Excessive capabilities weaken isolation and increase security risk.

SOURCE:

  • Docker API: GET /containers/{id}/json (HostConfig.CapAdd, HostConfig.CapDrop)
  • Command: docker inspect --format '{{json .HostConfig.CapAdd}}' <container_id>

HOW TO COLLECT IT MANUALLY:

# List capabilities for container
docker inspect --format '{{json .HostConfig.CapAdd}}' <container_id> | jq .
docker inspect --format '{{json .HostConfig.CapDrop}}' <container_id> | jq .

# Find containers with added capabilities
docker ps --format '{{.ID}} {{.Names}}' | while read id name; do
  caps=$(docker inspect --format '{{json .HostConfig.CapAdd}}' $id)
  if [ "$caps" != "null" ] && [ "$caps" != "[]" ]; then
    echo "$name: $caps"
  fi
done

# Via API
curl -s --unix-socket /var/run/docker.sock \
  "http://localhost/containers/<id>/json" | \
  jq '{Added: .HostConfig.CapAdd, Dropped: .HostConfig.CapDrop}'

WHAT IT TELLS YOU: Docker drops most capabilities by default. Added capabilities increase container power and risk. Dangerous capabilities include: SYS_ADMIN (many admin operations), NET_ADMIN (network configuration), SYS_PTRACE (debugging, can bypass isolation), ALL (all capabilities).

SEVERITY:

  • TICKET: Container with SYS_ADMIN, NET_ADMIN, SYS_PTRACE, or ALL capabilities
  • PLAN: Audit capability additions for necessity
  • INFO: Baseline capability inventory

THRESHOLDS:

  • Default capabilities are relatively safe
  • SYS_ADMIN is particularly dangerous (almost as bad as privileged)
  • NET_ADMIN can modify firewall rules
  • ALL = effectively privileged
  • Any capability addition requires justification

FAILURE MODES DETECTED:

  • Container escape via dangerous capabilities
  • Host manipulation (mount, network, kernel)
  • Privilege escalation

NUANCES & GOTCHAS:

  • Some applications legitimately need specific capabilities (e.g., NETWORK_ADMIN for VPN)
  • CapDrop is good practice even without CapAdd
  • Seccomp and AppArmor interact with capabilities
  • Capability meanings are complex; review Linux capability documentation

CORRELATES WITH:

  • Privileged Container Count — similar risk profile
  • Container Security Profile (seccomp, AppArmor)

SIGNAL: Docker Daemon Audit Logs

WHAT IT IS: Security-relevant events in Docker daemon logs or system audit logs: API access, authentication attempts, configuration changes.

SOURCE:

  • Journal: journalctl -u docker (filtered for security events)
  • File: /var/log/audit/audit.log (if auditd configured for Docker)
  • Docker API: GET /events (filtered for security-relevant actions)

HOW TO COLLECT IT MANUALLY:

# Docker daemon logs with security focus
journalctl -u docker.service | grep -iE "(auth|denied|forbidden|unauthorized|security)"

# If auditd rules for Docker are configured
ausearch -m avc -c dockerd
ausearch -m USER_LOGIN -c docker

# Docker events for sensitive actions
docker events --filter 'event=create' --filter 'event=attach' --since 1h

# Check for container privilege changes
journalctl -u docker.service --since "1 day ago" | grep -i privileged

WHAT IT TELLS YOU: Security audit logs reveal: unauthorized access attempts, privilege escalation, unusual API calls, and configuration changes. These should be monitored and alerted on for security incidents.

SEVERITY:

  • PAGE: Evidence of unauthorized access or privilege escalation
  • TICKET: Any authentication failures or denied operations
  • PLAN: Regular security log review
  • INFO: Baseline security event patterns

THRESHOLDS:

  • Any authentication failure: investigate
  • Any privilege escalation: investigate
  • Unusual API patterns: investigate
  • Audit log gaps: investigate

FAILURE MODES DETECTED:

  • Unauthorized API access
  • Container escape attempts
  • Malicious container creation
  • Configuration tampering

NUANCES & GOTCHAS:

  • Docker doesn’t have built-in user authentication; relies on TLS or socket access
  • Audit logging may require additional configuration
  • Docker Content Trust provides image verification (separate signal)
  • Swarm mode adds additional auth/audit capabilities

CORRELATES WITH:

  • Privileged Container Count — new privileged containers should have audit trail
  • Container Creation Failures — repeated failures may be attacks

SECTION 2 — Composite Failure Patterns


PATTERN: Disk Exhaustion Cascade

SIGNALS INVOLVED:

  • Docker Disk Usage approaching 100%
  • Container creation failures increasing
  • Image pull failures
  • Daemon latency increasing
  • Log write failures in containers

NARRATIVE: Docker disk usage grows gradually (images, logs, containers, volumes) until it nears the filesystem limit. As free space dwindles, writes become slower. Image pulls fail. Container creates fail. Running containers may crash if they can’t write logs or data. The daemon may become unresponsive during storage operations. Recovery requires disk cleanup but cleanup operations themselves may fail without working space.

SEVERITY: PAGE — system approaching complete unavailability

DISTINGUISHING FEATURES:

  • Disk usage is the primary indicator
  • Multiple failure types appear simultaneously
  • Failures are all storage-related

COMMON CAUSES:

  • Unbounded container log growth (json-file without max-size/max-file)
  • Image accumulation without cleanup
  • Orphaned volumes growing
  • Build cache bloat on CI runners
  • Application data growth in volumes

FIRST RESPONSE:

  1. Identify largest disk consumers: docker system df -v
  2. Quick reclaim: docker system prune -f (images, build cache)
  3. Identify and remove large log files in container directories
  4. If critical, stop non-essential containers to free space
  5. Schedule root cause analysis for log/image management

PATTERN: Container Death Spiral

SIGNALS INVOLVED:

  • Container restart count increasing rapidly
  • Exit codes 1 or 137 appearing repeatedly
  • Health check failures
  • Container start latency increasing (if many simultaneous restarts)
  • Daemon errors in logs

NARRATIVE: A container crashes (application error, OOM, or resource issue) and Docker restarts it (if restart policy permits). The container crashes again quickly, restarts again. Each restart consumes resources. If multiple containers are in this state, they can overwhelm the daemon, cause disk pressure (logs), and mask the root cause. The system appears “running” but is non-functional.

SEVERITY: PAGE if affecting critical service; TICKET otherwise

DISTINGUISHING FEATURES:

  • Restart count climbing rapidly
  • Containers are briefly “running” then “exited” repeatedly
  • Exit codes consistent (same failure cause)

COMMON CAUSES:

  • Application bug causing immediate crash
  • OOM kill (memory limit too low)
  • Missing dependencies (config, secrets, other services)
  • Invalid container configuration
  • Health check too aggressive (kills before app ready)

FIRST RESPONSE:

  1. Identify affected container(s) and their exit codes
  2. Check container logs: docker logs <container_id>
  3. If OOM, check memory usage and limits
  4. If application error, check application-level logs
  5. Consider pausing restart policy temporarily: docker update --restart=no <container_id>
  6. Fix root cause before re-enabling restarts

PATTERN: Daemon Hang

SIGNALS INVOLVED:

  • Daemon API unresponsive (/ping fails or times out)
  • Daemon process still running
  • Containers still running (workload not affected)
  • docker commands hang
  • Daemon latency spike before hang

NARRATIVE: The Docker daemon becomes unresponsive while containers continue running. All management operations hang: cannot inspect, create, stop, or get logs. This is typically caused by internal deadlock, storage driver hang, or extreme resource contention. Running workloads are unaffected but unmanageable. Recovery may require daemon restart (which briefly affects containers) or in extreme cases, host reboot.

SEVERITY: PAGE — operational capability lost

DISTINGUISHING FEATURES:

  • Daemon process exists but is unresponsive
  • Containers are still running (key difference from daemon crash)
  • Often preceded by latency increase

COMMON CAUSES:

  • Storage driver deadlock (overlay2 bug, filesystem issue)
  • Internal daemon deadlock (bug in Docker)
  • Extreme I/O contention causing storage operations to hang
  • File descriptor exhaustion
  • Kernel-level issue affecting cgroups/namespaces

FIRST RESPONSE:

  1. Confirm containers are still running: ps aux | grep containerd
  2. Check daemon process: ps aux | grep dockerd
  3. Check storage and system health: df -h, iostat
  4. Attempt graceful daemon restart: systemctl restart docker
  5. If graceful restart hangs, may need kill -9 on dockerd
  6. In extreme cases, host reboot (last resort)

PATTERN: Network Partition / DNS Failure

SIGNALS INVOLVED:

  • Container network errors increasing
  • Application errors for external service calls
  • DNS resolution failures inside containers
  • docker0 bridge or network interface issues
  • Health check failures for network-dependent services

NARRATIVE: Containers lose network connectivity or DNS resolution fails. Applications cannot reach databases, APIs, or other services. This may be caused by Docker network misconfiguration, iptables corruption, embedded DNS server failure, or external network issues. Containers appear healthy but are functionally broken.

SEVERITY: PAGE if affecting production services

DISTINGUISHING FEATURES:

  • Containers running and “healthy” but application failing
  • DNS errors in application logs
  • Network errors correlate with application errors
  • May affect all containers or only specific networks

COMMON CAUSES:

  • Docker embedded DNS server issue
  • iptables rules corrupted
  • Bridge interface misconfiguration
  • External DNS server unreachable
  • Network driver issue (overlay networks in Swarm)
  • MTU mismatch causing packet drops

FIRST RESPONSE:

  1. Test connectivity from inside container: docker exec <id> ping -c 3 <external_host>
  2. Test DNS resolution: docker exec <id> nslookup <hostname>
  3. Check bridge interface: ip link show docker0
  4. Check iptables: iptables -t nat -L DOCKER -n
  5. Restart Docker networking: systemctl restart docker (brief impact)
  6. If DNS issue, may need to restart containers to reinitialize DNS client

PATTERN: Resource Throttling Storm

SIGNALS INVOLVED:

  • Container CPU throttling increasing
  • Container memory usage near limits
  • Application latency increasing
  • Health check failures due to slowness
  • High CPU/memory usage on host

NARRATIVE: Containers are hitting their CPU or memory limits and being throttled. This causes application slowdown, which causes health check failures, which may trigger restarts. If multiple containers are affected, they may be competing for host resources. The system appears running but is degraded.

SEVERITY: TICKET for gradual onset; PAGE for sudden severe degradation

DISTINGUISHING FEATURES:

  • Throttling metrics are primary indicator
  • Degradation correlates with resource pressure
  • May be gradual (creeping workload) or sudden (traffic spike)

COMMON CAUSES:

  • Resource limits set too low for workload
  • Traffic increase exceeding capacity
  • Inefficient code causing high resource usage
  • Memory leak causing increasing memory usage
  • Multiple containers competing for host resources

FIRST RESPONSE:

  1. Identify throttled containers: docker stats
  2. Check throttling metrics per container
  3. Compare usage to limits
  4. Increase limits if justified: docker update --cpus/--memory
  5. Investigate root cause of increased resource usage
  6. Consider horizontal scaling if workload increased

PATTERN: Zombie Container Accumulation

SIGNALS INVOLVED:

  • Container count increasing (especially exited/dead)
  • Disk usage growing (container layers, logs)
  • Dead containers appearing
  • Daemon operations slowing (more state to manage)

NARRATIVE: Containers are being created but not properly cleaned up. Exited containers accumulate. Some containers may be in “dead” state (failed removal). This consumes disk space, slows daemon operations, and may exhaust IP addresses or container name space. Cleanup requires manual intervention.

SEVERITY: TICKET for accumulation; PAGE if dead containers blocking operations

DISTINGUISHING FEATURES:

  • Exited container count growing
  • Dead containers appearing
  • No corresponding container removals

COMMON CAUSES:

  • Missing cleanup automation
  • Deployment process not cleaning old containers
  • Removal failures leaving containers in dead state
  • Orchestration system not tracking all containers

FIRST RESPONSE:

  1. Identify exited containers: docker ps -a --filter status=exited
  2. Identify dead containers: docker ps -a --filter status=dead
  3. Remove exited containers: docker container prune -f
  4. For dead containers, may need manual cleanup of /var/lib/docker/containers
  5. Investigate why cleanup isn’t happening automatically
  6. Implement cleanup automation if missing

SECTION 3 — Capacity & Saturation Leading Indicators


RESOURCE: Disk Space on /var/lib/docker

LEADING INDICATORS:

  • Docker Disk Usage growth rate (>1GB/day sustained)
  • Image count increasing
  • Container log sizes growing
  • Build cache size increasing
  • Volume sizes growing

DEGRADATION CURVE: Sudden cliff-edge. System functions normally until ~95% full, then degrades rapidly. At 100%, daemon may crash or become unresponsive.

RUNWAY ESTIMATION:

current_usage_gb = $(docker system df --format '{{.Size}}' | head -1)
daily_growth_gb = [calculated from trend]
days_to_full = (total_space * 0.95 - current_usage_gb) / daily_growth_gb

HEADROOM DEFINITION:

  • Minimum 20% free space on /var/lib/docker filesystem
  • Or minimum 50GB free, whichever is larger
  • Growth rate should not exceed 2% per day of available space

RESOURCE: File Descriptors for Docker Daemon

LEADING INDICATORS:

  • FD count trending upward
  • Container count increasing
  • API connections not being released
  • Log file handles accumulating

DEGRADATION CURVE: Graceful until limit approached, then sudden failures. “Too many open files” errors appear. New containers cannot be created. API connections fail.

RUNWAY ESTIMATION:

current_fd = $(ls /proc/$(pgrep dockerd)/fd | wc -l)
fd_limit = $(cat /proc/$(pgrep dockerd)/limits | grep "open files" | awk '{print $4}')
fd_growth_per_day = [calculated from trend]
days_to_limit = (fd_limit * 0.8 - current_fd) / fd_growth_per_day

HEADROOM DEFINITION:

  • Keep FD usage below 50% of limit
  • Investigate any FD growth without corresponding workload increase
  • Consider increasing limit if legitimate growth

RESOURCE: Container IP Address Pool (Default Bridge)

LEADING INDICATORS:

  • Container count on default network increasing
  • IP allocation approaching subnet limit
  • Network creation failures

DEGRADATION CURVE: Graceful until pool exhausted, then container creation fails. Default bridge is typically 172.17.0.0/16 (65534 addresses). Custom networks have their own pools.

RUNWAY ESTIMATION:

# Count containers on default bridge
containers_on_bridge = $(docker network inspect bridge --format '{{range .Containers}}{{.Name}} {{end}}' | wc -w)
# Default pool size (varies)
pool_size = 65534
# Rough estimate
percent_used = containers_on_bridge / pool_size * 100

HEADROOM DEFINITION:

  • Keep IP pool usage below 50%
  • Use custom networks to distribute load
  • Consider smaller subnet allocation per network

RESOURCE: Daemon Memory

LEADING INDICATORS:

  • Daemon memory usage trending upward
  • Container/image count increasing
  • No memory recovery after cleanup operations

DEGRADATION CURVE: Gradual degradation as memory pressure increases. Go GC may cause pauses. In extreme cases, OOM kills daemon (catastrophic).

RUNWAY ESTIMATION:

daemon_memory_mb = $(ps -o rss -p $(pgrep dockerd) | tail -1)
daily_growth_mb = [calculated from trend]
host_memory_mb = $(free -m | awk '/Mem:/ {print $2}')
days_to_oom = (host_memory_mb * 0.8 - daemon_memory_mb) / daily_growth_mb

HEADROOM DEFINITION:

  • Daemon memory should be stable; growth indicates leak
  • Should not exceed 1GB under normal operation
  • If growing, investigate and consider daemon restart during maintenance

SECTION 4 — Operational Edge Cases


Behaviors that look alarming but are normal:

  1. High disk usage after large deployment — Pulling many large images consumes space; this is expected. Monitor cleanup afterward.

  2. Container in “created” state — Containers exist in “created” state before “running”. This is normal during startup.

  3. Occasional container restart count increment — If restart policy is “always” or “on-failure”, some restarts are expected. Investigate patterns, not single events.

  4. Dangling images after build — Builds create intermediate images that become dangling. This is normal; cleanup is scheduled.

  5. Network interface flapping during container start/stop — veth interfaces are created/destroyed with containers. Brief carrier losses during this are normal.

  6. Daemon memory usage varying — Go’s garbage collector causes memory to fluctuate. Look for sustained growth, not variation.

  7. CPU spikes during image operations — Image pulls and builds are CPU-intensive. Transient spikes are expected.


Behaviors that look normal but are silently catastrophic:

  1. Stable container count with growing exited containers — Running containers are fine, but exited containers accumulating indicates cleanup failure. Eventually causes disk exhaustion.

  2. Low CPU usage but high throttling — Container appears idle but is being throttled. Application is running slowly but not crashing.

  3. Memory usage stable at 99% of limit — Container is technically within limit but has no buffer for spikes. One traffic burst causes OOM.

  4. Container “running” but health check not configured — Container appears healthy but application may be dead. Without health check, there’s no signal.

  5. Network errors at low rate — Small error rate (0.01%) seems negligible but causes application-level retries, latency variance, and occasional failures.

  6. Daemon responding but slow — Technically “up” but 5-second latency makes it unusable for automation and orchestration.


Cold start, warmup, and initialization behaviors:

  1. First container start after daemon boot is slow — Daemon initializes storage driver, network, and caches. First operation is slower.

  2. First API call after daemon start has latency spike — Internal initialization happens on first request. Subsequent calls are faster.

  3. Image pull before first container start — If image not cached, container start includes pull time. First start is much slower than subsequent.

  4. Volume initialization — First use of a named volume may include filesystem initialization (especially for certain volume drivers).

  5. Network creation — First container on a new network triggers network creation. Brief delay.

  6. Health check “starting” period — Containers with health checks start in “starting” state before becoming “healthy”. This is intentional, not a failure.


Signals critical during incidents but rarely proactively monitored:

  1. Container exit codes — Only examined after something breaks. Pattern analysis could predict issues.

  2. Dead containers — Discovered during incident investigation. Should be monitored proactively.

  3. Docker socket mounts — Security risk, often only audited after security incident.

  4. Container capability additions — Security-relevant but often invisible.

  5. Events stream gaps — During incident, realize events weren’t being captured.

  6. Image pull failures — Often assumed to “just work” until deployment fails.


Known instrumentation limitations:

  1. Blkio statistics incomplete — Depends on cgroup configuration; may not capture all I/O.

  2. Network stats per container — Requires mapping veth pairs; not all tools do this correctly.

  3. Memory usage includes cache — “Used” memory includes reclaimable cache; actual memory pressure is different.

  4. CPU percentage calculation — Requires sampling over time; single sample gives cumulative, not rate.

  5. Container log size — Not directly exposed in API; must inspect filesystem.

  6. Internal daemon state — Much internal state is invisible; only exposed via debug endpoints (often disabled).


Interactions with adjacent systems:

  1. Docker + systemd — systemd manages dockerd; systemd timeouts can kill daemon during slow operations. Daemon restart affects all containers briefly.

  2. Docker + iptables/nftables — Docker manages iptables rules; external firewall changes can conflict. Firewall flush can break Docker networking.

  3. Docker + NFS/Network Storage — /var/lib/docker on NFS is unsupported and problematic. Volumes can be NFS, but storage driver shouldn’t be.

  4. Docker + log aggregation — Log driver choice affects what’s visible in docker logs. journald logs may not appear in docker logs.

  5. Docker + orchestration (Swarm/K8s) — Orchestration systems may restart containers, making restart counts misleading. Orchestration adds its own signals.

  6. Docker + monitoring agents — Agents running in containers have different visibility than agents on host. cgroups v2 changes many metric paths.


SECTION 5 — Security & Integrity Signals


SIGNAL: Docker Socket Access

WHAT IT IS: Detection of processes accessing the Docker socket, which provides full control over Docker daemon.

SOURCE:

  • File: /proc//fd/ (scan for socket references)
  • Audit: auditd rules for /var/run/docker.sock

HOW TO COLLECT IT MANUALLY:

# Find processes with Docker socket open
sudo lsof /var/run/docker.sock

# Via /proc
sudo find /proc/*/fd -lname "socket:*" -exec sh -c 'readlink {} | grep -q $(stat -c %i /var/run/docker.sock 2>/dev/null) && echo $(dirname $(dirname {}))' \;

# If auditd configured
ausearch -f /var/run/docker.sock

WHAT IT TELLS YOU: Any process with access to Docker socket can control Docker daemon, effectively giving root access. Unexpected processes with socket access are security incidents.

SEVERITY:

  • PAGE: Unknown/unauthorized process accessing socket
  • TICKET: New process granted socket access
  • INFO: Baseline authorized processes

THRESHOLDS:

  • Only known, authorized processes should access socket
  • Any unexpected access = incident

FAILURE MODES DETECTED:

  • Unauthorized Docker control
  • Privilege escalation via Docker
  • Malicious container management

NUANCES & GOTCHAS:

  • Root user always has potential access
  • Container with socket mount appears as process inside container
  • CI/CD systems often need socket access

SIGNAL: Unauthorized API Access Attempts

WHAT IT IS: Failed authentication or authorization attempts against Docker API (if TLS/auth enabled).

SOURCE:

  • Daemon logs: journalctl -u docker.service
  • TLS access logs (if configured)

HOW TO COLLECT IT MANUALLY:

# Look for auth failures in daemon logs
journalctl -u docker.service | grep -iE "(unauthorized|forbidden|denied|auth)"

# If TCP socket enabled, check for connection attempts
ss -tlnp | grep 2375
ss -tlnp | grep 2376

WHAT IT TELLS YOU: Repeated unauthorized access attempts indicate scanning or attack. Any successful unauthorized access is a breach.

SEVERITY:

  • PAGE: Successful unauthorized access
  • TICKET: Repeated failed access attempts
  • INFO: Baseline access patterns

THRESHOLDS:

  • Any successful unauthorized access = incident
  • 10 failed attempts from single source in 1 minute = suspicious

FAILURE MODES DETECTED:

  • Brute force attempts
  • Credential compromise
  • Misconfigured access control

NUANCES & GOTCHAS:

  • Unix socket has no built-in auth (file permissions only)
  • TCP socket should be secured with TLS
  • Swarm mode adds additional auth mechanisms

SIGNAL: Sensitive Environment Variables

WHAT IT IS: Containers with sensitive data (passwords, API keys, tokens) in environment variables, which are visible in inspection and process listing.

SOURCE:

  • Docker API: GET /containers/{id}/json (Config.Env)
  • Command: docker inspect --format '{{.Config.Env}}' <container_id>

HOW TO COLLECT IT MANUALLY:

# List environment variables for container
docker inspect --format '{{range .Config.Env}}{{println .}}{{end}}' <container_id>

# Find containers with potentially sensitive env vars
docker ps --format '{{.ID}}' | while read id; do
  env=$(docker inspect --format '{{range .Config.Env}}{{.}}{{end}}' $id)
  if echo "$env" | grep -qiE "(password|secret|token|key|api_key)"; then
    echo "SENSITIVE_ENV: $id"
  fi
done

# Via /proc inside container
docker exec <id> cat /proc/1/environ | tr '\0' '\n'

WHAT IT TELLS YOU: Environment variables with sensitive values are exposed via docker inspect, /proc filesystem, and process listing. This is a security risk. Secrets should use Docker secrets or external secret management.

SEVERITY:

  • TICKET: Sensitive data in environment variables
  • PLAN: Migrate to Docker secrets or external secret management
  • INFO: Audit of sensitive data handling

THRESHOLDS:

  • Target: zero sensitive data in environment variables
  • Any secret-like names in env vars = review needed

FAILURE MODES DETECTED:

  • Credential exposure via inspection
  • Credential exposure via logs
  • Credential exposure via process listing
  • Non-compliant secret handling

NUANCES & GOTCHAS:

  • Some legacy applications require env-based config
  • Docker secrets only available in Swarm or with specific run flags
  • Environment is the least secure option for secrets

SIGNAL: Container Image Provenance

WHAT IT IS: The source and trust status of container images. Untrusted or unknown-origin images are security risks.

SOURCE:

  • Docker API: GET /images/{name}/json (RepoTags, RepoDigests)
  • Docker Content Trust: docker trust inspect
  • Image labels and annotations

HOW TO COLLECT IT MANUALLY:

# List images and their sources
docker images --format 'table {{.Repository}}\t{{.Tag}}\t{{.ID}}'

# Check if image has verified signature (DCT)
docker trust inspect --pretty <image>:<tag>

# Check image labels for provenance
docker inspect --format '{{json .Config.Labels}}' <image>:<tag> | jq .

# Find unsigned images (if DCT enabled)
export DOCKER_CONTENT_TRUST=1
docker images --format '{{.Repository}}:{{.Tag}}' | while read img; do
  docker trust inspect "$img" > /dev/null 2>&1 || echo "UNTRUSTED: $img"
done

WHAT IT TELLS YOU: Images from untrusted sources or without verified signatures may contain vulnerabilities or malicious code. Running unsigned images is risky. Images should come from trusted registries with content trust.

SEVERITY:

  • TICKET: Unsigned/unverified images in production
  • TICKET: Images from untrusted registries
  • PLAN: Implement Docker Content Trust
  • INFO: Image provenance audit

THRESHOLDS:

  • Production: only signed images from trusted registries
  • Development: some flexibility but track sources
  • Any :latest tag in production = review needed

FAILURE MODES DETECTED:

  • Malicious images from untrusted sources
  • Compromised images in trusted registry
  • Supply chain attacks
  • Unpinned images changing unexpectedly

NUANCES & GOTCHAS:

  • Docker Content Trust requires explicit enablement
  • Some registries have their own signing mechanisms
  • Digest-pinned images are more verifiable than tag-pinned
  • Base image vulnerabilities affect all derived images

SECTION 6 — Monitoring Maturity Levels


LEVEL 1 — SURVIVAL

The absolute minimum to know if Docker is alive and not on fire:

  1. Docker Daemon Process Health — Is dockerd running?
  2. Docker API Health Check Endpoint — Is daemon responding?
  3. Docker Disk Usage (Total) — Is /var/lib/docker filling up?
  4. Container Count by State — How many are running/exited?
  5. Container Restart Count — Is anything crash-looping?
  6. Container OOM Killed Status — Did anything die from memory issues?

These 6 signals tell you: is the daemon working, is there disk space, and are containers running without crashing. Without these, you are flying blind.


LEVEL 2 — OPERATIONAL

What a competent team running Docker in production monitors:

  1. All Level 1 signals
  2. Docker Disk Usage breakdown — Images, containers, volumes, build cache separately
  3. Container CPU Usage — Per-container CPU consumption
  4. Container Memory Usage — Per-container memory vs limits
  5. Container Network I/O — Basic throughput per container
  6. Docker Daemon Response Latency — Is the daemon slow?
  7. Container Health Check Status — Are containers actually healthy?
  8. Docker Daemon Errors in Logs — Any errors in daemon logs?
  9. Container Exit Codes — Why did containers stop?
  10. Image Pull Rate/Failures — Are deployments working?

These signals give you visibility into resource consumption, performance, and reliability. You can detect and diagnose most common issues.


LEVEL 3 — MATURE

Full coverage: internals, leading indicators, composite patterns:

  1. All Level 2 signals
  2. Container CPU Throttling — Are containers being limited?
  3. Container Network Errors — Packet loss and errors
  4. Container Block I/O — Disk usage per container
  5. Docker Daemon File Descriptor Count — Approaching limits?
  6. Docker Daemon Memory Usage — Daemon memory footprint
  7. Docker Storage Driver Status — Health of overlay2/etc
  8. Docker Network Bridge Status — Network infrastructure health
  9. Dangling Images Count — Cleanup needed?
  10. Orphaned Volumes Count — Data volumes not in use
  11. Container Start Latency — How long to start containers
  12. Container Operations Rate — Churn rate
  13. Docker Events Stream Liveness — Are events flowing?
  14. Docker Build Cache Size — Build cache consumption

At this level you have leading indicators, can predict capacity issues, and have detailed performance visibility.


LEVEL 4 — EXPERT

Deep signals that experienced operators add after incidents:

  1. All Level 3 signals
  2. Docker Daemon Goroutine Count — Internal concurrency health
  3. Privileged Container Count — Security risk tracking
  4. Containers with Host Network — Network isolation bypass
  5. Containers with Host Path Mounts — Filesystem exposure
  6. Container Capabilities List — Capability audit
  7. Docker Socket Access — Who can control Docker
  8. Sensitive Environment Variables — Secret exposure
  9. Container Image Provenance — Trust and verification
  10. Docker Daemon Version/API Version — Fleet consistency
  11. Per-container log file sizes — Disk consumption from logs
  12. Network namespace leaks — Leaked ns after container stop
  13. Storage driver performance metrics — Overlay2-specific stats
  14. Container density per host — Host packing efficiency
  15. Layer sharing efficiency — How much layer reuse

At this level you have security observability, can detect subtle resource leaks, understand performance deeply, and have comprehensive audit capability.


SECTION 7 — What Most Teams Get Wrong


1. Not monitoring container logs disk consumption

Container logs (json-file driver) are stored in /var/lib/docker/containers// and can grow unbounded. Most teams monitor total disk usage but not the specific contribution of logs. A verbose application can fill the disk with logs while all other metrics look normal.

What to do: Monitor per-container log file sizes directly, or configure log rotation (max-size, max-file) and monitor the total size of /var/lib/docker/containers.


2. Ignoring exited containers until disk exhaustion

Exited containers consume disk space (writable layers, logs) but don’t show up in docker ps (only docker ps -a). Teams often have hundreds of exited containers accumulating, then hit disk issues suddenly.

What to do: Monitor exited container count and total size. Implement automated cleanup. Alert on accumulation rate.


3. Not tracking container restart counts until something breaks

Restart count is a leading indicator of instability, but most teams only look at it during incidents. A container restarting occasionally is invisible until it’s restarting constantly.

What to do: Alert on any restart count increase for production containers. Track restart rate over time.


4. Assuming “running” means “healthy”

Containers can be in “running” state while the application is deadlocked, waiting for missing dependencies, or otherwise non-functional. Without health checks, you have no visibility.

What to do: Configure meaningful health checks for all containers. Monitor health check status, not just container state.


5. Not monitoring Docker daemon latency

A daemon that responds in 5 seconds instead of 50ms is technically “up” but causes orchestration timeouts, slow deployments, and operational frustration. Most teams only check if the daemon process exists.

What to do: Monitor daemon API response latency. Alert on degradation, not just failure.


6. Missing network errors because they’re rare

A 0.01% packet error rate seems trivial but causes application-level retries, latency variance, and occasional failures. These errors are often invisible in high-level metrics.

What to do: Monitor network error counters, not just throughput. Any nonzero error rate warrants investigation.


7. Not monitoring CPU throttling

Containers with CPU quotas can be throttled without appearing to use much CPU. The application is slow but metrics show low usage. This is confusing and often misdiagnosed.

What to do: Monitor throttling metrics (nr_throttled, throttled_time) for containers with CPU quotas.


8. Blind trust of images from Docker Hub

Many teams pull images directly from Docker Hub without verification. These images may be outdated, vulnerable, or in rare cases, malicious.

What to do: Use approved base images from trusted registries. Pin to digests, not tags. Implement vulnerability scanning. Consider Docker Content Trust.


9. Not understanding memory metrics

Docker memory “usage” includes cache, which is reclaimable. Teams often see high memory usage and panic, or see low memory usage and miss OOM risk because they don’t understand the metrics.

What to do: Understand and monitor RSS (resident set size) in addition to total memory. Understand how your runtime (JVM, etc.) reports memory.


10. Monitoring containers but not the daemon

Teams often have detailed container monitoring but nothing about the Docker daemon itself. Daemon issues affect all containers but are invisible to container-level monitoring.

What to do: Monitor daemon health, latency, memory, FDs, and errors with the same rigor as containers.


11. No alerting on “dead” containers

Containers in “dead” state cannot be removed normally and require manual intervention. They’re rare enough that teams don’t notice them until they accumulate.

What to do: Alert on any container in “dead” state. They indicate prior daemon or cleanup issues.


12. Assuming disk cleanup will always work

Teams rely on docker system prune for disk management but don’t test it. When disk is critically full, cleanup operations themselves may fail due to lack of working space.

What to do: Monitor disk usage at lower thresholds (70-80%). Don’t wait until 95% to clean up. Test cleanup procedures.


13. Not monitoring build cache on CI runners

CI/CD runners that build images accumulate build cache rapidly. This is often discovered only when the runner runs out of disk space and builds start failing.

What to do: Monitor build cache size. Implement regular cache pruning. Consider cache limits.


14. Missing security signals entirely

Most Docker monitoring is performance-focused. Security-relevant signals (privileged containers, socket mounts, capability additions) are invisible until a security incident.

What to do: Add security signals to monitoring. Audit privileged containers, sensitive mounts, and capability additions regularly.


15. No fleet-wide version tracking

Docker versions drift across hosts. This causes inconsistent behavior, API incompatibilities, and vulnerability exposure. Most teams don’t track versions systematically.

What to do: Monitor Docker versions across all hosts. Alert on version drift. Track CVEs for Docker versions.