$ guides / docker ▌

DOCKER · OPERATIONS PLAYBOOK

Running Docker in production, without surprises

What Docker is doing under the hood, where it tends to break, what to monitor as your operation matures, and which mistakes to avoid before they become incidents.

> Start with the monitoring checklist → # Jump to the full guide list

Docker is easy to start with and surprisingly easy to run blind.

A container can be "up" while the application inside it is broken. The Docker daemon can be alive but too wedged to answer docker ps. A host can look healthy until /var/lib/docker fills and every container on it starts failing at once. CPU usage can look normal while CFS throttling quietly destroys latency. Logs can grow for weeks and then take the host down in an afternoon.

These guides are written for engineers who already run Docker, not for people learning what a container is. The goal is to give you the mental model, the failure patterns, the monitoring story, and the runbooks you wish someone had handed you before your last incident.

How Docker actually runs in production

Docker is not one thing. It is a stack of cooperating components, and most production failures happen between these layers, not inside any single one of them.

docker CLI / API clients

Where requests come in. Your CLI, your CI/CD system, your orchestrator, anything talking to the Docker socket.

USER

dockerd

The management plane. Handles images, networks, volumes, lifecycle, and log routing.

DAEMON

containerd

The runtime manager underneath dockerd. Owns container execution and lifecycle.

RUNTIME

containerd-shim

Per-container supervisor. Lets your containers survive a daemon restart.

SUPERVISOR

runc

OCI runtime. Sets up namespaces, cgroups, mounts, and seccomp policy when a container starts.

OCI

Linux kernel

The real workhorse. cgroups, namespaces, overlay filesystems, bridges, iptables. Docker is mostly a friendly interface to these.

KERNEL

your container process

PID 1 inside the container, plus its children. The workload itself.

CONTAINER

Why this matters: a container can keep running while the daemon is hung. The daemon can answer /_ping while docker ps blocks. A memory kill comes from the kernel, not Docker. A network problem may live in iptables, in embedded DNS, in conntrack, or in the application, and each one looks different from the outside.

The failures you'll actually see

Most Docker incidents are not exotic. They cluster into a small set of recurring patterns. Recognise the shape, and triage gets dramatically faster.

IMMINENT

The disk filled silently

Images, writable layers, volumes, build cache, metadata, and (often) logs all share /var/lib/docker. When that filesystem fills, every container on the host starts failing at once.

new containers fail to start
image pulls fail
running containers cannot write
containers stuck in dead or removing state

Investigate →

ACTIVE

The container death spiral

A container crashes, the restart policy brings it back, it crashes again. Logs flood, restart count climbs, and the actual root cause hides behind the restart loop.

high restart count
exit code 137 / 143 / 139 / 1
OOMKilled = true
health check failing

Investigate →

CRITICAL

The daemon wedged

dockerd is up but unresponsive. systemd reports it active, the process exists, but docker ps hangs and you cannot manage anything.

docker ps / inspect hangs
running containers keep going
storage driver stalls
internal lock contention

Investigate →

IMMINENT

The network black hole

Bridges, veth pairs, iptables/NAT, embedded DNS, overlay networks. They all have to line up. When one piece is wrong the container looks healthy while the path around it is broken.

DNS resolution fails inside containers
container-to-container traffic blocked
published ports unreachable
conntrack table fills under load

Investigate →

WATCHFUL

The hidden CPU throttle

Average CPU usage looks normal while the kernel keeps pausing the container against its CFS quota. Latency rises; the CPU graph is calm.

p95 / p99 latency climbs
request timeouts under load
slow health checks
throttled time keeps rising

Investigate →

ACTIVE

The OOM cascade

A container hits its memory limit. The kernel kills a process. Docker reports it after the fact, often after corrupted state, interrupted transactions, or restarts that hide the root cause.

exit code 137
OOMKilled = true
child killed, PID 1 still alive
approaching memory limit

Investigate →

Docker monitoring maturity levels

Docker monitoring works in four practical levels. Each level is a complete operation, not a stepping stone you must climb. Pick the level that matches how much your Docker reliability matters and how much investment your team has the bandwidth for. Most production teams should aim for the second level.

Level 1: Survival

Know that something is wrong

Survival monitoring is the floor. With these signals you can answer one question: is Docker still working? You will not learn what broke or why, but you will learn that something broke before users do. Survival is enough for hobby clusters, dev environments, and small teams running a handful of containers where Docker reliability is not in the critical path. It is not enough if Docker runs your customers, your pipeline, or your revenue.

Docker daemon responsiveness Can the Docker API answer basic requests at all?
Docker data directory disk usage Is /var/lib/docker close to full?
Container state Are expected containers running, or exited, dead, restarting?
Container restart count Is anything quietly crash-looping?
OOMKilled status Did the kernel just kill a container for memory pressure?
Host CPU, memory, disk, I/O Is the host itself under pressure?

Level 2: Operational

Diagnose most incidents on your own

Operational monitoring is what most production Docker hosts should target. Once survival signals tell you something is wrong, operational signals tell you what. With this coverage your team can usually diagnose an incident on its own: restart loops, OOM kills, disk filling, network drops, daemon slowness. If you only invest in one level above survival, this is the one to invest in.

Container exit codes Why did a container actually stop?
Memory usage vs limits How close is each container to its OOM ceiling?
CPU usage per container Which workload is consuming the host?
CPU throttling Are CFS quotas the real cause of latency?
Container log file size Are logs about to fill the host?
Daemon response latency Is dockerd getting slow before it hangs?
Daemon error logs Storage, network, or runtime errors surfacing?
Image and volume disk breakdown What is actually consuming /var/lib/docker?
Image pull failures Are deployments blocked on registry or network?
Container network errors Are packets dropping or connections failing?

Level 3: Mature

Catch problems before they become incidents

Mature monitoring catches problems before they wake anyone up. Memory creeping toward a limit, daemon latency drifting upward, conntrack tables slowly filling, container start latency growing, lifecycle event rates climbing under invisible load. None of these will page you on day one. They turn into pages on day thirty if no one is watching. Mature monitoring is for teams that have already been bitten by leading-indicator failures and want to spot the next one early.

Memory breakdown: anon, file, slab Is growth real workload memory, cache, or kernel memory?
Writable layer growth Is the application writing into the container layer instead of a volume?
Container PID count and zombies Is PID exhaustion or missing child reaping creeping in?
Daemon file descriptor usage Is dockerd approaching its FD limits?
Daemon goroutine count Is internal concurrency growing abnormally?
DNS health from inside containers Does resolution work where the application lives?
conntrack utilisation Are new connections at risk of being dropped?
Storage driver health Are overlay2 errors appearing in the daemon logs?
Lifecycle event rate Is container churn quietly stressing the host?

Level 4: Expert

Reactive instrumentation after real incidents

Expert signals are reactive, not aspirational. Each one tends to enter your monitoring stack the day after a specific incident proved you needed it. Daemon pprof captures, conntrack auditing, Pressure Stall Information, AppArmor drift detection, sub-second cgroup analysis. Most teams never need every signal at this level. Add the ones your incident history tells you to add. Adopting the full list without a reason is a way to spend engineering time on noise.

Daemon pprof profiles Heap, goroutine, mutex captures during pathological events.
containerd shim health Per-container supervisor responsiveness and exit signals.
overlay2 layer count per image Layer fan-out causing storage driver pressure.
iptables and NAT rule count Rule scale impacting packet processing latency.
cgroup period-level CPU analysis Sub-second throttling patterns hidden by averages.
Pressure Stall Information (PSI) Kernel signal when CPU, memory, or I/O are blocking work.
seccomp, AppArmor, capability audit Drift in container security posture over time.
Docker socket access audit Who and what is talking to /var/run/docker.sock?
Sensitive environment variable scan Secrets that ended up in container env by accident.

Operating mistakes worth avoiding

The traps teams keep falling into. Each of these has a clear, well-known fix. Most teams only learn it after an incident.

No log rotation

The default json-file driver grows unbounded. One noisy container can fill the host. Configure rotation early, not after the first incident.

Alerting on CPU usage but not CPU throttling

High CPU is obvious. Throttling is dangerous because it looks like application latency while average CPU appears acceptable.

Treating total memory as the whole story

Container memory includes reclaimable cache. Split anonymous, file, slab, and what the application actually owns.

Restart policies as a substitute for reliability

Restart policies keep workloads alive but hide repeated failures. Restart count is a signal, not noise.

Running Docker without disk hygiene

Old images, unused volumes, build cache, stopped containers, logs. They all accumulate. Cleanup policies and disk monitoring are not optional.

Exposing the Docker socket casually

Mounting /var/run/docker.sock into a container gives that container control over the host. Treat it like root access.

Docker runbooks in this section

Each guide is a focused runbook for one symptom or topic. Pick one when you have an incident, or use the categories to learn the area.

▸

Start here

▸

Container failures

▸

Logs and storage

▸

Networking

▸ DNS not working in containers →

WHERE TO GO NEXT

Setting up Docker monitoring, or putting out a fire?

If you're starting from scratch, the monitoring checklist is the path of least regret. If you're mid-incident, jump straight to the symptom that matches what you're seeing.

> Start with the checklist > Back to Operations Guides