Docker
How Docker actually runs in production, where it tends to break, what to monitor as your operation matures, and runbooks for the failures you'll actually see.
Production guides for engineers who get paged. What breaks first, which signals matter, how to move from symptom to root cause without guessing. We publish one technology at a time, each as a self-contained playbook.
Built for the technologies you actually run
Most monitoring content explains what a technology is. These guides explain how it actually fails.
Production systems break in patterns. The disk fills. The daemon wedges. The kernel kills a process the application thought it owned. Latency climbs while CPU graphs look fine. The same handful of failure modes show up across teams, across companies, across years.
Each section here is built around those patterns: a hub page that explains how the technology fails in production, plus focused runbooks for each common symptom. No fluff. No "what is a container" intros. Just the things you wish someone had written down before your last incident.
Each section is a self-contained operations playbook for one technology. Open one to dig in, or check what's coming next.
How Docker actually runs in production, where it tends to break, what to monitor as your operation matures, and runbooks for the failures you'll actually see.
How the control plane and node-side runtime actually behave in production, where clusters tend to break, and runbooks for the failures you'll actually see.
How the MVCC + WAL + autovacuum system actually behaves under load, where it tends to break, and runbooks for the connection, lock, vacuum, replication, and checkpoint failures you'll actually see.
How workers, connection slots, and upstreams actually behave under load, where NGINX tends to break, and runbooks for the connection, FD, backend-cascade, buffer, and TLS failures you'll actually see.
Fork/COW storms, event-loop wedge, replication backlog overflow, memory pressure spirals.
ISR shrinks, leaderless partitions, consumer lag, controller overload, broker disk exhaustion.
Heap pressure death spirals, unassigned shards, disk watermark cascades, mapping explosions, and the cluster-health states you'll actually debug.
Content roadmap for upcoming sections. Each will follow the same incident-first structure: mental model, failure patterns, monitoring maturity, and runbooks.