Operations Guides
Production guides for engineers who get paged. What breaks first, which signals matter, how to move from symptom to root cause without guessing. We publish one technology at a time, each as a self-contained playbook.
Built for the technologies you actually run
Most monitoring content explains what a technology is. These guides explain how it actually fails.
Production systems break in patterns. The disk fills. The daemon wedges. The kernel kills a process the application thought it owned. Latency climbs while CPU graphs look fine. The same handful of failure modes show up across teams, across companies, across years.
Each section here is built around those patterns: a hub page that explains how the technology fails in production, plus focused runbooks for each common symptom. No fluff. No "what is a container" intros. Just the things you wish someone had written down before your last incident.
Sections
Each section is a self-contained operations playbook for one technology. Open one to dig in, or check what's coming next.
Docker
How Docker actually runs in production, where it tends to break, what to monitor as your operation matures, and runbooks for the failures you'll actually see.
Kubernetes
Cluster state, kubelet, kube-proxy, API server, CoreDNS. Control plane and worker failure patterns.
PostgreSQL
Connection exhaustion, lock cascades, autovacuum starvation, transaction ID wraparound, checkpoint storms.
NGINX
Connection and FD exhaustion, backend cascade failure, buffer spill to disk, SSL CPU saturation.
Redis
Fork/COW storms, event-loop wedge, replication backlog overflow, memory pressure spirals.
Kafka
ISR shrinks, leaderless partitions, consumer lag, controller overload, broker disk exhaustion.
What's coming next
Content roadmap for upcoming sections. Each will follow the same incident-first structure: mental model, failure patterns, monitoring maturity, and runbooks.









