$ guides ▌

OPERATIONS · INCIDENT-FIRST

Operations Guides

Production guides for engineers who get paged. What breaks first, which signals matter, how to move from symptom to root cause without guessing. We publish one technology at a time, each as a self-contained playbook.

Built for the technologies you actually run

Most monitoring content explains what a technology is. These guides explain how it actually fails.

Production systems break in patterns. The disk fills. The daemon wedges. The kernel kills a process the application thought it owned. Latency climbs while CPU graphs look fine. The same handful of failure modes show up across teams, across companies, across years.

Each section here is built around those patterns: a hub page that explains how the technology fails in production, plus focused runbooks for each common symptom. No fluff. No "what is a container" intros. Just the things you wish someone had written down before your last incident.

Sections

Each section is a self-contained operations playbook for one technology. Open one to dig in, or check what's coming next.

Available now

Docker

How Docker actually runs in production, where it tends to break, what to monitor as your operation matures, and runbooks for the failures you'll actually see.

Start with a common failure

Open the Docker playbook→

Available now

Kubernetes

How the control plane and node-side runtime actually behave in production, where clusters tend to break, and runbooks for the failures you'll actually see.

Start with a common failure

Open the Kubernetes playbook→

Available now

PostgreSQL

How the MVCC + WAL + autovacuum system actually behaves under load, where it tends to break, and runbooks for the connection, lock, vacuum, replication, and checkpoint failures you'll actually see.

Start with a common failure

Open the PostgreSQL playbook→

Available now

NGINX

How workers, connection slots, and upstreams actually behave under load, where NGINX tends to break, and runbooks for the connection, FD, backend-cascade, buffer, and TLS failures you'll actually see.

Start with a common failure

Open the NGINX playbook→

Available now

Redis

Fork/COW storms, event-loop wedge, replication backlog overflow, memory pressure spirals.

Start with a common failure

Open the Redis playbook→

Available now

Kafka

ISR shrinks, leaderless partitions, consumer lag, controller overload, broker disk exhaustion.

Start with a common failure

Open the Kafka playbook→

Available now

Elasticsearch

Heap pressure death spirals, unassigned shards, disk watermark cascades, mapping explosions, and the cluster-health states you'll actually debug.

Start with a common failure

Open the Elasticsearch playbook→

What's coming next

Content roadmap for upcoming sections. Each will follow the same incident-first structure: mental model, failure patterns, monitoring maturity, and runbooks.

Available now Docker How Docker actually fails in production

Available now Kubernetes Cluster state, kubelet, kube-proxy, API server, CoreDNS

Available now PostgreSQL Connections, locks, autovacuum, replication, wraparound, checkpoints

Available now NGINX Connection exhaustion, FD limits, backend cascade, SSL CPU

Available now Redis Fork storms, event-loop wedge, replication backlog

Available now Kafka ISR shrinks, leaderless partitions, controller overload

Available now Elasticsearch Heap pressure, shard overallocation, watermark cascades

Planned MySQL Redo log, connection exhaustion, replication lag

Planned MongoDB Cache pressure, oplog window collapse, connection storms

Planned ClickHouse Merge debt, memory exhaustion, replication lag, Keeper saturation