$ guides / kubernetes ▌

KUBERNETES · OPERATIONS PLAYBOOK

Running Kubernetes in production, without the 3 a.m. surprises

What the control plane is doing under the hood, where clusters tend to break, what to monitor as your operation matures, and which mistakes to stop making before the next incident.

> Start with the monitoring checklist → # Jump to the full guide list

Kubernetes is easy to demo and hard to operate.

A pod can be Running while the application inside it is broken. A node can report Ready while its kubelet is too wedged to start anything new. The API server can pass /healthz while etcd is too slow to answer real writes. A service can have endpoints and still be unreachable because conntrack is full, an iptables sync is stalled, or NetworkPolicy quietly denied the packet. A pod evicted for memory pressure tells you something — a pod evicted because another pod broke a node tells you something different.

These guides are written for engineers who already run Kubernetes, not for people learning what a pod is. The goal is to give you the mental model of the control plane, the failure patterns that keep recurring, the monitoring story that catches issues before they page anyone, and the runbooks you wish someone had handed you before your last incident.

How Kubernetes actually runs in production

Kubernetes is not one thing. It is a stack of cooperating components, and most production failures happen between these layers, not inside any single one of them.

kubectl / API clients

Where requests come in. Your CLI, controllers, operators, dashboards, CI systems, anything talking to the API server.

USER

kube-apiserver

The control-plane front door. Authenticates, authorizes, admits, validates, and writes. The only thing that talks to etcd.

API

etcd

Cluster state of record. Quorum-based, fsync-bound. Every Kubernetes object lives here.

STATE

scheduler & controllers

Decision makers. Watch desired state, compute placement, drive reconciliation toward actual state.

CONTROL

kubelet

Per-node agent. Watches the API for assigned pods, pulls images, supervises the runtime, reports node status.

NODE

runtime + CNI + CSI

containerd or CRI-O runs pods; the CNI plugin wires the network; the CSI driver provides storage.

RUNTIME

Linux kernel

The real workhorse. cgroups, namespaces, iptables/IPVS, conntrack, overlayfs. Kubernetes is mostly a friendly interface to these.

KERNEL

your pod processes

PID 1 inside the container, plus children. The workload itself.

POD

Why this matters: a pod can be Running while the application is broken. NodeReady can be true while the kubelet's PLEG is stalling. The API server can answer /healthz while etcd is too slow to commit writes. A 'service not reachable' might live in kube-proxy, endpoints, DNS, NetworkPolicy, or the application — and each one looks different from the outside.

The failures you'll actually see

Most Kubernetes incidents are not exotic. They cluster into a small set of recurring patterns. Recognise the shape, and triage gets dramatically faster.

CRITICAL

The control-plane bottleneck

The API server is up but slow. kubectl responses crawl, controllers stop reconciling, the scheduler queue drains too slowly. Underneath it is usually etcd latency, a misbehaving admission webhook, or APF starvation of legitimate traffic.

kubectl latency climbs
controller queue depth grows
etcd backend commit duration spikes
APF rejected requests increase

Investigate →

IMMINENT

The node death by eviction

One node hits MemoryPressure or DiskPressure. The kubelet evicts pods. Replacement pods land on other nodes, push them over their thresholds, and the cascade keeps going. The cluster looks alive while replicas vanish.

evictions across multiple nodes
NodeNotReady flapping
pods rescheduled onto pressure
PLEG response time rising

Investigate →

ACTIVE

The pod that won't run

A workload never reaches Ready. The pod is stuck in Pending, ContainerCreating, ImagePullBackOff, or CrashLoopBackOff. Each state has a different root cause and a different runbook.

replicas stuck below desired
restartCount climbing
ImagePullBackOff or ErrImagePull
PodScheduled = false

Investigate →

IMMINENT

The silent network black hole

Services with endpoints that aren't reachable. Pod-to-pod traffic dropping. Conntrack tables full, iptables sync stalled, CNI plugin in an inconsistent state, NetworkPolicy denying silently. Most of these look healthy from outside the data path.

service connections time out
conntrack utilisation climbs
iptables-restore latency rises
pods reach external but not cluster IPs

Investigate →

WATCHFUL

The DNS chase

Resolution inside pods is slow or intermittent. Applications hit a 5-second resolver timeout. Upstream services get classified as flaky when the real problem is CoreDNS, the cluster DNS service IP, or ndots.

5-second tail latency on outbound calls
CoreDNS request rate spikes
kube-dns service endpoint unhealthy
pod resolv.conf points wrong

Investigate →

CRITICAL

The certificate clock

kubelet or API-server certs expire without warning. Nodes drop out one at a time. kubectl stops authenticating. There is no preceding load event, no obvious trigger, just a date on a certificate nobody was watching.

x509: certificate has expired
node NotReady, one at a time
kubelet TLS handshake errors
control plane refuses connections

Investigate →

Kubernetes monitoring maturity levels

Kubernetes monitoring works in four practical levels. Each level is a complete operation, not a stepping stone you must climb. Pick the level that matches how much your cluster's reliability matters and how much investment your team has the bandwidth for. Most production clusters should aim for the second level.

Level 1: Survival

Know that something is wrong

Survival monitoring is the floor. With these signals you can answer one question: is the cluster still functioning? You will not learn what broke or why, but you will learn that something broke before users do. Survival is enough for hobby clusters, dev environments, and teams running stateless workloads where Kubernetes reliability is not in the critical path.

API server availability Does /readyz answer, and how fast?
etcd availability Are all members healthy and reporting?
Node Ready count How many nodes are flagged Ready right now?
Pod state distribution Running vs Pending vs Failed across the cluster.
Workload deployment readiness Are the critical deployments at desired replicas?
Node disk and memory utilisation Is any node close to eviction thresholds?

Level 2: Operational

Diagnose most incidents on your own

Operational monitoring is what most production clusters should target. Once survival signals tell you something is wrong, operational signals tell you what. With this coverage your team can usually diagnose an incident on its own: scheduling failures, evictions, control-plane latency, network drops, image pull issues. If you only invest in one level above survival, this is the one to invest in.

API server request latency p99 Is the control plane slowing down before it fails?
etcd backend commit duration Is the cluster state store healthy?
Pending pods by failure reason Why are pods not being scheduled?
Pod restart count and exit reasons Where are the crash loops, and why?
Eviction events with reason Memory, disk, PID, or something else?
Node conditions (Ready, Disk, Memory, PID) Which pressure signals are firing?
Service endpoint readiness Are services backed by healthy pods?
Image pull failure rate per node Are deployments blocked at the registry layer?
kubelet error log rate Runtime, network, volume issues surfacing?
PVC bind state Are persistent volumes attaching successfully?

Level 3: Mature

Catch problems before they become incidents

Mature monitoring catches problems before they wake anyone up. APF throttling drifting upward, kubelet certificates approaching expiry, conntrack tables filling, iptables sync time creeping, controller queue depth growing under invisible load. None of these will page you on day one. They turn into pages on day thirty if no one is watching.

APF throttling per priority level Is legitimate traffic being shed before failure?
PLEG relisting duration Is the kubelet keeping pace with pod events?
kubelet certificate expiry Days until silent failure?
conntrack utilisation per node Are new connections at risk of being dropped?
iptables sync time per node Is kube-proxy struggling to apply rules?
DNS query latency from pods Is CoreDNS slowing down inside the cluster?
Volume attach / mount duration How long does CSI take to bring storage up?
etcd disk fsync p99 Is the underlying disk fast enough for etcd?
Controller queue depth Are controllers keeping up with reconciliation?

Level 4: Expert

Reactive instrumentation after real incidents

Expert signals are reactive, not aspirational. Each one tends to enter your stack the day after a specific incident proved you needed it. kubelet pprof captures, scheduler attempt latency by predicate, audit log forensics, network policy hit-miss tracking, etcd compaction frequency. Most teams never need every signal at this level. Add the ones your incident history tells you to add.

kubelet pprof captures Heap, goroutine, mutex profiles during pathological events.
Scheduler attempt latency by predicate Which scheduling step is the bottleneck?
Webhook latency p99 Which admission webhook is slowing the API server?
etcd compaction and defrag history Is the keyspace growing pathologically?
Audit log analysis Who is changing what, when, and from where?
NetworkPolicy hit/miss telemetry Are policies allowing or denying as intended?
Pod startup phase breakdown Image pull, mount, network, sandbox, runtime — where is the time going?
Service account token rotation lifecycle Are bound tokens rotating and being honoured?

Operating mistakes worth avoiding

The traps teams keep falling into. Each has a clear, well-known fix. Most teams only learn it after an incident.

No resource requests on workloads

Pods without requests get evicted first under pressure and are scheduled greedily by the scheduler. Set CPU and memory requests for every production workload, not just limits.

Watching only API server availability

The API server can answer /readyz while etcd is gasping. Watch etcd commit duration and the API-server-to-etcd latency separately.

Ignoring webhook failurePolicy and timeouts

One slow admission webhook can stall the entire control plane. Set failurePolicy and timeoutSeconds explicitly, alert on webhook latency, and exclude critical namespaces from webhooks that need them.

Treating Node Ready as healthy

Ready only means the kubelet is reporting. The node may still be degraded (PLEG slow, runtime stalls, disk pressure). Watch every node condition individually, not just the Ready summary.

No NetworkPolicy default-deny

In a multi-tenant or microservice cluster, default-allow is a blast-radius hazard. Default-deny per namespace and explicitly allow the traffic each workload needs.

Skipping certificate rotation drills

Most clusters auto-rotate kubelet and apiserver certs. A few don't. The day they expire silently, you find out which kind you have. Verify rotation works on a non-production cluster before you need it.

Kubernetes runbooks in this section

Each guide is a focused runbook for one symptom or topic. Pick one when you have an incident, or use the categories to learn the area.

▸

Start here

▸

Pod lifecycle

▸

Node health

▸

Networking and DNS

▸

Storage

WHERE TO GO NEXT

Setting up Kubernetes monitoring, or putting out a fire?

If you're starting from scratch, the monitoring checklist is the path of least regret. If you're mid-incident, jump straight to the symptom that matches what you're seeing.

> Start with the checklist > Back to Operations Guides