How the Kubernetes control plane works: a mental model for operators
The Kubernetes control plane is often drawn as a single box labeled “Master.” In production, that abstraction fails the moment kubectl hangs, a node disappears, or a Deployment stays pending. The control plane is not a monolith. It is a pipeline of specialized components that hand off state through a single API server, and understanding those handoffs is what lets you triage a cluster outage without guessing.
At the center is kube-apiserver. Everything else, etcd included, is either a client of the API server or a plugin that the kubelet calls on the node. When you run kubectl apply, the request enters the API server, is validated, persisted to etcd, and then broadcast as watch events to every component that cares. The scheduler, controllers, kubelets, and kube-proxy are all reactive watchers. They do not talk to each other directly. They talk to the API server, compare what they see to what they expect, and write corrections back.
This article walks the component graph and the request path. After reading it, you should be able to map a symptom, Pods stuck Pending or nodes flapping NotReady, to the specific component that is actually responsible, and know which signals to inspect first.
flowchart LR
A[Client] -->|mutate| B[kube-apiserver]
B -->|write| C[etcd]
C -->|watch| B
B -->|unscheduled| D[scheduler]
B -->|assigned| E[kubelet]
B -->|state drift| F[controllers]
D -->|bind| B
F -->|reconcile| B
E -->|status| Bkube-apiserver: the front door
kube-apiserver is a stateless HTTP/HTTPS front-end. Every kubectl command, controller action, kubelet report, and operator request flows through it. It does not schedule pods or run containers. It authenticates, authorizes, validates, and persists.
Inside the API server, every request walks a pipeline. Authentication confirms identity. Authorization checks RBAC. Admission controllers mutate and validate the object. The REST storage layer transforms it for etcd. The watch cache serves LIST and WATCH from memory when possible, avoiding etcd reads. API Priority and Fairness (APF), stable since Kubernetes 1.20, queues requests into flow schemas and priority levels to prevent a single noisy client from starving the control plane. Finally, the etcd client writes the object.
The API server scales horizontally in an active-active arrangement behind a load balancer. Each instance maintains its own watch cache. If one instance is unhealthy, the others continue serving traffic, but their caches are independent, so brief inconsistencies during high churn are normal.
etcd: the source of truth
etcd is the only component that stores persistent cluster state. The API server is its sole client. Every write is a serialized transaction that must be fsynced to the write-ahead log before it is acknowledged.
etcd runs the Raft consensus protocol. A majority of members must agree on the leader. That is why clusters require an odd member count: 3, 5, or 7. Losing a majority means quorum loss, and the cluster becomes read-only or entirely unavailable. etcd disk I/O is the single most common root cause of control plane slowness. Because every write waits for a WAL fsync, slow disk or a noisy neighbor on the same volume creates a latency cascade that backs up the API server’s inflight requests.
etcd also enforces a database size quota, commonly 2 GB by default and often raised to 8 GB. When the database exceeds the quota, etcd raises a NOSPACE alarm and rejects all writes. Compaction and defragmentation reclaim space, but they are separate operations. etcd v3.4 reaches end of support in May 2026; supported production versions are v3.5 and v3.6.
Scheduler and controllers: the reconcile loop
Kubernetes is a distributed state machine. The desired state lives in etcd as API objects. The current state lives on the nodes as running containers. Two components close the gap: the scheduler and the controller manager.
kube-controller-manager runs dozens of controller loops, deployment, replica set, node, service account, and others, inside a single binary. Each controller watches specific resource types via the API server and uses an internal workqueue to process events. Only one instance is active at a time. The others wait via leader election using the Lease API. If the workqueue depth grows and stays high, desired state is drifting from actual state.
kube-scheduler watches for unscheduled Pods, those with an empty spec.nodeName. It filters nodes that cannot run the Pod and scores the remainder. The scheduler maintains an internal queue with three sub-queues: activeQ, backoffQ, and unschedulableQ. Like the controller manager, only one scheduler instance is active via leader election. If the scheduler is unhealthy or starved of API server concurrency, new Pods accumulate in Pending.
cloud-controller-manager runs cloud-specific loops separately from the core binary. It is not present in on-premises clusters.
Kubelet and the node boundary
The kubelet is the node agent. It receives Pod specifications through a watch on the API server and turns them into running containers via the container runtime interface (CRI). Its sync loop reconciles desired state against what is actually running.
Inside the kubelet, the Pod Lifecycle Event Generator (PLEG) polls the container runtime to detect container starts, stops, and deaths. If PLEG slows down, the kubelet marks the node NotReady. The kubelet also manages volumes through the Container Storage Interface (CSI), network setup through the Container Network Interface (CNI), image pulls, probe execution, and pod evictions under resource pressure.
The kubelet reports node conditions and Pod status back to the API server. It also renews a Lease object in the kube-node-lease namespace as a heartbeat. If the API server stops receiving lease renewals, the node is marked NotReady after a grace period. A kubelet may be up to three minor versions older than the API server, but it must never be newer.
kube-proxy, which also runs on every node, is not part of the core control plane but completes the service abstraction. It watches Services and EndpointSlices and programs iptables or IPVS rules so that traffic to a ClusterIP reaches a healthy backend Pod.
The request lifecycle from kubectl to container
When you run kubectl apply -f deployment.yaml, the following chain begins:
- The API server receives the request, authenticates and authorizes it, runs admission webhooks, and writes the Deployment object to etcd.
- etcd confirms the write and emits a watch event.
- The Deployment controller sees the new Deployment, creates a ReplicaSet, and writes it to the API server.
- The ReplicaSet controller sees the ReplicaSet, creates Pod objects, and writes them.
- The scheduler sees unscheduled Pods, selects nodes, and patches each Pod with spec.nodeName.
- The kubelet on the assigned node sees the Pod via its watch, pulls the image, calls the CRI to create the container sandbox, calls the CNI to set up networking, mounts any CSI volumes, and starts the container.
- The kubelet reports Pod status back to the API server.
- The EndpointSlice controller sees the Pod is Ready and updates endpoints.
- kube-proxy programs rules so traffic to the Service ClusterIP reaches the new Pod.
Every step is asynchronous. The API server broadcasts events, and each consumer reacts independently. There is no central orchestrator directing the sequence. This design is resilient but means that failure at any hop produces distinct symptoms.
How this mental model helps in incident triage
When something breaks, the symptom tells you which component to inspect first.
- kubectl hangs or returns 500s: Start with API server latency, then etcd disk latency. If etcd WAL fsync is slow, the API server is waiting, and every mutating request queues up.
- 429 Too Many Requests: Check APF queue depth and inflight request counts. A single misbehaving controller can starve legitimate traffic.
- Pods stuck Pending: Check scheduler health, pending pod queues, and node allocatable resources. If the scheduler is healthy but the unschedulableQ is growing, the cluster is out of capacity or constrained by affinity rules.
- Pods stuck ContainerCreating: Look at the kubelet. Check PLEG relist duration, CRI operation latency, image pull failures, volume mount errors, and CNI health.
- Nodes flapping NotReady: Check kubelet logs for PLEG timeouts, container runtime disconnections, API server connectivity, and certificate expiration.
- Service timeouts after a rollout: Check kube-proxy sync duration, conntrack table utilization, and whether stale endpoint rules are sending traffic to terminated Pods.
Understanding the pipeline tells you whether the problem is in serialization (API server or etcd), reconciliation (controllers or scheduler), or execution (kubelet, runtime, CNI).
Signals to watch in production
| Signal | Why it matters | Warning sign |
|---|---|---|
| etcd WAL fsync latency | Every cluster write blocks on this | p99 > 100 ms sustained |
| API server mutating request latency | End-to-end write health | p99 > 1 s sustained |
| APF queue depth | Priority level saturation | Non-zero for system or leader-election flows |
| Controller workqueue depth | Reconciliation lag | Growing for more than 5 minutes |
| Scheduler pending pods | Scheduling backlog | unschedulableQ growing |
| Kubelet PLEG relist duration | Runtime responsiveness on the node | > 10 s, approaching the 3 min threshold |
| kube-proxy sync duration | Service rule staleness | Exceeding the configured sync period |
| etcd database size | Storage quota risk | > 75 % of configured quota |
How Netdata helps
Netdata surfaces the cross-component correlations that make this mental model actionable.
- It collects API server request latency and etcd WAL fsync latency on the same timeline, so you can see whether a control plane slowdown starts in the storage layer or in admission control.
- It tracks per-node kubelet PLEG relist duration and CRI operation latency, letting you distinguish a runtime problem from an API server connectivity issue.
- It monitors APF queue depth and inflight request counts, exposing saturation before clients are flooded with 429 responses.
- It alerts on etcd leader changes and database size growth rate, providing runway before a quota alarm.
- It maps controller workqueue depth and scheduler pending queues to deployment and scaling health.
Related guides
- Kubernetes anonymous API access: detection, audit, and lockdown
- Kubernetes API server audit logging: policy, backends, and forensics
- Kubernetes API server certificate rotation: detection and grace handling
- Kubernetes API server etcd latency: detection and cascading failures
- Kubernetes API server FlowSchemas and PriorityLevels: design and tuning
- Kubernetes API server memory pressure: OOM cycle and tuning
- Kubernetes API server rate limiting: APF priority levels and starvation
- Kubernetes API server slow or unresponsive: causes and fixes
- Kubernetes API server watch storm: re-list cascades and connection floods
- Kubernetes bound service account tokens: rotation, audience, and expiry
- Kubernetes conntrack exhaustion: dropped connections under load
- Kubernetes controller-manager leader election failures






