$ guides / kubernetes / how-kubernetes-control-plane-works ▌

Operations Guides

How the Kubernetes control plane works: a mental model for operators

The Kubernetes control plane is often drawn as a single box labeled “Master.” In production, that abstraction fails the moment kubectl hangs, a node disappears, or a Deployment stays pending. The control plane is not a monolith. It is a pipeline of specialized components that hand off state through a single API server, and understanding those handoffs is what lets you triage a cluster outage without guessing.

At the center is kube-apiserver. Everything else, etcd included, is either a client of the API server or a plugin that the kubelet calls on the node. When you run kubectl apply, the request enters the API server, is validated, persisted to etcd, and then broadcast as watch events to every component that cares. The scheduler, controllers, kubelets, and kube-proxy are all reactive watchers. They do not talk to each other directly. They talk to the API server, compare what they see to what they expect, and write corrections back.

This article walks the component graph and the request path. After reading it, you should be able to map a symptom, Pods stuck Pending or nodes flapping NotReady, to the specific component that is actually responsible, and know which signals to inspect first.

flowchart LR
    A[Client] -->|mutate| B[kube-apiserver]
    B -->|write| C[etcd]
    C -->|watch| B
    B -->|unscheduled| D[scheduler]
    B -->|assigned| E[kubelet]
    B -->|state drift| F[controllers]
    D -->|bind| B
    F -->|reconcile| B
    E -->|status| B

kube-apiserver: the front door

kube-apiserver is a stateless HTTP/HTTPS front-end. Every kubectl command, controller action, kubelet report, and operator request flows through it. It does not schedule pods or run containers. It authenticates, authorizes, validates, and persists.

Inside the API server, every request walks a pipeline. Authentication confirms identity. Authorization checks RBAC. Admission controllers mutate and validate the object. The REST storage layer transforms it for etcd. The watch cache serves LIST and WATCH from memory when possible, avoiding etcd reads. API Priority and Fairness (APF), stable since Kubernetes 1.20, queues requests into flow schemas and priority levels to prevent a single noisy client from starving the control plane. Finally, the etcd client writes the object.

The API server scales horizontally in an active-active arrangement behind a load balancer. Each instance maintains its own watch cache. If one instance is unhealthy, the others continue serving traffic, but their caches are independent, so brief inconsistencies during high churn are normal.

etcd: the source of truth

etcd is the only component that stores persistent cluster state. The API server is its sole client. Every write is a serialized transaction that must be fsynced to the write-ahead log before it is acknowledged.

etcd runs the Raft consensus protocol. A majority of members must agree on the leader. That is why clusters require an odd member count: 3, 5, or 7. Losing a majority means quorum loss, and the cluster becomes read-only or entirely unavailable. etcd disk I/O is the single most common root cause of control plane slowness. Because every write waits for a WAL fsync, slow disk or a noisy neighbor on the same volume creates a latency cascade that backs up the API server’s inflight requests.

etcd also enforces a database size quota, commonly 2 GB by default and often raised to 8 GB. When the database exceeds the quota, etcd raises a NOSPACE alarm and rejects all writes. Compaction and defragmentation reclaim space, but they are separate operations. etcd v3.4 reaches end of support in May 2026; supported production versions are v3.5 and v3.6.

Scheduler and controllers: the reconcile loop

Kubernetes is a distributed state machine. The desired state lives in etcd as API objects. The current state lives on the nodes as running containers. Two components close the gap: the scheduler and the controller manager.

kube-controller-manager runs dozens of controller loops, deployment, replica set, node, service account, and others, inside a single binary. Each controller watches specific resource types via the API server and uses an internal workqueue to process events. Only one instance is active at a time. The others wait via leader election using the Lease API. If the workqueue depth grows and stays high, desired state is drifting from actual state.

kube-scheduler watches for unscheduled Pods, those with an empty spec.nodeName. It filters nodes that cannot run the Pod and scores the remainder. The scheduler maintains an internal queue with three sub-queues: activeQ, backoffQ, and unschedulableQ. Like the controller manager, only one scheduler instance is active via leader election. If the scheduler is unhealthy or starved of API server concurrency, new Pods accumulate in Pending.

cloud-controller-manager runs cloud-specific loops separately from the core binary. It is not present in on-premises clusters.

Kubelet and the node boundary

The kubelet is the node agent. It receives Pod specifications through a watch on the API server and turns them into running containers via the container runtime interface (CRI). Its sync loop reconciles desired state against what is actually running.

Inside the kubelet, the Pod Lifecycle Event Generator (PLEG) polls the container runtime to detect container starts, stops, and deaths. If PLEG slows down, the kubelet marks the node NotReady. The kubelet also manages volumes through the Container Storage Interface (CSI), network setup through the Container Network Interface (CNI), image pulls, probe execution, and pod evictions under resource pressure.

The kubelet reports node conditions and Pod status back to the API server. It also renews a Lease object in the kube-node-lease namespace as a heartbeat. If the API server stops receiving lease renewals, the node is marked NotReady after a grace period. A kubelet may be up to three minor versions older than the API server, but it must never be newer.

kube-proxy, which also runs on every node, is not part of the core control plane but completes the service abstraction. It watches Services and EndpointSlices and programs iptables or IPVS rules so that traffic to a ClusterIP reaches a healthy backend Pod.

The request lifecycle from kubectl to container

When you run kubectl apply -f deployment.yaml, the following chain begins:

The API server receives the request, authenticates and authorizes it, runs admission webhooks, and writes the Deployment object to etcd.
etcd confirms the write and emits a watch event.
The Deployment controller sees the new Deployment, creates a ReplicaSet, and writes it to the API server.
The ReplicaSet controller sees the ReplicaSet, creates Pod objects, and writes them.
The scheduler sees unscheduled Pods, selects nodes, and patches each Pod with spec.nodeName.
The kubelet on the assigned node sees the Pod via its watch, pulls the image, calls the CRI to create the container sandbox, calls the CNI to set up networking, mounts any CSI volumes, and starts the container.
The kubelet reports Pod status back to the API server.
The EndpointSlice controller sees the Pod is Ready and updates endpoints.
kube-proxy programs rules so traffic to the Service ClusterIP reaches the new Pod.

Every step is asynchronous. The API server broadcasts events, and each consumer reacts independently. There is no central orchestrator directing the sequence. This design is resilient but means that failure at any hop produces distinct symptoms.

How this mental model helps in incident triage

When something breaks, the symptom tells you which component to inspect first.

kubectl hangs or returns 500s: Start with API server latency, then etcd disk latency. If etcd WAL fsync is slow, the API server is waiting, and every mutating request queues up.
429 Too Many Requests: Check APF queue depth and inflight request counts. A single misbehaving controller can starve legitimate traffic.
Pods stuck Pending: Check scheduler health, pending pod queues, and node allocatable resources. If the scheduler is healthy but the unschedulableQ is growing, the cluster is out of capacity or constrained by affinity rules.
Pods stuck ContainerCreating: Look at the kubelet. Check PLEG relist duration, CRI operation latency, image pull failures, volume mount errors, and CNI health.
Nodes flapping NotReady: Check kubelet logs for PLEG timeouts, container runtime disconnections, API server connectivity, and certificate expiration.
Service timeouts after a rollout: Check kube-proxy sync duration, conntrack table utilization, and whether stale endpoint rules are sending traffic to terminated Pods.

Understanding the pipeline tells you whether the problem is in serialization (API server or etcd), reconciliation (controllers or scheduler), or execution (kubelet, runtime, CNI).

Signals to watch in production

Signal	Why it matters	Warning sign
etcd WAL fsync latency	Every cluster write blocks on this	p99 > 100 ms sustained
API server mutating request latency	End-to-end write health	p99 > 1 s sustained
APF queue depth	Priority level saturation	Non-zero for system or leader-election flows
Controller workqueue depth	Reconciliation lag	Growing for more than 5 minutes
Scheduler pending pods	Scheduling backlog	unschedulableQ growing
Kubelet PLEG relist duration	Runtime responsiveness on the node	> 10 s, approaching the 3 min threshold
kube-proxy sync duration	Service rule staleness	Exceeding the configured sync period
etcd database size	Storage quota risk	> 75 % of configured quota

How Netdata helps

Netdata surfaces the cross-component correlations that make this mental model actionable.

It collects API server request latency and etcd WAL fsync latency on the same timeline, so you can see whether a control plane slowdown starts in the storage layer or in admission control.
It tracks per-node kubelet PLEG relist duration and CRI operation latency, letting you distinguish a runtime problem from an API server connectivity issue.
It monitors APF queue depth and inflight request counts, exposing saturation before clients are flooded with 429 responses.
It alerts on etcd leader changes and database size growth rate, providing runway before a quota alarm.
It maps controller workqueue depth and scheduler pending queues to deployment and scaling health.

The Netdata solution

Kubernetes monitoring with Netdata

Netdata monitors Kubernetes with per-second metrics across the control plane, nodes, and every pod, with ML anomaly detection and zero per-pod configuration. Correlate API-server and etcd latency, kubelet PLEG stalls, scheduling pressure, and OOMKills in one place.

See Kubernetes monitoring → Start monitoring free

How the Kubernetes control plane works: a mental model for operators

How the Kubernetes control plane works: a mental model for operators

kube-apiserver: the front door

etcd: the source of truth

Scheduler and controllers: the reconcile loop

Kubelet and the node boundary

The request lifecycle from kubectl to container

How this mental model helps in incident triage

Signals to watch in production

How Netdata helps

Related guides

Kubernetes monitoring with Netdata