Kubernetes node CPU saturation: load, throttling, and runqueue depth
Application latency climbs and pods slow down. kubectl top nodes reports 70 percent CPU, so you assume headroom exists. It does not. CPU percent is a time-average that masks micro-bursts, runqueue backlog, and CFS throttling. A container can throttle to a crawl while node utilization looks comfortable, and a node can show 50 percent utilization with every runnable thread queued behind a noisy neighbor. Distinguish node-level CPU contention from limit-induced throttling using runqueue depth, CFS bandwidth metrics, and Pressure Stall Information (PSI).
What this means
Kubernetes node CPU utilization is an aggregate average that hides two distinct failure modes.
First, CFS throttling. The Linux Completely Fair Scheduler enforces CPU limits via a quota per 100ms period. When a container exhausts its quota, the kernel halts its threads until the next period. The container experiences CPU starvation even if the node has idle cores. Multi-threaded containers exhaust the same quota faster in wall-clock time than single-threaded ones. A four-thread process with a one-core limit can burn its 100ms quota in roughly 25ms of wall-clock time and then throttle for the remainder of the period.
Second, runqueue depth and pressure. When runnable threads exceed physical cores, the kernel queues them in per-CPU runqueues. Deep queues mean tasks wait. The kernel exposes this through load average and, on cgroup v2 nodes, through Pressure Stall Information (PSI). PSI reports the percentage of wall-clock time tasks spend waiting for CPU: some means one or more tasks stalled, and full means all non-idle tasks stalled.
PSI cannot distinguish a pod throttled by its own CPU limit from one starved by neighbors. Correlate PSI with CFS throttling metrics to disambiguate.
Kubernetes v1.36 graduates PSI metrics to GA. The kubelet exposes them when the node runs cgroup v2. Before v1.36, the kubelet emitted zero-valued PSI metrics even when the underlying OS had PSI disabled, which triggered false alarms. In v1.36, the kubelet detects OS-level PSI support before emitting metrics. Even with kernel 4.20 or newer, some distributions compile PSI support but boot with `psi=0`, disabling it at the host level. Verify host-level PSI directly before trusting the metric.There is no canonical runqueue depth metric in Kubernetes. nr_running is a kernel metric visible in /proc/stat and sar -q, but kubelet and cAdvisor do not expose it. Load average is the closest proxy, though it includes uninterruptible I/O-blocked processes. To track runqueue depth, instrument node_exporter or collect nr_running via a node-level agent DaemonSet.
Common causes
| Cause | What it looks like | First thing to check |
|---|---|---|
| CPU limit too low | Container throttled despite idle node cores | CFS throttled periods ratio for the container |
| Node overcommit | High load average, slow scheduling, latency in many pods | Total CPU requests vs allocatable on the node |
| Noisy neighbor | One burstable pod spiking, others degraded | Per-pod CPU usage and node PSI some pressure |
| System overhead | Kubelet, container runtime, or kernel threads consuming cores | Host-level CPU usage outside pod cgroups |
| Burstable pod burst | Sudden latency spike after scale-up or batch start | CFS throttling ratio and load average trend |
Quick checks
# Check node CPU utilization (requires metrics-server)
kubectl top nodes
# Verify cgroup v2 (required for PSI)
stat -fc %T /sys/fs/cgroup
# Check host-level PSI (requires kernel 4.20+ with CONFIG_PSI)
cat /proc/pressure/cpu
# Check node PSI via Summary API (Kubernetes v1.36+, cgroup v2)
kubectl get --raw /api/v1/nodes/<node-name>/proxy/stats/summary | jq '.node.cpu.psi'
# Check CFS throttling metrics from cAdvisor
kubectl get --raw /api/v1/nodes/<node-name>/proxy/metrics/cadvisor | grep container_cpu_cfs_throttled_periods_total
# Check load average on the node
cat /proc/loadavg
# PromQL: CFS throttling ratio per container
# rate(container_cpu_cfs_throttled_periods_total[5m])
# /
# rate(container_cpu_cfs_periods_total[5m])
How to diagnose it
Confirm the symptom is CPU-bound. High
cpu.somePSI with lowmemory.someandio.somepoints to CPU. Without PSI, check whether load average trends upward while disk and memory metrics stay flat.Check for CFS throttling. Query cAdvisor Prometheus metrics for
container_cpu_cfs_throttled_periods_totalandcontainer_cpu_cfs_periods_total. Calculate the throttling ratio. A ratio above 0.05 sustained for five minutes means the container is regularly hitting its limit. Multi-threaded containers exhaust the fixed quota faster in wall-clock time, so they throttle sooner than single-threaded workloads.Distinguish limit-induced from contention-induced pressure. If the container is throttled and its limit is below its request, or simply too low for the workload, the fix is vertical scaling. If the container is not throttled but node-level PSI
cpu.someis elevated, the node is oversubscribed and the stall is external to the pod.Check node-level PSI or load average. On cgroup v2 nodes, use the Summary API or host
/proc/pressure/cpu.somepressure above 10 percent means tasks are stalling for CPU.fullpressure above 5 percent means all non-idle tasks are stalled simultaneously, indicating severe contention. Without PSI, use load average. Load average above the core count for more than five minutes indicates queuing.Find the consumers. Identify top consumers with
kubectl top pods --all-namespacesor cAdvisorcontainer_cpu_usage_seconds_total. Compare usage to requests. Burstable pods with no requests can consume excess CPU and starve neighbors.Check system overhead. Reserve CPU for kubelet, the container runtime, and system daemons via
--kube-reservedand--system-reserved. Without these, system processes compete with pods for scheduling slots. If node-level pressure persists after accounting for pods, check kubelet and runtime CPU usage.Correlate with runqueue depth. There is no native Kubernetes metric for
nr_running. Use node_exporternode_load1as a proxy, or collectnr_runningdirectly via a node-level agent. If load average is high but CFS throttling is low, the node has too many runnable threads for its core count.
flowchart TD
A[High application latency or slow pods] --> B{Check CFS throttling ratio}
B -->|Ratio > 0.05| C[CPU limit too low]
B -->|Ratio near zero| D{Check node PSI cpu.some}
D -->|avg10 > 10%| E[Node CPU contention / noisy neighbor]
D -->|avg10 near zero| F{Check load average vs cores}
F -->|Load > core count| G[Runqueue deep: oversubscription]
F -->|Load normal| H[Application or dependency issue]
C --> I[Raise CPU limit or remove limit]
E --> J[Rebalance pods or add nodes]
G --> K[Reduce requests or scale out]Metrics and signals to monitor
| Signal | Why it matters | Warning sign |
|---|---|---|
| container_cpu_cfs_throttled_periods_total / container_cpu_cfs_periods_total ratio | Reveals containers hitting CPU limits | Ratio > 0.05 sustained for > 5 min |
node/cpu PSI some avg10 | Measures wall-clock time tasks wait for CPU | avg10 > 10% on latency-sensitive nodes |
node/cpu PSI full avg10 | Measures time all non-idle tasks are stalled simultaneously | Any sustained non-zero value is critical |
| node_load1 (via node_exporter) | Proxies for runnable and uninterruptible threads | Load1 > core count for > 5 minutes |
| container_cpu_usage_seconds_total | Identifies top CPU consumers | Usage significantly above requests for burstable pods |
| kube_node_status_allocatable vs requested CPU | Shows scheduling headroom | Requests > 80% of allocatable sustained |
Fixes
If the cause is CPU limit misconfiguration
Raise the container’s CPU limit to match actual peak usage. Removing the limit eliminates throttling but reduces predictability. Kubernetes CPU limits are enforced by CFS quota. Because the period is fixed at 100ms, multi-threaded containers exhaust the fixed quota rapidly in wall-clock time. Ensure limits are at least equal to requests; a limit below the request guarantees throttling under load.
If the cause is node overcommitment
Add nodes or reduce total CPU requests. Check kubectl describe node for allocated resources. If requests exceed roughly 80 percent of allocatable, bursts from burstable pods create runqueue pressure. Move latency-sensitive workloads to nodes with lower request density, or convert them to Guaranteed QoS.
If the cause is a noisy neighbor
Identify the offending pod with kubectl top pod or cAdvisor metrics. Set CPU requests so the scheduler places it on a node with sufficient headroom. Consider pod anti-affinity to separate heavy consumers from latency-sensitive workloads.
If the cause is system overhead
Configure --kube-reserved and --system-reserved on the kubelet to protect node-agent CPU. If kubelet or the container runtime consumes unexpected CPU, check for PLEG delays, image pull storms, or goroutine leaks.
If the cause is missing runqueue visibility
Deploy node_exporter or a node-level monitoring DaemonSet to expose node_load1 and nr_running. Without runqueue depth, CPU percent is a blind spot.
Prevention
- Enable PSI metrics by running cgroup v2 on all nodes. Kubernetes v1.36 exposes PSI via the Summary API and cAdvisor. Do not rely solely on CPU utilization percentage.
- Monitor CFS throttling ratio as a first-class SLO for latency-sensitive services. Alert when the ratio exceeds 5 percent for more than five minutes.
- Right-size CPU requests and limits using vertical pod autoscaler or historical usage data.
- Maintain node CPU request headroom below 80 percent of allocatable to absorb bursts.
- Reserve CPU for system and kubelet overhead explicitly via kubelet flags.
- Track load average or runqueue depth per node, not just cluster aggregates.
How Netdata helps
- Correlates node CPU utilization, PSI
some/full, and CFS throttling metrics in a single view. - Exposes per-container cgroup v2 CPU pressure and throttled time without manual cAdvisor scraping.
- Alerts on CFS throttling ratio and PSI thresholds.
- Visualizes per-node load average alongside pod CPU usage to distinguish node contention from limit-induced starvation.
Related guides
- Kubernetes API server etcd latency: detection and cascading failures
- Kubernetes API server rate limiting: APF priority levels and starvation
- Kubernetes API server slow or unresponsive: causes and fixes
- Kubernetes conntrack exhaustion: dropped connections under load
- Kubernetes controller-manager leader election failures
- Kubernetes DNS resolution failures inside pods
- Kubernetes eviction cascade: when one node failure takes down the cluster
- Kubernetes kube-proxy iptables sync stall: causes and recovery
- Kubernetes kube-proxy IPVS: stale rules and session affinity issues
- Kubernetes kubelet certificate expired: detection, rotation, and recovery
- Kubernetes kubelet memory leak: detection and OOM cycle
- Kubernetes kubelet not responding: PLEG, runtime, and certificate issues






