Kubernetes node CPU saturation: load, throttling, and runqueue depth

Application latency climbs and pods slow down. kubectl top nodes reports 70 percent CPU, so you assume headroom exists. It does not. CPU percent is a time-average that masks micro-bursts, runqueue backlog, and CFS throttling. A container can throttle to a crawl while node utilization looks comfortable, and a node can show 50 percent utilization with every runnable thread queued behind a noisy neighbor. Distinguish node-level CPU contention from limit-induced throttling using runqueue depth, CFS bandwidth metrics, and Pressure Stall Information (PSI).

What this means

Kubernetes node CPU utilization is an aggregate average that hides two distinct failure modes.

First, CFS throttling. The Linux Completely Fair Scheduler enforces CPU limits via a quota per 100ms period. When a container exhausts its quota, the kernel halts its threads until the next period. The container experiences CPU starvation even if the node has idle cores. Multi-threaded containers exhaust the same quota faster in wall-clock time than single-threaded ones. A four-thread process with a one-core limit can burn its 100ms quota in roughly 25ms of wall-clock time and then throttle for the remainder of the period.

Second, runqueue depth and pressure. When runnable threads exceed physical cores, the kernel queues them in per-CPU runqueues. Deep queues mean tasks wait. The kernel exposes this through load average and, on cgroup v2 nodes, through Pressure Stall Information (PSI). PSI reports the percentage of wall-clock time tasks spend waiting for CPU: some means one or more tasks stalled, and full means all non-idle tasks stalled.

PSI cannot distinguish a pod throttled by its own CPU limit from one starved by neighbors. Correlate PSI with CFS throttling metrics to disambiguate.

Kubernetes v1.36 graduates PSI metrics to GA. The kubelet exposes them when the node runs cgroup v2. Before v1.36, the kubelet emitted zero-valued PSI metrics even when the underlying OS had PSI disabled, which triggered false alarms. In v1.36, the kubelet detects OS-level PSI support before emitting metrics. Even with kernel 4.20 or newer, some distributions compile PSI support but boot with `psi=0`, disabling it at the host level. Verify host-level PSI directly before trusting the metric.

There is no canonical runqueue depth metric in Kubernetes. nr_running is a kernel metric visible in /proc/stat and sar -q, but kubelet and cAdvisor do not expose it. Load average is the closest proxy, though it includes uninterruptible I/O-blocked processes. To track runqueue depth, instrument node_exporter or collect nr_running via a node-level agent DaemonSet.

Common causes

CauseWhat it looks likeFirst thing to check
CPU limit too lowContainer throttled despite idle node coresCFS throttled periods ratio for the container
Node overcommitHigh load average, slow scheduling, latency in many podsTotal CPU requests vs allocatable on the node
Noisy neighborOne burstable pod spiking, others degradedPer-pod CPU usage and node PSI some pressure
System overheadKubelet, container runtime, or kernel threads consuming coresHost-level CPU usage outside pod cgroups
Burstable pod burstSudden latency spike after scale-up or batch startCFS throttling ratio and load average trend

Quick checks

# Check node CPU utilization (requires metrics-server)
kubectl top nodes

# Verify cgroup v2 (required for PSI)
stat -fc %T /sys/fs/cgroup

# Check host-level PSI (requires kernel 4.20+ with CONFIG_PSI)
cat /proc/pressure/cpu

# Check node PSI via Summary API (Kubernetes v1.36+, cgroup v2)
kubectl get --raw /api/v1/nodes/<node-name>/proxy/stats/summary | jq '.node.cpu.psi'

# Check CFS throttling metrics from cAdvisor
kubectl get --raw /api/v1/nodes/<node-name>/proxy/metrics/cadvisor | grep container_cpu_cfs_throttled_periods_total

# Check load average on the node
cat /proc/loadavg

# PromQL: CFS throttling ratio per container
# rate(container_cpu_cfs_throttled_periods_total[5m])
#   /
# rate(container_cpu_cfs_periods_total[5m])

How to diagnose it

  1. Confirm the symptom is CPU-bound. High cpu.some PSI with low memory.some and io.some points to CPU. Without PSI, check whether load average trends upward while disk and memory metrics stay flat.

  2. Check for CFS throttling. Query cAdvisor Prometheus metrics for container_cpu_cfs_throttled_periods_total and container_cpu_cfs_periods_total. Calculate the throttling ratio. A ratio above 0.05 sustained for five minutes means the container is regularly hitting its limit. Multi-threaded containers exhaust the fixed quota faster in wall-clock time, so they throttle sooner than single-threaded workloads.

  3. Distinguish limit-induced from contention-induced pressure. If the container is throttled and its limit is below its request, or simply too low for the workload, the fix is vertical scaling. If the container is not throttled but node-level PSI cpu.some is elevated, the node is oversubscribed and the stall is external to the pod.

  4. Check node-level PSI or load average. On cgroup v2 nodes, use the Summary API or host /proc/pressure/cpu. some pressure above 10 percent means tasks are stalling for CPU. full pressure above 5 percent means all non-idle tasks are stalled simultaneously, indicating severe contention. Without PSI, use load average. Load average above the core count for more than five minutes indicates queuing.

  5. Find the consumers. Identify top consumers with kubectl top pods --all-namespaces or cAdvisor container_cpu_usage_seconds_total. Compare usage to requests. Burstable pods with no requests can consume excess CPU and starve neighbors.

  6. Check system overhead. Reserve CPU for kubelet, the container runtime, and system daemons via --kube-reserved and --system-reserved. Without these, system processes compete with pods for scheduling slots. If node-level pressure persists after accounting for pods, check kubelet and runtime CPU usage.

  7. Correlate with runqueue depth. There is no native Kubernetes metric for nr_running. Use node_exporter node_load1 as a proxy, or collect nr_running directly via a node-level agent. If load average is high but CFS throttling is low, the node has too many runnable threads for its core count.

flowchart TD
    A[High application latency or slow pods] --> B{Check CFS throttling ratio}
    B -->|Ratio > 0.05| C[CPU limit too low]
    B -->|Ratio near zero| D{Check node PSI cpu.some}
    D -->|avg10 > 10%| E[Node CPU contention / noisy neighbor]
    D -->|avg10 near zero| F{Check load average vs cores}
    F -->|Load > core count| G[Runqueue deep: oversubscription]
    F -->|Load normal| H[Application or dependency issue]
    C --> I[Raise CPU limit or remove limit]
    E --> J[Rebalance pods or add nodes]
    G --> K[Reduce requests or scale out]

Metrics and signals to monitor

SignalWhy it mattersWarning sign
container_cpu_cfs_throttled_periods_total / container_cpu_cfs_periods_total ratioReveals containers hitting CPU limitsRatio > 0.05 sustained for > 5 min
node/cpu PSI some avg10Measures wall-clock time tasks wait for CPUavg10 > 10% on latency-sensitive nodes
node/cpu PSI full avg10Measures time all non-idle tasks are stalled simultaneouslyAny sustained non-zero value is critical
node_load1 (via node_exporter)Proxies for runnable and uninterruptible threadsLoad1 > core count for > 5 minutes
container_cpu_usage_seconds_totalIdentifies top CPU consumersUsage significantly above requests for burstable pods
kube_node_status_allocatable vs requested CPUShows scheduling headroomRequests > 80% of allocatable sustained

Fixes

If the cause is CPU limit misconfiguration

Raise the container’s CPU limit to match actual peak usage. Removing the limit eliminates throttling but reduces predictability. Kubernetes CPU limits are enforced by CFS quota. Because the period is fixed at 100ms, multi-threaded containers exhaust the fixed quota rapidly in wall-clock time. Ensure limits are at least equal to requests; a limit below the request guarantees throttling under load.

If the cause is node overcommitment

Add nodes or reduce total CPU requests. Check kubectl describe node for allocated resources. If requests exceed roughly 80 percent of allocatable, bursts from burstable pods create runqueue pressure. Move latency-sensitive workloads to nodes with lower request density, or convert them to Guaranteed QoS.

If the cause is a noisy neighbor

Identify the offending pod with kubectl top pod or cAdvisor metrics. Set CPU requests so the scheduler places it on a node with sufficient headroom. Consider pod anti-affinity to separate heavy consumers from latency-sensitive workloads.

If the cause is system overhead

Configure --kube-reserved and --system-reserved on the kubelet to protect node-agent CPU. If kubelet or the container runtime consumes unexpected CPU, check for PLEG delays, image pull storms, or goroutine leaks.

If the cause is missing runqueue visibility

Deploy node_exporter or a node-level monitoring DaemonSet to expose node_load1 and nr_running. Without runqueue depth, CPU percent is a blind spot.

Prevention

  • Enable PSI metrics by running cgroup v2 on all nodes. Kubernetes v1.36 exposes PSI via the Summary API and cAdvisor. Do not rely solely on CPU utilization percentage.
  • Monitor CFS throttling ratio as a first-class SLO for latency-sensitive services. Alert when the ratio exceeds 5 percent for more than five minutes.
  • Right-size CPU requests and limits using vertical pod autoscaler or historical usage data.
  • Maintain node CPU request headroom below 80 percent of allocatable to absorb bursts.
  • Reserve CPU for system and kubelet overhead explicitly via kubelet flags.
  • Track load average or runqueue depth per node, not just cluster aggregates.

How Netdata helps

  • Correlates node CPU utilization, PSI some/full, and CFS throttling metrics in a single view.
  • Exposes per-container cgroup v2 CPU pressure and throttled time without manual cAdvisor scraping.
  • Alerts on CFS throttling ratio and PSI thresholds.
  • Visualizes per-node load average alongside pod CPU usage to distinguish node contention from limit-induced starvation.