Kubernetes kubelet memory leak: detection and OOM cycle

Kubelet memory growth ends one of two ways: the process hits its cgroup limit or the node runs out of memory. The kernel OOM killer sends SIGKILL. Systemd restarts kubelet, but the new process has cold caches and immediately runs a full reconciliation pass: relisting all containers, re-syncing every pod status, and re-attaching every volume. On a busy node, that burst spikes CPU and memory, which can push the fresh kubelet back over the edge and create a Ready/NotReady flap cycle.

Detect the leak before the kill, distinguish it from a workload leak, and break the cycle without worsening the node state.

What this means

Kubelet resident memory should correlate with pod count and then flatten. A leak manifests as monotonic RSS growth over days or weeks that is uncorrelated with workload growth.

When kubelet RSS hits its limit, the kernel sends SIGKILL. The exit is logged as exit code 137. After restart, kubelet rebuilds its entire state. That full reconciliation generates a burst of CRI ListContainers calls and API server status updates. If the node was already near memory pressure, the burst can trigger evictions or a second OOM. The node may appear to recover briefly, then fail again.

During restart, existing pods usually keep running, but probes, evictions, and new scheduling stop. If the leak is a known bug, the only immediate relief is to break the cycle manually.

Common causes

CauseWhat it looks likeFirst thing to check
Container GC leak (kubernetes#131905)Memory grows steadily; exited containers accumulate faster than they are collectedcrictl ps -a vs running pod count; compare across kubelet versions
PodCertsManager race (kubernetes#133501)Memory grows when PodCertificateRequest feature gate is enabled; stale CSR entries in queueFeature gate status and pending CSRs
EventedPLEG panic leak (kubernetes#132266)Goroutine count climbs with memory under node pressure; CRI DeadlineExceeded in logsEventedPLEG feature gate and goroutine trend
Pod controller leak (kubernetes#131906)Memory rises after periods of heavy pod churn; deleted pod objects retainedPod creation/deletion rate vs memory slope
cgroup v1 + Linux 5.15 runc leakProgressive growth only on cgroup v1 nodes with kernel 5.15kubelet_cgroup_version metric and kernel version
Oversubscribed node eviction loopMemoryPressure oscillates; evicted pods reschedule and immediately pressure the node againNode allocatable memory vs actual pod working set

Quick checks

Run these from the affected node. All are read-only.

# Check kubelet RSS on the node
ps -p $(pgrep kubelet) -o rss,vsz,comm

# Check kubelet goroutine count from metrics
curl -sk https://localhost:10250/metrics | grep kubelet_goroutines

# Check node memory pressure condition
kubectl get node <node-name> -o jsonpath='{.status.conditions[?(@.type=="MemoryPressure")]}'

# Check PLEG relist duration
curl -sk https://localhost:10250/metrics | grep kubelet_pleg_relist_duration_seconds

# Check kubelet eviction counters
curl -sk https://localhost:10250/metrics | grep kubelet_evictions_total

# Check for kernel OOM kills of kubelet
dmesg -T | grep -i "oom-killer\|killed process.*kubelet"

# Check kubelet container GC state
crictl ps -a | wc -l

# Check cgroup version and kernel
curl -sk https://localhost:10250/metrics | grep kubelet_cgroup_version
uname -r

What good and bad look like:

  • Healthy kubelet RSS should flatten after cluster warm-up. If RSS grows by more than 100MB per day with stable pod count, treat it as a leak.
  • PLEG relist p99 under 1-2 seconds is normal. Sustained values above 10 seconds indicate the runtime or kubelet is stressed. Above 180 seconds the node will go NotReady.
  • kubelet_evictions_total with eviction_signal="memory.available" should be zero under normal load. Any sustained increase means the node is already in pressure.

How to diagnose it

  1. Confirm the leak is in kubelet, not a workload. Workload leaks show up as high container_memory_working_set_bytes for specific pod cgroups. A kubelet leak shows up as high RSS in the kubelet process itself (/proc/$(pgrep kubelet)/status or the kubelet cgroup). If pod working sets are low but node memory is falling, look at system processes.

  2. Determine the leak pattern.

    • If crictl ps -a shows hundreds of exited containers while running pods are few, suspect the container GC leak.
    • If the node runs Kubernetes 1.35 or older with PodCertificateRequest enabled and CSR queue is backing up, suspect the PodCertsManager race.
    • If kubelet_pleg_relist_duration_seconds is elevated and goroutines are climbing under pressure, suspect EventedPLEG.
    • If the node uses cgroup v1 on kernel 5.15 and memory grows with no other explanation, suspect the runc interaction.
  3. Identify the OOM cycle stage.

    • Pre-OOM: kubelet RSS is climbing but the process is still running. Node may show MemoryPressure=False or flapping.
    • Active OOM: dmesg shows the kernel killed kubelet. Systemd restarts it within seconds.
    • Reconciliation storm: after restart, kubelet_pod_worker_duration_seconds and kubelet_pleg_relist_duration_seconds spike for 30-90 seconds. API server request rate from the node jumps.
    • Recovery or re-OOM: if the node has enough headroom, metrics settle. If not, the process is killed again.
  4. Check for collateral eviction damage. While kubelet was down, the node may have crossed the hard eviction threshold (memory.available<100Mi on Linux). Look for Evicted pods and OOMKilled containers. OOMKilled is a kernel cgroup kill of a container, while Evicted is a kubelet-initiated pod termination. Both report exit code 137 but have different root causes.

  5. Verify control plane impact. During reconciliation, kubelet floods the API server with status updates. Check apiserver_request_total from the node or APF queue depth if you see API latency spikes coinciding with node recovery.

Metrics and signals to monitor

SignalWhy it mattersWarning sign
Kubelet RSSDirect indicator of process memory healthGrowing >100MB/day with stable pod count
kubelet_goroutinesGoroutine leaks precede memory exhaustionCount >500 or monotonic growth over days
kubelet_pleg_relist_duration_secondsRuntime stress or internal goroutine stallsp99 >10s or trending toward 180s
kubelet_evictions_total{eviction_signal="memory.available"}Node-level pressure responseSustained increase over an hour
Node MemoryPressure conditionPrecedes OOM and kubelet instabilityTrue for any extended period
kubelet_pod_worker_duration_secondsSync loop falling behindSustained elevation after kubelet restart
kubelet_eviction_stats_age_secondsLag between stat collection and eviction triggerHigh values suggest stat pipeline delays
kubelet_cgroup_versionDetermines whether cgroup v1 leak appliesValue of 1 on kernel 5.15 is a risk factor

Fixes

If the cause is a known kubelet bug

For the container GC leak (#131905) and pod controller leak (#131906), no upstream fix is currently planned. Workarounds:

  • Schedule periodic graceful kubelet restarts during maintenance windows to reclaim leaked memory.
  • Tune --image-gc-high-threshold to trigger more frequent garbage collection cycles, which may slow the leak rate.

For the PodCertsManager race (#133501):

  • Upgrade to Kubernetes v1.35.0 or later where the fix is included.
  • If you cannot upgrade, disable the PodCertificateRequest feature gate.

For the EventedPLEG panic leak (#132266):

  • Disable EventedPLEG by setting --feature-gates=EventedPLEG=false. This reverts to Generic PLEG polling.

For the cgroup v1 + 5.15 kernel runc leak:

  • Migrate the node pool to cgroup v2, or upgrade the kernel to a version where the runc interaction is fixed.

If the node is in an active OOM cycle

  1. Cordon the node immediately to prevent new scheduling during recovery.
  2. If workloads can tolerate migration, drain the node. This reduces the reconciliation load when kubelet restarts.
  3. Restart kubelet manually during a low-traffic window if it has not already been restarted by systemd. This briefly interrupts pod lifecycle operations.
  4. After restart, watch kubelet_pleg_relist_duration_seconds and kubelet_pod_worker_duration_seconds. If they do not settle within 2 minutes, the node is still resource-starved.
  5. If the leak is uncorrectable, replace the node and retire the instance.

If the cause is node oversubscription

Lower pod density or increase node memory. Ensure kube-reserved and system-reserved are configured so kubelet has a protected memory budget. The eviction manager operates independently of scheduling requests. If actual memory usage exceeds allocatable because limits are not enforced, the OOM killer strikes before kubelet can evict.

Prevention

  • Monitor kubelet RSS as a time-series and alert on slope, not just absolute value. A flat baseline with daily sawtooth (Go GC) is healthy. A rising baseline is not.
  • Track kubelet_goroutines with the same discipline. Goroutine growth is the earliest leading indicator of several leak paths.
  • Size kube-reserved memory to include kubelet steady-state footprint plus headroom for reconciliation bursts. On dense nodes this can be 500MB to 1GB.
  • Enable memory pressure alerts at the node level before the hard eviction threshold is crossed.
  • Audit feature gates after upgrades. New alpha gates like PodCertificateRequest and EventedPLEG have introduced leak paths in recent versions.
  • If you run cgroup v1 nodes on kernel 5.15, treat memory growth as a known defect and plan migration.

How Netdata helps

  • Correlates kubelet process RSS with node-level memory pressure and container cgroup usage, making it easy to distinguish a kubelet leak from a workload leak.
  • Tracks kubelet_goroutines and kubelet_pleg_relist_duration_seconds together so you can spot EventedPLEG-related goroutine accumulation before the node goes NotReady.
  • Surfaces OOM kill events alongside node readiness transitions, mapping the exact timeline of the OOM cycle.
  • Alerts on sustained growth in kubelet memory and on node MemoryPressure, giving you runway before eviction begins.
flowchart TD
    A[Kubelet memory leak] --> B{Memory limit reached}
    B -->|Kernel OOM killer| C[Kubelet killed SIGKILL]
    C --> D[Systemd restarts kubelet]
    D --> E[Full reconciliation storm]
    E --> F[CRI list & API sync burst]
    F --> G[CPU & memory spike]
    G --> H{Leak still present?}
    H -->|Yes| B
    H -->|No| I[Node stabilizes]