Kubernetes kubelet memory leak: detection and OOM cycle
Kubelet memory growth ends one of two ways: the process hits its cgroup limit or the node runs out of memory. The kernel OOM killer sends SIGKILL. Systemd restarts kubelet, but the new process has cold caches and immediately runs a full reconciliation pass: relisting all containers, re-syncing every pod status, and re-attaching every volume. On a busy node, that burst spikes CPU and memory, which can push the fresh kubelet back over the edge and create a Ready/NotReady flap cycle.
Detect the leak before the kill, distinguish it from a workload leak, and break the cycle without worsening the node state.
What this means
Kubelet resident memory should correlate with pod count and then flatten. A leak manifests as monotonic RSS growth over days or weeks that is uncorrelated with workload growth.
When kubelet RSS hits its limit, the kernel sends SIGKILL. The exit is logged as exit code 137. After restart, kubelet rebuilds its entire state. That full reconciliation generates a burst of CRI ListContainers calls and API server status updates. If the node was already near memory pressure, the burst can trigger evictions or a second OOM. The node may appear to recover briefly, then fail again.
During restart, existing pods usually keep running, but probes, evictions, and new scheduling stop. If the leak is a known bug, the only immediate relief is to break the cycle manually.
Common causes
| Cause | What it looks like | First thing to check |
|---|---|---|
| Container GC leak (kubernetes#131905) | Memory grows steadily; exited containers accumulate faster than they are collected | crictl ps -a vs running pod count; compare across kubelet versions |
| PodCertsManager race (kubernetes#133501) | Memory grows when PodCertificateRequest feature gate is enabled; stale CSR entries in queue | Feature gate status and pending CSRs |
| EventedPLEG panic leak (kubernetes#132266) | Goroutine count climbs with memory under node pressure; CRI DeadlineExceeded in logs | EventedPLEG feature gate and goroutine trend |
| Pod controller leak (kubernetes#131906) | Memory rises after periods of heavy pod churn; deleted pod objects retained | Pod creation/deletion rate vs memory slope |
| cgroup v1 + Linux 5.15 runc leak | Progressive growth only on cgroup v1 nodes with kernel 5.15 | kubelet_cgroup_version metric and kernel version |
| Oversubscribed node eviction loop | MemoryPressure oscillates; evicted pods reschedule and immediately pressure the node again | Node allocatable memory vs actual pod working set |
Quick checks
Run these from the affected node. All are read-only.
# Check kubelet RSS on the node
ps -p $(pgrep kubelet) -o rss,vsz,comm
# Check kubelet goroutine count from metrics
curl -sk https://localhost:10250/metrics | grep kubelet_goroutines
# Check node memory pressure condition
kubectl get node <node-name> -o jsonpath='{.status.conditions[?(@.type=="MemoryPressure")]}'
# Check PLEG relist duration
curl -sk https://localhost:10250/metrics | grep kubelet_pleg_relist_duration_seconds
# Check kubelet eviction counters
curl -sk https://localhost:10250/metrics | grep kubelet_evictions_total
# Check for kernel OOM kills of kubelet
dmesg -T | grep -i "oom-killer\|killed process.*kubelet"
# Check kubelet container GC state
crictl ps -a | wc -l
# Check cgroup version and kernel
curl -sk https://localhost:10250/metrics | grep kubelet_cgroup_version
uname -r
What good and bad look like:
- Healthy kubelet RSS should flatten after cluster warm-up. If RSS grows by more than 100MB per day with stable pod count, treat it as a leak.
- PLEG relist p99 under 1-2 seconds is normal. Sustained values above 10 seconds indicate the runtime or kubelet is stressed. Above 180 seconds the node will go NotReady.
kubelet_evictions_totalwitheviction_signal="memory.available"should be zero under normal load. Any sustained increase means the node is already in pressure.
How to diagnose it
Confirm the leak is in kubelet, not a workload. Workload leaks show up as high
container_memory_working_set_bytesfor specific pod cgroups. A kubelet leak shows up as high RSS in the kubelet process itself (/proc/$(pgrep kubelet)/statusor the kubelet cgroup). If pod working sets are low but node memory is falling, look at system processes.Determine the leak pattern.
- If
crictl ps -ashows hundreds of exited containers while running pods are few, suspect the container GC leak. - If the node runs Kubernetes 1.35 or older with
PodCertificateRequestenabled and CSR queue is backing up, suspect the PodCertsManager race. - If
kubelet_pleg_relist_duration_secondsis elevated and goroutines are climbing under pressure, suspect EventedPLEG. - If the node uses cgroup v1 on kernel 5.15 and memory grows with no other explanation, suspect the runc interaction.
- If
Identify the OOM cycle stage.
- Pre-OOM: kubelet RSS is climbing but the process is still running. Node may show
MemoryPressure=Falseor flapping. - Active OOM:
dmesgshows the kernel killed kubelet. Systemd restarts it within seconds. - Reconciliation storm: after restart,
kubelet_pod_worker_duration_secondsandkubelet_pleg_relist_duration_secondsspike for 30-90 seconds. API server request rate from the node jumps. - Recovery or re-OOM: if the node has enough headroom, metrics settle. If not, the process is killed again.
- Pre-OOM: kubelet RSS is climbing but the process is still running. Node may show
Check for collateral eviction damage. While kubelet was down, the node may have crossed the hard eviction threshold (
memory.available<100Mion Linux). Look forEvictedpods andOOMKilledcontainers. OOMKilled is a kernel cgroup kill of a container, while Evicted is a kubelet-initiated pod termination. Both report exit code 137 but have different root causes.Verify control plane impact. During reconciliation, kubelet floods the API server with status updates. Check
apiserver_request_totalfrom the node or APF queue depth if you see API latency spikes coinciding with node recovery.
Metrics and signals to monitor
| Signal | Why it matters | Warning sign |
|---|---|---|
| Kubelet RSS | Direct indicator of process memory health | Growing >100MB/day with stable pod count |
kubelet_goroutines | Goroutine leaks precede memory exhaustion | Count >500 or monotonic growth over days |
kubelet_pleg_relist_duration_seconds | Runtime stress or internal goroutine stalls | p99 >10s or trending toward 180s |
kubelet_evictions_total{eviction_signal="memory.available"} | Node-level pressure response | Sustained increase over an hour |
Node MemoryPressure condition | Precedes OOM and kubelet instability | True for any extended period |
kubelet_pod_worker_duration_seconds | Sync loop falling behind | Sustained elevation after kubelet restart |
kubelet_eviction_stats_age_seconds | Lag between stat collection and eviction trigger | High values suggest stat pipeline delays |
kubelet_cgroup_version | Determines whether cgroup v1 leak applies | Value of 1 on kernel 5.15 is a risk factor |
Fixes
If the cause is a known kubelet bug
For the container GC leak (#131905) and pod controller leak (#131906), no upstream fix is currently planned. Workarounds:
- Schedule periodic graceful kubelet restarts during maintenance windows to reclaim leaked memory.
- Tune
--image-gc-high-thresholdto trigger more frequent garbage collection cycles, which may slow the leak rate.
For the PodCertsManager race (#133501):
- Upgrade to Kubernetes v1.35.0 or later where the fix is included.
- If you cannot upgrade, disable the
PodCertificateRequestfeature gate.
For the EventedPLEG panic leak (#132266):
- Disable EventedPLEG by setting
--feature-gates=EventedPLEG=false. This reverts to Generic PLEG polling.
For the cgroup v1 + 5.15 kernel runc leak:
- Migrate the node pool to cgroup v2, or upgrade the kernel to a version where the runc interaction is fixed.
If the node is in an active OOM cycle
- Cordon the node immediately to prevent new scheduling during recovery.
- If workloads can tolerate migration, drain the node. This reduces the reconciliation load when kubelet restarts.
- Restart kubelet manually during a low-traffic window if it has not already been restarted by systemd. This briefly interrupts pod lifecycle operations.
- After restart, watch
kubelet_pleg_relist_duration_secondsandkubelet_pod_worker_duration_seconds. If they do not settle within 2 minutes, the node is still resource-starved. - If the leak is uncorrectable, replace the node and retire the instance.
If the cause is node oversubscription
Lower pod density or increase node memory. Ensure kube-reserved and system-reserved are configured so kubelet has a protected memory budget. The eviction manager operates independently of scheduling requests. If actual memory usage exceeds allocatable because limits are not enforced, the OOM killer strikes before kubelet can evict.
Prevention
- Monitor kubelet RSS as a time-series and alert on slope, not just absolute value. A flat baseline with daily sawtooth (Go GC) is healthy. A rising baseline is not.
- Track
kubelet_goroutineswith the same discipline. Goroutine growth is the earliest leading indicator of several leak paths. - Size
kube-reservedmemory to include kubelet steady-state footprint plus headroom for reconciliation bursts. On dense nodes this can be 500MB to 1GB. - Enable memory pressure alerts at the node level before the hard eviction threshold is crossed.
- Audit feature gates after upgrades. New alpha gates like
PodCertificateRequestandEventedPLEGhave introduced leak paths in recent versions. - If you run cgroup v1 nodes on kernel 5.15, treat memory growth as a known defect and plan migration.
How Netdata helps
- Correlates kubelet process RSS with node-level memory pressure and container cgroup usage, making it easy to distinguish a kubelet leak from a workload leak.
- Tracks
kubelet_goroutinesandkubelet_pleg_relist_duration_secondstogether so you can spot EventedPLEG-related goroutine accumulation before the node goes NotReady. - Surfaces OOM kill events alongside node readiness transitions, mapping the exact timeline of the OOM cycle.
- Alerts on sustained growth in kubelet memory and on node MemoryPressure, giving you runway before eviction begins.
Related guides
- See Kubernetes eviction cascade: when one node failure takes down the cluster for how memory pressure spreads.
- See Kubernetes kubelet not responding: PLEG, runtime, and certificate issues for PLEG and runtime diagnostics.
- See Kubernetes kubelet certificate expired: detection, rotation, and recovery for CSR and certificate issues.
- See Kubernetes monitoring checklist: the signals every production cluster needs for baseline monitoring coverage.
flowchart TD
A[Kubelet memory leak] --> B{Memory limit reached}
B -->|Kernel OOM killer| C[Kubelet killed SIGKILL]
C --> D[Systemd restarts kubelet]
D --> E[Full reconciliation storm]
E --> F[CRI list & API sync burst]
F --> G[CPU & memory spike]
G --> H{Leak still present?}
H -->|Yes| B
H -->|No| I[Node stabilizes]





