Kubernetes kubelet not responding: PLEG, runtime, and certificate issues

A Kubernetes node flipping to NotReady while containers keep running is one of the most confusing production failure modes. The kubelet is the node agent that reconciles API server intent with running containers. When it stops responding or reports unhealthy subsystems, the control plane marks the node NotReady and reschedules workloads, even though the data plane may still serve traffic.

This guide covers three failure domains: Pod Lifecycle Event Generator (PLEG) stalls, container runtime disconnections, and kubelet certificate expiration or rotation failures. Distinguish these symptoms, run safe targeted diagnostics, and apply fixes without blind node reboots.

What this means

The kubelet reconciles desired pod state from the API server with actual container state through its sync loop. It relies on the PLEG to observe runtime changes, the Container Runtime Interface (CRI) to manage containers, and valid TLS certificates to authenticate with the API server.

When PLEG stalls, the kubelet cannot detect container starts, stops, or deaths. The Healthy() check fails if the elapsed time since the last successful relist() exceeds three minutes. Once PLEG is unhealthy, the kubelet skips pod synchronization and the node goes NotReady.

When the container runtime socket becomes unresponsive, the kubelet cannot execute lifecycle operations. crictl commands hang, PLEG cannot relist, and pod status updates stop. The runtime may still manage existing containers, but the kubelet is blind to them.

When the kubelet’s client certificate expires or rotation fails, the kubelet loses API server authentication. The node status becomes stale and eventually shows Unknown. Existing pods keep running autonomously, but the node is unmanaged and no new pods can be scheduled there.

Common causes

Cause	What it looks like	First thing to check
PLEG unhealthy / slow relist	Node NotReady with `PLEG is not healthy`; `crictl ps` is slow	PLEG relist duration and runtime CPU and disk I/O
Container runtime hung	`crictl info` hangs; runtime socket exists but commands fail	Runtime process status and socket permissions
Certificate expired or rotation failed	Node Ready=Unknown; kubelet logs show certificate or 401 errors	Certificate TTL and pending CSRs
Kubelet OOM or resource starvation	Kubelet process restarting; `dmesg` shows OOM kill	Kubelet RSS and node memory pressure
API server connectivity loss	Node Ready=Unknown; lease renewals failing	Network path and API server health from the node

Quick checks

# Check node Ready condition and reason
kubectl get node <node-name> -o jsonpath='{.status.conditions[?(@.type=="Ready")]}'

# Check kubelet process and systemd state
systemctl is-active kubelet
pgrep -x kubelet

# Check PLEG relist and CRI operation latency
curl -sk https://localhost:10250/metrics | grep kubelet_pleg_relist_duration_seconds
curl -sk https://localhost:10250/metrics | grep kubelet_runtime_operations_duration_seconds

# Test container runtime directly via crictl (use /run/crio/crio.sock for CRI-O)
time crictl --runtime-endpoint unix:///run/containerd/containerd.sock ps
crictl --runtime-endpoint unix:///run/containerd/containerd.sock info

# Search kubelet logs for PLEG, runtime, or certificate errors
journalctl -u kubelet --since "10 minutes ago" | grep -iE "pleg.*unhealthy|runtime|certificate|connection refused"

# Check client certificate expiry
openssl x509 -in /var/lib/kubelet/pki/kubelet-client-current.pem -noout -dates

# Look for unapproved CSRs
kubectl get csr | grep Pending

# Find the API server address from kubelet configuration
grep server /etc/kubernetes/kubelet.conf

# Test API server reachability from the node (200 or 401 indicates the server is up)
curl -k -s -o /dev/null -w "%{http_code}" https://<apiserver-host>:6443/healthz

# Check kubelet CPU and memory usage
ps -p $(pgrep kubelet) -o %cpu,rss,cmd

# Verify heartbeat lease freshness
kubectl get lease <node-name> -n kube-node-lease -o jsonpath='{.spec.renewTime}'

How to diagnose it

Confirm the node condition. Run kubectl get node and inspect the Ready condition. Note whether it is False or Unknown, and capture the reason message. False with “PLEG is not healthy” points to runtime or kubelet slowness; Unknown often points to API server or certificate issues.
Verify kubelet process liveness. Use systemctl is-active kubelet and pgrep. If the process is missing, check dmesg and journalctl -k for OOM kills or panics.
Probe kubelet’s /healthz endpoint. If the process is alive but the endpoint times out or returns non-200, the kubelet may be internally deadlocked or severely resource-starved.
Read kubelet logs. Search for “PLEG is not healthy”, certificate errors, or runtime connection failures. This usually isolates the failure to one of the three domains.
Test the container runtime directly. Run time crictl ps. If this hangs, the root cause is runtime slowness. If it is fast, the bottleneck is inside the kubelet.
Inspect PLEG metrics. If crictl ps is fast but kubelet_pleg_relist_duration_seconds is high, investigate kubelet CPU starvation, excessive pod density, or a goroutine leak that prevents the relist goroutine from completing.
Check certificate TTL. Inspect /var/lib/kubelet/pki/kubelet-client-current.pem. If it is expired or near expiry, look for pending CSRs and verify that kube-controller-manager is approving them.
Test API server reachability. If the API server is unreachable, fix the network path first. If it is reachable but returns 401 Unauthorized, the issue is certificate-based.
Check node resource pressure. Examine MemoryPressure, DiskPressure, and PIDPressure conditions. High pressure can slow or crash the kubelet.
Capture diagnostics before restart. If you suspect a leak or hang, collect a goroutine dump before restarting, because the restart destroys the evidence.

curl -sk https://localhost:10250/debug/pprof/goroutine?debug=1 > /tmp/kubelet_goroutines.txt

Metrics and signals to monitor

Signal	Why it matters	Warning sign
`kubelet_pleg_relist_duration_seconds`	Predicts node NotReady before the hard 3-minute threshold	p99 above 10-30 seconds
Node Ready condition	Binary cluster-level health indicator	False or Unknown sustained > 1 minute
`kubelet_runtime_operations_duration_seconds`	Runtime slowness cascades to PLEG and pod lifecycle	p99 above 5 seconds for list operations
`kubelet_certificate_manager_client_ttl_seconds`	Silent failure until expiration causes total API auth loss	Below 7 days, or `kubelet_certificate_manager_client_expiration_renew_errors` > 0
Kubelet RSS memory	OOM kills cause full reconciliation storms and brief outages	Sustained growth above 500MB without pod count increase
Kubelet goroutine count (`go_goroutines`)	Leading indicator of memory leaks and scheduler pressure	Above 500 or steady growth over days
`rest_client_request_duration_seconds`	Lease renewal and status updates depend on fast API writes	p99 above 5 seconds or 5xx errors
Node lease `renewTime` age	Direct measure of heartbeat freshness	Age approaching 40 seconds (default lease duration)

Fixes

If the cause is PLEG unhealthy or slow runtime

Restart containerd if it is hung. On systemd-managed nodes, systemctl restart containerd usually preserves existing containers via shims, but verify workload tolerance before restarting. If orphaned shims block state queries, drain the node and restart containerd; avoid manually killing shims, which can leave defunct containers. Reduce pod density or container churn if the node is overloaded. Investigate disk I/O saturation; high iowait slows runtime state reads.

If the cause is runtime disconnection

Check that the runtime process is running with systemctl status containerd. Verify the socket file exists at the path configured in --container-runtime-endpoint and that the kubelet has read-write access. If the runtime is running but unresponsive, restart it. Review runtime logs with journalctl -u containerd for crashes, thin pool exhaustion, or snapshotter errors.

If the cause is certificate expiration

Check for pending CSRs with kubectl get csr and approve them manually if necessary: kubectl certificate approve <csr-name>. Ensure kube-controller-manager is running and the CSR auto-approval mechanism is functional. If the certificate expired while the API server was unreachable, restore connectivity first.

If the kubelet uses bootstrap kubeconfig, restarting it may trigger a new bootstrap or renewal. Verify NTP synchronization on the node; clock skew causes certificate validation failures.

If the cause is kubelet OOM or resource starvation

Cordon the node to prevent new scheduling, then drain or delete excess pods to relieve pressure. Restarting the kubelet temporarily restores responsiveness, but expect a reconciliation spike in CRI calls and API requests when it returns. Increase --kube-reserved or --system-reserved to protect kubelet resources. If memory grows monotonically over days without a corresponding increase in pod count, capture a heap profile and plan an upgrade if a leak is confirmed in your version.

If the cause is API server partition

Fix the network path or control plane issue. Do not restart kubelet blindly during an API server outage. The kubelet keeps existing pods alive while disconnected. Restarting it only adds a reconciliation thundering herd when connectivity returns. Focus on restoring API server health or load balancer connectivity.

Prevention

PLEG latency alerts. Alert on PLEG relist p99 well below the 3-minute hard limit. A threshold of 30-60 seconds gives time to investigate before the node goes NotReady.
Scrape kubelet metrics. Collect kubelet_pleg_relist_duration_seconds, kubelet_runtime_operations_duration_seconds, kubelet_certificate_manager_client_ttl_seconds, and go_goroutines from the authenticated metrics endpoint.
Certificate rotation verification. Do not assume auto-rotation works. Monitor certificate TTL and pending CSRs proactively.
Resource reservations. Set --kube-reserved and --system-reserved high enough to prevent kubelet from being starved or OOM-killed by workloads.
Runtime health checks. Monitor container runtime latency independently of kubelet. A periodic time crictl ps or runtime-specific metrics catch runtime degradation early.
Disk and inode monitoring. Keep node filesystem usage below 70% and monitor inode consumption separately. Configure container log rotation and image GC thresholds aggressively enough to avoid DiskPressure.

How Netdata helps

Correlate node iowait and disk latency with PLEG relist duration spikes to confirm runtime storage bottlenecks.
Track kubelet and containerd process memory and CPU usage to detect resource starvation or leaks before OOM kills.
Monitor system memory, swap, and OOM killer events alongside kubelet eviction metrics to distinguish node pressure from application limits.
Visualize API server connectivity and network latency from the node to identify partition events.
Alert on certificate expiration windows by monitoring file ages or integrating with kubelet metrics.