Kubernetes kubelet not responding: PLEG, runtime, and certificate issues
A Kubernetes node flipping to NotReady while containers keep running is one of the most confusing production failure modes. The kubelet is the node agent that reconciles API server intent with running containers. When it stops responding or reports unhealthy subsystems, the control plane marks the node NotReady and reschedules workloads, even though the data plane may still serve traffic.
This guide covers three failure domains: Pod Lifecycle Event Generator (PLEG) stalls, container runtime disconnections, and kubelet certificate expiration or rotation failures. Distinguish these symptoms, run safe targeted diagnostics, and apply fixes without blind node reboots.
What this means
The kubelet reconciles desired pod state from the API server with actual container state through its sync loop. It relies on the PLEG to observe runtime changes, the Container Runtime Interface (CRI) to manage containers, and valid TLS certificates to authenticate with the API server.
When PLEG stalls, the kubelet cannot detect container starts, stops, or deaths. The Healthy() check fails if the elapsed time since the last successful relist() exceeds three minutes. Once PLEG is unhealthy, the kubelet skips pod synchronization and the node goes NotReady.
When the container runtime socket becomes unresponsive, the kubelet cannot execute lifecycle operations. crictl commands hang, PLEG cannot relist, and pod status updates stop. The runtime may still manage existing containers, but the kubelet is blind to them.
When the kubelet’s client certificate expires or rotation fails, the kubelet loses API server authentication. The node status becomes stale and eventually shows Unknown. Existing pods keep running autonomously, but the node is unmanaged and no new pods can be scheduled there.
Common causes
| Cause | What it looks like | First thing to check |
|---|---|---|
| PLEG unhealthy / slow relist | Node NotReady with PLEG is not healthy; crictl ps is slow | PLEG relist duration and runtime CPU and disk I/O |
| Container runtime hung | crictl info hangs; runtime socket exists but commands fail | Runtime process status and socket permissions |
| Certificate expired or rotation failed | Node Ready=Unknown; kubelet logs show certificate or 401 errors | Certificate TTL and pending CSRs |
| Kubelet OOM or resource starvation | Kubelet process restarting; dmesg shows OOM kill | Kubelet RSS and node memory pressure |
| API server connectivity loss | Node Ready=Unknown; lease renewals failing | Network path and API server health from the node |
Quick checks
# Check node Ready condition and reason
kubectl get node <node-name> -o jsonpath='{.status.conditions[?(@.type=="Ready")]}'
# Check kubelet process and systemd state
systemctl is-active kubelet
pgrep -x kubelet
# Check PLEG relist and CRI operation latency
curl -sk https://localhost:10250/metrics | grep kubelet_pleg_relist_duration_seconds
curl -sk https://localhost:10250/metrics | grep kubelet_runtime_operations_duration_seconds
# Test container runtime directly via crictl (use /run/crio/crio.sock for CRI-O)
time crictl --runtime-endpoint unix:///run/containerd/containerd.sock ps
crictl --runtime-endpoint unix:///run/containerd/containerd.sock info
# Search kubelet logs for PLEG, runtime, or certificate errors
journalctl -u kubelet --since "10 minutes ago" | grep -iE "pleg.*unhealthy|runtime|certificate|connection refused"
# Check client certificate expiry
openssl x509 -in /var/lib/kubelet/pki/kubelet-client-current.pem -noout -dates
# Look for unapproved CSRs
kubectl get csr | grep Pending
# Find the API server address from kubelet configuration
grep server /etc/kubernetes/kubelet.conf
# Test API server reachability from the node (200 or 401 indicates the server is up)
curl -k -s -o /dev/null -w "%{http_code}" https://<apiserver-host>:6443/healthz
# Check kubelet CPU and memory usage
ps -p $(pgrep kubelet) -o %cpu,rss,cmd
# Verify heartbeat lease freshness
kubectl get lease <node-name> -n kube-node-lease -o jsonpath='{.spec.renewTime}'
How to diagnose it
- Confirm the node condition. Run
kubectl get nodeand inspect the Ready condition. Note whether it is False or Unknown, and capture the reason message. False with “PLEG is not healthy” points to runtime or kubelet slowness; Unknown often points to API server or certificate issues. - Verify kubelet process liveness. Use
systemctl is-active kubeletandpgrep. If the process is missing, checkdmesgandjournalctl -kfor OOM kills or panics. - Probe kubelet’s
/healthzendpoint. If the process is alive but the endpoint times out or returns non-200, the kubelet may be internally deadlocked or severely resource-starved. - Read kubelet logs. Search for “PLEG is not healthy”, certificate errors, or runtime connection failures. This usually isolates the failure to one of the three domains.
- Test the container runtime directly. Run
time crictl ps. If this hangs, the root cause is runtime slowness. If it is fast, the bottleneck is inside the kubelet. - Inspect PLEG metrics. If
crictl psis fast butkubelet_pleg_relist_duration_secondsis high, investigate kubelet CPU starvation, excessive pod density, or a goroutine leak that prevents the relist goroutine from completing. - Check certificate TTL. Inspect
/var/lib/kubelet/pki/kubelet-client-current.pem. If it is expired or near expiry, look for pending CSRs and verify thatkube-controller-manageris approving them. - Test API server reachability. If the API server is unreachable, fix the network path first. If it is reachable but returns 401 Unauthorized, the issue is certificate-based.
- Check node resource pressure. Examine MemoryPressure, DiskPressure, and PIDPressure conditions. High pressure can slow or crash the kubelet.
- Capture diagnostics before restart. If you suspect a leak or hang, collect a goroutine dump before restarting, because the restart destroys the evidence.
curl -sk https://localhost:10250/debug/pprof/goroutine?debug=1 > /tmp/kubelet_goroutines.txt
Metrics and signals to monitor
| Signal | Why it matters | Warning sign |
|---|---|---|
kubelet_pleg_relist_duration_seconds | Predicts node NotReady before the hard 3-minute threshold | p99 above 10-30 seconds |
| Node Ready condition | Binary cluster-level health indicator | False or Unknown sustained > 1 minute |
kubelet_runtime_operations_duration_seconds | Runtime slowness cascades to PLEG and pod lifecycle | p99 above 5 seconds for list operations |
kubelet_certificate_manager_client_ttl_seconds | Silent failure until expiration causes total API auth loss | Below 7 days, or kubelet_certificate_manager_client_expiration_renew_errors > 0 |
| Kubelet RSS memory | OOM kills cause full reconciliation storms and brief outages | Sustained growth above 500MB without pod count increase |
Kubelet goroutine count (go_goroutines) | Leading indicator of memory leaks and scheduler pressure | Above 500 or steady growth over days |
rest_client_request_duration_seconds | Lease renewal and status updates depend on fast API writes | p99 above 5 seconds or 5xx errors |
Node lease renewTime age | Direct measure of heartbeat freshness | Age approaching 40 seconds (default lease duration) |
Fixes
If the cause is PLEG unhealthy or slow runtime
Restart containerd if it is hung. On systemd-managed nodes, systemctl restart containerd usually preserves existing containers via shims, but verify workload tolerance before restarting. If orphaned shims block state queries, drain the node and restart containerd; avoid manually killing shims, which can leave defunct containers. Reduce pod density or container churn if the node is overloaded. Investigate disk I/O saturation; high iowait slows runtime state reads.
If the cause is runtime disconnection
Check that the runtime process is running with systemctl status containerd. Verify the socket file exists at the path configured in --container-runtime-endpoint and that the kubelet has read-write access. If the runtime is running but unresponsive, restart it. Review runtime logs with journalctl -u containerd for crashes, thin pool exhaustion, or snapshotter errors.
If the cause is certificate expiration
Check for pending CSRs with kubectl get csr and approve them manually if necessary: kubectl certificate approve <csr-name>. Ensure kube-controller-manager is running and the CSR auto-approval mechanism is functional. If the certificate expired while the API server was unreachable, restore connectivity first.
If the kubelet uses bootstrap kubeconfig, restarting it may trigger a new bootstrap or renewal. Verify NTP synchronization on the node; clock skew causes certificate validation failures.
If the cause is kubelet OOM or resource starvation
Cordon the node to prevent new scheduling, then drain or delete excess pods to relieve pressure. Restarting the kubelet temporarily restores responsiveness, but expect a reconciliation spike in CRI calls and API requests when it returns. Increase --kube-reserved or --system-reserved to protect kubelet resources. If memory grows monotonically over days without a corresponding increase in pod count, capture a heap profile and plan an upgrade if a leak is confirmed in your version.
If the cause is API server partition
Fix the network path or control plane issue. Do not restart kubelet blindly during an API server outage. The kubelet keeps existing pods alive while disconnected. Restarting it only adds a reconciliation thundering herd when connectivity returns. Focus on restoring API server health or load balancer connectivity.
Prevention
- PLEG latency alerts. Alert on PLEG relist p99 well below the 3-minute hard limit. A threshold of 30-60 seconds gives time to investigate before the node goes NotReady.
- Scrape kubelet metrics. Collect
kubelet_pleg_relist_duration_seconds,kubelet_runtime_operations_duration_seconds,kubelet_certificate_manager_client_ttl_seconds, andgo_goroutinesfrom the authenticated metrics endpoint. - Certificate rotation verification. Do not assume auto-rotation works. Monitor certificate TTL and pending CSRs proactively.
- Resource reservations. Set
--kube-reservedand--system-reservedhigh enough to prevent kubelet from being starved or OOM-killed by workloads. - Runtime health checks. Monitor container runtime latency independently of kubelet. A periodic
time crictl psor runtime-specific metrics catch runtime degradation early. - Disk and inode monitoring. Keep node filesystem usage below 70% and monitor inode consumption separately. Configure container log rotation and image GC thresholds aggressively enough to avoid DiskPressure.
How Netdata helps
- Correlate node
iowaitand disk latency with PLEG relist duration spikes to confirm runtime storage bottlenecks. - Track kubelet and containerd process memory and CPU usage to detect resource starvation or leaks before OOM kills.
- Monitor system memory, swap, and OOM killer events alongside kubelet eviction metrics to distinguish node pressure from application limits.
- Visualize API server connectivity and network latency from the node to identify partition events.
- Alert on certificate expiration windows by monitoring file ages or integrating with kubelet metrics.





