Kubernetes node NotReady: kubelet, runtime, and network diagnosis
When a Kubernetes node becomes NotReady, existing containers usually keep running, but the cluster stops scheduling new pods, removes endpoints from Services, and eventually evicts workloads after the pod eviction timeout. Root causes fall into three domains: kubelet health, container runtime responsiveness, and CNI or control plane connectivity.
What this means
Kubernetes marks a node NotReady when the kubelet Ready condition is False, or when the node controller has not received a heartbeat within --node-monitor-grace-period (default 40 seconds). The node receives the node.kubernetes.io/not-ready:NoSchedule taint. If the condition persists longer than the pod eviction timeout (default 5 minutes), the controller manager marks pods on the node for rescheduling.
The Ready condition aggregates sub-conditions: MemoryPressure, DiskPressure, PIDPressure, and NetworkUnavailable. If any pressure condition is True, the node may reject new pods or transition to NotReady even when the kubelet process is still running. NetworkUnavailable, typically set when the CNI plugin has not finished configuring the node’s pod network, also blocks readiness.
A key distinction is whether the node reports itself NotReady (kubelet knows it is unhealthy) or the status shows Unknown (the control plane has not heard from the kubelet). The former points to kubelet-internal, runtime, or resource issues. The latter points to network partitions, certificate failures, or kubelet process death.
Common causes
| Cause | What it looks like | First thing to check |
|---|---|---|
| PLEG unhealthy or runtime stall | Ready=False with message “PLEG is not healthy”; crictl ps is slow | time crictl ps and kubelet CPU/disk I/O |
| Container runtime down | crictl info fails or hangs; runtime process missing | systemctl status containerd and the runtime socket |
| CNI not initialized | NetworkUnavailable=True; pods stuck in ContainerCreating | CNI DaemonSet pods and /etc/cni/net.d/ |
| Resource pressure | MemoryPressure/DiskPressure/PIDPressure=True; evictions active | free -m, df -h, dmesg, and /proc/sys/kernel/pid_max |
| API server or certificate failure | Node Unknown; kubelet logs show connection or cert errors | API connectivity from node and certificate expiry |
| Kubelet OOM or crash | Kubelet process missing; dmesg shows OOM kill | systemctl is-active kubelet and pgrep kubelet |
Quick checks
# Inspect node conditions and the Reason/Message field
kubectl describe node <node-name>
# Check Ready status programmatically
kubectl get node <node-name> -o jsonpath='{.status.conditions[?(@.type=="Ready")]}'
# Verify the kubelet process is running
systemctl is-active kubelet
pgrep -x kubelet
# Read recent kubelet logs for PLEG, runtime, or certificate errors
journalctl -u kubelet -n 200
# Test container runtime responsiveness directly
time crictl --runtime-endpoint unix:///run/containerd/containerd.sock ps
# Check runtime process status
systemctl status containerd
pgrep -x containerd
# Inspect CNI configuration on the node
ls -la /etc/cni/net.d/
# Check CNI plugin pods scheduled to the node
kubectl get pods -n kube-system --field-selector spec.nodeName=<node-name>
# Check kubelet client certificate expiration
openssl x509 -in /var/lib/kubelet/pki/kubelet-client-current.pem -noout -dates
<!-- TODO: verify certificate path on non-kubeadm distributions -->
# Check for pending certificate signing requests
kubectl get csr | grep Pending
# Check node pressure conditions from the API
kubectl get node <node-name> -o jsonpath='{.status.conditions[?(@.type=="MemoryPressure")].status}'
kubectl get node <node-name> -o jsonpath='{.status.conditions[?(@.type=="DiskPressure")].status}'
# Quick node resource inspection
free -m
df -h /var/lib/kubelet /var/lib/containerd
cat /proc/sys/kernel/pid_max
ls -d /proc/[0-9]* | wc -l
How to diagnose it
Run
kubectl describe node <node-name>. The Reason and Message fields in the Ready condition reveal whether kubelet reported itself unhealthy or the controller lost contact. PLEG messages point to the runtime. NetworkUnavailable points to CNI. Pressure conditions point to resources. A silent Unknown status points to connectivity or process failure.Verify kubelet is running with
systemctl is-active kubeletandpgrep -x kubelet. If the process is missing, checkdmesgandjournalctl -kfor OOM kills. Investigate systemd failures or crash loops before proceeding.Read
journalctl -u kubelet -n 500for the specific failure. Search forPLEG is not healthy,Container runtime network not ready,connection refused,certificate has expired, or eviction manager threshold breaches. Correlate the message with runtime, network, resource, or control plane paths.Test container runtime responsiveness directly with
time crictl ps. If it hangs or takes more than a few seconds, the runtime is the bottleneck. Checkjournalctl -u containerdfor errors and count shim processes withpgrep -c containerd-shim. A large shim count relative to running pods suggests leaked shims. A fast response points to kubelet CPU starvation or an internal goroutine leak rather than a runtime issue.Validate CNI health. Confirm the CNI DaemonSet pod is Running on the node and has not been evicted. Check that
/etc/cni/net.d/contains a valid configuration and that the node has an allocated pod CIDR. If the CNI pod is missing, check whether it was evicted.For Unknown status, test API connectivity from the node. Check certificate expiry with
openssl x509 -in /var/lib/kubelet/pki/kubelet-client-current.pem -noout -datesand look for pending CSRs withkubectl get csr. Expired certificates or unapproved CSRs block kubelet authentication. If the API server is unreachable, investigate network paths, DNS, and load balancers.Check resource pressure. Inspect
MemAvailableinfree -m, nodefs and imagefs withdf -h, and PID usage against/proc/sys/kernel/pid_max. Reviewdmesgfor OOM kills that struck before kubelet eviction. Identify top consumers and reclaim resources, or cordon the node to prevent new scheduling.
Metrics and signals to monitor
| Signal | Why it matters | Warning sign |
|---|---|---|
| Node Ready condition | Primary cluster-visible health indicator | False or Unknown sustained for more than 1 minute |
| PLEG relist duration | Leading indicator before NotReady; hard failure at 3 minutes | p99 sustained above 10-30 seconds |
| Container runtime operation latency | Foundation for PLEG and pod lifecycle; slow runtime cascades to NotReady | list_containers or list_podsandbox p99 above 1 second |
| Node pressure conditions | Trigger eviction and can force NotReady | Any MemoryPressure, DiskPressure, or PIDPressure True |
| Pod evictions | Active shedding of workload due to resource pressure | Any eviction event on production nodes |
| Kubelet certificate TTL | Silent lockout when expired | Less than 7 days remaining or non-zero rotation errors |
| API server request latency from kubelet | Affects lease renewal; high latency causes Unknown status | p99 above 5 seconds or lease renewal gaps |
| NetworkUnavailable condition | CNI plugin has not initialized pod networking | True for more than 30 seconds after node boot |
| Kubelet sync loop duration | Reflects kubelet’s ability to reconcile state | Sustained above 30 seconds |
Fixes
If the cause is PLEG or runtime slowness
If time crictl ps is slow, the runtime is the bottleneck. Inspect journalctl -u containerd for errors. Count shim processes: a high count relative to running pods suggests leaked shims. High disk I/O wait also stalls runtime operations. Restarting the runtime service may recover the node, but expect brief disruption to pod operations. Reduce pod density if the node is overloaded, and review garbage collection settings for exited containers.
If the cause is CNI or network initialization
Ensure the CNI DaemonSet pod is Running on the affected node and has not been evicted. Verify /etc/cni/net.d/ contains a valid configuration and CNI binaries exist in /opt/cni/bin/. Check kubelet logs for FailedCreatePodSandBox errors that name a specific CNI plugin. If the node lacks a pod CIDR, check the cluster IPAM and node controller for CIDRNotAvailable events. Restarting the CNI pod is often safe and can clear transient initialization failures.
If the cause is resource pressure
For MemoryPressure, identify top consumers with ps aux --sort=-%mem and evict or reschedule them. For DiskPressure, inspect container logs and writable layer storage under /var/lib/containerd and /var/lib/kubelet. Run crictl rmi to remove unused images if imagefs is full. Ensure log rotation is configured via containerLogMaxSize and containerLogMaxFiles so that container logs do not exhaust nodefs. For PIDPressure, raise pid_max if it is too low for the workload density, and enforce per-pod PID limits through kubelet configuration to prevent a single workload from exhausting the node.
If the cause is API server connectivity or certificates
Fix network paths between the node and API server, including DNS resolution and load balancer health. If the kubelet client certificate is expired, check for pending CSRs with kubectl get csr and approve them. Restarting kubelet can force an immediate rotation attempt. Verify NTP synchronization; clock skew causes certificate validation failures even with valid certificates.
If the cause is kubelet crash or OOM
If dmesg shows the kubelet was OOM killed, the node is oversubscribed or kubelet has a memory leak. Reduce workload density or plan an upgrade if a leak is known for your version. Capture a goroutine dump and heap profile from the kubelet debug endpoint before restarting to confirm a leak.
Prevention
- Alert on PLEG relist duration trending above 10 seconds. This indicates runtime or kubelet stress long before the 3-minute hard deadline forces NotReady.
- Monitor kubelet client certificate TTL at 30, 7, and 1 day to catch silent rotation failures before they cause Unknown status.
- Configure
containerLogMaxSize,containerLogMaxFiles, and image GC thresholds before disk pressure hits. - Require resource requests and limits on production workloads to prevent overcommit that leads to memory and PID exhaustion.
- Monitor container runtime independently of kubelet health by timing
crictl infofrom a node-level check. - Ensure CNI DaemonSets have tolerations and resource guarantees so they are not evicted before workload pods.
- Set
pid_maxto a production-appropriate value and apply per-pod PID limits to contain fork-heavy workloads.
How Netdata helps
Netdata collects kubelet PLEG latency alongside node disk I/O wait and container runtime CPU usage to distinguish runtime slowness from kubelet starvation. Node pressure conditions, eviction events, and OOM kills share a timeline to show whether resource saturation preceded the NotReady transition. Kubelet process CPU and memory usage tracked against pod counts help identify capacity-driven degradation before sync loops stall. Certificate TTL and API server connectivity latency from the node perspective provide early warning for Unknown-state transitions. Container runtime and CNI-related signals can be viewed alongside network traffic and DNS latency to isolate initialization failures.
Related guides
- Kubernetes API server slow or unresponsive: causes and fixes
- Kubernetes kubelet not responding: PLEG, runtime, and certificate issues
- Kubernetes monitoring checklist: the signals every production cluster needs
- Kubernetes pod CrashLoopBackOff: causes, diagnosis, and fixes
- Kubernetes pod ImagePullBackOff: registry, auth, and network diagnosis





