Kubernetes node NotReady: kubelet, runtime, and network diagnosis

When a Kubernetes node becomes NotReady, existing containers usually keep running, but the cluster stops scheduling new pods, removes endpoints from Services, and eventually evicts workloads after the pod eviction timeout. Root causes fall into three domains: kubelet health, container runtime responsiveness, and CNI or control plane connectivity.

What this means

Kubernetes marks a node NotReady when the kubelet Ready condition is False, or when the node controller has not received a heartbeat within --node-monitor-grace-period (default 40 seconds). The node receives the node.kubernetes.io/not-ready:NoSchedule taint. If the condition persists longer than the pod eviction timeout (default 5 minutes), the controller manager marks pods on the node for rescheduling.

The Ready condition aggregates sub-conditions: MemoryPressure, DiskPressure, PIDPressure, and NetworkUnavailable. If any pressure condition is True, the node may reject new pods or transition to NotReady even when the kubelet process is still running. NetworkUnavailable, typically set when the CNI plugin has not finished configuring the node’s pod network, also blocks readiness.

A key distinction is whether the node reports itself NotReady (kubelet knows it is unhealthy) or the status shows Unknown (the control plane has not heard from the kubelet). The former points to kubelet-internal, runtime, or resource issues. The latter points to network partitions, certificate failures, or kubelet process death.

Common causes

Cause	What it looks like	First thing to check
PLEG unhealthy or runtime stall	Ready=False with message “PLEG is not healthy”; `crictl ps` is slow	`time crictl ps` and kubelet CPU/disk I/O
Container runtime down	`crictl info` fails or hangs; runtime process missing	`systemctl status containerd` and the runtime socket
CNI not initialized	NetworkUnavailable=True; pods stuck in ContainerCreating	CNI DaemonSet pods and `/etc/cni/net.d/`
Resource pressure	MemoryPressure/DiskPressure/PIDPressure=True; evictions active	`free -m`, `df -h`, `dmesg`, and `/proc/sys/kernel/pid_max`
API server or certificate failure	Node Unknown; kubelet logs show connection or cert errors	API connectivity from node and certificate expiry
Kubelet OOM or crash	Kubelet process missing; `dmesg` shows OOM kill	`systemctl is-active kubelet` and `pgrep kubelet`

Quick checks

# Inspect node conditions and the Reason/Message field
kubectl describe node <node-name>

# Check Ready status programmatically
kubectl get node <node-name> -o jsonpath='{.status.conditions[?(@.type=="Ready")]}'

# Verify the kubelet process is running
systemctl is-active kubelet
pgrep -x kubelet

# Read recent kubelet logs for PLEG, runtime, or certificate errors
journalctl -u kubelet -n 200

# Test container runtime responsiveness directly
time crictl --runtime-endpoint unix:///run/containerd/containerd.sock ps

# Check runtime process status
systemctl status containerd
pgrep -x containerd

# Inspect CNI configuration on the node
ls -la /etc/cni/net.d/

# Check CNI plugin pods scheduled to the node
kubectl get pods -n kube-system --field-selector spec.nodeName=<node-name>

# Check kubelet client certificate expiration
openssl x509 -in /var/lib/kubelet/pki/kubelet-client-current.pem -noout -dates
<!-- TODO: verify certificate path on non-kubeadm distributions -->

# Check for pending certificate signing requests
kubectl get csr | grep Pending

# Check node pressure conditions from the API
kubectl get node <node-name> -o jsonpath='{.status.conditions[?(@.type=="MemoryPressure")].status}'
kubectl get node <node-name> -o jsonpath='{.status.conditions[?(@.type=="DiskPressure")].status}'

# Quick node resource inspection
free -m
df -h /var/lib/kubelet /var/lib/containerd
cat /proc/sys/kernel/pid_max
ls -d /proc/[0-9]* | wc -l

How to diagnose it

Run kubectl describe node <node-name>. The Reason and Message fields in the Ready condition reveal whether kubelet reported itself unhealthy or the controller lost contact. PLEG messages point to the runtime. NetworkUnavailable points to CNI. Pressure conditions point to resources. A silent Unknown status points to connectivity or process failure.
Verify kubelet is running with systemctl is-active kubelet and pgrep -x kubelet. If the process is missing, check dmesg and journalctl -k for OOM kills. Investigate systemd failures or crash loops before proceeding.
Read journalctl -u kubelet -n 500 for the specific failure. Search for PLEG is not healthy, Container runtime network not ready, connection refused, certificate has expired, or eviction manager threshold breaches. Correlate the message with runtime, network, resource, or control plane paths.
Test container runtime responsiveness directly with time crictl ps. If it hangs or takes more than a few seconds, the runtime is the bottleneck. Check journalctl -u containerd for errors and count shim processes with pgrep -c containerd-shim. A large shim count relative to running pods suggests leaked shims. A fast response points to kubelet CPU starvation or an internal goroutine leak rather than a runtime issue.
Validate CNI health. Confirm the CNI DaemonSet pod is Running on the node and has not been evicted. Check that /etc/cni/net.d/ contains a valid configuration and that the node has an allocated pod CIDR. If the CNI pod is missing, check whether it was evicted.
For Unknown status, test API connectivity from the node. Check certificate expiry with openssl x509 -in /var/lib/kubelet/pki/kubelet-client-current.pem -noout -dates and look for pending CSRs with kubectl get csr. Expired certificates or unapproved CSRs block kubelet authentication. If the API server is unreachable, investigate network paths, DNS, and load balancers.
Check resource pressure. Inspect MemAvailable in free -m, nodefs and imagefs with df -h, and PID usage against /proc/sys/kernel/pid_max. Review dmesg for OOM kills that struck before kubelet eviction. Identify top consumers and reclaim resources, or cordon the node to prevent new scheduling.

Metrics and signals to monitor

Signal	Why it matters	Warning sign
Node Ready condition	Primary cluster-visible health indicator	False or Unknown sustained for more than 1 minute
PLEG relist duration	Leading indicator before NotReady; hard failure at 3 minutes	p99 sustained above 10-30 seconds
Container runtime operation latency	Foundation for PLEG and pod lifecycle; slow runtime cascades to NotReady	`list_containers` or `list_podsandbox` p99 above 1 second
Node pressure conditions	Trigger eviction and can force NotReady	Any MemoryPressure, DiskPressure, or PIDPressure True
Pod evictions	Active shedding of workload due to resource pressure	Any eviction event on production nodes
Kubelet certificate TTL	Silent lockout when expired	Less than 7 days remaining or non-zero rotation errors
API server request latency from kubelet	Affects lease renewal; high latency causes Unknown status	p99 above 5 seconds or lease renewal gaps
NetworkUnavailable condition	CNI plugin has not initialized pod networking	True for more than 30 seconds after node boot
Kubelet sync loop duration	Reflects kubelet’s ability to reconcile state	Sustained above 30 seconds

Fixes

If the cause is PLEG or runtime slowness

If time crictl ps is slow, the runtime is the bottleneck. Inspect journalctl -u containerd for errors. Count shim processes: a high count relative to running pods suggests leaked shims. High disk I/O wait also stalls runtime operations. Restarting the runtime service may recover the node, but expect brief disruption to pod operations. Reduce pod density if the node is overloaded, and review garbage collection settings for exited containers.

If the cause is CNI or network initialization

Ensure the CNI DaemonSet pod is Running on the affected node and has not been evicted. Verify /etc/cni/net.d/ contains a valid configuration and CNI binaries exist in /opt/cni/bin/. Check kubelet logs for FailedCreatePodSandBox errors that name a specific CNI plugin. If the node lacks a pod CIDR, check the cluster IPAM and node controller for CIDRNotAvailable events. Restarting the CNI pod is often safe and can clear transient initialization failures.

If the cause is resource pressure

For MemoryPressure, identify top consumers with ps aux --sort=-%mem and evict or reschedule them. For DiskPressure, inspect container logs and writable layer storage under /var/lib/containerd and /var/lib/kubelet. Run crictl rmi to remove unused images if imagefs is full. Ensure log rotation is configured via containerLogMaxSize and containerLogMaxFiles so that container logs do not exhaust nodefs. For PIDPressure, raise pid_max if it is too low for the workload density, and enforce per-pod PID limits through kubelet configuration to prevent a single workload from exhausting the node.

If the cause is API server connectivity or certificates

Fix network paths between the node and API server, including DNS resolution and load balancer health. If the kubelet client certificate is expired, check for pending CSRs with kubectl get csr and approve them. Restarting kubelet can force an immediate rotation attempt. Verify NTP synchronization; clock skew causes certificate validation failures even with valid certificates.

If the cause is kubelet crash or OOM

If dmesg shows the kubelet was OOM killed, the node is oversubscribed or kubelet has a memory leak. Reduce workload density or plan an upgrade if a leak is known for your version. Capture a goroutine dump and heap profile from the kubelet debug endpoint before restarting to confirm a leak.

Prevention

Alert on PLEG relist duration trending above 10 seconds. This indicates runtime or kubelet stress long before the 3-minute hard deadline forces NotReady.
Monitor kubelet client certificate TTL at 30, 7, and 1 day to catch silent rotation failures before they cause Unknown status.
Configure containerLogMaxSize, containerLogMaxFiles, and image GC thresholds before disk pressure hits.
Require resource requests and limits on production workloads to prevent overcommit that leads to memory and PID exhaustion.
Monitor container runtime independently of kubelet health by timing crictl info from a node-level check.
Ensure CNI DaemonSets have tolerations and resource guarantees so they are not evicted before workload pods.
Set pid_max to a production-appropriate value and apply per-pod PID limits to contain fork-heavy workloads.

How Netdata helps

Netdata collects kubelet PLEG latency alongside node disk I/O wait and container runtime CPU usage to distinguish runtime slowness from kubelet starvation. Node pressure conditions, eviction events, and OOM kills share a timeline to show whether resource saturation preceded the NotReady transition. Kubelet process CPU and memory usage tracked against pod counts help identify capacity-driven degradation before sync loops stall. Certificate TTL and API server connectivity latency from the node perspective provide early warning for Unknown-state transitions. Container runtime and CNI-related signals can be viewed alongside network traffic and DNS latency to isolate initialization failures.

Kubernetes node NotReady: kubelet, runtime, and network diagnosis

Kubernetes node NotReady: kubelet, runtime, and network diagnosis

What this means

Common causes

Quick checks

How to diagnose it

Metrics and signals to monitor

Fixes

If the cause is PLEG or runtime slowness

If the cause is CNI or network initialization

If the cause is resource pressure

If the cause is API server connectivity or certificates

If the cause is kubelet crash or OOM

Prevention

How Netdata helps

Related guides