Kubernetes PLEG is not healthy: runtime stalls and node degradation

A node suddenly flips to NotReady with the message “PLEG is not healthy.” Containers on the node keep running, but the control plane evicts workloads and reschedules them elsewhere. New pods cannot start, and existing pods run without health checks or status updates. This is one of the most common kubelet failure modes in production.

The Pod Lifecycle Event Generator (PLEG) is the kubelet subsystem that polls the container runtime every second and emits events when containers start, stop, or change state. When the runtime becomes slow or unresponsive, the PLEG relist loop stalls. If the elapsed time since the last successful relist exceeds three minutes, kubelet declares PLEG unhealthy, marks the node NotReady, and skips pod synchronization.

A PLEG stall does not kill running containers. The node-level data plane often continues working, but the control plane treats the node as dead. Service endpoints are removed. The eviction manager may begin terminating pods. The real damage is control-plane isolation, not container runtime failure.

flowchart TD
    A[GenericPLEG relist fires every 1s] --> B{CRI responds?}
    B -->|Slow or blocked| C[Relist duration climbs]
    C --> D[Exceeds 3m threshold]
    D --> E[Kubelet marks PLEG unhealthy]
    E --> F[Node Ready becomes False]
    E --> G[Pod sync skipped]
    F --> H[Controllers evict pods]
    G --> I[New pods stall]
    B -->|Fast| J[Normal operation]

What this means

GenericPLEG reconciles container runtime state with kubelet’s internal view by calling the container runtime interface (CRI) to list all containers and pod sandboxes. By default, this relist fires every second. Each relist iterates over the returned state, computes deltas against the previous snapshot, and generates lifecycle events. These events feed kubelet’s main sync loop, which decides whether to start, stop, or restart containers.

The health check is hardcoded to a three-minute threshold. Kubelet calls Healthy() every ten seconds. If the time since the last successful relist exceeds three minutes, kubelet logs an error similar to PLEG is not healthy: pleg was last seen active Xs ago; threshold is 3m0s. It then sets the node Ready condition to False and enters a backoff loop, skipping pod synchronization until PLEG recovers.

Common causes

CauseWhat it looks likeFirst thing to check
Container runtime I/O pressure or overloadcrictl ps takes seconds or hangs; high node disk I/O waitRuntime process CPU and disk metrics
Serial updateCache() bottleneck at scalePLEG degrades only during mass pod churn (rollouts, scale-out); p99 relist climbs with pod countNumber of simultaneous pod changes
Hanging mount or container inspection stallA single pod with an NFS hard mount or stuck volume blocks the per-call timeout chain; consecutive stalled calls can accumulate past the 3m PLEG thresholdmount output for hung NFS; kubelet logs for TaskExit or DEADLINE_EXCEEDED
Container runtime bugPLEG recovers only after runtime restart; correlates with a specific runtime versionRuntime version and bug trackers
Kubelet resource starvationPLEG relist duration trends up as node CPU or memory pressure rises; no single runtime call is obviously slowNode CPU/memory pressure conditions and kubelet cgroup usage

Quick checks

Run these from the affected node or from a bastion with cluster access. Prefer read-only commands first.

# Check node Ready condition and reason
kubectl describe node <node-name> | grep -A 5 "Conditions:"

# Check PLEG relist duration from kubelet metrics (authenticated port)
kubectl get --raw /api/v1/nodes/<node-name>/proxy/metrics | grep kubelet_pleg_relist_duration_seconds

# If the read-only port is enabled, you can also use:
curl -s http://localhost:10255/metrics | grep kubelet_pleg_relist_duration_seconds

# Time a direct runtime list to confirm if the runtime itself is slow.
# This requires crictl to be configured for the node's CRI endpoint.
time crictl ps
time crictl pods

# Check kubelet logs for the exact PLEG error
journalctl -u kubelet --since "10 minutes ago" | grep -iE "pleg.*unhealthy|pleg.*timeout"

# Check container runtime logs for hangs or errors
journalctl -u containerd --since "10 minutes ago" | grep -iE "error|timeout|deadline"
# or for CRI-O:
journalctl -u crio --since "10 minutes ago" | grep -iE "error|timeout|deadline"

# Look for hung mounts that could block container inspection
mount | grep nfs
# Check for processes stuck in uninterruptible sleep
ps aux | awk '$8 ~ /^D/'

# Check node resource pressure
kubectl describe node <node-name> | grep -E "MemoryPressure|DiskPressure|PIDPressure"

# Check kubelet memory and CPU usage
ps -C kubelet -o pid,rss,vsz,pcpu,comm

Destructive or disruptive: Restarting the container runtime or kubelet can abort in-flight container operations. Do not restart services until read-only checks confirm the bottleneck.

How to diagnose it

  1. Confirm PLEG breached the threshold. Check kubectl describe node for Ready=False with a message mentioning PLEG. Verify in kubelet logs that the error text includes “PLEG is not healthy” and note the timestamp. If the node shows Ready=Unknown without a PLEG message, suspect API server connectivity loss instead.

  2. Verify container runtime responsiveness. On the node, run time crictl ps and time crictl pods. If these commands take more than a few seconds, the runtime is the bottleneck. PLEG is often the first victim of runtime slowdown because it queries the runtime more aggressively than any other component.

  3. Check PLEG metric trends. Inspect kubelet_pleg_relist_duration_seconds histogram percentiles. A healthy node typically shows p99 under two seconds. If p99 is climbing toward thirty seconds or higher, the node is on the path to a three-minute failure. Also check kubelet_pleg_relist_interval_seconds; intervals significantly above one second indicate that prior relists are still running.

  4. Identify the class of bottleneck.

    • If crictl ps is slow and node disk I/O wait is high, the runtime is under storage pressure.
    • If PLEG degrades only during deployment rollouts or HPA scale-out with hundreds of pods, the serial updateCache() work per changed pod is the likely cause.
    • If kubelet logs show container start/stop timeouts correlated with a specific pod, inspect that pod’s volume mounts. A single NFS hard mount that hangs during container inspection can block the entire relist.
    • If node CPU is saturated and kubelet itself is being throttled, the PLEG goroutine may not get enough scheduler time.
  5. Check for container runtime-specific bugs. Search runtime release notes for PLEG-related regressions. Both containerd and CRI-O have had releases with slow ListContainers under I/O pressure or event-handling bugs after init container exit.

  6. Determine correlation with scale events. Ask whether the failure coincided with a large Deployment rollout, a CronJob spike, or cluster autoscaling. PLEG unhealthy during pod scale-out but not at steady state strongly suggests the serial update cache bottleneck or a runtime event-processing limit.

Metrics and signals to monitor

SignalWhy it mattersWarning sign
kubelet_pleg_relist_duration_secondsDirect measurement of PLEG loop latencyp99 > 5 s is degraded; p99 > 30 s is critical
kubelet_pleg_relist_interval_secondsShows whether relists are backing upIntervals consistently > 1 s indicate queuing
kubelet_runtime_operations_duration_seconds for list_containers and list_podsandboxCRI-level latency that feeds into PLEGp99 > 1 s predicts PLEG stalls
Node Ready condition / PLEG healthBinary node availabilityAny transition to False or Unknown
Kubelet CPU and memory usageKubelet resource starvation can starve PLEGCPU throttling or RSS approaching limits
Node disk I/O utilization and iowaitSlow storage delays runtime state queriesDisk utilization > 80% or iowait > 10%
kubelet_running_pods, kubelet_running_containersDensity drives serial cache workSudden spikes above baseline
kubelet_runtime_operations_errors_totalRepeated failures can block relistAny sustained non-zero error rate on list operations

Fixes

If the cause is container runtime pressure

Runtime storage I/O saturation is the most common root cause. If the runtime data directory shares a disk with heavy write workloads, container state queries slow down.

  • Move containerd or CRI-O snapshot storage to a dedicated local SSD if possible.
  • If the runtime process is stuck but not crashed, a runtime restart may recover it. This is disruptive: existing containers will survive in most cases, but in-flight operations will abort. Coordinate the restart during a maintenance window if the node is partially functional.
  • Reduce concurrent image pulls and container churn to lower disk pressure.

If the cause is serial cache bottleneck

When many pods change simultaneously, PLEG calls updateCache() serially for each changed pod. The cumulative latency exceeds the three-minute threshold.

  • Reduce per-node pod density. The default --max-pods of 110 may be too high for nodes with slow storage or high churn.
  • Throttle rolling updates by reducing maxSurge or increasing maxUnavailable to avoid simultaneous mass pod changes.
  • If you are scaling out hundreds of pods at once, spread the load across more nodes or use pod anti-affinity to reduce per-node churn.

If the cause is hanging mounts or storage

A single pod with a stuck NFS mount or a volume that hangs during inspection can block the entire relist for the duration of the CRI call timeout.

  • Identify the offending pod from kubelet or runtime logs. Evict or delete the pod if possible.
  • Use soft mount options for NFS with reasonable timeo and retrans values, understanding that soft mounts can fail I/O operations and may not be safe for all workloads.
  • If the mount is already stuck, you may need to force a lazy unmount or reboot the node.
  • Review pod specifications to avoid mounting remote filesystems that can become unreachable.

Destructive or disruptive: A lazy unmount (umount -l <mountpoint>) and node reboot are disruptive. They can terminate running workloads and leave orphaned mountpoints.

If the cause is kubelet resource starvation

On dense nodes, kubelet itself can be CPU-throttled or memory-constrained, leaving insufficient resources for the PLEG goroutine.

  • Increase kube-reserved and system-reserved so kubelet has dedicated CPU and memory headroom.
  • If kubelet runs inside a cgroup or container, raise its CPU and memory limits.
  • Reduce node density or move to larger instance types.

If the cause is a runtime bug

If the issue correlates with a specific containerd or CRI-O version and read-only checks show no resource pressure, treat it as a bug.

  • Upgrade the container runtime to a patched version.
  • As a temporary mitigation, restart the runtime or cordon and drain the node.
  • Monitor upstream issue trackers for regressions that match your version and symptom pattern.

Prevention

PLEG unhealthy is almost always a leading indicator, not a sudden failure. The node typically warns you with elevated relist latency minutes or hours before crossing the three-minute threshold.

  • Alert early on relist latency. Alert on kubelet_pleg_relist_duration_seconds p99 > 10 s for more than two minutes. This gives you time to investigate before the hard threshold trips.
  • Cap pod density per node. Account for DaemonSets and sidecars when calculating effective pod count. Dense nodes with frequent churn are the highest risk.
  • Use fast local storage for the container runtime. Network-attached storage for runtime state adds variable latency to every container list operation.
  • Avoid hard-mount NFS in pods. Prefer soft mounts, or use object storage and local caches instead.
  • Do not rely on doubling runtime-request-timeout as a fix. The default is two minutes. Doubling it does not solve the underlying stall and can still exceed the three-minute PLEG threshold if consecutive calls hang.
  • Keep runtime versions current. Containerd and CRI-O bugs that affect event handling and list performance are fixed regularly.
  • Maintain kubelet resource headroom. Treat kubelet as a first-class workload on the node, not overhead that can be starved.

How Netdata helps

Netdata collects kubelet metrics from each node and correlates them with system-level signals.

  • PLEG latency: Charts for kubelet_pleg_relist_duration_seconds percentiles show degradation before the node goes NotReady.
  • Runtime and CRI operation latency: CRI operation latency is tracked alongside PLEG metrics to separate runtime slowdown from kubelet starvation.
  • Node resource pressure: Disk I/O wait, memory pressure, and CPU throttling are visualized on the same timeline as PLEG errors.
  • Early alerting: Anomaly detection on relist interval and duration can trigger before the three-minute threshold is breached.