Kubernetes container runtime shim failures: containerd, CRI-O troubleshooting
Pods stuck in ContainerCreating, nodes flapping NotReady, and PLEG timeouts that clear only after a node reboot usually point to the container runtime shim layer, not the kubelet or network. The shim sits between the kubelet and the low-level runtime. When it hangs, crashes, or leaks, the kubelet cannot enumerate containers, start sandboxes, or reap terminated pods. Existing containers may keep running, but the node stops accepting new work.
This guide covers how to distinguish a shim failure from a CNI or kubelet issue, identify orphaned shims before they exhaust node PIDs, and recover without unnecessary reboots. It focuses on containerd and CRI-O.
What this means
The kubelet drives pod lifecycle through the CRI (Container Runtime Interface) over a local Unix socket. containerd listens on /run/containerd/containerd.sock; CRI-O listens on /run/crio/crio.sock. Each pod is delegated to a monitor process: containerd-shim for containerd, or conmon for CRI-O. The monitor maintains OCI runtime state and reports exit codes back to the runtime daemon.
The kubelet’s PLEG (Pod Lifecycle Event Generator) periodically asks the runtime to list all containers and sandboxes. If a shim or monitor is hung, orphaned, or slow to respond, that enumeration delays. When PLEG relist exceeds its deadline, the kubelet marks the node NotReady. The runtime daemon may still be running, but it is blocked on monitor I/O or state.
Common causes
| Cause | What it looks like | First thing to check |
|---|---|---|
| Hung or orphaned shim/monitor | PLEG relist duration climbs; node flaps NotReady; containerd-shim or conmon count exceeds running containers | ps monitor count vs crictl ps -a count |
| Runtime socket unresponsive | crictl hangs or returns connection errors; kubelet logs show CRI timeout | crictl info and socket file presence |
| Runtime daemon crash or deadlock | Node Ready goes False; no new pods start; existing pods may still run | systemctl status containerd or crio |
| PID exhaustion from shim accumulation | fork/exec ... resource temporarily unavailable; node cannot spawn processes | Running PID count vs /proc/sys/kernel/pid_max |
| Cgroup driver mismatch | Containers start but limits are ignored; unexpected OOM kills | kubelet and runtime cgroup driver configs |
Quick checks
These commands are read-only unless noted.
# Check CRI socket exists and is accessible
ls -la /run/containerd/containerd.sock /run/crio/crio.sock 2>/dev/null
# Test runtime responsiveness (containerd)
time crictl --runtime-endpoint unix:///run/containerd/containerd.sock info
# Test runtime responsiveness (CRI-O)
time crictl --runtime-endpoint unix:///run/crio/crio.sock info
# Count containerd shim processes
ps aux | grep -c '[c]ontainerd-shim'
# Count CRI-O monitor processes
ps aux | grep -c '[c]onmon'
# List all containers via CRI, independent of kubelet
crictl --runtime-endpoint unix:///run/containerd/containerd.sock ps -a
# Check runtime service health
systemctl status containerd
systemctl status crio
# Check kubelet logs for PLEG timeout or relist errors
journalctl -u kubelet --since "5 minutes ago" | grep -iE 'pleg|relist'
# Query PLEG relist duration from metrics (requires kubelet or API server access)
kubectl get --raw "/api/v1/nodes/<node_name>/proxy/metrics" | grep kubelet_pleg_relist_duration_seconds
# Compare running PID count to system limit
echo "pids: $(find /proc -maxdepth 1 -type d -name '[0-9]*' | wc -l) / $(cat /proc/sys/kernel/pid_max)"
# Check runtime daemon logs for errors
journalctl -u containerd --since "5 minutes ago" | grep -iE 'error|fail|shim'
journalctl -u crio --since "5 minutes ago" | grep -iE 'error|fail|conmon'
How to diagnose it
- Isolate the scope. If only one node is affected, suspect a local runtime or monitor failure. If many nodes fail simultaneously, look for a control plane, network, or cluster-wide DaemonSet issue.
- Verify the CRI socket. Run
crictl infoagainst the runtime endpoint. If the command hangs or returns a connection error, the runtime is not accepting CRI requests. This is a hard failure. Check that the socket file exists. - Compare shim/monitor count to container count. For containerd, count
containerd-shimprocesses withps. For CRI-O, countconmonprocesses. Compare to the number of containers reported bycrictl ps -a. A large discrepancy indicates orphaned monitors holding PID and memory resources. - Check PLEG metrics. Query
kubelet_pleg_relist_duration_secondsfrom the node metrics endpoint or via the API server proxy. If the p99 climbs above 10 seconds, the runtime is slow to enumerate containers. This is the leading indicator before a NotReady transition. - Inspect runtime logs. For containerd, read
journalctl -u containerd. For CRI-O, readjournalctl -u crio. Look for OOM kills, segfaults, storage driver errors, or repeated shim or conmon start failures. - Check node PID and fd saturation. If the node is near
pid_maxor the runtime has too many open file descriptors, new shims cannot be spawned. Look forresource temporarily unavailablein runtime or kubelet logs. - Check cgroup driver alignment. The kubelet and the runtime must both use
systemdor both usecgroupfs. A mismatch causes containers to start while resource limits are silently ignored, which can lead to unexpected OOM kills and runtime stress. - Correlate with kubelet CRI metrics. Look at
kubelet_runtime_operations_duration_secondsandkubelet_runtime_operations_errors_total. High latency or errors onlist_containersorlist_podsandboxconfirm the runtime is the bottleneck, not the kubelet sync loop.
flowchart TD
A[Shim process hangs or orphans] --> B[Runtime slows on container enumeration]
B --> C[PLEG relist duration exceeds deadline]
C --> D[Kubelet reports node NotReady]
D --> E[Scheduler stops sending pods]
D --> F[Kubelet cannot start or terminate containers]Metrics and signals to monitor
| Signal | Why it matters | Warning sign |
|---|---|---|
| PLEG relist duration | Measures how fast the runtime can list containers; drives node readiness | p99 > 10s sustained |
| CRI operation latency | Kubelet view of runtime responsiveness | p99 > 5s for list_containers or list_podsandbox |
| CRI operation errors | Direct indicator of failed CRI calls | Any sustained error rate > 0 |
| Node Ready condition | Aggregates PLEG and runtime health | Transition to False or Unknown for > 1 minute |
| Shim/monitor process count | Orphaned shims leak PIDs and memory | Count > 1.5x running container count |
| PID pressure | Prevents new shim and container creation | PIDPressure=True or usage > 90% of pid_max |
| Kubelet CPU/memory | Resource-starved kubelet cannot drive CRI | CPU throttling or RSS approaching limit |
Fixes
If the cause is a hung or orphaned shim/monitor
WARNING: Killing shims or monitors can leave containers in an unknown state. Target only confirmed orphans. Prefer targeted cleanup over a runtime restart.
Identify the specific process. For containerd, find containerd-shim processes with no corresponding container in crictl ps -a. For CRI-O, find conmon processes with no corresponding container. Cordon the node, then kill the orphan process. After the process exits, the runtime may reap the container. If a pod remains stuck in Terminating, force-delete the pod object from the API server:
kubectl delete pod <pod_name> --force
If the cause is runtime daemon failure
Restarting the runtime daemon is disruptive. Cordon the node first to prevent new pod scheduling. Only restart after read-only checks confirm the daemon is not responding to CRI.
# Disruptive: restart the runtime daemon
systemctl restart containerd
# or
systemctl restart crio
Existing containers managed by independent shim or monitor processes may survive the restart, but new pods will be blocked until the runtime recovers. Verify recovery with crictl info before uncordoning.
If the cause is PID exhaustion
Increase pid_max for immediate relief. This change is not persistent.
# Disruptive if workloads are spawning rapidly; increases system-wide limit
echo 4194304 > /proc/sys/kernel/pid_max
Persist the change in /etc/sysctl.conf or a drop-in under /etc/sysctl.d/. Then identify and clean up orphaned shim or monitor processes. Set kubelet --pod-max-pids to limit per-pod process explosions. Review workloads for fork bombs or runaway thread pools.
If the cause is cgroup driver mismatch
Align the kubelet and runtime configurations so both specify the same cgroup driver. Restart the runtime and kubelet after changing the driver. A mismatched driver causes containers to run without effective resource limits, which amplifies memory and CPU pressure and can cascade into runtime instability.
Prevention
- Monitor PLEG relist duration and CRI operation latency per node pool. Alert on sustained deviation from baseline before the node transitions to NotReady.
- Monitor the ratio of shim/monitor processes to running containers. Automated alerts on orphans prevent PID exhaustion.
- Keep kubelet, container runtime, and kernel versions within the supported skew window for your Kubernetes version.
- Cordon the node before any runtime restart. Do not treat a runtime restart as a harmless first response.
- Enforce per-pod PID limits and maintain node-level PID headroom to absorb shim leaks.
How Netdata helps
- Correlate
kubelet_pleg_relist_duration_secondsspikes with node CPU, memory, and disk I/O to distinguish runtime slowness from resource pressure. - Track
kubelet_runtime_operations_errors_totalto surface CRI failures without manual log diving. - Monitor PID usage and
PIDPressureconditions alongside process counts to catch leaks before they exhaust the node. - Overlay container runtime daemon CPU and memory with kubelet metrics to pinpoint whether the runtime or the shim is the bottleneck.






