Kubernetes node PIDPressure: detection and remediation

PID exhaustion is a cliff-edge failure: once the kernel cannot fork, containers fail to start, health checks fail, and ssh to the node may hang. Kubernetes surfaces this through the PIDPressure node condition, but many clusters ship without PID-based eviction thresholds. Without them, the first symptom is usually EAGAIN or ENOMEM from fork failures, not a kubelet eviction.

This guide shows how to detect PIDPressure before it triggers an outage, distinguish between application leaks, runtime shim accumulation, and kernel limits, and remediate the root cause. You will correlate node-level PID utilization with specific pods, validate kubelet cgroup enforcement, and configure thresholds that provide lead time.

What this means

The kubelet monitors node-level PID capacity through the pid.available eviction signal, computed from node.stats.rlimit.maxpid - node.stats.rlimit.curproc. When available PIDs drop below the configured hard eviction threshold, kubelet sets PIDPressure=True and taints the node with node.kubernetes.io/pid-pressure:NoSchedule. The scheduler stops placing new pods, and the kubelet eviction manager may remove pods.

Two caveats. First, eviction is reactive and periodic. A rapid PID spike can exceed the limit before the next housekeeping cycle triggers eviction. Second, many cluster distributions ship without a default evictionHard value for pid.available. If you have not configured one, PIDPressure may never fire, and the first symptom will be fork failures at the kernel level.

PIDs are node-global. Every container process, thread, runtime shim, and zombie consumes a PID from the same kernel pool. When the pool is empty, fork() returns EAGAIN or ENOMEM. Inside Kubernetes, this manifests as failed container starts, probe failures, and cascading pod restarts that worsen the pressure.

Common causes

CauseWhat it looks likeFirst thing to check
Zombie process accumulation inside podsPID count climbs steadily; ps shows <defunct> entries`ps aux
containerd-shim leakPIDs consumed by runtime after container exitpgrep -c containerd-shim compared to running container count
Per-pod thread or subprocess explosionA single pod consumes thousands of PIDs via unbounded thread pools or shell loopspstree -p <pid> or /proc/<pid>/status inside the pod
Low kernel.pid_maxNode hits 32768 limit despite idle CPU/memorycat /proc/sys/kernel/pid_max
Missing or ineffective PodPidsLimitKubelet config sets a limit, but the container cgroup does not reflect itcat /sys/fs/cgroup/pids/pids.max inside a running container
Workload fork bomb or CI runnerSudden PID spike correlated with a specific deployment or jobProcess tree on the node sorted by thread count

Quick checks

# Check PIDPressure status across the cluster
kubectl get nodes -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.status.conditions[?(@.type=="PIDPressure")].status}{"\n"}{end}'

# Describe a specific node for condition details
kubectl describe node <node-name> | grep -A 5 PIDPressure

# Compare running PIDs against the kernel limit
echo "limit: $(cat /proc/sys/kernel/pid_max); used: $(ls /proc | grep -E '^[0-9]+$' | wc -l)"

# Check kubelet eviction flags for pid thresholds (only shows flags, not config file values)
ps aux | grep kubelet | grep -o 'eviction-hard=[^ ]*'

# Inspect the container cgroup PID limit from inside a pod (cgroup v1)
cat /sys/fs/cgroup/pids/pids.max

# Inspect current cgroup PID usage (cgroup v1)
cat /sys/fs/cgroup/pids/pids.current

# Find zombie processes on the node
ps aux | awk '$8 ~ /^Z/ {print $0}'

# Count container runtime shims
pgrep -c containerd-shim

# Check kubelet eviction metrics for pid signals (requires accessible metrics endpoint)
kubectl get --raw /api/v1/nodes/<node-name>/proxy/metrics | grep -E 'eviction.*pid'

On cgroup v2 hosts, the PID limit path is /sys/fs/cgroup/pids.max and usage is /sys/fs/cgroup/pids.current.

How to diagnose it

flowchart TD
    A[PIDPressure=True or fork failures] --> B{Node PIDs near pid_max?}
    B -->|Yes| C[Identify top PID consumers]
    B -->|No| D[Check kubelet eviction config]
    C --> E{Zombies or shims?}
    E -->|Zombies| F[Fix app reaping or restart pod]
    E -->|Shims| G[Restart runtime or drain node]
    E -->|Thread leak| H[Reduce pod parallelism]
    D --> I[Configure pid.available evictionHard]
  1. Confirm the node condition and taint. Use kubectl get nodes to check PIDPressure. If the status is True, verify the taint node.kubernetes.io/pid-pressure:NoSchedule is present. This confirms kubelet has detected the problem, but it does not reveal how many PIDs remain or which workload is responsible.

  2. Determine absolute headroom. On the node, compare ls /proc | grep -E '^[0-9]+$' | wc -l against /proc/sys/kernel/pid_max. If utilization is above 80% of the limit, the node is in immediate danger regardless of what kubelet reports. If the limit is 32768, expect exhaustion on any node running more than a few dozen pods with multi-threaded runtimes.

  3. Identify top consumers by parent process. List processes grouped by container runtime shim or pod cgroup. On the host, inspect pids.current files under /sys/fs/cgroup/pids/kubepods/ (exact path varies by cgroup version and QoS class). If your runtime is containerd, ctr tasks list shows per-container process counts.

  4. Look for zombies. Zombie processes hold PIDs in the kernel task table until their parent calls waitpid(). Run ps aux | awk '$8 ~ /^Z/' on the node. If zombies cluster under a specific container, the application inside that pod is failing to reap children. Restarting the pod provides immediate relief but does not fix the code.

  5. Check for orphaned runtime shims. Each container managed by containerd runs a containerd-shim process. If a shim is orphaned, it continues to consume a PID. Compare pgrep -c containerd-shim against the number of running containers from crictl ps | wc -l. A large gap indicates shim leakage.

  6. Verify kubelet and cgroup enforcement. Check whether PodPidsLimit is configured in the kubelet config (/var/lib/kubelet/config.yaml) or startup flags. Then enter a pod and run cat /sys/fs/cgroup/pids/pids.max. If the limit is max or a value far larger than PodPidsLimit, enforcement is not reaching the container. This is common with dockershim or cri-dockerd; verify that your runtime and cgroup driver support PID limits.

  7. Inspect eviction configuration. If PIDPressure is not firing despite high usage, check kubelet’s --eviction-hard and --eviction-soft settings. If pid.available is absent, kubelet will not evict for PID pressure. Add an explicit threshold such as pid.available<100 or a percentage, depending on node density.

  8. Correlate with events and logs. Check kubectl get events --field-selector involvedObject.kind=Node,reason=Evicted for PID-triggered evictions. Read kubelet logs with journalctl -u kubelet | grep -i pid to find the exact eviction signal value at the time of the event.

Metrics and signals to monitor

SignalWhy it mattersWarning sign
Node condition PIDPressureBinary indicator from kubeletstatus=True sustained for >1 minute
Node PIDs used vs pid_maxMeasures true system headroom before fork failsUtilization >80% of pid_max
Kubelet eviction metrics with pid signalConfirms kubelet is actively shedding loadAny non-zero eviction rate for pid
Per-pod PID usageIdentifies noisy neighbors before they exhaust the nodeAny pod approaching its PodPidsLimit
Zombie process countZombies consume PIDs but no other resourcesSustained >0 on production nodes
containerd-shim countOrphaned shims leak node-level PIDsCount growing while pod count is flat
Container cgroup pids.current vs pids.maxValidates that enforcement is workingpids.current within 10% of pids.max, or pids.max unlimited
kernel.pid_maxSystem-wide ceiling that is easy to misconfigureValue < 65536 on dense nodes

Fixes

If the cause is zombie or leaking processes inside pods

Fix the application to reap child processes properly. If the code cannot be changed immediately, setting a stricter PodPidsLimit contains the blast radius to that pod rather than the node. To relieve pressure immediately, delete the offending pod. This is disruptive to the workload but safe for the node.

If the cause is container runtime shim accumulation

Orphaned shims often require a container runtime restart to clear. Cordon the node and drain its workloads before restarting containerd (systemctl restart containerd). Verify shim counts drop before uncordoning. If the leak is due to a runtime bug, upgrade to a patched version.

If the cause is low kernel.pid_max

Raise the limit immediately:

# Apply now
sysctl -w kernel.pid_max=131072

# Persist
echo "kernel.pid_max = 131072" > /etc/sysctl.d/99-pid.conf
sysctl --system

Nodes running more than 50 pods or hosting Java, Node.js, or heavily threaded workloads should use at least 131072. The change does not require a reboot, but existing processes do not count against the new limit retroactively.

If the cause is kubelet/cgroup enforcement gap

If PodPidsLimit is set in kubelet configuration but containers show pids.max = max inside their cgroup, the runtime is not propagating the limit. This is common with dockershim or cri-dockerd. Migrate to containerd or CRI-O, then verify enforcement by running cat /sys/fs/cgroup/pids/pids.max inside a new pod.

If the cause is a fork bomb or runaway subprocess

Identify the parent process using ps aux --sort=-nlwp | head -20 on the node. Delete the pod immediately. To prevent recurrence, lower PodPidsLimit for that workload’s priority class or namespace, and review application logic for unbounded fork, exec, or thread creation.

Prevention

  • Explicitly configure evictionHard (and optionally evictionSoft) for pid.available in kubelet configuration. A starting point is pid.available<100 on smaller nodes, or a percentage on larger ones.
  • Set PodPidsLimit in KubeletConfiguration to a value that matches your density. Verify it is reflected in container cgroups.
  • Tune kernel.pid_max to at least 131072 during node provisioning if the node will host more than a few dozen pods.
  • Monitor per-pod PID usage in staging and CI. Applications that leak PIDs should be caught before production deployment.
  • Ensure container images and init systems reap zombie processes correctly. Avoid PID 1 processes that do not delegate to a proper init or use tini.
  • Use containerd or CRI-O rather than dockershim to ensure PodPidsLimit is enforced at the cgroup level.
  • Include PID headroom in capacity planning alongside CPU and memory.

How Netdata helps

  • Netdata tracks per-cgroup pids.current and pids.max, surfacing which pods are approaching their PID limit.
  • The process state chart highlights zombie counts per node, making it obvious when an application is failing to reap children.
  • Node-level process count metrics, correlated with the Kubernetes PIDPressure condition, let you distinguish a true kernel shortage from a kubelet configuration gap.
  • Alerts on per-container PID saturation and node-wide pid_max utilization fire before kubelet eviction, providing lead time to cordon or drain the node.