Kubernetes node PIDPressure: detection and remediation
PID exhaustion is a cliff-edge failure: once the kernel cannot fork, containers fail to start, health checks fail, and ssh to the node may hang. Kubernetes surfaces this through the PIDPressure node condition, but many clusters ship without PID-based eviction thresholds. Without them, the first symptom is usually EAGAIN or ENOMEM from fork failures, not a kubelet eviction.
This guide shows how to detect PIDPressure before it triggers an outage, distinguish between application leaks, runtime shim accumulation, and kernel limits, and remediate the root cause. You will correlate node-level PID utilization with specific pods, validate kubelet cgroup enforcement, and configure thresholds that provide lead time.
What this means
The kubelet monitors node-level PID capacity through the pid.available eviction signal, computed from node.stats.rlimit.maxpid - node.stats.rlimit.curproc. When available PIDs drop below the configured hard eviction threshold, kubelet sets PIDPressure=True and taints the node with node.kubernetes.io/pid-pressure:NoSchedule. The scheduler stops placing new pods, and the kubelet eviction manager may remove pods.
Two caveats. First, eviction is reactive and periodic. A rapid PID spike can exceed the limit before the next housekeeping cycle triggers eviction. Second, many cluster distributions ship without a default evictionHard value for pid.available. If you have not configured one, PIDPressure may never fire, and the first symptom will be fork failures at the kernel level.
PIDs are node-global. Every container process, thread, runtime shim, and zombie consumes a PID from the same kernel pool. When the pool is empty, fork() returns EAGAIN or ENOMEM. Inside Kubernetes, this manifests as failed container starts, probe failures, and cascading pod restarts that worsen the pressure.
Common causes
| Cause | What it looks like | First thing to check |
|---|---|---|
| Zombie process accumulation inside pods | PID count climbs steadily; ps shows <defunct> entries | `ps aux |
| containerd-shim leak | PIDs consumed by runtime after container exit | pgrep -c containerd-shim compared to running container count |
| Per-pod thread or subprocess explosion | A single pod consumes thousands of PIDs via unbounded thread pools or shell loops | pstree -p <pid> or /proc/<pid>/status inside the pod |
Low kernel.pid_max | Node hits 32768 limit despite idle CPU/memory | cat /proc/sys/kernel/pid_max |
| Missing or ineffective PodPidsLimit | Kubelet config sets a limit, but the container cgroup does not reflect it | cat /sys/fs/cgroup/pids/pids.max inside a running container |
| Workload fork bomb or CI runner | Sudden PID spike correlated with a specific deployment or job | Process tree on the node sorted by thread count |
Quick checks
# Check PIDPressure status across the cluster
kubectl get nodes -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.status.conditions[?(@.type=="PIDPressure")].status}{"\n"}{end}'
# Describe a specific node for condition details
kubectl describe node <node-name> | grep -A 5 PIDPressure
# Compare running PIDs against the kernel limit
echo "limit: $(cat /proc/sys/kernel/pid_max); used: $(ls /proc | grep -E '^[0-9]+$' | wc -l)"
# Check kubelet eviction flags for pid thresholds (only shows flags, not config file values)
ps aux | grep kubelet | grep -o 'eviction-hard=[^ ]*'
# Inspect the container cgroup PID limit from inside a pod (cgroup v1)
cat /sys/fs/cgroup/pids/pids.max
# Inspect current cgroup PID usage (cgroup v1)
cat /sys/fs/cgroup/pids/pids.current
# Find zombie processes on the node
ps aux | awk '$8 ~ /^Z/ {print $0}'
# Count container runtime shims
pgrep -c containerd-shim
# Check kubelet eviction metrics for pid signals (requires accessible metrics endpoint)
kubectl get --raw /api/v1/nodes/<node-name>/proxy/metrics | grep -E 'eviction.*pid'
On cgroup v2 hosts, the PID limit path is /sys/fs/cgroup/pids.max and usage is /sys/fs/cgroup/pids.current.
How to diagnose it
flowchart TD
A[PIDPressure=True or fork failures] --> B{Node PIDs near pid_max?}
B -->|Yes| C[Identify top PID consumers]
B -->|No| D[Check kubelet eviction config]
C --> E{Zombies or shims?}
E -->|Zombies| F[Fix app reaping or restart pod]
E -->|Shims| G[Restart runtime or drain node]
E -->|Thread leak| H[Reduce pod parallelism]
D --> I[Configure pid.available evictionHard]Confirm the node condition and taint. Use
kubectl get nodesto checkPIDPressure. If the status isTrue, verify the taintnode.kubernetes.io/pid-pressure:NoScheduleis present. This confirms kubelet has detected the problem, but it does not reveal how many PIDs remain or which workload is responsible.Determine absolute headroom. On the node, compare
ls /proc | grep -E '^[0-9]+$' | wc -lagainst/proc/sys/kernel/pid_max. If utilization is above 80% of the limit, the node is in immediate danger regardless of what kubelet reports. If the limit is 32768, expect exhaustion on any node running more than a few dozen pods with multi-threaded runtimes.Identify top consumers by parent process. List processes grouped by container runtime shim or pod cgroup. On the host, inspect
pids.currentfiles under/sys/fs/cgroup/pids/kubepods/(exact path varies by cgroup version and QoS class). If your runtime is containerd,ctr tasks listshows per-container process counts.Look for zombies. Zombie processes hold PIDs in the kernel task table until their parent calls
waitpid(). Runps aux | awk '$8 ~ /^Z/'on the node. If zombies cluster under a specific container, the application inside that pod is failing to reap children. Restarting the pod provides immediate relief but does not fix the code.Check for orphaned runtime shims. Each container managed by containerd runs a
containerd-shimprocess. If a shim is orphaned, it continues to consume a PID. Comparepgrep -c containerd-shimagainst the number of running containers fromcrictl ps | wc -l. A large gap indicates shim leakage.Verify kubelet and cgroup enforcement. Check whether
PodPidsLimitis configured in the kubelet config (/var/lib/kubelet/config.yaml) or startup flags. Then enter a pod and runcat /sys/fs/cgroup/pids/pids.max. If the limit ismaxor a value far larger thanPodPidsLimit, enforcement is not reaching the container. This is common with dockershim or cri-dockerd; verify that your runtime and cgroup driver support PID limits.Inspect eviction configuration. If PIDPressure is not firing despite high usage, check kubelet’s
--eviction-hardand--eviction-softsettings. Ifpid.availableis absent, kubelet will not evict for PID pressure. Add an explicit threshold such aspid.available<100or a percentage, depending on node density.Correlate with events and logs. Check
kubectl get events --field-selector involvedObject.kind=Node,reason=Evictedfor PID-triggered evictions. Read kubelet logs withjournalctl -u kubelet | grep -i pidto find the exact eviction signal value at the time of the event.
Metrics and signals to monitor
| Signal | Why it matters | Warning sign |
|---|---|---|
Node condition PIDPressure | Binary indicator from kubelet | status=True sustained for >1 minute |
Node PIDs used vs pid_max | Measures true system headroom before fork fails | Utilization >80% of pid_max |
| Kubelet eviction metrics with pid signal | Confirms kubelet is actively shedding load | Any non-zero eviction rate for pid |
| Per-pod PID usage | Identifies noisy neighbors before they exhaust the node | Any pod approaching its PodPidsLimit |
| Zombie process count | Zombies consume PIDs but no other resources | Sustained >0 on production nodes |
| containerd-shim count | Orphaned shims leak node-level PIDs | Count growing while pod count is flat |
Container cgroup pids.current vs pids.max | Validates that enforcement is working | pids.current within 10% of pids.max, or pids.max unlimited |
kernel.pid_max | System-wide ceiling that is easy to misconfigure | Value < 65536 on dense nodes |
Fixes
If the cause is zombie or leaking processes inside pods
Fix the application to reap child processes properly. If the code cannot be changed immediately, setting a stricter PodPidsLimit contains the blast radius to that pod rather than the node. To relieve pressure immediately, delete the offending pod. This is disruptive to the workload but safe for the node.
If the cause is container runtime shim accumulation
Orphaned shims often require a container runtime restart to clear. Cordon the node and drain its workloads before restarting containerd (systemctl restart containerd). Verify shim counts drop before uncordoning. If the leak is due to a runtime bug, upgrade to a patched version.
If the cause is low kernel.pid_max
Raise the limit immediately:
# Apply now
sysctl -w kernel.pid_max=131072
# Persist
echo "kernel.pid_max = 131072" > /etc/sysctl.d/99-pid.conf
sysctl --system
Nodes running more than 50 pods or hosting Java, Node.js, or heavily threaded workloads should use at least 131072. The change does not require a reboot, but existing processes do not count against the new limit retroactively.
If the cause is kubelet/cgroup enforcement gap
If PodPidsLimit is set in kubelet configuration but containers show pids.max = max inside their cgroup, the runtime is not propagating the limit. This is common with dockershim or cri-dockerd. Migrate to containerd or CRI-O, then verify enforcement by running cat /sys/fs/cgroup/pids/pids.max inside a new pod.
If the cause is a fork bomb or runaway subprocess
Identify the parent process using ps aux --sort=-nlwp | head -20 on the node. Delete the pod immediately. To prevent recurrence, lower PodPidsLimit for that workload’s priority class or namespace, and review application logic for unbounded fork, exec, or thread creation.
Prevention
- Explicitly configure
evictionHard(and optionallyevictionSoft) forpid.availablein kubelet configuration. A starting point ispid.available<100on smaller nodes, or a percentage on larger ones. - Set
PodPidsLimitinKubeletConfigurationto a value that matches your density. Verify it is reflected in container cgroups. - Tune
kernel.pid_maxto at least 131072 during node provisioning if the node will host more than a few dozen pods. - Monitor per-pod PID usage in staging and CI. Applications that leak PIDs should be caught before production deployment.
- Ensure container images and init systems reap zombie processes correctly. Avoid PID 1 processes that do not delegate to a proper init or use
tini. - Use containerd or CRI-O rather than dockershim to ensure
PodPidsLimitis enforced at the cgroup level. - Include PID headroom in capacity planning alongside CPU and memory.
How Netdata helps
- Netdata tracks per-cgroup
pids.currentandpids.max, surfacing which pods are approaching their PID limit. - The process state chart highlights zombie counts per node, making it obvious when an application is failing to reap children.
- Node-level process count metrics, correlated with the Kubernetes
PIDPressurecondition, let you distinguish a true kernel shortage from a kubelet configuration gap. - Alerts on per-container PID saturation and node-wide
pid_maxutilization fire before kubelet eviction, providing lead time to cordon or drain the node.
Related guides
- See Kubernetes eviction cascade: when one node failure takes down the cluster for how PID pressure can trigger broader cluster instability.
- See Kubernetes kubelet not responding: PLEG, runtime, and certificate issues when PID exhaustion causes kubelet health checks to fail.
- See Kubernetes kubelet memory leak: detection and OOM cycle for another kubelet-related resource exhaustion pattern.
- See Kubernetes conntrack exhaustion: dropped connections under load to differentiate PID pressure from connection tracking limits.






