Kubernetes eviction cascade: when one node failure takes down the cluster
You see pods entering Evicted status across multiple nodes. Nodes flap between Ready and MemoryPressure or DiskPressure. The scheduler keeps placing replacements, but the new pods are evicted again before they become ready. Workloads never stabilize, and every remediation attempt seems to make the cluster more volatile.
This is a node-pressure eviction cascade. It happens when the scheduler’s view of capacity diverges from the kubelet’s view. One node under pressure evicts pods; those pods land on other nodes that are also overcommitted; those nodes tip into pressure and evict more pods. The result is a cluster-wide feedback loop that looks like a resource shortage but is often a scheduling and configuration problem.
What this means
Kubernetes node-pressure eviction is driven by the kubelet, not the scheduler. The kubelet monitors eviction signals including memory.available, nodefs.available, and imagefs.available. When a hard threshold is crossed, the kubelet immediately selects pods to kill: BestEffort first, then Burstable pods whose usage exceeds requests, and finally Guaranteed pods and Burstable pods within requests, ranked by Priority. Node-pressure eviction ignores PodDisruptionBudgets, and for hard thresholds it ignores terminationGracePeriodSeconds.
The scheduler places pods based on requests; the kubelet evicts based on actual working set. If workloads use more memory or disk than requested, the scheduler stacks nodes to what looks like a safe level while the nodes run hot. A spike, node failure, or rolling update pushes one node over its threshold. Controllers recreate evicted pods and the scheduler places them elsewhere, often onto nodes that were also near capacity. Those nodes hit thresholds, evict pods, and the cascade continues.
flowchart TD
A[Node A crosses memory pressure threshold] --> B[Kubelet evicts BestEffort and oversubscribed Burstable pods]
B --> C[Controllers recreate pods scheduler places them on other nodes]
C --> D[Nodes B and C were already near actual capacity]
D --> E[Nodes B and C cross thresholds and begin evicting]
E --> F[More rescheduling creates more pressure]
F --> G[Cluster-wide eviction thrashing]Common causes
| Cause | What it looks like | First thing to check |
|---|---|---|
| Scheduler overplacement (requests far below usage) | Burstable pods evicted despite apparently healthy cluster request headroom | kubectl describe node for allocated requests versus actual utilization |
| Missing eviction hysteresis | Node flaps between pressure and recovery, evicting and re-admitting the same pods | Kubelet configuration for evictionMinimumReclaim |
| cgroup v2 memory.available shrink | Elevated evictions after moving to cgroup v2 | stat -fc %T /sys/fs/cgroup |
| Unbounded emptyDir usage | DiskPressure triggers before memory pressure, often affecting build or cache workloads | Pod specs for emptyDir volumes without sizeLimit |
| Sudden node loss or hotspot | One node goes down or a DaemonSet updates, shifting load that pushes remaining nodes over the edge | Node failure events followed by a spike in pending pods |
Quick checks
# Check node pressure conditions
kubectl get nodes -o json | jq '.items[] | {name: .metadata.name, conditions: [.status.conditions[] | select(.type | test("Pressure|Ready"))]}'
# Compare allocated requests to actual node resources
kubectl describe nodes | grep -A 6 "Allocated resources"
# List recent evictions
kubectl get events --field-selector reason=Evicted --sort-by='.lastTimestamp'
# Determine cgroup version (cgroup v2 changes memory.available accounting)
stat -fc %T /sys/fs/cgroup
# Check pod QoS distribution on a pressured node
kubectl get pods --all-namespaces --field-selector spec.nodeName=<node> \
-o json | jq -r '.items[].status.qosClass' | sort | uniq -c
# Find emptyDir volumes without sizeLimit
kubectl get pods --all-namespaces -o json | jq -r \
'.items[] | select(.spec.volumes[]?.emptyDir != null and .spec.volumes[]?.emptyDir.sizeLimit == null) | "\(.metadata.namespace)/\(.metadata.name)"'
# Review kubelet eviction configuration
cat /var/lib/kubelet/config.yaml | grep -A 12 eviction
How to diagnose it
- Identify the first node that flipped. The node with the most evictions is not necessarily the root cause. Use
kubectl get events --field-selector reason=Evictedsorted by time to find the earliest pressure transition. - Confirm the eviction signal. Check the node’s conditions for
MemoryPressure,DiskPressure, orPIDPressure. Note thatPIDPressure=Trueprevents scheduling but does not itself trigger eviction; however, it often co-occurs with other pressures. - Check for scheduler overplacement. On the first affected node, compare the sum of pod memory requests to actual memory working set. If working set is significantly higher than requests, the scheduler placed too many pods on the node. This is the most common root cause of cascades.
- Trace pod landing zones. Find the names of pods evicted from the first node, then check which nodes their replacements were scheduled to. If those destination nodes then reported pressure conditions within minutes, you have confirmed a cascade.
- Look for cgroup version changes. If the cluster runs cgroup v2, the
memory.availablecalculation no longer subtracts slab reclaimable memory. Identical workloads now report lower available memory, which can push previously safe nodes over the default 100Mi threshold. - Review kubelet hysteresis settings. If
evictionMinimumReclaimis at its default of 0 for the active signal, the kubelet stops evicting as soon as the signal clears the threshold. This allows the scheduler to place new pods back onto the node immediately, producing a ping-pong oscillation. - Inspect emptyDir and ephemeral storage. Check whether evicted pods used large
emptyDirvolumes. Without asizeLimit, emptyDir writes count againstnodefsand can trigger disk pressure before memory is constrained. - Check for node-name reuse. If a worker was terminated and replaced with a node of the same name before the
pod-eviction-timeout(default 5 minutes) elapsed, the node lifecycle controller may have behaved incorrectly, delaying expected evictions and causing a sudden backlog of rescheduling.
Metrics and signals to monitor
| Signal | Why it matters | Warning sign |
|---|---|---|
Node condition MemoryPressure / DiskPressure | Direct triggers for kubelet eviction | Any transition to True |
memory.available relative to threshold | Default hard threshold is 100Mi on Linux; cgroup v2 lowers the reported value | memory.available trending toward the threshold |
| Node allocatable vs requests vs actual usage | Reveals scheduler overplacement that drives cascades | Actual usage exceeds allocatable, or requests are far below peak usage |
| kubelet evictions (events or metrics) | Count of pods evicted by signal | Sustained evictions across multiple nodes after a single node failure |
| Scheduler pending pods | Backlog waiting to be placed | Sudden spike after node failure or rolling update |
| Node controller eviction rate | Zone-aware eviction pacing | Elevated rate when more than 55% of nodes in a zone are unhealthy |
nodefs / imagefs available and inodes | Disk pressure can fire before memory pressure | Available space trending toward 10% or inodes toward 5% |
| emptyDir consumption | Unbounded scratch space fills nodefs | Pods with large emptyDir volumes and no sizeLimit |
| PID utilization per node | PID exhaustion compounds other pressures | PIDs approaching the node limit |
Fixes
If the cause is resource pressure from overplacement
Do not simply tighten eviction thresholds. That treats the symptom and can make the cluster more aggressive without fixing the underlying gap. Instead:
- Cordon nodes that are cycling between pressure and recovery to stop the rescheduling loop.
- Audit Burstable workloads and raise memory
requeststo reflect actual peak working set. The scheduler cannot protect nodes from what it cannot see. - Drain cordoned nodes safely after workloads are rescheduled elsewhere.
- Add node capacity before uncordoning.
If the cause is missing hysteresis
Set evictionMinimumReclaim in the kubelet configuration for the signals you are hitting. For example, forcing memory.available to reclaim at least 200Mi above the threshold prevents pods from being scheduled straight back onto a node that just evicted them.
If you customize eviction thresholds in KubeletConfiguration, note that specifying evictionHard or evictionSoft overrides the defaults entirely. Define all thresholds you need explicitly; otherwise omitted defaults fall back to zero and disable protection.
If the cause is cgroup v2 accounting drift
On clusters using cgroup v2, expect lower reported memory.available for the same workloads. Re-baseline your nodes with the new accounting. Reduce per-node workload density or add memory to restore safe headroom.
If the cause is emptyDir or ephemeral storage
Enforce sizeLimit on all emptyDir volumes in pod specs. Ensure container logs are rotated so that nodefs does not fill from log output.
If the cause is node failure or zone disruption
The node-lifecycle controller modulates eviction rates based on zone health. Rate limiting helps during partial disruptions. If an entire zone fails, evicted pods must reschedule elsewhere; ensure other zones have headroom.
Ensure cluster autoscaler and PodDisruptionBudget settings allow rapid scale-out. Conservative PDB settings (maxUnavailable: 0) can block cluster autoscaler scale-down during recovery.
Prevention
- Treat resource requests as a scheduling contract. If a workload consistently exceeds its memory request, the contract is broken and the scheduler will overplace nodes.
- Configure
evictionMinimumReclaimfor all hard thresholds to create hysteresis and prevent immediate re-scheduling onto recovering nodes. - Monitor the gap between node allocatable and actual utilization, not just requested capacity.
- Set
sizeLimiton emptyDir volumes and enforce ephemeral storage limits. - Maintain cluster headroom so that losing one node or one availability zone does not push the remaining fleet above 80% actual utilization.
- Validate kubelet behavior on cgroup v2 nodes after any Kubernetes upgrade before promoting changes to production clusters.
How Netdata helps
Netdata surfaces node-level signals that precede API-visible Kubernetes conditions:
- Per-node charts for
memory.available, disk usage, and inode utilization identify the first node to tip before the kubelet begins evicting. - Correlating kubelet events with system memory and disk metrics on the same timeline reveals whether a spike was caused by workload growth, emptyDir bloat, or log accumulation.
- Per-cgroup memory charts let you validate whether a cgroup v2 configuration has shifted memory accounting headroom for your actual workloads.
- Alerts on
memory.availableand disk pressure can fire before Kubernetes conditions propagate to the control plane, giving time to cordon a node before a cascade begins.
Related guides
- Kubernetes node DiskPressure: detection, eviction, and recovery
- Kubernetes node NotReady: kubelet, runtime, and network diagnosis
- Kubernetes pod stuck ContainerCreating: volume, network, and image issues
- Kubernetes pod CrashLoopBackOff: causes, diagnosis, and fixes
- Kubernetes monitoring checklist: the signals every production cluster needs
- Kubernetes kubelet not responding: PLEG, runtime, and certificate issues






