$ guides / kubernetes / kubernetes-eviction-cascade ▌

Operations Guides

Kubernetes eviction cascade: when one node failure takes down the cluster

You see pods entering Evicted status across multiple nodes. Nodes flap between Ready and MemoryPressure or DiskPressure. The scheduler keeps placing replacements, but the new pods are evicted again before they become ready. Workloads never stabilize, and every remediation attempt seems to make the cluster more volatile.

This is a node-pressure eviction cascade. It happens when the scheduler’s view of capacity diverges from the kubelet’s view. One node under pressure evicts pods; those pods land on other nodes that are also overcommitted; those nodes tip into pressure and evict more pods. The result is a cluster-wide feedback loop that looks like a resource shortage but is often a scheduling and configuration problem.

What this means

Kubernetes node-pressure eviction is driven by the kubelet, not the scheduler. The kubelet monitors eviction signals including memory.available, nodefs.available, and imagefs.available. When a hard threshold is crossed, the kubelet immediately selects pods to kill: BestEffort first, then Burstable pods whose usage exceeds requests, and finally Guaranteed pods and Burstable pods within requests, ranked by Priority. Node-pressure eviction ignores PodDisruptionBudgets, and for hard thresholds it ignores terminationGracePeriodSeconds.

The scheduler places pods based on requests; the kubelet evicts based on actual working set. If workloads use more memory or disk than requested, the scheduler stacks nodes to what looks like a safe level while the nodes run hot. A spike, node failure, or rolling update pushes one node over its threshold. Controllers recreate evicted pods and the scheduler places them elsewhere, often onto nodes that were also near capacity. Those nodes hit thresholds, evict pods, and the cascade continues.

flowchart TD
    A[Node A crosses memory pressure threshold] --> B[Kubelet evicts BestEffort and oversubscribed Burstable pods]
    B --> C[Controllers recreate pods scheduler places them on other nodes]
    C --> D[Nodes B and C were already near actual capacity]
    D --> E[Nodes B and C cross thresholds and begin evicting]
    E --> F[More rescheduling creates more pressure]
    F --> G[Cluster-wide eviction thrashing]

Common causes

Cause	What it looks like	First thing to check
Scheduler overplacement (requests far below usage)	Burstable pods evicted despite apparently healthy cluster request headroom	`kubectl describe node` for allocated requests versus actual utilization
Missing eviction hysteresis	Node flaps between pressure and recovery, evicting and re-admitting the same pods	Kubelet configuration for `evictionMinimumReclaim`
cgroup v2 memory.available shrink	Elevated evictions after moving to cgroup v2	`stat -fc %T /sys/fs/cgroup`
Unbounded emptyDir usage	DiskPressure triggers before memory pressure, often affecting build or cache workloads	Pod specs for `emptyDir` volumes without `sizeLimit`
Sudden node loss or hotspot	One node goes down or a DaemonSet updates, shifting load that pushes remaining nodes over the edge	Node failure events followed by a spike in pending pods

Quick checks

# Check node pressure conditions
kubectl get nodes -o json | jq '.items[] | {name: .metadata.name, conditions: [.status.conditions[] | select(.type | test("Pressure|Ready"))]}'

# Compare allocated requests to actual node resources
kubectl describe nodes | grep -A 6 "Allocated resources"

# List recent evictions
kubectl get events --field-selector reason=Evicted --sort-by='.lastTimestamp'

# Determine cgroup version (cgroup v2 changes memory.available accounting)
stat -fc %T /sys/fs/cgroup

# Check pod QoS distribution on a pressured node
kubectl get pods --all-namespaces --field-selector spec.nodeName=<node> \
  -o json | jq -r '.items[].status.qosClass' | sort | uniq -c

# Find emptyDir volumes without sizeLimit
kubectl get pods --all-namespaces -o json | jq -r \
  '.items[] | select(.spec.volumes[]?.emptyDir != null and .spec.volumes[]?.emptyDir.sizeLimit == null) | "\(.metadata.namespace)/\(.metadata.name)"'

# Review kubelet eviction configuration
cat /var/lib/kubelet/config.yaml | grep -A 12 eviction

How to diagnose it

Identify the first node that flipped. The node with the most evictions is not necessarily the root cause. Use kubectl get events --field-selector reason=Evicted sorted by time to find the earliest pressure transition.
Confirm the eviction signal. Check the node’s conditions for MemoryPressure, DiskPressure, or PIDPressure. Note that PIDPressure=True prevents scheduling but does not itself trigger eviction; however, it often co-occurs with other pressures.
Check for scheduler overplacement. On the first affected node, compare the sum of pod memory requests to actual memory working set. If working set is significantly higher than requests, the scheduler placed too many pods on the node. This is the most common root cause of cascades.
Trace pod landing zones. Find the names of pods evicted from the first node, then check which nodes their replacements were scheduled to. If those destination nodes then reported pressure conditions within minutes, you have confirmed a cascade.
Look for cgroup version changes. If the cluster runs cgroup v2, the memory.available calculation no longer subtracts slab reclaimable memory. Identical workloads now report lower available memory, which can push previously safe nodes over the default 100Mi threshold.
Review kubelet hysteresis settings. If evictionMinimumReclaim is at its default of 0 for the active signal, the kubelet stops evicting as soon as the signal clears the threshold. This allows the scheduler to place new pods back onto the node immediately, producing a ping-pong oscillation.
Inspect emptyDir and ephemeral storage. Check whether evicted pods used large emptyDir volumes. Without a sizeLimit, emptyDir writes count against nodefs and can trigger disk pressure before memory is constrained.
Check for node-name reuse. If a worker was terminated and replaced with a node of the same name before the pod-eviction-timeout (default 5 minutes) elapsed, the node lifecycle controller may have behaved incorrectly, delaying expected evictions and causing a sudden backlog of rescheduling.

Metrics and signals to monitor

Signal	Why it matters	Warning sign
Node condition `MemoryPressure` / `DiskPressure`	Direct triggers for kubelet eviction	Any transition to True
`memory.available` relative to threshold	Default hard threshold is 100Mi on Linux; cgroup v2 lowers the reported value	`memory.available` trending toward the threshold
Node allocatable vs requests vs actual usage	Reveals scheduler overplacement that drives cascades	Actual usage exceeds allocatable, or requests are far below peak usage
kubelet evictions (events or metrics)	Count of pods evicted by signal	Sustained evictions across multiple nodes after a single node failure
Scheduler pending pods	Backlog waiting to be placed	Sudden spike after node failure or rolling update
Node controller eviction rate	Zone-aware eviction pacing	Elevated rate when more than 55% of nodes in a zone are unhealthy
`nodefs` / `imagefs` available and inodes	Disk pressure can fire before memory pressure	Available space trending toward 10% or inodes toward 5%
emptyDir consumption	Unbounded scratch space fills nodefs	Pods with large emptyDir volumes and no `sizeLimit`
PID utilization per node	PID exhaustion compounds other pressures	PIDs approaching the node limit

Fixes

If the cause is resource pressure from overplacement

Do not simply tighten eviction thresholds. That treats the symptom and can make the cluster more aggressive without fixing the underlying gap. Instead:

Cordon nodes that are cycling between pressure and recovery to stop the rescheduling loop.
Audit Burstable workloads and raise memory requests to reflect actual peak working set. The scheduler cannot protect nodes from what it cannot see.
Drain cordoned nodes safely after workloads are rescheduled elsewhere.
Add node capacity before uncordoning.

If the cause is missing hysteresis

Set evictionMinimumReclaim in the kubelet configuration for the signals you are hitting. For example, forcing memory.available to reclaim at least 200Mi above the threshold prevents pods from being scheduled straight back onto a node that just evicted them.

If you customize eviction thresholds in KubeletConfiguration, note that specifying evictionHard or evictionSoft overrides the defaults entirely. Define all thresholds you need explicitly; otherwise omitted defaults fall back to zero and disable protection.

If the cause is cgroup v2 accounting drift

On clusters using cgroup v2, expect lower reported memory.available for the same workloads. Re-baseline your nodes with the new accounting. Reduce per-node workload density or add memory to restore safe headroom.

If the cause is emptyDir or ephemeral storage

Enforce sizeLimit on all emptyDir volumes in pod specs. Ensure container logs are rotated so that nodefs does not fill from log output.

If the cause is node failure or zone disruption

The node-lifecycle controller modulates eviction rates based on zone health. Rate limiting helps during partial disruptions. If an entire zone fails, evicted pods must reschedule elsewhere; ensure other zones have headroom.

Ensure cluster autoscaler and PodDisruptionBudget settings allow rapid scale-out. Conservative PDB settings (maxUnavailable: 0) can block cluster autoscaler scale-down during recovery.

Prevention

Treat resource requests as a scheduling contract. If a workload consistently exceeds its memory request, the contract is broken and the scheduler will overplace nodes.
Configure evictionMinimumReclaim for all hard thresholds to create hysteresis and prevent immediate re-scheduling onto recovering nodes.
Monitor the gap between node allocatable and actual utilization, not just requested capacity.
Set sizeLimit on emptyDir volumes and enforce ephemeral storage limits.
Maintain cluster headroom so that losing one node or one availability zone does not push the remaining fleet above 80% actual utilization.
Validate kubelet behavior on cgroup v2 nodes after any Kubernetes upgrade before promoting changes to production clusters.

How Netdata helps

Netdata surfaces node-level signals that precede API-visible Kubernetes conditions:

Per-node charts for memory.available, disk usage, and inode utilization identify the first node to tip before the kubelet begins evicting.
Correlating kubelet events with system memory and disk metrics on the same timeline reveals whether a spike was caused by workload growth, emptyDir bloat, or log accumulation.
Per-cgroup memory charts let you validate whether a cgroup v2 configuration has shifted memory accounting headroom for your actual workloads.
Alerts on memory.available and disk pressure can fire before Kubernetes conditions propagate to the control plane, giving time to cordon a node before a cascade begins.

The Netdata solution

Kubernetes monitoring with Netdata

Netdata monitors Kubernetes with per-second metrics across the control plane, nodes, and every pod, with ML anomaly detection and zero per-pod configuration. Correlate API-server and etcd latency, kubelet PLEG stalls, scheduling pressure, and OOMKills in one place.

See Kubernetes monitoring → Start monitoring free

Kubernetes eviction cascade: when one node failure takes down the cluster

Kubernetes eviction cascade: when one node failure takes down the cluster

What this means

Common causes

Quick checks

How to diagnose it

Metrics and signals to monitor

Fixes

If the cause is resource pressure from overplacement

If the cause is missing hysteresis

If the cause is cgroup v2 accounting drift

If the cause is emptyDir or ephemeral storage

If the cause is node failure or zone disruption

Prevention

How Netdata helps

Related guides

Kubernetes monitoring with Netdata