Kubernetes eviction cascade: when one node failure takes down the cluster

You see pods entering Evicted status across multiple nodes. Nodes flap between Ready and MemoryPressure or DiskPressure. The scheduler keeps placing replacements, but the new pods are evicted again before they become ready. Workloads never stabilize, and every remediation attempt seems to make the cluster more volatile.

This is a node-pressure eviction cascade. It happens when the scheduler’s view of capacity diverges from the kubelet’s view. One node under pressure evicts pods; those pods land on other nodes that are also overcommitted; those nodes tip into pressure and evict more pods. The result is a cluster-wide feedback loop that looks like a resource shortage but is often a scheduling and configuration problem.

What this means

Kubernetes node-pressure eviction is driven by the kubelet, not the scheduler. The kubelet monitors eviction signals including memory.available, nodefs.available, and imagefs.available. When a hard threshold is crossed, the kubelet immediately selects pods to kill: BestEffort first, then Burstable pods whose usage exceeds requests, and finally Guaranteed pods and Burstable pods within requests, ranked by Priority. Node-pressure eviction ignores PodDisruptionBudgets, and for hard thresholds it ignores terminationGracePeriodSeconds.

The scheduler places pods based on requests; the kubelet evicts based on actual working set. If workloads use more memory or disk than requested, the scheduler stacks nodes to what looks like a safe level while the nodes run hot. A spike, node failure, or rolling update pushes one node over its threshold. Controllers recreate evicted pods and the scheduler places them elsewhere, often onto nodes that were also near capacity. Those nodes hit thresholds, evict pods, and the cascade continues.

flowchart TD
    A[Node A crosses memory pressure threshold] --> B[Kubelet evicts BestEffort and oversubscribed Burstable pods]
    B --> C[Controllers recreate pods scheduler places them on other nodes]
    C --> D[Nodes B and C were already near actual capacity]
    D --> E[Nodes B and C cross thresholds and begin evicting]
    E --> F[More rescheduling creates more pressure]
    F --> G[Cluster-wide eviction thrashing]

Common causes

CauseWhat it looks likeFirst thing to check
Scheduler overplacement (requests far below usage)Burstable pods evicted despite apparently healthy cluster request headroomkubectl describe node for allocated requests versus actual utilization
Missing eviction hysteresisNode flaps between pressure and recovery, evicting and re-admitting the same podsKubelet configuration for evictionMinimumReclaim
cgroup v2 memory.available shrinkElevated evictions after moving to cgroup v2stat -fc %T /sys/fs/cgroup
Unbounded emptyDir usageDiskPressure triggers before memory pressure, often affecting build or cache workloadsPod specs for emptyDir volumes without sizeLimit
Sudden node loss or hotspotOne node goes down or a DaemonSet updates, shifting load that pushes remaining nodes over the edgeNode failure events followed by a spike in pending pods

Quick checks

# Check node pressure conditions
kubectl get nodes -o json | jq '.items[] | {name: .metadata.name, conditions: [.status.conditions[] | select(.type | test("Pressure|Ready"))]}'

# Compare allocated requests to actual node resources
kubectl describe nodes | grep -A 6 "Allocated resources"

# List recent evictions
kubectl get events --field-selector reason=Evicted --sort-by='.lastTimestamp'

# Determine cgroup version (cgroup v2 changes memory.available accounting)
stat -fc %T /sys/fs/cgroup

# Check pod QoS distribution on a pressured node
kubectl get pods --all-namespaces --field-selector spec.nodeName=<node> \
  -o json | jq -r '.items[].status.qosClass' | sort | uniq -c

# Find emptyDir volumes without sizeLimit
kubectl get pods --all-namespaces -o json | jq -r \
  '.items[] | select(.spec.volumes[]?.emptyDir != null and .spec.volumes[]?.emptyDir.sizeLimit == null) | "\(.metadata.namespace)/\(.metadata.name)"'

# Review kubelet eviction configuration
cat /var/lib/kubelet/config.yaml | grep -A 12 eviction

How to diagnose it

  1. Identify the first node that flipped. The node with the most evictions is not necessarily the root cause. Use kubectl get events --field-selector reason=Evicted sorted by time to find the earliest pressure transition.
  2. Confirm the eviction signal. Check the node’s conditions for MemoryPressure, DiskPressure, or PIDPressure. Note that PIDPressure=True prevents scheduling but does not itself trigger eviction; however, it often co-occurs with other pressures.
  3. Check for scheduler overplacement. On the first affected node, compare the sum of pod memory requests to actual memory working set. If working set is significantly higher than requests, the scheduler placed too many pods on the node. This is the most common root cause of cascades.
  4. Trace pod landing zones. Find the names of pods evicted from the first node, then check which nodes their replacements were scheduled to. If those destination nodes then reported pressure conditions within minutes, you have confirmed a cascade.
  5. Look for cgroup version changes. If the cluster runs cgroup v2, the memory.available calculation no longer subtracts slab reclaimable memory. Identical workloads now report lower available memory, which can push previously safe nodes over the default 100Mi threshold.
  6. Review kubelet hysteresis settings. If evictionMinimumReclaim is at its default of 0 for the active signal, the kubelet stops evicting as soon as the signal clears the threshold. This allows the scheduler to place new pods back onto the node immediately, producing a ping-pong oscillation.
  7. Inspect emptyDir and ephemeral storage. Check whether evicted pods used large emptyDir volumes. Without a sizeLimit, emptyDir writes count against nodefs and can trigger disk pressure before memory is constrained.
  8. Check for node-name reuse. If a worker was terminated and replaced with a node of the same name before the pod-eviction-timeout (default 5 minutes) elapsed, the node lifecycle controller may have behaved incorrectly, delaying expected evictions and causing a sudden backlog of rescheduling.

Metrics and signals to monitor

SignalWhy it mattersWarning sign
Node condition MemoryPressure / DiskPressureDirect triggers for kubelet evictionAny transition to True
memory.available relative to thresholdDefault hard threshold is 100Mi on Linux; cgroup v2 lowers the reported valuememory.available trending toward the threshold
Node allocatable vs requests vs actual usageReveals scheduler overplacement that drives cascadesActual usage exceeds allocatable, or requests are far below peak usage
kubelet evictions (events or metrics)Count of pods evicted by signalSustained evictions across multiple nodes after a single node failure
Scheduler pending podsBacklog waiting to be placedSudden spike after node failure or rolling update
Node controller eviction rateZone-aware eviction pacingElevated rate when more than 55% of nodes in a zone are unhealthy
nodefs / imagefs available and inodesDisk pressure can fire before memory pressureAvailable space trending toward 10% or inodes toward 5%
emptyDir consumptionUnbounded scratch space fills nodefsPods with large emptyDir volumes and no sizeLimit
PID utilization per nodePID exhaustion compounds other pressuresPIDs approaching the node limit

Fixes

If the cause is resource pressure from overplacement

Do not simply tighten eviction thresholds. That treats the symptom and can make the cluster more aggressive without fixing the underlying gap. Instead:

  • Cordon nodes that are cycling between pressure and recovery to stop the rescheduling loop.
  • Audit Burstable workloads and raise memory requests to reflect actual peak working set. The scheduler cannot protect nodes from what it cannot see.
  • Drain cordoned nodes safely after workloads are rescheduled elsewhere.
  • Add node capacity before uncordoning.

If the cause is missing hysteresis

Set evictionMinimumReclaim in the kubelet configuration for the signals you are hitting. For example, forcing memory.available to reclaim at least 200Mi above the threshold prevents pods from being scheduled straight back onto a node that just evicted them.

If you customize eviction thresholds in KubeletConfiguration, note that specifying evictionHard or evictionSoft overrides the defaults entirely. Define all thresholds you need explicitly; otherwise omitted defaults fall back to zero and disable protection.

If the cause is cgroup v2 accounting drift

On clusters using cgroup v2, expect lower reported memory.available for the same workloads. Re-baseline your nodes with the new accounting. Reduce per-node workload density or add memory to restore safe headroom.

If the cause is emptyDir or ephemeral storage

Enforce sizeLimit on all emptyDir volumes in pod specs. Ensure container logs are rotated so that nodefs does not fill from log output.

If the cause is node failure or zone disruption

The node-lifecycle controller modulates eviction rates based on zone health. Rate limiting helps during partial disruptions. If an entire zone fails, evicted pods must reschedule elsewhere; ensure other zones have headroom.

Ensure cluster autoscaler and PodDisruptionBudget settings allow rapid scale-out. Conservative PDB settings (maxUnavailable: 0) can block cluster autoscaler scale-down during recovery.

Prevention

  • Treat resource requests as a scheduling contract. If a workload consistently exceeds its memory request, the contract is broken and the scheduler will overplace nodes.
  • Configure evictionMinimumReclaim for all hard thresholds to create hysteresis and prevent immediate re-scheduling onto recovering nodes.
  • Monitor the gap between node allocatable and actual utilization, not just requested capacity.
  • Set sizeLimit on emptyDir volumes and enforce ephemeral storage limits.
  • Maintain cluster headroom so that losing one node or one availability zone does not push the remaining fleet above 80% actual utilization.
  • Validate kubelet behavior on cgroup v2 nodes after any Kubernetes upgrade before promoting changes to production clusters.

How Netdata helps

Netdata surfaces node-level signals that precede API-visible Kubernetes conditions:

  • Per-node charts for memory.available, disk usage, and inode utilization identify the first node to tip before the kubelet begins evicting.
  • Correlating kubelet events with system memory and disk metrics on the same timeline reveals whether a spike was caused by workload growth, emptyDir bloat, or log accumulation.
  • Per-cgroup memory charts let you validate whether a cgroup v2 configuration has shifted memory accounting headroom for your actual workloads.
  • Alerts on memory.available and disk pressure can fire before Kubernetes conditions propagate to the control plane, giving time to cordon a node before a cascade begins.