Kubernetes node MemoryPressure: detection, eviction order, and prevention
Before adding RAM, determine whether kubelet is evicting because workloads are genuinely starving or because memory requests are misaligned with reality.
What this means
Kubelet evaluates memory.available against an eviction threshold. On Linux the default hard threshold is memory.available < 100Mi. Kubelet derives this from cgroup stats, not free -m. It measures working-set memory (RSS plus active file-backed pages) and subtracts that from total capacity. When the threshold is crossed, kubelet sets the node condition MemoryPressure=True and adds the taint node.kubernetes.io/memory-pressure:NoSchedule. New pods are blocked from scheduling until the condition clears.
Once the threshold is crossed, kubelet’s eviction manager kills pods to reclaim memory. The eviction order is not random. It targets BestEffort pods first, then Burstable pods whose usage exceeds their requests, and finally Guaranteed pods and Burstable pods whose usage is within requests. Within each tier, kubelet ranks pods by consumption above request, then by PriorityClass ascending (lower priority first). Guaranteed pods are the last to be touched, but they are not immune.
This mechanism is separate from the kernel OOM killer. OOMKill fires immediately when a container exceeds its cgroup memory limit with no grace period. Kubelet eviction is a node-level, signal-driven process that respects pod priority and QoS.
flowchart TD
A["memory.available < eviction-hard threshold"] --> B{"Kernel OOM or kubelet eviction?"}
B -->|"Container exceeds cgroup limit"| C["Kernel OOMKilled immediate"]
B -->|"Node-level pressure"| D["Kubelet sets MemoryPressure=True"]
D --> E["Taint node.kubernetes.io/memory-pressure:NoSchedule added"]
E --> F{"Eviction order by QoS"}
F --> G["BestEffort pods first"]
F --> H["Burstable where usage > request"]
F --> I["Guaranteed and Burstable where usage <= request last"]
G --> J["Reclaim memory"]
H --> J
I --> J
J --> K{"Pressure clears?"}
K -->|"No"| F
K -->|"Yes"| L["Remove condition after transition period"]Common causes
| Cause | What it looks like | First thing to check |
|---|---|---|
| Actual workload memory growth | Working set climbs steadily | kubectl top nodes and kubectl top pods |
| Missing or low memory requests | Scheduler overcommits the node | kubectl describe node allocated resources |
| Memory leak in application | Pod restart count climbing with OOMKilled | Pod status and lastState.terminated.reason |
| Kernel page cache accumulation | free -m shows free memory but MemAvailable is low | /proc/meminfo on the node |
| Sudden memory spike or thundering herd | MemoryPressure flips rapidly then clears | Node memory utilization graph over time |
| Kubelet eviction threshold too tight | Node enters pressure under normal load | Kubelet config --eviction-hard |
Quick checks
# Check node conditions for MemoryPressure
kubectl get nodes -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.status.conditions[?(@.type=="MemoryPressure")].status}{"\n"}{end}'
# Check taint
kubectl get nodes -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.spec.taints[?(@.key=="node.kubernetes.io/memory-pressure")].effect}{"\n"}{end}'
# Check kubelet eviction configuration (path varies by distribution)
grep -A5 'evictionHard' /var/lib/kubelet/config.yaml
# Check actual memory vs allocatable
kubectl describe node <node-name> | grep -A5 "Allocated resources"
# Check for evicted pods
kubectl get pods --all-namespaces --field-selector spec.nodeName=<node-name>,status.phase=Failed | grep Evicted
# Check system memory composition
grep -E "MemTotal|MemAvailable|MemFree" /proc/meminfo
# Check kubelet eviction metrics (requires node metrics access)
kubectl get --raw /api/v1/nodes/<node-name>/proxy/metrics | grep kubelet_evictions_total
# Check pod OOMKilled status
kubectl get pods -n <namespace> -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.status.containerStatuses[0].lastState.terminated.reason}{"\n"}{end}'
How to diagnose it
- Verify the node condition and taint. Confirm
MemoryPressure=Trueand thenode.kubernetes.io/memory-pressure:NoScheduletaint are present. This blocks new scheduling. - Distinguish eviction from OOMKill. Inspect
lastState.terminated.reason.OOMKilledmeans the container hit its cgroup limit.Evictedmeans kubelet removed the pod due to node pressure. OOMKilled points to a limit or leak; Evicted points to node capacity or request alignment. - Inspect node memory composition. Compare
MemAvailabletoMemFreein/proc/meminfo. IfMemAvailableis low whileMemFreeis high, active page cache is consuming memory. Kubelet evaluates working set, notMemFree. - Compare requested versus actual usage.
kubectl describe nodeshows allocated resources;kubectl top podsshows real utilization. If actual usage far exceeds requests, the scheduler overcommitted the node. - Identify evicted pods. Query events with
reason=Evictedand correlate with pod QoS classes. Guaranteed pods being evicted means severe oversubscription. - Check kubelet logs. Look for
eviction manager:log lines naming the signalmemory.availableand the selected pod. This confirms the threshold crossed and the victim chosen. - Verify eviction threshold configuration. Inspect
--eviction-hard. Specifying this flag or config field replaces the entire default set. If you tighten a threshold while the node is already violating it, kubelet may evict aggressively. - Look for memory leaks. A climbing restart count with
OOMKilledmeans a container is hitting its cgroup limit and contributing to pressure. Fix the leak or right-size the limit. - Check scheduler versus kubelet divergence. The scheduler computes
allocatable = capacity - kube-reserved - system-reserved - eviction-hard. If workloads consume more than requests but less than limits, the scheduler sees room while kubelet sees pressure. Accurate requests close this gap.
Metrics and signals to monitor
| Signal | Why it matters | Warning sign |
|---|---|---|
kube_node_status_condition{condition="MemoryPressure"} | Binary indicator of node pressure | Value == 1 |
node_memory_MemAvailable_bytes vs kubelet-reported available | Actual node memory state vs kubelet’s view | MemAvailable trending toward eviction threshold |
kubelet_evictions_total | Count of kubelet-driven evictions | Counter incrementing |
kube_pod_container_status_terminated_reason{reason="OOMKilled"} | Cgroup-level OOM events | Rate increasing |
container_memory_working_set_bytes per pod | Actual pod memory consumption | Sustained usage above request |
| Node allocatable vs requested memory ratio | Scheduling headroom | Requests approaching 80% of allocatable |
| MemoryPressure condition age | Persistence of pressure | Condition True for > 5 minutes |
Fixes
If the cause is resource pressure
Cordon the node to stop new scheduling. Identify highest-memory consumers with kubectl top pods. For Burstable pods exceeding requests, scale horizontally or increase requests and limits. If Guaranteed pods are evicted, the node needs more physical memory or fewer pods. Do not tune eviction thresholds to mask genuine capacity exhaustion.
If the cause is configuration drift
Adjust kubelet --eviction-hard if thresholds are unrealistically tight. Set --eviction-minimum-reclaim to create hysteresis and avoid flapping. Warning: specifying --eviction-hard via flag or kubelet config replaces the entire default set. If you tighten a threshold while the node is already violating it, kubelet may evict aggressively.
If the cause is application behavior
Set memory requests to steady-state usage. Set limits with headroom for spikes. Use LimitRange defaults to prevent BestEffort pods. For leaks, fix the application or use VerticalPodAutoscaler.
If the cause is working set confusion
On nodes with heavy file I/O, active page cache inflates working set. This is expected; do not disable eviction. Provision more memory or reduce workload density.
Prevention
- Set realistic memory requests and limits for every pod. Use
LimitRangepolicies to enforce defaults. - Monitor the gap between
container_memory_working_set_bytesand pod memory requests. A growing gap signals overcommit before eviction triggers. - Configure kubelet reserved resources with
--kube-reservedand--system-reservedso the scheduler accounts for node overhead. - Tune eviction thresholds based on node size. Set
--eviction-minimum-reclaimto avoid rapid condition flapping. - Enable node-problem-detector or equivalent to catch memory pressure trends before they reach the eviction threshold.
- Use Guaranteed QoS with requests equal to limits, and assign PriorityClasses so critical pods are evicted last. Guaranteed does not prevent OOMKill if the limit itself is wrong.
- Do not rely on
free -mfor capacity planning. UseMemAvailableand kubelet working-set metrics.
How Netdata helps
- Correlates
node_memory_MemAvailable_byteswithkubelet_evictions_totalto show the exact moment pressure turns into pod death. - Tracks
kube_node_status_conditiontransitions and taint application latency to catch scheduling blocks immediately. - Visualizes per-pod
container_memory_working_set_bytesagainst requests and limits to surface overcommit before eviction triggers. - Alerts on sustained
MemoryPressureconditions alongside system-levelMemAvailabletrends, distinguishing kubelet eviction from kernel OOM. - Aggregates node-level memory composition (cache, buffers, RSS) so you can see whether pressure is from active workloads or reclaimable page cache.






