$ guides / kubernetes / kubernetes-pod-liveness-probe-killing ▌

Operations Guides

Kubernetes pod liveness probe killing healthy containers

A container that is processing requests, not OOMKilled, and not crashed can still be restarted repeatedly by the kubelet because a liveness probe failed. The application is alive, but the probe says it is not. This usually shows up as a pod stuck in CrashLoopBackOff with Liveness probe failed events, even though application logs show no fatal error. The restarts waste resources, break active connections, and can trigger cascading load on the cluster as other pods absorb the shifted traffic.

After reading this guide, you will be able to distinguish a genuinely unhealthy container from a falsely failing liveness probe, identify whether the root cause is probe configuration, resource pressure, or kubelet execution lag, and fix it without guessing.

What this means

A liveness probe is meant to detect containers that are deadlocked or otherwise unable to recover. The kubelet executes the probe at a configured interval. If the probe fails enough consecutive times, the kubelet kills and restarts the container. When the container is actually healthy but the probe fails anyway, the restart is a false positive.

The failure mechanism is straightforward but unforgiving. The kubelet runs each probe in its own goroutine. If the probe exceeds its timeout, returns a non-success status, or cannot execute at all, that attempt counts as a failure. After failureThreshold consecutive failures, the container is restarted. The probe gives no partial credit. A container under GC pressure, a kubelet that is CPU-starved, or an endpoint that is simply slow to respond can all trigger a restart.

Startup probes, if configured, gate liveness probes. Until the startup probe succeeds, the kubelet does not evaluate the liveness probe at all. If your application has a slow startup phase and you rely only on initialDelaySeconds, you are exposed to this failure mode.

Common causes

Cause	What it looks like	First thing to check
Aggressive probe config	Container restarts seconds after starting; `CrashLoopBackOff` appears quickly	Probe `timeoutSeconds`, `periodSeconds`, and `failureThreshold`
Slow application startup	Restarts happen only during rollouts or cold starts; pod never stays Running long enough	Whether a `startupProbe` is defined
GC pause or CPU throttling	Probe failures correlate with application GC logs or CPU spikes	Node CPU pressure and container CFS throttling metrics
Wrong probe target	Immediate, permanent failures from pod creation	Whether the probe port and path match the running application
Kubelet probe execution lag	Intermittent failures when the node is heavily loaded	Kubelet CPU usage and sync loop duration
Memory pressure masking	Container responds to probes slowly before eventually being OOMKilled	Container memory usage versus its limit

Quick checks

# Check pod events for liveness probe failures
kubectl get events --field-selector involvedObject.name=<pod-name> --sort-by='.lastTimestamp'

# Check container restart count and last termination reason
kubectl get pod <pod-name> -o jsonpath='{.status.containerStatuses[0].restartCount} {.status.containerStatuses[0].lastState.terminated.reason}'

# Inspect probe configuration directly
kubectl get pod <pod-name> -o jsonpath='{.spec.containers[0].livenessProbe}'

# Compare liveness and readiness configs
kubectl get pod <pod-name> -o jsonpath='{.spec.containers[0].livenessProbe}{"\n"}{.spec.containers[0].readinessProbe}'

# Check if a startupProbe is configured
kubectl get pod <pod-name> -o jsonpath='{.spec.containers[0].startupProbe}'

# Check node CPU pressure and kubelet health
kubectl describe node <node-name> | grep -A 5 "Conditions:"

# Check container CPU throttling on the node
cat /sys/fs/cgroup/cpu/kubepods/burstable/pod<pod-uid>/<container-id>/cpu.stat

# Check kubelet logs for probe results on the node
journalctl -u kubelet --since "10 minutes ago" | grep -i "probe.*failed\|liveness"

How to diagnose it

Follow this flow to confirm whether the container is truly unhealthy or the probe is lying.

Confirm the container is actually healthy. Check application logs for panics or fatal errors. If the application is serving traffic, processing messages, or completing work, it is likely healthy and the probe is a false positive.
Check pod events for the exact probe failure reason. Look for Liveness probe failed events. The message usually states whether it was a timeout, a connection refused, or an HTTP error code. This determines which branch to follow.
Inspect the probe configuration. Look at timeoutSeconds, periodSeconds, failureThreshold, and initialDelaySeconds. If timeoutSeconds is 1 and your application has tail latency above 1 second under load, that is the problem.
Determine if the failure is during startup. If restarts only happen during pod creation or deployment rollouts, the application startup time exceeds the probe window. The fix is a startupProbe, not a larger initialDelaySeconds.
Check for node and container resource pressure. Look at node CPU and memory conditions. Check if the container is being CPU-throttled or is near its memory limit. A throttled container cannot respond to probes quickly. A container nearing its memory limit may experience slow allocations or GC pressure.
Check kubelet health on the node. If the kubelet is under CPU pressure or its sync loop is delayed, probe execution itself can lag. Check kubelet_sync_loop_duration_seconds and kubelet CPU usage on the node.
Correlate restart times with application behavior. If restarts align with GC pauses, batch job spikes, or traffic surges, the probe is too sensitive for the application’s normal operating envelope.
Verify the probe endpoint manually. kubectl exec into the pod and curl the probe endpoint locally. If it responds correctly inside the container but fails from the kubelet, suspect networking or port binding issues.

Metrics and signals to monitor

Signal	Why it matters	Warning sign
Container restart count	Direct indicator of probe-induced restarts	Restart count increasing steadily
Pod phase `CrashLoopBackOff`	The container is being restarted repeatedly	Pod enters or remains in `CrashLoopBackOff`
Kubelet probe failure rate	Shows raw liveness probe failures	`prober_probe_total` with `result="failed"` increasing
Container CPU throttling	Throttled containers respond slowly to probes	`container_cpu_cfs_throttled_periods_total` ratio to `container_cpu_cfs_periods_total` above 5%
Container memory usage	Near-limit memory causes slowdowns or OOM	Usage trending above 80% of limit
Node CPU utilization	High node CPU delays kubelet probe goroutines	Node CPU sustained above 80%
Kubelet sync loop duration	Slow sync delays all kubelet operations, including probes	`kubelet_sync_loop_duration_seconds` p99 above 10 seconds
Kubelet PLEG relist duration	PLEG lag can delay pod state propagation and probe scheduling	`kubelet_pleg_relist_duration_seconds` approaching the 3-minute threshold

Fixes

If the cause is aggressive probe configuration

Increase timeoutSeconds from the default of 1 to a value that covers your application’s p99 internal latency under load, typically 3 to 5 seconds. Increase periodSeconds only if you need to reduce probe overhead; lowering it makes restarts faster but increases kubelet load. Ensure failureThreshold allows for transient slowness without immediately restarting. Do not set failureThreshold to an artificially high number to mask a real problem; if the container is actually deadlocked, you want it restarted.

If the cause is slow startup

Add a startupProbe that checks the same endpoint as the liveness probe. Set its failureThreshold multiplied by periodSeconds to cover the worst-case startup duration. The kubelet will not run the liveness probe until the startup probe succeeds. This is the correct mechanism for slow-starting applications. Relying on a large initialDelaySeconds creates a fixed delay that does not adapt if the container starts faster or slower than expected.

If the cause is resource pressure

Raise the container’s CPU limit if CFS throttling is occurring during probe execution. Raise the memory limit if the container is approaching it and experiencing GC pressure or OOM kills. Ensure the container has resource requests set so the scheduler does not pack it onto an already saturated node. For language runtimes with stop-the-world GC, tune the runtime’s memory settings to reduce pause duration.

If the cause is probe misconfiguration

Verify that the probe port matches the port the application is actually listening on. Verify that the HTTP path returns a status in the 200-399 range and does so quickly. Do not point a liveness probe at an endpoint that depends on downstream services, databases, or external APIs. A liveness probe should test whether the container itself is alive, not whether the entire dependency chain is healthy. Keep readiness probes separate: readiness should catch dependency failures, while liveness should catch deadlocks.

If the cause is kubelet or node pressure

If the node is CPU-saturated, the kubelet may not schedule probe goroutines promptly. If the node is under memory pressure, the kubelet may be busy evicting pods. If PLEG is unhealthy or the sync loop is slow, probe execution is delayed. Cordon the node if necessary and investigate why the kubelet cannot keep up. On dense nodes, reduce pod count or increase node resources.

Prevention

Define a startupProbe for any application that takes more than a few seconds to become ready.
Define liveness probes that check only internal application state, never external dependencies.
Monitor container restart counts per deployment and alert when they increase.
Set resource requests and limits based on observed startup and steady-state profiles, not guesswork.
Review probe configurations in CI before deployment; enforce minimum timeoutSeconds and appropriate failureThreshold values.
Monitor kubelet probe latency and node CPU saturation as leading indicators.

How Netdata helps

Correlate container restart spikes with CPU throttling, memory pressure, and node saturation on the same timeline to distinguish probe failures from real crashes.
Monitor kubelet probe failure rates alongside pod health transitions without aggregating away per-pod behavior.
Track container CFS throttling and memory usage alongside pod phase changes to confirm resource pressure as the root cause.
Alert on node CPU, memory, and PID pressure that delays kubelet probe execution before it kills containers.

How the Kubernetes control plane works: a mental model for operators: /guides/kubernetes/how-kubernetes-control-plane-works/
Kubernetes API server slow or unresponsive: causes and fixes: /guides/kubernetes/kubernetes-api-server-slow/
Kubernetes API server memory pressure: OOM cycle and tuning: /guides/kubernetes/kubernetes-api-server-memory-pressure/
Kubernetes conntrack exhaustion: dropped connections under load: /guides/kubernetes/kubernetes-conntrack-exhaustion/
Kubernetes API server etcd latency: detection and cascading failures: /guides/kubernetes/kubernetes-api-server-etcd-latency/

flowchart TD
    A[Liveness probe fails] --> B{During startup?}
    B -->|Yes| C[Check startupProbe config]
    B -->|No| D{Timeout or error?}
    D -->|Timeout| E[Check resource pressure
CPU throttling, GC pauses]
    D -->|Connection refused| F[Check probe port and path]
    E --> G{Node or kubelet slow?}
    G -->|Yes| H[Check kubelet CPU and sync loop]
    G -->|No| I[Increase timeoutSeconds
and failureThreshold]
    F --> J[Fix probe target]
    C --> K[Add or tune startupProbe]
    H --> L[Relieve node pressure]
    I --> M[Monitor restart count]
    J --> M
    K --> M
    L --> M

The Netdata solution

Kubernetes monitoring with Netdata

Netdata monitors Kubernetes with per-second metrics across the control plane, nodes, and every pod, with ML anomaly detection and zero per-pod configuration. Correlate API-server and etcd latency, kubelet PLEG stalls, scheduling pressure, and OOMKills in one place.

See Kubernetes monitoring → Start monitoring free

Kubernetes pod liveness probe killing healthy containers

Kubernetes pod liveness probe killing healthy containers

What this means

Common causes

Quick checks

How to diagnose it

Metrics and signals to monitor

Fixes

If the cause is aggressive probe configuration

If the cause is slow startup

If the cause is resource pressure

If the cause is probe misconfiguration

If the cause is kubelet or node pressure

Prevention

How Netdata helps

Related guides

Kubernetes monitoring with Netdata