Kubernetes pod liveness probe killing healthy containers

A container that is processing requests, not OOMKilled, and not crashed can still be restarted repeatedly by the kubelet because a liveness probe failed. The application is alive, but the probe says it is not. This usually shows up as a pod stuck in CrashLoopBackOff with Liveness probe failed events, even though application logs show no fatal error. The restarts waste resources, break active connections, and can trigger cascading load on the cluster as other pods absorb the shifted traffic.

After reading this guide, you will be able to distinguish a genuinely unhealthy container from a falsely failing liveness probe, identify whether the root cause is probe configuration, resource pressure, or kubelet execution lag, and fix it without guessing.

What this means

A liveness probe is meant to detect containers that are deadlocked or otherwise unable to recover. The kubelet executes the probe at a configured interval. If the probe fails enough consecutive times, the kubelet kills and restarts the container. When the container is actually healthy but the probe fails anyway, the restart is a false positive.

The failure mechanism is straightforward but unforgiving. The kubelet runs each probe in its own goroutine. If the probe exceeds its timeout, returns a non-success status, or cannot execute at all, that attempt counts as a failure. After failureThreshold consecutive failures, the container is restarted. The probe gives no partial credit. A container under GC pressure, a kubelet that is CPU-starved, or an endpoint that is simply slow to respond can all trigger a restart.

Startup probes, if configured, gate liveness probes. Until the startup probe succeeds, the kubelet does not evaluate the liveness probe at all. If your application has a slow startup phase and you rely only on initialDelaySeconds, you are exposed to this failure mode.

Common causes

CauseWhat it looks likeFirst thing to check
Aggressive probe configContainer restarts seconds after starting; CrashLoopBackOff appears quicklyProbe timeoutSeconds, periodSeconds, and failureThreshold
Slow application startupRestarts happen only during rollouts or cold starts; pod never stays Running long enoughWhether a startupProbe is defined
GC pause or CPU throttlingProbe failures correlate with application GC logs or CPU spikesNode CPU pressure and container CFS throttling metrics
Wrong probe targetImmediate, permanent failures from pod creationWhether the probe port and path match the running application
Kubelet probe execution lagIntermittent failures when the node is heavily loadedKubelet CPU usage and sync loop duration
Memory pressure maskingContainer responds to probes slowly before eventually being OOMKilledContainer memory usage versus its limit

Quick checks

# Check pod events for liveness probe failures
kubectl get events --field-selector involvedObject.name=<pod-name> --sort-by='.lastTimestamp'

# Check container restart count and last termination reason
kubectl get pod <pod-name> -o jsonpath='{.status.containerStatuses[0].restartCount} {.status.containerStatuses[0].lastState.terminated.reason}'

# Inspect probe configuration directly
kubectl get pod <pod-name> -o jsonpath='{.spec.containers[0].livenessProbe}'

# Compare liveness and readiness configs
kubectl get pod <pod-name> -o jsonpath='{.spec.containers[0].livenessProbe}{"\n"}{.spec.containers[0].readinessProbe}'

# Check if a startupProbe is configured
kubectl get pod <pod-name> -o jsonpath='{.spec.containers[0].startupProbe}'

# Check node CPU pressure and kubelet health
kubectl describe node <node-name> | grep -A 5 "Conditions:"

# Check container CPU throttling on the node
cat /sys/fs/cgroup/cpu/kubepods/burstable/pod<pod-uid>/<container-id>/cpu.stat

# Check kubelet logs for probe results on the node
journalctl -u kubelet --since "10 minutes ago" | grep -i "probe.*failed\|liveness"

How to diagnose it

Follow this flow to confirm whether the container is truly unhealthy or the probe is lying.

  1. Confirm the container is actually healthy. Check application logs for panics or fatal errors. If the application is serving traffic, processing messages, or completing work, it is likely healthy and the probe is a false positive.

  2. Check pod events for the exact probe failure reason. Look for Liveness probe failed events. The message usually states whether it was a timeout, a connection refused, or an HTTP error code. This determines which branch to follow.

  3. Inspect the probe configuration. Look at timeoutSeconds, periodSeconds, failureThreshold, and initialDelaySeconds. If timeoutSeconds is 1 and your application has tail latency above 1 second under load, that is the problem.

  4. Determine if the failure is during startup. If restarts only happen during pod creation or deployment rollouts, the application startup time exceeds the probe window. The fix is a startupProbe, not a larger initialDelaySeconds.

  5. Check for node and container resource pressure. Look at node CPU and memory conditions. Check if the container is being CPU-throttled or is near its memory limit. A throttled container cannot respond to probes quickly. A container nearing its memory limit may experience slow allocations or GC pressure.

  6. Check kubelet health on the node. If the kubelet is under CPU pressure or its sync loop is delayed, probe execution itself can lag. Check kubelet_sync_loop_duration_seconds and kubelet CPU usage on the node.

  7. Correlate restart times with application behavior. If restarts align with GC pauses, batch job spikes, or traffic surges, the probe is too sensitive for the application’s normal operating envelope.

  8. Verify the probe endpoint manually. kubectl exec into the pod and curl the probe endpoint locally. If it responds correctly inside the container but fails from the kubelet, suspect networking or port binding issues.

Metrics and signals to monitor

SignalWhy it mattersWarning sign
Container restart countDirect indicator of probe-induced restartsRestart count increasing steadily
Pod phase CrashLoopBackOffThe container is being restarted repeatedlyPod enters or remains in CrashLoopBackOff
Kubelet probe failure rateShows raw liveness probe failuresprober_probe_total with result="failed" increasing
Container CPU throttlingThrottled containers respond slowly to probescontainer_cpu_cfs_throttled_periods_total ratio to container_cpu_cfs_periods_total above 5%
Container memory usageNear-limit memory causes slowdowns or OOMUsage trending above 80% of limit
Node CPU utilizationHigh node CPU delays kubelet probe goroutinesNode CPU sustained above 80%
Kubelet sync loop durationSlow sync delays all kubelet operations, including probeskubelet_sync_loop_duration_seconds p99 above 10 seconds
Kubelet PLEG relist durationPLEG lag can delay pod state propagation and probe schedulingkubelet_pleg_relist_duration_seconds approaching the 3-minute threshold

Fixes

If the cause is aggressive probe configuration

Increase timeoutSeconds from the default of 1 to a value that covers your application’s p99 internal latency under load, typically 3 to 5 seconds. Increase periodSeconds only if you need to reduce probe overhead; lowering it makes restarts faster but increases kubelet load. Ensure failureThreshold allows for transient slowness without immediately restarting. Do not set failureThreshold to an artificially high number to mask a real problem; if the container is actually deadlocked, you want it restarted.

If the cause is slow startup

Add a startupProbe that checks the same endpoint as the liveness probe. Set its failureThreshold multiplied by periodSeconds to cover the worst-case startup duration. The kubelet will not run the liveness probe until the startup probe succeeds. This is the correct mechanism for slow-starting applications. Relying on a large initialDelaySeconds creates a fixed delay that does not adapt if the container starts faster or slower than expected.

If the cause is resource pressure

Raise the container’s CPU limit if CFS throttling is occurring during probe execution. Raise the memory limit if the container is approaching it and experiencing GC pressure or OOM kills. Ensure the container has resource requests set so the scheduler does not pack it onto an already saturated node. For language runtimes with stop-the-world GC, tune the runtime’s memory settings to reduce pause duration.

If the cause is probe misconfiguration

Verify that the probe port matches the port the application is actually listening on. Verify that the HTTP path returns a status in the 200-399 range and does so quickly. Do not point a liveness probe at an endpoint that depends on downstream services, databases, or external APIs. A liveness probe should test whether the container itself is alive, not whether the entire dependency chain is healthy. Keep readiness probes separate: readiness should catch dependency failures, while liveness should catch deadlocks.

If the cause is kubelet or node pressure

If the node is CPU-saturated, the kubelet may not schedule probe goroutines promptly. If the node is under memory pressure, the kubelet may be busy evicting pods. If PLEG is unhealthy or the sync loop is slow, probe execution is delayed. Cordon the node if necessary and investigate why the kubelet cannot keep up. On dense nodes, reduce pod count or increase node resources.

Prevention

  • Define a startupProbe for any application that takes more than a few seconds to become ready.
  • Define liveness probes that check only internal application state, never external dependencies.
  • Monitor container restart counts per deployment and alert when they increase.
  • Set resource requests and limits based on observed startup and steady-state profiles, not guesswork.
  • Review probe configurations in CI before deployment; enforce minimum timeoutSeconds and appropriate failureThreshold values.
  • Monitor kubelet probe latency and node CPU saturation as leading indicators.

How Netdata helps

  • Correlate container restart spikes with CPU throttling, memory pressure, and node saturation on the same timeline to distinguish probe failures from real crashes.
  • Monitor kubelet probe failure rates alongside pod health transitions without aggregating away per-pod behavior.
  • Track container CFS throttling and memory usage alongside pod phase changes to confirm resource pressure as the root cause.
  • Alert on node CPU, memory, and PID pressure that delays kubelet probe execution before it kills containers.
flowchart TD
    A[Liveness probe fails] --> B{During startup?}
    B -->|Yes| C[Check startupProbe config]
    B -->|No| D{Timeout or error?}
    D -->|Timeout| E[Check resource pressure
CPU throttling, GC pauses] D -->|Connection refused| F[Check probe port and path] E --> G{Node or kubelet slow?} G -->|Yes| H[Check kubelet CPU and sync loop] G -->|No| I[Increase timeoutSeconds
and failureThreshold] F --> J[Fix probe target] C --> K[Add or tune startupProbe] H --> L[Relieve node pressure] I --> M[Monitor restart count] J --> M K --> M L --> M