Kubernetes pod CrashLoopBackOff: causes, diagnosis, and fixes

CrashLoopBackOff means a container in a Pod has terminated after starting, and the kubelet is delaying the next restart with exponential backoff. The status describes behavior, not root cause. Underlying failures include application panics, OOM kills, misconfigured liveness probes, missing secrets, or node-level resource pressure.

Use pod status, previous container logs, node conditions, and kubelet events to narrow the cause. Monitor restart rate, node pressure, and probe failures to catch loops before they degrade capacity.

What this means

CrashLoopBackOff is the Waiting state reason reported by kubectl get pods when the kubelet delays the next restart attempt to avoid restart storms. For Deployments, the restart policy is Always, so any container termination triggers a restart. If the container cannot stay running, the kubelet enters the backoff loop.

The state is a symptom, not a diagnosis. The root cause lives in the container’s last termination reason, previous logs, node resource conditions, or probe configuration. The backoff delay increases with each failure and can stretch to several minutes. It resets only after the container runs successfully for a sustained period.

Common causes

CauseWhat it looks likeFirst thing to check
Application error or misconfigurationExit code 1; stack traces or config errors in --previous logskubectl logs <pod> --previous
OOM killLast State: OOMKilled; exit code 137; node may show MemoryPressurekubectl describe pod and node memory conditions
Liveness probe failureEvents show Unhealthy; container restarts without crash logsProbe config in pod spec and application health endpoint
Missing dependency or secretContainer exits immediately; volume or env errors in eventskubectl describe pod events and volume mounts
Node resource pressureEviction events, DiskPressure/MemoryPressure, or system OOM killskubectl describe node and dmesg
Read-only filesystem or permission errorSilent exit with no stack trace; app cannot write required path--previous logs for permission denied errors

Quick checks

# Confirm the waiting reason and last exit code
kubectl get pod ${POD_NAME} -o jsonpath='{.status.containerStatuses[0].state.waiting.reason}{"\n"}{.status.containerStatuses[0].lastState.terminated.exitCode}{"\n"}'

# Inspect the last terminated container's logs
kubectl logs ${POD_NAME} -c ${CONTAINER_NAME} --previous

# Check pod events for probe failures, mount errors, or back-off messages
kubectl get events --field-selector involvedObject.name=${POD_NAME}

# Check node conditions for pressure that may be evicting or starving the pod
kubectl describe node ${NODE_NAME} | grep -E "MemoryPressure|DiskPressure|PIDPressure"

# Look for system-level OOM kills that preceded the container restart
dmesg -T | grep -i "out of memory\|oom-killer"

# Compare pod resource usage against its limits (requires metrics-server)
kubectl top pod ${POD_NAME} -n ${NAMESPACE}

# Check if image pull or volume mount errors preceded the crash loop
kubectl describe pod ${POD_NAME} | grep -A 5 "Events:"

# Verify liveness and startup probe configuration
kubectl get pod ${POD_NAME} -o jsonpath='{.spec.containers[0].livenessProbe}{"\n"}{.spec.containers[0].startupProbe}{"\n"}'

How to diagnose it

  1. Identify the failing container and node. Note which container is in CrashLoopBackOff and which node it is running on. If the pod has multiple containers, the status block names the failing container.
  2. Check the last terminated state. kubectl describe pod shows the exit code, signal, and reason under Last State. OOMKilled means memory limits or node pressure. Exit code 1 indicates an application-level error. Exit code 137 means SIGKILL, which includes OOM but can also be an explicit kill.
  3. Read the previous container logs. kubectl logs --previous shows application panics, missing files, or connection errors. If the logs are empty, the container may have crashed before initializing logging or the image entrypoint may be wrong.
  4. Read the pod events. The events block in kubectl describe pod includes kubelet messages about probe failures, volume mount problems, image pull issues, and back-off timing. Probe failures often appear here even when application logs look clean.
  5. Check node conditions. A node with MemoryPressure, DiskPressure, or PIDPressure evicts or starves pods. Even without explicit eviction, memory pressure increases the chance of cgroup or system OOM kills.
  6. Check for node-level OOM kills. Run dmesg on the node. If the kernel OOM killer fired, it may have killed the container process or a dependent process. This happens when actual memory usage on the node exceeds allocatable memory, even if individual container limits are not breached.
  7. Evaluate probe configuration. If the events show Unhealthy liveness probe failures, compare the probe’s initialDelaySeconds, timeoutSeconds, and failureThreshold to the application’s actual startup time. A missing startupProbe can cause liveness probes to kill a slow-starting container before it finishes initialization.
  8. Check resource limits versus usage. If the container is OOMKilled, verify whether the memory limit is simply too low or whether the application has a leak. If the container is not OOMKilled but the node has memory pressure, the application may be failing due to memory starvation or throttling.
  9. Check for volume and image issues. If logs are absent and the exit code is non-descriptive, look for FailedMount or FailedAttachVolume events. A container that cannot write to a read-only root filesystem or a projected volume that failed to mount exits immediately without application logs.
  10. Correlate with runtime and kubelet health. If multiple pods on the same node are in CrashLoopBackOff simultaneously, check kubelet logs, container runtime responsiveness (crictl info), and PLEG relist duration. Runtime slowness can cause probes to time out and containers to be restarted.

Metrics and signals to monitor

SignalWhy it mattersWarning sign
Container restart rateRestart storms destabilize service capacityRestart count increases by more than 5 in 10 minutes
OOM kill eventsCgroup or system OOM kills produce CrashLoopBackOfflastState.terminated.reason equals OOMKilled or dmesg shows OOM kills
Liveness and readiness probe failuresProbe failures trigger restarts independent of application crasheskubelet_prober_probe_total failures increase, or events show Unhealthy
Node MemoryPressure and DiskPressurePressure causes evictions and OOM kills that appear as pod crashesNode condition MemoryPressure or DiskPressure is True
Pod startup latencySlow-starting containers are killed by probes before initialization finisheskubelet_pod_start_duration_seconds p99 above 30 seconds
CPU throttlingThrottled containers may miss probe deadlinescontainer_cpu_cfs_throttled_periods_total ratio above 25% sustained
Image pull errorsImage pull failures block pod starts and can masquerade as crashesPods in ImagePullBackOff or CRI pull image errors increase
Volume mount latencyStuck mounts block container startup and can trigger probe timeoutsstorage_operation_duration_seconds above 2 minutes or pods stuck in ContainerCreating

Fixes

If the cause is application error or misconfiguration

Fix the code, configuration file, environment variable, or secret reference. Use kubectl logs --previous to identify the exact error. Validate changes in staging before rolling out. If the error is a missing command-line flag or bad entrypoint, correct the container args or the Dockerfile CMD.

If the cause is OOM kill

Raise the container memory limit if the working set is legitimately larger than the limit. If the application has a memory leak, fix the leak rather than endlessly raising the limit. For JVM workloads, ensure the heap size leaves headroom for native memory and thread stacks inside the container limit. If the node itself is under MemoryPressure, cordon the node and drain workloads to relieve pressure. Drain is disruptive; plan for pod rescheduling.

If the cause is liveness probe failure

Separate liveness from readiness checks. Liveness should detect deadlocks, not temporary load. Add a startupProbe if the application takes longer than the liveness probe’s initial delay to become healthy. Increase timeoutSeconds if the health endpoint is slow, but ensure the endpoint itself is lightweight. Do not point liveness probes at downstream dependencies.

If the cause is node resource pressure

Identify the pressure signal: memory, disk, or PID. For disk pressure, clean up container logs, unused images, and orphaned volumes. For memory pressure, evict or reschedule heavy pods, or scale the node pool. Set kube-reserved and system-reserved so the eviction manager has accurate headroom calculations.

If the cause is missing dependency or volume

Verify that ConfigMaps, Secrets, and persistent volumes referenced in the spec exist and are bound. Check that projected volumes are mounting correctly and that the application has permissions to read them. If the container needs a writable path, ensure the volume is mounted at the correct path and that the security context allows writes.

If the cause is image issue

Use explicit image tags instead of latest to avoid silent binary mismatches. Verify the image architecture matches the node architecture. If the image is missing shared libraries, rebuild the image or use a base image that includes the required dependencies.

Prevention

  • Set realistic resource requests and limits based on measured usage profiles, not guesses. Revisit them after every major release.
  • Use startup probes for any container that needs more than a few seconds to initialize. Gate liveness probes behind the startup probe.
  • Keep liveness probes simple and independent of external services. A liveness probe should fail only when the process itself is deadlocked.
  • Monitor node pressure conditions proactively. Memory and disk pressure cause cascading restarts that look like application failures.
  • Rotate container logs and limit emptyDir size to prevent disk pressure from silently evicting pods.
  • Validate pod specs in CI for missing volume mounts, incorrect secret names, and broken image references before they reach the cluster.
  • Configure PodDisruptionBudgets to prevent node maintenance or autoscaling events from concentrating pods and causing resource pressure spikes.

How Netdata helps

  • Correlates pod restart spikes with node-level memory pressure, disk I/O latency, and CPU throttling to distinguish app bugs from infrastructure saturation.
  • Surfaces OOM kill events and per-cgroup memory usage versus limits without manual kubectl and dmesg queries across nodes.
  • Shows container runtime operation latency and PLEG relist duration to identify kubelet or runtime slowness causing probe timeouts.
  • Tracks per-container CPU throttling rates and disk wait times to reveal resource contention that triggers health-check failures.
  • Alerts on node conditions such as MemoryPressure and DiskPressure before they cascade into eviction loops.