You’ve deployed your application to Kubernetes, but a quick check on your pods reveals the dreaded CrashLoopBackOff
status. Your pod is stuck in a restart loop, and your service is down. This isn’t an error itself, but a status indicating that Kubernetes is trying to start a container, it crashes, and Kubernetes waits an exponentially increasing amount of time before trying again. This back-off mechanism prevents a faulty pod from overwhelming the cluster with constant restart attempts.
For any SRE or developer, seeing CrashLoopBackOff
is a call to action. It means an underlying issue is preventing your container from running successfully. This guide will walk you through a systematic flowchart to diagnose the root cause and provide quick fixes for the most common problems.
What is CrashLoopBackOff
?
When a pod’s container terminates, Kubernetes, based on the pod’s restartPolicy
(which defaults to Always
), attempts to restart it. If the container starts, runs, and then crashes again, Kubernetes enters a crash loop. To avoid consuming excessive resources, the kubelet introduces an exponential back-off delay between restart attempts—starting at 10 seconds and capped at five minutes. During this waiting period, the pod’s status shows as CrashLoopBackOff
.
It’s important to distinguish this from other states:
- ImagePullBackOff: Kubernetes cannot pull the container image from the registry. This happens before the container can even attempt to start.
- Pending: The pod has been accepted by the Kubernetes system, but one or more of its containers has not been created. This is often a scheduling issue.
- Running: The pod is bound to a node, and all its containers have been created. At least one container is still running, or is in the process of starting or restarting.
- Succeeded / Failed: All containers in the pod have terminated in success or failure and will not be restarted.
CrashLoopBackOff
specifically tells you the container is starting but is exiting prematurely with an error.
The CrashLoopBackOff
Troubleshooting Flowchart
When faced with a restarting pod, your goal is to find out why the container is crashing. Follow these steps methodically to pinpoint the issue.
Step 1: Get a High-Level Overview
Your first step should be to use the kubectl describe pod
command for your specific pod. It provides a wealth of information, including the pod’s current state, recent events, and restart count.
Look closely at these sections in the output:
- State: This will show the current status. For a crashing pod, you might see
Terminated
with anExit Code
. - Last State: This shows the state of the container from its previous termination. This is key. It will often contain a reason and an exit code.
- Events: This is a log of events related to the pod. Look for messages from the scheduler, kubelet, and other components. You might see warnings about failed probes or resource issues.
The Exit Code
in the Last State
section is your first major clue.
Step 2: Analyze Container Exit Codes
The exit code tells you why the container terminated. Here are some of the most common ones:
- Exit Code 0: The container exited successfully. If it keeps restarting, it might be a short-lived job that isn’t configured correctly for a deployment that expects a long-running process.
- Exit Code 1: General application error. This is a catch-all for unhandled exceptions or generic failures within your application. This is your cue to check the application logs.
- Exit Code 137 (128 + 9): The container was terminated by a
SIGKILL
signal. In Kubernetes, this almost always means the container exceeded its memory limit, triggering the OOM (Out-of-Memory) Killer. Thedescribe
output might explicitly state the reason asOOMKilled
. - Exit Code 139 (128 + 11): Segmentation Fault. The container tried to access a memory address that was not assigned to it, often due to a bug in the application code.
- Exit Code 143 (128 + 15): The container received a
SIGTERM
signal, indicating a graceful shutdown request. If this leads to a crash loop, it could be that your liveness probe is misconfigured and Kubernetes is terminating the pod because it thinks it’s unhealthy.
Step 3: Check the Container Logs
If the exit code suggests an application-level problem (like Exit Code 1), your next step is to inspect the logs from the container. Since the pod is crashing, you need to retrieve logs from the previous failed instance using the --previous
flag with the kubectl logs
command.
These logs are your direct line to the application’s standard output and error streams. Look for stack traces, error messages, or any indication of what the application was doing right before it stopped. If the logs are empty, it could mean the application is crashing before it has a chance to log anything, which might point to a configuration issue.
Step 4: Review Configuration and Dependencies
Many CrashLoopBackOff
issues stem from incorrect configuration.
Application Configuration
- ConfigMaps & Secrets: Are you mounting a ConfigMap or Secret? Ensure it exists and the pod has the correct permissions to access it. A typo in a volume mount path or key name can cause the application to fail on startup. You can use
kubectl describe
on the configmap or secret to verify their contents. - Environment Variables: Double-check environment variables passed to the container. A missing or incorrect database connection string, API key, or feature flag can easily cause a crash.
- Command & Arguments: Review the
command
andargs
in your pod spec. A typo or an incorrect path to an executable will cause an immediate exit.
Resource Availability
- Persistent Volumes (PVs): If your application relies on a Persistent Volume, ensure the
PersistentVolumeClaim
is bound and the volume is accessible. Issues with storage provisioning or access permissions can prevent an application from starting. - Network Dependencies: Can your pod reach other services, databases, or external APIs it depends on? DNS issues or restrictive Network Policies can cause connection timeouts that lead to crashes.
Step 5: Investigate Health Probes
Kubernetes uses liveness and readiness probes to determine a pod’s health. Misconfigured probes are a very common cause of CrashLoopBackOff
.
- Liveness Probe: This probe checks if the container is still running. If the liveness probe fails, the kubelet kills the container and restarts it.
- Readiness Probe: This probe checks if the container is ready to serve traffic. If it fails, the pod’s IP is removed from the service’s endpoints. While a readiness probe failure doesn’t directly cause a crash loop, it can indicate the same underlying problem that might eventually cause a liveness probe to fail.
Check your probe configuration in the pod spec:
initialDelaySeconds
: Does your application take longer to start than the initial delay? The probe might be checking for health before the app is ready, causing it to fail and restart the container.timeoutSeconds
: Is the timeout too short? A slow network or a heavy startup process could cause the health check to time out.- The Check Itself: Is the HTTP endpoint (
/healthz
), TCP port, or command you’re using for the probe correct? A typo in the path or port will lead to consistent failures.
If you suspect a probe issue, you can temporarily remove the livenessProbe
from your deployment manifest and re-apply it to see if the pod stabilizes.
Quick Fixes for Common Scenarios
-
Problem:
OOMKilled
(Exit Code 137)- Fix: The container needs more memory. Increase the
resources.limits.memory
in your pod spec. Analyze your application’s memory footprint to set a realistic limit.
- Fix: The container needs more memory. Increase the
-
Problem: Application Error (Exit Code 1)
- Fix: Analyze logs from the previous container instance. The error is inside your application. Debug the code, fix the bug, build a new container image, and redeploy.
-
Problem: Liveness Probe Fails
- Fix: Adjust the probe settings. Increase
initialDelaySeconds
to give your app more time to start up. IncreasetimeoutSeconds
if the check is slow. Verify thehttpGet
path,tcpSocket
port, orexec
command is correct.
- Fix: Adjust the probe settings. Increase
-
Problem: ConfigMap or Secret Not Found
- Fix: Ensure the referenced ConfigMap or Secret exists in the same namespace as the pod. Check for typos in the names within your deployment manifest. Verify RBAC permissions allow the pod’s ServiceAccount to access these resources.
-
Problem: Missing Dependencies
- Fix: Check if your container image includes all necessary libraries and binaries. If you’re connecting to a database or another service, ensure the hostname is correct and resolvable via DNS within the cluster.
By following this diagnostic flow, you can move from the generic CrashLoopBackOff
status to a specific, actionable root cause. This methodical approach saves time and reduces the frustration of debugging in a complex distributed system like Kubernetes.
Ready to gain deeper, real-time insights into your Kubernetes clusters and prevent issues like CrashLoopBackOff
before they impact users? Get started with Netdata for free, open-source, high-granularity monitoring. Sign up today.