Docker container keeps restarting: causes, checks, and fixes
A container that restarts every few seconds is not self-healing. It is a crash loop that wastes CPU, floods logs, and masks the real failure. Docker’s restart policy can hide whether the application is OOM-killed, segfaulting, or waiting for a dependency that never arrives. This guide shows how to read the signals, map exit codes to causes, and stop the loop before it degrades the host.
What this means
When a container’s main process exits or is killed, Docker increments RestartCount in the container metadata and starts a new instance if the restart policy allows. The count persists across daemon restarts. A container that crashes immediately on every start creates a death spiral: each cycle truncates ephemeral state, writes new log data, and adds overhead to the daemon and storage driver. The container may appear to be “running” for a few seconds before it disappears again, which makes the failure easy to miss if you only check whether the process exists.
Restart loops are not always caused by the application itself. OOM kills, CPU throttling, misconfigured health checks, missing dependencies, and daemon stress can all produce the same symptom. The key is to distinguish the cause before you try to fix it.
Common causes
| Cause | What it looks like | First thing to check |
|---|---|---|
| OOM kill | Exit code 137, OOMKilled true, memory at limit | docker inspect for OOMKilled and memory limit |
| Application crash or segfault | Exit code 1 or 139, logs show panic or stack trace | docker logs --tail for crash output |
| SIGTERM race or health check kill | Exit code 143, sometimes followed by 137 | Health check config and stop timeout |
| Missing dependency | Exit code 1 immediately on start, logs show connection refused | Reachability of linked services from the container |
| Misconfiguration | Exit code 126 or 127, image or entrypoint changed | docker inspect for entrypoint and image tag |
| CPU throttling or resource starvation | Slow starts, health checks timeout, exit 1 | docker stats and cgroup cpu.stat for throttling |
Quick checks
Run these checks in order. They are all read-only unless noted.
# Check restart count, exit code, and OOM status
docker inspect --format '{{.RestartCount}} {{.State.ExitCode}} {{.State.OOMKilled}}' <container_id>
# List containers currently in a restart loop
docker ps --filter "status=restarting" --format '{{.Names}} {{.Status}}'
# Read recent logs without crashing the client on multi-GB files
docker logs --tail 100 <container_id>
# Check memory usage vs limit
docker stats --no-stream --format "table {{.Name}}\t{{.MemUsage}}\t{{.MemPerc}}" <container_id>
# Check CPU throttling (cgroup v2 path; for v1 use /sys/fs/cgroup/cpu,cpuacct/docker/<id>/cpu.stat)
CONTAINER_ID=$(docker inspect --format '{{.Id}}' <container_id>)
cat /sys/fs/cgroup/system.slice/docker-${CONTAINER_ID}.scope/cpu.stat
# Check for OOM events in the kernel log
dmesg | grep -i "oom\|killed process"
# Stream OOM events from the daemon
docker events --filter event=oom --since 1h
# Check log file size on disk (json-file driver)
ls -lh /var/lib/docker/containers/${CONTAINER_ID}/${CONTAINER_ID}-json.log
# Check health check state
docker inspect --format '{{json .State.Health}}' <container_id>
How to diagnose it
Follow this flow to classify the failure before you change anything.
Confirm the loop is active. Run
docker ps --filter status=restartingor compareRestartCountover a short window. A stable count means the restarts have stopped. A climbing count means the loop is active.Read the exit code. This is the single most important signal.
- 137 means SIGKILL. Check
OOMKilled. If true, the kernel killed the container for exceeding its memory limit. If false, something else sent SIGKILL. - 139 means SIGSEGV. The application crashed in native code.
- 143 means SIGTERM. This is normal for a graceful stop, but in a restart loop it usually means a health check failed or the stop timeout was reached.
- 1 is a generic application error. The process exited on its own because of a bug, a missing config file, or a dependency that is unreachable.
- 126/127 means the command is not executable or not found. This is usually a broken image or a changed entrypoint.
- 137 means SIGKILL. Check
Check recency.
docker inspect --format '{{.State.StartedAt}}'tells you whether the restarts are happening back-to-back. If the container runs for 30 seconds and then exits, it may be a slow dependency timeout. If it exits within 1 second, it is likely a crash on startup.Read the logs. Use
docker logs --tail 100to avoid loading multi-gigabyte log files into the client. Look for stack traces, connection errors, or “out of memory” messages.Check resources. If the exit code is 137 but
OOMKilledis false, look at CPU throttling. Highnr_throttledincpu.statwith moderate CPU usage means the CFS bandwidth controller is pausing the container, which can cause health checks to time out and trigger a restart. Also check whether/var/lib/docker/is full; disk pressure can cause writes to fail and the application to crash.Check dependencies. If the logs show connection refused or DNS errors, test reachability from inside the container namespace. A missing database or a stale DNS entry can cause the app to exit immediately.
Check health checks. A container with a short
start_periodmay be marked unhealthy before it finishes initializing. The orchestrator or an operator then sends SIGTERM, producing exit code 143 and a restart. Verify the health check command, the interval, and thestart_period.Pause the restart policy if needed. If the container restarts so fast that you cannot inspect it, run
docker update --restart=no <container_id>to freeze the state. Then start it manually, let it fail, and inspect the logs and filesystem before it restarts.
Metrics and signals to monitor
| Signal | Why it matters | Warning sign |
|---|---|---|
| Container RestartCount | Identifies crash loops | Increasing by more than 1 in 10 minutes for a stable workload |
| Container exit code | Classifies the failure | Any nonzero code for a long-running container |
| OOMKilled status | Distinguishes OOM from external SIGKILL | true |
| Container memory usage vs limit | Predicts OOM before it happens | Sustained usage above 80% of limit |
CPU throttling (nr_throttled) | Silent cause of slow starts and health failures | Any sustained increase for latency-sensitive workloads |
| Container health check status | Catches “running but broken” states | unhealthy |
| Docker daemon response latency | Rules out daemon stress causing timeouts | Sustained latency above 1 second |
| Container log file size | Restart loops generate massive logs | Above 1 GB without rotation |
| Container start latency | Slow starts can trigger health-check kills | Consistently above 30 seconds |
Fixes
If the cause is OOM kill
Temporarily raise the memory limit to stop the crash loop: docker update --memory <new_limit> <container_id>. Then investigate whether the limit is simply too low or the application has a leak. For JVM workloads, set the heap to roughly 75% of the container limit and leave headroom for metaspace and native memory.
If the cause is an application crash or segfault
Capture the stack trace from docker logs --tail. Exit code 139 indicates a segfault; look for incompatible native libraries or corrupt input data. Exit code 1 usually means an unhandled exception or a missing config file.
If the cause is a SIGTERM race or health check failure
Increase the health check start_period so the application has enough time to initialize before the first probe. Increase the Docker stop timeout if the application needs more than the default 10 seconds to shut down gracefully. Verify the health check command is correct and not overly aggressive.
If the cause is a missing dependency
Fix the startup order. Do not rely on restart policies to retry until a dependency is ready. Use explicit orchestration dependencies or init containers. Verify DNS resolution and network connectivity from inside the container.
If the cause is misconfiguration
Check the entrypoint and command in docker inspect. If the image tag was recently updated, verify the binary path and permissions did not change. Exit codes 126 and 127 point directly to permission or path issues.
If the cause is CPU throttling or resource starvation
Increase the CPU quota with docker update --cpus, or remove the limit if noisy-neighbor risk is managed elsewhere. If the application uses a garbage-collected runtime, GC pauses can consume the entire CFS quota in a burst. Consider pinning latency-sensitive workloads to specific cores with cpuset instead of using CFS limits.
Prevention
- Configure log rotation for the
json-filedriver by settingmax-sizeandmax-fileindaemon.json, or switch to thelocallog driver. - Monitor
RestartCountand alert on any increase for containers that should be stable. - Set memory limits with headroom. Monitor anonymous memory growth in
memory.stat, not just total usage, to catch leaks early. - Configure health checks with realistic
start_periodvalues. Do not let a health check kill a container that is still starting. - Clean up stopped containers and unused images automatically. Exited containers consume disk space and clutter state.
- Treat restart policies as a safety net, not a fix. A container that needs to restart repeatedly is broken and should be investigated.
How Netdata helps
Netdata correlates container RestartCount spikes with memory usage, CPU throttling, and OOM kill events on the same timeline so you can see whether a restart loop is caused by resource pressure or an application bug. Alerts on exit code 137 with OOMKilled: true distinguish memory kills from external SIGKILL, and daemon API latency charts help rule out Docker stress as the cause of health-check timeouts. Log size monitoring for json-file drivers also catches runaway growth before it fills the disk.




