Docker container keeps restarting: causes, checks, and fixes

A container that restarts every few seconds is not self-healing. It is a crash loop that wastes CPU, floods logs, and masks the real failure. Docker’s restart policy can hide whether the application is OOM-killed, segfaulting, or waiting for a dependency that never arrives. This guide shows how to read the signals, map exit codes to causes, and stop the loop before it degrades the host.

What this means

When a container’s main process exits or is killed, Docker increments RestartCount in the container metadata and starts a new instance if the restart policy allows. The count persists across daemon restarts. A container that crashes immediately on every start creates a death spiral: each cycle truncates ephemeral state, writes new log data, and adds overhead to the daemon and storage driver. The container may appear to be “running” for a few seconds before it disappears again, which makes the failure easy to miss if you only check whether the process exists.

Restart loops are not always caused by the application itself. OOM kills, CPU throttling, misconfigured health checks, missing dependencies, and daemon stress can all produce the same symptom. The key is to distinguish the cause before you try to fix it.

Common causes

CauseWhat it looks likeFirst thing to check
OOM killExit code 137, OOMKilled true, memory at limitdocker inspect for OOMKilled and memory limit
Application crash or segfaultExit code 1 or 139, logs show panic or stack tracedocker logs --tail for crash output
SIGTERM race or health check killExit code 143, sometimes followed by 137Health check config and stop timeout
Missing dependencyExit code 1 immediately on start, logs show connection refusedReachability of linked services from the container
MisconfigurationExit code 126 or 127, image or entrypoint changeddocker inspect for entrypoint and image tag
CPU throttling or resource starvationSlow starts, health checks timeout, exit 1docker stats and cgroup cpu.stat for throttling

Quick checks

Run these checks in order. They are all read-only unless noted.

# Check restart count, exit code, and OOM status
docker inspect --format '{{.RestartCount}} {{.State.ExitCode}} {{.State.OOMKilled}}' <container_id>

# List containers currently in a restart loop
docker ps --filter "status=restarting" --format '{{.Names}} {{.Status}}'

# Read recent logs without crashing the client on multi-GB files
docker logs --tail 100 <container_id>

# Check memory usage vs limit
docker stats --no-stream --format "table {{.Name}}\t{{.MemUsage}}\t{{.MemPerc}}" <container_id>

# Check CPU throttling (cgroup v2 path; for v1 use /sys/fs/cgroup/cpu,cpuacct/docker/<id>/cpu.stat)
CONTAINER_ID=$(docker inspect --format '{{.Id}}' <container_id>)
cat /sys/fs/cgroup/system.slice/docker-${CONTAINER_ID}.scope/cpu.stat

# Check for OOM events in the kernel log
dmesg | grep -i "oom\|killed process"

# Stream OOM events from the daemon
docker events --filter event=oom --since 1h

# Check log file size on disk (json-file driver)
ls -lh /var/lib/docker/containers/${CONTAINER_ID}/${CONTAINER_ID}-json.log

# Check health check state
docker inspect --format '{{json .State.Health}}' <container_id>

How to diagnose it

Follow this flow to classify the failure before you change anything.

  1. Confirm the loop is active. Run docker ps --filter status=restarting or compare RestartCount over a short window. A stable count means the restarts have stopped. A climbing count means the loop is active.

  2. Read the exit code. This is the single most important signal.

    • 137 means SIGKILL. Check OOMKilled. If true, the kernel killed the container for exceeding its memory limit. If false, something else sent SIGKILL.
    • 139 means SIGSEGV. The application crashed in native code.
    • 143 means SIGTERM. This is normal for a graceful stop, but in a restart loop it usually means a health check failed or the stop timeout was reached.
    • 1 is a generic application error. The process exited on its own because of a bug, a missing config file, or a dependency that is unreachable.
    • 126/127 means the command is not executable or not found. This is usually a broken image or a changed entrypoint.
  3. Check recency. docker inspect --format '{{.State.StartedAt}}' tells you whether the restarts are happening back-to-back. If the container runs for 30 seconds and then exits, it may be a slow dependency timeout. If it exits within 1 second, it is likely a crash on startup.

  4. Read the logs. Use docker logs --tail 100 to avoid loading multi-gigabyte log files into the client. Look for stack traces, connection errors, or “out of memory” messages.

  5. Check resources. If the exit code is 137 but OOMKilled is false, look at CPU throttling. High nr_throttled in cpu.stat with moderate CPU usage means the CFS bandwidth controller is pausing the container, which can cause health checks to time out and trigger a restart. Also check whether /var/lib/docker/ is full; disk pressure can cause writes to fail and the application to crash.

  6. Check dependencies. If the logs show connection refused or DNS errors, test reachability from inside the container namespace. A missing database or a stale DNS entry can cause the app to exit immediately.

  7. Check health checks. A container with a short start_period may be marked unhealthy before it finishes initializing. The orchestrator or an operator then sends SIGTERM, producing exit code 143 and a restart. Verify the health check command, the interval, and the start_period.

  8. Pause the restart policy if needed. If the container restarts so fast that you cannot inspect it, run docker update --restart=no <container_id> to freeze the state. Then start it manually, let it fail, and inspect the logs and filesystem before it restarts.

Metrics and signals to monitor

SignalWhy it mattersWarning sign
Container RestartCountIdentifies crash loopsIncreasing by more than 1 in 10 minutes for a stable workload
Container exit codeClassifies the failureAny nonzero code for a long-running container
OOMKilled statusDistinguishes OOM from external SIGKILLtrue
Container memory usage vs limitPredicts OOM before it happensSustained usage above 80% of limit
CPU throttling (nr_throttled)Silent cause of slow starts and health failuresAny sustained increase for latency-sensitive workloads
Container health check statusCatches “running but broken” statesunhealthy
Docker daemon response latencyRules out daemon stress causing timeoutsSustained latency above 1 second
Container log file sizeRestart loops generate massive logsAbove 1 GB without rotation
Container start latencySlow starts can trigger health-check killsConsistently above 30 seconds

Fixes

If the cause is OOM kill

Temporarily raise the memory limit to stop the crash loop: docker update --memory <new_limit> <container_id>. Then investigate whether the limit is simply too low or the application has a leak. For JVM workloads, set the heap to roughly 75% of the container limit and leave headroom for metaspace and native memory.

If the cause is an application crash or segfault

Capture the stack trace from docker logs --tail. Exit code 139 indicates a segfault; look for incompatible native libraries or corrupt input data. Exit code 1 usually means an unhandled exception or a missing config file.

If the cause is a SIGTERM race or health check failure

Increase the health check start_period so the application has enough time to initialize before the first probe. Increase the Docker stop timeout if the application needs more than the default 10 seconds to shut down gracefully. Verify the health check command is correct and not overly aggressive.

If the cause is a missing dependency

Fix the startup order. Do not rely on restart policies to retry until a dependency is ready. Use explicit orchestration dependencies or init containers. Verify DNS resolution and network connectivity from inside the container.

If the cause is misconfiguration

Check the entrypoint and command in docker inspect. If the image tag was recently updated, verify the binary path and permissions did not change. Exit codes 126 and 127 point directly to permission or path issues.

If the cause is CPU throttling or resource starvation

Increase the CPU quota with docker update --cpus, or remove the limit if noisy-neighbor risk is managed elsewhere. If the application uses a garbage-collected runtime, GC pauses can consume the entire CFS quota in a burst. Consider pinning latency-sensitive workloads to specific cores with cpuset instead of using CFS limits.

Prevention

  • Configure log rotation for the json-file driver by setting max-size and max-file in daemon.json, or switch to the local log driver.
  • Monitor RestartCount and alert on any increase for containers that should be stable.
  • Set memory limits with headroom. Monitor anonymous memory growth in memory.stat, not just total usage, to catch leaks early.
  • Configure health checks with realistic start_period values. Do not let a health check kill a container that is still starting.
  • Clean up stopped containers and unused images automatically. Exited containers consume disk space and clutter state.
  • Treat restart policies as a safety net, not a fix. A container that needs to restart repeatedly is broken and should be investigated.

How Netdata helps

Netdata correlates container RestartCount spikes with memory usage, CPU throttling, and OOM kill events on the same timeline so you can see whether a restart loop is caused by resource pressure or an application bug. Alerts on exit code 137 with OOMKilled: true distinguish memory kills from external SIGKILL, and daemon API latency charts help rule out Docker stress as the cause of health-check timeouts. Log size monitoring for json-file drivers also catches runaway growth before it fills the disk.