Docker container running but unhealthy: how to diagnose health check failures
The container is up. docker ps shows it running. But the status is (unhealthy) and your load balancer or orchestrator has stopped sending traffic. This means Docker’s health check command is returning a non-zero exit code, or timing out, while the container’s main process stays alive. The container is running, but Docker does not consider it ready.
An unhealthy state is not a crash. In Swarm, it triggers replacement. In Compose, depends_on with condition: service_healthy blocks downstream services. Even on a single host, an unhealthy mark often precedes a restart loop that buries the real error in noise. You need to distinguish between a broken application, a broken probe, and a broken runtime configuration.
This guide shows you how to inspect the health state, read the captured check output, identify whether the cause is slow startup, a missing binary, resource pressure, or a misconfigured probe, and fix it without guessing.
What this means
Docker’s health check runs a command inside the container via docker exec. Exit code 0 means healthy. Any non-zero exit means unhealthy. After consecutive failures equal to --retries (default 3), Docker marks the container as unhealthy. There is no recovering state. A single passing check resets the streak and flips the status back to healthy.
The health check does not restart the container on its own. A restart policy or orchestrator may act on the unhealthy state, but the transition itself is purely informational. The main process continues running. This distinction matters because the root cause is often not a crash but a probe that cannot verify readiness.
Key mechanics to keep in mind:
- The first check fires
--intervalseconds after container start, not immediately. - If the command exceeds
--timeout, Docker sends SIGKILL to the check process. This counts as a failure. --start-periodgrants a grace window where failures do not count toward the retry limit. Successful checks during this period mark the container as healthy immediately.--start-interval(Docker Compose v2.20.2+ and Engine 25.0+) controls probe frequency duringstart_periodseparately from the normal interval.- Only the first 4096 bytes of stdout/stderr from the health check command are captured and stored in the container state.
Common causes
| Cause | What it looks like | First thing to check |
|---|---|---|
| Slow application startup | Container stays in starting state, then flips to unhealthy shortly after start_period ends | docker inspect FailingStreak and StartedAt timestamps |
| Missing health check binary | OCI runtime exec failed: executable file not found in health output | Run the same command with docker exec inside the container |
| Dependency not ready | Health check fails until database or API comes up; intermittent early failures | Application logs and dependency service status |
| Resource pressure (CPU throttling, memory pressure) | Health check times out or runs too slowly; the container is sluggish | docker stats, cgroup cpu.stat, and memory usage |
| Misconfigured probe command | Shell syntax used in exec form, wrong port, or hardcoded endpoint that moved | Dockerfile HEALTHCHECK or Compose healthcheck.test syntax |
| Signal handling bug | docker inspect shows unhealthy but docker ps still shows healthy after SIGHUP/SIGUSR1 | Recent signal delivery to the container (docker/for-linux#454) |
| Zombie health check accumulation | Hundreds of <defunct> processes inside the container from spawned shells | docker top or ps inside the container for defunct processes |
Quick checks
# Check current health state and full failure history
docker inspect --format='{{json .State.Health}}' <container_id> | jq .
# List only containers showing health status
docker ps --format '{{.ID}} {{.Status}}' | grep -E '\(healthy|unhealthy|starting\)'
# Read the last health check output (first 4096 bytes captured)
docker inspect --format='{{range .State.Health.Log}}{{.End}} exit={{.ExitCode}} output={{.Output}}{{"\n"}}{{end}}' <container_id> | tail -n 5
# Verify the health check command exists inside the container
docker exec <container_id> sh -c 'command -v curl || command -v wget || command -v nc'
# Check if the application port is actually listening inside the container
docker exec <container_id> ss -tlnp | grep <app_port>
# Check container resource pressure
docker stats --no-stream <container_id>
# Check CPU throttling for the container's cgroup (cgroup v2)
cat /sys/fs/cgroup/system.slice/docker-<container_id>.scope/cpu.stat | grep -E "nr_throttled|throttled_usec"
# Check for defunct/zombie processes from health checks
docker top <container_id> -o pid,ppid,stat,comm | grep Z
How to diagnose it
Confirm the state and failure streak. Run
docker inspect --format='{{json .State.Health}}' <container_id>. Look atStatus(starting,healthy, orunhealthy) andFailingStreak. A high streak means the check has failed repeatedly. If the status isstartingand the streak is climbing, thestart_periodmay be too short.Read the captured output. Inspect the
.State.Health.Logarray. The most recent entries showExitCodeandOutput. Exit code 0 is healthy. Exit code 1 is a failed probe. If the output showsOCI runtime exec failed: ... executable file not found, the health check binary is missing from the image. If it shows connection refused, the application is not listening yet or is listening on the wrong port.Reproduce the command manually. Copy the exact health check command from the image or runtime configuration and run it inside the container with
docker exec -it <container_id> <command>. If you used shell syntax (CMD-SHELLorHEALTHCHECK CMD curl ... || exit 1) but defined it in exec form (["curl", "...", "||", "exit", "1"]), the command will fail silently because the pipe and logical operators are passed as arguments, not interpreted by a shell.Check if the application is actually ready. A health check that curls
localhost:8080/healthwill fail if the application binds to127.0.0.1but the check resolveslocalhostto::1, or if the app listens on a Unix socket instead of TCP. Verify from inside the container withss -tlnpandcurl -v <endpoint>.Check for resource pressure causing timeouts. A container that is CPU-throttled or under memory pressure may be unable to complete the health check within
--timeout. Checkcpu.statfornr_throttledandthrottled_usec. If throttling is increasing, the CFS bandwidth controller is pausing the health check process. Also check if memory usage is near its limit; heavy GC pressure can stall the application.Review timing parameters. Check the configured
Interval,Timeout,StartPeriod, andRetries. If the application needs 45 seconds to initialize butStartPeriodis 30 seconds, the container can become unhealthy immediately after the grace window closes. If the check command normally takes 5 seconds butTimeoutis 3 seconds, every check will be killed with SIGKILL and count as a failure.Check for known daemon-level bugs. If the container was recently sent
SIGHUPorSIGUSR1, a known bug (docker/for-linux#454) can corrupt the reported health status indocker inspect, showingunhealthywhiledocker psstill showshealthy. Avoid sending arbitrary signals to containers with active health checks.Also note that during
start_period, successful checks do not reset theFailingStreakcounter due to a confirmed bug (docker/compose#11131). This means a container can become unhealthy immediately afterstart_periodends even if checks were passing throughout.Correlate with restarts and application logs. Check
RestartCountand application logs. An unhealthy container with a restart policy ofunless-stoppedoralwaysmay enter a restart loop. If the root cause is a missing dependency or OOM risk, the logs will show it before the health check ever fires.
Metrics and signals to monitor
| Signal | Why it matters | Warning sign |
|---|---|---|
| Container health status | Tells you the application is actually ready, not just running | unhealthy state on any production container |
Health check FailingStreak | Counts consecutive failures toward the unhealthy threshold | Streak > 1 during steady state, or increasing during starting |
Container CPU throttling (nr_throttled) | Throttled containers run health checks too slowly, causing timeouts | Any increasing nr_throttled or throttled_usec |
| Container memory usage vs limit | Memory pressure causes GC stalls and slow responses | Usage > 80% of limit; anon memory growing continuously |
| Container restart count | Unhealthy containers with restart policies enter crash loops | RestartCount > 0 and increasing |
| Docker daemon response latency | A slow daemon delays docker exec health checks | /_ping or docker ps latency > 1 second sustained |
| Application listen port status | A common probe failure is checking an endpoint before the app binds | Port not listening inside the container namespace |
Fixes
If the cause is slow startup
Set --start-period to at least twice the expected initialization time. If you use Docker Compose v2.20.2+ or Engine 25.0+, set --start-interval to probe more frequently during startup without affecting steady-state interval. Do not rely on the default interval alone.
If the cause is a missing binary
Minimal images (Alpine slim, Debian slim) often lack curl, wget, or bash. Either install the required binary in the image or replace the health check with a command that exists, such as a compiled health checker or a language-native check if the runtime is present. Avoid CMD ["curl", "...", "||", "exit", "1"] in exec form; use CMD-SHELL for shell syntax or the exec array form only when running a single binary directly.
If the cause is dependency failure
If the health check fails because a database or downstream API is not ready, the application should handle backoff and readiness internally. Do not make the health check depend on external services unless that is the intended semantics. If using Compose, ensure depends_on with condition: service_healthy is set on the consumer, not the dependency.
If the cause is resource pressure
Increase the container’s CPU or memory limits, or optimize the application. If CPU throttling is the cause, raising the limit is the fastest fix. For memory, ensure JVM heap settings and other runtime memory pools leave headroom under the cgroup limit. See Docker container high CPU usage: causes and fixes and Docker CPU throttling: the hidden cause of container latency.
If the cause is a misconfigured probe
Verify the endpoint, port, and protocol. Check that the health check uses the same IP family as the application (IPv4 vs IPv6). Ensure the timeout is long enough for the check to complete under load. Reduce retries only after confirming the check is reliable; the default of 3 is usually correct.
If the cause is a daemon bug
Do not send SIGHUP or SIGUSR1 to containers with active health checks. If you observe status desync between docker inspect and docker ps, compare timestamps and upgrade Docker if you are affected by known signal bugs.
If the cause is zombie accumulation
Ensure the health check command does not spawn background processes. If the command runs through /bin/sh -c, a child process that ignores SIGTERM can become a zombie. Use a direct exec form or add --init to the container so that tini reaps child processes properly.
Prevention
- Configure health checks for every production container. A running container without a health check may be deadlocked and you will not know until traffic fails.
- Set
start_periodgenerously. Measure your application’s cold-start time under realistic load and set the grace window to at least 2x that value. - Set
timeoutbased on the worst-case observed execution time of the check, not the best case. - Pin health check binaries in your image build. Do not assume
curlorwgetexists in slim or distroless images. - Monitor
FailingStreakproactively. Do not wait for theunhealthystate. - Avoid sending arbitrary signals to containers. If you must reload configuration, prefer an in-application mechanism over Unix signals when health checks are active.
- Test health check behavior in CI. Start the container and inspect
.State.Healthbefore declaring the image valid.
How Netdata helps
- Correlate container health status with per-container CPU throttling charts to spot probes timing out due to CFS bandwidth limits.
- Monitor container memory usage, RSS, and
memory.statbreakdowns to catch pressure that stalls health checks before an OOM kill. - Track container restart counts and exit codes alongside health state changes to identify restart loops caused by unhealthy states.
- Alert on Docker daemon API latency spikes that delay
docker exechealth check execution. - Visualize application latency and request error rates next to container resource saturation to distinguish probe failures from real application issues.
Related guides
- Docker container high CPU usage: causes and fixes
- Docker container high memory usage: how to diagnose it
- Docker container keeps restarting: causes, checks, and fixes
- Docker container memory leak: how to find one and prove it
- Docker CPU throttling: the hidden cause of container latency
- Docker daemon not responding: how to troubleshoot a hung dockerd
- Docker disk space full: how to troubleshoot /var/lib/docker
- Docker DNS not working inside containers
- Docker exit code 137: OOMKilled or SIGKILL?
- Docker log rotation: preventing json-file logs from filling disk
- Docker logs taking too much disk space: how to fix log growth
- Docker monitoring checklist: the signals every production host needs




