Docker container running but unhealthy: how to diagnose health check failures

The container is up. docker ps shows it running. But the status is (unhealthy) and your load balancer or orchestrator has stopped sending traffic. This means Docker’s health check command is returning a non-zero exit code, or timing out, while the container’s main process stays alive. The container is running, but Docker does not consider it ready.

An unhealthy state is not a crash. In Swarm, it triggers replacement. In Compose, depends_on with condition: service_healthy blocks downstream services. Even on a single host, an unhealthy mark often precedes a restart loop that buries the real error in noise. You need to distinguish between a broken application, a broken probe, and a broken runtime configuration.

This guide shows you how to inspect the health state, read the captured check output, identify whether the cause is slow startup, a missing binary, resource pressure, or a misconfigured probe, and fix it without guessing.

What this means

Docker’s health check runs a command inside the container via docker exec. Exit code 0 means healthy. Any non-zero exit means unhealthy. After consecutive failures equal to --retries (default 3), Docker marks the container as unhealthy. There is no recovering state. A single passing check resets the streak and flips the status back to healthy.

The health check does not restart the container on its own. A restart policy or orchestrator may act on the unhealthy state, but the transition itself is purely informational. The main process continues running. This distinction matters because the root cause is often not a crash but a probe that cannot verify readiness.

Key mechanics to keep in mind:

  • The first check fires --interval seconds after container start, not immediately.
  • If the command exceeds --timeout, Docker sends SIGKILL to the check process. This counts as a failure.
  • --start-period grants a grace window where failures do not count toward the retry limit. Successful checks during this period mark the container as healthy immediately.
  • --start-interval (Docker Compose v2.20.2+ and Engine 25.0+) controls probe frequency during start_period separately from the normal interval.
  • Only the first 4096 bytes of stdout/stderr from the health check command are captured and stored in the container state.

Common causes

CauseWhat it looks likeFirst thing to check
Slow application startupContainer stays in starting state, then flips to unhealthy shortly after start_period endsdocker inspect FailingStreak and StartedAt timestamps
Missing health check binaryOCI runtime exec failed: executable file not found in health outputRun the same command with docker exec inside the container
Dependency not readyHealth check fails until database or API comes up; intermittent early failuresApplication logs and dependency service status
Resource pressure (CPU throttling, memory pressure)Health check times out or runs too slowly; the container is sluggishdocker stats, cgroup cpu.stat, and memory usage
Misconfigured probe commandShell syntax used in exec form, wrong port, or hardcoded endpoint that movedDockerfile HEALTHCHECK or Compose healthcheck.test syntax
Signal handling bugdocker inspect shows unhealthy but docker ps still shows healthy after SIGHUP/SIGUSR1Recent signal delivery to the container (docker/for-linux#454)
Zombie health check accumulationHundreds of <defunct> processes inside the container from spawned shellsdocker top or ps inside the container for defunct processes

Quick checks

# Check current health state and full failure history
docker inspect --format='{{json .State.Health}}' <container_id> | jq .

# List only containers showing health status
docker ps --format '{{.ID}} {{.Status}}' | grep -E '\(healthy|unhealthy|starting\)'

# Read the last health check output (first 4096 bytes captured)
docker inspect --format='{{range .State.Health.Log}}{{.End}} exit={{.ExitCode}} output={{.Output}}{{"\n"}}{{end}}' <container_id> | tail -n 5

# Verify the health check command exists inside the container
docker exec <container_id> sh -c 'command -v curl || command -v wget || command -v nc'

# Check if the application port is actually listening inside the container
docker exec <container_id> ss -tlnp | grep <app_port>

# Check container resource pressure
docker stats --no-stream <container_id>

# Check CPU throttling for the container's cgroup (cgroup v2)
cat /sys/fs/cgroup/system.slice/docker-<container_id>.scope/cpu.stat | grep -E "nr_throttled|throttled_usec"

# Check for defunct/zombie processes from health checks
docker top <container_id> -o pid,ppid,stat,comm | grep Z

How to diagnose it

  1. Confirm the state and failure streak. Run docker inspect --format='{{json .State.Health}}' <container_id>. Look at Status (starting, healthy, or unhealthy) and FailingStreak. A high streak means the check has failed repeatedly. If the status is starting and the streak is climbing, the start_period may be too short.

  2. Read the captured output. Inspect the .State.Health.Log array. The most recent entries show ExitCode and Output. Exit code 0 is healthy. Exit code 1 is a failed probe. If the output shows OCI runtime exec failed: ... executable file not found, the health check binary is missing from the image. If it shows connection refused, the application is not listening yet or is listening on the wrong port.

  3. Reproduce the command manually. Copy the exact health check command from the image or runtime configuration and run it inside the container with docker exec -it <container_id> <command>. If you used shell syntax (CMD-SHELL or HEALTHCHECK CMD curl ... || exit 1) but defined it in exec form (["curl", "...", "||", "exit", "1"]), the command will fail silently because the pipe and logical operators are passed as arguments, not interpreted by a shell.

  4. Check if the application is actually ready. A health check that curls localhost:8080/health will fail if the application binds to 127.0.0.1 but the check resolves localhost to ::1, or if the app listens on a Unix socket instead of TCP. Verify from inside the container with ss -tlnp and curl -v <endpoint>.

  5. Check for resource pressure causing timeouts. A container that is CPU-throttled or under memory pressure may be unable to complete the health check within --timeout. Check cpu.stat for nr_throttled and throttled_usec. If throttling is increasing, the CFS bandwidth controller is pausing the health check process. Also check if memory usage is near its limit; heavy GC pressure can stall the application.

  6. Review timing parameters. Check the configured Interval, Timeout, StartPeriod, and Retries. If the application needs 45 seconds to initialize but StartPeriod is 30 seconds, the container can become unhealthy immediately after the grace window closes. If the check command normally takes 5 seconds but Timeout is 3 seconds, every check will be killed with SIGKILL and count as a failure.

  7. Check for known daemon-level bugs. If the container was recently sent SIGHUP or SIGUSR1, a known bug (docker/for-linux#454) can corrupt the reported health status in docker inspect, showing unhealthy while docker ps still shows healthy. Avoid sending arbitrary signals to containers with active health checks.

    Also note that during start_period, successful checks do not reset the FailingStreak counter due to a confirmed bug (docker/compose#11131). This means a container can become unhealthy immediately after start_period ends even if checks were passing throughout.

  8. Correlate with restarts and application logs. Check RestartCount and application logs. An unhealthy container with a restart policy of unless-stopped or always may enter a restart loop. If the root cause is a missing dependency or OOM risk, the logs will show it before the health check ever fires.

Metrics and signals to monitor

SignalWhy it mattersWarning sign
Container health statusTells you the application is actually ready, not just runningunhealthy state on any production container
Health check FailingStreakCounts consecutive failures toward the unhealthy thresholdStreak > 1 during steady state, or increasing during starting
Container CPU throttling (nr_throttled)Throttled containers run health checks too slowly, causing timeoutsAny increasing nr_throttled or throttled_usec
Container memory usage vs limitMemory pressure causes GC stalls and slow responsesUsage > 80% of limit; anon memory growing continuously
Container restart countUnhealthy containers with restart policies enter crash loopsRestartCount > 0 and increasing
Docker daemon response latencyA slow daemon delays docker exec health checks/_ping or docker ps latency > 1 second sustained
Application listen port statusA common probe failure is checking an endpoint before the app bindsPort not listening inside the container namespace

Fixes

If the cause is slow startup

Set --start-period to at least twice the expected initialization time. If you use Docker Compose v2.20.2+ or Engine 25.0+, set --start-interval to probe more frequently during startup without affecting steady-state interval. Do not rely on the default interval alone.

If the cause is a missing binary

Minimal images (Alpine slim, Debian slim) often lack curl, wget, or bash. Either install the required binary in the image or replace the health check with a command that exists, such as a compiled health checker or a language-native check if the runtime is present. Avoid CMD ["curl", "...", "||", "exit", "1"] in exec form; use CMD-SHELL for shell syntax or the exec array form only when running a single binary directly.

If the cause is dependency failure

If the health check fails because a database or downstream API is not ready, the application should handle backoff and readiness internally. Do not make the health check depend on external services unless that is the intended semantics. If using Compose, ensure depends_on with condition: service_healthy is set on the consumer, not the dependency.

If the cause is resource pressure

Increase the container’s CPU or memory limits, or optimize the application. If CPU throttling is the cause, raising the limit is the fastest fix. For memory, ensure JVM heap settings and other runtime memory pools leave headroom under the cgroup limit. See Docker container high CPU usage: causes and fixes and Docker CPU throttling: the hidden cause of container latency.

If the cause is a misconfigured probe

Verify the endpoint, port, and protocol. Check that the health check uses the same IP family as the application (IPv4 vs IPv6). Ensure the timeout is long enough for the check to complete under load. Reduce retries only after confirming the check is reliable; the default of 3 is usually correct.

If the cause is a daemon bug

Do not send SIGHUP or SIGUSR1 to containers with active health checks. If you observe status desync between docker inspect and docker ps, compare timestamps and upgrade Docker if you are affected by known signal bugs.

If the cause is zombie accumulation

Ensure the health check command does not spawn background processes. If the command runs through /bin/sh -c, a child process that ignores SIGTERM can become a zombie. Use a direct exec form or add --init to the container so that tini reaps child processes properly.

Prevention

  • Configure health checks for every production container. A running container without a health check may be deadlocked and you will not know until traffic fails.
  • Set start_period generously. Measure your application’s cold-start time under realistic load and set the grace window to at least 2x that value.
  • Set timeout based on the worst-case observed execution time of the check, not the best case.
  • Pin health check binaries in your image build. Do not assume curl or wget exists in slim or distroless images.
  • Monitor FailingStreak proactively. Do not wait for the unhealthy state.
  • Avoid sending arbitrary signals to containers. If you must reload configuration, prefer an in-application mechanism over Unix signals when health checks are active.
  • Test health check behavior in CI. Start the container and inspect .State.Health before declaring the image valid.

How Netdata helps

  • Correlate container health status with per-container CPU throttling charts to spot probes timing out due to CFS bandwidth limits.
  • Monitor container memory usage, RSS, and memory.stat breakdowns to catch pressure that stalls health checks before an OOM kill.
  • Track container restart counts and exit codes alongside health state changes to identify restart loops caused by unhealthy states.
  • Alert on Docker daemon API latency spikes that delay docker exec health check execution.
  • Visualize application latency and request error rates next to container resource saturation to distinguish probe failures from real application issues.