Docker daemon not responding: how to troubleshoot a hung dockerd

A hung Docker daemon is one of the more disorienting production failures you can face. The process is still running, your containers are still serving traffic, but every docker command hangs. You cannot inspect, stop, or create containers. Deployments stall. Automation times out.

This guide covers the diagnostic ladder from socket probe to storage driver investigation, explains when to wait versus when to restart, and describes what to avoid when you are not sure what is wrong.

What this means

A hung daemon is distinct from a crashed daemon. When dockerd crashes, the process is gone and you get immediate errors. When it hangs, the process exists but the Unix socket stops responding to API requests. Running containers continue because they are managed by containerd and runc, not by dockerd directly. The daemon is the management plane; the workload plane keeps running without it.

The practical consequence: you have lost all operational control. You cannot get logs, exec into containers, stop or start anything, or run health checks through the Docker API. Automated systems that poll the daemon will time out and may take disruptive action.

Common internal causes include storage driver deadlocks (overlay2 blocking on a filesystem operation), internal goroutine deadlocks, file descriptor exhaustion, and extreme I/O contention that causes storage operations to block indefinitely.

Common causes

CauseWhat it looks likeFirst thing to check
Storage driver hangDaemon process alive, socket unresponsive, high I/O wait on hostiostat, df -h /var/lib/docker, daemon logs for storage errors
File descriptor exhaustion“Too many open files” in daemon logs before hangls /proc/$(pgrep dockerd)/fd | wc -l vs FD limit
Internal deadlock (goroutine leak)Daemon unresponsive, no obvious resource pressureDaemon logs for panics; goroutine count if debug endpoint is enabled
Disk space exhaustionDaemon slow then unresponsive, write errors in logsdocker system df, df -h /var/lib/docker
Kernel-level cgroup or namespace issueDaemon hangs after kernel upgrade or cgroup changedmesg, kernel logs, containerd status

Quick checks

Run these in order. They are read-only and safe.

# 1. Is the dockerd process alive?
pgrep -x dockerd && echo "PROCESS ALIVE" || echo "PROCESS GONE"

# 2. Is the socket responding? (fastest daemon liveness check)
time curl --max-time 5 --unix-socket /var/run/docker.sock http://localhost/_ping

# 3. Is containerd still running? (workloads depend on this, not dockerd)
pgrep -x containerd && echo "containerd ALIVE" || echo "containerd GONE"

# 4. Are containers still running at the containerd level?
# ctr is the containerd CLI; -n moby is the namespace Docker uses
sudo ctr -n moby containers list

# 5. Check disk space on the Docker data directory
df -h /var/lib/docker

# 6. Check I/O wait on the host
iostat -x 2 3

# 7. Count open file descriptors for dockerd
sudo ls /proc/$(pgrep dockerd)/fd | wc -l

# 8. Check the FD limit for the daemon process
sudo cat /proc/$(pgrep dockerd)/limits | grep "open files"

# 9. Read recent daemon logs for errors
journalctl -u docker.service -p err --since "30 minutes ago"

# 10. Look for storage driver or lock errors specifically
journalctl -u docker.service --since "1 hour ago" | grep -iE "(storage|overlay|lock|deadlock|panic|fatal)"

A /_ping that returns OK in under one second means the HTTP handler is alive. A /_ping that hangs or times out confirms the daemon is wedged, not just slow.

If ctr -n moby containers list returns your running containers, the workload plane is intact. That is important: it means a daemon restart is lower risk than a full host reboot.

How to diagnose it

Step 1. Confirm the hang is real, not just slow. Run time curl --max-time 5 --unix-socket /var/run/docker.sock http://localhost/_ping. If it returns in under a second, the daemon is alive but possibly stressed. If it times out or hangs past 30 seconds, you have a genuine hang.

Step 2. Check containerd separately. Run sudo ctr -n moby containers list. If this responds, containerd is healthy and your running containers are intact. This tells you a daemon restart will not kill running workloads if live-restore is enabled (see below).

Step 3. Check disk space. Run df -h /var/lib/docker. Disk exhaustion is a common cause of daemon hangs because storage driver operations block when writes fail. If the filesystem is above 90%, treat this as a disk pressure incident first. See the Docker disk space full guide.

Step 4. Check file descriptors. Compare sudo ls /proc/$(pgrep dockerd)/fd | wc -l against the limit from /proc/$(pgrep dockerd)/limits. If you are above 80% of the limit, FD exhaustion is likely. Each container consumes multiple FDs; a large container count or API connection leak can exhaust the limit.

Step 5. Check I/O wait. Run iostat -x 2 3. Sustained high %iowait on the device backing /var/lib/docker points to a storage driver hang. The overlay2 driver can deadlock when the backing filesystem is under extreme pressure or has errors.

Step 6. Read daemon logs.

journalctl -u docker.service --since "2 hours ago" | grep -iE "(panic|fatal|deadlock|storage|overlay|lock)"

A panic or fatal message before the hang is the clearest diagnostic signal. Storage driver errors or lock contention messages narrow the cause further.

Step 7. Decide: wait or restart. If the hang is caused by a transient I/O spike (disk pressure relieved, I/O wait dropping), the daemon may recover on its own. Give it 2-5 minutes after the underlying condition clears. If there is no improvement, proceed to restart.

Metrics and signals to monitor

SignalWhy it mattersWarning sign
/_ping response timeFastest daemon liveness checkNo response within 5 seconds
Daemon process existenceDistinguishes hang from crashProcess gone unexpectedly
Disk usage on /var/lib/dockerDisk exhaustion causes storage hangsAbove 80%
Host I/O wait (%iowait)High I/O wait precedes storage driver hangsSustained above 30%
Daemon FD count vs limitFD exhaustion causes “too many open files” and hangsAbove 75% of limit
Daemon log error rateErrors precede and accompany hangsAny panic or fatal; error rate spike
containerd process healthConfirms workload plane is intact during daemon hangProcess gone
Daemon response latency trendRising latency is an early warning before full hangAbove 500ms sustained

Fixes

If the cause is disk pressure

Free space before attempting a daemon restart. With a hung daemon, the Docker CLI will not work. Use filesystem-level commands:

# Check what is consuming space
du -sh /var/lib/docker/containers/
du -sh /var/lib/docker/overlay2/
du -sh /var/lib/docker/volumes/

# Remove large log files directly (identify the container ID first)
# /var/lib/docker/containers/<container-id>/<container-id>-json.log
# Truncate rather than delete to avoid breaking the running log handle
sudo truncate -s 0 /var/lib/docker/containers/<container-id>/<container-id>-json.log

Once you have freed enough space, attempt a daemon restart. After recovery, configure log rotation (max-size and max-file in daemon.json or per container) to prevent recurrence.

If the cause is FD exhaustion

Increase the FD limit for the daemon process. The correct place to set this is the systemd unit override:

# Create an override for the docker service
sudo systemctl edit docker.service

Add:

[Service]
LimitNOFILE=1048576

Then restart the daemon. Also investigate why FDs are accumulating: persistent docker logs -f sessions, API connections not being closed, or a very high container count are the usual causes.

If the cause is a storage driver hang or internal deadlock

This requires a daemon restart. Before restarting, verify live-restore is configured so running containers survive the restart:

# Check if live-restore is enabled
docker info 2>/dev/null | grep -i "live restore" || \
  cat /etc/docker/daemon.json | grep live-restore

If live-restore is enabled (value true in daemon.json), a daemon restart will not stop running containers. If it is not enabled, a daemon restart will stop all containers.

# Attempt graceful restart first
sudo systemctl restart docker.service

# If systemctl restart hangs (daemon is fully wedged), send SIGTERM directly
sudo kill -TERM $(pgrep dockerd)

# Wait 30 seconds. If still alive, use SIGKILL (last resort before reboot)
sudo kill -9 $(pgrep dockerd)
# Then start the daemon again
sudo systemctl start docker.service

Using kill -9 on dockerd is disruptive. Without live-restore, it will leave containers in an inconsistent state. With live-restore, containerd maintains the containers and dockerd reconnects on startup.

If the daemon fails to start after a kill, check for leftover lock files or corrupted state in /var/lib/docker. Daemon logs on startup will indicate what is failing.

If the daemon will not restart cleanly

Check for storage driver corruption:

# Check for overlay mount issues
mount | grep overlay | head -20

# Check backing filesystem health
sudo dmesg | grep -iE "(ext4|xfs|overlay|error)" | tail -20

If the backing filesystem has errors, you may need filesystem repair (fsck for ext4, xfs_repair for xfs) before the daemon can start. This requires unmounting, which means stopping all containers first. At this point you are in a host-level recovery scenario.

Prevention

Configure live-restore. Add "live-restore": true to /etc/docker/daemon.json. This is the single most important setting for reducing the blast radius of a daemon restart.

Set log rotation defaults. Add to /etc/docker/daemon.json:

{
  "log-driver": "json-file",
  "log-opts": {
    "max-size": "50m",
    "max-file": "3"
  }
}

Unbounded logs are a leading cause of disk exhaustion, which causes daemon hangs.

Monitor disk at 70%, not 90%. By the time you are at 90%, cleanup operations may themselves fail. Alert at 70% and clean up before it becomes an emergency.

Raise the FD limit proactively. If you run more than a few dozen containers per host, the default FD limit may be insufficient. Set LimitNOFILE in the systemd unit before you hit the limit.

Monitor daemon response latency, not just process existence. A daemon that responds in 5 seconds is operationally broken even though the process is alive. Alert on latency above 500ms sustained, not just on process absence.

Implement automated disk cleanup. Run docker system prune -f on a schedule (daily on most hosts, more frequently on CI runners). Do not rely on manual cleanup.

How Netdata helps

  • Daemon liveness and latency. Netdata can monitor the /_ping endpoint response time, giving you a latency trend before the daemon fully hangs, not just a binary up/down signal.
  • Disk pressure correlation. Netdata’s disk usage charts for /var/lib/docker correlate directly with daemon hang risk. Watching disk usage growth rate alongside I/O wait gives early warning of the disk-exhaustion-to-hang pattern.
  • Host I/O wait. The system.io and per-disk I/O wait charts show when the storage layer is under pressure, which is the leading indicator for overlay2 storage driver hangs.
  • File descriptor tracking. Netdata tracks per-process FD counts, so you can alert when dockerd is approaching its FD limit before exhaustion causes failures.
  • Container state counts. Tracking running, exited, and dead container counts over time surfaces accumulation patterns that contribute to daemon stress.