Docker daemon not responding: how to troubleshoot a hung dockerd
A hung Docker daemon is one of the more disorienting production failures you can face. The process is still running, your containers are still serving traffic, but every docker command hangs. You cannot inspect, stop, or create containers. Deployments stall. Automation times out.
This guide covers the diagnostic ladder from socket probe to storage driver investigation, explains when to wait versus when to restart, and describes what to avoid when you are not sure what is wrong.
What this means
A hung daemon is distinct from a crashed daemon. When dockerd crashes, the process is gone and you get immediate errors. When it hangs, the process exists but the Unix socket stops responding to API requests. Running containers continue because they are managed by containerd and runc, not by dockerd directly. The daemon is the management plane; the workload plane keeps running without it.
The practical consequence: you have lost all operational control. You cannot get logs, exec into containers, stop or start anything, or run health checks through the Docker API. Automated systems that poll the daemon will time out and may take disruptive action.
Common internal causes include storage driver deadlocks (overlay2 blocking on a filesystem operation), internal goroutine deadlocks, file descriptor exhaustion, and extreme I/O contention that causes storage operations to block indefinitely.
Common causes
| Cause | What it looks like | First thing to check |
|---|---|---|
| Storage driver hang | Daemon process alive, socket unresponsive, high I/O wait on host | iostat, df -h /var/lib/docker, daemon logs for storage errors |
| File descriptor exhaustion | “Too many open files” in daemon logs before hang | ls /proc/$(pgrep dockerd)/fd | wc -l vs FD limit |
| Internal deadlock (goroutine leak) | Daemon unresponsive, no obvious resource pressure | Daemon logs for panics; goroutine count if debug endpoint is enabled |
| Disk space exhaustion | Daemon slow then unresponsive, write errors in logs | docker system df, df -h /var/lib/docker |
| Kernel-level cgroup or namespace issue | Daemon hangs after kernel upgrade or cgroup change | dmesg, kernel logs, containerd status |
Quick checks
Run these in order. They are read-only and safe.
# 1. Is the dockerd process alive?
pgrep -x dockerd && echo "PROCESS ALIVE" || echo "PROCESS GONE"
# 2. Is the socket responding? (fastest daemon liveness check)
time curl --max-time 5 --unix-socket /var/run/docker.sock http://localhost/_ping
# 3. Is containerd still running? (workloads depend on this, not dockerd)
pgrep -x containerd && echo "containerd ALIVE" || echo "containerd GONE"
# 4. Are containers still running at the containerd level?
# ctr is the containerd CLI; -n moby is the namespace Docker uses
sudo ctr -n moby containers list
# 5. Check disk space on the Docker data directory
df -h /var/lib/docker
# 6. Check I/O wait on the host
iostat -x 2 3
# 7. Count open file descriptors for dockerd
sudo ls /proc/$(pgrep dockerd)/fd | wc -l
# 8. Check the FD limit for the daemon process
sudo cat /proc/$(pgrep dockerd)/limits | grep "open files"
# 9. Read recent daemon logs for errors
journalctl -u docker.service -p err --since "30 minutes ago"
# 10. Look for storage driver or lock errors specifically
journalctl -u docker.service --since "1 hour ago" | grep -iE "(storage|overlay|lock|deadlock|panic|fatal)"
A /_ping that returns OK in under one second means the HTTP handler is alive. A /_ping that hangs or times out confirms the daemon is wedged, not just slow.
If ctr -n moby containers list returns your running containers, the workload plane is intact. That is important: it means a daemon restart is lower risk than a full host reboot.
How to diagnose it
Step 1. Confirm the hang is real, not just slow.
Run time curl --max-time 5 --unix-socket /var/run/docker.sock http://localhost/_ping. If it returns in under a second, the daemon is alive but possibly stressed. If it times out or hangs past 30 seconds, you have a genuine hang.
Step 2. Check containerd separately.
Run sudo ctr -n moby containers list. If this responds, containerd is healthy and your running containers are intact. This tells you a daemon restart will not kill running workloads if live-restore is enabled (see below).
Step 3. Check disk space.
Run df -h /var/lib/docker. Disk exhaustion is a common cause of daemon hangs because storage driver operations block when writes fail. If the filesystem is above 90%, treat this as a disk pressure incident first. See the Docker disk space full guide.
Step 4. Check file descriptors.
Compare sudo ls /proc/$(pgrep dockerd)/fd | wc -l against the limit from /proc/$(pgrep dockerd)/limits. If you are above 80% of the limit, FD exhaustion is likely. Each container consumes multiple FDs; a large container count or API connection leak can exhaust the limit.
Step 5. Check I/O wait.
Run iostat -x 2 3. Sustained high %iowait on the device backing /var/lib/docker points to a storage driver hang. The overlay2 driver can deadlock when the backing filesystem is under extreme pressure or has errors.
Step 6. Read daemon logs.
journalctl -u docker.service --since "2 hours ago" | grep -iE "(panic|fatal|deadlock|storage|overlay|lock)"
A panic or fatal message before the hang is the clearest diagnostic signal. Storage driver errors or lock contention messages narrow the cause further.
Step 7. Decide: wait or restart. If the hang is caused by a transient I/O spike (disk pressure relieved, I/O wait dropping), the daemon may recover on its own. Give it 2-5 minutes after the underlying condition clears. If there is no improvement, proceed to restart.
Metrics and signals to monitor
| Signal | Why it matters | Warning sign |
|---|---|---|
/_ping response time | Fastest daemon liveness check | No response within 5 seconds |
| Daemon process existence | Distinguishes hang from crash | Process gone unexpectedly |
Disk usage on /var/lib/docker | Disk exhaustion causes storage hangs | Above 80% |
Host I/O wait (%iowait) | High I/O wait precedes storage driver hangs | Sustained above 30% |
| Daemon FD count vs limit | FD exhaustion causes “too many open files” and hangs | Above 75% of limit |
| Daemon log error rate | Errors precede and accompany hangs | Any panic or fatal; error rate spike |
| containerd process health | Confirms workload plane is intact during daemon hang | Process gone |
| Daemon response latency trend | Rising latency is an early warning before full hang | Above 500ms sustained |
Fixes
If the cause is disk pressure
Free space before attempting a daemon restart. With a hung daemon, the Docker CLI will not work. Use filesystem-level commands:
# Check what is consuming space
du -sh /var/lib/docker/containers/
du -sh /var/lib/docker/overlay2/
du -sh /var/lib/docker/volumes/
# Remove large log files directly (identify the container ID first)
# /var/lib/docker/containers/<container-id>/<container-id>-json.log
# Truncate rather than delete to avoid breaking the running log handle
sudo truncate -s 0 /var/lib/docker/containers/<container-id>/<container-id>-json.log
Once you have freed enough space, attempt a daemon restart. After recovery, configure log rotation (max-size and max-file in daemon.json or per container) to prevent recurrence.
If the cause is FD exhaustion
Increase the FD limit for the daemon process. The correct place to set this is the systemd unit override:
# Create an override for the docker service
sudo systemctl edit docker.service
Add:
[Service]
LimitNOFILE=1048576
Then restart the daemon. Also investigate why FDs are accumulating: persistent docker logs -f sessions, API connections not being closed, or a very high container count are the usual causes.
If the cause is a storage driver hang or internal deadlock
This requires a daemon restart. Before restarting, verify live-restore is configured so running containers survive the restart:
# Check if live-restore is enabled
docker info 2>/dev/null | grep -i "live restore" || \
cat /etc/docker/daemon.json | grep live-restore
If live-restore is enabled (value true in daemon.json), a daemon restart will not stop running containers. If it is not enabled, a daemon restart will stop all containers.
# Attempt graceful restart first
sudo systemctl restart docker.service
# If systemctl restart hangs (daemon is fully wedged), send SIGTERM directly
sudo kill -TERM $(pgrep dockerd)
# Wait 30 seconds. If still alive, use SIGKILL (last resort before reboot)
sudo kill -9 $(pgrep dockerd)
# Then start the daemon again
sudo systemctl start docker.service
Using kill -9 on dockerd is disruptive. Without live-restore, it will leave containers in an inconsistent state. With live-restore, containerd maintains the containers and dockerd reconnects on startup.
If the daemon fails to start after a kill, check for leftover lock files or corrupted state in /var/lib/docker. Daemon logs on startup will indicate what is failing.
If the daemon will not restart cleanly
Check for storage driver corruption:
# Check for overlay mount issues
mount | grep overlay | head -20
# Check backing filesystem health
sudo dmesg | grep -iE "(ext4|xfs|overlay|error)" | tail -20
If the backing filesystem has errors, you may need filesystem repair (fsck for ext4, xfs_repair for xfs) before the daemon can start. This requires unmounting, which means stopping all containers first. At this point you are in a host-level recovery scenario.
Prevention
Configure live-restore. Add "live-restore": true to /etc/docker/daemon.json. This is the single most important setting for reducing the blast radius of a daemon restart.
Set log rotation defaults. Add to /etc/docker/daemon.json:
{
"log-driver": "json-file",
"log-opts": {
"max-size": "50m",
"max-file": "3"
}
}
Unbounded logs are a leading cause of disk exhaustion, which causes daemon hangs.
Monitor disk at 70%, not 90%. By the time you are at 90%, cleanup operations may themselves fail. Alert at 70% and clean up before it becomes an emergency.
Raise the FD limit proactively. If you run more than a few dozen containers per host, the default FD limit may be insufficient. Set LimitNOFILE in the systemd unit before you hit the limit.
Monitor daemon response latency, not just process existence. A daemon that responds in 5 seconds is operationally broken even though the process is alive. Alert on latency above 500ms sustained, not just on process absence.
Implement automated disk cleanup. Run docker system prune -f on a schedule (daily on most hosts, more frequently on CI runners). Do not rely on manual cleanup.
How Netdata helps
- Daemon liveness and latency. Netdata can monitor the
/_pingendpoint response time, giving you a latency trend before the daemon fully hangs, not just a binary up/down signal. - Disk pressure correlation. Netdata’s disk usage charts for
/var/lib/dockercorrelate directly with daemon hang risk. Watching disk usage growth rate alongside I/O wait gives early warning of the disk-exhaustion-to-hang pattern. - Host I/O wait. The
system.ioand per-disk I/O wait charts show when the storage layer is under pressure, which is the leading indicator for overlay2 storage driver hangs. - File descriptor tracking. Netdata tracks per-process FD counts, so you can alert when dockerd is approaching its FD limit before exhaustion causes failures.
- Container state counts. Tracking running, exited, and dead container counts over time surfaces accumulation patterns that contribute to daemon stress.
Related guides
- Docker disk space full: how to troubleshoot /var/lib/docker - disk exhaustion is the most common trigger for daemon hangs
- Docker container keeps restarting: causes, checks, and fixes - crash loops stress the daemon and can contribute to hangs
- Docker OOMKilled: causes, detection, and prevention - OOM events can cascade into daemon instability
- Docker monitoring checklist: the signals every production host needs - the full signal set for production Docker hosts
- Docker log rotation: preventing json-file logs from filling disk - log growth is a primary path to disk exhaustion and daemon hangs
- Docker container high memory usage: how to diagnose it - memory pressure on the host can affect daemon stability




