Docker image cleanup: safe pruning strategies for production hosts
When df -h /var/lib/docker shows 87% utilization, the urge to run docker system prune -a is strong. On production hosts, that is a mistake. Cleanup is not about finding the single command that reclaims the most space. It is about knowing exactly what each flag deletes, what it leaves behind, and which filters prevent a 3 a.m. image re-pull because a base layer was removed.
This guide covers the safe pruning hierarchy: dangling images, unused tagged images, build cache, and the dangerous flags that touch volumes or running workloads. The goal is to reclaim space, automate cleanup safely, and avoid outage patterns caused by aggressive pruning.
What this means
Docker stores images, container writable layers, volumes, build cache, and logs under /var/lib/docker. Run docker system df for the authoritative breakdown of usage and reclaimable space. Use it to avoid chasing image bloat when the real problem is container logs or orphaned volumes.
There are three distinct cleanup scopes. docker image prune removes images only. docker builder prune removes build cache only. docker system prune removes stopped containers, unused networks, dangling images, and build cache. The -a flag expands image removal from dangling-only to all unused images. The --volumes flag adds volume destruction. These flags are not additive conveniences. They change the blast radius from safe intermediates to everything not currently running.
Common causes
| Cause | What it looks like | First thing to check |
|---|---|---|
| Dangling build artifacts | Many <none>:<none> images after builds or tag overwrites | docker images -f "dangling=true" |
| Old tagged image versions | Repositories with dozens of tags accumulated over releases | docker images --format "table {{.Repository}}\t{{.Tag}}\t{{.Size}}" sorted by creation date |
| Build cache accumulation | CI runners with 100+ GB under /var/lib/docker/buildkit | docker system df Build Cache line, or du -sh /var/lib/docker/buildkit/ |
| Exited containers holding references | High exited count preventing image removal | docker ps -a --filter "status=exited" -q | wc -l |
| Confusion between prune commands | Scheduled jobs using docker system prune when docker image prune was sufficient | Review cron or systemd timer definitions |
Quick checks
Use these read-only commands to assess the situation before running any destructive operation.
# Overall breakdown of images, containers, volumes, and build cache
docker system df
# Dangling images with sizes
docker images -f "dangling=true" --format "table {{.ID}}\t{{.Size}}\t{{.CreatedAt}}"
# Largest images
docker images --format "table {{.Size}}\t{{.Repository}}:{{.Tag}}" | sort -hr | head -20
# Exited containers that may hold image references
docker ps -a --filter "status=exited" -q | wc -l
# Build cache size on disk
du -sh /var/lib/docker/buildkit/
# Dead containers blocking cleanup
docker ps -a --filter "status=dead" --format "{{.ID}} {{.Names}}"
How to diagnose it
Establish baseline space usage. Run
docker system df. It shows total usage and reclaimable space for images, containers, local volumes, and build cache. Use it to avoid chasing image bloat when the real problem is container logs or orphaned volumes.Count dangling images. These are untagged layers with no container reference and the safest cleanup target. Run
docker images -f "dangling=true". If the count or total size is significant, start here.Identify unused tagged images. Compare your image list against running and stopped containers. An image referenced by a stopped container is considered in-use and survives
docker image prune -a. However, if you remove that stopped container later, the image is already gone and must be re-pulled.Inspect build cache separately. On CI/CD hosts, build cache can dominate disk usage. Check the Build Cache line from
docker system dfor inspect/var/lib/docker/buildkit/directly. Build cache cleanup has different failure modes than image cleanup.Check for dead containers. Containers in
deadstate cannot be removed by standard prune commands and often indicate prior daemon or storage driver issues. Handle them manually before normal cleanup.Review daemon logs for storage errors. If overlay2 is already unhealthy, prune operations may fail with errors such as “rw layer snapshot not found” or hang indefinitely. Check
journalctl -u docker.servicefor recent storage driver errors before proceeding.
Metrics and signals to monitor
| Signal | Why it matters | Warning sign |
|---|---|---|
| Docker disk usage by images | Often the largest component of Docker disk consumption | >50 GB and most images not referenced by running containers |
| Dangling images count | Safest cleanup target; rapid accumulation indicates build or tag churn | >10 GB or rapid accumulation between builds |
| Build cache size | Separate from image layers; on CI hosts it can exceed image usage | >20 GB or >20% of total Docker disk usage |
| Docker daemon response latency | Pruning is I/O-heavy; a stressed daemon may hang during cleanup | /_ping or docker info takes >5 seconds |
| Container count by state | Exited containers hold image references and consume writable layers | Exited count growing without automated cleanup |
| Docker disk usage total | Docker fails catastrophically when the filesystem fills | >80% utilization of the /var/lib/docker filesystem |
Fixes
If the cause is dangling images
Run the safest prune command:
docker image prune -f
This removes untagged intermediate layers with no container references. It does not touch tagged images, running containers, stopped containers, volumes, or networks.
If the cause is unused tagged images
Remove images that are not referenced by any container and are older than a safe threshold:
docker image prune -a -f --filter "until=240h"
The --filter "until=240h" flag limits removal to images older than 10 days. Do not run docker image prune -a without a filter in production. Without a filter, it removes all images not used by a running container, including base images like ubuntu:latest or alpine:latest. If you stop and later restart a container, Docker will re-pull its image.
Tradeoff: Images referenced by stopped containers survive this command, but if you remove those containers later, the image is gone and must be re-pulled.
If the cause is build cache bloat
Target the build cache independently.
docker builder prune --filter "until=168h"
Build cache can grow to hundreds of gigabytes on CI runners. Be aware that docker builder prune can hang indefinitely on very large partitions (reports of 180+ GB cache stuck for days). If the command hangs, you may need to stop the Docker daemon and manually clean /var/lib/docker/buildkit/.
Note: The --keep-storage flag is deprecated. The actual replacement flag is --reserved-space, though some versions incorrectly suggest --max-storage in the deprecation message.
If stopped containers are holding image references
Remove old stopped containers first, then re-evaluate image usage:
docker container prune --filter "until=24h"
Stopped containers consume disk space for their writable layers and log files, and they pin images. Removing them may reveal additional reclaimable image space.
If container logs are the real disk consumer
docker system prune does not touch container log files. On hosts using the json-file log driver, logs live in /var/lib/docker/containers/<id>/<id>-json.log and can grow without bound. If disk pressure is critical, truncate the largest log files manually:
# <id> is the full container ID
truncate -s 0 /var/lib/docker/containers/<id>/<id>-json.log
This is safe while the container is running because the file is opened with O_APPEND. Then configure log rotation in daemon.json to prevent recurrence.
Avoid the nuclear option
Do not run docker system prune -a --volumes on production hosts running databases, queues, or any stateful workload. This command removes unused images, stopped containers, networks, build cache, and anonymous volumes. IBM documented cases where this command extended production outages by destroying data volumes during incident response. Named volumes are excluded by default, but anonymous volumes attached to stopped containers are destroyed.
Prevention
- Configure log rotation. Add
"log-opts": {"max-size": "10m", "max-file": "3"}to/etc/docker/daemon.json. Unbounded json-file logs are the most common cause of Docker disk exhaustion. - Schedule filtered pruning, not blanket pruning. A cron job running
docker image prune -a -f --filter "until=240h"is safer thandocker system prune -a. - Monitor growth rate, not just absolute usage. Alert when
/var/lib/dockerexceeds 70% or grows faster than 5 GB per day. Waiting until 90% leaves no runway for cleanup operations themselves to fail. - Separate CI runner cleanup. On build hosts, run
docker builder pruneon a different schedule thandocker image prune. Build cache has different retention needs and risk profiles. - Do not use
--volumesin automation unless the host is strictly stateless. Volume cleanup should be a deliberate, audited operation. - Test cleanup on a representative non-production host first. Verify that your filters and age thresholds remove what you expect and preserve what you need.
How Netdata helps
- Correlate Docker disk usage with host filesystem utilization to isolate whether images, containers, volumes, or build cache are growing.
- Alert on Docker daemon response latency spikes before and after heavy prune operations to detect daemon stress or storage driver contention.
- Track container count by state to detect exited container accumulation that blocks image removal.
- Monitor disk fill rate on the
/var/lib/dockerfilesystem to trigger cleanup before exhaustion.
Related guides
- If
docker psor other commands hang during cleanup, see Docker commands hang: docker ps, inspect, and exec freezes. - If the daemon becomes unresponsive after pruning, see Docker daemon not responding: how to troubleshoot a hung dockerd.
- For broader disk exhaustion patterns, see Docker disk space full: how to troubleshoot /var/lib/docker.
- If containers are crashing after image re-pulls, see Docker container exits immediately: how to diagnose it and Docker exit code 1: application errors and how to find them.
- For resource pressure inside containers, see Docker container high CPU usage: causes and fixes, Docker container high memory usage: how to diagnose it, and Docker CPU throttling: the hidden cause of container latency.
- For restart loops and health issues, see Docker container keeps restarting: causes, checks, and fixes and Docker container running but unhealthy: how to diagnose health check failures.
- For memory-specific diagnosis, see Docker container memory leak: how to find one and prove it.
- For DNS issues that may appear after container recreation, see Docker DNS not working inside containers.





