$ guides / docker / docker-image-cleanup ▌

Operations Guides

Docker image cleanup: safe pruning strategies for production hosts

When df -h /var/lib/docker shows 87% utilization, the urge to run docker system prune -a is strong. On production hosts, that is a mistake. Cleanup is not about finding the single command that reclaims the most space. It is about knowing exactly what each flag deletes, what it leaves behind, and which filters prevent a 3 a.m. image re-pull because a base layer was removed.

This guide covers the safe pruning hierarchy: dangling images, unused tagged images, build cache, and the dangerous flags that touch volumes or running workloads. The goal is to reclaim space, automate cleanup safely, and avoid outage patterns caused by aggressive pruning.

What this means

Docker stores images, container writable layers, volumes, build cache, and logs under /var/lib/docker. Run docker system df for the authoritative breakdown of usage and reclaimable space. Use it to avoid chasing image bloat when the real problem is container logs or orphaned volumes.

There are three distinct cleanup scopes. docker image prune removes images only. docker builder prune removes build cache only. docker system prune removes stopped containers, unused networks, dangling images, and build cache. The -a flag expands image removal from dangling-only to all unused images. The --volumes flag adds volume destruction. These flags are not additive conveniences. They change the blast radius from safe intermediates to everything not currently running.

Common causes

Cause	What it looks like	First thing to check
Dangling build artifacts	Many `<none>:<none>` images after builds or tag overwrites	`docker images -f "dangling=true"`
Old tagged image versions	Repositories with dozens of tags accumulated over releases	`docker images --format "table {{.Repository}}\t{{.Tag}}\t{{.Size}}"` sorted by creation date
Build cache accumulation	CI runners with 100+ GB under `/var/lib/docker/buildkit`	`docker system df` Build Cache line, or `du -sh /var/lib/docker/buildkit/`
Exited containers holding references	High exited count preventing image removal	`docker ps -a --filter "status=exited" -q \| wc -l`
Confusion between prune commands	Scheduled jobs using `docker system prune` when `docker image prune` was sufficient	Review cron or systemd timer definitions

Quick checks

Use these read-only commands to assess the situation before running any destructive operation.

# Overall breakdown of images, containers, volumes, and build cache
docker system df

# Dangling images with sizes
docker images -f "dangling=true" --format "table {{.ID}}\t{{.Size}}\t{{.CreatedAt}}"

# Largest images
docker images --format "table {{.Size}}\t{{.Repository}}:{{.Tag}}" | sort -hr | head -20

# Exited containers that may hold image references
docker ps -a --filter "status=exited" -q | wc -l

# Build cache size on disk
du -sh /var/lib/docker/buildkit/

# Dead containers blocking cleanup
docker ps -a --filter "status=dead" --format "{{.ID}} {{.Names}}"

How to diagnose it

Establish baseline space usage. Run docker system df. It shows total usage and reclaimable space for images, containers, local volumes, and build cache. Use it to avoid chasing image bloat when the real problem is container logs or orphaned volumes.
Count dangling images. These are untagged layers with no container reference and the safest cleanup target. Run docker images -f "dangling=true". If the count or total size is significant, start here.
Identify unused tagged images. Compare your image list against running and stopped containers. An image referenced by a stopped container is considered in-use and survives docker image prune -a. However, if you remove that stopped container later, the image is already gone and must be re-pulled.
Inspect build cache separately. On CI/CD hosts, build cache can dominate disk usage. Check the Build Cache line from docker system df or inspect /var/lib/docker/buildkit/ directly. Build cache cleanup has different failure modes than image cleanup.
Check for dead containers. Containers in dead state cannot be removed by standard prune commands and often indicate prior daemon or storage driver issues. Handle them manually before normal cleanup.
Review daemon logs for storage errors. If overlay2 is already unhealthy, prune operations may fail with errors such as “rw layer snapshot not found” or hang indefinitely. Check journalctl -u docker.service for recent storage driver errors before proceeding.

Metrics and signals to monitor

Signal	Why it matters	Warning sign
Docker disk usage by images	Often the largest component of Docker disk consumption	>50 GB and most images not referenced by running containers
Dangling images count	Safest cleanup target; rapid accumulation indicates build or tag churn	>10 GB or rapid accumulation between builds
Build cache size	Separate from image layers; on CI hosts it can exceed image usage	>20 GB or >20% of total Docker disk usage
Docker daemon response latency	Pruning is I/O-heavy; a stressed daemon may hang during cleanup	`/_ping` or `docker info` takes >5 seconds
Container count by state	Exited containers hold image references and consume writable layers	Exited count growing without automated cleanup
Docker disk usage total	Docker fails catastrophically when the filesystem fills	>80% utilization of the `/var/lib/docker` filesystem

Fixes

If the cause is dangling images

Run the safest prune command:

docker image prune -f

This removes untagged intermediate layers with no container references. It does not touch tagged images, running containers, stopped containers, volumes, or networks.

If the cause is unused tagged images

Remove images that are not referenced by any container and are older than a safe threshold:

docker image prune -a -f --filter "until=240h"

The --filter "until=240h" flag limits removal to images older than 10 days. Do not run docker image prune -a without a filter in production. Without a filter, it removes all images not used by a running container, including base images like ubuntu:latest or alpine:latest. If you stop and later restart a container, Docker will re-pull its image.

Tradeoff: Images referenced by stopped containers survive this command, but if you remove those containers later, the image is gone and must be re-pulled.

If the cause is build cache bloat

Target the build cache independently.

docker builder prune --filter "until=168h"

Build cache can grow to hundreds of gigabytes on CI runners. Be aware that docker builder prune can hang indefinitely on very large partitions (reports of 180+ GB cache stuck for days). If the command hangs, you may need to stop the Docker daemon and manually clean /var/lib/docker/buildkit/.

Note: The --keep-storage flag is deprecated. The actual replacement flag is --reserved-space, though some versions incorrectly suggest --max-storage in the deprecation message.

If stopped containers are holding image references

Remove old stopped containers first, then re-evaluate image usage:

docker container prune --filter "until=24h"

Stopped containers consume disk space for their writable layers and log files, and they pin images. Removing them may reveal additional reclaimable image space.

If container logs are the real disk consumer

docker system prune does not touch container log files. On hosts using the json-file log driver, logs live in /var/lib/docker/containers/<id>/<id>-json.log and can grow without bound. If disk pressure is critical, truncate the largest log files manually:

# <id> is the full container ID
truncate -s 0 /var/lib/docker/containers/<id>/<id>-json.log

This is safe while the container is running because the file is opened with O_APPEND. Then configure log rotation in daemon.json to prevent recurrence.

Avoid the nuclear option

Do not run docker system prune -a --volumes on production hosts running databases, queues, or any stateful workload. This command removes unused images, stopped containers, networks, build cache, and anonymous volumes. IBM documented cases where this command extended production outages by destroying data volumes during incident response. Named volumes are excluded by default, but anonymous volumes attached to stopped containers are destroyed.

Prevention

Configure log rotation. Add "log-opts": {"max-size": "10m", "max-file": "3"} to /etc/docker/daemon.json. Unbounded json-file logs are the most common cause of Docker disk exhaustion.
Schedule filtered pruning, not blanket pruning. A cron job running docker image prune -a -f --filter "until=240h" is safer than docker system prune -a.
Monitor growth rate, not just absolute usage. Alert when /var/lib/docker exceeds 70% or grows faster than 5 GB per day. Waiting until 90% leaves no runway for cleanup operations themselves to fail.
Separate CI runner cleanup. On build hosts, run docker builder prune on a different schedule than docker image prune. Build cache has different retention needs and risk profiles.
Do not use --volumes in automation unless the host is strictly stateless. Volume cleanup should be a deliberate, audited operation.
Test cleanup on a representative non-production host first. Verify that your filters and age thresholds remove what you expect and preserve what you need.

How Netdata helps

Correlate Docker disk usage with host filesystem utilization to isolate whether images, containers, volumes, or build cache are growing.
Alert on Docker daemon response latency spikes before and after heavy prune operations to detect daemon stress or storage driver contention.
Track container count by state to detect exited container accumulation that blocks image removal.
Monitor disk fill rate on the /var/lib/docker filesystem to trigger cleanup before exhaustion.

If docker ps or other commands hang during cleanup, see Docker commands hang: docker ps, inspect, and exec freezes.
If the daemon becomes unresponsive after pruning, see Docker daemon not responding: how to troubleshoot a hung dockerd.
For broader disk exhaustion patterns, see Docker disk space full: how to troubleshoot /var/lib/docker.
If containers are crashing after image re-pulls, see Docker container exits immediately: how to diagnose it and Docker exit code 1: application errors and how to find them.
For resource pressure inside containers, see Docker container high CPU usage: causes and fixes, Docker container high memory usage: how to diagnose it, and Docker CPU throttling: the hidden cause of container latency.
For restart loops and health issues, see Docker container keeps restarting: causes, checks, and fixes and Docker container running but unhealthy: how to diagnose health check failures.
For memory-specific diagnosis, see Docker container memory leak: how to find one and prove it.
For DNS issues that may appear after container recreation, see Docker DNS not working inside containers.

Docker image cleanup: safe pruning strategies for production hosts

Docker image cleanup: safe pruning strategies for production hosts

What this means

Common causes

Quick checks

How to diagnose it

Metrics and signals to monitor

Fixes

If the cause is dangling images

If the cause is unused tagged images

If the cause is build cache bloat

If stopped containers are holding image references

If container logs are the real disk consumer

Avoid the nuclear option

Prevention

How Netdata helps

Related guides