Docker image cleanup: safe pruning strategies for production hosts

When df -h /var/lib/docker shows 87% utilization, the urge to run docker system prune -a is strong. On production hosts, that is a mistake. Cleanup is not about finding the single command that reclaims the most space. It is about knowing exactly what each flag deletes, what it leaves behind, and which filters prevent a 3 a.m. image re-pull because a base layer was removed.

This guide covers the safe pruning hierarchy: dangling images, unused tagged images, build cache, and the dangerous flags that touch volumes or running workloads. The goal is to reclaim space, automate cleanup safely, and avoid outage patterns caused by aggressive pruning.

What this means

Docker stores images, container writable layers, volumes, build cache, and logs under /var/lib/docker. Run docker system df for the authoritative breakdown of usage and reclaimable space. Use it to avoid chasing image bloat when the real problem is container logs or orphaned volumes.

There are three distinct cleanup scopes. docker image prune removes images only. docker builder prune removes build cache only. docker system prune removes stopped containers, unused networks, dangling images, and build cache. The -a flag expands image removal from dangling-only to all unused images. The --volumes flag adds volume destruction. These flags are not additive conveniences. They change the blast radius from safe intermediates to everything not currently running.

Common causes

CauseWhat it looks likeFirst thing to check
Dangling build artifactsMany <none>:<none> images after builds or tag overwritesdocker images -f "dangling=true"
Old tagged image versionsRepositories with dozens of tags accumulated over releasesdocker images --format "table {{.Repository}}\t{{.Tag}}\t{{.Size}}" sorted by creation date
Build cache accumulationCI runners with 100+ GB under /var/lib/docker/buildkitdocker system df Build Cache line, or du -sh /var/lib/docker/buildkit/
Exited containers holding referencesHigh exited count preventing image removaldocker ps -a --filter "status=exited" -q | wc -l
Confusion between prune commandsScheduled jobs using docker system prune when docker image prune was sufficientReview cron or systemd timer definitions

Quick checks

Use these read-only commands to assess the situation before running any destructive operation.

# Overall breakdown of images, containers, volumes, and build cache
docker system df

# Dangling images with sizes
docker images -f "dangling=true" --format "table {{.ID}}\t{{.Size}}\t{{.CreatedAt}}"

# Largest images
docker images --format "table {{.Size}}\t{{.Repository}}:{{.Tag}}" | sort -hr | head -20

# Exited containers that may hold image references
docker ps -a --filter "status=exited" -q | wc -l

# Build cache size on disk
du -sh /var/lib/docker/buildkit/

# Dead containers blocking cleanup
docker ps -a --filter "status=dead" --format "{{.ID}} {{.Names}}"

How to diagnose it

  1. Establish baseline space usage. Run docker system df. It shows total usage and reclaimable space for images, containers, local volumes, and build cache. Use it to avoid chasing image bloat when the real problem is container logs or orphaned volumes.

  2. Count dangling images. These are untagged layers with no container reference and the safest cleanup target. Run docker images -f "dangling=true". If the count or total size is significant, start here.

  3. Identify unused tagged images. Compare your image list against running and stopped containers. An image referenced by a stopped container is considered in-use and survives docker image prune -a. However, if you remove that stopped container later, the image is already gone and must be re-pulled.

  4. Inspect build cache separately. On CI/CD hosts, build cache can dominate disk usage. Check the Build Cache line from docker system df or inspect /var/lib/docker/buildkit/ directly. Build cache cleanup has different failure modes than image cleanup.

  5. Check for dead containers. Containers in dead state cannot be removed by standard prune commands and often indicate prior daemon or storage driver issues. Handle them manually before normal cleanup.

  6. Review daemon logs for storage errors. If overlay2 is already unhealthy, prune operations may fail with errors such as “rw layer snapshot not found” or hang indefinitely. Check journalctl -u docker.service for recent storage driver errors before proceeding.

Metrics and signals to monitor

SignalWhy it mattersWarning sign
Docker disk usage by imagesOften the largest component of Docker disk consumption>50 GB and most images not referenced by running containers
Dangling images countSafest cleanup target; rapid accumulation indicates build or tag churn>10 GB or rapid accumulation between builds
Build cache sizeSeparate from image layers; on CI hosts it can exceed image usage>20 GB or >20% of total Docker disk usage
Docker daemon response latencyPruning is I/O-heavy; a stressed daemon may hang during cleanup/_ping or docker info takes >5 seconds
Container count by stateExited containers hold image references and consume writable layersExited count growing without automated cleanup
Docker disk usage totalDocker fails catastrophically when the filesystem fills>80% utilization of the /var/lib/docker filesystem

Fixes

If the cause is dangling images

Run the safest prune command:

docker image prune -f

This removes untagged intermediate layers with no container references. It does not touch tagged images, running containers, stopped containers, volumes, or networks.

If the cause is unused tagged images

Remove images that are not referenced by any container and are older than a safe threshold:

docker image prune -a -f --filter "until=240h"

The --filter "until=240h" flag limits removal to images older than 10 days. Do not run docker image prune -a without a filter in production. Without a filter, it removes all images not used by a running container, including base images like ubuntu:latest or alpine:latest. If you stop and later restart a container, Docker will re-pull its image.

Tradeoff: Images referenced by stopped containers survive this command, but if you remove those containers later, the image is gone and must be re-pulled.

If the cause is build cache bloat

Target the build cache independently.

docker builder prune --filter "until=168h"

Build cache can grow to hundreds of gigabytes on CI runners. Be aware that docker builder prune can hang indefinitely on very large partitions (reports of 180+ GB cache stuck for days). If the command hangs, you may need to stop the Docker daemon and manually clean /var/lib/docker/buildkit/.

Note: The --keep-storage flag is deprecated. The actual replacement flag is --reserved-space, though some versions incorrectly suggest --max-storage in the deprecation message.

If stopped containers are holding image references

Remove old stopped containers first, then re-evaluate image usage:

docker container prune --filter "until=24h"

Stopped containers consume disk space for their writable layers and log files, and they pin images. Removing them may reveal additional reclaimable image space.

If container logs are the real disk consumer

docker system prune does not touch container log files. On hosts using the json-file log driver, logs live in /var/lib/docker/containers/<id>/<id>-json.log and can grow without bound. If disk pressure is critical, truncate the largest log files manually:

# <id> is the full container ID
truncate -s 0 /var/lib/docker/containers/<id>/<id>-json.log

This is safe while the container is running because the file is opened with O_APPEND. Then configure log rotation in daemon.json to prevent recurrence.

Avoid the nuclear option

Do not run docker system prune -a --volumes on production hosts running databases, queues, or any stateful workload. This command removes unused images, stopped containers, networks, build cache, and anonymous volumes. IBM documented cases where this command extended production outages by destroying data volumes during incident response. Named volumes are excluded by default, but anonymous volumes attached to stopped containers are destroyed.

Prevention

  • Configure log rotation. Add "log-opts": {"max-size": "10m", "max-file": "3"} to /etc/docker/daemon.json. Unbounded json-file logs are the most common cause of Docker disk exhaustion.
  • Schedule filtered pruning, not blanket pruning. A cron job running docker image prune -a -f --filter "until=240h" is safer than docker system prune -a.
  • Monitor growth rate, not just absolute usage. Alert when /var/lib/docker exceeds 70% or grows faster than 5 GB per day. Waiting until 90% leaves no runway for cleanup operations themselves to fail.
  • Separate CI runner cleanup. On build hosts, run docker builder prune on a different schedule than docker image prune. Build cache has different retention needs and risk profiles.
  • Do not use --volumes in automation unless the host is strictly stateless. Volume cleanup should be a deliberate, audited operation.
  • Test cleanup on a representative non-production host first. Verify that your filters and age thresholds remove what you expect and preserve what you need.

How Netdata helps

  • Correlate Docker disk usage with host filesystem utilization to isolate whether images, containers, volumes, or build cache are growing.
  • Alert on Docker daemon response latency spikes before and after heavy prune operations to detect daemon stress or storage driver contention.
  • Track container count by state to detect exited container accumulation that blocks image removal.
  • Monitor disk fill rate on the /var/lib/docker filesystem to trigger cleanup before exhaustion.