Kubernetes pod ImagePullBackOff: registry, auth, and network diagnosis

ImagePullBackOff means the kubelet cannot pull a required image. After each ErrImagePull failure, the kubelet retries with exponential backoff capped at five minutes. When serializeImagePulls is true, a single slow pull blocks every subsequent pull on that node. Read the exact error from the CRI in pod events, test the registry directly from the node, and fix the root cause without blindly recreating pods.

What this means

The kubelet asks the container runtime to pull any image not cached locally. The runtime resolves the registry, authenticates, downloads layers, and unpacks them into node storage. A failure at any step returns a CRI error that the kubelet surfaces as a pod event. kubectl get pod shows only the state; the reason lives in the events.

Common causes

CauseWhat it looks likeFirst thing to check
Invalid image or tagEvent says not found or manifest unknownkubectl describe pod Events
Missing registry credentialsEvent says unauthorized or authentication requiredimagePullSecrets on pod or service account
Network or DNS failureEvent says dial tcp: i/o timeout or no such hostDNS and TCP path from node to registry
Node disk pressureEvent says no space left on devicedf -h and DiskPressure condition
Registry rate limitingEvent says 429 Too Many Requests or toomanyrequestsRegistry status and pull error rate
Runtime not respondingEvent contains rpc error: code = Unknown desc = ...crictl info and runtime socket health

Quick checks

# Find pods in ImagePullBackOff or ErrImagePull
kubectl get pods -A --field-selector status.phase=Pending -o json | \
  jq '.items[] | select(.status.containerStatuses[]?.state.waiting.reason == "ImagePullBackOff" or .status.containerStatuses[]?.state.waiting.reason == "ErrImagePull") | {namespace: .metadata.namespace, name: .metadata.name}'

# Inspect pod events for the exact CRI error
kubectl describe pod ${POD_NAME} -n ${NAMESPACE}

# Find recent pull failures across the cluster
kubectl get events -A --field-selector reason=Failed | grep -iE "pull|image"

# Check pull secrets attached to the pod
kubectl get pod ${POD_NAME} -o jsonpath='{.spec.imagePullSecrets[*].name}'

# Check pull secrets attached to the service account
kubectl get serviceaccount ${SA_NAME} -n ${NAMESPACE} -o jsonpath='{.imagePullSecrets[*].name}'

# Verify secret type; must be kubernetes.io/dockerconfigjson
kubectl get secret ${SECRET_NAME} -o jsonpath='{.type}'

# Inspect the registry and credentials stored in the secret
kubectl get secret ${SECRET_NAME} -o jsonpath='{.data[".dockerconfigjson"]}' | base64 -d | jq '.auths'

# Reproduce the pull directly on the affected node
crictl pull ${IMAGE_REFERENCE}

# Pull kubelet metrics via the API server (substitute node name)
kubectl get --raw "/api/v1/nodes/${NODE_NAME}/proxy/metrics" | \
  grep 'kubelet_runtime_operations_errors_total.*pull_image'

# Check node conditions
kubectl get node ${NODE_NAME} -o jsonpath='{.status.conditions[?(@.type=="DiskPressure")].status}'

# Check node filesystem utilization (paths vary by runtime config)
df -h /var/lib/kubelet /var/lib/containerd /var/lib/docker

# Recent kubelet logs for pull activity
journalctl -u kubelet --since "30 minutes ago" --no-pager | grep -iE "pulling image|pulled|error"

How to diagnose it

  1. Read the pod event message. Run kubectl describe pod. The event text from the CRI is the primary signal. Look for not found, unauthorized, timeout, certificate signed by unknown authority, no space left on device, or rpc error.
  2. Verify the image reference. A typo or deleted tag produces manifest unknown. Test directly on the node with crictl pull. If the node succeeds but the pod fails, suspect imagePullSecrets or a node-specific network issue.
  3. Check authentication for private registries. Confirm imagePullSecrets is set on the pod or on its service account. Verify the secret type is kubernetes.io/dockerconfigjson; an Opaque secret will not work. Ensure the registry server string inside the secret matches the registry hostname exactly, including any port. Existing pods must be recreated to pick up a new service account secret.
  4. Test the network path from the node. Resolve the registry hostname with nslookup or dig, then open a TCP connection with nc -zv ${REGISTRY_HOST} 443. If the node is in a private subnet without external egress, public registries are unreachable without NAT, VPC endpoints, or a pull-through mirror.
  5. Inspect node disk space and pressure. Run df -h against nodefs and imagefs. Check the node for DiskPressure. If utilization is high or the condition is True, the runtime cannot unpack new layers. Use crictl images to identify large or unused images. If imagefs is a separate filesystem, ensure both it and nodefs have free space. Clean up logs, unused images, or emptyDir data.
  6. Review kubelet and runtime metrics. Check kubelet_image_pull_duration_seconds and kubelet_runtime_operations_errors_total{operation_type="pull_image"}. A spike in duration points to registry or network degradation. A spike in errors points to auth failures or missing images.
  7. Determine the scope. If every pod on a single node fails, suspect node disk, network, or runtime health. If pods across the cluster fail for the same image, suspect a registry outage, expired credentials, or a deleted tag.
  8. Reset backoff after the fix. ImagePullBackOff waits up to five minutes between retries. Deleting the pod (or letting the controller recreate it) resets the backoff immediately once the root cause is resolved.

Metrics and signals to monitor

SignalWhy it mattersWarning sign
kubelet_runtime_operations_errors_total{operation_type="pull_image"}Counts pull failures at the CRI layerSustained increase above baseline
kubelet_image_pull_duration_secondsMeasures registry and network performancep99 exceeds baseline for image size
Node DiskPressure conditionImage extraction needs writable diskCondition is True
kubelet_evictions_total with disk signalNode is critically low on spaceAny disk-triggered eviction event
Pending pods with waiting reasonUser-visible impactPods in ErrImagePull or ImagePullBackOff for more than 5 minutes
Container runtime operation latencySlow runtime can stall pullscrictl commands hang or timeout

Fixes

If the image reference is wrong

Update the workload spec with the correct tag or digest. Push the image if it is missing from the registry. Avoid mutable tags like latest if you need reproducible deployments.

If authentication fails

Create a secret of type kubernetes.io/dockerconfigjson and attach it to the pod imagePullSecrets or to the default service account in the namespace. Existing pods must be recreated to pick up a new service account secret. Ensure the registry server string in the .dockerconfigjson matches the registry hostname exactly. On managed clusters, verify the node identity or IAM role has registry read permissions.

Inspect the secret content to confirm the registry and credentials:

kubectl get secret ${SECRET_NAME} -o jsonpath='{.data[".dockerconfigjson"]}' | base64 -d | jq '.auths'

If the network or DNS is unreachable

Fix DNS resolution on the node or cluster DNS. Open egress to the registry endpoint. For private nodes, configure NAT, VPC endpoints, or a pull-through registry mirror inside the network.

Test from the node:

nslookup ${REGISTRY_HOST}
nc -zv ${REGISTRY_HOST} 443

If disk pressure blocks extraction

Free space on nodefs and imagefs. Remove unused images with crictl rmi. Warning: This deletes images globally on the node; ensure no other workload needs them.

Expand the node disk if image storage consistently exceeds 80%. Configure container log rotation with containerLogMaxSize and containerLogMaxFiles to prevent /var/log/pods from filling the filesystem.

If the registry is rate limiting or down

Authenticate to raise pull limits. Mirror critical images to a private registry. If pulls timeout due to image size, increase bandwidth or use a closer registry mirror.

If serialized pulls stall the node

When serializeImagePulls is true, the kubelet pulls images one at a time. A single large image can block every other pod on the node. You can set serializeImagePulls: false in the kubelet configuration to enable parallel pulls. Warning: This requires restarting the kubelet, which briefly disrupts pods on the node. It also increases network bandwidth and disk I/O contention.

Prevention

Pin images by digest or immutable tag in production. Pre-pull images onto nodes during provisioning, or use a local registry mirror to reduce external dependency. Monitor node disk usage trends and configure container log rotation before nodefs fills. Validate imagePullSecrets and registry credentials in CI before deployment. Set imagePullPolicy: IfNotPresent where appropriate to avoid unnecessary registry round-trips, but pair it with explicit tags rather than latest. Rotate registry credentials before expiry and validate pull access from a test pod in CI. If you disable serializeImagePulls, isolate large-image workloads with node affinity or taints to limit I/O contention.

How Netdata helps

  • Correlate kubelet_runtime_operations_errors_total spikes with node-level disk and network saturation to distinguish registry slowdown from node resource exhaustion.
  • Alert on DiskPressure conditions and nodefs usage trends before they block image extraction.
  • Track pod status transitions to surface pods entering ImagePullBackOff within seconds of the first failure.
  • Monitor CRI operation latency alongside kubelet sync loop duration to detect when serialized image pulls are stalling the node.