Kubernetes pod ImagePullBackOff: registry, auth, and network diagnosis
ImagePullBackOff means the kubelet cannot pull a required image. After each ErrImagePull failure, the kubelet retries with exponential backoff capped at five minutes. When serializeImagePulls is true, a single slow pull blocks every subsequent pull on that node. Read the exact error from the CRI in pod events, test the registry directly from the node, and fix the root cause without blindly recreating pods.
What this means
The kubelet asks the container runtime to pull any image not cached locally. The runtime resolves the registry, authenticates, downloads layers, and unpacks them into node storage. A failure at any step returns a CRI error that the kubelet surfaces as a pod event. kubectl get pod shows only the state; the reason lives in the events.
Common causes
| Cause | What it looks like | First thing to check |
|---|---|---|
| Invalid image or tag | Event says not found or manifest unknown | kubectl describe pod Events |
| Missing registry credentials | Event says unauthorized or authentication required | imagePullSecrets on pod or service account |
| Network or DNS failure | Event says dial tcp: i/o timeout or no such host | DNS and TCP path from node to registry |
| Node disk pressure | Event says no space left on device | df -h and DiskPressure condition |
| Registry rate limiting | Event says 429 Too Many Requests or toomanyrequests | Registry status and pull error rate |
| Runtime not responding | Event contains rpc error: code = Unknown desc = ... | crictl info and runtime socket health |
Quick checks
# Find pods in ImagePullBackOff or ErrImagePull
kubectl get pods -A --field-selector status.phase=Pending -o json | \
jq '.items[] | select(.status.containerStatuses[]?.state.waiting.reason == "ImagePullBackOff" or .status.containerStatuses[]?.state.waiting.reason == "ErrImagePull") | {namespace: .metadata.namespace, name: .metadata.name}'
# Inspect pod events for the exact CRI error
kubectl describe pod ${POD_NAME} -n ${NAMESPACE}
# Find recent pull failures across the cluster
kubectl get events -A --field-selector reason=Failed | grep -iE "pull|image"
# Check pull secrets attached to the pod
kubectl get pod ${POD_NAME} -o jsonpath='{.spec.imagePullSecrets[*].name}'
# Check pull secrets attached to the service account
kubectl get serviceaccount ${SA_NAME} -n ${NAMESPACE} -o jsonpath='{.imagePullSecrets[*].name}'
# Verify secret type; must be kubernetes.io/dockerconfigjson
kubectl get secret ${SECRET_NAME} -o jsonpath='{.type}'
# Inspect the registry and credentials stored in the secret
kubectl get secret ${SECRET_NAME} -o jsonpath='{.data[".dockerconfigjson"]}' | base64 -d | jq '.auths'
# Reproduce the pull directly on the affected node
crictl pull ${IMAGE_REFERENCE}
# Pull kubelet metrics via the API server (substitute node name)
kubectl get --raw "/api/v1/nodes/${NODE_NAME}/proxy/metrics" | \
grep 'kubelet_runtime_operations_errors_total.*pull_image'
# Check node conditions
kubectl get node ${NODE_NAME} -o jsonpath='{.status.conditions[?(@.type=="DiskPressure")].status}'
# Check node filesystem utilization (paths vary by runtime config)
df -h /var/lib/kubelet /var/lib/containerd /var/lib/docker
# Recent kubelet logs for pull activity
journalctl -u kubelet --since "30 minutes ago" --no-pager | grep -iE "pulling image|pulled|error"
How to diagnose it
- Read the pod event message. Run
kubectl describe pod. The event text from the CRI is the primary signal. Look fornot found,unauthorized,timeout,certificate signed by unknown authority,no space left on device, orrpc error. - Verify the image reference. A typo or deleted tag produces
manifest unknown. Test directly on the node withcrictl pull. If the node succeeds but the pod fails, suspectimagePullSecretsor a node-specific network issue. - Check authentication for private registries. Confirm
imagePullSecretsis set on the pod or on its service account. Verify the secret type iskubernetes.io/dockerconfigjson; anOpaquesecret will not work. Ensure the registry server string inside the secret matches the registry hostname exactly, including any port. Existing pods must be recreated to pick up a new service account secret. - Test the network path from the node. Resolve the registry hostname with
nslookupordig, then open a TCP connection withnc -zv ${REGISTRY_HOST} 443. If the node is in a private subnet without external egress, public registries are unreachable without NAT, VPC endpoints, or a pull-through mirror. - Inspect node disk space and pressure. Run
df -hagainst nodefs and imagefs. Check the node forDiskPressure. If utilization is high or the condition is True, the runtime cannot unpack new layers. Usecrictl imagesto identify large or unused images. If imagefs is a separate filesystem, ensure both it and nodefs have free space. Clean up logs, unused images, or emptyDir data. - Review kubelet and runtime metrics. Check
kubelet_image_pull_duration_secondsandkubelet_runtime_operations_errors_total{operation_type="pull_image"}. A spike in duration points to registry or network degradation. A spike in errors points to auth failures or missing images. - Determine the scope. If every pod on a single node fails, suspect node disk, network, or runtime health. If pods across the cluster fail for the same image, suspect a registry outage, expired credentials, or a deleted tag.
- Reset backoff after the fix. ImagePullBackOff waits up to five minutes between retries. Deleting the pod (or letting the controller recreate it) resets the backoff immediately once the root cause is resolved.
Metrics and signals to monitor
| Signal | Why it matters | Warning sign |
|---|---|---|
kubelet_runtime_operations_errors_total{operation_type="pull_image"} | Counts pull failures at the CRI layer | Sustained increase above baseline |
kubelet_image_pull_duration_seconds | Measures registry and network performance | p99 exceeds baseline for image size |
| Node DiskPressure condition | Image extraction needs writable disk | Condition is True |
kubelet_evictions_total with disk signal | Node is critically low on space | Any disk-triggered eviction event |
| Pending pods with waiting reason | User-visible impact | Pods in ErrImagePull or ImagePullBackOff for more than 5 minutes |
| Container runtime operation latency | Slow runtime can stall pulls | crictl commands hang or timeout |
Fixes
If the image reference is wrong
Update the workload spec with the correct tag or digest. Push the image if it is missing from the registry. Avoid mutable tags like latest if you need reproducible deployments.
If authentication fails
Create a secret of type kubernetes.io/dockerconfigjson and attach it to the pod imagePullSecrets or to the default service account in the namespace. Existing pods must be recreated to pick up a new service account secret. Ensure the registry server string in the .dockerconfigjson matches the registry hostname exactly. On managed clusters, verify the node identity or IAM role has registry read permissions.
Inspect the secret content to confirm the registry and credentials:
kubectl get secret ${SECRET_NAME} -o jsonpath='{.data[".dockerconfigjson"]}' | base64 -d | jq '.auths'
If the network or DNS is unreachable
Fix DNS resolution on the node or cluster DNS. Open egress to the registry endpoint. For private nodes, configure NAT, VPC endpoints, or a pull-through registry mirror inside the network.
Test from the node:
nslookup ${REGISTRY_HOST}
nc -zv ${REGISTRY_HOST} 443
If disk pressure blocks extraction
Free space on nodefs and imagefs. Remove unused images with crictl rmi. Warning: This deletes images globally on the node; ensure no other workload needs them.
Expand the node disk if image storage consistently exceeds 80%. Configure container log rotation with containerLogMaxSize and containerLogMaxFiles to prevent /var/log/pods from filling the filesystem.
If the registry is rate limiting or down
Authenticate to raise pull limits. Mirror critical images to a private registry. If pulls timeout due to image size, increase bandwidth or use a closer registry mirror.
If serialized pulls stall the node
When serializeImagePulls is true, the kubelet pulls images one at a time. A single large image can block every other pod on the node. You can set serializeImagePulls: false in the kubelet configuration to enable parallel pulls. Warning: This requires restarting the kubelet, which briefly disrupts pods on the node. It also increases network bandwidth and disk I/O contention.
Prevention
Pin images by digest or immutable tag in production. Pre-pull images onto nodes during provisioning, or use a local registry mirror to reduce external dependency. Monitor node disk usage trends and configure container log rotation before nodefs fills. Validate imagePullSecrets and registry credentials in CI before deployment. Set imagePullPolicy: IfNotPresent where appropriate to avoid unnecessary registry round-trips, but pair it with explicit tags rather than latest. Rotate registry credentials before expiry and validate pull access from a test pod in CI. If you disable serializeImagePulls, isolate large-image workloads with node affinity or taints to limit I/O contention.
How Netdata helps
- Correlate
kubelet_runtime_operations_errors_totalspikes with node-level disk and network saturation to distinguish registry slowdown from node resource exhaustion. - Alert on DiskPressure conditions and nodefs usage trends before they block image extraction.
- Track pod status transitions to surface pods entering ImagePullBackOff within seconds of the first failure.
- Monitor CRI operation latency alongside kubelet sync loop duration to detect when serialized image pulls are stalling the node.





