$ guides / kubernetes / kubernetes-deployment-rollout-stuck ▌

Operations Guides

Kubernetes Deployment rollout stuck: stalled rollouts and ready replicas

A Deployment rollout that stalls is a silent capacity leak. The old ReplicaSet scales down, the new ReplicaSet stops halfway, and kubectl rollout status blocks indefinitely. Kubernetes does not automatically recover. The controller sets ProgressDeadlineExceeded only after progressDeadlineSeconds elapses, and takes no corrective action. You need to distinguish between a Pod lifecycle blockage, a readiness probe or gate failure, and a rare controller bug that freezes the rollout entirely.

A Pod can be running and passing liveness probes but remain unready because of a missing readiness gate or a failing readiness probe. The controller tracks this, but will not fix it. This guide shows how to isolate the cause.

What this means

A Deployment rollout creates a new ReplicaSet and shifts replicas from the old one. A Pod counts toward readyReplicas only when its Ready condition is True, which requires every container to pass its readiness probe and every readiness gate condition in .status.conditions to report True. Pods in the Terminating phase are no longer counted in availableReplicas, but they continue to consume node resources until fully removed.

progressDeadlineSeconds (default 600s) is the controller’s patience timer. If the rollout does not make progress within this window, the Progressing condition becomes False with reason ProgressDeadlineExceeded. Kubernetes does not roll back or restart Pods automatically. If the Deployment is paused, the deadline is not evaluated. After a rollout completes, the condition stays True with reason NewReplicaSetAvailable indefinitely, even if ready replicas later crash or become unschedulable. The controller will not fire ProgressDeadlineExceeded for post-rollout replica shortfall.

Common causes

Cause	What it looks like	First thing to check
Readiness probe failure or misconfiguration	New Pods are `Running` but `Ready=False`; events show probe failures	`kubectl describe pod` for probe events and `lastState`
Missing readiness gate condition	`ContainersReady=True`, `Ready=False`; no matching custom condition in Pod status	Pod spec for `readinessGates` and status for corresponding condition types
Resource exhaustion or scheduling block	New Pods stuck in `Pending`; no nodes pass predicates	`kubectl describe pod` events and node allocatable resources
Stale ReplicaSet annotation during scale (maxSurge=0)	Deployment frozen mid-rollout; no error events	Active ReplicaSet annotation `deployment.kubernetes.io/desired-replicas` vs Deployment `spec.replicas`
Restrictive maxUnavailable / maxSurge	Rollout advances one Pod at a time or halts entirely	`kubectl get deployment -o jsonpath='{.spec.strategy.rollingUpdate}'`
Post-rollout replica loss	Rollout previously succeeded; `availableReplicas` dropped below `spec.replicas`	`kubectl get deployment` for `availableReplicas` and Pod status

Quick checks

# Check Deployment replica counts and conditions
kubectl get deployment <name> -o jsonpath='{range .status.conditions[*]}{.type}={.status} {.reason}{"\n"}{end}'

# Check ready, updated, and unavailable replica counts
kubectl get deployment <name>

# Check Pod readiness and phase for the new ReplicaSet
kubectl get pods -l pod-template-hash=<new-hash> -o custom-columns=NAME:.metadata.name,PHASE:.status.phase,READY:.status.conditions[?(@.type=="Ready")].status

# Check for readiness gate conditions
kubectl get pod <pod> -o json | jq '.status.conditions[] | {type: .type, status: .status}'

# Check ReplicaSet desired-replicas annotation against Deployment spec
kubectl get rs -l app=<label> -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.metadata.annotations.deployment\.kubernetes\.io/desired-replicas}{"\t"}{.spec.replicas}{"\n"}{end}'
# A mismatch indicates the stale-annotation bug.

# Check events (most rollout events are on Pods and ReplicaSets, not the Deployment)
kubectl get events --field-selector involvedObject.name=<pod-or-rs-name> --sort-by='.lastTimestamp'

# Verify rollout strategy limits
kubectl get deployment <name> -o jsonpath='{"maxUnavailable: "}{.spec.strategy.rollingUpdate.maxUnavailable}{" maxSurge: "}{.spec.strategy.rollingUpdate.maxSurge}{"\n"}'

How to diagnose it

Confirm the stall pattern. Run kubectl get deployment <name>. If readyReplicas is below spec.replicas and updatedReplicas is not increasing, the rollout is stalled. If unavailableReplicas is non-zero, Pods are failing to become ready.
Check the Progressing condition. Run kubectl get deployment <name> -o jsonpath='{.status.conditions[?(@.type=="Progressing")]}'. If status is False and reason is ProgressDeadlineExceeded, the controller has marked the rollout as stuck. If the condition is still True, the deadline has not yet elapsed.
Inspect the new ReplicaSet’s Pods. Identify the new pod-template-hash and list those Pods. If they are Pending, the issue is scheduling or image pulling. If they are Running but not Ready, the issue is probes, gates, or container startup.
Differentiate container readiness from Pod readiness. Run kubectl get pod <pod> -o jsonpath='{.status.conditions[*].type}'. If ContainersReady is True but Ready is False, examine .spec.readinessGates. Then check .status.conditions for the matching gate type. If the gate condition is absent, the Pod will never become ready.
Evaluate readiness probes. In kubectl describe pod, look for Unhealthy events with Readiness probe failed. Verify that the probe port, path, and scheme match the listening interface inside the container. If the application has a slow startup, initialDelaySeconds or a startupProbe may be needed.
Check for the stale-annotation bug. If the Deployment uses maxSurge=0 and was scaled during the rollout, compare the active ReplicaSet’s deployment.kubernetes.io/desired-replicas annotation with deployment.spec.replicas. If they differ, the controller is in an infinite loop and the rollout is frozen.
Validate strategy arithmetic. maxUnavailable rounds down when computed as a percentage, and maxSurge rounds up. If maxUnavailable rounds down to zero and maxSurge is also zero, the rollout cannot make progress because it is forbidden to remove old Pods or add new ones beyond the limit. Ensure the strategy allows movement.
Distinguish rollout stalls from post-rollout decay. If updatedReplicas equals spec.replicas and the Progressing condition shows NewReplicaSetAvailable, the rollout is complete. If availableReplicas then drops, this is a workload health or node problem, not a rollout stall. The controller will not surface ProgressDeadlineExceeded for this.

flowchart TD
    A[Deployment readyReplicas < spec.replicas] --> B{Pod phase?}
    B -->|Pending| C[Check scheduling, resources, PVC]
    B -->|Running, not Ready| D{ContainersReady?}
    D -->|False| E[Check readiness probes, crashes]
    D -->|True| F[Check readiness gates]
    B -->|Terminating| G[Wait or check terminationGracePeriodSeconds]
    C --> H[Fix capacity, taints, or image pull]
    E --> I[Fix probe config or app health]
    F --> J[Restore external controller or remove gate]
    G --> K[Check maxSurge=0 stale annotation bug]

Metrics and signals to monitor

Signal	Why it matters	Warning sign
`status.readyReplicas` vs `spec.replicas`	Direct measure of rollout completion	Gap persists longer than normal startup time
`status.availableReplicas` vs `spec.replicas`	Tracks Pods that are ready for minReadySeconds	Drop after rollout completes indicates silent degradation
`Progressing` condition reason and status	Only automatic controller signal for stalled rollouts	`ProgressDeadlineExceeded` or condition stuck at `ReplicaSetUpdated`
Pod `Ready` false with `ContainersReady` true	Indicates readiness gate blockage	Any Pod in this state for more than 2 minutes
Container restart count	Repeated restarts prevent readiness	Restart count increasing across new ReplicaSet Pods
Pod phase `Pending`	Scheduling, image, or volume stall	Pending duration exceeding 5 minutes
kube-controller-manager `workqueue_depth`	Reconciliation backlog in the controller	Depth increasing while rollout is active
Node `MemoryPressure` or `DiskPressure`	Pressure evictions kill new Pods before they become ready	Pressure condition true on nodes hosting new Pods

Fixes

If the cause is readiness probe misconfiguration

Edit the container spec to correct the probe endpoint, increase timeoutSeconds, or add a startupProbe to cover slow initialization. If the application startup is legitimately longer than progressDeadlineSeconds, increase progressDeadlineSeconds in the Deployment spec.

If the cause is a missing readiness gate condition

Identify the external controller that writes the condition (for example, an ingress controller or service mesh). If that controller is down, restore it. If the gate is not essential, remove it from the Pod template as a workaround.

If the cause is resource or scheduling pressure

Add node capacity, reduce resource requests, or resolve taints that block the new Pods. If the stall is due to an unschedulable Pod, kubectl describe pod will show the specific predicate failure.

If the cause is the stale-annotation bug (maxSurge=0)

On affected Kubernetes versions, if scaling during a rolling update with maxSurge=0 triggers the infinite loop, work around it by forcing the annotation to match: scale the Deployment to a different value and back, or patch the ReplicaSet annotation directly.

If the cause is a too-restrictive strategy

Adjust maxUnavailable or maxSurge so that at least one of them is non-zero. For a single-replica Deployment, maxUnavailable: 0 and maxSurge: 1 is a common safe pattern. For larger Deployments, ensure maxUnavailable does not round down to zero unless maxSurge compensates.

If the cause is post-rollout replica loss

This is not fixed by rollout parameters. Investigate Pod crashes, node evictions, or CSI volume failures. Cordon failing nodes and trigger a new rollout only after the underlying issue is resolved.

Prevention

Set progressDeadlineSeconds slightly above your application’s known cold-start time, rather than accepting the default 600s if it is too short or too long. Validate readiness probes in a staging environment before promotion. Monitor the kube-controller-manager workqueue_depth to detect reconciliation lag before it becomes visible as stalled replicas. Avoid scaling Deployments during active rollouts if you must use maxSurge=0 on Kubernetes versions prior to the fix for the stale-annotation issue. Set explicit resource requests and limits to prevent scheduling stalls and evictions. If you use readiness gates, monitor the health of the external controllers that maintain those conditions separately. Alert on availableReplicas dropping below spec.replicas independently of rollout status, because Kubernetes does not re-fire ProgressDeadlineExceeded after a rollout finishes.

How Netdata helps

Correlates node-level CPU, memory, and disk pressure with Pod scheduling failures and evictions that stall rollouts.
Surfaces container restart loops and OOM kills that prevent new replicas from reaching the ready state.
Tracks network latency and conntrack utilization to identify infrastructure-level causes of readiness probe timeouts.
Brings together control-plane signals like API server latency and controller workqueue depth so you can distinguish a controller backlog from an application-level stall.
Provides per-Pod resource usage to validate whether resource limits are causing startup delays.

If API server latency is causing controllers to lag, see Kubernetes API server slow or unresponsive: causes and fixes.
For dropped connections that affect readiness probes, see Kubernetes conntrack exhaustion: dropped connections under load.
If node pressure is evicting new Pods, see Kubernetes eviction cascade: when one node failure takes down the cluster.
For DNS-related readiness failures, see Kubernetes DNS resolution failures inside pods.
If scheduling is the blocker, see Kubernetes DaemonSet pods Pending: scheduling and tolerations.

The Netdata solution

Kubernetes monitoring with Netdata

Netdata monitors Kubernetes with per-second metrics across the control plane, nodes, and every pod, with ML anomaly detection and zero per-pod configuration. Correlate API-server and etcd latency, kubelet PLEG stalls, scheduling pressure, and OOMKills in one place.

See Kubernetes monitoring → Start monitoring free

Kubernetes Deployment rollout stuck: stalled rollouts and ready replicas

Kubernetes Deployment rollout stuck: stalled rollouts and ready replicas

What this means

Common causes

Quick checks

How to diagnose it

Metrics and signals to monitor

Fixes

If the cause is readiness probe misconfiguration

If the cause is a missing readiness gate condition

If the cause is resource or scheduling pressure

If the cause is the stale-annotation bug (maxSurge=0)

If the cause is a too-restrictive strategy

If the cause is post-rollout replica loss

Prevention

How Netdata helps

Related guides

Kubernetes monitoring with Netdata