Kubernetes Deployment rollout stuck: stalled rollouts and ready replicas

A Deployment rollout that stalls is a silent capacity leak. The old ReplicaSet scales down, the new ReplicaSet stops halfway, and kubectl rollout status blocks indefinitely. Kubernetes does not automatically recover. The controller sets ProgressDeadlineExceeded only after progressDeadlineSeconds elapses, and takes no corrective action. You need to distinguish between a Pod lifecycle blockage, a readiness probe or gate failure, and a rare controller bug that freezes the rollout entirely.

A Pod can be running and passing liveness probes but remain unready because of a missing readiness gate or a failing readiness probe. The controller tracks this, but will not fix it. This guide shows how to isolate the cause.

What this means

A Deployment rollout creates a new ReplicaSet and shifts replicas from the old one. A Pod counts toward readyReplicas only when its Ready condition is True, which requires every container to pass its readiness probe and every readiness gate condition in .status.conditions to report True. Pods in the Terminating phase are no longer counted in availableReplicas, but they continue to consume node resources until fully removed.

progressDeadlineSeconds (default 600s) is the controller’s patience timer. If the rollout does not make progress within this window, the Progressing condition becomes False with reason ProgressDeadlineExceeded. Kubernetes does not roll back or restart Pods automatically. If the Deployment is paused, the deadline is not evaluated. After a rollout completes, the condition stays True with reason NewReplicaSetAvailable indefinitely, even if ready replicas later crash or become unschedulable. The controller will not fire ProgressDeadlineExceeded for post-rollout replica shortfall.

Common causes

CauseWhat it looks likeFirst thing to check
Readiness probe failure or misconfigurationNew Pods are Running but Ready=False; events show probe failureskubectl describe pod for probe events and lastState
Missing readiness gate conditionContainersReady=True, Ready=False; no matching custom condition in Pod statusPod spec for readinessGates and status for corresponding condition types
Resource exhaustion or scheduling blockNew Pods stuck in Pending; no nodes pass predicateskubectl describe pod events and node allocatable resources
Stale ReplicaSet annotation during scale (maxSurge=0)Deployment frozen mid-rollout; no error eventsActive ReplicaSet annotation deployment.kubernetes.io/desired-replicas vs Deployment spec.replicas
Restrictive maxUnavailable / maxSurgeRollout advances one Pod at a time or halts entirelykubectl get deployment -o jsonpath='{.spec.strategy.rollingUpdate}'
Post-rollout replica lossRollout previously succeeded; availableReplicas dropped below spec.replicaskubectl get deployment for availableReplicas and Pod status

Quick checks

# Check Deployment replica counts and conditions
kubectl get deployment <name> -o jsonpath='{range .status.conditions[*]}{.type}={.status} {.reason}{"\n"}{end}'

# Check ready, updated, and unavailable replica counts
kubectl get deployment <name>

# Check Pod readiness and phase for the new ReplicaSet
kubectl get pods -l pod-template-hash=<new-hash> -o custom-columns=NAME:.metadata.name,PHASE:.status.phase,READY:.status.conditions[?(@.type=="Ready")].status

# Check for readiness gate conditions
kubectl get pod <pod> -o json | jq '.status.conditions[] | {type: .type, status: .status}'

# Check ReplicaSet desired-replicas annotation against Deployment spec
kubectl get rs -l app=<label> -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.metadata.annotations.deployment\.kubernetes\.io/desired-replicas}{"\t"}{.spec.replicas}{"\n"}{end}'
# A mismatch indicates the stale-annotation bug.

# Check events (most rollout events are on Pods and ReplicaSets, not the Deployment)
kubectl get events --field-selector involvedObject.name=<pod-or-rs-name> --sort-by='.lastTimestamp'

# Verify rollout strategy limits
kubectl get deployment <name> -o jsonpath='{"maxUnavailable: "}{.spec.strategy.rollingUpdate.maxUnavailable}{" maxSurge: "}{.spec.strategy.rollingUpdate.maxSurge}{"\n"}'

How to diagnose it

  1. Confirm the stall pattern. Run kubectl get deployment <name>. If readyReplicas is below spec.replicas and updatedReplicas is not increasing, the rollout is stalled. If unavailableReplicas is non-zero, Pods are failing to become ready.

  2. Check the Progressing condition. Run kubectl get deployment <name> -o jsonpath='{.status.conditions[?(@.type=="Progressing")]}'. If status is False and reason is ProgressDeadlineExceeded, the controller has marked the rollout as stuck. If the condition is still True, the deadline has not yet elapsed.

  3. Inspect the new ReplicaSet’s Pods. Identify the new pod-template-hash and list those Pods. If they are Pending, the issue is scheduling or image pulling. If they are Running but not Ready, the issue is probes, gates, or container startup.

  4. Differentiate container readiness from Pod readiness. Run kubectl get pod <pod> -o jsonpath='{.status.conditions[*].type}'. If ContainersReady is True but Ready is False, examine .spec.readinessGates. Then check .status.conditions for the matching gate type. If the gate condition is absent, the Pod will never become ready.

  5. Evaluate readiness probes. In kubectl describe pod, look for Unhealthy events with Readiness probe failed. Verify that the probe port, path, and scheme match the listening interface inside the container. If the application has a slow startup, initialDelaySeconds or a startupProbe may be needed.

  6. Check for the stale-annotation bug. If the Deployment uses maxSurge=0 and was scaled during the rollout, compare the active ReplicaSet’s deployment.kubernetes.io/desired-replicas annotation with deployment.spec.replicas. If they differ, the controller is in an infinite loop and the rollout is frozen.

  7. Validate strategy arithmetic. maxUnavailable rounds down when computed as a percentage, and maxSurge rounds up. If maxUnavailable rounds down to zero and maxSurge is also zero, the rollout cannot make progress because it is forbidden to remove old Pods or add new ones beyond the limit. Ensure the strategy allows movement.

  8. Distinguish rollout stalls from post-rollout decay. If updatedReplicas equals spec.replicas and the Progressing condition shows NewReplicaSetAvailable, the rollout is complete. If availableReplicas then drops, this is a workload health or node problem, not a rollout stall. The controller will not surface ProgressDeadlineExceeded for this.

flowchart TD
    A[Deployment readyReplicas < spec.replicas] --> B{Pod phase?}
    B -->|Pending| C[Check scheduling, resources, PVC]
    B -->|Running, not Ready| D{ContainersReady?}
    D -->|False| E[Check readiness probes, crashes]
    D -->|True| F[Check readiness gates]
    B -->|Terminating| G[Wait or check terminationGracePeriodSeconds]
    C --> H[Fix capacity, taints, or image pull]
    E --> I[Fix probe config or app health]
    F --> J[Restore external controller or remove gate]
    G --> K[Check maxSurge=0 stale annotation bug]

Metrics and signals to monitor

SignalWhy it mattersWarning sign
status.readyReplicas vs spec.replicasDirect measure of rollout completionGap persists longer than normal startup time
status.availableReplicas vs spec.replicasTracks Pods that are ready for minReadySecondsDrop after rollout completes indicates silent degradation
Progressing condition reason and statusOnly automatic controller signal for stalled rolloutsProgressDeadlineExceeded or condition stuck at ReplicaSetUpdated
Pod Ready false with ContainersReady trueIndicates readiness gate blockageAny Pod in this state for more than 2 minutes
Container restart countRepeated restarts prevent readinessRestart count increasing across new ReplicaSet Pods
Pod phase PendingScheduling, image, or volume stallPending duration exceeding 5 minutes
kube-controller-manager workqueue_depthReconciliation backlog in the controllerDepth increasing while rollout is active
Node MemoryPressure or DiskPressurePressure evictions kill new Pods before they become readyPressure condition true on nodes hosting new Pods

Fixes

If the cause is readiness probe misconfiguration

Edit the container spec to correct the probe endpoint, increase timeoutSeconds, or add a startupProbe to cover slow initialization. If the application startup is legitimately longer than progressDeadlineSeconds, increase progressDeadlineSeconds in the Deployment spec.

If the cause is a missing readiness gate condition

Identify the external controller that writes the condition (for example, an ingress controller or service mesh). If that controller is down, restore it. If the gate is not essential, remove it from the Pod template as a workaround.

If the cause is resource or scheduling pressure

Add node capacity, reduce resource requests, or resolve taints that block the new Pods. If the stall is due to an unschedulable Pod, kubectl describe pod will show the specific predicate failure.

If the cause is the stale-annotation bug (maxSurge=0)

On affected Kubernetes versions, if scaling during a rolling update with maxSurge=0 triggers the infinite loop, work around it by forcing the annotation to match: scale the Deployment to a different value and back, or patch the ReplicaSet annotation directly.

If the cause is a too-restrictive strategy

Adjust maxUnavailable or maxSurge so that at least one of them is non-zero. For a single-replica Deployment, maxUnavailable: 0 and maxSurge: 1 is a common safe pattern. For larger Deployments, ensure maxUnavailable does not round down to zero unless maxSurge compensates.

If the cause is post-rollout replica loss

This is not fixed by rollout parameters. Investigate Pod crashes, node evictions, or CSI volume failures. Cordon failing nodes and trigger a new rollout only after the underlying issue is resolved.

Prevention

Set progressDeadlineSeconds slightly above your application’s known cold-start time, rather than accepting the default 600s if it is too short or too long. Validate readiness probes in a staging environment before promotion. Monitor the kube-controller-manager workqueue_depth to detect reconciliation lag before it becomes visible as stalled replicas. Avoid scaling Deployments during active rollouts if you must use maxSurge=0 on Kubernetes versions prior to the fix for the stale-annotation issue. Set explicit resource requests and limits to prevent scheduling stalls and evictions. If you use readiness gates, monitor the health of the external controllers that maintain those conditions separately. Alert on availableReplicas dropping below spec.replicas independently of rollout status, because Kubernetes does not re-fire ProgressDeadlineExceeded after a rollout finishes.

How Netdata helps

  • Correlates node-level CPU, memory, and disk pressure with Pod scheduling failures and evictions that stall rollouts.
  • Surfaces container restart loops and OOM kills that prevent new replicas from reaching the ready state.
  • Tracks network latency and conntrack utilization to identify infrastructure-level causes of readiness probe timeouts.
  • Brings together control-plane signals like API server latency and controller workqueue depth so you can distinguish a controller backlog from an application-level stall.
  • Provides per-Pod resource usage to validate whether resource limits are causing startup delays.