Troubleshooting

Helm Chart Rollback Failures Release Stuck in Pending-Rollback and Revision Conflicts

A step-by-step playbook to diagnose and resolve stuck Helm releases- from failed hooks to corrupted release secrets

Helm Chart Rollback Failures Release Stuck in Pending-Rollback and Revision Conflicts

It’s a scenario that every Kubernetes operator dreads. A production deployment has gone wrong, you confidently initiate a Helm rollback, and then… nothing. The command hangs, and a quick check reveals the dreaded pending-rollback status. Your application is now in a broken state, and you’re blocked from deploying any new fixes. This is more than a minor inconvenience; it’s a critical failure that can leave your services unstable and your deployment pipeline paralyzed.

These rollback failures often stem from a few common causes: Kubernetes resources failing to become ready, misbehaving deployment hooks, or inconsistencies in Helm’s release state data. This is particularly true in Helm 3, where the removal of Tiller shifted release management to a client-side model using Secrets for storage. Understanding how to navigate these issues is essential for maintaining a resilient continuous delivery process. This guide provides a step-by-step playbook for diagnosing why your Helm rollback failed and how to recover your stuck release.

Why Helm Rollbacks Fail The Common Culprits

When a release is in the pending-rollback state, it means Helm has initiated the rollback process but is now waiting for a specific condition to be met before it can mark the release as deployed. If that condition never gets met, the release remains stuck indefinitely. Let’s break down the most frequent reasons this happens.

The Timeout Trap Waiting for Ready Resources

One of the most powerful features of Helm is its ability to wait for deployed resources to become healthy. When you use the --wait flag during a rollback, Helm monitors the Kubernetes resources it’s managing. It waits until all Pods, PVCs, Services, and the minimum required replicas of a Deployment or StatefulSet are in a ready state.

The failure point is the --timeout flag, which defaults to 5 minutes. If your application’s pods fail to start correctly within this window, the rollback will fail and get stuck. Common reasons for pod failures include:

  • ImagePullBackOff: Kubernetes can’t pull the container image from the registry.
  • CrashLoopBackOff: The application container is starting and then immediately crashing.
  • Failed Probes: Liveness or readiness probes are failing, preventing Kubernetes from marking the pod as ready.
  • Resource Constraints: The node lacks sufficient CPU or memory to schedule the pod.

How to Investigate: Your primary tools here are Kubernetes command-line utilities. Start by checking the state of the pods in your release’s namespace. Getting detailed information and viewing the logs from a failing pod will almost always reveal the root cause of the failure.

Misbehaving Deployment Hooks

Helm hooks allow you to run jobs or other resources at specific points in a release lifecycle, such as before or after a rollback (pre-rollback, post-rollback). These are often used for database migrations, data cleanup, or other setup tasks.

A rollback can get stuck if a hook-managed Job fails to complete successfully or hangs indefinitely. Helm will wait for the hook to finish before proceeding, and if it never does, the release remains in a pending state.

How to Investigate: Check for any active or failed Jobs in your namespace. Describing a specific job and checking the logs of the pod created by that job can reveal why it might have failed. If a hook is the problem, you can sometimes bypass it during the rollback using the --no-hooks flag as an emergency measure.

Helm’s Release State The Secret in the Namespace

Helm 3 stores the state and history of each release in a Kubernetes Secret within the release’s namespace. Each revision of a release gets its own Secret, following a standard naming convention.

This storage mechanism is robust, but it can be a source of trouble, especially if:

  • A migration from Helm v2 was incomplete: Helm v2 used a central component called Tiller, which stored release information differently. If old Tiller ConfigMaps or Secrets still exist, or if the migration wasn’t clean, you can encounter conflicts.
  • A Secret becomes corrupted: While rare, manual edits or cluster issues can lead to a corrupted release Secret, making it unreadable by Helm.
  • Permissions are incorrect: The user or service account running Helm might lack the necessary RBAC permissions to read or write the release Secrets in the target namespace.

You can inspect these secrets directly using kubectl to see what Helm is working with. Any discrepancy here can lead to failures.

Your Step-by-Step Recovery Playbook

When faced with a pending-rollback status, follow these steps methodically, starting with the least invasive options.

Step 1: Gather Intelligence with Helm Commands

Before making any changes, get a clear picture of the situation using Helm’s built-in commands.

  • helm status <release-name>: This is your first command. It provides a detailed summary of the release’s current state, including any notes from the chart that might offer clues.
  • helm history <release-name>: This command lists all the revisions for your release, showing their status (deployed, failed, superseded). This helps you confirm you are rolling back to a known-good revision number.

Step 2: Investigate the Kubernetes Layer

As discussed earlier, most hangs are due to underlying Kubernetes resource issues. Dig into the events and states of your workloads. Check the latest events in the namespace for errors and check the status of the key workloads managed by the chart, like Deployments, StatefulSets, and Jobs. Look for any resource that is not in a Ready or Completed state and start your investigation there.

Step 3: Attempt a More Assertive Rollback

If you’ve identified a resource that is stuck but not fundamentally broken, you can try to force the rollback through.

  • Increase the timeout: If your pods simply take a long time to become ready, a longer timeout might be all you need by using the --timeout flag with a higher value (e.g., 15m).
  • Use the --force flag: This flag is more aggressive. It forces the rollback by deleting and recreating resources if an update fails. This can resolve stateful conflicts where Kubernetes refuses to apply a change. Use it with caution, as it can be disruptive.

Step 4: Manual Intervention (The Last Resort)

If automated rollbacks continue to fail, you may need to intervene manually. Proceed with caution.

Scenario A: Uninstalling the Stuck Release

Sometimes, the cleanest path forward is to remove the release and start over. First, attempt a normal uninstall. If the uninstall hangs (often due to finalizers on resources), you may need to manually delete the problematic Kubernetes objects first. Once the objects are gone, you can run the uninstall command again. This will also clear out the release history secrets, giving you a clean slate.

Scenario B: Manually Fixing the Release History

This is an advanced and risky procedure. Always back up the secret before editing or deleting it. If you’ve identified that a specific revision is causing the failure, you can manually delete its associated Secret. First, back up the failed secret to a file. Then, delete the failed secret from the cluster. By deleting the secret for the failed revision, you effectively erase it from Helm’s history. This often “unsticks” the release, allowing you to attempt a fresh upgrade or rollback.

Preventing Future Rollback Failures

Recovering a stuck release is stressful. The best strategy is prevention.

  • Use --dry-run: Before any upgrade or rollback, run the command with --dry-run. It will render the templates and show you exactly what will be sent to the Kubernetes API, helping you catch errors early.
  • Implement Robust Probes: Ensure your charts have well-configured liveness and readiness probes. This allows Helm’s --wait functionality to work reliably.
  • Manage History: Use the --history-max flag in your upgrade commands to limit the number of revisions Helm keeps. This prevents the release secrets from becoming unmanageably large.

By combining these preventative measures with a clear troubleshooting playbook, you can turn a potential crisis into a manageable incident and keep your deployments flowing smoothly.

For a holistic view of your Kubernetes cluster’s health, which can help you spot the resource issues that cause rollbacks to fail, explore what Netdata can do for you.