Kubernetes admission webhook death spiral: detection and recovery

You deploy a mutating webhook to enforce policy. Later, a node drains and the webhook pod evicts. Now no pods can start, including the webhook’s own replacement. The cluster does not crash, but it stops moving. Every kubectl apply hangs. Horizontal autoscalers freeze. Rolling updates stall. This is the admission webhook death spiral: a circular dependency where the webhook must admit a pod it itself requires.

A related failure mode is quieter but equally destructive. Several webhooks each time out at 10 seconds. Even with failurePolicy: Ignore, Kubernetes accumulates those timeouts against a global admission ceiling of roughly 30 seconds. Once the ceiling is breached, pod creation fails with “context deadline exceeded” and no single webhook is named in the error. The cluster appears to reject pods for no reason.

What this means

Kubernetes admission webhooks are synchronous HTTP calls from the API server to an external service. Mutating webhooks run sequentially in sorted name order. Validating webhooks run afterward, in parallel. Webhook latency adds directly to the API server’s mutating request latency. When a webhook fails or slows down, requests queue in the API server’s inflight pool and in API Priority and Fairness (APF) queues. If the backlog grows faster than it drains, the cluster enters a positive feedback loop: controllers retry failed operations, generating more admission load, which deepens the backlog.

There are two distinct triggers. The first is a self-hosting deadlock: a webhook running inside the cluster matches its own pod resource, so its absence prevents its own recovery. The second is cumulative timeout saturation: the sum of per-webhook timeouts on the pod creation path exceeds the global request context deadline. In this case, even webhooks configured with failurePolicy: Ignore cause admission to fail because the request context expires before the API server can finish calling all webhooks.

flowchart TD
    A[Webhook pod terminates] --> B[APIServer blocks CREATE on matched resources]
    B --> C[Webhook Deployment tries to reschedule]
    C --> D{Webhook endpoint ready?}
    D -->|No endpoints| E[Pod creation times out]
    E --> B
    D -->|Endpoints ready| F[Admission succeeds]

Common causes

CauseWhat it looks likeFirst thing to check
Self-referential webhook deadlockWebhook pod is 0/1 Ready; all pod creation hangs; events show “failed calling webhook”kubectl get endpoints <webhook-svc> -n <ns>
Cumulative timeout ceiling exceededPod creation fails with “context deadline exceeded” after ~30s; no webhook is named in the errorNumber of webhooks and timeoutSeconds on the pod CREATE path
Network policy blocking API server egressWebhook pod is Running and Ready; connections from the API server time out silentlyDefault-deny NetworkPolicies in the webhook namespace
Webhook service scaled to zero or crashloopingEndpoint has no addresses; failurePolicy: Fail rejects all matching mutationsWebhook Deployment status and recent pod events
Kyverno or operator retry stormWebhook configuration reconciliation loops rapidly; API server inflight requests spikeKyverno webhook configuration age and controller logs

Quick checks

# Check per-webhook admission latency (histogram; expect multiple lines)
kubectl get --raw /metrics | grep apiserver_admission_webhook_admission_duration_seconds

# List failure policies across all webhook configurations
kubectl get mutatingwebhookconfigurations,validatingwebhookconfigurations -o yaml | grep -B5 failurePolicy

# Check if the webhook service has ready endpoints
kubectl get endpoints -n <webhook-namespace> <webhook-service-name>

# Find recent pod creation failures tied to webhooks
kubectl get events --all-namespaces --field-selector reason=FailedCreate

# Check API server inflight mutating requests
kubectl get --raw /metrics | grep apiserver_current_inflight_requests

# Check APF queue depth for mutating traffic
kubectl get --raw /metrics | grep apiserver_flowcontrol_current_inqueue_requests

How to diagnose it

  1. Confirm that mutating requests are stalled. Run kubectl run test --image=busybox --restart=Never --dry-run=server. If the command hangs or returns a 5xx, mutating admission is blocked. Check apiserver_request_duration_seconds for POST/PUT/PATCH verbs. If mutating latency is elevated but etcd latency and LIST latency are normal, webhooks are the likely bottleneck. If the cluster is already deadlocked, this command will hang; cancel it with Ctrl-C.

  2. Identify which webhook is slow or failing. Inspect apiserver_admission_webhook_admission_duration_seconds. A webhook whose p99 approaches its configured timeoutSeconds is the culprit. If multiple webhooks are near their limits, suspect cumulative timeout saturation.

  3. Check for the self-hosting deadlock. Compare the webhook’s namespaceSelector and objectSelector against the namespace and labels of the webhook’s own pods. If the webhook pod matches its own configuration, and the webhook pod is not Ready, you have a circular dependency.

  4. Verify webhook endpoint health. kubectl get endpoints -n <ns> <svc> should show one or more ready addresses. If it shows <none>, the backing pods are not Ready or do not exist. Check the Deployment for ImagePullBackOff, CrashLoopBackOff, or resource pressure evictions.

  5. Inspect the network path from the API server. The API server typically reaches webhooks via the cluster network. If a default-deny NetworkPolicy exists in the webhook namespace, the API server’s egress may be dropped without logging. Check for NetworkPolicies that do not explicitly allow ingress from the API server.

  6. Calculate cumulative timeout exposure. Count the number of mutating webhooks that match pod CREATE. Multiply the count by each webhook’s timeoutSeconds. If the sum approaches or exceeds 30 seconds, you are hitting the global ceiling. Even with failurePolicy: Ignore, the API server waits for the full timeout before proceeding, and the request context expires.

  7. Correlate with APF saturation. If apiserver_flowcontrol_current_inqueue_requests is non-zero for mutating traffic and apiserver_request_total{code="429"} is increasing, the death spiral has progressed to active request rejection. Controllers retrying against 429s will amplify load.

Metrics and signals to monitor

SignalWhy it mattersWarning sign
apiserver_admission_webhook_admission_duration_secondsDirect measurement of webhook latency added to every mutating requestp99 > 1s or approaching timeoutSeconds
apiserver_request_duration_seconds (mutating verbs)Total mutating latency; webhooks are the most common avoidable componentp99 > 5s sustained
apiserver_current_inflight_requests (mutating)Tracks how many mutating requests are blocked behind slow webhooks> 80% of --max-mutating-requests-inflight
apiserver_flowcontrol_current_inqueue_requestsAPF queuing indicates admission path saturationQueue depth > 0 for mutating priority levels
apiserver_request_total{code="429"}Confirms APF or inflight saturation caused by webhook backlogSustained rate above baseline
apiserver_admission_webhook_rejection_count{error_type="calling_webhook_error"}Counts failed webhook calls when failurePolicy: FailAny sustained increase
Webhook endpoint ready addressesA webhook with no endpoints and failurePolicy: Fail blocks all matching creationsEndpoint count == 0 for > 60s

Fixes

If the cause is a self-referential deadlock

You must break the circle by removing the admission requirement for the webhook’s own pod. This is disruptive because it temporarily disables admission control for every resource that webhook matches.

# WARNING: This disables the webhook. Re-enable after the webhook pod is healthy.
# WARNING: Replace 'mutatingwebhookconfiguration' with 'validatingwebhookconfiguration' if needed.
# WARNING: Verify the correct webhook index; /webhooks/0 is the first webhook in the configuration.
kubectl patch mutatingwebhookconfiguration <NAME> \
  --type=json \
  -p='[{"op":"replace","path":"/webhooks/0/failurePolicy","value":"Ignore"}]'

After the webhook pod becomes Ready, restore failurePolicy: Fail. To prevent recurrence, add a namespaceSelector that excludes the webhook namespace, or use an objectSelector that excludes the webhook’s own pods.

If the cause is cumulative timeout saturation

Reduce the total timeout budget on the pod admission path. Lower timeoutSeconds to 5 or less for non-critical webhooks, remove webhooks from the pod CREATE path if they do not need to intercept pods, or temporarily patch one or more webhooks to failurePolicy: Ignore until the count can be reduced. There is no API server flag to raise the global ~30-second ceiling. The fix must reduce the aggregate timeout.

If the cause is a network policy

Add an ingress rule to the webhook namespace that permits traffic from the API server. Because the API server often runs on the host network or uses a special source IP, match on the CIDR block of the control plane nodes rather than on pod labels.

If the cause is a crashing or scaled-down webhook service

Scale the webhook Deployment to at least one Ready replica. If the image is bad or the pod is in CrashLoopBackOff, patch the webhook configuration to failurePolicy: Ignore until the application is fixed.

If the cause is a Kyverno or operator retry loop

Some policy controllers enter exponential retry loops when they cannot reconcile their own webhook configuration. If Kyverno is the source, break the loop by deleting the conflicting webhook configuration and restarting the Kyverno controller pods.

# WARNING: This deletes a live admission configuration. Ensure you can restore it.
kubectl delete validatingwebhookconfiguration <kyverno-config-name>
kubectl delete mutatingwebhookconfiguration <kyverno-config-name>
kubectl rollout restart deployment <kyverno-deployment> -n <kyverno-namespace>

Prevention

  • Exclude the webhook namespace. Use namespaceSelector in MutatingWebhookConfiguration and ValidatingWebhookConfiguration to exclude kube-system, the webhook’s own namespace, and any namespace used by the control plane.
  • Scope webhooks narrowly. Use objectSelector, rules, and scope so that a webhook only intercepts the resource types and namespaces it absolutely needs.
  • Keep timeouts short. Set timeoutSeconds to 10 or less. If a webhook needs more than 10 seconds, redesign it; synchronous admission is not the right place for long-running work.
  • Count webhooks on the pod path. Audit how many mutating webhooks run on pod CREATE. Four webhooks at 10 seconds each is already at the global ceiling.
  • Run critical webhooks outside the cluster. If a webhook is required for cluster operation, deploy it outside the Kubernetes control plane to eliminate the circular dependency.
  • Prefer ValidatingAdmissionPolicy. Where feasible, replace validating webhooks with ValidatingAdmissionPolicy, which evaluates CEL expressions inside the API server and eliminates the external HTTP call entirely.
  • Test failure modes. During chaos engineering or maintenance windows, verify that a webhook pod eviction does not block its own rescheduling.

How Netdata helps

  • Correlates apiserver_admission_webhook_admission_duration_seconds with overall mutating request latency, making it obvious when a webhook is the bottleneck.
  • Surfaces apiserver_current_inflight_requests and apiserver_flowcontrol_current_inqueue_requests so you see the backlog before it triggers mass 429s.
  • Alerts on apiserver_request_total spikes for 5xx and 429 response codes on mutating verbs, providing early warning of admission path saturation.
  • Tracks webhook endpoint availability if you monitor the backing service, linking pod health to API server latency.