Kubernetes admission webhook death spiral: detection and recovery
You deploy a mutating webhook to enforce policy. Later, a node drains and the webhook pod evicts. Now no pods can start, including the webhook’s own replacement. The cluster does not crash, but it stops moving. Every kubectl apply hangs. Horizontal autoscalers freeze. Rolling updates stall. This is the admission webhook death spiral: a circular dependency where the webhook must admit a pod it itself requires.
A related failure mode is quieter but equally destructive. Several webhooks each time out at 10 seconds. Even with failurePolicy: Ignore, Kubernetes accumulates those timeouts against a global admission ceiling of roughly 30 seconds. Once the ceiling is breached, pod creation fails with “context deadline exceeded” and no single webhook is named in the error. The cluster appears to reject pods for no reason.
What this means
Kubernetes admission webhooks are synchronous HTTP calls from the API server to an external service. Mutating webhooks run sequentially in sorted name order. Validating webhooks run afterward, in parallel. Webhook latency adds directly to the API server’s mutating request latency. When a webhook fails or slows down, requests queue in the API server’s inflight pool and in API Priority and Fairness (APF) queues. If the backlog grows faster than it drains, the cluster enters a positive feedback loop: controllers retry failed operations, generating more admission load, which deepens the backlog.
There are two distinct triggers. The first is a self-hosting deadlock: a webhook running inside the cluster matches its own pod resource, so its absence prevents its own recovery. The second is cumulative timeout saturation: the sum of per-webhook timeouts on the pod creation path exceeds the global request context deadline. In this case, even webhooks configured with failurePolicy: Ignore cause admission to fail because the request context expires before the API server can finish calling all webhooks.
flowchart TD
A[Webhook pod terminates] --> B[APIServer blocks CREATE on matched resources]
B --> C[Webhook Deployment tries to reschedule]
C --> D{Webhook endpoint ready?}
D -->|No endpoints| E[Pod creation times out]
E --> B
D -->|Endpoints ready| F[Admission succeeds]Common causes
| Cause | What it looks like | First thing to check |
|---|---|---|
| Self-referential webhook deadlock | Webhook pod is 0/1 Ready; all pod creation hangs; events show “failed calling webhook” | kubectl get endpoints <webhook-svc> -n <ns> |
| Cumulative timeout ceiling exceeded | Pod creation fails with “context deadline exceeded” after ~30s; no webhook is named in the error | Number of webhooks and timeoutSeconds on the pod CREATE path |
| Network policy blocking API server egress | Webhook pod is Running and Ready; connections from the API server time out silently | Default-deny NetworkPolicies in the webhook namespace |
| Webhook service scaled to zero or crashlooping | Endpoint has no addresses; failurePolicy: Fail rejects all matching mutations | Webhook Deployment status and recent pod events |
| Kyverno or operator retry storm | Webhook configuration reconciliation loops rapidly; API server inflight requests spike | Kyverno webhook configuration age and controller logs |
Quick checks
# Check per-webhook admission latency (histogram; expect multiple lines)
kubectl get --raw /metrics | grep apiserver_admission_webhook_admission_duration_seconds
# List failure policies across all webhook configurations
kubectl get mutatingwebhookconfigurations,validatingwebhookconfigurations -o yaml | grep -B5 failurePolicy
# Check if the webhook service has ready endpoints
kubectl get endpoints -n <webhook-namespace> <webhook-service-name>
# Find recent pod creation failures tied to webhooks
kubectl get events --all-namespaces --field-selector reason=FailedCreate
# Check API server inflight mutating requests
kubectl get --raw /metrics | grep apiserver_current_inflight_requests
# Check APF queue depth for mutating traffic
kubectl get --raw /metrics | grep apiserver_flowcontrol_current_inqueue_requests
How to diagnose it
Confirm that mutating requests are stalled. Run
kubectl run test --image=busybox --restart=Never --dry-run=server. If the command hangs or returns a 5xx, mutating admission is blocked. Checkapiserver_request_duration_secondsfor POST/PUT/PATCH verbs. If mutating latency is elevated but etcd latency and LIST latency are normal, webhooks are the likely bottleneck. If the cluster is already deadlocked, this command will hang; cancel it with Ctrl-C.Identify which webhook is slow or failing. Inspect
apiserver_admission_webhook_admission_duration_seconds. A webhook whose p99 approaches its configuredtimeoutSecondsis the culprit. If multiple webhooks are near their limits, suspect cumulative timeout saturation.Check for the self-hosting deadlock. Compare the webhook’s
namespaceSelectorandobjectSelectoragainst the namespace and labels of the webhook’s own pods. If the webhook pod matches its own configuration, and the webhook pod is not Ready, you have a circular dependency.Verify webhook endpoint health.
kubectl get endpoints -n <ns> <svc>should show one or more ready addresses. If it shows<none>, the backing pods are not Ready or do not exist. Check the Deployment for ImagePullBackOff, CrashLoopBackOff, or resource pressure evictions.Inspect the network path from the API server. The API server typically reaches webhooks via the cluster network. If a default-deny NetworkPolicy exists in the webhook namespace, the API server’s egress may be dropped without logging. Check for NetworkPolicies that do not explicitly allow ingress from the API server.
Calculate cumulative timeout exposure. Count the number of mutating webhooks that match pod CREATE. Multiply the count by each webhook’s
timeoutSeconds. If the sum approaches or exceeds 30 seconds, you are hitting the global ceiling. Even withfailurePolicy: Ignore, the API server waits for the full timeout before proceeding, and the request context expires.Correlate with APF saturation. If
apiserver_flowcontrol_current_inqueue_requestsis non-zero for mutating traffic andapiserver_request_total{code="429"}is increasing, the death spiral has progressed to active request rejection. Controllers retrying against 429s will amplify load.
Metrics and signals to monitor
| Signal | Why it matters | Warning sign |
|---|---|---|
apiserver_admission_webhook_admission_duration_seconds | Direct measurement of webhook latency added to every mutating request | p99 > 1s or approaching timeoutSeconds |
apiserver_request_duration_seconds (mutating verbs) | Total mutating latency; webhooks are the most common avoidable component | p99 > 5s sustained |
apiserver_current_inflight_requests (mutating) | Tracks how many mutating requests are blocked behind slow webhooks | > 80% of --max-mutating-requests-inflight |
apiserver_flowcontrol_current_inqueue_requests | APF queuing indicates admission path saturation | Queue depth > 0 for mutating priority levels |
apiserver_request_total{code="429"} | Confirms APF or inflight saturation caused by webhook backlog | Sustained rate above baseline |
apiserver_admission_webhook_rejection_count{error_type="calling_webhook_error"} | Counts failed webhook calls when failurePolicy: Fail | Any sustained increase |
| Webhook endpoint ready addresses | A webhook with no endpoints and failurePolicy: Fail blocks all matching creations | Endpoint count == 0 for > 60s |
Fixes
If the cause is a self-referential deadlock
You must break the circle by removing the admission requirement for the webhook’s own pod. This is disruptive because it temporarily disables admission control for every resource that webhook matches.
# WARNING: This disables the webhook. Re-enable after the webhook pod is healthy.
# WARNING: Replace 'mutatingwebhookconfiguration' with 'validatingwebhookconfiguration' if needed.
# WARNING: Verify the correct webhook index; /webhooks/0 is the first webhook in the configuration.
kubectl patch mutatingwebhookconfiguration <NAME> \
--type=json \
-p='[{"op":"replace","path":"/webhooks/0/failurePolicy","value":"Ignore"}]'
After the webhook pod becomes Ready, restore failurePolicy: Fail. To prevent recurrence, add a namespaceSelector that excludes the webhook namespace, or use an objectSelector that excludes the webhook’s own pods.
If the cause is cumulative timeout saturation
Reduce the total timeout budget on the pod admission path. Lower timeoutSeconds to 5 or less for non-critical webhooks, remove webhooks from the pod CREATE path if they do not need to intercept pods, or temporarily patch one or more webhooks to failurePolicy: Ignore until the count can be reduced. There is no API server flag to raise the global ~30-second ceiling. The fix must reduce the aggregate timeout.
If the cause is a network policy
Add an ingress rule to the webhook namespace that permits traffic from the API server. Because the API server often runs on the host network or uses a special source IP, match on the CIDR block of the control plane nodes rather than on pod labels.
If the cause is a crashing or scaled-down webhook service
Scale the webhook Deployment to at least one Ready replica. If the image is bad or the pod is in CrashLoopBackOff, patch the webhook configuration to failurePolicy: Ignore until the application is fixed.
If the cause is a Kyverno or operator retry loop
Some policy controllers enter exponential retry loops when they cannot reconcile their own webhook configuration. If Kyverno is the source, break the loop by deleting the conflicting webhook configuration and restarting the Kyverno controller pods.
# WARNING: This deletes a live admission configuration. Ensure you can restore it.
kubectl delete validatingwebhookconfiguration <kyverno-config-name>
kubectl delete mutatingwebhookconfiguration <kyverno-config-name>
kubectl rollout restart deployment <kyverno-deployment> -n <kyverno-namespace>
Prevention
- Exclude the webhook namespace. Use
namespaceSelectorinMutatingWebhookConfigurationandValidatingWebhookConfigurationto excludekube-system, the webhook’s own namespace, and any namespace used by the control plane. - Scope webhooks narrowly. Use
objectSelector,rules, andscopeso that a webhook only intercepts the resource types and namespaces it absolutely needs. - Keep timeouts short. Set
timeoutSecondsto 10 or less. If a webhook needs more than 10 seconds, redesign it; synchronous admission is not the right place for long-running work. - Count webhooks on the pod path. Audit how many mutating webhooks run on pod CREATE. Four webhooks at 10 seconds each is already at the global ceiling.
- Run critical webhooks outside the cluster. If a webhook is required for cluster operation, deploy it outside the Kubernetes control plane to eliminate the circular dependency.
- Prefer ValidatingAdmissionPolicy. Where feasible, replace validating webhooks with ValidatingAdmissionPolicy, which evaluates CEL expressions inside the API server and eliminates the external HTTP call entirely.
- Test failure modes. During chaos engineering or maintenance windows, verify that a webhook pod eviction does not block its own rescheduling.
How Netdata helps
- Correlates
apiserver_admission_webhook_admission_duration_secondswith overall mutating request latency, making it obvious when a webhook is the bottleneck. - Surfaces
apiserver_current_inflight_requestsandapiserver_flowcontrol_current_inqueue_requestsso you see the backlog before it triggers mass 429s. - Alerts on
apiserver_request_totalspikes for 5xx and 429 response codes on mutating verbs, providing early warning of admission path saturation. - Tracks webhook endpoint availability if you monitor the backing service, linking pod health to API server latency.
Related guides
- Kubernetes API server etcd latency: detection and cascading failures
- Kubernetes API server rate limiting: APF priority levels and starvation
- Kubernetes API server slow or unresponsive: causes and fixes
- Kubernetes eviction cascade: when one node failure takes down the cluster
- Kubernetes kubelet not responding: PLEG, runtime, and certificate issues






