Kubernetes API server slow or unresponsive: causes and fixes
When kubectl hangs, controllers log context deadline exceeded, and deployments stall, the Kubernetes API server is usually the bottleneck. It is the single funnel for every read and write to cluster state. Slowness propagates to scheduling, pod lifecycle, service discovery, and external automation.
This article covers operational causes and gives a step-by-step diagnostic flow to run during an incident. Use it to distinguish etcd latency, admission webhook stalls, request saturation, and memory pressure.
What this means
The API server is a stateless HTTP front-end to etcd. Every request passes through authentication, authorization, admission control, and then storage. Slowness means one of these stages is blocked. Unresponsiveness means the process is OOM-killed, deadlocked, or unable to reach its backing store.
In practice, you see three symptom classes:
- Elevated latency:
kubectlcommands take seconds, controller reconciliation lags, and scheduling delays grow. - Saturation: the API server returns 429 (Too Many Requests) as inflight limits or API Priority and Fairness (APF) queues fill.
- Outright failure: the process crashes, restarts, or stops passing
/livez, causing all cluster operations to halt.
Common causes
| Cause | What it looks like | First thing to check |
|---|---|---|
| etcd disk latency | Mutating requests slow; /readyz/etcd fails; WAL fsync p99 above 100 ms | etcd_disk_wal_fsync_duration_seconds |
| Admission webhook timeout | Mutating latency spikes for specific resources; latency plateaus at the webhook timeout value | apiserver_admission_webhook_admission_duration_seconds |
| Inflight / APF saturation | 429 errors; APF queues full; all controllers lagging | apiserver_current_inflight_requests and APF queue depth |
| Memory pressure / OOM | Process restarts; LIST latency spikes after restart; RSS near container limit | process_resident_memory_bytes vs limit |
| Re-list storm | LIST rate spikes; CPU and memory burst; watches reconnecting en masse | apiserver_request_total{verb="LIST"} |
| Certificate expiry | Sudden 401s; nodes NotReady; TLS handshake errors | kubeadm certs check-expiration |
Quick checks
These checks are read-only and safe to run during an active incident.
# Check if the API server process is alive and ready
kubectl get --raw '/livez?verbose'
kubectl get --raw '/readyz?verbose'
# Check etcd cluster health from a control plane node
ETCDCTL_API=3 etcdctl \
--endpoints=https://127.0.0.1:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt \
--key=/etc/kubernetes/pki/etcd/healthcheck-client.key \
endpoint health
# Check per-webhook admission latency
kubectl get --raw '/metrics' | grep ^apiserver_admission_webhook_admission_duration_seconds
# Check inflight requests and APF queue depth
kubectl get --raw '/metrics' | grep ^apiserver_current_inflight_requests
kubectl get --raw '/metrics' | grep ^apiserver_flowcontrol_current_inqueue_requests
# Check API server memory and container restarts on the node
crictl ps --name kube-apiserver -q | xargs crictl stats
# Check 5xx and 429 error rates
kubectl get --raw '/metrics' | grep 'apiserver_request_total.*code="5'
kubectl get --raw '/metrics' | grep 'apiserver_request_total.*code="429"'
# Check control plane certificate expiry
kubeadm certs check-expiration
# Look for a LIST rate spike that indicates a re-list storm
kubectl get --raw '/metrics' | grep 'apiserver_request_total.*verb="LIST"'
How to diagnose it
Follow this flow to isolate the root cause.
- Scope the impact. In an HA deployment, check whether one instance or all instances are affected. Bypass the load balancer and call
/readyz?verboseon each API server directly. If only one instance is degraded, remove it from rotation and investigate locally. - Distinguish hung from overloaded. If
/livezfails, the process is likely OOM-killed or deadlocked. Check container restart counts and kernel OOM logs (dmesg | grep -i oom). If/livezpasses but/readyzfails, inspect the specific sub-checks (oftenetcdorpoststarthook). - Check etcd WAL fsync latency. Query the etcd metrics endpoint for
etcd_disk_wal_fsync_duration_seconds. If p99 is above 100 ms sustained, disk I/O is the root cause. Every etcd write blocks on this fsync, so mutating API latency cannot be lower than this value. - Check admission webhook latency. If
apiserver_admission_webhook_admission_duration_secondsis elevated and correlates with total mutating latency, identify the specific webhook by name. Check its Deployment endpoints, pod readiness, and recent logs. A single slow webhook withfailurePolicy: Failcan freeze all mutations for matched resources. - Check for request saturation. If
apiserver_current_inflight_requestsis near the configured limit (default 400 read-only, 200 mutating) or APF queue depth is growing, find the noisy client. Break downapiserver_request_totalby user-agent and user to identify runaway controllers or CI/CD pipelines. - Check memory and OOM patterns. If the API server container has restarted and memory was near the limit beforehand, you are likely in an OOM/re-list cycle. The replacement instance starts with cold caches; all clients re-list simultaneously, causing a memory spike that triggers another OOM. Check
process_resident_memory_bytestrends. - Check auth errors and certificates. A sudden spike in 401s from known components suggests certificate expiry or service account token rotation failure. Use
kubeadm certs check-expirationoropenssl x509 -inon the relevant cert files. - Correlate by verb and resource. High LIST latency with normal GET latency points to large un-paginated list calls or a re-list storm. High mutating latency with normal reads points to etcd or webhooks. High latency across all verbs with normal etcd and webhooks points to CPU or memory pressure.
Metrics and signals to monitor
| Signal | Why it matters | Warning sign |
|---|---|---|
apiserver_request_duration_seconds p99 | Primary server-side latency indicator | Mutating p99 above 1 s sustained; LIST p99 above 5 s |
etcd_disk_wal_fsync_duration_seconds p99 | Root cause of most write latency | p99 above 100 ms |
apiserver_admission_webhook_admission_duration_seconds | Adds synchronous latency to every mutation | p99 above 200 ms for any single webhook |
apiserver_current_inflight_requests | Measures global request saturation | Above 80% of configured limit |
apiserver_flowcontrol_current_inqueue_requests | Early warning before APF rejects traffic | Queue depth above 0 for system or leader-election priority levels |
process_resident_memory_bytes | Memory pressure leads to OOM and re-list cascades | Above 80% of container limit |
apiserver_request_total{code="429"} | Confirms APF or inflight rejection | Sustained rate above 5% of total requests |
apiserver_storage_objects | Object growth drives cache size and etcd storage | Any resource type growing unboundedly |
Fixes
If the cause is etcd latency
Run etcdctl check perf to validate disk performance against etcd requirements. Warning: this command writes load-test data. Do not run it on a production etcd cluster during an active incident.
If the disk is shared with other workloads, move etcd data to a dedicated SSD or NVMe volume. Check etcd_server_leader_changes_seen_total. If leader changes are increasing, the disk is too slow for the default Raft heartbeat interval. Schedule compaction and defragmentation during maintenance windows, one member at a time.
If the cause is admission webhooks
Identify the slow webhook from apiserver_admission_webhook_admission_duration_seconds. Check its Deployment endpoints and pod logs. If the webhook is non-critical, you can temporarily change its failurePolicy to Ignore to unblock mutations. Narrow its rules and namespaceSelector so it matches fewer requests. Scale the webhook horizontally or increase its CPU and memory if it is overloaded.
If the cause is inflight or APF saturation
Find the client causing the load by inspecting apiserver_request_total breakdown by user-agent and user. If traffic is legitimate, increase --max-requests-inflight and --max-mutating-requests-inflight only if the node has CPU and memory headroom. Tune APF PriorityLevelConfiguration concurrency shares to protect system and leader-election flows. Isolate noisy operators into dedicated flow schemas with lower priority.
If the cause is memory pressure or OOM
Increase the API server container memory limit immediately. If running Go 1.19+, set GOMEMLIMIT to roughly 90% of the container limit to trigger more aggressive garbage collection before the kernel OOM killer fires. Review apiserver_storage_objects and remove unneeded CRDs or stale objects. Increase --watch-cache-sizes for high-churn resources only if memory allows.
If the cause is a re-list storm
If the storm follows an API server restart, allow caches to warm up. If it is spontaneous, check for 410 Gone watch errors indicating cache overflow. Increase the watch cache size for the affected resource type via --watch-cache-sizes. Ensure clients use bookmark watches to reduce cache misses.
If the cause is certificate expiry
Renew expired certificates with kubeadm certs renew all on kubeadm-managed clusters, or via your certificate management pipeline. Restart the API server and affected kubelets to load the new certificates. Verify that renewal automation covers control plane, etcd, webhook CA, and front-proxy certificates.
Prevention
- Monitor etcd disk latency directly. Slow disk is the root cause of most API server write latency, yet many teams only notice it after a leader election storm.
- Alert on webhook latency and fail-opens. A slow or ignored webhook degrades every mutating request silently. Monitor
apiserver_admission_webhook_fail_open_count. - Size memory for burst headroom. Post-restart re-list storms can spike API server memory 2-3x above baseline. Size limits for the burst, not the steady state.
- Review APF flow schemas quarterly. Misclassification can starve leader election and kubelet heartbeats while a runaway operator fills the catch-all queue.
- Automate certificate expiry checks. Alert at 30 days and renew with buffer time for troubleshooting.
- Enforce object cleanup policies. Remove completed Jobs, stale Events, and unused CRDs before they inflate etcd and watch caches beyond capacity.
How Netdata helps
- Correlates API server request latency with etcd WAL fsync and host disk I/O latency on a single timeline, making etcd cascades obvious.
- Surfaces APF queue depth, inflight requests, and 429 rates alongside per-verb latency so you can see saturation before clients fail.
- Tracks API server memory and container restart counts to catch OOM cycles and memory leaks early.
- Maps request rate spikes by verb and resource to help identify noisy neighbors and re-list storms.





