Kubernetes API server rate limiting: APF priority levels and starvation
Your API server is running. /healthz returns 200. /readyz passes. Yet nodes drop to NotReady, the scheduler stops placing pods, and controller logs fill with context deadline exceeded. The cluster is not down, but it is frozen. This pattern often points to API Priority and Fairness (APF) starvation: low-priority traffic consumes the API server’s concurrency budget, and critical control plane requests queue or get rejected.
APF is enabled by default in Kubernetes 1.20+. It classifies every API request into a priority level via FlowSchema rules, then schedules requests against a per-level concurrency limit. When a priority level exhausts its seats, requests queue. If the queue fills, the server returns HTTP 429. When the queue grows in system or leader-election, kubelets cannot renew leases, controllers cannot write status, and the cluster degrades from the inside out. This guide shows how to confirm APF starvation, identify the culprit, and fix the allocation without turning the API server into a free-for-all.
What this means
APF replaces the older global --max-requests-inflight limits with a fair-queuing system. Two CRDs control APF:
- FlowSchema assigns incoming requests to a priority level based on user, verb, resource, namespace, or source.
- PriorityLevelConfiguration defines the concurrency share and queue size for each level.
The default levels include exempt (no limits), system (for system:masters), leader-election, workload-high (built-in controllers), workload-low (general authenticated traffic), and catch-all (unauthenticated or unmatched traffic).
Each non-exempt level receives an effective concurrency limit proportional to its nominalConcurrencyShares relative to the total shares across all levels. For example, in a cluster where the server concurrency limit is 600 and total shares are 100, a level with 10 shares gets an effective limit of 60 concurrent requests.
Starvation happens when lower-priority traffic, such as a runaway operator or CI pipeline, consumes its own seats plus any available headroom, leaving system or leader-election without capacity. Those critical requests then sit in queue until they time out or are rejected. The symptoms look like a slow control plane, but the root cause is distribution, not total volume.
Common causes
| Cause | What it looks like | First thing to check |
|---|---|---|
| Runaway controller or operator | workload-low executing at 100% of its limit; system queue depth rising | apiserver_flowcontrol_current_executing_requests by priority_level |
| Insufficient concurrency for critical levels | leader-election or system queues grow during normal load | prioritylevelconfigurations shares and total cluster shares |
| Misconfigured FlowSchema | Kubelet or controller traffic classified into catch-all | flowschemas matching rules for critical users |
| Thundering herd after recovery | All priority levels show queue spikes simultaneously | Request rate by flow schema |
| Global API server saturation | apiserver_current_inflight_requests near the hard limit; 429s across all levels | Global inflight vs --max-requests-inflight |
Quick checks
# Check APF queue depth by priority level
kubectl get --raw /metrics | grep apiserver_flowcontrol_current_inqueue_requests
# Check concurrency utilization per priority level
kubectl get --raw /metrics | grep apiserver_flowcontrol_current_executing_requests
# Check APF rejected requests by priority level and flow schema
kubectl get --raw /metrics | grep apiserver_flowcontrol_rejected_requests_total
# Check 429 rate from the API server
kubectl get --raw /metrics | grep 'apiserver_request_total.*code="429"'
# Check global inflight requests
kubectl get --raw /metrics | grep apiserver_current_inflight_requests
# View current APF configuration
kubectl get prioritylevelconfigurations -o custom-columns=NAME:.metadata.name,CONCURRENCY:.spec.limited.nominalConcurrencyShares
kubectl get flowschemas
A healthy cluster shows zero sustained queue depth in system and leader-election, a 429 rate near zero, and inflight requests well below the hard limit.
How to diagnose it
Confirm APF is actively throttling. Check
apiserver_flowcontrol_rejected_requests_totalandapiserver_request_total{code="429"}. If 429s are present, APF is the bottleneck. If absent, look at etcd latency or admission webhooks instead.Identify which priority levels are queuing. Check
apiserver_flowcontrol_current_inqueue_requestsbypriority_level. Any sustained queue depth insystemorleader-electionis critical. Queueing inworkload-loworcatch-allis expected under load and is APF working as designed.Find the level consuming all concurrency. Compare
apiserver_flowcontrol_current_executing_requestsagainstapiserver_flowcontrol_request_concurrency_limitfor each priority level. Ifworkload-lowis at 100% whilesystemis queuing, a noisy neighbor is starving critical traffic.Pinpoint the specific client or flow. Use
apiserver_flowcontrol_rejected_requests_totalbroken down byflow_schema, or inspect audit logs for the user-agent and username generating the flood. A single flow schema dominating the request count indicates a runaway controller, aggressive CI job, or misconfigured operator.Distinguish local saturation from global overload. Check
apiserver_current_inflight_requests. If global inflight is well below--max-requests-inflightbut APF is rejecting traffic, the issue is share misallocation. If inflight is at the global limit, the server is universally overloaded.Correlate with downstream impact. Check node
Readyconditions and controller logs. If kubelets miss heartbeats or the scheduler times out on leader election, APF starvation is already causing cluster-wide degradation. This confirms urgency.
flowchart TD
A[Runaway controller floods workload-low] --> B[APF concurrency exhausted]
B --> C[System and leader-election requests queue]
C --> D[Kubelet heartbeats delayed]
D --> E[Nodes marked NotReady]
C --> F[Controller updates timeout]
F --> G[Scheduling and reconciliation lag]Metrics and signals to monitor
| Signal | Why it matters | Warning sign |
|---|---|---|
APF queue depth (system / leader-election) | Critical control plane traffic is waiting instead of executing | Queue depth > 0 sustained for more than 30 seconds |
APF rejected requests (system / leader-election) | Critical traffic is being dropped with 429 | Any non-zero rate in these levels |
| APF concurrency utilization per level | How close each level is to its effective limit | > 80% of limit sustained |
| 429 response rate | Active throttling by APF | > 5% of total API requests |
| Inflight requests (mutating / read-only) | Global API server saturation | > 80% of --max-requests-inflight |
| Controller timeout errors | Downstream impact of queue delays | context deadline exceeded in controller or kubelet logs |
Fixes
If the cause is a runaway controller or operator
Identify the offending client from audit logs or the flow_schema label on rejected requests. Throttle the client at the source: add client-side rate limits, reduce polling frequency, or fix the reconciliation loop. Do not raise APF limits to absorb bad behavior; the client will keep growing until it hits the next ceiling.
If the cause is insufficient concurrency shares
Edit the PriorityLevelConfiguration for system and leader-election to increase nominalConcurrencyShares. Remember that shares are relative to the total across all levels; increasing shares for one level reduces the effective limit of others unless you also raise the server concurrency limit via --max-requests-inflight and --max-mutating-requests-inflight. Ensure the API server’s CPU, memory, and etcd backing can handle the additional load before raising global limits.
If the cause is misconfigured flow schemas
Ensure that kubelet, controller-manager, and scheduler traffic match dedicated high-priority flow schemas. The default schemas cover built-in components, but custom controllers or infrastructure agents often fall into catch-all. Create specific FlowSchema resources for these components, matching on their service account or user group, and assign them to workload-high or a custom high-priority level.
If the cause is a thundering herd
If the traffic is legitimate but bursty, add jitter to client retry logic and ensure exponential backoff respects 429 responses. Temporarily increasing concurrency shares can provide relief, but the permanent fix is client behavior.
If cluster stability is at risk
As a last resort, you can temporarily move a critical service account to the exempt priority level. This bypasses all queuing and can destabilize the API server if the client floods requests. Revert immediately after recovery. Long-term exemptions defeat the purpose of APF.
Prevention
- Review APF configuration quarterly and after adding major operators. New controllers change the request mix.
- Monitor
systemandleader-electionqueue depth as a leading indicator, not a lagging one. - Ensure every critical controller has a dedicated
FlowSchemaresource. Do not let important traffic fall intocatch-all. - Document which service accounts and user groups each FlowSchema matches. Stale selectors silently reclassify traffic after deployments change.
- Set client-side rate limits and backoff on all custom controllers and automation.
- Test APF behavior under load. A deployment of 1,000 replicas should not push
workload-lowinto a state that starvesleader-election.
How Netdata helps
- Correlate APF queue depth with API server request latency to distinguish queuing delays from etcd latency.
- Alert on sustained queue depth in
systemorleader-electionbefore nodes transition toNotReady. - Track 429 spikes alongside etcd disk latency and webhook latency to isolate the true bottleneck.
- Visualize per-priority-level concurrency utilization to spot noisy neighbors before they cause cluster-wide impact.
- Monitor controller workqueue depth as a downstream signal that APF throttling delays reconciliation.
Related guides
- Kubernetes API server etcd latency: detection and cascading failures
- Kubernetes API server slow or unresponsive: causes and fixes
- Kubernetes kubelet not responding: PLEG, runtime, and certificate issues
- Kubernetes node NotReady: kubelet, runtime, and network diagnosis
- Kubernetes monitoring checklist: the signals every production cluster needs






