Kubernetes API server FlowSchemas and PriorityLevels: design and tuning

Before Kubernetes 1.20, the API server protected itself with two global hard limits: --max-requests-inflight and --max-mutating-requests-inflight. Every request, whether a kubelet heartbeat or a runaway controller LIST, competed for the same pool. API Priority and Fairness (APF), enabled by default since 1.20, replaces that coarse model with a two-stage classification and fair-queuing system. It separates requests into priority levels, isolates flows within each level, and rejects or queues traffic before it can starve critical control plane operations.

Understanding APF is not optional for operators running production clusters. A misconfigured hierarchy can silently throttle leader election, delay node status updates, or allow a single namespace to monopolize API server capacity. This article explains how FlowSchemas match requests, how PriorityLevels divide concurrency, what the built-in defaults actually protect, and how to design custom rules that keep the control plane responsive under load.

What APF is and why it matters

APF is implemented in the flowcontrol.apiserver.k8s.io API group. The stable version is v1, which graduated to GA in Kubernetes 1.29 and became the default. The v1beta3 version was removed in Kubernetes 1.32; clusters upgrading to 1.32 must convert any remaining v1beta3 manifests to v1 or the upgrade will block.

APF governs two Custom Resource kinds: FlowSchema and PriorityLevelConfiguration. A FlowSchema classifies an inbound request by its attributes. A PriorityLevelConfiguration defines how much concurrency that class receives and whether overflowing requests are queued or rejected. Together they create a traffic-shaping layer that sits after authentication and authorization but before the request reaches etcd.

The system still respects the global --max-requests-inflight and --max-mutating-requests-inflight flags, but those now act as a backstop. APF is the primary scheduling mechanism.

flowchart LR
    A[API request] --> B{FlowSchema match}
    B -->|precedence ascending| C[PriorityLevel]
    C --> D{Concurrency available?}
    D -->|Yes| E[Execute]
    D -->|No, Queue| F[Shuffle-shard queue]
    F --> E
    D -->|No, Reject| G[HTTP 429]

How FlowSchemas classify traffic

A FlowSchema contains a list of rules that match requests by user, group, verb, resource, APIGroup, namespace, or non-resource URL. The API server evaluates all FlowSchemas in ascending numeric order of matchingPrecedence; the first match wins. This means precedence is not a suggestion. It is the arbitration order, and a catch-all rule with a low precedence number will shadow more specific rules placed after it.

Each FlowSchema specifies a priorityLevelConfiguration.name and an optional distinguisherMethod. The distinguisher can be ByUser or ByNamespace. When enabled, APF splits matched requests into separate queues so that one user or one namespace cannot consume all the slots within a priority level. Without a distinguisher, all matching requests share a single queue.

The exempt priority level is special. Any FlowSchema that points to exempt bypasses queuing entirely. This is appropriate for system:masters traffic, but applying it broadly removes all backpressure and allows a single client to saturate the API server.

How PriorityLevels share concurrency

A PriorityLevelConfiguration defines concurrency through spec.limited.assuredConcurrencyShares (ACS). The effective concurrency limit for a level is calculated as:

effective_limit = (ACS / sum_of_all_ACS) * max-requests-inflight

Each level also defines a limitResponse. If type is Queue, APF applies shuffle-sharding fair queuing with configurable queues, queueLengthLimit, and handSize. If type is Reject, the API server returns HTTP 429 immediately when concurrency is exhausted.

The built-in default PriorityLevelConfiguration objects and their ACS values are:

PriorityLevelConfigurationACS
node-high40
system30
workload-high100
workload-low30
leader-election10
global-default20
catch-all5
exempttype=Exempt (bypasses queuing)

workload-high receives the largest share because it carries traffic from the built-in controllers, but node-high and system are deliberately kept high enough to protect kubelet and internal control plane traffic.

There is a subtle quirk in the v1 API: when spec.limited.nominalConcurrencyShares is unspecified, it defaults to 30. When it is explicitly set to 0, the value stays 0. Operators who previously set 0 expecting a promotion to 30 will receive zero seats instead.

Built-in defaults and their limits

The default FlowSchema hierarchy, from highest to lowest precedence, maps traffic to the priority levels above:

FlowSchemaPrecedencePriorityLevel
exempt1exempt
probes2exempt
system-leader-election100leader-election
endpoint-controller150workload-high
workload-leader-election200leader-election
system-node-high400node-high
system-nodes500system
kube-controller-manager800workload-high
kube-scheduler800workload-high
kube-system-service-accounts900workload-high
service-accounts9000workload-low
global-default9900global-default
catch-all10000catch-all

The catch-all FlowSchema has a precedence of 10000 and maps to the catch-all PriorityLevel. That level has only 5 ACS and uses type: Reject, which means it does not queue requests. Any request that is not matched by a more specific FlowSchema lands here and is immediately rejected with HTTP 429 under even modest load. The catch-all level exists to satisfy API machinery requirements, but production workloads should never rely on it.

Where it shows up in production

Clusters upgrading to Kubernetes 1.32 must ensure no active flowcontrol.apiserver.k8s.io/v1beta3 objects remain. Managed providers such as AKS and GKE block upgrades if v1beta3 FlowSchemas or PriorityLevelConfigurations are detected in stored configuration. You can verify this with kubectl get flowschemas and kubectl get prioritylevelconfigurations before initiating the upgrade.

There is also an open issue (#132233) where APF’s work estimator charges approximately one seat per 100 objects regardless of individual object size. A large LIST response, tens to hundreds of megabytes, can exhaust API server memory even when APF correctly gates CPU-bound concurrency. The estimator does not yet account for response byte size. Operators with large objects should monitor API server memory and consider watch-based pagination instead of unbounded LIST calls.

Designing custom FlowSchemas

The built-in defaults cover the core control plane, but custom controllers, CI/CD agents, and multi-tenant workloads usually need explicit rules. Follow these principles when adding custom FlowSchemas:

  • Never rely on catch-all for production traffic. Create a custom fallback FlowSchema with a reasonable precedence and assign it to a Queue-based PriorityLevel that has enough ACS to absorb baseline load.
  • Protect critical paths first. Leader election, node status updates, and kube-system controllers should have FlowSchemas with precedence lower than general workload traffic.
  • Use exempt sparingly. Setting a FlowSchema to exempt removes all flow control. This is appropriate for system:masters, but dangerous for authenticated service accounts or unauthenticated traffic.
  • Leave gaps in precedence. Insert custom FlowSchemas at values like 150, 300, or 5000 so you do not have to reorder the entire hierarchy later.
  • Distinguish noisy tenants. Use ByNamespace or ByUser distinguisher methods to prevent a single controller or namespace from consuming all the slots in a shared priority level.

Signals to watch in production

SignalWhy it mattersWarning sign
apiserver_flowcontrol_rejected_requests_totalRejections mean requests are denied, not delayed. Sustained rejections in system levels indicate starvation.Non-zero rate for system, leader-election, or node-high priority levels.
apiserver_flowcontrol_current_inqueue_requestsQueue depth shows how many requests are waiting per priority level.Depth > 0 for system or leader-election levels; depth > 100 for any level.
apiserver_flowcontrol_current_executing_requestsActual concurrency consumption versus the effective limit.Sustained > 80% of the level’s effective concurrency limit.
apiserver_flowcontrol_request_wait_duration_secondsTime spent queued before execution.p99 > 1 second sustained for critical priority levels.
apiserver_flowcontrol_nominal_limit_seatsThe theoretical concurrency limit per priority level.Use to compute utilization ratios against current executing requests.
apiserver_request_total{code="429"}APF rejections appear as HTTP 429 to clients.Sustained 429s from kubelet, controller-manager, or scheduler service accounts.

You can also inspect runtime state through the debug endpoint /debug/api_priority_and_fairness/dump_priority_levels, which exposes active and waiting request counts per priority level.

How Netdata helps

  • Correlates apiserver_flowcontrol_rejected_requests_total with apiserver_request_total{code="429"} to confirm whether client throttling originates from APF or from other layers.
  • Tracks apiserver_flowcontrol_current_inqueue_requests per priority level so you can see which traffic class is backing up before controllers time out.
  • Surfaces API server memory and LIST latency together to catch oversized LIST responses that APF’s work estimator cannot yet limit by byte size.
  • Alerts on sustained queue depth in system or leader-election priority levels, providing early warning before node heartbeats or leader locks fail.