Kubernetes scheduler not scheduling pods: queue depth and failure reasons
Pods stay Pending for many reasons, but the scheduler process being down is rarely one. More often, pods accumulate in internal queues because the cluster is out of capacity, a control plane dependency stalls the binding cycle, or a filter plugin rejects every candidate node. Distinguishing “unschedulable” (no node fits) from “not scheduling” (the scheduler cannot keep up or the binding cycle is failing) prevents wasted node scaling when the real problem is an etcd latency spike or a volume affinity conflict.
This guide focuses on stable scheduler metrics and queueing mechanics as of Kubernetes v1.32+.
What this means
The Kubernetes scheduler maintains three internal queues for unscheduled pods. The activeQ holds pods ready for immediate scheduling attempts. The backoffQ holds pods that failed scheduling and are waiting out an exponential delay, capped by a default ceiling. The unschedulableQ holds pods that are parked until a relevant cluster change triggers a re-evaluation. As of v1.32, QueueingHint drives event-driven requeueing: each plugin subscribes to specific event types and evaluates whether a change could make a parked pod schedulable. If QueueingHints do not fire for your workload, pods remain in unschedulableQ until the periodic flush goroutine moves them.
A scheduling attempt has two phases: the scheduling cycle selects a node, and the binding cycle persists that decision to etcd. If either fails, the pod returns to the queues. A pod that cannot find a feasible node increments scheduler_schedule_attempts_total{result="unschedulable"}. A pod that encounters an internal fault during binding (etcd timeout, API server throttle, plugin panic) increments scheduler_schedule_attempts_total{result="error"}. These are different failure classes. Queue depth alone does not tell you which class you are dealing with.
Common causes
| Cause | What it looks like | First thing to check |
|---|---|---|
| Cluster capacity exhaustion | FailedScheduling events with “Insufficient cpu” or “Insufficient memory”; scheduler_pending_pods in unschedulableQ growing | kubectl describe nodes Allocated resources section |
| Control plane latency (etcd/API server) | scheduler_schedule_attempts_total{result="error"} increasing; rapid FailedScheduling events with identical timestamps; binding timeouts | etcd_disk_wal_fsync_duration_seconds and API server mutating latency |
| Volume or node affinity conflict | FailedScheduling with “volume node affinity conflict” or “0/N nodes available” for pods with PVCs | StorageClass volumeBindingMode and PV node affinity labels |
| Node pressure or taints | FailedScheduling mentioning taints, or nodes with MemoryPressure, DiskPressure, or PIDPressure | kubectl describe node <name> Conditions and taints |
| Backoff queue flooding | scheduler_pending_pods in backoffQ growing after transient failures such as CSI delays | scheduler_schedule_attempts_total{result="unschedulable"} rate and pod event history |
| Scheduler throughput bottleneck | scheduler_pending_pods in activeQ growing; scheduling latency p99 trending up | Scheduler CPU and the rate of pod creation versus scheduled rate |
Quick checks
# Count pending pods and scheduler queue depths
kubectl get pods -A --field-selector=status.phase=Pending --no-headers | wc -l
# Requires access to the scheduler metrics endpoint. Adjust host, port, and TLS for your environment.
curl -sk https://localhost:10259/metrics | grep scheduler_pending_pods
# Check scheduler health and leader election
kubectl get pods -n kube-system -l component=kube-scheduler
kubectl get lease -n kube-system kube-scheduler -o yaml
# Read the scheduling failure reason for a specific pod
kubectl describe pod <pod-name> -n <namespace> | grep -A 30 "Events:"
# Check scheduling attempt results
curl -sk https://localhost:10259/metrics | grep scheduler_schedule_attempts_total
# Check which filter plugins are rejecting pods
curl -sk https://localhost:10259/metrics | grep scheduler_unschedulable_pods
# Check node pressure conditions and taints
kubectl get nodes -o json | jq '.items[] | {name: .metadata.name, conditions: [.status.conditions[] | select(.type | test("Pressure|Ready")) | {type: .type, status: .status}], taints: .spec.taints}'
# Check etcd fsync latency on the control plane. Adjust scheme, host, and TLS for your cluster.
# TODO: verify endpoint and auth configuration for your etcd cluster
curl -s http://localhost:2379/metrics | grep etcd_disk_wal_fsync_duration_seconds
# Check API server mutating latency and APF queue depth
kubectl get --raw /metrics | grep 'apiserver_request_duration_seconds_bucket' | grep -E 'verb="POST"|verb="PUT"|verb="PATCH"|verb="DELETE"'
kubectl get --raw /metrics | grep apiserver_flowcontrol_current_inqueue_requests
How to diagnose it
- Confirm the scheduler is running and holds the leader lease. In HA clusters, only the leader schedules. Check the
kube-schedulerLease in thekube-systemnamespace. - Quantify the backlog. Use
scheduler_pending_podsbroken down by queue. If unschedulableQ is growing, the scheduler cannot find a feasible node. If activeQ is growing, the scheduler is not processing attempts fast enough. - Read pod events for aggregated failures. The scheduler reports all filter failures in the event message (for example, “8 Insufficient cpu, 1 Insufficient memory”). A bare “0 nodes available” with no breakdown may indicate a PreEnqueue plugin rejection, which does not apply the
Unschedulablecondition to the pod object. - Check attempt results.
scheduler_schedule_attempts_total{result="error"}indicates control plane faults. If this is rising whileresult="unschedulable"is flat, the issue is binding-cycle latency, not capacity. - Inspect plugin rejections.
scheduler_unschedulable_podsincrements for every plugin that rejects a pod, so summing all plugins over-counts. Query it by plugin name to identify the bottleneck (for example,NodeResourcesFit,VolumeBinding,InterPodAffinity). - Check for backoff flooding. A high backoffQ with a high retry rate means pods are failing for transient reasons and re-entering with exponential delays.
- Verify QueueingHint behavior.
scheduler_pod_scheduled_after_flush_totalspiking means pods are leaving unschedulableQ because of the periodic flush rather than an event-driven requeue. This signals that QueueingHints are not matching your workload changes. - Correlate with control plane latency. Check etcd WAL fsync p99 and API server mutating latency. Slow etcd writes stall the binding cycle and create retry loops that amplify queue depth.
flowchart TD
A[Pods pending above baseline] --> B{Scheduler has leader?}
B -->|No leader| C[Fix leader election or scheduler health]
B -->|Leader OK| D{Check scheduler_pending_pods queue}
D -->|activeQ growing| E[Scheduler throughput bottleneck
Check CPU and pod creation rate]
D -->|backoffQ growing| F[Transient failures retrying
Check plugin rejections and provisioning]
D -->|unschedulableQ growing| G[Hard constraint failures]
G --> H{Read pod Events}
H -->|Insufficient resources| I[Scale nodes or reduce requests]
H -->|Volume/node affinity| J[Check PVC binding mode and PV labels]
H -->|Taints or pressure| K[Clear node conditions or add tolerations]
F --> L{Check schedule_attempts result}
L -->|error rate up| M[Correlate etcd and API server latency]
L -->|unschedulable only| N[Fix root cause of rejection]Metrics and signals to monitor
| Signal | Why it matters | Warning sign |
|---|---|---|
| scheduler_pending_pods by queue | Distinguishes capacity (unschedulableQ) from throughput (activeQ) backlog | Any queue sustained > 0 for longer than 5 minutes during normal operations |
| scheduler_schedule_attempts_total{result=“unschedulable”} | Count of hard constraint failures | Sustained rate above baseline |
| scheduler_schedule_attempts_total{result=“error”} | Count of binding-cycle or internal faults | Any sustained non-zero rate |
| scheduler_scheduling_attempt_duration_seconds | Time spent in the scheduling cycle | p99 greater than 1 second in large clusters |
| scheduler_unschedulable_pods by plugin | Identifies which filter is rejecting pods | One plugin dominating rejections |
| scheduler_pod_scheduled_after_flush_total | Pods exiting queue via timeout instead of event | Spikes indicate QueueingHint misses |
| etcd_disk_wal_fsync_duration_seconds | etcd latency directly stalls binding writes | p99 greater than 100 ms |
| apiserver_request_duration_seconds (mutating) | API server slowness delays binding and status updates | p99 greater than 1 second sustained |
| apiserver_flowcontrol_current_inqueue_requests | APF queuing delays scheduler traffic | Queue depth greater than 0 for scheduler’s priority level |
| Node pressure conditions | Pressure taints exclude nodes silently | MemoryPressure, DiskPressure, or PIDPressure True |
Fixes
If the cause is resource exhaustion
Add nodes or reduce resource requests. Check that DaemonSet overhead has not consumed all allocatable capacity. Verify cluster-autoscaler is not capped at max node count. If requests are set much lower than actual usage, the scheduler over-commits and the kubelet evicts later; align requests with measured usage.
If the cause is control plane latency
Investigate etcd disk I/O. Every binding write requires an etcd fsync; if WAL fsync p99 exceeds 100 ms, binding cycles time out and retry. Ensure the scheduler’s API traffic is not being throttled by APF; its flow schema should have sufficient concurrency. Check admission webhook latency, since binding requests may trigger validating webhooks.
If the cause is volume or affinity constraints
For PVCs using volumeBindingMode: Immediate, the PV may be provisioned in a zone that does not match the pod’s node selectors. Create a new StorageClass with volumeBindingMode: WaitForFirstConsumer and recreate the PVC; you cannot change volumeBindingMode on an existing StorageClass. Verify that node labels and pod nodeAffinity rules are not mutually exclusive.
If the cause is node pressure or taints
Clear the pressure condition by evicting overcommitted workloads, cleaning disk, or adding nodes. Review taints on newly scaled nodes. Remember that MemoryPressure, DiskPressure, and PIDPressure exclude nodes via the NodeUnschedulable filter plugin even if raw resource requests appear to fit.
If the cause is backoff queue flooding
Identify the transient failure (CSI volume attachment delay, network provisioning, image pull) and fix it at the source. If the failure is expected and temporary, you may need to increase the scheduler’s maximum backoff duration ceiling so retries do not hammer the active queue.
Prevention
Monitor scheduler_pending_pods per queue. Alert on scheduler_schedule_attempts_total{result="error"} to catch control plane degradation before queues back up. Ensure API Priority and Fairness flow schemas classify scheduler and leader-election traffic in high-priority levels with dedicated concurrency. Use WaitForFirstConsumer for all dynamic provisioning StorageClasses. Review scheduler plugin rejection trends after every major deployment to spot emerging affinity or resource mismatches.
How Netdata helps
- Correlates
scheduler_pending_podswithapiserver_request_duration_secondsandetcd_disk_wal_fsync_duration_secondsto distinguish capacity exhaustion from control plane latency. - Surfaces node pressure conditions alongside pod scheduling failures to identify taint-related rejections without manual node inspection.
- Tracks APF queue depth and concurrency utilization per priority level to detect when scheduler binding requests are being throttled.
- Alerts on composite patterns, such as pending pods increasing while etcd fsync latency spikes, signaling a binding-cycle cascade before queues flood.
Related guides
- See Kubernetes API server etcd latency: detection and cascading failures
- See Kubernetes API server rate limiting: APF priority levels and starvation
- See Kubernetes API server slow or unresponsive: causes and fixes
- See Kubernetes conntrack exhaustion: dropped connections under load
- See Kubernetes controller-manager leader election failures
- See Kubernetes DNS resolution failures inside pods
- See Kubernetes eviction cascade: when one node failure takes down the cluster
- See Kubernetes kube-proxy iptables sync stall: causes and recovery
- See Kubernetes kube-proxy IPVS: stale rules and session affinity issues
- See Kubernetes kubelet certificate expired: detection, rotation, and recovery
- See Kubernetes kubelet memory leak: detection and OOM cycle
- See Kubernetes kubelet not responding: PLEG, runtime, and certificate issues






