Kubernetes scheduler not scheduling pods: queue depth and failure reasons

Pods stay Pending for many reasons, but the scheduler process being down is rarely one. More often, pods accumulate in internal queues because the cluster is out of capacity, a control plane dependency stalls the binding cycle, or a filter plugin rejects every candidate node. Distinguishing “unschedulable” (no node fits) from “not scheduling” (the scheduler cannot keep up or the binding cycle is failing) prevents wasted node scaling when the real problem is an etcd latency spike or a volume affinity conflict.

This guide focuses on stable scheduler metrics and queueing mechanics as of Kubernetes v1.32+.

What this means

The Kubernetes scheduler maintains three internal queues for unscheduled pods. The activeQ holds pods ready for immediate scheduling attempts. The backoffQ holds pods that failed scheduling and are waiting out an exponential delay, capped by a default ceiling. The unschedulableQ holds pods that are parked until a relevant cluster change triggers a re-evaluation. As of v1.32, QueueingHint drives event-driven requeueing: each plugin subscribes to specific event types and evaluates whether a change could make a parked pod schedulable. If QueueingHints do not fire for your workload, pods remain in unschedulableQ until the periodic flush goroutine moves them.

A scheduling attempt has two phases: the scheduling cycle selects a node, and the binding cycle persists that decision to etcd. If either fails, the pod returns to the queues. A pod that cannot find a feasible node increments scheduler_schedule_attempts_total{result="unschedulable"}. A pod that encounters an internal fault during binding (etcd timeout, API server throttle, plugin panic) increments scheduler_schedule_attempts_total{result="error"}. These are different failure classes. Queue depth alone does not tell you which class you are dealing with.

Common causes

CauseWhat it looks likeFirst thing to check
Cluster capacity exhaustionFailedScheduling events with “Insufficient cpu” or “Insufficient memory”; scheduler_pending_pods in unschedulableQ growingkubectl describe nodes Allocated resources section
Control plane latency (etcd/API server)scheduler_schedule_attempts_total{result="error"} increasing; rapid FailedScheduling events with identical timestamps; binding timeoutsetcd_disk_wal_fsync_duration_seconds and API server mutating latency
Volume or node affinity conflictFailedScheduling with “volume node affinity conflict” or “0/N nodes available” for pods with PVCsStorageClass volumeBindingMode and PV node affinity labels
Node pressure or taintsFailedScheduling mentioning taints, or nodes with MemoryPressure, DiskPressure, or PIDPressurekubectl describe node <name> Conditions and taints
Backoff queue floodingscheduler_pending_pods in backoffQ growing after transient failures such as CSI delaysscheduler_schedule_attempts_total{result="unschedulable"} rate and pod event history
Scheduler throughput bottleneckscheduler_pending_pods in activeQ growing; scheduling latency p99 trending upScheduler CPU and the rate of pod creation versus scheduled rate

Quick checks

# Count pending pods and scheduler queue depths
kubectl get pods -A --field-selector=status.phase=Pending --no-headers | wc -l
# Requires access to the scheduler metrics endpoint. Adjust host, port, and TLS for your environment.
curl -sk https://localhost:10259/metrics | grep scheduler_pending_pods
# Check scheduler health and leader election
kubectl get pods -n kube-system -l component=kube-scheduler
kubectl get lease -n kube-system kube-scheduler -o yaml
# Read the scheduling failure reason for a specific pod
kubectl describe pod <pod-name> -n <namespace> | grep -A 30 "Events:"
# Check scheduling attempt results
curl -sk https://localhost:10259/metrics | grep scheduler_schedule_attempts_total
# Check which filter plugins are rejecting pods
curl -sk https://localhost:10259/metrics | grep scheduler_unschedulable_pods
# Check node pressure conditions and taints
kubectl get nodes -o json | jq '.items[] | {name: .metadata.name, conditions: [.status.conditions[] | select(.type | test("Pressure|Ready")) | {type: .type, status: .status}], taints: .spec.taints}'
# Check etcd fsync latency on the control plane. Adjust scheme, host, and TLS for your cluster.
# TODO: verify endpoint and auth configuration for your etcd cluster
curl -s http://localhost:2379/metrics | grep etcd_disk_wal_fsync_duration_seconds
# Check API server mutating latency and APF queue depth
kubectl get --raw /metrics | grep 'apiserver_request_duration_seconds_bucket' | grep -E 'verb="POST"|verb="PUT"|verb="PATCH"|verb="DELETE"'
kubectl get --raw /metrics | grep apiserver_flowcontrol_current_inqueue_requests

How to diagnose it

  1. Confirm the scheduler is running and holds the leader lease. In HA clusters, only the leader schedules. Check the kube-scheduler Lease in the kube-system namespace.
  2. Quantify the backlog. Use scheduler_pending_pods broken down by queue. If unschedulableQ is growing, the scheduler cannot find a feasible node. If activeQ is growing, the scheduler is not processing attempts fast enough.
  3. Read pod events for aggregated failures. The scheduler reports all filter failures in the event message (for example, “8 Insufficient cpu, 1 Insufficient memory”). A bare “0 nodes available” with no breakdown may indicate a PreEnqueue plugin rejection, which does not apply the Unschedulable condition to the pod object.
  4. Check attempt results. scheduler_schedule_attempts_total{result="error"} indicates control plane faults. If this is rising while result="unschedulable" is flat, the issue is binding-cycle latency, not capacity.
  5. Inspect plugin rejections. scheduler_unschedulable_pods increments for every plugin that rejects a pod, so summing all plugins over-counts. Query it by plugin name to identify the bottleneck (for example, NodeResourcesFit, VolumeBinding, InterPodAffinity).
  6. Check for backoff flooding. A high backoffQ with a high retry rate means pods are failing for transient reasons and re-entering with exponential delays.
  7. Verify QueueingHint behavior. scheduler_pod_scheduled_after_flush_total spiking means pods are leaving unschedulableQ because of the periodic flush rather than an event-driven requeue. This signals that QueueingHints are not matching your workload changes.
  8. Correlate with control plane latency. Check etcd WAL fsync p99 and API server mutating latency. Slow etcd writes stall the binding cycle and create retry loops that amplify queue depth.
flowchart TD
    A[Pods pending above baseline] --> B{Scheduler has leader?}
    B -->|No leader| C[Fix leader election or scheduler health]
    B -->|Leader OK| D{Check scheduler_pending_pods queue}
    D -->|activeQ growing| E[Scheduler throughput bottleneck
Check CPU and pod creation rate] D -->|backoffQ growing| F[Transient failures retrying
Check plugin rejections and provisioning] D -->|unschedulableQ growing| G[Hard constraint failures] G --> H{Read pod Events} H -->|Insufficient resources| I[Scale nodes or reduce requests] H -->|Volume/node affinity| J[Check PVC binding mode and PV labels] H -->|Taints or pressure| K[Clear node conditions or add tolerations] F --> L{Check schedule_attempts result} L -->|error rate up| M[Correlate etcd and API server latency] L -->|unschedulable only| N[Fix root cause of rejection]

Metrics and signals to monitor

SignalWhy it mattersWarning sign
scheduler_pending_pods by queueDistinguishes capacity (unschedulableQ) from throughput (activeQ) backlogAny queue sustained > 0 for longer than 5 minutes during normal operations
scheduler_schedule_attempts_total{result=“unschedulable”}Count of hard constraint failuresSustained rate above baseline
scheduler_schedule_attempts_total{result=“error”}Count of binding-cycle or internal faultsAny sustained non-zero rate
scheduler_scheduling_attempt_duration_secondsTime spent in the scheduling cyclep99 greater than 1 second in large clusters
scheduler_unschedulable_pods by pluginIdentifies which filter is rejecting podsOne plugin dominating rejections
scheduler_pod_scheduled_after_flush_totalPods exiting queue via timeout instead of eventSpikes indicate QueueingHint misses
etcd_disk_wal_fsync_duration_secondsetcd latency directly stalls binding writesp99 greater than 100 ms
apiserver_request_duration_seconds (mutating)API server slowness delays binding and status updatesp99 greater than 1 second sustained
apiserver_flowcontrol_current_inqueue_requestsAPF queuing delays scheduler trafficQueue depth greater than 0 for scheduler’s priority level
Node pressure conditionsPressure taints exclude nodes silentlyMemoryPressure, DiskPressure, or PIDPressure True

Fixes

If the cause is resource exhaustion

Add nodes or reduce resource requests. Check that DaemonSet overhead has not consumed all allocatable capacity. Verify cluster-autoscaler is not capped at max node count. If requests are set much lower than actual usage, the scheduler over-commits and the kubelet evicts later; align requests with measured usage.

If the cause is control plane latency

Investigate etcd disk I/O. Every binding write requires an etcd fsync; if WAL fsync p99 exceeds 100 ms, binding cycles time out and retry. Ensure the scheduler’s API traffic is not being throttled by APF; its flow schema should have sufficient concurrency. Check admission webhook latency, since binding requests may trigger validating webhooks.

If the cause is volume or affinity constraints

For PVCs using volumeBindingMode: Immediate, the PV may be provisioned in a zone that does not match the pod’s node selectors. Create a new StorageClass with volumeBindingMode: WaitForFirstConsumer and recreate the PVC; you cannot change volumeBindingMode on an existing StorageClass. Verify that node labels and pod nodeAffinity rules are not mutually exclusive.

If the cause is node pressure or taints

Clear the pressure condition by evicting overcommitted workloads, cleaning disk, or adding nodes. Review taints on newly scaled nodes. Remember that MemoryPressure, DiskPressure, and PIDPressure exclude nodes via the NodeUnschedulable filter plugin even if raw resource requests appear to fit.

If the cause is backoff queue flooding

Identify the transient failure (CSI volume attachment delay, network provisioning, image pull) and fix it at the source. If the failure is expected and temporary, you may need to increase the scheduler’s maximum backoff duration ceiling so retries do not hammer the active queue.

Prevention

Monitor scheduler_pending_pods per queue. Alert on scheduler_schedule_attempts_total{result="error"} to catch control plane degradation before queues back up. Ensure API Priority and Fairness flow schemas classify scheduler and leader-election traffic in high-priority levels with dedicated concurrency. Use WaitForFirstConsumer for all dynamic provisioning StorageClasses. Review scheduler plugin rejection trends after every major deployment to spot emerging affinity or resource mismatches.

How Netdata helps

  • Correlates scheduler_pending_pods with apiserver_request_duration_seconds and etcd_disk_wal_fsync_duration_seconds to distinguish capacity exhaustion from control plane latency.
  • Surfaces node pressure conditions alongside pod scheduling failures to identify taint-related rejections without manual node inspection.
  • Tracks APF queue depth and concurrency utilization per priority level to detect when scheduler binding requests are being throttled.
  • Alerts on composite patterns, such as pending pods increasing while etcd fsync latency spikes, signaling a binding-cycle cascade before queues flood.