$ guides / kubernetes / kubernetes-pod-readiness-probe-failures ▌

Operations Guides

Kubernetes pod readiness probe failures: traffic exclusion and debug

You check a Deployment and see pods in Running phase but not Ready. Traffic to the Service drops. The containers have not restarted. This is a readiness probe failure. Unlike liveness, readiness does not restart the container. It removes the pod from Service endpoints. The container may be starting slowly, temporarily overloaded, or waiting on a dependency that should not be in the probe path. This guide explains how Kubernetes excludes traffic, how to distinguish readiness from liveness failures, and how to debug the root cause.

What this means

A readiness probe failure is a traffic exclusion event, not a container lifecycle event. When the probe fails failureThreshold consecutive times, the kubelet sets the Pod Ready condition to False. The EndpointSlice controller removes the pod IP from all matching Services. The container continues running. The kubelet keeps probing. If the probe later passes successThreshold times, the pod is added back to endpoints.

The default probe fields are: initialDelaySeconds: 0, periodSeconds: 10, timeoutSeconds: 1, successThreshold: 1, failureThreshold: 3. These defaults mean a pod can be marked NotReady within roughly 30 seconds of a failing endpoint.

Readiness probes should test whether the pod itself can serve traffic. They should not test downstream dependencies. A probe that checks a database or cache will fail during a dependency outage and remove every pod from traffic, compounding the failure.

Under high node load, kubelet probe workers may fall behind, causing intermittent NotReady flaps that do not align exactly with periodSeconds.

Common causes

Cause	What it looks like	First thing to check
Slow startup vs short `initialDelaySeconds`	Pod `Running` then flaps `NotReady` briefly after scheduling	`kubectl describe pod` events for `Unhealthy` right after start
Downstream dependency checked by probe	All pods become `NotReady` simultaneously during an outage	Whether the probe endpoint queries external databases or caches
Kubelet probe execution delay	Intermittent `NotReady` under high node load	Node `MemoryPressure`, `DiskPressure`, or high kubelet CPU
`timeoutSeconds` too short for load	Probe failures spike during traffic bursts	Probe latency versus configured `timeoutSeconds`
PID pressure blocking exec probes	exec probe failures on dense nodes or fork-heavy workloads	Node `PIDPressure` condition and `pid_max` headroom
Probe endpoint misconfiguration	HTTP 404/500 in application logs but pod stays `Running`	Pod spec `readinessProbe` path and port

Quick checks

# Check if the pod is Running but not Ready
kubectl get pod <pod-name> -o jsonpath='{.status.conditions[?(@.type=="Ready")].status}'

# List recent Unhealthy events for the pod
kubectl get events --field-selector involvedObject.name=<pod-name>,reason=Unhealthy

# View the configured readiness probe
kubectl get pod <pod-name> -o jsonpath='{.spec.containers[*].readinessProbe}'

# Check if the pod IP is present in the Service endpoints
kubectl get endpoints <service-name>

# Check node pressure conditions that delay probe execution
kubectl get node <node-name> -o jsonpath='{.status.conditions[?(@.type=="MemoryPressure")].status}'

# Read kubelet logs for probe execution timing and failures (requires node access)
journalctl -u kubelet --since "10 minutes ago" | grep -i "probe\|readiness"

# Check kubelet probe metrics directly on the node (requires node access and kubelet auth)
curl -sk https://localhost:10250/metrics | grep prober_probe_total

# Check container restart counts to distinguish readiness from liveness
# For multi-container pods this returns counts for all containers
kubectl get pod <pod-name> -o jsonpath='{.status.containerStatuses[*].restartCount}'

How to diagnose it

flowchart TD
    A[Pod Running but Not Ready] --> B{Container restarted?}
    B -->|Yes| C[Liveness or startup failure]
    B -->|No| D[Readiness failure]
    D --> E{Events show Unhealthy?}
    E -->|Yes| F{Probe checks dependency?}
    E -->|No| G[Kubelet execution delay]
    F -->|Yes| H[Remove external checks]
    F -->|No| I{Fails only at startup?}
    I -->|Yes| J[Add startup probe]
    I -->|No| K[Check node CPU / PID pressure]
    G --> K

Confirm the symptom is readiness, not liveness. Check restartCount. If it is zero, the kubelet has not restarted the container, so readiness is the likely cause. If restarts are increasing, investigate liveness or startup probe failures, OOMKills, or node pressure evictions.
Inspect pod events. kubectl describe pod shows Unhealthy events with the probe type. Look at the timestamps. If failures start immediately after the container starts, the probe is firing before the application is ready.
Verify traffic exclusion. Check the Service endpoints. If the pod IP is absent while the pod is Running, the readiness failure is working as designed. If the IP is still present, check the Service publishNotReadyAddresses setting or CNI state. If endpoint updates lag during an API server slowdown, see the related guide below.
Test the probe endpoint manually. For an httpGet probe, exec into the target pod and curl the path and port from localhost. If it returns 500, the application is reporting itself unready. If it times out, the application is overloaded or deadlocked. If network policies block pod-to-pod traffic, test from inside the same pod.
Check node pressure. On dense nodes, kubelet may fall behind on probe execution. Check MemoryPressure, DiskPressure, and PIDPressure conditions. Also check kubelet CPU usage. If the node is throttled, probes are a casualty.
Review probe type and overhead. exec probes fork a process inside the container on every check. At high pod density, this adds measurable CPU and PID consumption. tcpSocket probes connect from the node network namespace to the pod IP. They confirm the port is open but do not validate application health. httpGet probes are the most common and should return 200-399.
Check for downstream dependencies in the probe. If the probe endpoint checks a database connection, a cache, or an external API, remove those checks. Readiness should reflect the pod’s own ability to serve, not the health of the platform.
If using gRPC probes (stable since Kubernetes 1.24), verify the application implements the gRPC Health Checking Protocol and returns SERVING.

Metrics and signals to monitor

Signal	Why it matters	Warning sign
Pod Ready condition	Binary gate for traffic inclusion	`Ready=False` while `Phase=Running`
`prober_probe_total{result="failure"}`	Direct count of readiness probe failures	Sustained rate above baseline
Service endpoints	Ground truth for routing	Pod IP missing from ready addresses
Kubelet CPU usage	Probe execution competes for resources	Sustained spike above node baseline
Node PID pressure	Prevents exec probe forks	`PIDPressure=True`
Pod restart count	Distinguishes readiness from liveness	`restartCount=0` confirms readiness failure
Kubelet sync loop duration	Stalled sync delays status updates	Elevated duration trending above the node status update interval

Fixes

If the cause is slow startup

Add a startupProbe to protect slow-starting containers. The startup probe disables liveness and readiness checks until the container has started. Then increase initialDelaySeconds or failureThreshold on the readiness probe to match worst-case startup time. Do not rely on a 1-second timeoutSeconds if the application needs several seconds to respond during initialization.

If the cause is downstream dependency checks

Remove database, cache, and external API checks from the readiness probe endpoint. Move those to a separate metrics or health endpoint used by your monitoring system. Readiness should return success as long as the pod can accept and queue requests. If the dependency is required to serve traffic, use a circuit breaker inside the application instead of failing the readiness probe.

If the cause is kubelet resource pressure

Reduce pod density on the affected node or increase kubelet resource reservations. Switch exec probes to httpGet probes where possible to eliminate fork overhead. If you must use exec probes, ensure the node has sufficient PID headroom. Raising pid_max requires a node sysctl change and a kubelet restart; plan for disruption.

If the cause is probe misconfiguration

Set timeoutSeconds lower than periodSeconds but high enough for the application under load. For HTTP probes, ensure the path returns the correct status code. Avoid using tcpSocket probes when application-level health matters, because an open port does not mean the application is ready to serve. For gRPC services, use the native gRPC probe instead of an HTTP wrapper.

Prevention

Size readiness probes for worst-case startup, not average. Use startup probes for any container that takes more than a few seconds to initialize.
Keep external dependencies out of readiness checks. This prevents cascading traffic exclusion during platform outages.
Monitor prober_probe_total failure rate at the node level. A rising failure rate across many pods indicates node pressure, not application bugs.
Reserve kubelet CPU and memory headroom on dense nodes. Probe execution degrades when kubelet is resource-starved.
Review probe timeout and period during load testing. A timeout that works at low load may fail at high load.

How Netdata helps

Netdata correlates pod readiness transitions with kubelet CPU, memory, and sync loop latency on the same node to surface kubelet-side root causes. It tracks node pressure conditions alongside probe failure events, visualizes EndpointSlice membership changes against pod phase shifts to confirm traffic exclusion timing, and monitors container restart counts to distinguish readiness failures from liveness failures without manual kubectl checks.

If readiness failures block a Deployment from progressing, see Kubernetes Deployment rollout stuck: stalled rollouts and ready replicas.
If pods are NotReady because DNS resolution is failing inside the probe, see Kubernetes DNS resolution failures inside pods.
If tcpSocket probes are timing out due to connection tracking issues, see Kubernetes conntrack exhaustion: dropped connections under load.
If kubelet status updates are lagging because the API server is slow, see Kubernetes API server slow or unresponsive: causes and fixes.

The Netdata solution

Kubernetes monitoring with Netdata

Netdata monitors Kubernetes with per-second metrics across the control plane, nodes, and every pod, with ML anomaly detection and zero per-pod configuration. Correlate API-server and etcd latency, kubelet PLEG stalls, scheduling pressure, and OOMKills in one place.

See Kubernetes monitoring → Start monitoring free

Kubernetes pod readiness probe failures: traffic exclusion and debug

Kubernetes pod readiness probe failures: traffic exclusion and debug

What this means

Common causes

Quick checks

How to diagnose it

Metrics and signals to monitor

Fixes

If the cause is slow startup

If the cause is downstream dependency checks

If the cause is kubelet resource pressure

If the cause is probe misconfiguration

Prevention

How Netdata helps

Related guides

Kubernetes monitoring with Netdata