Kubernetes pod readiness probe failures: traffic exclusion and debug
You check a Deployment and see pods in Running phase but not Ready. Traffic to the Service drops. The containers have not restarted. This is a readiness probe failure. Unlike liveness, readiness does not restart the container. It removes the pod from Service endpoints. The container may be starting slowly, temporarily overloaded, or waiting on a dependency that should not be in the probe path. This guide explains how Kubernetes excludes traffic, how to distinguish readiness from liveness failures, and how to debug the root cause.
What this means
A readiness probe failure is a traffic exclusion event, not a container lifecycle event. When the probe fails failureThreshold consecutive times, the kubelet sets the Pod Ready condition to False. The EndpointSlice controller removes the pod IP from all matching Services. The container continues running. The kubelet keeps probing. If the probe later passes successThreshold times, the pod is added back to endpoints.
The default probe fields are: initialDelaySeconds: 0, periodSeconds: 10, timeoutSeconds: 1, successThreshold: 1, failureThreshold: 3. These defaults mean a pod can be marked NotReady within roughly 30 seconds of a failing endpoint.
Readiness probes should test whether the pod itself can serve traffic. They should not test downstream dependencies. A probe that checks a database or cache will fail during a dependency outage and remove every pod from traffic, compounding the failure.
Under high node load, kubelet probe workers may fall behind, causing intermittent NotReady flaps that do not align exactly with periodSeconds.
Common causes
| Cause | What it looks like | First thing to check |
|---|---|---|
Slow startup vs short initialDelaySeconds | Pod Running then flaps NotReady briefly after scheduling | kubectl describe pod events for Unhealthy right after start |
| Downstream dependency checked by probe | All pods become NotReady simultaneously during an outage | Whether the probe endpoint queries external databases or caches |
| Kubelet probe execution delay | Intermittent NotReady under high node load | Node MemoryPressure, DiskPressure, or high kubelet CPU |
timeoutSeconds too short for load | Probe failures spike during traffic bursts | Probe latency versus configured timeoutSeconds |
| PID pressure blocking exec probes | exec probe failures on dense nodes or fork-heavy workloads | Node PIDPressure condition and pid_max headroom |
| Probe endpoint misconfiguration | HTTP 404/500 in application logs but pod stays Running | Pod spec readinessProbe path and port |
Quick checks
# Check if the pod is Running but not Ready
kubectl get pod <pod-name> -o jsonpath='{.status.conditions[?(@.type=="Ready")].status}'
# List recent Unhealthy events for the pod
kubectl get events --field-selector involvedObject.name=<pod-name>,reason=Unhealthy
# View the configured readiness probe
kubectl get pod <pod-name> -o jsonpath='{.spec.containers[*].readinessProbe}'
# Check if the pod IP is present in the Service endpoints
kubectl get endpoints <service-name>
# Check node pressure conditions that delay probe execution
kubectl get node <node-name> -o jsonpath='{.status.conditions[?(@.type=="MemoryPressure")].status}'
# Read kubelet logs for probe execution timing and failures (requires node access)
journalctl -u kubelet --since "10 minutes ago" | grep -i "probe\|readiness"
# Check kubelet probe metrics directly on the node (requires node access and kubelet auth)
curl -sk https://localhost:10250/metrics | grep prober_probe_total
# Check container restart counts to distinguish readiness from liveness
# For multi-container pods this returns counts for all containers
kubectl get pod <pod-name> -o jsonpath='{.status.containerStatuses[*].restartCount}'
How to diagnose it
flowchart TD
A[Pod Running but Not Ready] --> B{Container restarted?}
B -->|Yes| C[Liveness or startup failure]
B -->|No| D[Readiness failure]
D --> E{Events show Unhealthy?}
E -->|Yes| F{Probe checks dependency?}
E -->|No| G[Kubelet execution delay]
F -->|Yes| H[Remove external checks]
F -->|No| I{Fails only at startup?}
I -->|Yes| J[Add startup probe]
I -->|No| K[Check node CPU / PID pressure]
G --> K- Confirm the symptom is readiness, not liveness. Check
restartCount. If it is zero, the kubelet has not restarted the container, so readiness is the likely cause. If restarts are increasing, investigate liveness or startup probe failures, OOMKills, or node pressure evictions. - Inspect pod events.
kubectl describe podshowsUnhealthyevents with the probe type. Look at the timestamps. If failures start immediately after the container starts, the probe is firing before the application is ready. - Verify traffic exclusion. Check the Service endpoints. If the pod IP is absent while the pod is
Running, the readiness failure is working as designed. If the IP is still present, check the ServicepublishNotReadyAddressessetting or CNI state. If endpoint updates lag during an API server slowdown, see the related guide below. - Test the probe endpoint manually. For an
httpGetprobe,execinto the target pod andcurlthe path and port fromlocalhost. If it returns 500, the application is reporting itself unready. If it times out, the application is overloaded or deadlocked. If network policies block pod-to-pod traffic, test from inside the same pod. - Check node pressure. On dense nodes, kubelet may fall behind on probe execution. Check
MemoryPressure,DiskPressure, andPIDPressureconditions. Also check kubelet CPU usage. If the node is throttled, probes are a casualty. - Review probe type and overhead.
execprobes fork a process inside the container on every check. At high pod density, this adds measurable CPU and PID consumption.tcpSocketprobes connect from the node network namespace to the pod IP. They confirm the port is open but do not validate application health.httpGetprobes are the most common and should return 200-399. - Check for downstream dependencies in the probe. If the probe endpoint checks a database connection, a cache, or an external API, remove those checks. Readiness should reflect the pod’s own ability to serve, not the health of the platform.
- If using gRPC probes (stable since Kubernetes 1.24), verify the application implements the gRPC Health Checking Protocol and returns
SERVING.
Metrics and signals to monitor
| Signal | Why it matters | Warning sign |
|---|---|---|
| Pod Ready condition | Binary gate for traffic inclusion | Ready=False while Phase=Running |
prober_probe_total{result="failure"} | Direct count of readiness probe failures | Sustained rate above baseline |
| Service endpoints | Ground truth for routing | Pod IP missing from ready addresses |
| Kubelet CPU usage | Probe execution competes for resources | Sustained spike above node baseline |
| Node PID pressure | Prevents exec probe forks | PIDPressure=True |
| Pod restart count | Distinguishes readiness from liveness | restartCount=0 confirms readiness failure |
| Kubelet sync loop duration | Stalled sync delays status updates | Elevated duration trending above the node status update interval |
Fixes
If the cause is slow startup
Add a startupProbe to protect slow-starting containers. The startup probe disables liveness and readiness checks until the container has started. Then increase initialDelaySeconds or failureThreshold on the readiness probe to match worst-case startup time. Do not rely on a 1-second timeoutSeconds if the application needs several seconds to respond during initialization.
If the cause is downstream dependency checks
Remove database, cache, and external API checks from the readiness probe endpoint. Move those to a separate metrics or health endpoint used by your monitoring system. Readiness should return success as long as the pod can accept and queue requests. If the dependency is required to serve traffic, use a circuit breaker inside the application instead of failing the readiness probe.
If the cause is kubelet resource pressure
Reduce pod density on the affected node or increase kubelet resource reservations. Switch exec probes to httpGet probes where possible to eliminate fork overhead. If you must use exec probes, ensure the node has sufficient PID headroom. Raising pid_max requires a node sysctl change and a kubelet restart; plan for disruption.
If the cause is probe misconfiguration
Set timeoutSeconds lower than periodSeconds but high enough for the application under load. For HTTP probes, ensure the path returns the correct status code. Avoid using tcpSocket probes when application-level health matters, because an open port does not mean the application is ready to serve. For gRPC services, use the native gRPC probe instead of an HTTP wrapper.
Prevention
- Size readiness probes for worst-case startup, not average. Use startup probes for any container that takes more than a few seconds to initialize.
- Keep external dependencies out of readiness checks. This prevents cascading traffic exclusion during platform outages.
- Monitor
prober_probe_totalfailure rate at the node level. A rising failure rate across many pods indicates node pressure, not application bugs. - Reserve kubelet CPU and memory headroom on dense nodes. Probe execution degrades when kubelet is resource-starved.
- Review probe timeout and period during load testing. A timeout that works at low load may fail at high load.
How Netdata helps
Netdata correlates pod readiness transitions with kubelet CPU, memory, and sync loop latency on the same node to surface kubelet-side root causes. It tracks node pressure conditions alongside probe failure events, visualizes EndpointSlice membership changes against pod phase shifts to confirm traffic exclusion timing, and monitors container restart counts to distinguish readiness failures from liveness failures without manual kubectl checks.
Related guides
- If readiness failures block a Deployment from progressing, see Kubernetes Deployment rollout stuck: stalled rollouts and ready replicas.
- If pods are NotReady because DNS resolution is failing inside the probe, see Kubernetes DNS resolution failures inside pods.
- If
tcpSocketprobes are timing out due to connection tracking issues, see Kubernetes conntrack exhaustion: dropped connections under load. - If kubelet status updates are lagging because the API server is slow, see Kubernetes API server slow or unresponsive: causes and fixes.






