Kubernetes pod readiness probe failures: traffic exclusion and debug

You check a Deployment and see pods in Running phase but not Ready. Traffic to the Service drops. The containers have not restarted. This is a readiness probe failure. Unlike liveness, readiness does not restart the container. It removes the pod from Service endpoints. The container may be starting slowly, temporarily overloaded, or waiting on a dependency that should not be in the probe path. This guide explains how Kubernetes excludes traffic, how to distinguish readiness from liveness failures, and how to debug the root cause.

What this means

A readiness probe failure is a traffic exclusion event, not a container lifecycle event. When the probe fails failureThreshold consecutive times, the kubelet sets the Pod Ready condition to False. The EndpointSlice controller removes the pod IP from all matching Services. The container continues running. The kubelet keeps probing. If the probe later passes successThreshold times, the pod is added back to endpoints.

The default probe fields are: initialDelaySeconds: 0, periodSeconds: 10, timeoutSeconds: 1, successThreshold: 1, failureThreshold: 3. These defaults mean a pod can be marked NotReady within roughly 30 seconds of a failing endpoint.

Readiness probes should test whether the pod itself can serve traffic. They should not test downstream dependencies. A probe that checks a database or cache will fail during a dependency outage and remove every pod from traffic, compounding the failure.

Under high node load, kubelet probe workers may fall behind, causing intermittent NotReady flaps that do not align exactly with periodSeconds.

Common causes

CauseWhat it looks likeFirst thing to check
Slow startup vs short initialDelaySecondsPod Running then flaps NotReady briefly after schedulingkubectl describe pod events for Unhealthy right after start
Downstream dependency checked by probeAll pods become NotReady simultaneously during an outageWhether the probe endpoint queries external databases or caches
Kubelet probe execution delayIntermittent NotReady under high node loadNode MemoryPressure, DiskPressure, or high kubelet CPU
timeoutSeconds too short for loadProbe failures spike during traffic burstsProbe latency versus configured timeoutSeconds
PID pressure blocking exec probesexec probe failures on dense nodes or fork-heavy workloadsNode PIDPressure condition and pid_max headroom
Probe endpoint misconfigurationHTTP 404/500 in application logs but pod stays RunningPod spec readinessProbe path and port

Quick checks

# Check if the pod is Running but not Ready
kubectl get pod <pod-name> -o jsonpath='{.status.conditions[?(@.type=="Ready")].status}'

# List recent Unhealthy events for the pod
kubectl get events --field-selector involvedObject.name=<pod-name>,reason=Unhealthy

# View the configured readiness probe
kubectl get pod <pod-name> -o jsonpath='{.spec.containers[*].readinessProbe}'

# Check if the pod IP is present in the Service endpoints
kubectl get endpoints <service-name>

# Check node pressure conditions that delay probe execution
kubectl get node <node-name> -o jsonpath='{.status.conditions[?(@.type=="MemoryPressure")].status}'

# Read kubelet logs for probe execution timing and failures (requires node access)
journalctl -u kubelet --since "10 minutes ago" | grep -i "probe\|readiness"

# Check kubelet probe metrics directly on the node (requires node access and kubelet auth)
curl -sk https://localhost:10250/metrics | grep prober_probe_total

# Check container restart counts to distinguish readiness from liveness
# For multi-container pods this returns counts for all containers
kubectl get pod <pod-name> -o jsonpath='{.status.containerStatuses[*].restartCount}'

How to diagnose it

flowchart TD
    A[Pod Running but Not Ready] --> B{Container restarted?}
    B -->|Yes| C[Liveness or startup failure]
    B -->|No| D[Readiness failure]
    D --> E{Events show Unhealthy?}
    E -->|Yes| F{Probe checks dependency?}
    E -->|No| G[Kubelet execution delay]
    F -->|Yes| H[Remove external checks]
    F -->|No| I{Fails only at startup?}
    I -->|Yes| J[Add startup probe]
    I -->|No| K[Check node CPU / PID pressure]
    G --> K
  1. Confirm the symptom is readiness, not liveness. Check restartCount. If it is zero, the kubelet has not restarted the container, so readiness is the likely cause. If restarts are increasing, investigate liveness or startup probe failures, OOMKills, or node pressure evictions.
  2. Inspect pod events. kubectl describe pod shows Unhealthy events with the probe type. Look at the timestamps. If failures start immediately after the container starts, the probe is firing before the application is ready.
  3. Verify traffic exclusion. Check the Service endpoints. If the pod IP is absent while the pod is Running, the readiness failure is working as designed. If the IP is still present, check the Service publishNotReadyAddresses setting or CNI state. If endpoint updates lag during an API server slowdown, see the related guide below.
  4. Test the probe endpoint manually. For an httpGet probe, exec into the target pod and curl the path and port from localhost. If it returns 500, the application is reporting itself unready. If it times out, the application is overloaded or deadlocked. If network policies block pod-to-pod traffic, test from inside the same pod.
  5. Check node pressure. On dense nodes, kubelet may fall behind on probe execution. Check MemoryPressure, DiskPressure, and PIDPressure conditions. Also check kubelet CPU usage. If the node is throttled, probes are a casualty.
  6. Review probe type and overhead. exec probes fork a process inside the container on every check. At high pod density, this adds measurable CPU and PID consumption. tcpSocket probes connect from the node network namespace to the pod IP. They confirm the port is open but do not validate application health. httpGet probes are the most common and should return 200-399.
  7. Check for downstream dependencies in the probe. If the probe endpoint checks a database connection, a cache, or an external API, remove those checks. Readiness should reflect the pod’s own ability to serve, not the health of the platform.
  8. If using gRPC probes (stable since Kubernetes 1.24), verify the application implements the gRPC Health Checking Protocol and returns SERVING.

Metrics and signals to monitor

SignalWhy it mattersWarning sign
Pod Ready conditionBinary gate for traffic inclusionReady=False while Phase=Running
prober_probe_total{result="failure"}Direct count of readiness probe failuresSustained rate above baseline
Service endpointsGround truth for routingPod IP missing from ready addresses
Kubelet CPU usageProbe execution competes for resourcesSustained spike above node baseline
Node PID pressurePrevents exec probe forksPIDPressure=True
Pod restart countDistinguishes readiness from livenessrestartCount=0 confirms readiness failure
Kubelet sync loop durationStalled sync delays status updatesElevated duration trending above the node status update interval

Fixes

If the cause is slow startup

Add a startupProbe to protect slow-starting containers. The startup probe disables liveness and readiness checks until the container has started. Then increase initialDelaySeconds or failureThreshold on the readiness probe to match worst-case startup time. Do not rely on a 1-second timeoutSeconds if the application needs several seconds to respond during initialization.

If the cause is downstream dependency checks

Remove database, cache, and external API checks from the readiness probe endpoint. Move those to a separate metrics or health endpoint used by your monitoring system. Readiness should return success as long as the pod can accept and queue requests. If the dependency is required to serve traffic, use a circuit breaker inside the application instead of failing the readiness probe.

If the cause is kubelet resource pressure

Reduce pod density on the affected node or increase kubelet resource reservations. Switch exec probes to httpGet probes where possible to eliminate fork overhead. If you must use exec probes, ensure the node has sufficient PID headroom. Raising pid_max requires a node sysctl change and a kubelet restart; plan for disruption.

If the cause is probe misconfiguration

Set timeoutSeconds lower than periodSeconds but high enough for the application under load. For HTTP probes, ensure the path returns the correct status code. Avoid using tcpSocket probes when application-level health matters, because an open port does not mean the application is ready to serve. For gRPC services, use the native gRPC probe instead of an HTTP wrapper.

Prevention

  • Size readiness probes for worst-case startup, not average. Use startup probes for any container that takes more than a few seconds to initialize.
  • Keep external dependencies out of readiness checks. This prevents cascading traffic exclusion during platform outages.
  • Monitor prober_probe_total failure rate at the node level. A rising failure rate across many pods indicates node pressure, not application bugs.
  • Reserve kubelet CPU and memory headroom on dense nodes. Probe execution degrades when kubelet is resource-starved.
  • Review probe timeout and period during load testing. A timeout that works at low load may fail at high load.

How Netdata helps

Netdata correlates pod readiness transitions with kubelet CPU, memory, and sync loop latency on the same node to surface kubelet-side root causes. It tracks node pressure conditions alongside probe failure events, visualizes EndpointSlice membership changes against pod phase shifts to confirm traffic exclusion timing, and monitors container restart counts to distinguish readiness failures from liveness failures without manual kubectl checks.