Kubernetes monitoring checklist: the signals every production cluster needs
Production Kubernetes failures announce themselves early: PLEG relist latency climbs before a node goes NotReady; etcd WAL fsync duration creeps toward the heartbeat timeout; conntrack utilization sits at 90% until a traffic spike drops SYN packets. This checklist maps signals to failure modes across four layers: node and container runtime, control plane, workload state, and network data plane. Use it to wire a greenfield pipeline or audit a brownfield one.
Common causes
| Cause | What it looks like | First thing to check |
|---|---|---|
| PLEG stall / runtime degradation | Node NotReady with “PLEG is not healthy” | time crictl ps and kubelet_pleg_relist_duration_seconds |
| Admission webhook deadlock | Mutating API requests timeout; cluster appears frozen | apiserver_admission_webhook_admission_duration_seconds |
| etcd disk latency cascade | API mutating latency spikes; leader elections | etcd_disk_wal_fsync_duration_seconds |
| Conntrack exhaustion | Random connection timeouts; DNS failures on one node | nf_conntrack_count vs nf_conntrack_max |
| Silent kube-proxy watch death | Progressive service failure on one node; healthz still 200 | kubeproxy_sync_proxy_rules_last_timestamp_seconds |
| Resource pressure with invisible overcommit | MemoryPressure oscillating; BestEffort pods evicted | Actual memory usage vs requests per node |
Quick checks
Run these commands to validate health across layers before relying on dashboards.
# Control plane liveness
kubectl get --raw /livez
# etcd leader and disk health (run from etcd node)
etcdctl endpoint health --cluster
etcdctl endpoint status --cluster -w table
# Node Ready conditions and pressure
kubectl get nodes -o json | jq -r '.items[] | "\(.metadata.name): Ready=\(.status.conditions[] | select(.type=="Ready") | .status), MemoryPressure=\(.status.conditions[]? | select(.type=="MemoryPressure") | .status)"'
# Pending pods
kubectl get pods --all-namespaces --field-selector=status.phase=Pending
# Conntrack utilization on a node
echo "scale=2; $(cat /proc/sys/net/netfilter/nf_conntrack_count) * 100 / $(cat /proc/sys/net/netfilter/nf_conntrack_max)" | bc
# kube-proxy sync freshness
curl -s http://127.0.0.1:10249/metrics | grep kubeproxy_sync_proxy_rules_last_timestamp_seconds
# CRI runtime responsiveness
time crictl ps
# API server error rate
kubectl get --raw /metrics | grep 'apiserver_request_total' | grep 'code="5"'
# Critical service endpoints
kubectl get endpoints --all-namespaces -o json | jq -r '.items[] | select((.subsets // []) | length == 0) | "\(.metadata.namespace)/\(.metadata.name)"'
How to diagnose it
When an alert fires, follow this flow to isolate the layer.
- Control plane liveness.
kubectl get --raw /livezmust return 200. If it fails, check etcd leader stability (etcd_server_has_leader) and disk latency (etcd_disk_wal_fsync_duration_seconds). A missing leader or fsync p99 over 100 ms explains downstream symptoms. - Node Ready conditions. Any node with
Ready=Unknownfor more than one minute is past its grace period. Checkkubelet_pleg_relist_duration_seconds. If it is above 60 seconds, the container runtime is the bottleneck. - Workload scheduling. Sustained pending pods mean either capacity exhaustion (check node allocatable vs requests) or scheduling constraints (taints, affinity, ResourceQuotas). Look at
scheduler_pending_pods:unschedulableQindicates hard constraints;backoffQindicates transient failures. - Network programming. If pods are Running but services are unreachable, verify
kubeproxy_sync_proxy_rules_last_timestamp_secondsis recent. Then checknf_conntrack_countagainstnf_conntrack_max. If utilization is above 90%, new connections drop. - Resource pressure per node.
MemoryPressure=TrueorDiskPressure=Truemeans the eviction manager is actively killing pods. Cross-reference withkubelet_evictions_totaland actual memory utilization from/proc/meminfoor cgroup stats. If actual usage exceeds requests, you have invisible overcommit.
Metrics and signals to monitor
Node and container runtime
| Signal | Why it matters | Warning sign |
|---|---|---|
| Node Ready condition | Node can run pods | Ready=False or Unknown for more than 60 seconds |
kubelet_pleg_relist_duration_seconds | Precedes almost every unexpected NotReady | p99 above 30 seconds |
kubelet_runtime_operations_duration_seconds | Slow CRI calls stall pod lifecycle | p99 for list_containers or list_podsandbox above 5 seconds |
kubelet_evictions_total | Active load shedding | Any increase on production workloads |
| Node MemoryPressure / DiskPressure / PIDPressure | Incompressible resources hit cliff-edge failures | Any condition True |
| Container OOM kill events | Memory limit enforcement or node exhaustion | Exit code 137 with reason OOMKilled |
| Image pull errors | Blocks pod startup | ImagePullBackOff for more than 5 minutes |
kubelet_certificate_manager_client_ttl_seconds | Silent rotation failure leads to node disconnection | TTL below 7 days with rotation errors |
Control plane
| Signal | Why it matters | Warning sign |
|---|---|---|
API server /livez and /readyz | Binary health and traffic readiness | Non-200 sustained for more than 15 seconds |
apiserver_request_duration_seconds | Control plane temperature; verb breakdown matters | Mutating p99 above 1 second; LIST p99 above 30 seconds (exclude WATCH from SLO calculations) |
apiserver_request_total by code | Throttling, auth failure, or etcd issues | Sustained 5xx rate above 0; 429 rate above 5% |
etcd_request_duration_seconds | API server view of etcd; floor for all writes | Write p99 above 100 ms sustained |
etcd_disk_wal_fsync_duration_seconds | Root cause of most etcd instability | p99 above 50 ms; approaching the heartbeat timeout |
etcd_server_leader_changes_seen_total | Disk or network stress causing Raft elections | More than one per hour without maintenance |
etcd_debugging_mvcc_db_total_size_in_bytes | Approaching quota triggers read-only alarm | Above 75% of --quota-backend-bytes |
apiserver_admission_webhook_admission_duration_seconds | Synchronous external dependency in the mutation path | Per-webhook p99 above 1 second |
apiserver_flowcontrol_current_inqueue_requests | APF starvation of critical flows | Queue depth above 0 for system or leader-election levels |
apiserver_current_inflight_requests | Request processing saturation | Above 80% of configured max for mutating or read-only |
| API server memory / CPU vs limits | OOM or throttling causes crash loops | RSS above 80% of limit; CFS throttled periods increasing |
Workload and cluster state
| Signal | Why it matters | Warning sign |
|---|---|---|
scheduler_pending_pods by queue | Capacity or constraint failure | unschedulableQ growing for more than 5 minutes |
workqueue_depth | Controllers falling behind | Sustained depth above 0 for node, deployment, or replicaset queues |
Deployment readyReplicas vs spec.replicas | Rollout health and capacity | readyReplicas below minAvailable from PDB for more than 5 minutes |
| Pod phase distribution | Coarse cluster health | Unknown pods appearing; Pending sustained |
| Container restart count | Crashes, OOM kills, or probe failures | Increase above 5 in 10 minutes for production pods |
container_cpu_cfs_throttled_periods_total | Invisible latency from CFS quota denial | Throttle ratio above 25% sustained for latency-sensitive workloads |
| Service endpoint count | Zero endpoints means zero traffic capacity | Any production Service with zero ready endpoints |
| CoreDNS latency and SERVFAIL rate | DNS failure cascades to all discovery | SERVFAIL rate above 1%; p99 latency above 500 ms |
| PVC status | Volume binding blocks stateful startup | PVC Pending or Lost for more than 5 minutes |
apiserver_storage_objects | Object bloat drives etcd size and API memory | Any resource type growing unboundedly; events above 50,000 |
Network and data plane
| Signal | Why it matters | Warning sign |
|---|---|---|
| kube-proxy process liveness | Stale rules and new Service failures | Process missing on any schedulable node |
kubeproxy_sync_proxy_rules_duration_seconds | Rule programming latency | p99 above 10 seconds in iptables mode; above 1 second and growing |
kubeproxy_sync_proxy_rules_last_timestamp_seconds | Silent watch death despite healthy process | Age above 2 times the sync period |
Conntrack utilization (nf_conntrack_count / nf_conntrack_max) | New connections dropped at 100% utilization | Above 75%; any nonzero drop counter |
Conntrack drops (conntrack -S) | Confirmed packet loss from table exhaustion | Any increment in the drop field |
| iptables rule count (iptables mode) | O(n) traversal and lock contention | Rule count above 20,000 |
| IPVS virtual/real server counts (IPVS mode) | Proxy programming correctness | Virtual server count diverging from Service count |
| kube-proxy API watch errors | Stale state even if process is running | Elevated rest_client_requests_total with 5xx or 429 |
Fixes
If the cause is node or runtime pressure
Cordon the node. Identify evicted pods: kubectl get events --field-selector reason=Evicted. Check whether evicted workloads have memory requests matching actual usage. If BestEffort pods consume unbounded memory, enforce limits via LimitRange. If the runtime is slow, check for defunct or hung containerd-shim processes.
If the cause is control plane saturation
Identify the traffic source using apiserver_request_total user-agent labels. If APF throttles critical controllers, increase concurrency shares for system and leader-election priority levels. If a single operator floods the API, reduce its concurrency or add client-side rate limits. If admission webhooks are the bottleneck, verify webhook endpoint health. As a temporary mitigation only, you can change failurePolicy to Ignore, but this bypasses policy enforcement and can allow invalid or insecure objects into the cluster.
If the cause is etcd latency
Move etcd data directories to dedicated SSD storage with low fsync latency. Verify etcd does not share disks with log-heavy workloads. Check compaction effectiveness: if the gap between etcd_debugging_mvcc_db_total_size_in_bytes and etcd_mvcc_db_total_size_in_use_in_bytes is large, schedule defragmentation during a maintenance window, one member at a time. Increase --quota-backend-bytes if the database approaches its limit.
If the cause is network or conntrack exhaustion
Immediately increase nf_conntrack_max on affected nodes: sysctl -w net.netfilter.nf_conntrack_max=<2x current>. This is a temporary fix; investigate the root cause to avoid masking a connection leak. Inspect connection states with conntrack -L to identify TIME_WAIT or UDP accumulation. If kube-proxy sync duration in iptables mode exceeds 10 seconds at scale, plan a migration to IPVS or nftables mode. For UDP services in IPVS mode, reduce the UDP timeout to prevent dead backend stickiness.
If the cause is workload misconfiguration
Set resource requests and limits for all production pods. Remove CPU limits to avoid CFS throttling if your cluster can tolerate the burst, but keep memory limits to contain blast radius. Configure PodDisruptionBudgets with minAvailable below total replicas to allow drains. Ensure liveness probes detect deadlocks while readiness probes detect temporary traffic inability. Do not configure them identically.
Prevention
- Monitor leading indicators, not just lagging ones. PLEG relist duration, etcd WAL fsync, and conntrack ratio warn early. Node Ready and pod eviction are late.
- Forward Kubernetes Events to persistent storage. Events have a one-hour TTL. Without external forwarding, you lose the incident narrative.
- Alert on certificate TTL at 30 days. Control plane, kubelet, and webhook certificates expire on predictable schedules; auto-rotation fails silently.
- Enforce resource hygiene. Use LimitRange and ResourceQuota to prevent BestEffort pods from entering the cluster. The scheduler sees zero requests and over-packs nodes.
- Right-size kube-proxy for scale. If you run more than a few thousand Services, iptables mode degrades. Size for IPVS or nftables before you hit the cliff.
- Audit admission webhooks quarterly. Verify
failurePolicy,namespaceSelectorexclusions forkube-system, and timeout values. A webhook matching all resources withfailurePolicy: Failis a cluster-wide single point of failure.
How Netdata helps
- Correlates node-level PSI stall metrics with kubelet PLEG relist latency and CRI operation duration to surface runtime slowdown before nodes go NotReady.
- Tracks conntrack utilization, API server request latency, and APF rejection rates on one timeline to expose network and control plane saturation.
- Surfaces container-level CPU throttling, memory pressure, and OOM kills alongside pod restart counts and deployment readiness gaps.
- Maps etcd disk latency and leader changes to scheduler pending pods and controller work queue depth to show control plane cascades.





