Kubernetes monitoring checklist: the signals every production cluster needs

Production Kubernetes failures announce themselves early: PLEG relist latency climbs before a node goes NotReady; etcd WAL fsync duration creeps toward the heartbeat timeout; conntrack utilization sits at 90% until a traffic spike drops SYN packets. This checklist maps signals to failure modes across four layers: node and container runtime, control plane, workload state, and network data plane. Use it to wire a greenfield pipeline or audit a brownfield one.

Common causes

Cause	What it looks like	First thing to check
PLEG stall / runtime degradation	Node NotReady with “PLEG is not healthy”	`time crictl ps` and `kubelet_pleg_relist_duration_seconds`
Admission webhook deadlock	Mutating API requests timeout; cluster appears frozen	`apiserver_admission_webhook_admission_duration_seconds`
etcd disk latency cascade	API mutating latency spikes; leader elections	`etcd_disk_wal_fsync_duration_seconds`
Conntrack exhaustion	Random connection timeouts; DNS failures on one node	`nf_conntrack_count` vs `nf_conntrack_max`
Silent kube-proxy watch death	Progressive service failure on one node; healthz still 200	`kubeproxy_sync_proxy_rules_last_timestamp_seconds`
Resource pressure with invisible overcommit	MemoryPressure oscillating; BestEffort pods evicted	Actual memory usage vs requests per node

Quick checks

Run these commands to validate health across layers before relying on dashboards.

# Control plane liveness
kubectl get --raw /livez

# etcd leader and disk health (run from etcd node)
etcdctl endpoint health --cluster
etcdctl endpoint status --cluster -w table

# Node Ready conditions and pressure
kubectl get nodes -o json | jq -r '.items[] | "\(.metadata.name): Ready=\(.status.conditions[] | select(.type=="Ready") | .status), MemoryPressure=\(.status.conditions[]? | select(.type=="MemoryPressure") | .status)"'

# Pending pods
kubectl get pods --all-namespaces --field-selector=status.phase=Pending

# Conntrack utilization on a node
echo "scale=2; $(cat /proc/sys/net/netfilter/nf_conntrack_count) * 100 / $(cat /proc/sys/net/netfilter/nf_conntrack_max)" | bc

# kube-proxy sync freshness
curl -s http://127.0.0.1:10249/metrics | grep kubeproxy_sync_proxy_rules_last_timestamp_seconds

# CRI runtime responsiveness
time crictl ps

# API server error rate
kubectl get --raw /metrics | grep 'apiserver_request_total' | grep 'code="5"'

# Critical service endpoints
kubectl get endpoints --all-namespaces -o json | jq -r '.items[] | select((.subsets // []) | length == 0) | "\(.metadata.namespace)/\(.metadata.name)"'

How to diagnose it

When an alert fires, follow this flow to isolate the layer.

Control plane liveness. kubectl get --raw /livez must return 200. If it fails, check etcd leader stability (etcd_server_has_leader) and disk latency (etcd_disk_wal_fsync_duration_seconds). A missing leader or fsync p99 over 100 ms explains downstream symptoms.
Node Ready conditions. Any node with Ready=Unknown for more than one minute is past its grace period. Check kubelet_pleg_relist_duration_seconds. If it is above 60 seconds, the container runtime is the bottleneck.
Workload scheduling. Sustained pending pods mean either capacity exhaustion (check node allocatable vs requests) or scheduling constraints (taints, affinity, ResourceQuotas). Look at scheduler_pending_pods: unschedulableQ indicates hard constraints; backoffQ indicates transient failures.
Network programming. If pods are Running but services are unreachable, verify kubeproxy_sync_proxy_rules_last_timestamp_seconds is recent. Then check nf_conntrack_count against nf_conntrack_max. If utilization is above 90%, new connections drop.
Resource pressure per node. MemoryPressure=True or DiskPressure=True means the eviction manager is actively killing pods. Cross-reference with kubelet_evictions_total and actual memory utilization from /proc/meminfo or cgroup stats. If actual usage exceeds requests, you have invisible overcommit.

Metrics and signals to monitor

Node and container runtime

Signal	Why it matters	Warning sign
Node Ready condition	Node can run pods	`Ready=False` or `Unknown` for more than 60 seconds
`kubelet_pleg_relist_duration_seconds`	Precedes almost every unexpected NotReady	p99 above 30 seconds
`kubelet_runtime_operations_duration_seconds`	Slow CRI calls stall pod lifecycle	p99 for `list_containers` or `list_podsandbox` above 5 seconds
`kubelet_evictions_total`	Active load shedding	Any increase on production workloads
Node MemoryPressure / DiskPressure / PIDPressure	Incompressible resources hit cliff-edge failures	Any condition `True`
Container OOM kill events	Memory limit enforcement or node exhaustion	Exit code 137 with reason `OOMKilled`
Image pull errors	Blocks pod startup	`ImagePullBackOff` for more than 5 minutes
`kubelet_certificate_manager_client_ttl_seconds`	Silent rotation failure leads to node disconnection	TTL below 7 days with rotation errors

Control plane

Signal	Why it matters	Warning sign
API server `/livez` and `/readyz`	Binary health and traffic readiness	Non-200 sustained for more than 15 seconds
`apiserver_request_duration_seconds`	Control plane temperature; verb breakdown matters	Mutating p99 above 1 second; LIST p99 above 30 seconds (exclude WATCH from SLO calculations)
`apiserver_request_total` by code	Throttling, auth failure, or etcd issues	Sustained 5xx rate above 0; 429 rate above 5%
`etcd_request_duration_seconds`	API server view of etcd; floor for all writes	Write p99 above 100 ms sustained
`etcd_disk_wal_fsync_duration_seconds`	Root cause of most etcd instability	p99 above 50 ms; approaching the heartbeat timeout
`etcd_server_leader_changes_seen_total`	Disk or network stress causing Raft elections	More than one per hour without maintenance
`etcd_debugging_mvcc_db_total_size_in_bytes`	Approaching quota triggers read-only alarm	Above 75% of `--quota-backend-bytes`
`apiserver_admission_webhook_admission_duration_seconds`	Synchronous external dependency in the mutation path	Per-webhook p99 above 1 second
`apiserver_flowcontrol_current_inqueue_requests`	APF starvation of critical flows	Queue depth above 0 for `system` or `leader-election` levels
`apiserver_current_inflight_requests`	Request processing saturation	Above 80% of configured max for mutating or read-only
API server memory / CPU vs limits	OOM or throttling causes crash loops	RSS above 80% of limit; CFS throttled periods increasing

Workload and cluster state

Signal	Why it matters	Warning sign
`scheduler_pending_pods` by queue	Capacity or constraint failure	`unschedulableQ` growing for more than 5 minutes
`workqueue_depth`	Controllers falling behind	Sustained depth above 0 for `node`, `deployment`, or `replicaset` queues
Deployment `readyReplicas` vs `spec.replicas`	Rollout health and capacity	`readyReplicas` below `minAvailable` from PDB for more than 5 minutes
Pod phase distribution	Coarse cluster health	`Unknown` pods appearing; `Pending` sustained
Container restart count	Crashes, OOM kills, or probe failures	Increase above 5 in 10 minutes for production pods
`container_cpu_cfs_throttled_periods_total`	Invisible latency from CFS quota denial	Throttle ratio above 25% sustained for latency-sensitive workloads
Service endpoint count	Zero endpoints means zero traffic capacity	Any production Service with zero ready endpoints
CoreDNS latency and SERVFAIL rate	DNS failure cascades to all discovery	SERVFAIL rate above 1%; p99 latency above 500 ms
PVC status	Volume binding blocks stateful startup	PVC `Pending` or `Lost` for more than 5 minutes
`apiserver_storage_objects`	Object bloat drives etcd size and API memory	Any resource type growing unboundedly; events above 50,000

Network and data plane

Signal	Why it matters	Warning sign
kube-proxy process liveness	Stale rules and new Service failures	Process missing on any schedulable node
`kubeproxy_sync_proxy_rules_duration_seconds`	Rule programming latency	p99 above 10 seconds in iptables mode; above 1 second and growing
`kubeproxy_sync_proxy_rules_last_timestamp_seconds`	Silent watch death despite healthy process	Age above 2 times the sync period
Conntrack utilization (`nf_conntrack_count / nf_conntrack_max`)	New connections dropped at 100% utilization	Above 75%; any nonzero `drop` counter
Conntrack drops (`conntrack -S`)	Confirmed packet loss from table exhaustion	Any increment in the `drop` field
iptables rule count (iptables mode)	O(n) traversal and lock contention	Rule count above 20,000
IPVS virtual/real server counts (IPVS mode)	Proxy programming correctness	Virtual server count diverging from Service count
kube-proxy API watch errors	Stale state even if process is running	Elevated `rest_client_requests_total` with 5xx or 429

Fixes

If the cause is node or runtime pressure

Cordon the node. Identify evicted pods: kubectl get events --field-selector reason=Evicted. Check whether evicted workloads have memory requests matching actual usage. If BestEffort pods consume unbounded memory, enforce limits via LimitRange. If the runtime is slow, check for defunct or hung containerd-shim processes.

If the cause is control plane saturation

Identify the traffic source using apiserver_request_total user-agent labels. If APF throttles critical controllers, increase concurrency shares for system and leader-election priority levels. If a single operator floods the API, reduce its concurrency or add client-side rate limits. If admission webhooks are the bottleneck, verify webhook endpoint health. As a temporary mitigation only, you can change failurePolicy to Ignore, but this bypasses policy enforcement and can allow invalid or insecure objects into the cluster.

If the cause is etcd latency

Move etcd data directories to dedicated SSD storage with low fsync latency. Verify etcd does not share disks with log-heavy workloads. Check compaction effectiveness: if the gap between etcd_debugging_mvcc_db_total_size_in_bytes and etcd_mvcc_db_total_size_in_use_in_bytes is large, schedule defragmentation during a maintenance window, one member at a time. Increase --quota-backend-bytes if the database approaches its limit.

If the cause is network or conntrack exhaustion

Immediately increase nf_conntrack_max on affected nodes: sysctl -w net.netfilter.nf_conntrack_max=<2x current>. This is a temporary fix; investigate the root cause to avoid masking a connection leak. Inspect connection states with conntrack -L to identify TIME_WAIT or UDP accumulation. If kube-proxy sync duration in iptables mode exceeds 10 seconds at scale, plan a migration to IPVS or nftables mode. For UDP services in IPVS mode, reduce the UDP timeout to prevent dead backend stickiness.

If the cause is workload misconfiguration

Set resource requests and limits for all production pods. Remove CPU limits to avoid CFS throttling if your cluster can tolerate the burst, but keep memory limits to contain blast radius. Configure PodDisruptionBudgets with minAvailable below total replicas to allow drains. Ensure liveness probes detect deadlocks while readiness probes detect temporary traffic inability. Do not configure them identically.

Prevention

Monitor leading indicators, not just lagging ones. PLEG relist duration, etcd WAL fsync, and conntrack ratio warn early. Node Ready and pod eviction are late.
Forward Kubernetes Events to persistent storage. Events have a one-hour TTL. Without external forwarding, you lose the incident narrative.
Alert on certificate TTL at 30 days. Control plane, kubelet, and webhook certificates expire on predictable schedules; auto-rotation fails silently.
Enforce resource hygiene. Use LimitRange and ResourceQuota to prevent BestEffort pods from entering the cluster. The scheduler sees zero requests and over-packs nodes.
Right-size kube-proxy for scale. If you run more than a few thousand Services, iptables mode degrades. Size for IPVS or nftables before you hit the cliff.
Audit admission webhooks quarterly. Verify failurePolicy, namespaceSelector exclusions for kube-system, and timeout values. A webhook matching all resources with failurePolicy: Fail is a cluster-wide single point of failure.

How Netdata helps

Correlates node-level PSI stall metrics with kubelet PLEG relist latency and CRI operation duration to surface runtime slowdown before nodes go NotReady.
Tracks conntrack utilization, API server request latency, and APF rejection rates on one timeline to expose network and control plane saturation.
Surfaces container-level CPU throttling, memory pressure, and OOM kills alongside pod restart counts and deployment readiness gaps.
Maps etcd disk latency and leader changes to scheduler pending pods and controller work queue depth to show control plane cascades.