Kubernetes monitoring checklist: the signals every production cluster needs

Production Kubernetes failures announce themselves early: PLEG relist latency climbs before a node goes NotReady; etcd WAL fsync duration creeps toward the heartbeat timeout; conntrack utilization sits at 90% until a traffic spike drops SYN packets. This checklist maps signals to failure modes across four layers: node and container runtime, control plane, workload state, and network data plane. Use it to wire a greenfield pipeline or audit a brownfield one.

Common causes

CauseWhat it looks likeFirst thing to check
PLEG stall / runtime degradationNode NotReady with “PLEG is not healthy”time crictl ps and kubelet_pleg_relist_duration_seconds
Admission webhook deadlockMutating API requests timeout; cluster appears frozenapiserver_admission_webhook_admission_duration_seconds
etcd disk latency cascadeAPI mutating latency spikes; leader electionsetcd_disk_wal_fsync_duration_seconds
Conntrack exhaustionRandom connection timeouts; DNS failures on one nodenf_conntrack_count vs nf_conntrack_max
Silent kube-proxy watch deathProgressive service failure on one node; healthz still 200kubeproxy_sync_proxy_rules_last_timestamp_seconds
Resource pressure with invisible overcommitMemoryPressure oscillating; BestEffort pods evictedActual memory usage vs requests per node

Quick checks

Run these commands to validate health across layers before relying on dashboards.

# Control plane liveness
kubectl get --raw /livez

# etcd leader and disk health (run from etcd node)
etcdctl endpoint health --cluster
etcdctl endpoint status --cluster -w table

# Node Ready conditions and pressure
kubectl get nodes -o json | jq -r '.items[] | "\(.metadata.name): Ready=\(.status.conditions[] | select(.type=="Ready") | .status), MemoryPressure=\(.status.conditions[]? | select(.type=="MemoryPressure") | .status)"'

# Pending pods
kubectl get pods --all-namespaces --field-selector=status.phase=Pending

# Conntrack utilization on a node
echo "scale=2; $(cat /proc/sys/net/netfilter/nf_conntrack_count) * 100 / $(cat /proc/sys/net/netfilter/nf_conntrack_max)" | bc

# kube-proxy sync freshness
curl -s http://127.0.0.1:10249/metrics | grep kubeproxy_sync_proxy_rules_last_timestamp_seconds

# CRI runtime responsiveness
time crictl ps

# API server error rate
kubectl get --raw /metrics | grep 'apiserver_request_total' | grep 'code="5"'

# Critical service endpoints
kubectl get endpoints --all-namespaces -o json | jq -r '.items[] | select((.subsets // []) | length == 0) | "\(.metadata.namespace)/\(.metadata.name)"'

How to diagnose it

When an alert fires, follow this flow to isolate the layer.

  1. Control plane liveness. kubectl get --raw /livez must return 200. If it fails, check etcd leader stability (etcd_server_has_leader) and disk latency (etcd_disk_wal_fsync_duration_seconds). A missing leader or fsync p99 over 100 ms explains downstream symptoms.
  2. Node Ready conditions. Any node with Ready=Unknown for more than one minute is past its grace period. Check kubelet_pleg_relist_duration_seconds. If it is above 60 seconds, the container runtime is the bottleneck.
  3. Workload scheduling. Sustained pending pods mean either capacity exhaustion (check node allocatable vs requests) or scheduling constraints (taints, affinity, ResourceQuotas). Look at scheduler_pending_pods: unschedulableQ indicates hard constraints; backoffQ indicates transient failures.
  4. Network programming. If pods are Running but services are unreachable, verify kubeproxy_sync_proxy_rules_last_timestamp_seconds is recent. Then check nf_conntrack_count against nf_conntrack_max. If utilization is above 90%, new connections drop.
  5. Resource pressure per node. MemoryPressure=True or DiskPressure=True means the eviction manager is actively killing pods. Cross-reference with kubelet_evictions_total and actual memory utilization from /proc/meminfo or cgroup stats. If actual usage exceeds requests, you have invisible overcommit.

Metrics and signals to monitor

Node and container runtime

SignalWhy it mattersWarning sign
Node Ready conditionNode can run podsReady=False or Unknown for more than 60 seconds
kubelet_pleg_relist_duration_secondsPrecedes almost every unexpected NotReadyp99 above 30 seconds
kubelet_runtime_operations_duration_secondsSlow CRI calls stall pod lifecyclep99 for list_containers or list_podsandbox above 5 seconds
kubelet_evictions_totalActive load sheddingAny increase on production workloads
Node MemoryPressure / DiskPressure / PIDPressureIncompressible resources hit cliff-edge failuresAny condition True
Container OOM kill eventsMemory limit enforcement or node exhaustionExit code 137 with reason OOMKilled
Image pull errorsBlocks pod startupImagePullBackOff for more than 5 minutes
kubelet_certificate_manager_client_ttl_secondsSilent rotation failure leads to node disconnectionTTL below 7 days with rotation errors

Control plane

SignalWhy it mattersWarning sign
API server /livez and /readyzBinary health and traffic readinessNon-200 sustained for more than 15 seconds
apiserver_request_duration_secondsControl plane temperature; verb breakdown mattersMutating p99 above 1 second; LIST p99 above 30 seconds (exclude WATCH from SLO calculations)
apiserver_request_total by codeThrottling, auth failure, or etcd issuesSustained 5xx rate above 0; 429 rate above 5%
etcd_request_duration_secondsAPI server view of etcd; floor for all writesWrite p99 above 100 ms sustained
etcd_disk_wal_fsync_duration_secondsRoot cause of most etcd instabilityp99 above 50 ms; approaching the heartbeat timeout
etcd_server_leader_changes_seen_totalDisk or network stress causing Raft electionsMore than one per hour without maintenance
etcd_debugging_mvcc_db_total_size_in_bytesApproaching quota triggers read-only alarmAbove 75% of --quota-backend-bytes
apiserver_admission_webhook_admission_duration_secondsSynchronous external dependency in the mutation pathPer-webhook p99 above 1 second
apiserver_flowcontrol_current_inqueue_requestsAPF starvation of critical flowsQueue depth above 0 for system or leader-election levels
apiserver_current_inflight_requestsRequest processing saturationAbove 80% of configured max for mutating or read-only
API server memory / CPU vs limitsOOM or throttling causes crash loopsRSS above 80% of limit; CFS throttled periods increasing

Workload and cluster state

SignalWhy it mattersWarning sign
scheduler_pending_pods by queueCapacity or constraint failureunschedulableQ growing for more than 5 minutes
workqueue_depthControllers falling behindSustained depth above 0 for node, deployment, or replicaset queues
Deployment readyReplicas vs spec.replicasRollout health and capacityreadyReplicas below minAvailable from PDB for more than 5 minutes
Pod phase distributionCoarse cluster healthUnknown pods appearing; Pending sustained
Container restart countCrashes, OOM kills, or probe failuresIncrease above 5 in 10 minutes for production pods
container_cpu_cfs_throttled_periods_totalInvisible latency from CFS quota denialThrottle ratio above 25% sustained for latency-sensitive workloads
Service endpoint countZero endpoints means zero traffic capacityAny production Service with zero ready endpoints
CoreDNS latency and SERVFAIL rateDNS failure cascades to all discoverySERVFAIL rate above 1%; p99 latency above 500 ms
PVC statusVolume binding blocks stateful startupPVC Pending or Lost for more than 5 minutes
apiserver_storage_objectsObject bloat drives etcd size and API memoryAny resource type growing unboundedly; events above 50,000

Network and data plane

SignalWhy it mattersWarning sign
kube-proxy process livenessStale rules and new Service failuresProcess missing on any schedulable node
kubeproxy_sync_proxy_rules_duration_secondsRule programming latencyp99 above 10 seconds in iptables mode; above 1 second and growing
kubeproxy_sync_proxy_rules_last_timestamp_secondsSilent watch death despite healthy processAge above 2 times the sync period
Conntrack utilization (nf_conntrack_count / nf_conntrack_max)New connections dropped at 100% utilizationAbove 75%; any nonzero drop counter
Conntrack drops (conntrack -S)Confirmed packet loss from table exhaustionAny increment in the drop field
iptables rule count (iptables mode)O(n) traversal and lock contentionRule count above 20,000
IPVS virtual/real server counts (IPVS mode)Proxy programming correctnessVirtual server count diverging from Service count
kube-proxy API watch errorsStale state even if process is runningElevated rest_client_requests_total with 5xx or 429

Fixes

If the cause is node or runtime pressure

Cordon the node. Identify evicted pods: kubectl get events --field-selector reason=Evicted. Check whether evicted workloads have memory requests matching actual usage. If BestEffort pods consume unbounded memory, enforce limits via LimitRange. If the runtime is slow, check for defunct or hung containerd-shim processes.

If the cause is control plane saturation

Identify the traffic source using apiserver_request_total user-agent labels. If APF throttles critical controllers, increase concurrency shares for system and leader-election priority levels. If a single operator floods the API, reduce its concurrency or add client-side rate limits. If admission webhooks are the bottleneck, verify webhook endpoint health. As a temporary mitigation only, you can change failurePolicy to Ignore, but this bypasses policy enforcement and can allow invalid or insecure objects into the cluster.

If the cause is etcd latency

Move etcd data directories to dedicated SSD storage with low fsync latency. Verify etcd does not share disks with log-heavy workloads. Check compaction effectiveness: if the gap between etcd_debugging_mvcc_db_total_size_in_bytes and etcd_mvcc_db_total_size_in_use_in_bytes is large, schedule defragmentation during a maintenance window, one member at a time. Increase --quota-backend-bytes if the database approaches its limit.

If the cause is network or conntrack exhaustion

Immediately increase nf_conntrack_max on affected nodes: sysctl -w net.netfilter.nf_conntrack_max=<2x current>. This is a temporary fix; investigate the root cause to avoid masking a connection leak. Inspect connection states with conntrack -L to identify TIME_WAIT or UDP accumulation. If kube-proxy sync duration in iptables mode exceeds 10 seconds at scale, plan a migration to IPVS or nftables mode. For UDP services in IPVS mode, reduce the UDP timeout to prevent dead backend stickiness.

If the cause is workload misconfiguration

Set resource requests and limits for all production pods. Remove CPU limits to avoid CFS throttling if your cluster can tolerate the burst, but keep memory limits to contain blast radius. Configure PodDisruptionBudgets with minAvailable below total replicas to allow drains. Ensure liveness probes detect deadlocks while readiness probes detect temporary traffic inability. Do not configure them identically.

Prevention

  • Monitor leading indicators, not just lagging ones. PLEG relist duration, etcd WAL fsync, and conntrack ratio warn early. Node Ready and pod eviction are late.
  • Forward Kubernetes Events to persistent storage. Events have a one-hour TTL. Without external forwarding, you lose the incident narrative.
  • Alert on certificate TTL at 30 days. Control plane, kubelet, and webhook certificates expire on predictable schedules; auto-rotation fails silently.
  • Enforce resource hygiene. Use LimitRange and ResourceQuota to prevent BestEffort pods from entering the cluster. The scheduler sees zero requests and over-packs nodes.
  • Right-size kube-proxy for scale. If you run more than a few thousand Services, iptables mode degrades. Size for IPVS or nftables before you hit the cliff.
  • Audit admission webhooks quarterly. Verify failurePolicy, namespaceSelector exclusions for kube-system, and timeout values. A webhook matching all resources with failurePolicy: Fail is a cluster-wide single point of failure.

How Netdata helps

  • Correlates node-level PSI stall metrics with kubelet PLEG relist latency and CRI operation duration to surface runtime slowdown before nodes go NotReady.
  • Tracks conntrack utilization, API server request latency, and APF rejection rates on one timeline to expose network and control plane saturation.
  • Surfaces container-level CPU throttling, memory pressure, and OOM kills alongside pod restart counts and deployment readiness gaps.
  • Maps etcd disk latency and leader changes to scheduler pending pods and controller work queue depth to show control plane cascades.