Kubernetes monitoring maturity model: from survival to expert
Kubernetes failures rarely announce themselves. A slow etcd disk cascades into API latency, which backs up controller workqueues, which delays pod scheduling, which triggers autoscaling, which amplifies the load that caused the original latency spike. Without the right signals, you will debug the autoscaling event while the real problem is a WAL fsync that crossed 100ms ten minutes earlier.
A monitoring maturity model is not a tooling checklist. It is a coverage framework that tells you which signals you are missing and what those gaps cost you during an incident. The four levels below map the progression from “Is the cluster on fire?” to “Why did that single pod take three milliseconds longer to start on node twelve?” Each level assumes the previous and adds signals that change your mean time to detection and your mean time to root cause.
flowchart TB
subgraph Expert["Level 4 - Expert"]
direction TB
E1["Watch delivery latency"]
E2["etcd peer RTT / WAL fsync distribution"]
E3["Per-flow-schema APF metrics"]
E4["CRI / runtime internal metrics"]
E5["Post-GC memory baseline / object churn"]
end
subgraph Mature["Level 3 - Mature"]
direction TB
M1["APF concurrency / watch rates"]
M2["GC pause / request termination"]
M3["Webhook fail-open / audit drops"]
M4["PV inode usage / conntrack state"]
M5["Endpoint-to-rule consistency"]
end
subgraph Operational["Level 2 - Operational"]
direction TB
O1["Latency by verb / request rate"]
O2["etcd fsync / DB size"]
O3["Controller workqueue / scheduler queues"]
O4["Webhook latency / APF queue depth"]
O5["Node pressure / PLEG relist"]
end
subgraph Survival["Level 1 - Survival"]
direction TB
S1["API server /livez /readyz"]
S2["Node Ready / etcd health"]
S3["CrashLoopBackOff / 5xx rate"]
S4["Certificate expiry / conntrack usage"]
end
Survival --> Operational --> Mature --> ExpertSurvival
What it gives you: Binary health. At this level you know whether the control plane is responding, whether nodes are reachable, and whether workloads are crash-looping. You will catch total outages and certificate expirations before they become lockouts.
What stays missing: You cannot distinguish a slow API server from a dead one. You cannot see etcd disk pressure building, admission webhook latency climbing, or a kubelet PLEG stall that is about to mark a healthy node as NotReady. Most partial failures and all leading indicators are invisible.
Signals it adds:
- API server /livez and /readyz endpoint status
- etcd health endpoint and leader existence
- Node Ready condition across the fleet
- Container restart counts and CrashLoopBackOff detection
- Cluster-wide 5xx error rate from the API server
- Control plane and kubelet certificate expiration windows
- kube-proxy process liveness and healthz status
- Conntrack table utilization versus
nf_conntrack_max
Operational
What it gives you: Coverage of the critical path. You can diagnose most production incidents because you are watching the verbs, resources, and components that carry user traffic and control plane state. You will spot saturation before it becomes rejection and latency before it becomes timeout.
What stays missing: You see queue depth but not queue fairness. You see API latency but not whether it is caused by a webhook, etcd, or RBAC evaluation. You see node pressure but not whether the kubelet sync loop is falling behind. Composite patterns remain hidden until they fully mature.
Signals it adds:
- API request latency broken down by verb and resource type
- Request rate by response code, including 429 rejections
- Inflight requests versus configured limits
- etcd WAL fsync latency and database size against quota
- Admission webhook latency per webhook configuration
- API Priority and Fairness queue depth per priority level
- Controller workqueue depth and scheduler pending pod queues
- Node allocatable versus requested resources
- kube-proxy sync duration and iptables or IPVS rule count
- Kubelet PLEG relist duration and sync loop latency
- CoreDNS request latency and error rate
- Container image pull duration and runtime operation errors
- Pod eviction events and OOM kill detection
Mature
What it gives you: Leading indicators and composite pattern detection. You stop reacting to symptoms and start observing the mechanisms that produce them. You can detect a re-list storm, a webhook death spiral, or conntrack exhaustion before user traffic is affected.
What stays missing: The finest-grained forensic signals remain unavailable. You know that GC pressure is rising but not which heap profile to inspect. You know that APF is rejecting requests but not which specific flow schema is starving leader election. You know rules are stale but not the exact iptables chain that is orphaned.
Signals it adds:
- APF concurrency utilization and rejection count per priority level
- API server watch count, watch event rate, and event sizes
- Go GC pause duration and memory fragmentation ratio
- API request termination rate and response size distribution
- Admission webhook fail-open event count
- Audit log throughput and drop rate
- etcd compaction effectiveness and MVCC revision rate
- Per-pod CPU CFS throttling ratio
- PersistentVolume utilization and inode exhaustion
- Node pod count versus kubelet max-pods limit
- kube-proxy endpoint-to-rule consistency and conntrack state distribution
- Kubelet running versus desired pod count
- Volume mount latency and stuck attach detection
Expert
What it gives you: Deep forensic capability and subtle failure detection. After your third major incident, these are the signals that explain why a node behaved differently from its peers, why a certificate rotation failed silently for six hours, or why a single Service caused a cross-AZ latency spike. You can reconstruct causality from metrics alone.
What stays missing: Nothing operationally critical. The gap at this level is cost and complexity. Cardinality is high, baselining is workload-specific, and the operational burden of maintaining thresholds across hundreds of fine-grained signals is significant.
Signals it adds:
- Client-side watch delivery latency and per-flow-schema request rate
- etcd peer round-trip time and WAL fsync duration distribution
- RBAC evaluation latency per request path
- Envelope encryption DEK cache fill and miss rate
- HTTP/2 GOAWAY event rate and TLS handshake latency
- Cross-instance API server load distribution in HA deployments
- Aggregated API server health and audit webhook latency
- etcd Raft proposal latency and failed proposal count
- Per-resource-type object churn rate
- Post-GC baseline memory trend and goroutine growth rate
- kubelet per-pod sync latency and detailed PLEG percentiles
- Container runtime internal operation latency by CRI type
- kube-proxy iptables-restore execution time and rule hit counters
- IPVS connection table size and UDP conntrack age distribution
- kubelet file descriptor and inotify watch consumption
- Inter-node behavioral comparison for configuration drift
Choosing the right level for your team
Small clusters running a single tenant with forgiving SLAs can operate at Survival plus a handful of Operational signals. The cost of missing a controller workqueue depth metric is low when you have twenty nodes and one operator.
Move to Operational when you have more than one person on-call, when you run admission webhooks, or when you manage your own etcd. The breakpoint is usually the first time you spend an hour debugging API latency without knowing whether etcd or a webhook is the bottleneck.
Move to Mature when you run multi-tenant namespaces, autoscaling, or stateful workloads. The composite patterns at this level, such as conntrack exhaustion, re-list storms, and APF starvation, are the ones that cause 3 a.m. pages in clusters with real traffic.
Move to Expert after a major incident that you could not fully explain, or when you operate at a scale where the default thresholds in managed services no longer apply. Do not jump to Expert for novelty. The cardinality of per-flow-schema and per-chain metrics can overwhelm your monitoring backend if you are not prepared to curate them.
How Netdata helps
- Netdata collects kubelet, kube-proxy, and API server metrics at one-second resolution, catching brief PLEG stalls and sync latency spikes that longer scrape intervals miss.
- Parent-child streaming aggregates per-node signals into fleet-wide views without losing granular detail, so you can compare conntrack utilization or sync duration across every node from a single dashboard.
- Pre-built collectors for Kubernetes /metrics endpoints surface APF queue depths, workqueue lengths, and container runtime operation latencies without manual PromQL construction.
- High-resolution CPU and memory correlation on the same timeline as kubelet and API server metrics makes it easier to distinguish “kubelet is slow” from “the node is throttled” during an incident.
Related guides
- How the Kubernetes control plane works: a mental model for operators
- Kubernetes anonymous API access: detection, audit, and lockdown
- Kubernetes API server audit logging: policy, backends, and forensics
- Kubernetes API server certificate rotation: detection and grace handling
- Kubernetes API server etcd latency: detection and cascading failures
- Kubernetes API server FlowSchemas and PriorityLevels: design and tuning
- Kubernetes API server memory pressure: OOM cycle and tuning
- Kubernetes API server rate limiting: APF priority levels and starvation
- Kubernetes API server slow or unresponsive: causes and fixes
- Kubernetes API server watch storm: re-list cascades and connection floods
- Kubernetes bound service account tokens: rotation, audience, and expiry
- Kubernetes conntrack exhaustion: dropped connections under load






