Kubernetes monitoring maturity model: from survival to expert

Kubernetes failures rarely announce themselves. A slow etcd disk cascades into API latency, which backs up controller workqueues, which delays pod scheduling, which triggers autoscaling, which amplifies the load that caused the original latency spike. Without the right signals, you will debug the autoscaling event while the real problem is a WAL fsync that crossed 100ms ten minutes earlier.

A monitoring maturity model is not a tooling checklist. It is a coverage framework that tells you which signals you are missing and what those gaps cost you during an incident. The four levels below map the progression from “Is the cluster on fire?” to “Why did that single pod take three milliseconds longer to start on node twelve?” Each level assumes the previous and adds signals that change your mean time to detection and your mean time to root cause.

flowchart TB
    subgraph Expert["Level 4 - Expert"]
        direction TB
        E1["Watch delivery latency"]
        E2["etcd peer RTT / WAL fsync distribution"]
        E3["Per-flow-schema APF metrics"]
        E4["CRI / runtime internal metrics"]
        E5["Post-GC memory baseline / object churn"]
    end
    subgraph Mature["Level 3 - Mature"]
        direction TB
        M1["APF concurrency / watch rates"]
        M2["GC pause / request termination"]
        M3["Webhook fail-open / audit drops"]
        M4["PV inode usage / conntrack state"]
        M5["Endpoint-to-rule consistency"]
    end
    subgraph Operational["Level 2 - Operational"]
        direction TB
        O1["Latency by verb / request rate"]
        O2["etcd fsync / DB size"]
        O3["Controller workqueue / scheduler queues"]
        O4["Webhook latency / APF queue depth"]
        O5["Node pressure / PLEG relist"]
    end
    subgraph Survival["Level 1 - Survival"]
        direction TB
        S1["API server /livez /readyz"]
        S2["Node Ready / etcd health"]
        S3["CrashLoopBackOff / 5xx rate"]
        S4["Certificate expiry / conntrack usage"]
    end
    Survival --> Operational --> Mature --> Expert

Survival

What it gives you: Binary health. At this level you know whether the control plane is responding, whether nodes are reachable, and whether workloads are crash-looping. You will catch total outages and certificate expirations before they become lockouts.

What stays missing: You cannot distinguish a slow API server from a dead one. You cannot see etcd disk pressure building, admission webhook latency climbing, or a kubelet PLEG stall that is about to mark a healthy node as NotReady. Most partial failures and all leading indicators are invisible.

Signals it adds:

API server /livez and /readyz endpoint status
etcd health endpoint and leader existence
Node Ready condition across the fleet
Container restart counts and CrashLoopBackOff detection
Cluster-wide 5xx error rate from the API server
Control plane and kubelet certificate expiration windows
kube-proxy process liveness and healthz status
Conntrack table utilization versus nf_conntrack_max

Operational

What it gives you: Coverage of the critical path. You can diagnose most production incidents because you are watching the verbs, resources, and components that carry user traffic and control plane state. You will spot saturation before it becomes rejection and latency before it becomes timeout.

What stays missing: You see queue depth but not queue fairness. You see API latency but not whether it is caused by a webhook, etcd, or RBAC evaluation. You see node pressure but not whether the kubelet sync loop is falling behind. Composite patterns remain hidden until they fully mature.

Signals it adds:

API request latency broken down by verb and resource type
Request rate by response code, including 429 rejections
Inflight requests versus configured limits
etcd WAL fsync latency and database size against quota
Admission webhook latency per webhook configuration
API Priority and Fairness queue depth per priority level
Controller workqueue depth and scheduler pending pod queues
Node allocatable versus requested resources
kube-proxy sync duration and iptables or IPVS rule count
Kubelet PLEG relist duration and sync loop latency
CoreDNS request latency and error rate
Container image pull duration and runtime operation errors
Pod eviction events and OOM kill detection

Mature

What it gives you: Leading indicators and composite pattern detection. You stop reacting to symptoms and start observing the mechanisms that produce them. You can detect a re-list storm, a webhook death spiral, or conntrack exhaustion before user traffic is affected.

What stays missing: The finest-grained forensic signals remain unavailable. You know that GC pressure is rising but not which heap profile to inspect. You know that APF is rejecting requests but not which specific flow schema is starving leader election. You know rules are stale but not the exact iptables chain that is orphaned.

Signals it adds:

APF concurrency utilization and rejection count per priority level
API server watch count, watch event rate, and event sizes
Go GC pause duration and memory fragmentation ratio
API request termination rate and response size distribution
Admission webhook fail-open event count
Audit log throughput and drop rate
etcd compaction effectiveness and MVCC revision rate
Per-pod CPU CFS throttling ratio
PersistentVolume utilization and inode exhaustion
Node pod count versus kubelet max-pods limit
kube-proxy endpoint-to-rule consistency and conntrack state distribution
Kubelet running versus desired pod count
Volume mount latency and stuck attach detection

Expert

What it gives you: Deep forensic capability and subtle failure detection. After your third major incident, these are the signals that explain why a node behaved differently from its peers, why a certificate rotation failed silently for six hours, or why a single Service caused a cross-AZ latency spike. You can reconstruct causality from metrics alone.

What stays missing: Nothing operationally critical. The gap at this level is cost and complexity. Cardinality is high, baselining is workload-specific, and the operational burden of maintaining thresholds across hundreds of fine-grained signals is significant.

Signals it adds:

Client-side watch delivery latency and per-flow-schema request rate
etcd peer round-trip time and WAL fsync duration distribution
RBAC evaluation latency per request path
Envelope encryption DEK cache fill and miss rate
HTTP/2 GOAWAY event rate and TLS handshake latency
Cross-instance API server load distribution in HA deployments
Aggregated API server health and audit webhook latency
etcd Raft proposal latency and failed proposal count
Per-resource-type object churn rate
Post-GC baseline memory trend and goroutine growth rate
kubelet per-pod sync latency and detailed PLEG percentiles
Container runtime internal operation latency by CRI type
kube-proxy iptables-restore execution time and rule hit counters
IPVS connection table size and UDP conntrack age distribution
kubelet file descriptor and inotify watch consumption
Inter-node behavioral comparison for configuration drift

Choosing the right level for your team

Small clusters running a single tenant with forgiving SLAs can operate at Survival plus a handful of Operational signals. The cost of missing a controller workqueue depth metric is low when you have twenty nodes and one operator.

Move to Operational when you have more than one person on-call, when you run admission webhooks, or when you manage your own etcd. The breakpoint is usually the first time you spend an hour debugging API latency without knowing whether etcd or a webhook is the bottleneck.

Move to Mature when you run multi-tenant namespaces, autoscaling, or stateful workloads. The composite patterns at this level, such as conntrack exhaustion, re-list storms, and APF starvation, are the ones that cause 3 a.m. pages in clusters with real traffic.

Move to Expert after a major incident that you could not fully explain, or when you operate at a scale where the default thresholds in managed services no longer apply. Do not jump to Expert for novelty. The cardinality of per-flow-schema and per-chain metrics can overwhelm your monitoring backend if you are not prepared to curate them.

How Netdata helps

Netdata collects kubelet, kube-proxy, and API server metrics at one-second resolution, catching brief PLEG stalls and sync latency spikes that longer scrape intervals miss.
Parent-child streaming aggregates per-node signals into fleet-wide views without losing granular detail, so you can compare conntrack utilization or sync duration across every node from a single dashboard.
Pre-built collectors for Kubernetes /metrics endpoints surface APF queue depths, workqueue lengths, and container runtime operation latencies without manual PromQL construction.
High-resolution CPU and memory correlation on the same timeline as kubelet and API server metrics makes it easier to distinguish “kubelet is slow” from “the node is throttled” during an incident.

The Netdata solution

Kubernetes monitoring with Netdata

Netdata monitors Kubernetes with per-second metrics across the control plane, nodes, and every pod, with ML anomaly detection and zero per-pod configuration. Correlate API-server and etcd latency, kubelet PLEG stalls, scheduling pressure, and OOMKills in one place.

See Kubernetes monitoring → Start monitoring free

Kubernetes monitoring maturity model: from survival to expert

Kubernetes monitoring maturity model: from survival to expert

Survival

Operational

Mature

Expert

Choosing the right level for your team

How Netdata helps

Related guides

Kubernetes monitoring with Netdata