Kafka monitoring maturity model: from survival to expert

Broker logs and a single health check are not enough for production Kafka. Monitoring everything at once creates noise and hides the signals that prevent outages. This model gives you a prioritized path: start with telemetry that prevents total data loss, then add layers that catch degradation before it becomes an incident.

Use this as an onboarding checklist for new clusters, an incident reference, and a roadmap when you have instrumentation budget. Signals come from Kafka JMX, OS metrics, and admin APIs. In KRaft mode (mandatory in Kafka 4.0+), substitute ZooKeeper-specific signals with the corresponding Raft quorum metrics.

Start at Level 1. Most clusters should reach Level 2 within weeks of launch and Level 3 within the first quarter. Level 4 is for large clusters or strict SLOs.

flowchart TD
    L1["Level 1: survival
Prevent total outage"]
    L2["Level 2: operational
Explain degradation"]
    L3["Level 3: mature
Catch silent failures"]
    L4["Level 4: expert
Expose edge cases"]

    L1 --> L2
    L2 --> L3
    L3 --> L4

Each level assumes the previous one is fully instrumented and alerted. Do not add Level 2 signals until Level 1 paging is reliable. Level 3 and 4 signals are useful only when the team has time to investigate leading indicators instead of fighting outages.

Level 1 – survival

Level 1 is the minimum. These signals distinguish a running process from a broker that is actually serving requests. If any fire outside a maintenance window, you have an active or imminent outage.

Broker process liveness. Dead brokers make their leader partitions unreachable and stop replication by followers. Verify process presence, systemd state, and the configured listener port (default 9092). JMX connectivity confirms the broker is responsive, not just running.
UnderReplicatedPartitions. Nonzero means at least one follower is not keeping up and the durability window is open. This is the most important Kafka metric.
OfflinePartitionsCount. Nonzero means partitions have no leader and are unavailable for reads and writes.
ActiveControllerCount. Exactly one broker must report 1. Zero means no metadata operations; more than one risks split-brain.
Consumer lag for critical groups. Growing lag means consumers are falling behind producers. If lag exceeds retention, data is lost from the consumer’s perspective.
Disk space on log.dirs volumes. Kafka does not degrade gracefully when disk is full. It shuts down the log directory or crashes.
JVM heap utilization and Full GC. Heap exhaustion or multi-second Full GC pauses cause broker unavailability and ZooKeeper session expiry in ZK mode.

Level 2 – operational

Level 2 adds signals that explain why degradation is happening. Without these, Level 1 metrics reveal incidents only after impact has started.

IsrShrinksPerSec and IsrExpandsPerSec. Sustained shrinks mean replicas are falling behind. Paired shrinks and expansions indicate ISR flapping from intermittent follower health issues.
RequestHandlerAvgIdlePercent. Below 0.3 means I/O thread saturation; below 0.1 means active overload and request queuing.
NetworkProcessorAvgIdlePercent. Below 0.3 indicates network thread saturation, which blocks all connections including metadata requests.
Produce and FetchConsumer request latency breakdown. TotalTimeMs alone is not actionable. Break it down into RequestQueueTimeMs, LocalTimeMs, RemoteTimeMs, ResponseQueueTimeMs, and ResponseSendTimeMs. LocalTimeMs spikes indicate disk I/O problems; RemoteTimeMs spikes point to slow followers for acks=all.
BytesInPerSec and BytesOutPerSec per broker. Inbound volume plus replication and consumer fan-out tells you when network egress is approaching NIC capacity.
MessagesInPerSec. Derive average message size by combining with BytesInPerSec, and detect producer behavior changes.
Disk I/O latency (await). Elevated await on log directories drives LocalTimeMs spikes and thread blocking.
Open file descriptors versus limit. Hitting the FD limit causes immediate segment-open and connection-accept failures. Production deployments should set limits above 100,000.
Connection count per broker. Connection storms exhaust FDs and network threads, especially with TLS enabled.
Consumer group state. Groups stuck in PreparingRebalance or CompletingRebalance for more than 5 minutes indicate a rebalance storm.
UnderMinIsrPartitionCount. Nonzero means partitions are actively rejecting acks=all produce requests. This confirms write-path impact that UnderReplicatedPartitions alone cannot measure.
UncleanLeaderElectionsPerSec. Any nonzero value is confirmed data loss. Alert on OneMinuteRate or delta(Count), not the cumulative Count.
FailedProduceRequestsPerSec and FailedFetchRequestsPerSec. Measure producer-visible and consumer-visible broker errors directly.
ZooKeeper request latency (ZK mode) or KRaft quorum health. Metadata store latency slows controller operations, ISR updates, and leader elections.

Level 3 – mature

Level 3 introduces leading indicators. Watch queues, internal thread pools, and OS resources to find problems before they reach the request path. This catches silent failures like a dead log cleaner or page cache thrashing that standard health checks miss.

RequestQueueSize and ResponseQueueSize. Sustained growth above 50% of queued.max.requests (default 500) indicates saturation before idle percent drops.
Produce and Fetch purgatory size. Growing produce purgatory means acks=all requests are stuck waiting for replication. Unbounded fetch purgatory growth can indicate stalled consumers.
ControllerEventQueueSize. A growing queue means the controller is falling behind on leadership changes and ISR updates. Values above 1000 events suggest metadata propagation is stalling. In large clusters, queue depth during broker failures is proportional to partition count, so plan headroom accordingly.
LeaderElectionRateAndTimeMs. Slow or continuous elections indicate controller pressure or broker flapping.
LogFlushRateAndTimeMs. p99 flush times above 500ms indicate disk write degradation or RAID cache failure.
Partition count and leader count per broker. Severe imbalance creates hot-spot brokers. Leadership skew drives disproportionate produce and fetch traffic.
Per-topic BytesInPerSec and BytesOutPerSec. Detects rogue producers and traffic shifts that aggregate-level metrics hide.
GC pause duration distribution. Young GC pauses over 200ms or