Kafka monitoring checklist: the signals every production cluster needs

Kafka failures follow predictable paths: an ISR shrinks, a controller queue backs up, a disk fills while the cleaner thread hangs, or a consumer rebalance storm hides behind healthy broker metrics. You need to know which signals matter and when they justify a 3 AM page.

This checklist organizes broker-side signals into four levels. Each builds on the last: Level 1 prevents data loss. Level 2 prevents surprises. Level 3 exposes leading indicators. Level 4 catches silent killers. Use it to audit dashboards, tune alert severity, and justify instrumentation.

ZooKeeper mode and KRaft (mandatory in Kafka 4.0+) share similar failure modes. Differences in source or threshold are noted below.

How to use this checklist

Do not advance to Level 3 until Level 2 is fully instrumented and on-call trusts the alerts. “Page” thresholds wake an operator. “Ticket” thresholds are for daytime investigation. “Plan” items inform capacity roadmaps.

flowchart TD
    L1["Level 1 - Survival"]
    L2["Level 2 - Operational"]
    L3["Level 3 - Mature"]
    L4["Level 4 - Expert"]
    L1 --> L2
    L2 --> L3
    L3 --> L4

Level 1 - Survival

The absolute minimum. Missing any of these means you cannot tell whether the cluster is currently losing data.

SignalWhy it mattersPage threshold
Broker livenessA dead broker takes its led partitions offline.Process or port unreachable for over 60 s, with prior uptime over 600 s.
UnderReplicatedPartitionsThe ISR is below the replication factor. The durability window is open.Sustained nonzero for over 5 min while UnderMinIsrPartitionCount is above 0, no broker uptime under 600 s, and no reassignment in progress.
OfflinePartitionsCountPartitions have no leader. Producers and consumers fail.Any nonzero value sustained for over 60 s with no broker or controller uptime under 600 s.
ActiveControllerCountExactly one broker must be the active controller.Cluster-wide sum not equal to 1 for over 2 min with visible data-plane impact (offline partitions or election storm).
Consumer lag (critical groups)Unprocessed messages breach SLAs or may be lost to retention.Growing monotonically, or lag-as-time exceeds the application SLA.
Disk space on log.dirsFull disk is a cliff-edge failure. Kafka marks the log directory offline.90% utilized, or runway below 4 hours at current growth.
JVM heap / Full GCPressure causes long pauses, ISR shrinks, and session timeouts.Heap above 80% after GC, or any Full GC pause above 5 s.

Level 2 - Operational

Missing any of these means surprise incidents. This level adds latency, saturation, error rates, and signals that distinguish a sick broker from maintenance noise.

SignalWhy it mattersPage threshold
UnderMinIsrPartitionCountProducers with acks=all are actively rejected.Nonzero for over 2 min with no broker uptime under 600 s and no reassignment in progress.
IsrShrinksPerSec / IsrExpandsPerSecVelocity of durability degradation or recovery.Sustained shrinks for over 5 min outside maintenance without matching expands.
RequestHandlerAvgIdlePercentBest single indicator of broker processing saturation.Sustained below 0.1 for over 5 min.
NetworkProcessorAvgIdlePercentNetwork thread saturation affects all clients, including metadata requests.Sustained below 0.1 for over 2 min.
Produce request latency (p99 and breakdown)High total time is not actionable without knowing which stage is slow.p99 approaching request.timeout.ms (default 30 s) or sustained 3x baseline.
FetchConsumer request latencyConsumer-visible read path health.FetchConsumer LocalTimeMs spiking for tail consumers that should hit the page cache.
BytesInPerSec / BytesOutPerSecBandwidth utilization and traffic imbalance.Sustained above 70% of NIC capacity.
MessagesInPerSecRecord-level throughput for detecting message size changes.Baseline deviation above 50%.
Disk I/O latency (await)Kafka performance is bounded by disk I/O.Above 100 ms sustained for over 5 min with confirmed broker impact (queue growth or idle percent drop).
Open file descriptorsTwo per log segment plus one per connection. Hitting the limit is immediate failure.Above 95% of limit with rising trend and corroborated errors.
Connection countStorms exhaust file descriptors or saturate network threads.Above 2x normal baseline.
Consumer group stateRebalance storms pause consumption and grow lag.PreparingRebalance or CompletingRebalance for over 5 min, or Empty unexpectedly.
UncleanLeaderElectionsPerSecConfirmed data loss. A leader was elected from outside the ISR.Any delta above 0 or OneMinuteRate above 0.
FailedProduceRequestsPerSecDirect measurement of producer-visible broker errors.Sustained nonzero outside known maintenance.
FailedFetchRequestsPerSecConsumer-visible and replication-visible failures.Sustained nonzero outside metadata transitions.
ZooKeeper request latency (ZK mode only)ZK latency directly slows controller operations.p99 above 1 s with session expirations or controller instability.
KRaft quorum health (KRaft mode only)Raft quorum loss freezes metadata operations.current-leader is -1 for over 120 s with metadata-plane impact.

Level 3 - Mature

Leading indicators, balance metrics, and internal signals that expose problems before they become incidents.

SignalWhy it mattersPage threshold
RequestQueueSizePressure gauge between network and I/O threads.Approaching queued.max.requests (default 500).
ResponseQueueSizeNetwork threads cannot drain responses fast enough.Consistently elevated above baseline.
Produce / Fetch purgatory sizeacks=all requests or long-poll fetches are stuck waiting.Produce purgatory above 2x baseline sustained for over 5 min.
ControllerEventQueueSizeSequential processing of leadership and ISR changes.Growing continuously above 1000 events.
LeaderElectionRateAndTimeMsRate and duration of leadership changes.Continuous elections without stabilization, or p99 above 1 s.
LogFlushRateAndTimeMsfsync latency even when relying on OS flush.p99 above 2 s.
Partition count and leader count per brokerImbalance creates hot spots.Partition count deviation above 20% from mean, or LeaderCount deviation above 30%.
Per-topic BytesInPerSec / BytesOutPerSecDetects rogue producers and traffic shifts.Baseline deviation above 50% for a single topic.
GC pause duration distributionYoung GC above 200 ms or any Full GC indicates unhealthy heap pressure.Full GC above 5 s, or more than 2 Full GCs in 10 min.
Page cache pressure (pgmajfault)Cache misses force reads to disk and explode tail latency.Rate above 2x baseline correlated with consumer latency degradation.
Consumer lag as time estimateAbsolute offset lag is meaningless without produce rate context.Lag-as-time approaching retention boundary or SLA.
Consumer rebalance rateFrequent rebalances indicate client instability.Above 3 rebalances per hour outside deployments.
Authentication failure rateBrute force or credential rotation issues.Sustained burst from an unknown source.
Log cleaner dirty ratioA dead cleaner thread causes unbounded disk growth on compacted topics.Above 50% sustained, or DeadThreadCount above 0 combined with disk growth.

Level 4 - Expert

Deep signals that experienced operators add after painful incidents. These catch silent failures, network-layer issues, and security events that standard broker metrics miss.

SignalWhy it mattersPage threshold
OfflineLogDirectoryCountA log directory