Kafka monitoring checklist: the signals every production cluster needs
Kafka failures follow predictable paths: an ISR shrinks, a controller queue backs up, a disk fills while the cleaner thread hangs, or a consumer rebalance storm hides behind healthy broker metrics. You need to know which signals matter and when they justify a 3 AM page.
This checklist organizes broker-side signals into four levels. Each builds on the last: Level 1 prevents data loss. Level 2 prevents surprises. Level 3 exposes leading indicators. Level 4 catches silent killers. Use it to audit dashboards, tune alert severity, and justify instrumentation.
ZooKeeper mode and KRaft (mandatory in Kafka 4.0+) share similar failure modes. Differences in source or threshold are noted below.
How to use this checklist
Do not advance to Level 3 until Level 2 is fully instrumented and on-call trusts the alerts. “Page” thresholds wake an operator. “Ticket” thresholds are for daytime investigation. “Plan” items inform capacity roadmaps.
flowchart TD
L1["Level 1 - Survival"]
L2["Level 2 - Operational"]
L3["Level 3 - Mature"]
L4["Level 4 - Expert"]
L1 --> L2
L2 --> L3
L3 --> L4Level 1 - Survival
The absolute minimum. Missing any of these means you cannot tell whether the cluster is currently losing data.
| Signal | Why it matters | Page threshold |
|---|---|---|
| Broker liveness | A dead broker takes its led partitions offline. | Process or port unreachable for over 60 s, with prior uptime over 600 s. |
| UnderReplicatedPartitions | The ISR is below the replication factor. The durability window is open. | Sustained nonzero for over 5 min while UnderMinIsrPartitionCount is above 0, no broker uptime under 600 s, and no reassignment in progress. |
| OfflinePartitionsCount | Partitions have no leader. Producers and consumers fail. | Any nonzero value sustained for over 60 s with no broker or controller uptime under 600 s. |
| ActiveControllerCount | Exactly one broker must be the active controller. | Cluster-wide sum not equal to 1 for over 2 min with visible data-plane impact (offline partitions or election storm). |
| Consumer lag (critical groups) | Unprocessed messages breach SLAs or may be lost to retention. | Growing monotonically, or lag-as-time exceeds the application SLA. |
| Disk space on log.dirs | Full disk is a cliff-edge failure. Kafka marks the log directory offline. | 90% utilized, or runway below 4 hours at current growth. |
| JVM heap / Full GC | Pressure causes long pauses, ISR shrinks, and session timeouts. | Heap above 80% after GC, or any Full GC pause above 5 s. |
Level 2 - Operational
Missing any of these means surprise incidents. This level adds latency, saturation, error rates, and signals that distinguish a sick broker from maintenance noise.
| Signal | Why it matters | Page threshold |
|---|---|---|
| UnderMinIsrPartitionCount | Producers with acks=all are actively rejected. | Nonzero for over 2 min with no broker uptime under 600 s and no reassignment in progress. |
| IsrShrinksPerSec / IsrExpandsPerSec | Velocity of durability degradation or recovery. | Sustained shrinks for over 5 min outside maintenance without matching expands. |
| RequestHandlerAvgIdlePercent | Best single indicator of broker processing saturation. | Sustained below 0.1 for over 5 min. |
| NetworkProcessorAvgIdlePercent | Network thread saturation affects all clients, including metadata requests. | Sustained below 0.1 for over 2 min. |
| Produce request latency (p99 and breakdown) | High total time is not actionable without knowing which stage is slow. | p99 approaching request.timeout.ms (default 30 s) or sustained 3x baseline. |
| FetchConsumer request latency | Consumer-visible read path health. | FetchConsumer LocalTimeMs spiking for tail consumers that should hit the page cache. |
| BytesInPerSec / BytesOutPerSec | Bandwidth utilization and traffic imbalance. | Sustained above 70% of NIC capacity. |
| MessagesInPerSec | Record-level throughput for detecting message size changes. | Baseline deviation above 50%. |
| Disk I/O latency (await) | Kafka performance is bounded by disk I/O. | Above 100 ms sustained for over 5 min with confirmed broker impact (queue growth or idle percent drop). |
| Open file descriptors | Two per log segment plus one per connection. Hitting the limit is immediate failure. | Above 95% of limit with rising trend and corroborated errors. |
| Connection count | Storms exhaust file descriptors or saturate network threads. | Above 2x normal baseline. |
| Consumer group state | Rebalance storms pause consumption and grow lag. | PreparingRebalance or CompletingRebalance for over 5 min, or Empty unexpectedly. |
| UncleanLeaderElectionsPerSec | Confirmed data loss. A leader was elected from outside the ISR. | Any delta above 0 or OneMinuteRate above 0. |
| FailedProduceRequestsPerSec | Direct measurement of producer-visible broker errors. | Sustained nonzero outside known maintenance. |
| FailedFetchRequestsPerSec | Consumer-visible and replication-visible failures. | Sustained nonzero outside metadata transitions. |
| ZooKeeper request latency (ZK mode only) | ZK latency directly slows controller operations. | p99 above 1 s with session expirations or controller instability. |
| KRaft quorum health (KRaft mode only) | Raft quorum loss freezes metadata operations. | current-leader is -1 for over 120 s with metadata-plane impact. |
Level 3 - Mature
Leading indicators, balance metrics, and internal signals that expose problems before they become incidents.
| Signal | Why it matters | Page threshold |
|---|---|---|
| RequestQueueSize | Pressure gauge between network and I/O threads. | Approaching queued.max.requests (default 500). |
| ResponseQueueSize | Network threads cannot drain responses fast enough. | Consistently elevated above baseline. |
| Produce / Fetch purgatory size | acks=all requests or long-poll fetches are stuck waiting. | Produce purgatory above 2x baseline sustained for over 5 min. |
| ControllerEventQueueSize | Sequential processing of leadership and ISR changes. | Growing continuously above 1000 events. |
| LeaderElectionRateAndTimeMs | Rate and duration of leadership changes. | Continuous elections without stabilization, or p99 above 1 s. |
| LogFlushRateAndTimeMs | fsync latency even when relying on OS flush. | p99 above 2 s. |
| Partition count and leader count per broker | Imbalance creates hot spots. | Partition count deviation above 20% from mean, or LeaderCount deviation above 30%. |
| Per-topic BytesInPerSec / BytesOutPerSec | Detects rogue producers and traffic shifts. | Baseline deviation above 50% for a single topic. |
| GC pause duration distribution | Young GC above 200 ms or any Full GC indicates unhealthy heap pressure. | Full GC above 5 s, or more than 2 Full GCs in 10 min. |
| Page cache pressure (pgmajfault) | Cache misses force reads to disk and explode tail latency. | Rate above 2x baseline correlated with consumer latency degradation. |
| Consumer lag as time estimate | Absolute offset lag is meaningless without produce rate context. | Lag-as-time approaching retention boundary or SLA. |
| Consumer rebalance rate | Frequent rebalances indicate client instability. | Above 3 rebalances per hour outside deployments. |
| Authentication failure rate | Brute force or credential rotation issues. | Sustained burst from an unknown source. |
| Log cleaner dirty ratio | A dead cleaner thread causes unbounded disk growth on compacted topics. | Above 50% sustained, or DeadThreadCount above 0 combined with disk growth. |
Level 4 - Expert
Deep signals that experienced operators add after painful incidents. These catch silent failures, network-layer issues, and security events that standard broker metrics miss.
| Signal | Why it matters | Page threshold |
|---|---|---|
| OfflineLogDirectoryCount | A log directory |







