Kafka monitoring checklist: the signals every production cluster needs

Kafka failures follow predictable paths: an ISR shrinks, a controller queue backs up, a disk fills while the cleaner thread hangs, or a consumer rebalance storm hides behind healthy broker metrics. You need to know which signals matter and when they justify a 3 AM page.

This checklist organizes broker-side signals into four levels. Each builds on the last: Level 1 prevents data loss. Level 2 prevents surprises. Level 3 exposes leading indicators. Level 4 catches silent killers. Use it to audit dashboards, tune alert severity, and justify instrumentation.

ZooKeeper mode and KRaft (mandatory in Kafka 4.0+) share similar failure modes. Differences in source or threshold are noted below.

How to use this checklist

Do not advance to Level 3 until Level 2 is fully instrumented and on-call trusts the alerts. “Page” thresholds wake an operator. “Ticket” thresholds are for daytime investigation. “Plan” items inform capacity roadmaps.

flowchart TD
    L1["Level 1 - Survival"]
    L2["Level 2 - Operational"]
    L3["Level 3 - Mature"]
    L4["Level 4 - Expert"]
    L1 --> L2
    L2 --> L3
    L3 --> L4

Level 1 - Survival

The absolute minimum. Missing any of these means you cannot tell whether the cluster is currently losing data.

Signal	Why it matters	Page threshold
Broker liveness	A dead broker takes its led partitions offline.	Process or port unreachable for over 60 s, with prior uptime over 600 s.
UnderReplicatedPartitions	The ISR is below the replication factor. The durability window is open.	Sustained nonzero for over 5 min while UnderMinIsrPartitionCount is above 0, no broker uptime under 600 s, and no reassignment in progress.
OfflinePartitionsCount	Partitions have no leader. Producers and consumers fail.	Any nonzero value sustained for over 60 s with no broker or controller uptime under 600 s.
ActiveControllerCount	Exactly one broker must be the active controller.	Cluster-wide sum not equal to 1 for over 2 min with visible data-plane impact (offline partitions or election storm).
Consumer lag (critical groups)	Unprocessed messages breach SLAs or may be lost to retention.	Growing monotonically, or lag-as-time exceeds the application SLA.
Disk space on log.dirs	Full disk is a cliff-edge failure. Kafka marks the log directory offline.	90% utilized, or runway below 4 hours at current growth.
JVM heap / Full GC	Pressure causes long pauses, ISR shrinks, and session timeouts.	Heap above 80% after GC, or any Full GC pause above 5 s.

Level 2 - Operational

Missing any of these means surprise incidents. This level adds latency, saturation, error rates, and signals that distinguish a sick broker from maintenance noise.

Signal	Why it matters	Page threshold
UnderMinIsrPartitionCount	Producers with acks=all are actively rejected.	Nonzero for over 2 min with no broker uptime under 600 s and no reassignment in progress.
IsrShrinksPerSec / IsrExpandsPerSec	Velocity of durability degradation or recovery.	Sustained shrinks for over 5 min outside maintenance without matching expands.
RequestHandlerAvgIdlePercent	Best single indicator of broker processing saturation.	Sustained below 0.1 for over 5 min.
NetworkProcessorAvgIdlePercent	Network thread saturation affects all clients, including metadata requests.	Sustained below 0.1 for over 2 min.
Produce request latency (p99 and breakdown)	High total time is not actionable without knowing which stage is slow.	p99 approaching request.timeout.ms (default 30 s) or sustained 3x baseline.
FetchConsumer request latency	Consumer-visible read path health.	FetchConsumer LocalTimeMs spiking for tail consumers that should hit the page cache.
BytesInPerSec / BytesOutPerSec	Bandwidth utilization and traffic imbalance.	Sustained above 70% of NIC capacity.
MessagesInPerSec	Record-level throughput for detecting message size changes.	Baseline deviation above 50%.
Disk I/O latency (await)	Kafka performance is bounded by disk I/O.	Above 100 ms sustained for over 5 min with confirmed broker impact (queue growth or idle percent drop).
Open file descriptors	Two per log segment plus one per connection. Hitting the limit is immediate failure.	Above 95% of limit with rising trend and corroborated errors.
Connection count	Storms exhaust file descriptors or saturate network threads.	Above 2x normal baseline.
Consumer group state	Rebalance storms pause consumption and grow lag.	PreparingRebalance or CompletingRebalance for over 5 min, or Empty unexpectedly.
UncleanLeaderElectionsPerSec	Confirmed data loss. A leader was elected from outside the ISR.	Any delta above 0 or OneMinuteRate above 0.
FailedProduceRequestsPerSec	Direct measurement of producer-visible broker errors.	Sustained nonzero outside known maintenance.
FailedFetchRequestsPerSec	Consumer-visible and replication-visible failures.	Sustained nonzero outside metadata transitions.
ZooKeeper request latency (ZK mode only)	ZK latency directly slows controller operations.	p99 above 1 s with session expirations or controller instability.
KRaft quorum health (KRaft mode only)	Raft quorum loss freezes metadata operations.	current-leader is -1 for over 120 s with metadata-plane impact.

Level 3 - Mature

Leading indicators, balance metrics, and internal signals that expose problems before they become incidents.

Signal	Why it matters	Page threshold
RequestQueueSize	Pressure gauge between network and I/O threads.	Approaching queued.max.requests (default 500).
ResponseQueueSize	Network threads cannot drain responses fast enough.	Consistently elevated above baseline.
Produce / Fetch purgatory size	acks=all requests or long-poll fetches are stuck waiting.	Produce purgatory above 2x baseline sustained for over 5 min.
ControllerEventQueueSize	Sequential processing of leadership and ISR changes.	Growing continuously above 1000 events.
LeaderElectionRateAndTimeMs	Rate and duration of leadership changes.	Continuous elections without stabilization, or p99 above 1 s.
LogFlushRateAndTimeMs	fsync latency even when relying on OS flush.	p99 above 2 s.
Partition count and leader count per broker	Imbalance creates hot spots.	Partition count deviation above 20% from mean, or LeaderCount deviation above 30%.
Per-topic BytesInPerSec / BytesOutPerSec	Detects rogue producers and traffic shifts.	Baseline deviation above 50% for a single topic.
GC pause duration distribution	Young GC above 200 ms or any Full GC indicates unhealthy heap pressure.	Full GC above 5 s, or more than 2 Full GCs in 10 min.
Page cache pressure (pgmajfault)	Cache misses force reads to disk and explode tail latency.	Rate above 2x baseline correlated with consumer latency degradation.
Consumer lag as time estimate	Absolute offset lag is meaningless without produce rate context.	Lag-as-time approaching retention boundary or SLA.
Consumer rebalance rate	Frequent rebalances indicate client instability.	Above 3 rebalances per hour outside deployments.
Authentication failure rate	Brute force or credential rotation issues.	Sustained burst from an unknown source.
Log cleaner dirty ratio	A dead cleaner thread causes unbounded disk growth on compacted topics.	Above 50% sustained, or DeadThreadCount above 0 combined with disk growth.

Level 4 - Expert

Deep signals that experienced operators add after painful incidents. These catch silent failures, network-layer issues, and security events that standard broker metrics miss.

Signal	Why it matters	Page threshold
OfflineLogDirectoryCount	A log directory