Kafka controller event queue backing up: overwhelmed controller and stalled metadata
You see NOT_LEADER_FOR_PARTITION errors spike in client logs. Leader elections stop completing. The active controller’s event queue grows and does not drain. Partitions that need new leaders stay offline. The single thread processing partition state changes cannot keep up, and metadata operations stall.
In ZooKeeper mode, each event writes to ZooKeeper. In KRaft mode, each event appends to the Raft metadata log. The active controller processes them sequentially. When the queue backs up, the control plane slows. The data plane may continue serving existing leaders, but any failure requiring a state change gets stuck.
This appears during large-scale failures: multiple brokers dropping simultaneously, a rack outage, or an operator restarting several brokers too quickly. The controller receives a burst of leader elections, ISR changes, and broker lifecycle events. If the metadata store or controller node is slow, queue depth increases without bound.
What this means
The active controller maintains an internal event queue for partition and broker state changes. Events include leader elections, ISR expansions and shrinks, topic creations and deletions, and broker registrations. Only the active controller exposes a meaningful queue depth metric.
When the queue backs up, the delay between a broker failure and the resulting leader election grows from milliseconds to seconds or minutes. During this window:
- Partitions whose leader was on a failed broker stay offline.
- Producers and consumers receive
NOT_LEADER_FOR_PARTITIONbecause metadata still points to the dead broker. - Under-replicated partitions accumulate because the controller has not yet processed ISR changes.
- If the queue continues to grow, the controller becomes a bottleneck for all cluster metadata operations.
flowchart TD
A[Multiple broker failures] --> B[Controller event queue fills]
B --> C[Leader elections delayed]
C --> D[Partitions stay offline]
C --> E[NOT_LEADER_FOR_PARTITION to clients]
B --> F[ISR updates stall]
F --> G[UnderReplicatedPartitions accumulates]
G --> H[Operators restart more brokers]
H --> I[More events flood queue]
I --> BCommon causes
| Cause | What it looks like | First thing to check |
|---|---|---|
| Too many partitions per broker | Queue spikes during any broker failure; recovery takes minutes | PartitionCount and LeaderCount per broker; cluster-wide total |
| Cascading broker failures | Queue grows as multiple brokers drop; events arrive faster than they drain | Broker liveness; whether automation triggered mass restarts |
| ZooKeeper latency (ZK mode) | Elevated ZooKeeperRequestLatencyMs on the controller; each event persists more slowly | ZooKeeperRequestLatencyMs p99 on the controller |
| KRaft quorum replication lag (KRaft mode) | Standby controllers or brokers lag; commit latency grows | kafka.server:type=raft-metrics on controllers; metadata log lag |
| Controller resource exhaustion | High CPU or GC pauses on the active controller; JVM heap pressure | CPU utilization and GC pause duration on the active controller |
Quick checks
Run from the controller node or any node with JMX access to it.
# Check active controller identity
echo "get -b kafka.controller:type=KafkaController,name=ActiveControllerCount Value" | java -jar jmxterm.jar -l localhost:9999
# Check controller event queue depth
echo "get -b kafka.controller:type=ControllerEventManager,name=EventQueueSize Value" | java -jar jmxterm.jar -l localhost:9999
# Check offline partitions
kafka-topics.sh --bootstrap-server localhost:9092 --describe --unavailable-partitions
# Check under-replicated partitions cluster-wide
kafka-topics.sh --bootstrap-server localhost:9092 --describe --under-replicated-partitions
# Check leader election timing
echo "get -b kafka.controller:type=ControllerStats,name=LeaderElectionRateAndTimeMs 99thPercentile" | java -jar jmxterm.jar -l localhost:9999
# Check controller JVM heap and GC
jstat -gcutil $(pgrep -f kafka.Kafka) 1000
# Check controller node disk I/O latency
iostat -xz 1
How to diagnose it
Confirm which broker is the active controller. Query
ActiveControllerCountacross all brokers. Exactly one must return 1. If none report 1, the cluster has no controller; see Kafka OfflinePartitionsCount > 0. If multiple report 1, treat it as split-brain.Read the controller event queue depth. On the active controller, check
ControllerEventQueueSize. Sustained values above 100 indicate pressure. Values growing without bound indicate the controller cannot keep up.Correlate queue depth with leader election time. Check
LeaderElectionRateAndTimeMs. If the rate is elevated but the 99th percentile is increasing, elections are queuing behind other events.Check for large-scale broker failure. Cross-reference
OfflinePartitionsCountandUnderReplicatedPartitions. If many brokers are down simultaneously, queue depth is proportional to lost partitions.Identify the metadata bottleneck.
- In ZooKeeper mode: check
ZooKeeperRequestLatencyMsp99 on the controller. Values above 100 ms slow every event. - In KRaft mode: check
kafka.server:type=raft-metricsforcommit-latency-avgandcurrent-leader. High commit latency or leader instability means quorum health is the constraint.
- In ZooKeeper mode: check
Inspect controller node health. High CPU, full GC pauses, or disk I/O latency reduce the drain rate of the single event thread. Check OS-level CPU and GC metrics on the active controller.
Review recent cluster changes. Look for broker restarts, partition reassignments, or topic creations that injected a burst of events.
Metrics and signals to monitor
| Signal | Why it matters | Warning sign |
|---|---|---|
ControllerEventQueueSize | Direct measure of controller backlog. Only meaningful on active controller. | Sustained > 100; growing without draining |
LeaderElectionRateAndTimeMs | Shows whether elections are happening and how long they wait. | 99th percentile > 1 s and rising |
OfflinePartitionsCount | Confirms user-visible impact. | Nonzero and growing while queue is backed up |
UnderReplicatedPartitions | Indicates replication degradation that may require controller action. | Cluster-wide spike correlated with queue growth |
ActiveControllerCount | Verifies controller existence. | Sum across cluster != 1 |
ZooKeeperRequestLatencyMs (ZK mode) | Metadata store latency directly bounds event processing speed. | p99 > 100 ms sustained |
kafka.server:type=raft-metrics (KRaft mode) | Quorum health and commit latency in KRaft. | commit-latency-avg growing or current-leader = -1 |
RequestHandlerAvgIdlePercent (controller node) | Broker-level saturation on the controller. | Below 0.3 on the active controller |
| JVM GC pause time (controller node) | Long pauses stall the event thread. | Full GC > 5 s or frequent old-gen collections |
Fixes
Do not restart additional brokers. Restarting brokers generates new controller events and deepens the queue. This is the most common operator mistake during this pattern.
If the queue is draining, even slowly: Wait. Monitor ControllerEventQueueSize and OfflinePartitionsCount. A large cluster may take minutes to process a mass failure. Restarting the controller or triggering more metadata changes prolongs recovery.
If the queue is not draining or growing:
- ZooKeeper mode: Investigate ZooKeeper cluster health. Check transaction log disk latency and whether ZK nodes are saturated. ZK latency is often the binding constraint. Do not restart the controller unless ZK health is confirmed and the controller itself is hung.
- KRaft mode: Check quorum voter lag and network connectivity between controllers. If the active controller is healthy but Raft commit latency is high, the constraint is replication between voters or disk I/O on voter nodes.
- Controller node resource exhaustion: If the controller is in full GC loops or CPU-throttled, consider a controlled controller migration, such as restarting the controller node after confirming another eligible node can take over. This is disruptive; only do it if the controller is objectively hung and not processing events.
Reduce incoming event rate:
- Halt automation that creates topics, changes configs, or reassigns partitions.
- Stop non-essential broker restarts.
- If producers are retrying aggressively, throttle them temporarily with quotas to reduce load on the controller path.
Prevention
Limit partitions per broker. The controller must process an event for every partition on a failed broker. Keep partitions per broker under a threshold tested during game days. A conservative guideline is 4,000 partitions per broker, though hardware and version affect this.
Test failure recovery time. During a game day, gracefully shut down one broker and measure controller queue drain time. If it exceeds your tolerance, reduce partition density or upgrade controller hardware.
Monitor queue depth continuously. Alert on
ControllerEventQueueSize> 100 on the active controller. Sustained elevation outside maintenance indicates the cluster is approaching controller capacity limits.Avoid mass broker restarts. Rolling restarts should proceed one broker at a time, waiting for ISR recovery and queue drain. Never restart multiple brokers simultaneously unless you have tested that scenario.
Maintain ZooKeeper or KRaft quorum health. Keep ZK on dedicated nodes with fast disks for the transaction log. In KRaft, ensure controller nodes have low network latency between each other and adequate disk I/O for the metadata log.
How Netdata helps
- Surfaces
ControllerEventQueueSizeper broker so you can identify the active controller and spot queue growth without manual JMX queries. - Correlates controller queue depth with
LeaderElectionRateAndTimeMs,OfflinePartitionsCount, andUnderReplicatedPartitionsto distinguish a controller backlog from a simple broker failure. - Tracks JVM heap and GC metrics on the controller node to determine whether the backlog is caused by controller-side resource exhaustion or metadata store latency.
- Alerts on composite conditions, such as queue size rising while
ActiveControllerCountis stable, reducing false positives from brief spikes during rolling restarts.
Related guides
- How Kafka actually works in production: a mental model for operators
- Kafka ISR shrinking: IsrShrinksPerSec, flapping, and the cascade to offline
- Kafka LEADER_NOT_AVAILABLE: causes during elections, restarts, and topic creation
- Kafka min.insync.replicas and acks: configuring durability you actually have
- Kafka monitoring checklist: the signals every production cluster needs
- Kafka monitoring maturity model: from survival to expert
- Kafka NotEnoughReplicasException: acks=all writes rejected below min.insync.replicas
- Kafka NOT_LEADER_FOR_PARTITION: stale metadata, controller lag, and client retries
- Kafka OfflinePartitionsCount > 0: partitions with no leader and how to recover
- Kafka replica MaxLag growing: slow followers and replica fetcher health
- Kafka UncleanLeaderElectionsPerSec > 0: confirmed silent data loss
- Kafka UnderMinIsrPartitionCount: confirming the write path is blocked







