Kafka controller event queue backing up: overwhelmed controller and stalled metadata

You see NOT_LEADER_FOR_PARTITION errors spike in client logs. Leader elections stop completing. The active controller’s event queue grows and does not drain. Partitions that need new leaders stay offline. The single thread processing partition state changes cannot keep up, and metadata operations stall.

In ZooKeeper mode, each event writes to ZooKeeper. In KRaft mode, each event appends to the Raft metadata log. The active controller processes them sequentially. When the queue backs up, the control plane slows. The data plane may continue serving existing leaders, but any failure requiring a state change gets stuck.

This appears during large-scale failures: multiple brokers dropping simultaneously, a rack outage, or an operator restarting several brokers too quickly. The controller receives a burst of leader elections, ISR changes, and broker lifecycle events. If the metadata store or controller node is slow, queue depth increases without bound.

What this means

The active controller maintains an internal event queue for partition and broker state changes. Events include leader elections, ISR expansions and shrinks, topic creations and deletions, and broker registrations. Only the active controller exposes a meaningful queue depth metric.

When the queue backs up, the delay between a broker failure and the resulting leader election grows from milliseconds to seconds or minutes. During this window:

  • Partitions whose leader was on a failed broker stay offline.
  • Producers and consumers receive NOT_LEADER_FOR_PARTITION because metadata still points to the dead broker.
  • Under-replicated partitions accumulate because the controller has not yet processed ISR changes.
  • If the queue continues to grow, the controller becomes a bottleneck for all cluster metadata operations.
flowchart TD
    A[Multiple broker failures] --> B[Controller event queue fills]
    B --> C[Leader elections delayed]
    C --> D[Partitions stay offline]
    C --> E[NOT_LEADER_FOR_PARTITION to clients]
    B --> F[ISR updates stall]
    F --> G[UnderReplicatedPartitions accumulates]
    G --> H[Operators restart more brokers]
    H --> I[More events flood queue]
    I --> B

Common causes

CauseWhat it looks likeFirst thing to check
Too many partitions per brokerQueue spikes during any broker failure; recovery takes minutesPartitionCount and LeaderCount per broker; cluster-wide total
Cascading broker failuresQueue grows as multiple brokers drop; events arrive faster than they drainBroker liveness; whether automation triggered mass restarts
ZooKeeper latency (ZK mode)Elevated ZooKeeperRequestLatencyMs on the controller; each event persists more slowlyZooKeeperRequestLatencyMs p99 on the controller
KRaft quorum replication lag (KRaft mode)Standby controllers or brokers lag; commit latency growskafka.server:type=raft-metrics on controllers; metadata log lag
Controller resource exhaustionHigh CPU or GC pauses on the active controller; JVM heap pressureCPU utilization and GC pause duration on the active controller

Quick checks

Run from the controller node or any node with JMX access to it.

# Check active controller identity
echo "get -b kafka.controller:type=KafkaController,name=ActiveControllerCount Value" | java -jar jmxterm.jar -l localhost:9999

# Check controller event queue depth
echo "get -b kafka.controller:type=ControllerEventManager,name=EventQueueSize Value" | java -jar jmxterm.jar -l localhost:9999

# Check offline partitions
kafka-topics.sh --bootstrap-server localhost:9092 --describe --unavailable-partitions

# Check under-replicated partitions cluster-wide
kafka-topics.sh --bootstrap-server localhost:9092 --describe --under-replicated-partitions

# Check leader election timing
echo "get -b kafka.controller:type=ControllerStats,name=LeaderElectionRateAndTimeMs 99thPercentile" | java -jar jmxterm.jar -l localhost:9999

# Check controller JVM heap and GC
jstat -gcutil $(pgrep -f kafka.Kafka) 1000

# Check controller node disk I/O latency
iostat -xz 1

How to diagnose it

  1. Confirm which broker is the active controller. Query ActiveControllerCount across all brokers. Exactly one must return 1. If none report 1, the cluster has no controller; see Kafka OfflinePartitionsCount > 0. If multiple report 1, treat it as split-brain.

  2. Read the controller event queue depth. On the active controller, check ControllerEventQueueSize. Sustained values above 100 indicate pressure. Values growing without bound indicate the controller cannot keep up.

  3. Correlate queue depth with leader election time. Check LeaderElectionRateAndTimeMs. If the rate is elevated but the 99th percentile is increasing, elections are queuing behind other events.

  4. Check for large-scale broker failure. Cross-reference OfflinePartitionsCount and UnderReplicatedPartitions. If many brokers are down simultaneously, queue depth is proportional to lost partitions.

  5. Identify the metadata bottleneck.

    • In ZooKeeper mode: check ZooKeeperRequestLatencyMs p99 on the controller. Values above 100 ms slow every event.
    • In KRaft mode: check kafka.server:type=raft-metrics for commit-latency-avg and current-leader. High commit latency or leader instability means quorum health is the constraint.
  6. Inspect controller node health. High CPU, full GC pauses, or disk I/O latency reduce the drain rate of the single event thread. Check OS-level CPU and GC metrics on the active controller.

  7. Review recent cluster changes. Look for broker restarts, partition reassignments, or topic creations that injected a burst of events.

Metrics and signals to monitor

SignalWhy it mattersWarning sign
ControllerEventQueueSizeDirect measure of controller backlog. Only meaningful on active controller.Sustained > 100; growing without draining
LeaderElectionRateAndTimeMsShows whether elections are happening and how long they wait.99th percentile > 1 s and rising
OfflinePartitionsCountConfirms user-visible impact.Nonzero and growing while queue is backed up
UnderReplicatedPartitionsIndicates replication degradation that may require controller action.Cluster-wide spike correlated with queue growth
ActiveControllerCountVerifies controller existence.Sum across cluster != 1
ZooKeeperRequestLatencyMs (ZK mode)Metadata store latency directly bounds event processing speed.p99 > 100 ms sustained
kafka.server:type=raft-metrics (KRaft mode)Quorum health and commit latency in KRaft.commit-latency-avg growing or current-leader = -1
RequestHandlerAvgIdlePercent (controller node)Broker-level saturation on the controller.Below 0.3 on the active controller
JVM GC pause time (controller node)Long pauses stall the event thread.Full GC > 5 s or frequent old-gen collections

Fixes

Do not restart additional brokers. Restarting brokers generates new controller events and deepens the queue. This is the most common operator mistake during this pattern.

If the queue is draining, even slowly: Wait. Monitor ControllerEventQueueSize and OfflinePartitionsCount. A large cluster may take minutes to process a mass failure. Restarting the controller or triggering more metadata changes prolongs recovery.

If the queue is not draining or growing:

  • ZooKeeper mode: Investigate ZooKeeper cluster health. Check transaction log disk latency and whether ZK nodes are saturated. ZK latency is often the binding constraint. Do not restart the controller unless ZK health is confirmed and the controller itself is hung.
  • KRaft mode: Check quorum voter lag and network connectivity between controllers. If the active controller is healthy but Raft commit latency is high, the constraint is replication between voters or disk I/O on voter nodes.
  • Controller node resource exhaustion: If the controller is in full GC loops or CPU-throttled, consider a controlled controller migration, such as restarting the controller node after confirming another eligible node can take over. This is disruptive; only do it if the controller is objectively hung and not processing events.

Reduce incoming event rate:

  • Halt automation that creates topics, changes configs, or reassigns partitions.
  • Stop non-essential broker restarts.
  • If producers are retrying aggressively, throttle them temporarily with quotas to reduce load on the controller path.

Prevention

  • Limit partitions per broker. The controller must process an event for every partition on a failed broker. Keep partitions per broker under a threshold tested during game days. A conservative guideline is 4,000 partitions per broker, though hardware and version affect this.

  • Test failure recovery time. During a game day, gracefully shut down one broker and measure controller queue drain time. If it exceeds your tolerance, reduce partition density or upgrade controller hardware.

  • Monitor queue depth continuously. Alert on ControllerEventQueueSize > 100 on the active controller. Sustained elevation outside maintenance indicates the cluster is approaching controller capacity limits.

  • Avoid mass broker restarts. Rolling restarts should proceed one broker at a time, waiting for ISR recovery and queue drain. Never restart multiple brokers simultaneously unless you have tested that scenario.

  • Maintain ZooKeeper or KRaft quorum health. Keep ZK on dedicated nodes with fast disks for the transaction log. In KRaft, ensure controller nodes have low network latency between each other and adequate disk I/O for the metadata log.

How Netdata helps

  • Surfaces ControllerEventQueueSize per broker so you can identify the active controller and spot queue growth without manual JMX queries.
  • Correlates controller queue depth with LeaderElectionRateAndTimeMs, OfflinePartitionsCount, and UnderReplicatedPartitions to distinguish a controller backlog from a simple broker failure.
  • Tracks JVM heap and GC metrics on the controller node to determine whether the backlog is caused by controller-side resource exhaustion or metadata store latency.
  • Alerts on composite conditions, such as queue size rising while ActiveControllerCount is stable, reducing false positives from brief spikes during rolling restarts.