Kafka ActiveControllerCount not equal to 1: no controller or split brain
When the cluster-wide sum of kafka.controller:type=KafkaController,name=ActiveControllerCount is not 1, the cluster has either no active controller or multiple active controllers. A sum of 0 means no broker is steering metadata. A sum greater than 1 means split brain. Either way, this is a control-plane failure. Existing partition leaders usually keep serving produce and fetch requests, so the data plane may look healthy at first. Any broker failure, partition reassignment, or topic operation will stall because there is no controller to process it, or multiple controllers are processing conflicting operations.
Determine quickly whether the sum is 0 or greater than 1. The causes and fixes differ. The data plane keeps running, so severity depends on whether metadata absence is already causing visible impact.
What this means
kafka.controller:type=KafkaController,name=ActiveControllerCount is a per-broker JMX gauge with value 0 or 1. In steady state, exactly one broker reports 1 and all others report 0. The cluster-wide sum must always be 1.
- Sum = 0: No broker believes it is the active controller. Leader elections stop, ISR changes stall, topic creation and deletion fail, and broker failures are not handled. Existing leaders continue to serve data, but the cluster cannot self-heal.
- Sum > 1: Multiple brokers claim to be the active controller. In ZooKeeper mode this is rare but possible during network partitions or ZK session edge cases. In KRaft mode it should be impossible because Raft majority voting prevents concurrent leaders; if observed, treat it as a critical quorum bug and a data-loss risk.
In KRaft mode, only controller-role nodes expose this metric. Broker-only nodes may not emit it at all, so your monitoring aggregation must account for the smaller denominator.
During a rolling restart, if the controller broker is restarted, expect a brief transition window. This typically resolves in under 10 seconds.
Common causes
| Cause | What it looks like | First thing to check |
|---|---|---|
| ZooKeeper session expiry (ZK mode) | Controller broker logs session expired messages. ISR changes stop propagating. Other brokers show ZooKeeperExpiresPerSec spikes. | ZK session state and GC pause duration on the former controller. |
| KRaft quorum loss (KRaft mode) | kafka-metadata-quorum.sh describe --status shows no leader or voter lag is growing. Controller-eligible brokers are alive but disconnected from each other. | Network connectivity between dedicated controller nodes. |
| Network partition isolating the controller | Controller process is alive but unreachable from ZK or quorum peers. Brokers cannot reach the controller for metadata. | OS-level network metrics and recent firewall or routing changes. |
| Controller JVM crash or OOM | Controller broker process is gone or restarted recently. dmesg shows OOM kill. | Broker uptime, process presence, and JVM heap utilization. |
Quick checks
Run these safe, read-only checks from a host with JMX and Kafka CLI access.
# ActiveControllerCount on the local broker
echo "get -b kafka.controller:type=KafkaController,name=ActiveControllerCount Value" | java -jar jmxterm.jar -l localhost:9999
# Unavailable partitions (data-plane impact)
kafka-topics.sh --bootstrap-server localhost:9092 --describe --unavailable-partitions
# KRaft quorum status
kafka-metadata-quorum.sh --bootstrap-server localhost:9092 describe --status
# Broker process and uptime
pgrep -a -f kafka.Kafka
# OOM kills in kernel log
dmesg | grep -i "kill.*kafka\|oom"
# ZK session expiration rate (ZK mode)
echo "get -b kafka.server:type=SessionExpireListener,name=ZooKeeperExpiresPerSec OneMinuteRate" | java -jar jmxterm.jar -l localhost:9999
# GC behavior
PID=$(pgrep -f 'kafka.Kafka' | head -n 1)
jstat -gcutil "$PID" 1000 5
# Controller event queue depth (meaningful only on the active controller)
echo "get -b kafka.controller:type=ControllerEventManager,name=EventQueueSize Value" | java -jar jmxterm.jar -l localhost:9999
How to diagnose it
- Confirm the deviation. Aggregate
ActiveControllerCountacross all brokers. A sum of 0 means no controller. A sum greater than 1 means split brain. Note the timestamp when the change started. - Identify the last known controller. Check Kafka server logs on all controller-eligible brokers for the most recent controller leadership log line. The broker that was controller before the incident is your starting point.
- Determine the deployment mode. Check
server.propertiesor the process command line forzookeeper.connectversusprocess.rolesandcontroller.quorum.voters. ZK and KRaft have different failure modes and different recovery tools. - If the sum is 0, check controller liveness. On the broker that was last controller, verify the process is running and its uptime. If the process restarted, look for
OutOfMemoryErroror fatal exceptions in the broker log. Checkdmesgfor OOM kills. - If the sum is 0, check the session or quorum. In ZK mode, look for ZK session expiry messages in the broker log. Correlate with GC logs: a Full GC longer than
zookeeper.session.timeout.msdirectly causes session expiry. In KRaft mode, runkafka-metadata-quorum.sh describe --statusand verify that aLeaderIdexists and voters can reach each other. - If the sum is greater than 1, check for network partitions. A split brain usually means two or more controller-eligible brokers can each reach ZK or enough quorum voters, but cannot reach each other. Check inter-broker network connectivity and firewall rules. In KRaft, treat this as a critical bug; capture the output of
kafka-metadata-quorum.sh describe --statusfrom all controller nodes before attempting recovery. - Confirm data-plane impact. Check
OfflinePartitionsCountandUnderReplicatedPartitions. If they are rising while the controller is absent, the cluster is actively degrading.
flowchart TD
A[Cluster sum != 1] --> B{Sum = 0?}
B -->|Yes| C[Check controller liveness]
B -->|No| D[Check network partition]
C --> E[Check ZK session or KRaft quorum]
D --> F[Identify minority partition]
E --> G[Restart broker after root cause fixed]
F --> H[Restart minority-side brokers]
G --> I[Verify sum = 1]
H --> IMetrics and signals to monitor
| Signal | Why it matters | Warning sign |
|---|---|---|
ActiveControllerCount sum | Binary health of the control plane. | Sum not equal to 1 for more than 2 minutes outside rolling restarts. |
OfflinePartitionsCount | Confirms data-plane impact when the controller is absent. | Nonzero and rising while ActiveControllerCount is 0. |
UnderReplicatedPartitions | Shows the cluster cannot recover from broker failures without a controller. | Rising across multiple brokers during controller loss. |
ControllerEventQueueSize | Backlogged events indicate an overwhelmed controller. | Sustained above 100 events, or growing without bound. |
ZooKeeperExpiresPerSec (ZK mode) | Direct indicator of ZK session loss causing controller resignation. | Any nonzero rate on the controller-eligible broker. |
KRaft quorum current-leader (KRaft mode) | Confirms whether the Raft quorum has an active leader. | current-leader = -1 or voter lag growing continuously. |
| JVM GC pause duration | Long pauses cause ZK session expiry and controller unresponsiveness. | Full GC exceeding 5 seconds, or any Old Gen collection longer than the session timeout. |
Fixes
No controller (sum = 0)
Controller process crashed or OOM killed. If the broker process is gone, start it. If it was OOM killed, increase the heap only if it is below the 4-8 GB production guideline. An oversized heap causes longer GC pauses and more session timeouts. After restart, expect a brief metadata request storm while other brokers refresh cluster state.
ZooKeeper session expiry. Do not restart the broker if the root cause is ZK latency or GC pauses. Fix the ZK cluster or JVM tuning first, then restart. If you restart into the same degraded ZK environment, the session will expire again.
KRaft quorum loss. Verify network connectivity between all controller quorum voters. Do not restart all controller nodes simultaneously. Restart one controller at a time, allowing the quorum to stabilize between restarts. Concurrent restarts risk total metadata unavailability.
Network partition isolating the controller. Restore connectivity first. Restarting the controller before the partition heals may elect a new controller on the wrong side of the partition, making recovery harder.
Split brain (sum > 1)
ZooKeeper mode. Identify which broker was the legitimate controller just before the incident by checking log timestamps. Restart brokers on the minority side of the network partition first. If there is no clear partition, restart all controller-eligible brokers except the one you believe is the true controller. This is disruptive and clients will see metadata transitions.
KRaft mode. Split brain should be impossible. If observed, capture diagnostics from all controller nodes, then restart the broker reporting 1 that is not the quorum leader shown in kafka-metadata-quorum.sh describe --status. This is disruptive; clients will retry during metadata transitions.
Prevention
Alert on the control plane. Alert on ActiveControllerCount sum != 1 with at least a 2-minute gate to avoid paging during brief rolling-restart transitions. In KRaft, also monitor kafka.server:type=raft-metrics for voter lag and commit latency.
Size JVM heap conservatively. Keep broker heaps in the 4-8 GB range. Long Full GC pauses are a leading cause of ZK session expiry in ZooKeeper mode.
Isolate controller nodes. In KRaft, use dedicated controller nodes and avoid combined mode in production. In ZK mode, ensure ZK runs on separate hosts with fast, dedicated disks for the transaction log.
Test failure recovery. Gracefully stop a controller-eligible broker during a maintenance window and measure how long the cluster takes to elect a new controller and return to a stable state.
How Netdata helps
Netdata collects ActiveControllerCount per broker and can aggregate the cluster-wide sum. It charts OfflinePartitionsCount, UnderReplicatedPartitions, and JVM GC pause duration on the same timeline, so you can correlate controller loss with data-plane impact and ZK session expiry root causes. OS-level network latency, packet loss, and TCP retransmit rates help identify partitions that isolate the controller. ControllerEventQueueSize and request latency breakdowns show an overwhelmed controller before it fails.
Related guides
- How Kafka actually works in production: a mental model for operators: /guides/kafka/how-kafka-works-in-production/
- Kafka ISR shrinking: IsrShrinksPerSec, flapping, and the cascade to offline: /guides/kafka/kafka-isr-shrink-storm/
- Kafka LEADER_NOT_AVAILABLE: causes during elections, restarts, and topic creation: /guides/kafka/kafka-leader-not-available/
- Kafka min.insync.replicas and acks: configuring durability you actually have: /guides/kafka/kafka-min-insync-replicas-misconfigured/
- Kafka monitoring checklist: the signals every production cluster needs: /guides/kafka/kafka-monitoring-checklist/
- Kafka monitoring maturity model: from survival to expert: /guides/kafka/kafka-monitoring-maturity-model/
- Kafka NotEnoughReplicasException: acks=all writes rejected below min.insync.replicas: /guides/kafka/kafka-not-enough-replicas-exception/
- Kafka NOT_LEADER_FOR_PARTITION: stale metadata, controller lag, and client retries: /guides/kafka/kafka-not-leader-for-partition/
- Kafka OfflinePartitionsCount > 0: partitions with no leader and how to recover: /guides/kafka/kafka-offline-partitions-count/
- Kafka replica MaxLag growing: slow followers and replica fetcher health: /guides/kafka/kafka-replica-fetcher-max-lag/
- Kafka UnderMinIsrPartitionCount: confirming the write path is blocked: /guides/kafka/kafka-under-min-isr-partition-count/
- Kafka UnderReplicatedPartitions > 0: the most important metric and how to clear it: /guides/kafka/kafka-under-replicated-partitions/







