Kafka ActiveControllerCount not equal to 1: no controller or split brain

When the cluster-wide sum of kafka.controller:type=KafkaController,name=ActiveControllerCount is not 1, the cluster has either no active controller or multiple active controllers. A sum of 0 means no broker is steering metadata. A sum greater than 1 means split brain. Either way, this is a control-plane failure. Existing partition leaders usually keep serving produce and fetch requests, so the data plane may look healthy at first. Any broker failure, partition reassignment, or topic operation will stall because there is no controller to process it, or multiple controllers are processing conflicting operations.

Determine quickly whether the sum is 0 or greater than 1. The causes and fixes differ. The data plane keeps running, so severity depends on whether metadata absence is already causing visible impact.

What this means

kafka.controller:type=KafkaController,name=ActiveControllerCount is a per-broker JMX gauge with value 0 or 1. In steady state, exactly one broker reports 1 and all others report 0. The cluster-wide sum must always be 1.

  • Sum = 0: No broker believes it is the active controller. Leader elections stop, ISR changes stall, topic creation and deletion fail, and broker failures are not handled. Existing leaders continue to serve data, but the cluster cannot self-heal.
  • Sum > 1: Multiple brokers claim to be the active controller. In ZooKeeper mode this is rare but possible during network partitions or ZK session edge cases. In KRaft mode it should be impossible because Raft majority voting prevents concurrent leaders; if observed, treat it as a critical quorum bug and a data-loss risk.

In KRaft mode, only controller-role nodes expose this metric. Broker-only nodes may not emit it at all, so your monitoring aggregation must account for the smaller denominator.

During a rolling restart, if the controller broker is restarted, expect a brief transition window. This typically resolves in under 10 seconds.

Common causes

CauseWhat it looks likeFirst thing to check
ZooKeeper session expiry (ZK mode)Controller broker logs session expired messages. ISR changes stop propagating. Other brokers show ZooKeeperExpiresPerSec spikes.ZK session state and GC pause duration on the former controller.
KRaft quorum loss (KRaft mode)kafka-metadata-quorum.sh describe --status shows no leader or voter lag is growing. Controller-eligible brokers are alive but disconnected from each other.Network connectivity between dedicated controller nodes.
Network partition isolating the controllerController process is alive but unreachable from ZK or quorum peers. Brokers cannot reach the controller for metadata.OS-level network metrics and recent firewall or routing changes.
Controller JVM crash or OOMController broker process is gone or restarted recently. dmesg shows OOM kill.Broker uptime, process presence, and JVM heap utilization.

Quick checks

Run these safe, read-only checks from a host with JMX and Kafka CLI access.

# ActiveControllerCount on the local broker
echo "get -b kafka.controller:type=KafkaController,name=ActiveControllerCount Value" | java -jar jmxterm.jar -l localhost:9999
# Unavailable partitions (data-plane impact)
kafka-topics.sh --bootstrap-server localhost:9092 --describe --unavailable-partitions

# KRaft quorum status
kafka-metadata-quorum.sh --bootstrap-server localhost:9092 describe --status

# Broker process and uptime
pgrep -a -f kafka.Kafka

# OOM kills in kernel log
dmesg | grep -i "kill.*kafka\|oom"

# ZK session expiration rate (ZK mode)
echo "get -b kafka.server:type=SessionExpireListener,name=ZooKeeperExpiresPerSec OneMinuteRate" | java -jar jmxterm.jar -l localhost:9999
# GC behavior
PID=$(pgrep -f 'kafka.Kafka' | head -n 1)
jstat -gcutil "$PID" 1000 5

# Controller event queue depth (meaningful only on the active controller)
echo "get -b kafka.controller:type=ControllerEventManager,name=EventQueueSize Value" | java -jar jmxterm.jar -l localhost:9999

How to diagnose it

  1. Confirm the deviation. Aggregate ActiveControllerCount across all brokers. A sum of 0 means no controller. A sum greater than 1 means split brain. Note the timestamp when the change started.
  2. Identify the last known controller. Check Kafka server logs on all controller-eligible brokers for the most recent controller leadership log line. The broker that was controller before the incident is your starting point.
  3. Determine the deployment mode. Check server.properties or the process command line for zookeeper.connect versus process.roles and controller.quorum.voters. ZK and KRaft have different failure modes and different recovery tools.
  4. If the sum is 0, check controller liveness. On the broker that was last controller, verify the process is running and its uptime. If the process restarted, look for OutOfMemoryError or fatal exceptions in the broker log. Check dmesg for OOM kills.
  5. If the sum is 0, check the session or quorum. In ZK mode, look for ZK session expiry messages in the broker log. Correlate with GC logs: a Full GC longer than zookeeper.session.timeout.ms directly causes session expiry. In KRaft mode, run kafka-metadata-quorum.sh describe --status and verify that a LeaderId exists and voters can reach each other.
  6. If the sum is greater than 1, check for network partitions. A split brain usually means two or more controller-eligible brokers can each reach ZK or enough quorum voters, but cannot reach each other. Check inter-broker network connectivity and firewall rules. In KRaft, treat this as a critical bug; capture the output of kafka-metadata-quorum.sh describe --status from all controller nodes before attempting recovery.
  7. Confirm data-plane impact. Check OfflinePartitionsCount and UnderReplicatedPartitions. If they are rising while the controller is absent, the cluster is actively degrading.
flowchart TD
    A[Cluster sum != 1] --> B{Sum = 0?}
    B -->|Yes| C[Check controller liveness]
    B -->|No| D[Check network partition]
    C --> E[Check ZK session or KRaft quorum]
    D --> F[Identify minority partition]
    E --> G[Restart broker after root cause fixed]
    F --> H[Restart minority-side brokers]
    G --> I[Verify sum = 1]
    H --> I

Metrics and signals to monitor

SignalWhy it mattersWarning sign
ActiveControllerCount sumBinary health of the control plane.Sum not equal to 1 for more than 2 minutes outside rolling restarts.
OfflinePartitionsCountConfirms data-plane impact when the controller is absent.Nonzero and rising while ActiveControllerCount is 0.
UnderReplicatedPartitionsShows the cluster cannot recover from broker failures without a controller.Rising across multiple brokers during controller loss.
ControllerEventQueueSizeBacklogged events indicate an overwhelmed controller.Sustained above 100 events, or growing without bound.
ZooKeeperExpiresPerSec (ZK mode)Direct indicator of ZK session loss causing controller resignation.Any nonzero rate on the controller-eligible broker.
KRaft quorum current-leader (KRaft mode)Confirms whether the Raft quorum has an active leader.current-leader = -1 or voter lag growing continuously.
JVM GC pause durationLong pauses cause ZK session expiry and controller unresponsiveness.Full GC exceeding 5 seconds, or any Old Gen collection longer than the session timeout.

Fixes

No controller (sum = 0)

Controller process crashed or OOM killed. If the broker process is gone, start it. If it was OOM killed, increase the heap only if it is below the 4-8 GB production guideline. An oversized heap causes longer GC pauses and more session timeouts. After restart, expect a brief metadata request storm while other brokers refresh cluster state.

ZooKeeper session expiry. Do not restart the broker if the root cause is ZK latency or GC pauses. Fix the ZK cluster or JVM tuning first, then restart. If you restart into the same degraded ZK environment, the session will expire again.

KRaft quorum loss. Verify network connectivity between all controller quorum voters. Do not restart all controller nodes simultaneously. Restart one controller at a time, allowing the quorum to stabilize between restarts. Concurrent restarts risk total metadata unavailability.

Network partition isolating the controller. Restore connectivity first. Restarting the controller before the partition heals may elect a new controller on the wrong side of the partition, making recovery harder.

Split brain (sum > 1)

ZooKeeper mode. Identify which broker was the legitimate controller just before the incident by checking log timestamps. Restart brokers on the minority side of the network partition first. If there is no clear partition, restart all controller-eligible brokers except the one you believe is the true controller. This is disruptive and clients will see metadata transitions.

KRaft mode. Split brain should be impossible. If observed, capture diagnostics from all controller nodes, then restart the broker reporting 1 that is not the quorum leader shown in kafka-metadata-quorum.sh describe --status. This is disruptive; clients will retry during metadata transitions.

Prevention

Alert on the control plane. Alert on ActiveControllerCount sum != 1 with at least a 2-minute gate to avoid paging during brief rolling-restart transitions. In KRaft, also monitor kafka.server:type=raft-metrics for voter lag and commit latency.

Size JVM heap conservatively. Keep broker heaps in the 4-8 GB range. Long Full GC pauses are a leading cause of ZK session expiry in ZooKeeper mode.

Isolate controller nodes. In KRaft, use dedicated controller nodes and avoid combined mode in production. In ZK mode, ensure ZK runs on separate hosts with fast, dedicated disks for the transaction log.

Test failure recovery. Gracefully stop a controller-eligible broker during a maintenance window and measure how long the cluster takes to elect a new controller and return to a stable state.

How Netdata helps

Netdata collects ActiveControllerCount per broker and can aggregate the cluster-wide sum. It charts OfflinePartitionsCount, UnderReplicatedPartitions, and JVM GC pause duration on the same timeline, so you can correlate controller loss with data-plane impact and ZK session expiry root causes. OS-level network latency, packet loss, and TCP retransmit rates help identify partitions that isolate the controller. ControllerEventQueueSize and request latency breakdowns show an overwhelmed controller before it fails.

  • How Kafka actually works in production: a mental model for operators: /guides/kafka/how-kafka-works-in-production/
  • Kafka ISR shrinking: IsrShrinksPerSec, flapping, and the cascade to offline: /guides/kafka/kafka-isr-shrink-storm/
  • Kafka LEADER_NOT_AVAILABLE: causes during elections, restarts, and topic creation: /guides/kafka/kafka-leader-not-available/
  • Kafka min.insync.replicas and acks: configuring durability you actually have: /guides/kafka/kafka-min-insync-replicas-misconfigured/
  • Kafka monitoring checklist: the signals every production cluster needs: /guides/kafka/kafka-monitoring-checklist/
  • Kafka monitoring maturity model: from survival to expert: /guides/kafka/kafka-monitoring-maturity-model/
  • Kafka NotEnoughReplicasException: acks=all writes rejected below min.insync.replicas: /guides/kafka/kafka-not-enough-replicas-exception/
  • Kafka NOT_LEADER_FOR_PARTITION: stale metadata, controller lag, and client retries: /guides/kafka/kafka-not-leader-for-partition/
  • Kafka OfflinePartitionsCount > 0: partitions with no leader and how to recover: /guides/kafka/kafka-offline-partitions-count/
  • Kafka replica MaxLag growing: slow followers and replica fetcher health: /guides/kafka/kafka-replica-fetcher-max-lag/
  • Kafka UnderMinIsrPartitionCount: confirming the write path is blocked: /guides/kafka/kafka-under-min-isr-partition-count/
  • Kafka UnderReplicatedPartitions > 0: the most important metric and how to clear it: /guides/kafka/kafka-under-replicated-partitions/