Kafka leadership imbalance: LeaderCount skew and preferred replica election

One broker in your cluster is handling 40 percent of all produce requests. Its RequestHandlerAvgIdlePercent is dropping toward 0.3 while the rest of the cluster is nearly idle. Cluster-level throughput dashboards look healthy because aggregate ingress is within capacity, but tail latency on the hot broker is climbing and consumers connected to it are falling behind. The root cause is leadership skew: a disproportionate number of partition leaders have landed on one broker, and every leader carries the full request load for its partitions. This happens silently after rolling restarts, controlled shutdowns, or broker decommissions. Even a perfectly even PartitionCount across brokers can hide severe leadership imbalance. You will not see it without monitoring LeaderCount per broker.

What this means

kafka.server:type=ReplicaManager,name=LeaderCount counts how many partitions each broker leads. Leaders handle all produce and fetch requests for their partitions; followers only replicate. A broker leading 40 percent of partitions in a five-node cluster is doing 40 percent of the request processing, not 20. Because leaders also handle all consumer fetches, the hot broker sees ingress and egress amplification. With replication factor 3 and multiple consumer groups, that broker can push egress to several times its fair share of network capacity. Cluster-wide averages hide this.

Kafka assigns the first replica in the partition assignment as the preferred leader when a topic is created. When that broker fails or restarts, leadership moves to another replica. When the broker returns, the assignment still prefers it, but leadership does not move back automatically. The controller’s auto-rebalancer is supposed to trigger this, controlled by auto.leader.rebalance.enable (default true). The controller checks every leader.imbalance.check.interval.seconds (default 300 seconds). If any broker exceeds leader.imbalance.per.broker.percentage (default 10 percent) above the average, the controller triggers preferred replica election.

Two issues break this mechanism in production. KAFKA-4084 documents that the auto-rebalancer moves all imbalanced partitions simultaneously, flooding the cluster with replication traffic and ISR churn. Many operators disable auto.leader.rebalance.enable to avoid this. KAFKA-8359 argues that the 10 percent default is too high: small topics catch up first after a restart, leaving large high-volume topics unbalanced. As of Kafka 3.7 the default remains 10 percent, and applying changes to these settings requires a broker configuration update that may need a rolling restart.

flowchart TD
    A[LeaderCount deviation >30% from mean] -->|check| B{auto.leader.rebalance.enable}
    B -->|enabled| C[Review leader.imbalance.per.broker.percentage]
    B -->|disabled| D[Plan manual preferred replica election]
    C -->|threshold > 10%| E[Lower threshold or run immediate PLE]
    C -->|threshold <= 10%| F[Verify replica lag before election]
    D -->|next| F
    E -->|next| F
    F -->|lagging| G[Wait for ISR recovery]
    F -->|caught up| H[Run kafka-leader-election.sh --election-type preferred]
    G -->|retry| F
    H -->|verify| I[Confirm balanced LeaderCount and request latency]

Common causes

CauseWhat it looks likeFirst thing to check
Auto-rebalance disabled to avoid KAFKA-4084Skew persists indefinitely after maintenanceauto.leader.rebalance.enable in broker config
Default 10% threshold too looseGradual drift after broker returns; large topics remain on overloaded brokersleader.imbalance.per.broker.percentage
Post-restart recovery windowBroker returned but auto-rebalance has not yet runUptime and time since rejoin
Controlled shutdown without follow-upLeaders were moved gracefully and never moved backEvent timeline aligned with maintenance
KRaft mode differencesLeadership mechanics differ from ZooKeeper modeCluster mode and version-specific behavior

Quick checks

# Inspect leader distribution in topic metadata
kafka-topics.sh --bootstrap-server localhost:9092 --describe
# Read LeaderCount on the local broker via JMX
echo "get -b kafka.server:type=ReplicaManager,name=LeaderCount Value" | java -jar jmxterm.jar -l localhost:9999
# Check auto-rebalance configuration in server.properties
grep -E 'auto.leader.rebalance.enable|leader.imbalance.per.broker.percentage' /etc/kafka/server.properties
# Check if the hot broker is saturated
echo "get -b kafka.server:type=KafkaRequestHandlerPool,name=RequestHandlerAvgIdlePercent Value" | java -jar jmxterm.jar -l localhost:9999
# Read inbound traffic on the local broker
echo "get -b kafka.server:type=BrokerTopicMetrics,name=BytesInPerSec OneMinuteRate" | java -jar jmxterm.jar -l localhost:9999
# Verify no under-replication before considering rebalancing
kafka-topics.sh --bootstrap-server localhost:9092 --describe --under-replicated-partitions
# Check follower lag if this broker recently rejoined
echo "get -b kafka.server:type=ReplicaFetcherManager,name=MaxLag,clientId=Replica Value" | java -jar jmxterm.jar -l localhost:9999

How to diagnose it

  1. Collect LeaderCount from every broker via JMX, compute the cluster mean, and flag any broker deviating by more than 30 percent.
  2. Collect PartitionCount per broker. If partition counts are even but LeaderCount is skewed, you have a leadership imbalance, not a replica assignment problem.
  3. Check auto.leader.rebalance.enable and leader.imbalance.per.broker.percentage. If enabled and set to 10 percent, KAFKA-8359 explains why large topics may remain unbalanced indefinitely.
  4. Correlate with traffic signals. The hot broker should show higher BytesInPerSec and BytesOutPerSec, and lower RequestHandlerAvgIdlePercent, than its peers.
  5. Review the event timeline. If skew appeared after a rolling restart, controlled shutdown, or broker decommission, the cluster likely relied on auto-rebalance and either it did not trigger or it was disabled.
  6. Verify replica health before any fix. Check UnderReplicatedPartitions and Replica MaxLag. Triggering preferred replica election to a broker that is still catching up will fail or extend the recovery window.

Metrics and signals to monitor

SignalWhy it mattersWarning sign
LeaderCount per brokerDirect measure of how many partitions this broker leadsDeviation >30% from cluster mean
PartitionCount per brokerDistinguishes leadership skew from uneven replica assignmentEven count but severely uneven leaders
RequestHandlerAvgIdlePercentConfirms the leader-heavy broker is actually saturatedSustained below 0.3 on the hot broker
BytesInPerSec / BytesOutPerSecTraffic follows leadership; skew is visible in bandwidthOne broker carrying a disproportionate share
UnderReplicatedPartitionsLagging replicas make rebalancing unsafeNonzero before planned election
ControllerEventQueueSizeMass elections burden the single-threaded controllerGrowth during or after rebalance events

Fixes

Lower the auto-rebalance threshold If auto.leader.rebalance.enable=true but skew persists, the 10 percent default may be too tolerant. KAFKA-8359 proposes lowering leader.imbalance.per.broker.percentage to 0 percent so that any deviation triggers election. Changing this requires a broker configuration update and may require a rolling restart, depending on your Kafka version. Tradeoff: more frequent elections increase controller load and may cause brief NOT_LEADER_FOR_PARTITION errors as clients refresh metadata.

Trigger manual preferred replica election If auto-rebalance is disabled to avoid the KAFKA-4084 thundering-herd, run manual preferred replica election after verifying followers are caught up. This returns leadership to the preferred replica for specified partitions. For large clusters, run in batches to avoid overwhelming the controller.

# Trigger preferred replica election using a partition list file
kafka-leader-election.sh --bootstrap-server localhost:9092 --election-type preferred --path-to-json-file ple.json

Tradeoff: You control the timing and scope, but each election causes a brief leadership transition. Avoid running this during peak traffic or while UnderReplicatedPartitions is nonzero. When running the election, monitor LeaderElectionRateAndTimeMs and ControllerEventQueueSize. If election time spikes above one second or the queue grows continuously, pause the operation and allow the controller to drain before continuing.

Wait for ISR recovery after restarts Do not trigger preferred replica election until Replica MaxLag is near zero and IsrExpandsPerSec has settled. Premature election creates unnecessary leader transitions and can delay recovery.

Use Cruise Control for large clusters For clusters above 20 brokers, the built-in auto-rebalancer is usually too coarse. LinkedIn’s Cruise Control performs gradual, load-aware rebalancing that avoids the all-at-once behavior of KAFKA-4084. Tradeoff: another operational system to manage.

Prevention

  • Alert on LeaderCount deviation >30 percent from the cluster mean. Do not rely on aggregate throughput or PartitionCount alone.
  • Maintain a post-maintenance runbook: after any broker restart or controlled shutdown, verify LeaderCount recovers within one or two leader.imbalance.check.interval.seconds cycles, or trigger manual preferred replica election.
  • If you disable auto.leader.rebalance.enable to avoid replication storms, schedule periodic manual rebalancing during low-traffic windows and track it in your change calendar.
  • Dashboard LeaderCount and PartitionCount side by side per broker so operators can distinguish leadership skew from replica skew at a glance.

How Netdata helps

  • Surfaces LeaderCount deviation from the cluster mean per broker without manual JMX queries.
  • Correlates leadership skew with RequestHandlerAvgIdlePercent, BytesInPerSec, and produce latency on unified charts.
  • Alerts on sustained leadership imbalance using a >30 percent deviation threshold.
  • Exposes ControllerEventQueueSize and election rate to warn when a rebalance is likely to overload the controller.