Kafka leadership imbalance: LeaderCount skew and preferred replica election
One broker in your cluster is handling 40 percent of all produce requests. Its RequestHandlerAvgIdlePercent is dropping toward 0.3 while the rest of the cluster is nearly idle. Cluster-level throughput dashboards look healthy because aggregate ingress is within capacity, but tail latency on the hot broker is climbing and consumers connected to it are falling behind. The root cause is leadership skew: a disproportionate number of partition leaders have landed on one broker, and every leader carries the full request load for its partitions. This happens silently after rolling restarts, controlled shutdowns, or broker decommissions. Even a perfectly even PartitionCount across brokers can hide severe leadership imbalance. You will not see it without monitoring LeaderCount per broker.
What this means
kafka.server:type=ReplicaManager,name=LeaderCount counts how many partitions each broker leads. Leaders handle all produce and fetch requests for their partitions; followers only replicate. A broker leading 40 percent of partitions in a five-node cluster is doing 40 percent of the request processing, not 20. Because leaders also handle all consumer fetches, the hot broker sees ingress and egress amplification. With replication factor 3 and multiple consumer groups, that broker can push egress to several times its fair share of network capacity. Cluster-wide averages hide this.
Kafka assigns the first replica in the partition assignment as the preferred leader when a topic is created. When that broker fails or restarts, leadership moves to another replica. When the broker returns, the assignment still prefers it, but leadership does not move back automatically. The controller’s auto-rebalancer is supposed to trigger this, controlled by auto.leader.rebalance.enable (default true). The controller checks every leader.imbalance.check.interval.seconds (default 300 seconds). If any broker exceeds leader.imbalance.per.broker.percentage (default 10 percent) above the average, the controller triggers preferred replica election.
Two issues break this mechanism in production. KAFKA-4084 documents that the auto-rebalancer moves all imbalanced partitions simultaneously, flooding the cluster with replication traffic and ISR churn. Many operators disable auto.leader.rebalance.enable to avoid this. KAFKA-8359 argues that the 10 percent default is too high: small topics catch up first after a restart, leaving large high-volume topics unbalanced. As of Kafka 3.7 the default remains 10 percent, and applying changes to these settings requires a broker configuration update that may need a rolling restart.
flowchart TD
A[LeaderCount deviation >30% from mean] -->|check| B{auto.leader.rebalance.enable}
B -->|enabled| C[Review leader.imbalance.per.broker.percentage]
B -->|disabled| D[Plan manual preferred replica election]
C -->|threshold > 10%| E[Lower threshold or run immediate PLE]
C -->|threshold <= 10%| F[Verify replica lag before election]
D -->|next| F
E -->|next| F
F -->|lagging| G[Wait for ISR recovery]
F -->|caught up| H[Run kafka-leader-election.sh --election-type preferred]
G -->|retry| F
H -->|verify| I[Confirm balanced LeaderCount and request latency]Common causes
| Cause | What it looks like | First thing to check |
|---|---|---|
| Auto-rebalance disabled to avoid KAFKA-4084 | Skew persists indefinitely after maintenance | auto.leader.rebalance.enable in broker config |
| Default 10% threshold too loose | Gradual drift after broker returns; large topics remain on overloaded brokers | leader.imbalance.per.broker.percentage |
| Post-restart recovery window | Broker returned but auto-rebalance has not yet run | Uptime and time since rejoin |
| Controlled shutdown without follow-up | Leaders were moved gracefully and never moved back | Event timeline aligned with maintenance |
| KRaft mode differences | Leadership mechanics differ from ZooKeeper mode | Cluster mode and version-specific behavior |
Quick checks
# Inspect leader distribution in topic metadata
kafka-topics.sh --bootstrap-server localhost:9092 --describe
# Read LeaderCount on the local broker via JMX
echo "get -b kafka.server:type=ReplicaManager,name=LeaderCount Value" | java -jar jmxterm.jar -l localhost:9999
# Check auto-rebalance configuration in server.properties
grep -E 'auto.leader.rebalance.enable|leader.imbalance.per.broker.percentage' /etc/kafka/server.properties
# Check if the hot broker is saturated
echo "get -b kafka.server:type=KafkaRequestHandlerPool,name=RequestHandlerAvgIdlePercent Value" | java -jar jmxterm.jar -l localhost:9999
# Read inbound traffic on the local broker
echo "get -b kafka.server:type=BrokerTopicMetrics,name=BytesInPerSec OneMinuteRate" | java -jar jmxterm.jar -l localhost:9999
# Verify no under-replication before considering rebalancing
kafka-topics.sh --bootstrap-server localhost:9092 --describe --under-replicated-partitions
# Check follower lag if this broker recently rejoined
echo "get -b kafka.server:type=ReplicaFetcherManager,name=MaxLag,clientId=Replica Value" | java -jar jmxterm.jar -l localhost:9999
How to diagnose it
- Collect
LeaderCountfrom every broker via JMX, compute the cluster mean, and flag any broker deviating by more than 30 percent. - Collect
PartitionCountper broker. If partition counts are even butLeaderCountis skewed, you have a leadership imbalance, not a replica assignment problem. - Check
auto.leader.rebalance.enableandleader.imbalance.per.broker.percentage. If enabled and set to 10 percent, KAFKA-8359 explains why large topics may remain unbalanced indefinitely. - Correlate with traffic signals. The hot broker should show higher
BytesInPerSecandBytesOutPerSec, and lowerRequestHandlerAvgIdlePercent, than its peers. - Review the event timeline. If skew appeared after a rolling restart, controlled shutdown, or broker decommission, the cluster likely relied on auto-rebalance and either it did not trigger or it was disabled.
- Verify replica health before any fix. Check
UnderReplicatedPartitionsandReplica MaxLag. Triggering preferred replica election to a broker that is still catching up will fail or extend the recovery window.
Metrics and signals to monitor
| Signal | Why it matters | Warning sign |
|---|---|---|
LeaderCount per broker | Direct measure of how many partitions this broker leads | Deviation >30% from cluster mean |
PartitionCount per broker | Distinguishes leadership skew from uneven replica assignment | Even count but severely uneven leaders |
RequestHandlerAvgIdlePercent | Confirms the leader-heavy broker is actually saturated | Sustained below 0.3 on the hot broker |
BytesInPerSec / BytesOutPerSec | Traffic follows leadership; skew is visible in bandwidth | One broker carrying a disproportionate share |
UnderReplicatedPartitions | Lagging replicas make rebalancing unsafe | Nonzero before planned election |
ControllerEventQueueSize | Mass elections burden the single-threaded controller | Growth during or after rebalance events |
Fixes
Lower the auto-rebalance threshold
If auto.leader.rebalance.enable=true but skew persists, the 10 percent default may be too tolerant. KAFKA-8359 proposes lowering leader.imbalance.per.broker.percentage to 0 percent so that any deviation triggers election. Changing this requires a broker configuration update and may require a rolling restart, depending on your Kafka version. Tradeoff: more frequent elections increase controller load and may cause brief NOT_LEADER_FOR_PARTITION errors as clients refresh metadata.
Trigger manual preferred replica election If auto-rebalance is disabled to avoid the KAFKA-4084 thundering-herd, run manual preferred replica election after verifying followers are caught up. This returns leadership to the preferred replica for specified partitions. For large clusters, run in batches to avoid overwhelming the controller.
# Trigger preferred replica election using a partition list file
kafka-leader-election.sh --bootstrap-server localhost:9092 --election-type preferred --path-to-json-file ple.json
Tradeoff: You control the timing and scope, but each election causes a brief leadership transition. Avoid running this during peak traffic or while UnderReplicatedPartitions is nonzero. When running the election, monitor LeaderElectionRateAndTimeMs and ControllerEventQueueSize. If election time spikes above one second or the queue grows continuously, pause the operation and allow the controller to drain before continuing.
Wait for ISR recovery after restarts
Do not trigger preferred replica election until Replica MaxLag is near zero and IsrExpandsPerSec has settled. Premature election creates unnecessary leader transitions and can delay recovery.
Use Cruise Control for large clusters For clusters above 20 brokers, the built-in auto-rebalancer is usually too coarse. LinkedIn’s Cruise Control performs gradual, load-aware rebalancing that avoids the all-at-once behavior of KAFKA-4084. Tradeoff: another operational system to manage.
Prevention
- Alert on
LeaderCountdeviation >30 percent from the cluster mean. Do not rely on aggregate throughput orPartitionCountalone. - Maintain a post-maintenance runbook: after any broker restart or controlled shutdown, verify
LeaderCountrecovers within one or twoleader.imbalance.check.interval.secondscycles, or trigger manual preferred replica election. - If you disable
auto.leader.rebalance.enableto avoid replication storms, schedule periodic manual rebalancing during low-traffic windows and track it in your change calendar. - Dashboard
LeaderCountandPartitionCountside by side per broker so operators can distinguish leadership skew from replica skew at a glance.
How Netdata helps
- Surfaces
LeaderCountdeviation from the cluster mean per broker without manual JMX queries. - Correlates leadership skew with
RequestHandlerAvgIdlePercent,BytesInPerSec, and produce latency on unified charts. - Alerts on sustained leadership imbalance using a >30 percent deviation threshold.
- Exposes
ControllerEventQueueSizeand election rate to warn when a rebalance is likely to overload the controller.
Related guides
- How Kafka actually works in production: a mental model for operators
- Kafka ISR shrinking: IsrShrinksPerSec, flapping, and the cascade to offline
- Kafka LEADER_NOT_AVAILABLE: causes during elections, restarts, and topic creation
- Kafka min.insync.replicas and acks: configuring durability you actually have
- Kafka monitoring checklist: the signals every production cluster needs
- Kafka monitoring maturity model: from survival to expert
- Kafka NotEnoughReplicasException: acks=all writes rejected below min.insync.replicas
- Kafka OfflinePartitionsCount > 0: partitions with no leader and how to recover
- Kafka replica MaxLag growing: slow followers and replica fetcher health
- Kafka UnderMinIsrPartitionCount: confirming the write path is blocked
- Kafka UnderReplicatedPartitions > 0: the most important metric and how to clear it







