Kafka too many partitions per broker: controller load, recovery time, and the 4000 guideline
Every replica on a Kafka broker carries fixed overhead: file descriptors, memory-mapped index files, replication fetcher load, and controller metadata. The guideline of roughly 4000 partitions per broker is not a hard architectural limit. It is an operational warning based on a serial bottleneck: the controller processes partition state changes one at a time. When a broker with 5000 partitions fails, the active controller queues thousands of events. Until the queue drains, leader elections stall, metadata propagation slows, and clients see NOT_LEADER_FOR_PARTITION. Restarting the broker forces every replica to recover its logs and catch up from leaders before rejoining the ISR. Recovery time is a step function; a routine restart can become a multi-hour incident. Partition count is easy to ignore in steady state and catastrophic to discover during an outage.
What it is and why it matters
PartitionCount per broker is the total number of leader and follower replicas assigned to that broker. Read it from the JMX MBean kafka.server:type=ReplicaManager,name=PartitionCount. It changes only when topics are created, deleted, or reassigned.
Each partition adds overhead in four dimensions:
- File descriptors: every log segment requires a
.logfile and at least one index file (.index,.timeindex). Hundreds of partitions with rolling segments exhaust the defaultulimit -nof 1024 quickly. Production deployments should raise this to 100,000 or higher. - Memory: brokers memory-map index files and maintain metadata structures per partition. This overhead competes with the OS page cache even when message volume is low.
- Replication fetcher load: followers catch up from leaders via replica fetcher threads. More follower partitions increase disk I/O, network utilization, and CPU consumption across the fetcher thread pool.
- Controller event queue: every ISR change, leader election, and reassignment becomes an event that the active controller processes serially.
The 4000 partition guideline is a conservative threshold. The actual ceiling depends on hardware, disk latency, network bandwidth, and whether the cluster runs ZooKeeper or KRaft mode. Clusters on slower spinning disks or with constrained network throughput will hit controller and recovery limits well before 4000. KRaft removes ZooKeeper session bottlenecks and improves failover latency, but the active controller still processes partition events sequentially. A broker with 5000 partitions still generates roughly 5000 events on failure, and a restarted broker still must catch up all 5000 replicas regardless of the metadata quorum implementation.
How it works
The critical path is the controller event queue. Kafka maintains exactly one active controller. It holds a queue of pending events: ISR expansions and shrinks, leader elections, topic changes, and broker lifecycle updates. It processes them sequentially. There is no horizontal scaling for this queue.
When a broker fails:
- For partitions where the dead broker was the leader, the controller elects a new leader from the remaining ISR.
- For partitions where the dead broker was a follower, surviving leaders eventually shrink the ISR, which the controller also processes.
In total, a broker failure generates on the order of one controller event per partition it hosted. A broker with 5000 partitions therefore injects roughly 5000 events into a single-threaded queue. While those events drain, new failures, rolling restarts, or reassignment operations add more work. Queue depth grows, and LeaderElectionRateAndTimeMs spikes. Partitions that need new leaders remain offline until the controller reaches their events. Clients see NOT_LEADER_FOR_PARTITION and retry, amplifying load on the remaining brokers.
Broker restart compounds the problem. After restart, the broker loads every partition directory, replays unflushed messages from the recovery point, and rebuilds in-memory indexes. Only then does it open replica fetcher connections to leaders. Until the follower catches up to the high-water mark, the leader does not add it back to the ISR. With thousands of partitions, log recovery alone can take tens of minutes. The catch-up phase then keeps UnderReplicatedPartitions elevated for minutes or hours, depending on data volume, network bandwidth, and disk speed. During this window the cluster runs at reduced durability, and any additional failure risks unavailability.
flowchart TD
A[Broker hosts 5000 partitions] --> B[Steady-state FD, memory, and fetcher overhead]
A --> C[Broker fails]
C --> D[Controller queues ~5000 events]
D --> E[Sequential drain]
E --> F[Leader elections delayed]
F --> G[UnderReplicatedPartitions rises]
F --> H[Clients see NOT_LEADER_FOR_PARTITION]
C --> I[Surviving brokers absorb leadership traffic]
A --> J[Broker restarts]
J --> K[Catch-up from leaders takes minutes to hours]Where it shows up in production
Rolling restarts leave the cluster under-replicated for hours. If one broker takes thirty minutes to catch up after restart, a ten-broker rolling restart creates a five-hour window of continuous under-replication. If the cluster uses a replication factor of two, that window is also a period of reduced redundancy: a single additional disk failure on the broker that hosts the other replica risks unavailability. A second full broker failure during that window doubles the controller queue depth and can leave some partitions without a viable leader.
Single broker death becomes a metadata stall. Operators watching only OfflinePartitionsCount may notice a lag between the broker failure and the count rising. That lag is the controller queue backing up. ControllerEventQueueSize reveals the real problem: thousands of pending events.
Untested recovery times. Teams that do not run failure tests often discover during incidents that a broker restart takes ninety minutes. They know their broker count and replication factor, but have never measured recovery duration. Without a baseline, on-call engineers cannot tell whether a restart is progressing normally or has stalled on a slow disk or network partition.
Leadership skew creates hot spots. Even balanced PartitionCount can hide leadership imbalance. A broker leading 80% of its assigned partitions handles far more produce and fetch traffic than a peer leading 20%. RequestHandlerAvgIdlePercent drops on the hot broker. If it fails, the traffic shift plus the controller event wave creates a compound outage. The remaining brokers must absorb the redirected produce traffic while also handling the controller event wave. If any survivor was already near its limit, it too can become a bottleneck.
Tradeoffs and common misuses
Partition count is often treated as a throughput knob. More partitions allow higher consumer parallelism and smaller per-partition batch latencies. Consumer groups scale parallelism to the partition count of the subscribed topics. Adding partitions to increase throughput is valid, but the operational tax is permanent and paid on every broker that hosts a replica.
- Over-provisioning partitions “for future scale” creates permanent overhead. A topic with 64 partitions and replication factor three adds 192 replicas cluster-wide. If those replicas land on brokers that already hold 3500 partitions, the broker crosses the 4000 threshold for a topic that may not yet have any traffic.
- Ignoring follower load is common because
PartitionCountincludes both leaders and followers. A broker with few leaders but many followers does not serve heavy client traffic, yet it still consumes disk I/O, network, and controller attention during failures. During a rolling restart, this broker still has to catch up thousands of follower partitions, extending the under-replicated window even though it was never a hot spot under normal load. - Looking only at cluster averages hides imbalance. A fifty-broker cluster with 200,000 total partitions averages 4000 per broker. If two brokers hold 8000 partitions due to a decommissioning or reassignment error, those two brokers dominate recovery time and controller load.
- Assuming KRaft removes all limits. KRaft improves controller failover and raises total cluster partition limits, but the active controller still processes partition state changes serially. High partition churn from frequent broker failures or rapid reassignments can still back up the event queue.
Signals to watch in production
| Signal | Why it matters | Warning sign |
|---|---|---|
kafka.server:type=ReplicaManager,name=PartitionCount | Total partition overhead per broker | >4000 partitions, or >20% deviation from cluster mean |
kafka.server:type=ReplicaManager,name=LeaderCount | Concentration of request-processing load | Any broker >30% above the cluster mean |
kafka.controller:type=ControllerEventManager,name=EventQueueSize | Pending metadata operations on the active controller | >100 sustained, or growing continuously above 1000 |
kafka.controller:type=ControllerStats,name=LeaderElectionRateAndTimeMs | Speed of failure recovery | Election time consistently >1s, or burst outside maintenance |
kafka.server:type=KafkaRequestHandlerPool,name=RequestHandlerAvgIdlePercent | Broker processing headroom | <0.3 sustained |
kafka.server:type=ReplicaManager,name=UnderReplicatedPartitions | Durability degradation during recovery | Stays flat or grows after a broker restart, rather than trending down |
| Recovery duration after controlled shutdown | Real-world operational readiness | Measured during game day; should recover within minutes, not hours |
How Netdata helps
Netdata collects per-broker JMX metrics directly. Use it to spot partition-related imbalances without maintaining a separate metrics store.
- Per-broker
PartitionCountandLeaderCountcharts show skew across all brokers on one dashboard. - Controller event queue depth on the active controller correlates with broker restarts and failures.
- Correlate
RequestHandlerAvgIdlePercentwithLeaderCountto distinguish saturation caused by leadership density from raw throughput pressure. - Anomaly detection on
UnderReplicatedPartitionsandIsrShrinksPerSecflags recoveries that exceed historical baselines. - JMX ingestion without an external store means you can monitor a new broker as soon as its JMX endpoint is accessible and immediately see partition load and recovery behavior.
When a rolling restart begins, watch ControllerEventQueueSize on the active controller. If it climbs above 100 while the first broker is restarting, pause the rollout until the queue drains. Use per-broker PartitionCount to verify that the restarted broker rejoined with the expected replica set. If UnderReplicatedPartitions does not trend down within your baseline recovery window, investigate disk I/O or network throttling before proceeding to the next broker.
Related guides
- How Kafka actually works in production: a mental model for operators
- Kafka enable.auto.commit data loss: committed offsets that outrun processing
- Kafka broker out of disk: log.dirs full, the cliff-edge shutdown, and recovery
- Kafka CommitFailedException: rebalanced-out consumers and poll loop timeouts
- Kafka consumer group stuck Empty or Dead: no members consuming
- Kafka consumer group lag growing: detection, lag-as-time, and root causes
- Kafka consumer group rebalancing too often: heartbeats, session timeout, and assignors
- Kafka consumer rebalance storm: stuck in PreparingRebalance and max.poll.interval.ms
- Kafka controller event queue backing up: overwhelmed controller and stalled metadata
- Kafka disk I/O latency high: await, LocalTimeMs, and the slow-disk broker
- Kafka disk space planning: retention, replication, and runway estimation
- Kafka fetch request latency high: FetchConsumer vs FetchFollower and page cache misses







