Kafka NOT_LEADER_FOR_PARTITION: stale metadata, controller lag, and client retries

Producers and consumers log NOT_LEADER_FOR_PARTITION. Broker response metrics show spikes in failed produce or fetch requests. The cluster usually self-heals within seconds as clients refresh metadata. When the error persists for minutes, or flaps across many partitions, the root cause is typically a controller that cannot keep up with leadership changes. Distinguishing a routine leader election from a controller queue backup that blocks metadata propagation is the first step.

What this means

Kafka clients cache partition leadership metadata. When a leader moves (rolling restart, broker failure, preferred replica election), a client with a stale view sends requests to the previous leader. That broker returns NOT_LEADER_FOR_PARTITION. The Java client treats this as a retriable error and refreshes metadata eagerly. A short spike during a restart is normal and usually clears immediately.

If the error persists, the controller’s event queue is likely backed up. The active controller processes leadership changes, ISR updates, and topic operations sequentially from a single-threaded queue. When events arrive faster than they drain, metadata changes propagate slowly. Brokers serve conflicting leadership metadata, and clients receive unstable answers even after refresh. The result is sustained NOT_LEADER_FOR_PARTITION responses, often accompanied by growing under-replication and delayed leader elections.

Common causes

CauseWhat it looks likeFirst thing to check
Transient leader electionErrors spike for 10-60 seconds then flatlineLeaderElectionRateAndTimeMs burst; ActiveControllerCount == 1
Controller event queue backupErrors sustained for minutes; queue growingControllerEventQueueSize on the active controller
No active controllerErrors spread cluster-wide; no elections completingActiveControllerCount summed across brokers != 1
Broker overloadHigh request latency alongside leadership errorsRequestHandlerAvgIdlePercent below 0.3

Quick checks

# Verify exactly one active controller exists
for host in broker1 broker2 broker3; do
  printf "%s: " "$host"
  echo "get -b kafka.controller:type=KafkaController,name=ActiveControllerCount Value" | java -jar jmxterm.jar -l $host:9999 -n
done

# Check controller event queue depth (run on the active controller)
echo "get -b kafka.controller:type=ControllerEventManager,name=EventQueueSize Value" | java -jar jmxterm.jar -l localhost:9999 -n

# Check recent leader election volume and timing
echo "get -b kafka.controller:type=ControllerStats,name=LeaderElectionRateAndTimeMs Count" | java -jar jmxterm.jar -l localhost:9999 -n
echo "get -b kafka.controller:type=ControllerStats,name=LeaderElectionRateAndTimeMs 99thPercentile" | java -jar jmxterm.jar -l localhost:9999 -n

# List partitions lacking a full ISR
kafka-topics.sh --bootstrap-server localhost:9092 --describe --under-replicated-partitions

# List partitions with no leader
kafka-topics.sh --bootstrap-server localhost:9092 --describe --unavailable-partitions

# Check broker-side failed request rates
echo "get -b kafka.server:type=BrokerTopicMetrics,name=FailedProduceRequestsPerSec OneMinuteRate" | java -jar jmxterm.jar -l localhost:9999 -n
echo "get -b kafka.server:type=BrokerTopicMetrics,name=FailedFetchRequestsPerSec OneMinuteRate" | java -jar jmxterm.jar -l localhost:9999 -n

# Rule out broker processing saturation
echo "get -b kafka.server:type=KafkaRequestHandlerPool,name=RequestHandlerAvgIdlePercent Value" | java -jar jmxterm.jar -l localhost:9999 -n

How to diagnose it

Use the following flow to isolate whether you are seeing a transient transition or a controller backlog.

flowchart TD
    A[NOT_LEADER_FOR_PARTITION errors] --> B{Persistent > 2 min?}
    B -->|No| C[Transient leader election]
    B -->|Yes| D[ActiveControllerCount == 1?]
    D -->|No| E[Controller outage]
    D -->|Yes| F[ControllerEventQueueSize]
    F -->|Growing > 1000| G[Controller queue backup]
    F -->|Spike then drain| H[Large failure recovery]
    F -->|Near zero| I[Check client or network]
    G --> J[Check ZK or KRaft health]
  1. Confirm the cluster has one active controller. Query ActiveControllerCount on every broker. The cluster-wide sum must be exactly 1. If it is 0, the cluster cannot elect leaders or update metadata. If it is greater than 1 in ZooKeeper mode, you may have split-brain.
  2. Check the controller event queue size. On the active controller, read kafka.controller:type=ControllerEventManager,name=EventQueueSize. Near zero is healthy. Sustained values above 100 indicate pressure. Continuous growth above 1000 means the controller is overwhelmed and metadata changes are queuing.
  3. Evaluate leader election timing. Read LeaderElectionRateAndTimeMs. A brief burst with completion times under 100ms suggests a normal transition. Elections consistently taking over 1 second, or a sustained high election rate outside maintenance, indicate the controller or metadata store is degraded.
  4. Correlate with replication state. Check UnderReplicatedPartitions. If it is rising across many brokers while NOT_LEADER_FOR_PARTITION persists, followers are also unable to sync, pointing to a broader broker or network issue rather than pure metadata staleness.
  5. Check for offline partitions. If OfflinePartitionsCount is increasing while the controller queue is backed up, partitions are waiting in line for leader election. This confirms the controller cannot keep up with failure recovery.
  6. Rule out broker saturation. If RequestHandlerAvgIdlePercent is sustained below 0.3 and RequestQueueSize is elevated, the broker is too slow to process requests. This can produce leadership timeouts that look like metadata issues but are actually resource exhaustion.

Metrics and signals to monitor

SignalWhy it mattersWarning sign
ControllerEventQueueSizeMeasures controller backlog directly. Only meaningful on the active controller.Sustained > 100; continuous growth > 1000
LeaderElectionRateAndTimeMsReveals election velocity and whether the controller is stalling.p99 time > 1s; sustained high rate outside maintenance
ActiveControllerCountConfirms a single controller exists to process events.Cluster-wide sum != 1
UnderReplicatedPartitionsShows if replication is degraded beyond leadership metadata.Nonzero and growing across multiple brokers
OfflinePartitionsCountConfirms partitions are truly unavailable, not just misrouted.Increasing while controller queue is backed up
FailedProduceRequestsPerSec / FailedFetchRequestsPerSecBroker-side view of client-visible errors including this one.Sustained nonzero rate outside maintenance
RequestHandlerAvgIdlePercentDistinguishes broker overload from metadata propagation delays.Sustained below 0.3

Fixes

If the controller queue is backed up

Do not restart additional brokers. Restarting generates more controller events and worsens the backlog. Check the metadata store:

  • In ZooKeeper mode, check ZooKeeperRequestLatencyMs and ZooKeeperExpiresPerSec. High ZK latency slows every controller event. If ZK is shared with other systems, isolate it.
  • In KRaft mode, check quorum health with kafka-metadata-quorum.sh --bootstrap-server broker:9092 describe --status. Look for voter lag and commit latency growth. If the quorum has lost its leader, metadata is frozen.

If a specific broker is unhealthy (disk latency spikes, GC pauses) and generating repeated ISR changes, perform a controlled shutdown to remove it cleanly rather than letting it flap and enqueue more events.

If the queue is draining slowly after a large-scale failure, wait. Monitor LeaderElectionRateAndTimeMs for completion. If the queue is growing without bound, the controller itself may need attention; check JVM GC logs and CPU.

If there is no active controller

In ZooKeeper mode, check for ZK session expirations and ensure network connectivity between brokers and ZK nodes. In KRaft mode, verify quorum voter connectivity and that controller nodes are healthy. Without a controller, the cluster cannot self-heal and manual intervention is required to restore the metadata plane.

If the issue is transient

If errors spike during a rolling restart or single broker recovery and clear within 30-60 seconds, no fix is needed. Verify that UnderReplicatedPartitions returns to zero and that clients have resumed normal throughput.

Prevention

  • Monitor ControllerEventQueueSize as a primary controller health signal. Alert on sustained values above 100.
  • Keep partition counts per broker within tested limits. The controller must process an event per partition during failures. Test recovery time by gracefully shutting down one broker. If recovery exceeds 1-2 minutes, reduce partition density.
  • Maintain ZooKeeper or KRaft quorum health independently. Do not share ZK clusters with other applications.
  • Avoid coordinated restarts of multiple brokers. Stagger maintenance to prevent controller overload.
  • Verify leadership rebalancing after restarts. If auto.leader.rebalance.enable does not rebalance sufficiently, run kafka-leader-election.sh to prevent hot brokers from concentrating metadata churn.

How Netdata helps

  • Correlate ControllerEventQueueSize with FailedProduceRequestsPerSec in the same time window to confirm that controller lag is causing client errors.
  • Alert on LeaderElectionRateAndTimeMs spikes and ActiveControllerCount anomalies.
  • Track UnderReplicatedPartitions and OfflinePartitionsCount alongside broker resource metrics to distinguish controller issues from disk or network degradation.
  • Visualize request latency breakdowns (RequestQueueTimeMs, LocalTimeMs) to rule out I/O thread saturation that can mimic metadata propagation delays.