Kafka OfflinePartitionsCount > 0: partitions with no leader and how to recover

When kafka.controller:type=KafkaController,name=OfflinePartitionsCount is nonzero, at least one partition has no active leader. Those partitions are completely unavailable: producers receive errors, consumers stall, and no data is written or read until a leader is elected. This is a data-plane outage.

This metric is only meaningful on the active controller. Non-controller brokers always report zero. If the controller itself is down, the metric may be stale or unreachable at the exact moment you need it. Brief spikes can occur during controller re-election or ungraceful broker shutdown, but any sustained nonzero value past 60 seconds is an active incident that requires immediate intervention.

Partitions go offline when the controller cannot elect a leader from the current In-Sync Replica set (ISR). The most common trigger is a broker failure that takes all replicas for a partition down, or an ISR that already shrank to a single leader which then fails. The path from healthy replication to offline partitions usually runs through UnderReplicatedPartitions, making under-replication the leading indicator and offline partitions the impact.

flowchart TD
    A[Broker failure or degradation] --> B[Follower falls behind]
    B --> C[ISR shrinks]
    C --> D[UnderReplicatedPartitions rises]
    D --> E[No ISR member available]
    E --> F[Controller cannot elect leader]
    F --> G[OfflinePartitionsCount > 0]

What this means

OfflinePartitionsCount measures partitions that currently have no leader. Without a leader, no broker is authorized to accept writes or serve reads for that partition. The controller elects a new leader from the ISR. If the ISR is empty or the controller cannot process the election, the partition stays offline.

The severity is PAGE if the metric is nonzero for more than 60 seconds and no broker or controller has an uptime below 600 seconds. That uptime gate filters out noise from cold starts and rolling restarts. During steady state, this metric must be zero.

Topics with replication.factor=1 are especially vulnerable. If the single broker hosting that partition goes down, there are no other replicas to promote. The partition remains offline until that specific broker returns.

Common causes

CauseWhat it looks likeFirst thing to check
All replicas for a partition are on failed brokersMultiple brokers unreachable; UnderReplicatedPartitions spiked just before the outageBroker process liveness and network reachability on the replica nodes
No ISR member available with unclean elections disabledunclean.leader.election.enable=false (default) and the only remaining broker in the ISR is downkafka-topics.sh --describe --unavailable-partitions to see the ISR state
Controller unable to elect a leaderControllerEventQueueSize is growing; elections are delayed or stalledActive controller health and metadata store latency
Broker decommissioned without partition reassignment, or RF=1 broker downPartitions still assigned to a broker ID that no longer exists, or a single-replica partition on a failed nodeTopic replica assignment list against current live broker IDs

Quick checks

Run these safe, read-only commands to orient yourself.

# List unavailable partitions and their ISR state
kafka-topics.sh --bootstrap-server localhost:9092 --describe --unavailable-partitions
# Confirm which broker is the active controller
echo "get -b kafka.controller:type=KafkaController,name=ActiveControllerCount Value" | java -jar jmxterm.jar -l localhost:9999
# Read OfflinePartitionsCount on the controller broker
echo "get -b kafka.controller:type=KafkaController,name=OfflinePartitionsCount Value" | java -jar jmxterm.jar -l localhost:9999
# Check under-replicated partitions cluster-wide
kafka-topics.sh --bootstrap-server localhost:9092 --describe --under-replicated-partitions
# Inspect controller event queue depth
echo "get -b kafka.controller:type=ControllerEventManager,name=EventQueueSize Value" | java -jar jmxterm.jar -l localhost:9999
# Check for log directories taken offline due to disk errors
echo "get -b kafka.log:type=LogManager,name=OfflineLogDirectoryCount Value" | java -jar jmxterm.jar -l localhost:9999
# Verify broker processes are listening on the expected port
ss -tnp | grep :9092

How to diagnose it

  1. Confirm controller authority. Verify that the broker where you are reading OfflinePartitionsCount reports ActiveControllerCount = 1. If the controller is down, the metric may be stale or unreachable, and the cluster cannot self-heal.
  2. Identify the victims. Run kafka-topics.sh --describe --unavailable-partitions. Record the topic, partition, replication factor, and the current ISR for each offline partition.
  3. Map replicas to brokers. Cross-reference the offline partition replicas with broker liveness. If all replica brokers are down, the cause is broker failure. If some replicas are up but not in the ISR, the ISR already shrank before the leader was lost.
  4. Check the controller pipeline. Query ControllerEventQueueSize. If it is consistently above 100 or growing, the controller is backlogged and cannot process leader elections fast enough. Check LeaderElectionRateAndTimeMs for slowing election velocity.
  5. Correlate with under-replication. A cluster-wide rise in UnderReplicatedPartitions that began before the offline event indicates a cascading replication failure. Identify the common follower that dropped out first; its disk or network metrics usually reveal the root cause.
  6. Inspect broker log directories. An OfflineLogDirectoryCount greater than zero means a broker shut down a disk due to I/O errors. Partitions on that path are unavailable even if the broker process is running.
  7. Determine if unclean election is enabled. If unclean.leader.election.enable=false (the default since Kafka 0.11.0.0) and no ISR member is alive, the partition will stay offline until an ISR replica recovers. If UncleanLeaderElectionsPerSec is nonzero, unclean elections are occurring somewhere in the cluster; the newly elected leader may have truncated data.

Metrics and signals to monitor

SignalWhy it mattersWarning sign
OfflinePartitionsCountDirect count of partitions with no leaderNonzero for more than 60 seconds outside of restarts
UnderReplicatedPartitionsLeading indicator that replication is degrading before leaders are lostNonzero and growing across the cluster
ActiveControllerCountWithout a controller, no leader elections can occurCluster-wide sum not equal to 1
ControllerEventQueueSizeBacked-up queue delays elections and ISR updatesConsistently above 100 or growing without draining
UncleanLeaderElectionsPerSecConfirms silent data loss when enabled; inverse of offline when disabledAny nonzero delta
LogManager OfflineLogDirectoryCountDisk or filesystem failure removes partitions from serviceAny nonzero value
IsrShrinksPerSecVelocity of replicas leaving the ISRSustained nonzero outside maintenance windows

Fixes

Restore an ISR replica

If unclean.leader.election.enable=false (the safe default), the controller will only elect a leader from the ISR. The only way to recover without data loss is to bring a missing ISR member back online. Restart the failed broker and monitor IsrExpandsPerSec and UnderReplicatedPartitions as it catches up.

For partitions with replication.factor=1, the single broker is the only possible leader. Bring that broker back. If its storage is permanently lost, the partition data is unrecoverable. You must delete and recreate the topic or accept permanent data loss for that partition.

Resolve controller metadata backlog

If ControllerEventQueueSize is growing and offline partitions are piling up, do not restart additional brokers. Each restart generates more controller events and worsens the backlog. Check ZooKeeper request latency (in ZooKeeper mode) or KRaft quorum health (in KRaft mode). If the metadata store is degraded, the controller is bottlenecked by external latency. Once the metadata store recovers, monitor the queue drain rate. If the queue drains, allow the controller to work through the backlog before taking further action.

Reassign partitions from a decommissioned broker

If a broker was removed without reassigning its partitions, those partitions may have no live replicas. Generate a partition reassignment JSON and execute it with kafka-reassign-partitions.sh. Verify progress with the --verify flag. Do not decommission brokers without first moving their partitions to remaining cluster members.

Force an unclean leader election (last resort)

If no ISR member can be restored and availability is more important than data consistency, enable unclean.leader.election.enable=true. You can apply this at the topic level to avoid a broker restart, but a cluster-wide default change requires a rolling restart. A non-ISR replica will be elected leader and truncate its log to its own offset. Any data acknowledged by the previous leader but not replicated to the new leader is silently lost. Revert the setting to false immediately after recovery to prevent future data loss.

Prevention

  • Avoid replication.factor=1 for critical topics. A single broker failure guarantees an outage.
  • Keep unclean.leader.election.enable=false unless your application explicitly tolerates data loss.
  • Decommission brokers only after reassigning partitions. Verify with kafka-reassign-partitions.sh --verify before removing the node.
  • Monitor UnderReplicatedPartitions and IsrShrinksPerSec. These provide early warning of the cascade that ends in offline partitions.
  • Watch controller queue depth and metadata store latency. A healthy controller should process events near zero queue depth in steady state.
  • Set min.insync.replicas=2 when using acks=all and replication.factor=3. This prevents the ISR from shrinking to a single replica, reducing the chance of total leader loss.

How Netdata helps

  • Correlates OfflinePartitionsCount with UnderReplicatedPartitions, ActiveControllerCount, and broker process liveness on one timeline to distinguish controller issues from broker failures.
  • Surfaces controller-only metrics automatically from the active controller node; non-controller nodes are omitted so you do not read stale zeros.
  • Alerts on ControllerEventQueueSize and JVM GC pauses to catch controller bottlenecks before they delay leader elections.
  • Tracks per-broker disk I/O latency and OfflineLogDirectoryCount to pinpoint hardware degradation that removes partitions from service.
  • Visualizes IsrShrinksPerSec and IsrExpandsPerSec to flag the replication collapse that precedes offline partitions.