Kafka OfflinePartitionsCount > 0: partitions with no leader and how to recover
When kafka.controller:type=KafkaController,name=OfflinePartitionsCount is nonzero, at least one partition has no active leader. Those partitions are completely unavailable: producers receive errors, consumers stall, and no data is written or read until a leader is elected. This is a data-plane outage.
This metric is only meaningful on the active controller. Non-controller brokers always report zero. If the controller itself is down, the metric may be stale or unreachable at the exact moment you need it. Brief spikes can occur during controller re-election or ungraceful broker shutdown, but any sustained nonzero value past 60 seconds is an active incident that requires immediate intervention.
Partitions go offline when the controller cannot elect a leader from the current In-Sync Replica set (ISR). The most common trigger is a broker failure that takes all replicas for a partition down, or an ISR that already shrank to a single leader which then fails. The path from healthy replication to offline partitions usually runs through UnderReplicatedPartitions, making under-replication the leading indicator and offline partitions the impact.
flowchart TD
A[Broker failure or degradation] --> B[Follower falls behind]
B --> C[ISR shrinks]
C --> D[UnderReplicatedPartitions rises]
D --> E[No ISR member available]
E --> F[Controller cannot elect leader]
F --> G[OfflinePartitionsCount > 0]What this means
OfflinePartitionsCount measures partitions that currently have no leader. Without a leader, no broker is authorized to accept writes or serve reads for that partition. The controller elects a new leader from the ISR. If the ISR is empty or the controller cannot process the election, the partition stays offline.
The severity is PAGE if the metric is nonzero for more than 60 seconds and no broker or controller has an uptime below 600 seconds. That uptime gate filters out noise from cold starts and rolling restarts. During steady state, this metric must be zero.
Topics with replication.factor=1 are especially vulnerable. If the single broker hosting that partition goes down, there are no other replicas to promote. The partition remains offline until that specific broker returns.
Common causes
| Cause | What it looks like | First thing to check |
|---|---|---|
| All replicas for a partition are on failed brokers | Multiple brokers unreachable; UnderReplicatedPartitions spiked just before the outage | Broker process liveness and network reachability on the replica nodes |
| No ISR member available with unclean elections disabled | unclean.leader.election.enable=false (default) and the only remaining broker in the ISR is down | kafka-topics.sh --describe --unavailable-partitions to see the ISR state |
| Controller unable to elect a leader | ControllerEventQueueSize is growing; elections are delayed or stalled | Active controller health and metadata store latency |
| Broker decommissioned without partition reassignment, or RF=1 broker down | Partitions still assigned to a broker ID that no longer exists, or a single-replica partition on a failed node | Topic replica assignment list against current live broker IDs |
Quick checks
Run these safe, read-only commands to orient yourself.
# List unavailable partitions and their ISR state
kafka-topics.sh --bootstrap-server localhost:9092 --describe --unavailable-partitions
# Confirm which broker is the active controller
echo "get -b kafka.controller:type=KafkaController,name=ActiveControllerCount Value" | java -jar jmxterm.jar -l localhost:9999
# Read OfflinePartitionsCount on the controller broker
echo "get -b kafka.controller:type=KafkaController,name=OfflinePartitionsCount Value" | java -jar jmxterm.jar -l localhost:9999
# Check under-replicated partitions cluster-wide
kafka-topics.sh --bootstrap-server localhost:9092 --describe --under-replicated-partitions
# Inspect controller event queue depth
echo "get -b kafka.controller:type=ControllerEventManager,name=EventQueueSize Value" | java -jar jmxterm.jar -l localhost:9999
# Check for log directories taken offline due to disk errors
echo "get -b kafka.log:type=LogManager,name=OfflineLogDirectoryCount Value" | java -jar jmxterm.jar -l localhost:9999
# Verify broker processes are listening on the expected port
ss -tnp | grep :9092
How to diagnose it
- Confirm controller authority. Verify that the broker where you are reading
OfflinePartitionsCountreportsActiveControllerCount = 1. If the controller is down, the metric may be stale or unreachable, and the cluster cannot self-heal. - Identify the victims. Run
kafka-topics.sh --describe --unavailable-partitions. Record the topic, partition, replication factor, and the current ISR for each offline partition. - Map replicas to brokers. Cross-reference the offline partition replicas with broker liveness. If all replica brokers are down, the cause is broker failure. If some replicas are up but not in the ISR, the ISR already shrank before the leader was lost.
- Check the controller pipeline. Query
ControllerEventQueueSize. If it is consistently above 100 or growing, the controller is backlogged and cannot process leader elections fast enough. CheckLeaderElectionRateAndTimeMsfor slowing election velocity. - Correlate with under-replication. A cluster-wide rise in
UnderReplicatedPartitionsthat began before the offline event indicates a cascading replication failure. Identify the common follower that dropped out first; its disk or network metrics usually reveal the root cause. - Inspect broker log directories. An
OfflineLogDirectoryCountgreater than zero means a broker shut down a disk due to I/O errors. Partitions on that path are unavailable even if the broker process is running. - Determine if unclean election is enabled. If
unclean.leader.election.enable=false(the default since Kafka 0.11.0.0) and no ISR member is alive, the partition will stay offline until an ISR replica recovers. IfUncleanLeaderElectionsPerSecis nonzero, unclean elections are occurring somewhere in the cluster; the newly elected leader may have truncated data.
Metrics and signals to monitor
| Signal | Why it matters | Warning sign |
|---|---|---|
OfflinePartitionsCount | Direct count of partitions with no leader | Nonzero for more than 60 seconds outside of restarts |
UnderReplicatedPartitions | Leading indicator that replication is degrading before leaders are lost | Nonzero and growing across the cluster |
ActiveControllerCount | Without a controller, no leader elections can occur | Cluster-wide sum not equal to 1 |
ControllerEventQueueSize | Backed-up queue delays elections and ISR updates | Consistently above 100 or growing without draining |
UncleanLeaderElectionsPerSec | Confirms silent data loss when enabled; inverse of offline when disabled | Any nonzero delta |
LogManager OfflineLogDirectoryCount | Disk or filesystem failure removes partitions from service | Any nonzero value |
IsrShrinksPerSec | Velocity of replicas leaving the ISR | Sustained nonzero outside maintenance windows |
Fixes
Restore an ISR replica
If unclean.leader.election.enable=false (the safe default), the controller will only elect a leader from the ISR. The only way to recover without data loss is to bring a missing ISR member back online. Restart the failed broker and monitor IsrExpandsPerSec and UnderReplicatedPartitions as it catches up.
For partitions with replication.factor=1, the single broker is the only possible leader. Bring that broker back. If its storage is permanently lost, the partition data is unrecoverable. You must delete and recreate the topic or accept permanent data loss for that partition.
Resolve controller metadata backlog
If ControllerEventQueueSize is growing and offline partitions are piling up, do not restart additional brokers. Each restart generates more controller events and worsens the backlog. Check ZooKeeper request latency (in ZooKeeper mode) or KRaft quorum health (in KRaft mode). If the metadata store is degraded, the controller is bottlenecked by external latency. Once the metadata store recovers, monitor the queue drain rate. If the queue drains, allow the controller to work through the backlog before taking further action.
Reassign partitions from a decommissioned broker
If a broker was removed without reassigning its partitions, those partitions may have no live replicas. Generate a partition reassignment JSON and execute it with kafka-reassign-partitions.sh. Verify progress with the --verify flag. Do not decommission brokers without first moving their partitions to remaining cluster members.
Force an unclean leader election (last resort)
If no ISR member can be restored and availability is more important than data consistency, enable unclean.leader.election.enable=true. You can apply this at the topic level to avoid a broker restart, but a cluster-wide default change requires a rolling restart. A non-ISR replica will be elected leader and truncate its log to its own offset. Any data acknowledged by the previous leader but not replicated to the new leader is silently lost. Revert the setting to false immediately after recovery to prevent future data loss.
Prevention
- Avoid
replication.factor=1for critical topics. A single broker failure guarantees an outage. - Keep
unclean.leader.election.enable=falseunless your application explicitly tolerates data loss. - Decommission brokers only after reassigning partitions. Verify with
kafka-reassign-partitions.sh --verifybefore removing the node. - Monitor
UnderReplicatedPartitionsandIsrShrinksPerSec. These provide early warning of the cascade that ends in offline partitions. - Watch controller queue depth and metadata store latency. A healthy controller should process events near zero queue depth in steady state.
- Set
min.insync.replicas=2when usingacks=allandreplication.factor=3. This prevents the ISR from shrinking to a single replica, reducing the chance of total leader loss.
How Netdata helps
- Correlates
OfflinePartitionsCountwithUnderReplicatedPartitions,ActiveControllerCount, and broker process liveness on one timeline to distinguish controller issues from broker failures. - Surfaces controller-only metrics automatically from the active controller node; non-controller nodes are omitted so you do not read stale zeros.
- Alerts on
ControllerEventQueueSizeand JVM GC pauses to catch controller bottlenecks before they delay leader elections. - Tracks per-broker disk I/O latency and
OfflineLogDirectoryCountto pinpoint hardware degradation that removes partitions from service. - Visualizes
IsrShrinksPerSecandIsrExpandsPerSecto flag the replication collapse that precedes offline partitions.
Related guides
- How Kafka actually works in production: a mental model for operators
- Kafka ISR shrinking: IsrShrinksPerSec, flapping, and the cascade to offline
- Kafka monitoring checklist: the signals every production cluster needs
- Kafka monitoring maturity model: from survival to expert
- Kafka NotEnoughReplicasException: acks=all writes rejected below min.insync.replicas
- Kafka UnderMinIsrPartitionCount: confirming the write path is blocked
- Kafka UnderReplicatedPartitions > 0: the most important metric and how to clear it







