Kafka UnderReplicatedPartitions > 0: the most important metric and how to clear it
UnderReplicatedPartitions climbing above zero means a follower has fallen behind, the ISR has shrunk, and the cluster’s durability guarantee is degraded. One more failure on the wrong broker could make partitions unavailable or cause acks=all writes to be rejected. Determine whether this is a transient blip from maintenance or the start of a cascading replication failure.
What this means
The JMX MBean kafka.server:type=ReplicaManager,name=UnderReplicatedPartitions reports the number of partitions where this broker is the leader and the ISR count is below the configured replication factor. It is a per-broker gauge. A broker without leadership always reports zero, even when the cluster is degraded, so aggregate across all brokers to see the full picture.
When this metric is nonzero, a follower replica is not keeping up. The leader still accepts writes, but if the leader dies before the follower catches up, unreplicated data is at risk. If the ISR drops below min.insync.replicas, producers using acks=all receive NotEnoughReplicasException.
Alert on Value, OneMinuteRate, or delta(Count); the Count attribute is cumulative over the broker lifetime. During rolling restarts or partition reassignment, transient under-replication is expected and should resolve within roughly one to two times replica.lag.time.max.ms after the broker returns.
flowchart TD
A[Producer writes to leader] --> B[Leader appends to log]
B --> C[Follower fetches via ReplicaFetcher]
C --> D{Follower keeps up?}
D -->|Yes| E[ISR remains full]
D -->|No| F[ISR shrinks]
F --> G[Leader reports URP > 0]
G --> H[Durability window open]Common causes
| Cause | What it looks like | First thing to check |
|---|---|---|
| Follower disk I/O saturation | URP is reported by leaders whose followers sit on one specific broker; disk await is high on that follower | iostat -xz 1 on the follower broker |
| Network partition or degradation | URP appears suddenly for replicas on one follower; leader-side FetchFollower latency jumps | Network connectivity and FetchFollower p99 on the leader |
| Follower broker overload or GC | URP with low RequestHandlerAvgIdlePercent or long GC pauses on the follower | jstat -gcutil and per-broker CPU saturation |
| Unbalanced partition or leadership assignment | One broker carries disproportionate PartitionCount or LeaderCount, concentrating replication load | PartitionCount and LeaderCount per broker |
| Rolling restart or reassignment | Transient URP proportional to partitions on the affected broker; resolves quickly | Broker uptime and whether kafka-reassign-partitions.sh --verify --reassignment-json-file <reassignment-json> shows active movement |
Quick checks
# List under-replicated partitions interactively
kafka-topics.sh --bootstrap-server localhost:9092 --describe --under-replicated-partitions
# Read URP via JMX on the local broker
echo "get -b kafka.server:type=ReplicaManager,name=UnderReplicatedPartitions Value" | java -jar jmxterm.jar -l localhost:9999
# Check disk latency on the suspect broker
iostat -xz 1
# Check JVM GC pressure on the suspect broker
jstat -gcutil $(pgrep -f kafka.Kafka) 1000
# Check open file descriptor usage
ls /proc/$(pgrep -f kafka.Kafka)/fd | wc -l
cat /proc/$(pgrep -f kafka.Kafka)/limits | grep "Max open files"
# Check ISR shrink velocity on the leader
echo "get -b kafka.server:type=ReplicaManager,name=IsrShrinksPerSec OneMinuteRate" | java -jar jmxterm.jar -l localhost:9999
# Confirm the cluster has an active controller
echo "get -b kafka.controller:type=KafkaController,name=ActiveControllerCount Value" | java -jar jmxterm.jar -l localhost:9999
# Check active TCP connections to the broker
ss -tnp | grep $(pgrep -f kafka.Kafka) | wc -l
How to diagnose it
- Aggregate
UnderReplicatedPartitionsacross all brokers. Because the metric is leader-side only, a broker with zero URP may still be the lagging follower causing URP on other nodes. - Check for transient causes. If a restarted broker has uptime under 600 seconds, or if a partition reassignment is active (
kafka-reassign-partitions.sh --verify --reassignment-json-file <reassignment-json>), expect transient URP. Monitor for resolution within roughly one to two timesreplica.lag.time.max.ms. - Identify the lagging follower.
kafka-topics.sh --describe --under-replicated-partitionsshows the leader and ISR for each partition. The broker listed inReplicasbut missing fromISRis the target. - Inspect follower health. On the target broker, check disk I/O
await, JVM GC pause duration, andRequestHandlerAvgIdlePercent. A saturated follower cannot replicate fast enough. - Inspect leader health. On the leader, check
FetchFollowerrequest latency. High follower-fetch latency on the leader can mean the leader itself is too slow to serve replication traffic, even if the follower is healthy. - Check for ISR flapping. Compare
IsrShrinksPerSecwithIsrExpandsPerSec. Sustained shrinks without matching expands indicate stable degradation. Sustained activity in both metrics indicates an intermittent problem such as periodic GC pauses or bursty traffic. - Validate controller health. If
ActiveControllerCountsummed across the cluster is not exactly one, metadata changes may stall and URP can persist even after the follower recovers.
Metrics and signals to monitor
| Signal | Why it matters | Warning sign |
|---|---|---|
UnderReplicatedPartitions | Leading indicator that a follower has fallen behind | Nonzero outside maintenance windows |
UnderMinIsrPartitionCount | Confirms producers with acks=all are being rejected | > 0 for more than 2 minutes with stable broker uptime |
IsrShrinksPerSec | Velocity of durability loss; active degradation | Sustained > 0 for more than 5 minutes |
IsrExpandsPerSec | Recovery velocity; paired with shrinks indicates flapping | Sustained alongside IsrShrinksPerSec |
FetchFollower latency | Leader-side view of how slowly it serves replication | p99 exceeding 50% of replica.lag.time.max.ms |
RequestHandlerAvgIdlePercent | Broker thread saturation; low idle means requests pile up | Sustained below 0.3 |
Disk I/O await | Root-cause signal for follower slowness | SSD above 20 ms or HDD above 50 ms sustained |
ActiveControllerCount | Metadata plane must be healthy to process ISR changes | Cluster-wide sum not equal to 1 for more than 30 seconds |
Fixes
Transient states
If URP appeared during a rolling restart or reassignment, wait. Do not restart additional brokers while URP is recovering. Verify reassignment status using kafka-reassign-partitions.sh --verify --reassignment-json-file <reassignment-json>. If broker uptime is under 600 seconds, give the broker time to rebuild its page cache and catch up on replication.
Follower disk or network saturation
When one broker is the common follower across most URP entries, treat that broker as the problem. Check disk await, NIC errors, and packet loss on that node. A controlled shutdown of a failing leader triggers clean leader elections and is safer than an uncontrolled crash. Shutting down a follower does not reduce URP; fix the underlying disk or network saturation first. If the broker is terminally degraded, reassign its partitions before removing it. Tradeoff: removing a broker generates controller events and transient leadership movement.
Hot-spot broker or partition imbalance
If PartitionCount or LeaderCount is severely skewed, the overloaded broker may lack the threads or disk I/O to serve replication traffic. Rebalance leadership or move partitions using kafka-reassign-partitions.sh --execute --reassignment-json-file <reassignment-json>. Tradeoff: rebalancing consumes network and disk I/O and will itself cause transient under-replication until it completes.
JVM pressure on the follower
Long GC pauses on a follower can push it out of ISR repeatedly. If heap after GC is consistently above 80% or Full GC pauses exceed several seconds, reduce heap pressure. Kafka brokers typically run with modest heaps; oversized heaps cause longer pauses that directly impact replication. If down-conversion is causing heap pressure, upgrading client message formats or isolating legacy consumers will help.
Prevention
- Alert on
UnderReplicatedPartitionsas a TICKET for any nonzero value outside maintenance. Escalate to PAGE only whenUnderMinIsrPartitionCountis also rising, no broker has uptime under 600 seconds, and the condition persists for more than 5 minutes. - Enforce
min.insync.replicasat the topic or broker level so that ISR shrink becomes visible to producers asNotEnoughReplicasException. - Monitor partition and leadership balance per broker. A deviation greater than 20% from the cluster mean is a leading indicator of hot-spot overload.
- Leave disk headroom and use dedicated volumes for Kafka. Disk saturation is the most common root cause of follower lag.
- Keep JVM heap conservative and monitor GC pause duration. A follower that pauses for longer than
replica.lag.time.max.mswill drop out of ISR. - Run game-day failures. Measure how long URP lasts after a controlled broker shutdown so your paging thresholds account for normal recovery time.
How Netdata helps
- Correlates per-broker
UnderReplicatedPartitionswith disk I/O latency, CPU, and memory pressure on the same node to identify whether the leader or the follower is the bottleneck. - Surfaces
IsrShrinksPerSecandIsrExpandsPerSecalongside URP to distinguish stable degradation from ISR flapping caused by intermittent GC or network blips. - Supports alerting on delta of cumulative JMX counters instead of raw
Count, reducing false positives during restarts. - Overlays JVM GC metrics with Kafka request-latency breakdowns to expose heap-pressure lag versus disk lag.
- Aggregates URP across all brokers into a cluster-level view, removing blind spots from leader-only reporting.







