Kafka UnderReplicatedPartitions > 0: the most important metric and how to clear it

UnderReplicatedPartitions climbing above zero means a follower has fallen behind, the ISR has shrunk, and the cluster’s durability guarantee is degraded. One more failure on the wrong broker could make partitions unavailable or cause acks=all writes to be rejected. Determine whether this is a transient blip from maintenance or the start of a cascading replication failure.

What this means

The JMX MBean kafka.server:type=ReplicaManager,name=UnderReplicatedPartitions reports the number of partitions where this broker is the leader and the ISR count is below the configured replication factor. It is a per-broker gauge. A broker without leadership always reports zero, even when the cluster is degraded, so aggregate across all brokers to see the full picture.

When this metric is nonzero, a follower replica is not keeping up. The leader still accepts writes, but if the leader dies before the follower catches up, unreplicated data is at risk. If the ISR drops below min.insync.replicas, producers using acks=all receive NotEnoughReplicasException.

Alert on Value, OneMinuteRate, or delta(Count); the Count attribute is cumulative over the broker lifetime. During rolling restarts or partition reassignment, transient under-replication is expected and should resolve within roughly one to two times replica.lag.time.max.ms after the broker returns.

flowchart TD
    A[Producer writes to leader] --> B[Leader appends to log]
    B --> C[Follower fetches via ReplicaFetcher]
    C --> D{Follower keeps up?}
    D -->|Yes| E[ISR remains full]
    D -->|No| F[ISR shrinks]
    F --> G[Leader reports URP > 0]
    G --> H[Durability window open]

Common causes

CauseWhat it looks likeFirst thing to check
Follower disk I/O saturationURP is reported by leaders whose followers sit on one specific broker; disk await is high on that followeriostat -xz 1 on the follower broker
Network partition or degradationURP appears suddenly for replicas on one follower; leader-side FetchFollower latency jumpsNetwork connectivity and FetchFollower p99 on the leader
Follower broker overload or GCURP with low RequestHandlerAvgIdlePercent or long GC pauses on the followerjstat -gcutil and per-broker CPU saturation
Unbalanced partition or leadership assignmentOne broker carries disproportionate PartitionCount or LeaderCount, concentrating replication loadPartitionCount and LeaderCount per broker
Rolling restart or reassignmentTransient URP proportional to partitions on the affected broker; resolves quicklyBroker uptime and whether kafka-reassign-partitions.sh --verify --reassignment-json-file <reassignment-json> shows active movement

Quick checks

# List under-replicated partitions interactively
kafka-topics.sh --bootstrap-server localhost:9092 --describe --under-replicated-partitions

# Read URP via JMX on the local broker
echo "get -b kafka.server:type=ReplicaManager,name=UnderReplicatedPartitions Value" | java -jar jmxterm.jar -l localhost:9999

# Check disk latency on the suspect broker
iostat -xz 1

# Check JVM GC pressure on the suspect broker
jstat -gcutil $(pgrep -f kafka.Kafka) 1000

# Check open file descriptor usage
ls /proc/$(pgrep -f kafka.Kafka)/fd | wc -l
cat /proc/$(pgrep -f kafka.Kafka)/limits | grep "Max open files"

# Check ISR shrink velocity on the leader
echo "get -b kafka.server:type=ReplicaManager,name=IsrShrinksPerSec OneMinuteRate" | java -jar jmxterm.jar -l localhost:9999

# Confirm the cluster has an active controller
echo "get -b kafka.controller:type=KafkaController,name=ActiveControllerCount Value" | java -jar jmxterm.jar -l localhost:9999

# Check active TCP connections to the broker
ss -tnp | grep $(pgrep -f kafka.Kafka) | wc -l

How to diagnose it

  1. Aggregate UnderReplicatedPartitions across all brokers. Because the metric is leader-side only, a broker with zero URP may still be the lagging follower causing URP on other nodes.
  2. Check for transient causes. If a restarted broker has uptime under 600 seconds, or if a partition reassignment is active (kafka-reassign-partitions.sh --verify --reassignment-json-file <reassignment-json>), expect transient URP. Monitor for resolution within roughly one to two times replica.lag.time.max.ms.
  3. Identify the lagging follower. kafka-topics.sh --describe --under-replicated-partitions shows the leader and ISR for each partition. The broker listed in Replicas but missing from ISR is the target.
  4. Inspect follower health. On the target broker, check disk I/O await, JVM GC pause duration, and RequestHandlerAvgIdlePercent. A saturated follower cannot replicate fast enough.
  5. Inspect leader health. On the leader, check FetchFollower request latency. High follower-fetch latency on the leader can mean the leader itself is too slow to serve replication traffic, even if the follower is healthy.
  6. Check for ISR flapping. Compare IsrShrinksPerSec with IsrExpandsPerSec. Sustained shrinks without matching expands indicate stable degradation. Sustained activity in both metrics indicates an intermittent problem such as periodic GC pauses or bursty traffic.
  7. Validate controller health. If ActiveControllerCount summed across the cluster is not exactly one, metadata changes may stall and URP can persist even after the follower recovers.

Metrics and signals to monitor

SignalWhy it mattersWarning sign
UnderReplicatedPartitionsLeading indicator that a follower has fallen behindNonzero outside maintenance windows
UnderMinIsrPartitionCountConfirms producers with acks=all are being rejected> 0 for more than 2 minutes with stable broker uptime
IsrShrinksPerSecVelocity of durability loss; active degradationSustained > 0 for more than 5 minutes
IsrExpandsPerSecRecovery velocity; paired with shrinks indicates flappingSustained alongside IsrShrinksPerSec
FetchFollower latencyLeader-side view of how slowly it serves replicationp99 exceeding 50% of replica.lag.time.max.ms
RequestHandlerAvgIdlePercentBroker thread saturation; low idle means requests pile upSustained below 0.3
Disk I/O awaitRoot-cause signal for follower slownessSSD above 20 ms or HDD above 50 ms sustained
ActiveControllerCountMetadata plane must be healthy to process ISR changesCluster-wide sum not equal to 1 for more than 30 seconds

Fixes

Transient states

If URP appeared during a rolling restart or reassignment, wait. Do not restart additional brokers while URP is recovering. Verify reassignment status using kafka-reassign-partitions.sh --verify --reassignment-json-file <reassignment-json>. If broker uptime is under 600 seconds, give the broker time to rebuild its page cache and catch up on replication.

Follower disk or network saturation

When one broker is the common follower across most URP entries, treat that broker as the problem. Check disk await, NIC errors, and packet loss on that node. A controlled shutdown of a failing leader triggers clean leader elections and is safer than an uncontrolled crash. Shutting down a follower does not reduce URP; fix the underlying disk or network saturation first. If the broker is terminally degraded, reassign its partitions before removing it. Tradeoff: removing a broker generates controller events and transient leadership movement.

Hot-spot broker or partition imbalance

If PartitionCount or LeaderCount is severely skewed, the overloaded broker may lack the threads or disk I/O to serve replication traffic. Rebalance leadership or move partitions using kafka-reassign-partitions.sh --execute --reassignment-json-file <reassignment-json>. Tradeoff: rebalancing consumes network and disk I/O and will itself cause transient under-replication until it completes.

JVM pressure on the follower

Long GC pauses on a follower can push it out of ISR repeatedly. If heap after GC is consistently above 80% or Full GC pauses exceed several seconds, reduce heap pressure. Kafka brokers typically run with modest heaps; oversized heaps cause longer pauses that directly impact replication. If down-conversion is causing heap pressure, upgrading client message formats or isolating legacy consumers will help.

Prevention

  • Alert on UnderReplicatedPartitions as a TICKET for any nonzero value outside maintenance. Escalate to PAGE only when UnderMinIsrPartitionCount is also rising, no broker has uptime under 600 seconds, and the condition persists for more than 5 minutes.
  • Enforce min.insync.replicas at the topic or broker level so that ISR shrink becomes visible to producers as NotEnoughReplicasException.
  • Monitor partition and leadership balance per broker. A deviation greater than 20% from the cluster mean is a leading indicator of hot-spot overload.
  • Leave disk headroom and use dedicated volumes for Kafka. Disk saturation is the most common root cause of follower lag.
  • Keep JVM heap conservative and monitor GC pause duration. A follower that pauses for longer than replica.lag.time.max.ms will drop out of ISR.
  • Run game-day failures. Measure how long URP lasts after a controlled broker shutdown so your paging thresholds account for normal recovery time.

How Netdata helps

  • Correlates per-broker UnderReplicatedPartitions with disk I/O latency, CPU, and memory pressure on the same node to identify whether the leader or the follower is the bottleneck.
  • Surfaces IsrShrinksPerSec and IsrExpandsPerSec alongside URP to distinguish stable degradation from ISR flapping caused by intermittent GC or network blips.
  • Supports alerting on delta of cumulative JMX counters instead of raw Count, reducing false positives during restarts.
  • Overlays JVM GC metrics with Kafka request-latency breakdowns to expose heap-pressure lag versus disk lag.
  • Aggregates URP across all brokers into a cluster-level view, removing blind spots from leader-only reporting.