$ guides / kafka / kafka-replica-fetcher-max-lag ▌

Operations Guides

Kafka replica MaxLag growing: slow followers and replica fetcher health

When kafka.server:type=ReplicaFetcherManager,name=MaxLag,clientId=Replica climbs on a broker, the worst follower is failing to replicate fast enough. This metric is the maximum offset distance between a leader and its most lagging follower. In a healthy cluster it stays near zero. If the gap persists longer than replica.lag.time.max.ms, the leader removes the follower from the ISR. Once enough replicas drop, partitions can fall below min.insync.replicas, and producers using acks=all hit NotEnoughReplicasException.

This guide separates follower-side, network, and leader-side bottlenecks, and stops the lag before it forces a full rebuild.

What this means

MaxLag is an offset count, not a time value. A lag of 100,000 offsets means different things at 1,000 versus 100,000 messages per second. Convert it to time: lag_messages / produce_rate_per_sec. If that interval approaches replica.lag.time.max.ms, ISR eviction is imminent.

Followers replicate with fetcher threads that send FetchFollower requests to the leader. The leader responds with log segments, ideally via zero-copy sendfile. These requests compete with consumer fetches for the same network threads, I/O threads, and disk access. A saturated leader delays replication even when the follower is healthy.

Since Kafka 0.9.0.0, replica.lag.max.messages has been removed. ISR membership is time-based only. The default replica.lag.time.max.ms was 10,000 ms through Kafka 2.4.x and changed to 30,000 ms in 2.5.0. If a follower does not send a fetch request or does not consume up to the leader’s log end offset within that window, the leader shrinks the ISR.

If MaxLag in seconds exceeds the topic retention period, the leader may have deleted segments the follower still needs. The follower then rebuilds from scratch, generating a large burst of inter-broker traffic and keeping partitions under-replicated for an extended period.

flowchart TD
    A[MaxLag growing] --> B{Follower disk slow?}
    B -->|Yes| C[High disk await]
    B -->|No| D{Network saturated?}
    D -->|Yes| E[High retransmits or NIC usage]
    D -->|No| F{Leader overloaded?}
    F -->|Yes| G[Low idle percent high queue]
    F -->|No| H[Follower GC or CPU bound]
    C --> I[ISR shrink or rebuild]
    E --> I
    G --> I
    H --> I

Common causes

Cause	What it looks like	First thing to check
Slow follower disk	`iostat` shows `await` >20 ms for SSD or >50 ms for HDD; `RequestHandlerAvgIdlePercent` low on follower	Disk latency on the lagging broker
Network bottleneck between leader and follower	`FetchFollower` `ResponseSendTimeMs` is high; `BytesOutPerSec` near NIC capacity; `/proc/net/snmp` shows rising `RetransSegs`	OS network stats on both sides
Leader overloaded serving fetches	`FetchFollower` p99 is high on the leader, but follower disk and network are healthy; `RequestHandlerAvgIdlePercent` <0.3	Leader `RequestQueueSize` and idle percent
Follower JVM pauses or CPU saturation	GC logs show pauses >200 ms; CPU usage sustained above 80%; lag spikes correlate with GC events	`jstat -gcutil` or JMX GC metrics on the follower
Page cache cold start or thrashing	`pgmajfault` rate spikes after restart or backfill consumer start; `FetchConsumer` `LocalTimeMs` jumps	`/proc/vmstat` major faults on follower

Quick checks

# Check maximum replica lag on this broker via JMX
# Requires jmxterm.jar in the working directory
echo "get -b kafka.server:type=ReplicaFetcherManager,name=MaxLag,clientId=Replica Value" | java -jar jmxterm.jar -l localhost:9999

# List under-replicated partitions cluster-wide
kafka-topics.sh --bootstrap-server localhost:9092 --describe --under-replicated-partitions

# Check follower fetch latency on the leader side
echo "get -b kafka.network:type=RequestMetrics,name=TotalTimeMs,request=FetchFollower 99thPercentile" | java -jar jmxterm.jar -l localhost:9999

# Verify per-partition replica lag across the cluster
# TODO: verify --topic-white-list vs --topic-whitelist for your Kafka version
kafka-replica-verification.sh --broker-list localhost:9092 --topic-white-list ".*"

# Check ISR shrink rate on leader brokers
echo "get -b kafka.server:type=ReplicaManager,name=IsrShrinksPerSec OneMinuteRate" | java -jar jmxterm.jar -l localhost:9999

# Inspect disk latency on the suspect follower
iostat -xz 1

# Review TCP retransmissions (cumulative counter; sample twice to compute a rate)
cat /proc/net/snmp | grep RetransSegs

# Check available disk space on all log directories
grep '^log.dirs=' /etc/kafka/server.properties | tr ',' '\n' | while read d; do df -h "$d"; done

# Check JVM GC behavior on the follower
# Ensure the PID belongs to the broker, not another JVM on the same host
jstat -gcutil $(pgrep -f kafka.Kafka) 1000

How to diagnose it

Rule out transient maintenance. Check broker uptime. If any broker involved has uptime under 600 seconds, the lag is likely post-restart catch-up. Wait for the broker to warm its page cache before deeper investigation.
Find the lagging follower. Cross-reference UnderReplicatedPartitions across brokers to identify the common follower. Run kafka-replica-verification.sh to see per-partition lag.
Check the leader side. On the leader, read FetchFollower p99. If it is high while the follower disk and network are healthy, the leader is too busy to serve replication. Check RequestHandlerAvgIdlePercent and RequestQueueSize on the leader.
Check the follower side. On the lagging follower, run iostat -xz 1. Sustained high await means the disk cannot keep up with replication writes. Check RequestHandlerAvgIdlePercent on the follower for local thread saturation.
Convert lag to time. Divide MaxLag by the topic’s produce rate. If the result exceeds 50% of replica.lag.time.max.ms, expect ISR shrinks soon. If it exceeds the retention period, the follower will rebuild from scratch rather than catch up.
Correlate with GC and page cache. On the follower, check jstat -gcutil or JMX GC metrics. Pauses over 200 ms stall replication. Check /proc/vmstat for a rising pgmajfault rate, which pushes reads to disk and steals I/O bandwidth.

Metrics and signals to monitor

Signal	Why it matters	Warning sign
`Replica MaxLag` (ReplicaFetcherManager)	Direct measure of the worst follower offset delta	Nonzero and stable or growing for more than 10 minutes
`UnderReplicatedPartitions`	Confirms that at least one follower is out of sync	Nonzero outside of maintenance windows
`FetchFollower TotalTimeMs` p99	Leader-side replication path latency; predicts ISR shrink	p99 exceeds 50% of `replica.lag.time.max.ms`
`ISR Shrinks Per Second`	Velocity of durability degradation	Sustained nonzero outside maintenance
`Disk I/O await`	Disk saturation is the dominant follower bottleneck	SSD >20 ms or HDD >50 ms sustained
`RequestHandlerAvgIdlePercent`	Broker processing capacity; low values mean threads are blocked on I/O or CPU	Sustained below 0.3
`Produce purgatory size`	`acks=all` requests stalled waiting for slow followers	>2x baseline for more than 5 minutes
`Page cache major fault rate`	Cache misses force disk reads, competing with replication I/O	`pgmajfault` rate 2x above baseline

Fixes

Slow follower disk

If the lagging broker uses JBOD and only one disk is degraded, the partitions on that disk are the bottleneck. Reassign those partitions to a healthier broker. Do not restart the broker as a first response; a restart loses page cache and extends recovery time. If all disks are slow, reduce partition count on the broker or scale to faster storage. Reassignment is disruptive and keeps partitions under-replicated until it completes. Run it during low traffic.

Network bottleneck between brokers

If BytesOutPerSec on the leader is near NIC capacity or RetransSegs is climbing, inter-broker replication competes with client traffic for egress. Isolate replication to a dedicated listener if possible. Otherwise, throttle heavy consumers or add network capacity. A dedicated listener requires broker restart and client reconfiguration.

Leader overloaded serving fetches

When FetchFollower latency is high on the leader but the follower is healthy, the leader has too many partitions or is saturated with consumer reads. Trigger preferred replica election to move leadership to less loaded brokers. This is disruptive; leadership moves cause brief client reconnect storms. Increasing num.io.threads can help if the bottleneck is concurrency and not disk I/O, but more threads increase memory pressure and context switching.

Follower JVM pauses or CPU saturation

If GC pauses correlate with lag spikes, tune the JVM heap. Kafka brokers typically run best with 4-8 GB heaps; larger heaps cause longer pauses. Ensure the broker is not colocated with other CPU-intensive workloads. If the follower is also a leader for many partitions, its CPU may be consumed by consumer fetch processing.

Imminent rebuild from scratch

If MaxLag in seconds exceeds the topic’s retention, the follower needs a full rebuild. Consider proactively reassigning the replica to a healthier broker rather than waiting for a catch-up that will never complete. Reassignment is disruptive and generates a large temporary traffic spike.

Prevention

Monitor MaxLag as time, not offsets. Alert when lag_seconds exceeds a fraction of replica.lag.time.max.ms.
Maintain disk headroom and IOPS margin on every node. A single slow JBOD disk can degrade a broker while the rest of the array is fine.
Keep inter-broker network utilization below 70% of NIC capacity to absorb bursts.
Avoid colocating other heavy workloads on Kafka brokers to prevent page cache eviction and CPU contention.
Measure baseline recovery time during planned rolling restarts. Know how long your largest partitions take to catch up after a broker restart.

How Netdata helps

Correlate replica MaxLag with per-node disk await and pgmajfault rate to identify follower-side disk bottlenecks.
Visualize FetchFollower latency alongside RequestHandlerAvgIdlePercent to separate leader overload from network issues.
Track ISR shrinks and expands on the same chart to spot flapping replicas before they trigger UnderReplicatedPartitions alerts.
Set custom JMX alerts on MaxLag expressed in seconds relative to replica.lag.time.max.ms.

How Kafka actually works in production: a mental model for operators: /guides/kafka/how-kafka-works-in-production/
Kafka monitoring checklist: the signals every production cluster needs: /guides/kafka/kafka-monitoring-checklist/
Kafka monitoring maturity model: from survival to expert: /guides/kafka/kafka-monitoring-maturity-model/
Kafka NotEnoughReplicasException: acks=all writes rejected below min.insync.replicas: /guides/kafka/kafka-not-enough-replicas-exception/

Kafka replica MaxLag growing: slow followers and replica fetcher health

Kafka replica MaxLag growing: slow followers and replica fetcher health

What this means

Common causes

Quick checks

How to diagnose it

Metrics and signals to monitor

Fixes

Slow follower disk

Network bottleneck between brokers

Leader overloaded serving fetches

Follower JVM pauses or CPU saturation

Imminent rebuild from scratch

Prevention

How Netdata helps

Related guides