Kafka replica MaxLag growing: slow followers and replica fetcher health

When kafka.server:type=ReplicaFetcherManager,name=MaxLag,clientId=Replica climbs on a broker, the worst follower is failing to replicate fast enough. This metric is the maximum offset distance between a leader and its most lagging follower. In a healthy cluster it stays near zero. If the gap persists longer than replica.lag.time.max.ms, the leader removes the follower from the ISR. Once enough replicas drop, partitions can fall below min.insync.replicas, and producers using acks=all hit NotEnoughReplicasException.

This guide separates follower-side, network, and leader-side bottlenecks, and stops the lag before it forces a full rebuild.

What this means

MaxLag is an offset count, not a time value. A lag of 100,000 offsets means different things at 1,000 versus 100,000 messages per second. Convert it to time: lag_messages / produce_rate_per_sec. If that interval approaches replica.lag.time.max.ms, ISR eviction is imminent.

Followers replicate with fetcher threads that send FetchFollower requests to the leader. The leader responds with log segments, ideally via zero-copy sendfile. These requests compete with consumer fetches for the same network threads, I/O threads, and disk access. A saturated leader delays replication even when the follower is healthy.

Since Kafka 0.9.0.0, replica.lag.max.messages has been removed. ISR membership is time-based only. The default replica.lag.time.max.ms was 10,000 ms through Kafka 2.4.x and changed to 30,000 ms in 2.5.0. If a follower does not send a fetch request or does not consume up to the leader’s log end offset within that window, the leader shrinks the ISR.

If MaxLag in seconds exceeds the topic retention period, the leader may have deleted segments the follower still needs. The follower then rebuilds from scratch, generating a large burst of inter-broker traffic and keeping partitions under-replicated for an extended period.

flowchart TD
    A[MaxLag growing] --> B{Follower disk slow?}
    B -->|Yes| C[High disk await]
    B -->|No| D{Network saturated?}
    D -->|Yes| E[High retransmits or NIC usage]
    D -->|No| F{Leader overloaded?}
    F -->|Yes| G[Low idle percent high queue]
    F -->|No| H[Follower GC or CPU bound]
    C --> I[ISR shrink or rebuild]
    E --> I
    G --> I
    H --> I

Common causes

CauseWhat it looks likeFirst thing to check
Slow follower diskiostat shows await >20 ms for SSD or >50 ms for HDD; RequestHandlerAvgIdlePercent low on followerDisk latency on the lagging broker
Network bottleneck between leader and followerFetchFollower ResponseSendTimeMs is high; BytesOutPerSec near NIC capacity; /proc/net/snmp shows rising RetransSegsOS network stats on both sides
Leader overloaded serving fetchesFetchFollower p99 is high on the leader, but follower disk and network are healthy; RequestHandlerAvgIdlePercent <0.3Leader RequestQueueSize and idle percent
Follower JVM pauses or CPU saturationGC logs show pauses >200 ms; CPU usage sustained above 80%; lag spikes correlate with GC eventsjstat -gcutil or JMX GC metrics on the follower
Page cache cold start or thrashingpgmajfault rate spikes after restart or backfill consumer start; FetchConsumer LocalTimeMs jumps/proc/vmstat major faults on follower

Quick checks

# Check maximum replica lag on this broker via JMX
# Requires jmxterm.jar in the working directory
echo "get -b kafka.server:type=ReplicaFetcherManager,name=MaxLag,clientId=Replica Value" | java -jar jmxterm.jar -l localhost:9999

# List under-replicated partitions cluster-wide
kafka-topics.sh --bootstrap-server localhost:9092 --describe --under-replicated-partitions

# Check follower fetch latency on the leader side
echo "get -b kafka.network:type=RequestMetrics,name=TotalTimeMs,request=FetchFollower 99thPercentile" | java -jar jmxterm.jar -l localhost:9999

# Verify per-partition replica lag across the cluster
# TODO: verify --topic-white-list vs --topic-whitelist for your Kafka version
kafka-replica-verification.sh --broker-list localhost:9092 --topic-white-list ".*"

# Check ISR shrink rate on leader brokers
echo "get -b kafka.server:type=ReplicaManager,name=IsrShrinksPerSec OneMinuteRate" | java -jar jmxterm.jar -l localhost:9999

# Inspect disk latency on the suspect follower
iostat -xz 1

# Review TCP retransmissions (cumulative counter; sample twice to compute a rate)
cat /proc/net/snmp | grep RetransSegs

# Check available disk space on all log directories
grep '^log.dirs=' /etc/kafka/server.properties | tr ',' '\n' | while read d; do df -h "$d"; done

# Check JVM GC behavior on the follower
# Ensure the PID belongs to the broker, not another JVM on the same host
jstat -gcutil $(pgrep -f kafka.Kafka) 1000

How to diagnose it

  1. Rule out transient maintenance. Check broker uptime. If any broker involved has uptime under 600 seconds, the lag is likely post-restart catch-up. Wait for the broker to warm its page cache before deeper investigation.

  2. Find the lagging follower. Cross-reference UnderReplicatedPartitions across brokers to identify the common follower. Run kafka-replica-verification.sh to see per-partition lag.

  3. Check the leader side. On the leader, read FetchFollower p99. If it is high while the follower disk and network are healthy, the leader is too busy to serve replication. Check RequestHandlerAvgIdlePercent and RequestQueueSize on the leader.

  4. Check the follower side. On the lagging follower, run iostat -xz 1. Sustained high await means the disk cannot keep up with replication writes. Check RequestHandlerAvgIdlePercent on the follower for local thread saturation.

  5. Convert lag to time. Divide MaxLag by the topic’s produce rate. If the result exceeds 50% of replica.lag.time.max.ms, expect ISR shrinks soon. If it exceeds the retention period, the follower will rebuild from scratch rather than catch up.

  6. Correlate with GC and page cache. On the follower, check jstat -gcutil or JMX GC metrics. Pauses over 200 ms stall replication. Check /proc/vmstat for a rising pgmajfault rate, which pushes reads to disk and steals I/O bandwidth.

Metrics and signals to monitor

SignalWhy it mattersWarning sign
Replica MaxLag (ReplicaFetcherManager)Direct measure of the worst follower offset deltaNonzero and stable or growing for more than 10 minutes
UnderReplicatedPartitionsConfirms that at least one follower is out of syncNonzero outside of maintenance windows
FetchFollower TotalTimeMs p99Leader-side replication path latency; predicts ISR shrinkp99 exceeds 50% of replica.lag.time.max.ms
ISR Shrinks Per SecondVelocity of durability degradationSustained nonzero outside maintenance
Disk I/O awaitDisk saturation is the dominant follower bottleneckSSD >20 ms or HDD >50 ms sustained
RequestHandlerAvgIdlePercentBroker processing capacity; low values mean threads are blocked on I/O or CPUSustained below 0.3
Produce purgatory sizeacks=all requests stalled waiting for slow followers>2x baseline for more than 5 minutes
Page cache major fault rateCache misses force disk reads, competing with replication I/Opgmajfault rate 2x above baseline

Fixes

Slow follower disk

If the lagging broker uses JBOD and only one disk is degraded, the partitions on that disk are the bottleneck. Reassign those partitions to a healthier broker. Do not restart the broker as a first response; a restart loses page cache and extends recovery time. If all disks are slow, reduce partition count on the broker or scale to faster storage. Reassignment is disruptive and keeps partitions under-replicated until it completes. Run it during low traffic.

Network bottleneck between brokers

If BytesOutPerSec on the leader is near NIC capacity or RetransSegs is climbing, inter-broker replication competes with client traffic for egress. Isolate replication to a dedicated listener if possible. Otherwise, throttle heavy consumers or add network capacity. A dedicated listener requires broker restart and client reconfiguration.

Leader overloaded serving fetches

When FetchFollower latency is high on the leader but the follower is healthy, the leader has too many partitions or is saturated with consumer reads. Trigger preferred replica election to move leadership to less loaded brokers. This is disruptive; leadership moves cause brief client reconnect storms. Increasing num.io.threads can help if the bottleneck is concurrency and not disk I/O, but more threads increase memory pressure and context switching.

Follower JVM pauses or CPU saturation

If GC pauses correlate with lag spikes, tune the JVM heap. Kafka brokers typically run best with 4-8 GB heaps; larger heaps cause longer pauses. Ensure the broker is not colocated with other CPU-intensive workloads. If the follower is also a leader for many partitions, its CPU may be consumed by consumer fetch processing.

Imminent rebuild from scratch

If MaxLag in seconds exceeds the topic’s retention, the follower needs a full rebuild. Consider proactively reassigning the replica to a healthier broker rather than waiting for a catch-up that will never complete. Reassignment is disruptive and generates a large temporary traffic spike.

Prevention

  • Monitor MaxLag as time, not offsets. Alert when lag_seconds exceeds a fraction of replica.lag.time.max.ms.
  • Maintain disk headroom and IOPS margin on every node. A single slow JBOD disk can degrade a broker while the rest of the array is fine.
  • Keep inter-broker network utilization below 70% of NIC capacity to absorb bursts.
  • Avoid colocating other heavy workloads on Kafka brokers to prevent page cache eviction and CPU contention.
  • Measure baseline recovery time during planned rolling restarts. Know how long your largest partitions take to catch up after a broker restart.

How Netdata helps

  • Correlate replica MaxLag with per-node disk await and pgmajfault rate to identify follower-side disk bottlenecks.
  • Visualize FetchFollower latency alongside RequestHandlerAvgIdlePercent to separate leader overload from network issues.
  • Track ISR shrinks and expands on the same chart to spot flapping replicas before they trigger UnderReplicatedPartitions alerts.
  • Set custom JMX alerts on MaxLag expressed in seconds relative to replica.lag.time.max.ms.
  • How Kafka actually works in production: a mental model for operators: /guides/kafka/how-kafka-works-in-production/
  • Kafka monitoring checklist: the signals every production cluster needs: /guides/kafka/kafka-monitoring-checklist/
  • Kafka monitoring maturity model: from survival to expert: /guides/kafka/kafka-monitoring-maturity-model/
  • Kafka NotEnoughReplicasException: acks=all writes rejected below min.insync.replicas: /guides/kafka/kafka-not-enough-replicas-exception/