Kafka replica MaxLag growing: slow followers and replica fetcher health
When kafka.server:type=ReplicaFetcherManager,name=MaxLag,clientId=Replica climbs on a broker, the worst follower is failing to replicate fast enough. This metric is the maximum offset distance between a leader and its most lagging follower. In a healthy cluster it stays near zero. If the gap persists longer than replica.lag.time.max.ms, the leader removes the follower from the ISR. Once enough replicas drop, partitions can fall below min.insync.replicas, and producers using acks=all hit NotEnoughReplicasException.
This guide separates follower-side, network, and leader-side bottlenecks, and stops the lag before it forces a full rebuild.
What this means
MaxLag is an offset count, not a time value. A lag of 100,000 offsets means different things at 1,000 versus 100,000 messages per second. Convert it to time: lag_messages / produce_rate_per_sec. If that interval approaches replica.lag.time.max.ms, ISR eviction is imminent.
Followers replicate with fetcher threads that send FetchFollower requests to the leader. The leader responds with log segments, ideally via zero-copy sendfile. These requests compete with consumer fetches for the same network threads, I/O threads, and disk access. A saturated leader delays replication even when the follower is healthy.
Since Kafka 0.9.0.0, replica.lag.max.messages has been removed. ISR membership is time-based only. The default replica.lag.time.max.ms was 10,000 ms through Kafka 2.4.x and changed to 30,000 ms in 2.5.0. If a follower does not send a fetch request or does not consume up to the leader’s log end offset within that window, the leader shrinks the ISR.
If MaxLag in seconds exceeds the topic retention period, the leader may have deleted segments the follower still needs. The follower then rebuilds from scratch, generating a large burst of inter-broker traffic and keeping partitions under-replicated for an extended period.
flowchart TD
A[MaxLag growing] --> B{Follower disk slow?}
B -->|Yes| C[High disk await]
B -->|No| D{Network saturated?}
D -->|Yes| E[High retransmits or NIC usage]
D -->|No| F{Leader overloaded?}
F -->|Yes| G[Low idle percent high queue]
F -->|No| H[Follower GC or CPU bound]
C --> I[ISR shrink or rebuild]
E --> I
G --> I
H --> ICommon causes
| Cause | What it looks like | First thing to check |
|---|---|---|
| Slow follower disk | iostat shows await >20 ms for SSD or >50 ms for HDD; RequestHandlerAvgIdlePercent low on follower | Disk latency on the lagging broker |
| Network bottleneck between leader and follower | FetchFollower ResponseSendTimeMs is high; BytesOutPerSec near NIC capacity; /proc/net/snmp shows rising RetransSegs | OS network stats on both sides |
| Leader overloaded serving fetches | FetchFollower p99 is high on the leader, but follower disk and network are healthy; RequestHandlerAvgIdlePercent <0.3 | Leader RequestQueueSize and idle percent |
| Follower JVM pauses or CPU saturation | GC logs show pauses >200 ms; CPU usage sustained above 80%; lag spikes correlate with GC events | jstat -gcutil or JMX GC metrics on the follower |
| Page cache cold start or thrashing | pgmajfault rate spikes after restart or backfill consumer start; FetchConsumer LocalTimeMs jumps | /proc/vmstat major faults on follower |
Quick checks
# Check maximum replica lag on this broker via JMX
# Requires jmxterm.jar in the working directory
echo "get -b kafka.server:type=ReplicaFetcherManager,name=MaxLag,clientId=Replica Value" | java -jar jmxterm.jar -l localhost:9999
# List under-replicated partitions cluster-wide
kafka-topics.sh --bootstrap-server localhost:9092 --describe --under-replicated-partitions
# Check follower fetch latency on the leader side
echo "get -b kafka.network:type=RequestMetrics,name=TotalTimeMs,request=FetchFollower 99thPercentile" | java -jar jmxterm.jar -l localhost:9999
# Verify per-partition replica lag across the cluster
# TODO: verify --topic-white-list vs --topic-whitelist for your Kafka version
kafka-replica-verification.sh --broker-list localhost:9092 --topic-white-list ".*"
# Check ISR shrink rate on leader brokers
echo "get -b kafka.server:type=ReplicaManager,name=IsrShrinksPerSec OneMinuteRate" | java -jar jmxterm.jar -l localhost:9999
# Inspect disk latency on the suspect follower
iostat -xz 1
# Review TCP retransmissions (cumulative counter; sample twice to compute a rate)
cat /proc/net/snmp | grep RetransSegs
# Check available disk space on all log directories
grep '^log.dirs=' /etc/kafka/server.properties | tr ',' '\n' | while read d; do df -h "$d"; done
# Check JVM GC behavior on the follower
# Ensure the PID belongs to the broker, not another JVM on the same host
jstat -gcutil $(pgrep -f kafka.Kafka) 1000
How to diagnose it
Rule out transient maintenance. Check broker uptime. If any broker involved has uptime under 600 seconds, the lag is likely post-restart catch-up. Wait for the broker to warm its page cache before deeper investigation.
Find the lagging follower. Cross-reference
UnderReplicatedPartitionsacross brokers to identify the common follower. Runkafka-replica-verification.shto see per-partition lag.Check the leader side. On the leader, read
FetchFollowerp99. If it is high while the follower disk and network are healthy, the leader is too busy to serve replication. CheckRequestHandlerAvgIdlePercentandRequestQueueSizeon the leader.Check the follower side. On the lagging follower, run
iostat -xz 1. Sustained highawaitmeans the disk cannot keep up with replication writes. CheckRequestHandlerAvgIdlePercenton the follower for local thread saturation.Convert lag to time. Divide MaxLag by the topic’s produce rate. If the result exceeds 50% of
replica.lag.time.max.ms, expect ISR shrinks soon. If it exceeds the retention period, the follower will rebuild from scratch rather than catch up.Correlate with GC and page cache. On the follower, check
jstat -gcutilor JMX GC metrics. Pauses over 200 ms stall replication. Check/proc/vmstatfor a risingpgmajfaultrate, which pushes reads to disk and steals I/O bandwidth.
Metrics and signals to monitor
| Signal | Why it matters | Warning sign |
|---|---|---|
Replica MaxLag (ReplicaFetcherManager) | Direct measure of the worst follower offset delta | Nonzero and stable or growing for more than 10 minutes |
UnderReplicatedPartitions | Confirms that at least one follower is out of sync | Nonzero outside of maintenance windows |
FetchFollower TotalTimeMs p99 | Leader-side replication path latency; predicts ISR shrink | p99 exceeds 50% of replica.lag.time.max.ms |
ISR Shrinks Per Second | Velocity of durability degradation | Sustained nonzero outside maintenance |
Disk I/O await | Disk saturation is the dominant follower bottleneck | SSD >20 ms or HDD >50 ms sustained |
RequestHandlerAvgIdlePercent | Broker processing capacity; low values mean threads are blocked on I/O or CPU | Sustained below 0.3 |
Produce purgatory size | acks=all requests stalled waiting for slow followers | >2x baseline for more than 5 minutes |
Page cache major fault rate | Cache misses force disk reads, competing with replication I/O | pgmajfault rate 2x above baseline |
Fixes
Slow follower disk
If the lagging broker uses JBOD and only one disk is degraded, the partitions on that disk are the bottleneck. Reassign those partitions to a healthier broker. Do not restart the broker as a first response; a restart loses page cache and extends recovery time. If all disks are slow, reduce partition count on the broker or scale to faster storage. Reassignment is disruptive and keeps partitions under-replicated until it completes. Run it during low traffic.
Network bottleneck between brokers
If BytesOutPerSec on the leader is near NIC capacity or RetransSegs is climbing, inter-broker replication competes with client traffic for egress. Isolate replication to a dedicated listener if possible. Otherwise, throttle heavy consumers or add network capacity. A dedicated listener requires broker restart and client reconfiguration.
Leader overloaded serving fetches
When FetchFollower latency is high on the leader but the follower is healthy, the leader has too many partitions or is saturated with consumer reads. Trigger preferred replica election to move leadership to less loaded brokers. This is disruptive; leadership moves cause brief client reconnect storms. Increasing num.io.threads can help if the bottleneck is concurrency and not disk I/O, but more threads increase memory pressure and context switching.
Follower JVM pauses or CPU saturation
If GC pauses correlate with lag spikes, tune the JVM heap. Kafka brokers typically run best with 4-8 GB heaps; larger heaps cause longer pauses. Ensure the broker is not colocated with other CPU-intensive workloads. If the follower is also a leader for many partitions, its CPU may be consumed by consumer fetch processing.
Imminent rebuild from scratch
If MaxLag in seconds exceeds the topic’s retention, the follower needs a full rebuild. Consider proactively reassigning the replica to a healthier broker rather than waiting for a catch-up that will never complete. Reassignment is disruptive and generates a large temporary traffic spike.
Prevention
- Monitor MaxLag as time, not offsets. Alert when
lag_secondsexceeds a fraction ofreplica.lag.time.max.ms. - Maintain disk headroom and IOPS margin on every node. A single slow JBOD disk can degrade a broker while the rest of the array is fine.
- Keep inter-broker network utilization below 70% of NIC capacity to absorb bursts.
- Avoid colocating other heavy workloads on Kafka brokers to prevent page cache eviction and CPU contention.
- Measure baseline recovery time during planned rolling restarts. Know how long your largest partitions take to catch up after a broker restart.
How Netdata helps
- Correlate replica MaxLag with per-node disk
awaitandpgmajfaultrate to identify follower-side disk bottlenecks. - Visualize
FetchFollowerlatency alongsideRequestHandlerAvgIdlePercentto separate leader overload from network issues. - Track ISR shrinks and expands on the same chart to spot flapping replicas before they trigger
UnderReplicatedPartitionsalerts. - Set custom JMX alerts on MaxLag expressed in seconds relative to
replica.lag.time.max.ms.
Related guides
- How Kafka actually works in production: a mental model for operators: /guides/kafka/how-kafka-works-in-production/
- Kafka monitoring checklist: the signals every production cluster needs: /guides/kafka/kafka-monitoring-checklist/
- Kafka monitoring maturity model: from survival to expert: /guides/kafka/kafka-monitoring-maturity-model/
- Kafka NotEnoughReplicasException: acks=all writes rejected below min.insync.replicas: /guides/kafka/kafka-not-enough-replicas-exception/







