Kafka fetch request latency high: FetchConsumer vs FetchFollower and page cache misses

Your tail consumers are lagging, or your under-replicated partition count is climbing. On the broker, kafka.network:type=RequestMetrics,name=TotalTimeMs,request=FetchConsumer or FetchFollower is elevated. Raw fetch latency is a poor signal: consumer long-polling means TotalTimeMs routinely includes the full fetch.max.wait.ms wait even on an idle topic, and follower fetches are paced by the leader’s ability to serve segments. The actionable metric is LocalTimeMs, the time the leader spends reading the log. When FetchConsumer LocalTimeMs spikes, the data was not in the OS page cache. When FetchFollower LocalTimeMs spikes, the leader is slow to serve replication reads and ISR shrinks will follow. The sections below show how to tell the two apart, confirm page cache misses, and fix the root cause.

What this means

TotalTimeMs for FetchConsumer and FetchFollower is a composite of RequestQueueTimeMs, LocalTimeMs, RemoteTimeMs, ResponseQueueTimeMs, and ResponseSendTimeMs. For consumer fetches, RemoteTimeMs often equals fetch.max.wait.ms (default 500 ms) because the request sits in purgatory waiting for fetch.min.bytes. Raw TotalTimeMs is useless for tail consumers on low-volume topics.

LocalTimeMs is the time the broker spends reading the log segment. For FetchConsumer, a LocalTimeMs spike means the segment was not in the OS page cache, so the broker read from disk. For FetchFollower, a LocalTimeMs spike means the leader is slow to build the replication response. If follower fetch latency stays high, followers fall behind the leader and are removed from the ISR once they exceed replica.lag.time.max.ms.

flowchart TD
    A[Fetch latency spike] --> B{Which request type?}
    B -->|FetchConsumer| C[LocalTimeMs high?]
    B -->|FetchFollower| D[LocalTimeMs high?]
    C -->|Yes| E[Page cache miss or disk IO]
    C -->|No| F[RemoteTimeMs equals fetch.max.wait.ms]
    D -->|Yes| G[Leader disk slow serving replication]
    D -->|No| H[Network or queue wait]
    E --> I[Check pgmajfault and consumer lag]
    G --> J[Check UnderReplicatedPartitions and ISR shrinks]
    F --> K[Normal long-poll behavior]
    H --> L[Check RequestQueueTimeMs and ResponseSendTimeMs]

Common causes

CauseWhat it looks likeFirst thing to check
Consumer backfill evicting page cacheFetchConsumer LocalTimeMs jumps from near-zero to disk-read latency; disk read throughput spikes; tail consumers slow even though they read recent offsetsConsumer groups with large, actively shrinking lag
Cold page cache after broker restartElevated FetchConsumer LocalTimeMs across many topics; broker uptime under 60 minutes; no new consumer groups/proc/vmstat pgmajfault rate and broker uptime
Leader disk I/O saturation serving replicasFetchFollower LocalTimeMs high; IsrShrinksPerSec rising; UnderReplicatedPartitions growing on this broker’s leader partitionsDisk await on the leader’s log.dirs volumes
Request handler pool saturationBoth request types show high RequestQueueTimeMs; RequestHandlerAvgIdlePercent drops below 0.3; RequestQueueSize growsRequestQueueSize and per-broker RequestHandlerAvgIdlePercent
Network thread saturationResponseSendTimeMs elevated; NetworkProcessorAvgIdlePercent low; connection count may be highResponseQueueSize and network processor idle percent

Quick checks

Run these read-only commands on the broker showing elevated fetch latency.

# Check FetchConsumer latency breakdown
echo "get -b kafka.network:type=RequestMetrics,name=LocalTimeMs,request=FetchConsumer 99thPercentile" | java -jar jmxterm.jar -l localhost:9999
echo "get -b kafka.network:type=RequestMetrics,name=RemoteTimeMs,request=FetchConsumer 99thPercentile" | java -jar jmxterm.jar -l localhost:9999

# Check FetchFollower latency breakdown
echo "get -b kafka.network:type=RequestMetrics,name=LocalTimeMs,request=FetchFollower 99thPercentile" | java -jar jmxterm.jar -l localhost:9999

# Check page cache pressure (cumulative counter; sample twice to compute rate)
grep pgmajfault /proc/vmstat

# Check disk read latency
iostat -xz 1 5

# Check for under-replicated partitions
kafka-topics.sh --bootstrap-server localhost:9092 --describe --under-replicated-partitions

# Check request handler saturation
echo "get -b kafka.server:type=KafkaRequestHandlerPool,name=RequestHandlerAvgIdlePercent OneMinuteRate" | java -jar jmxterm.jar -l localhost:9999

# Check consumer lag for a suspect group
kafka-consumer-groups.sh --bootstrap-server localhost:9092 --describe --group {group-id}

How to diagnose it

  1. Confirm LocalTimeMs is the dominant component. Raw TotalTimeMs is not enough. Use the JMX breakdown to see whether LocalTimeMs, RequestQueueTimeMs, or ResponseSendTimeMs is driving the spike.
  2. Determine whether FetchConsumer or FetchFollower is affected. Consumer fetch problems point to page cache or consumer read patterns. Follower fetch problems point to leader disk or thread saturation.
  3. If FetchConsumer LocalTimeMs is high, check for page cache pressure. Read /proc/vmstat and compute the pgmajfault rate. If the rate is elevated above baseline, the OS is reading from disk. Correlate with consumer groups: look for a group with very large lag that is actively consuming.
  4. If the broker restarted recently, expect cold cache. Page cache is empty after restart. LocalTimeMs will be elevated for 10-60 minutes until the working set is back in memory. This is normal physics, not a configuration error.
  5. If FetchFollower LocalTimeMs is high, check leader disk health. Run iostat -xz 1 5 on the leader broker and look at await for the devices backing log.dirs. High await means the disk cannot serve reads fast enough to keep followers in sync.
  6. Check for ISR shrink velocity. If FetchFollower latency is high, IsrShrinksPerSec will likely rise. Confirm with kafka-topics.sh --describe --under-replicated-partitions.
  7. Rule out thread pool saturation. If RequestQueueTimeMs is high and RequestHandlerAvgIdlePercent is below 0.3, the broker cannot process requests fast enough regardless of disk speed. Check RequestQueueSize.
  8. Check network thread saturation if ResponseSendTimeMs is high. If NetworkProcessorAvgIdlePercent is below 0.3 or ResponseQueueSize is growing, network threads are the bottleneck.

Metrics and signals to monitor

SignalWhy it mattersWarning sign
FetchConsumer LocalTimeMsDistinguishes page cache hits from physical disk readsSustained increase above near-zero baseline for tail consumers
FetchFollower LocalTimeMsMeasures how fast the leader serves replication readsp99 exceeds 50% of replica.lag.time.max.ms
IsrShrinksPerSecVelocity of replicas falling out of syncSustained above zero outside maintenance windows
UnderReplicatedPartitionsCumulative durability degradationNonzero and growing on a single broker
pgmajfault rateOS-level page cache pressureRate doubles above baseline for more than 5 minutes
Disk awaitPhysical disk latencySSD sustained above 20ms; HDD sustained above 50ms
RequestHandlerAvgIdlePercentI/O thread capacitySustained below 0.3
RequestQueueSizePressure between network and I/O threadsSustained above 50% of queued.max.requests

Fixes

Isolate or throttle backfill consumers

Identify the consumer group with kafka-consumer-groups.sh --bootstrap-server localhost:9092 --describe --group {group-id}. If a single group is reading from the beginning of the log and evicting hot pages, apply a consumer_byte_rate quota via kafka-configs.sh to throttle its fetch rate. Tradeoff: the backfill takes longer, but tail consumer latency recovers immediately.

Wait out cold cache after restarts

After a broker restart, expect elevated FetchConsumer LocalTimeMs for 10-60 minutes while the working set reloads into page cache. Do not restart additional brokers during this window, and do not declare the broker unhealthy. If catch-up traffic is too aggressive, throttle consumers as above.

Reduce leader disk pressure

If FetchFollower latency points to a specific broker, check per-log-dir disk metrics. On JBOD configurations, one slow disk degrades partitions on that directory without affecting others. Consider reassigning leadership away from the degraded broker. Removing the broker entirely triggers partition reassignment and can degrade availability; verify cluster capacity first.

Scale thread pools if saturated

If RequestQueueTimeMs is the dominant component and RequestHandlerAvgIdlePercent is below 0.3, increase num.io.threads. If ResponseQueueTimeMs or NetworkProcessorAvgIdlePercent is the bottleneck, increase num.network.threads. These changes require a rolling restart of the broker. Tradeoff: more threads increase context switching and memory use, and they do not fix a disk bottleneck.

Isolate follower fetches

For Kafka 2.4+, configure replica.selector.class to route follower fetches away from overloaded leaders or to dedicated follower brokers. This isolates replication reads from consumer load. This requires broker restart and up-front rack-aware planning; it is not a runtime toggle. Tradeoff: adds operational complexity.

Prevention

  • Alert on FetchConsumer and FetchFollower LocalTimeMs, not TotalTimeMs.
  • Monitor OS pgmajfault rate on every broker as a leading indicator of page cache thrashing.
  • Establish consumer byte-rate quotas before backfill jobs start.
  • Measure broker restart warmup time in your environment so you know how long to expect elevated latency.
  • Keep num.io.threads and num.network.threads headroom; do not run brokers above 50% request handler idle during peak.

How Netdata helps

  • Netdata collects kafka.network JMX metrics including FetchConsumer and FetchFollower LocalTimeMs, removing the need for manual JMXterm queries.
  • Correlate FetchConsumer LocalTimeMs with OS pgmajfault rate on the same node to confirm page cache misses in one dashboard.
  • Visualize IsrShrinksPerSec alongside FetchFollower latency to catch replication slowdown before ISR shrinks cascade into under-replicated partitions.
  • Alert on RequestHandlerAvgIdlePercent and disk await together to distinguish thread saturation from disk bottlenecks.