Kafka page cache thrashing: the backfill consumer that 100x’s tail latency

Tail latency jumps from milliseconds to seconds while producer throughput and replication stay flat. CPU is normal. Every fetch request from tail consumers starts hitting disk. The culprit is usually a single backfill consumer reading historical data, evicting the hot working set from the OS page cache and turning a memory-speed system into a disk-bound one.

The write path stays green. Under-replicated partitions do not increase. No broker has crashed. The only visible signs are elevated disk read latency and slow consumers. If the cluster slowed suddenly with no broker fault, suspect page cache thrashing from a backfill consumer.

What this means

Kafka brokers serve consumer fetch requests from the OS page cache using zero-copy sendfile. When the working set fits in RAM, tail consumers read recent data at memory speed. A backfill consumer, whether from a new consumer group with auto.offset.reset=earliest, an intentional offset reset, or a reprocessing job, begins reading old log segments from disk. These reads pull cold data into the page cache and evict the hot data that tail consumers need. Every consumer, including those reading the newest offsets, then triggers major page faults and disk I/O. Latency degrades abruptly, often by 10x to 100x.

This is a read-path problem. The broker is not overloaded by writes. The request handler idle percent may remain healthy. Replication is unaffected. The root-cause signal lives in OS metrics, not Kafka JMX alone.

flowchart LR
    A[Backfill consumer reads old offsets] --> B[Disk reads rise]
    B --> C[Hot page cache evicted]
    C --> D[Tail consumers miss cache]
    D --> E[FetchConsumer latency spikes]

Common causes

CauseWhat it looks likeFirst thing to check
New consumer group with earliest offset resetA new group appears and consumes from the beginning of one or more topicsConsumer group lag for the new group is large and shrinking
Historical reprocessing jobA batch analytics or stream processing job starts reading days or weeks of old dataJob consumer group ID and offset position relative to log start
MirrorMaker or Connect bootstrapCross-cluster replication or connector task begins backfilling a new topicClient IDs and consumer groups associated with Connect or MirrorMaker
Developer or operator offset resetA manual offset reset to earliest on an existing groupRecent admin operations or consumer group state changes

Quick checks

Run these safe, read-only checks to confirm page cache thrashing and identify the source.

# Cumulative major page faults (sample twice to compute rate)
awk '/pgmajfault/ {print $2}' /proc/vmstat

# Check consumer groups for large active lag
kafka-consumer-groups.sh --bootstrap-server localhost:9092 --describe --group {group-id}

# Check broker fetch latency breakdown (LocalTimeMs = disk reads)
echo "get -b kafka.network:type=RequestMetrics,name=LocalTimeMs,request=FetchConsumer 99thPercentile" | java -jar jmxterm.jar -l localhost:9999

# Check disk read latency and utilization
iostat -xz 1

# Check broker egress (may rise due to backfill volume)
echo "get -b kafka.server:type=BrokerTopicMetrics,name=BytesOutPerSec OneMinuteRate" | java -jar jmxterm.jar -l localhost:9999

How to diagnose it

  1. Confirm the symptom pattern. Tail consumer latency spikes while producer traffic (BytesInPerSec) and replication (UnderReplicatedPartitions) remain steady. This rules out write-path saturation and replication lag.
  2. Check page cache pressure. Read pgmajfault from /proc/vmstat and compute the rate. A sharp increase above baseline confirms cache misses. This is the strongest leading indicator.
  3. Inspect FetchConsumer LocalTimeMs. If the 99th percentile jumps from near-zero to disk latency values (tens of milliseconds or higher), the broker is serving consumer fetches from disk instead of page cache.
  4. Identify the backfill consumer. Use kafka-consumer-groups.sh --describe to find groups with very large lag that are actively consuming (current offset moving toward log end offset). The onset time of the consumer should correlate with the latency spike.
  5. Correlate disk metrics. iostat should show elevated read await and read throughput without a corresponding write latency increase. If write await is also high, suspect disk degradation rather than cache thrashing.
  6. Check for recent consumer group changes. New groups, offset resets, or reprocessing job deployments are common triggers. Look for group state transitions or new member subscriptions.

Metrics and signals to monitor

SignalWhy it mattersWarning sign
pgmajfault rate from /proc/vmstatMajor page faults mean data was not in cache and had to be read from diskSustained rate 2x or more above baseline
FetchConsumer LocalTimeMs p99Time the broker spends reading from local log for consumer fetchesJumps from sub-millisecond to double-digit milliseconds
Disk read await from iostatRead latency at the block layerSustained above 20 ms for SSDs or 50 ms for HDDs
Consumer group lagIdentifies which consumers are reading far behind the log endLarge lag on an active group that started recently
BytesOutPerSecEgress may rise simply because more data is being readIncreases without a matching increase in BytesInPerSec
UnderReplicatedPartitionsHelps exclude write-path and replication problemsRemains zero or stable while latency degrades

Fixes

Throttle the backfill consumer

The fastest fix is to apply a consumer_byte_rate quota to the offending client or user. This limits how fast the consumer can read historical data, giving the page cache time to retain the hot working set. Kafka quotas are enforced per broker, so the throttled consumer will still make progress, just slowly enough to stop evicting hot data.

Tradeoff: Quotas prolong the backfill duration. If the consumer has an SLA for catching up, you may need to balance throttle rate against completion time.

Redirect reads to follower replicas

Configure follower fetching via KIP-392 so that backfill consumers read from follower brokers rather than partition leaders. By directing large historical reads to dedicated follower replicas, you isolate the page cache impact away from the brokers serving tail consumers.

Tradeoff: This requires broker configuration (replica.selector.class) and consumer-side rack awareness. It also does not help if the follower is already serving tail consumers, since the same thrashing occurs on that broker.

Stop non-essential backfill

If the consumer is a non-production reprocessing job, developer test, or accidental offset reset, stop or kill the consumer group. The page cache will recover naturally as the hot working set is re-read.

Tradeoff: The backfill does not complete. This is only viable when the historical read is discretionary.

Let it finish under monitoring

If the backfill is critical and cannot be throttled or redirected, you may choose to let it run to completion. Monitor pgmajfault and tail consumer lag. Warn downstream teams that tail latency will remain elevated until the consumer reaches the log end and the working set warms back into cache.

Tradeoff: All tail consumers suffer until the backfill completes. This is a business decision, not a technical fix.

Prevention

  • Monitor pgmajfault proactively. Most Kafka operators track JMX but ignore OS page cache signals. Track major page fault rates on every broker and baseline them during normal tail consumption.
  • Preemptively quota known backfill jobs. If you run periodic batch reprocessing or analytics consumers, assign them a low consumer_byte_rate quota before they start.
  • Require review for offset resets. Treat offset reset operations as infrastructure changes. A single reset to earliest can degrade an entire cluster.
  • Alert on new consumer groups. A new consumer group with large initial lag reading from the earliest offset is a leading indicator of impending page cache thrashing.
  • Use follower fetch for large historical consumers. Isolate backfill traffic to follower replicas so leaders retain their hot cache for tail consumers.

How Netdata helps

  • Correlates OS and Kafka signals. Netdata collects pgmajfault alongside Kafka JMX metrics such as FetchConsumer latency, showing page cache pressure and broker read latency on the same timeline.
  • Surfaces consumer lag per group. Continuous per-group lag monitoring helps you spot a backfill consumer before it evicts the entire working set.
  • Splits read and write latency. Disk read latency charts are separate from write latency, confirming a read-path issue without parsing iostat.
  • Alerts on major page fault spikes. OS-level alerting on pgmajfault rate catches page cache thrashing in its early phase, before tail latency degrades fully.
  • How Kafka actually works in production: a mental model for operators: /guides/kafka/how-kafka-works-in-production/
  • Kafka enable.auto.commit data loss: committed offsets that outrun processing: /guides/kafka/kafka-auto-commit-silent-data-loss/
  • Kafka CommitFailedException: rebalanced-out consumers and poll loop timeouts: /guides/kafka/kafka-commit-failed-exception/
  • Kafka consumer group stuck Empty or Dead: no members consuming: /guides/kafka/kafka-consumer-group-empty-stuck/
  • Kafka consumer group lag growing: detection, lag-as-time, and root causes: /guides/kafka/kafka-consumer-group-lag-growing/
  • Kafka consumer group rebalancing too often: heartbeats, session timeout, and assignors: /guides/kafka/kafka-consumer-group-rebalancing-frequently/
  • Kafka consumer rebalance storm: stuck in PreparingRebalance and max.poll.interval.ms: /guides/kafka/kafka-consumer-rebalance-storm/
  • Kafka controller event queue backing up: overwhelmed controller and stalled metadata: /guides/kafka/kafka-controller-event-queue-backup/
  • Kafka ISR shrinking: IsrShrinksPerSec, flapping, and the cascade to offline: /guides/kafka/kafka-isr-shrink-storm/
  • Kafka JVM heap and Full GC pauses: ISR drops, session timeouts, and right-sizing the heap: /guides/kafka/kafka-jvm-heap-full-gc-pauses/
  • Kafka KRaft metadata log lag: standby controllers and brokers falling behind: /guides/kafka/kafka-kraft-metadata-log-lag/
  • Kafka KRaft quorum has no leader: current-leader = -1 and frozen metadata: /guides/kafka/kafka-kraft-quorum-no-leader/