Kafka consumer group lag growing: detection, lag-as-time, and root causes

When a consumer group falls behind, lag climbs monotonically. If the committed offset crosses the retention boundary, consumers hit OffsetOutOfRangeException and must reset to earliest or latest, reprocessing or skipping data. Restarting the consumer is a common first reaction, but broker-side fetch latency, page cache eviction, and network saturation are equally common culprits. Detect lag accurately, convert it to time, and trace the root cause without guessing.

What this means

Consumer lag is log-end-offset minus the last committed offset for each partition in a group. Monotonic growth means the consumer is not keeping up with the production rate. The immediate risk is committed offset falling behind log retention, forcing a reset to earliest or latest.

Rate of change matters more than absolute offsets. One hundred thousand offsets might be one second on a high-throughput topic and two days on a low-throughput one. Convert lag to time and alert on growth rate.

flowchart TD
    A[Growing consumer lag] --> B{Group state Stable?}
    B -->|No - Rebalancing| C[Rebalance storm or member failure]
    B -->|Yes| D{FetchConsumer LocalTimeMs high?}
    D -->|Yes| E[Broker page cache or disk bottleneck]
    D -->|No| F{BytesOutPerSec flat?}
    F -->|Yes| G[Slow consumer processing]
    F -->|No| H[Producer throughput spike or network limit]
    D -->|ThrottleTimeMs high| I[Broker quota throttling]

Common causes

CauseWhat it looks likeFirst thing to check
Slow consumer processingLag grows steadily; group state Stable; consumer logs show slow downstream callsConsumer max.poll.interval.ms and processing latency per batch
Consumer rebalance stormGroup state flips between Stable and PreparingRebalance; lag spikes during each cyclekafka-consumer-groups.sh --describe output and consumer logs for eviction
Consumer crash or stopGroup state Empty; lag grows at exactly the production rateConsumer process presence and exit codes
Broker page cache evictionFetchConsumer LocalTimeMs spikes; pgmajfault rate jumps; tail consumers slow even reading recent dataOS page cache pressure and backfill consumers
Broker disk I/O saturationRequestHandlerAvgIdlePercent drops; disk await elevated; fetch latency highiostat -xz and per-broker disk metrics
Producer throughput spikeBytesInPerSec jumps; lag grows across all consumer groups simultaneouslyPer-topic produce rate and traffic correlation
Network saturation on brokerBytesOutPerSec near NIC capacity; NetworkProcessorAvgIdlePercent lowNIC utilization and connection count
Quota throttlingThrottleTimeMs elevated for fetch or produce; lag grows despite healthy consumersClient quota configuration and violation metrics

Quick checks

# Check current lag and committed offsets per partition
kafka-consumer-groups.sh --bootstrap-server localhost:9092 --describe --group <group-id>

# Check group coordination state (Stable, PreparingRebalance, Empty)
kafka-consumer-groups.sh --bootstrap-server localhost:9092 --describe --group <group-id> --state

# Check active members and partition assignment
kafka-consumer-groups.sh --bootstrap-server localhost:9092 --describe --group <group-id> --members --verbose
# Check broker-side consumer fetch latency (page cache vs disk)
echo "get -b kafka.network:type=RequestMetrics,name=LocalTimeMs,request=FetchConsumer 99thPercentile" | java -jar jmxterm.jar -l localhost:9999

# Check broker egress to consumers
echo "get -b kafka.server:type=BrokerTopicMetrics,name=BytesOutPerSec OneMinuteRate" | java -jar jmxterm.jar -l localhost:9999

# Check I/O thread saturation (broker processing capacity)
echo "get -b kafka.server:type=KafkaRequestHandlerPool,name=RequestHandlerAvgIdlePercent OneMinuteRate" | java -jar jmxterm.jar -l localhost:9999

# Check page cache pressure (major faults indicate disk reads)
grep pgmajfault /proc/vmstat

# Check disk I/O latency
iostat -xz 1

How to diagnose it

  1. Confirm the trend, not the snapshot. Run kafka-consumer-groups.sh --describe twice, spaced by a known interval. Calculate delta(lag) / delta(time). If lag is flat or shrinking, the consumer caught up. If it is growing, proceed.
  2. Check group state. If the state is PreparingRebalance or CompletingRebalance for more than five minutes, the group is in a rebalance storm. Check consumer logs for max.poll.interval.ms exceeded or CommitFailedException.
  3. Check for Empty or Dead state. An Empty group with growing lag means no consumer is running. Check whether the consumer process crashed or was stopped intentionally.
  4. Correlate with broker fetch latency. High FetchConsumer LocalTimeMs means the broker is reading from disk instead of page cache. Check pgmajfault rate. If it spiked, a backfill consumer or offset reset is thrashing the cache.
  5. Check broker saturation. If RequestHandlerAvgIdlePercent is below 0.3 and RequestQueueSize is growing, the broker cannot process requests fast enough. Check disk await with iostat.
  6. Check for throttling. If kafka.server:type=Request,name=ThrottleTimeMs is elevated, the broker is applying backpressure via quotas. Verify quota policies for the consumer or producer.
  7. Convert lag to time. Estimate lag_seconds = lag_offsets / produce_rate_per_sec. If this approaches half your retention period, treat it as an imminent data loss risk.

Metrics and signals to monitor

SignalWhy it mattersWarning sign
Consumer group lag (rate of change)Direct measure of falling behindGrowing monotonically for > 5 minutes
Consumer group stateRebalances pause all consumptionStuck in PreparingRebalance > 5 min
FetchConsumer LocalTimeMs (p99)Broker read-path health; page cache vs diskSustained spike above 2x baseline
BytesOutPerSecBroker egress to consumersFlat or dropping while lag grows
RequestHandlerAvgIdlePercentBroker processing capacitySustained below 0.3
pgmajfault ratePage cache effectiveness2x baseline or sustained growth
Consumer rebalance rateGroup stability> 3 rebalances per hour outside deploys
ThrottleTimeMsBroker backpressurep99 > 0 for consumer or producer

Fixes

Slow consumer processing

Reduce max.poll.records to give the consumer smaller batches, or increase max.poll.interval.ms if processing is legitimately slow. Avoid blocking I/O inside the poll loop. If the bottleneck is downstream, scale the consumer instances up to the partition count limit. Tradeoff: more instances increase rebalance frequency.

Rebalance storm

Identify the cycling consumer from logs and isolate it. Switch to the cooperative sticky assignor from the eager assignor. For long-running tasks, increase max.poll.interval.ms or reduce batch size. Static group membership (group.instance.id) prevents unnecessary rebalances on rolling restarts.

Broker page cache eviction

Identify the backfill consumer group and throttle it with a consumer_byte_rate quota. If using Kafka 2.4+, consider follower fetching (replica.selector.class) to isolate historical reads from tail consumers. Tradeoff: follower fetching adds replication latency.

Broker disk or network saturation

If disk await is elevated, move partitions off the hot broker or add brokers to spread load. If network is saturated, verify NIC capacity and reduce replication fan-out where possible. Do not restart the broker as a first fix; that worsens the cold-start penalty.

Quota throttling

Raise the consumer or producer byte-rate quota if the workload has legitimately grown. If a single rogue client is causing throttling, impose a stricter quota or isolate it to a dedicated listener.

Consumer crash or offset loss

Restart the consumer. If it hits OffsetOutOfRangeException, decide whether to seek to earliest (reprocess) or latest (skip). For critical pipelines, force explicit handling rather than silent skipping.

Prevention

Monitor lag-as-time rather than absolute offsets. Alert when the rate of change is positive for more than five minutes. Monitor FetchConsumer latency and page cache pressure proactively so broker-side slowdowns surface before lag grows. Keep consumer configurations consistent across instances to avoid heartbeat timeouts. Test consumer recovery time during game days so you know how long a partition can lag before it becomes unrecoverable.

How Netdata helps

  • Correlates lag growth with broker FetchConsumer latency and OS page cache metrics on the same timeline.
  • Surfaces RequestHandlerAvgIdlePercent, NetworkProcessorAvgIdlePercent, and disk I/O latency from the broker host.
  • Tracks pgmajfault rate alongside Kafka JMX metrics to identify page cache thrashing from backfill consumers.
  • Estimates lag-as-time by combining offset lag with per-topic produce rates, avoiding false positives from absolute offset thresholds.
  • Alerts on group state changes and rebalance rates without requiring manual CLI polling.
  • How Kafka actually works in production: a mental model for operators: /guides/kafka/how-kafka-works-in-production/
  • Kafka controller event queue backing up: overwhelmed controller and stalled metadata: /guides/kafka/kafka-controller-event-queue-backup/
  • Kafka ISR shrinking: IsrShrinksPerSec, flapping, and the cascade to offline: /guides/kafka/kafka-isr-shrink-storm/
  • Kafka KRaft quorum has no leader: current-leader = -1 and frozen metadata: /guides/kafka/kafka-kraft-quorum-no-leader/
  • Kafka LeaderElectionRateAndTimeMs spiking: election storms and slow elections: /guides/kafka/kafka-leader-election-rate-high/
  • Kafka LEADER_NOT_AVAILABLE: causes during elections, restarts, and topic creation: /guides/kafka/kafka-leader-not-available/
  • Kafka leadership imbalance: LeaderCount skew and preferred replica election: /guides/kafka/kafka-leadership-imbalance/
  • Kafka min.insync.replicas and acks: configuring durability you actually have: /guides/kafka/kafka-min-insync-replicas-misconfigured/
  • Kafka monitoring checklist: the signals every production cluster needs: /guides/kafka/kafka-monitoring-checklist/
  • Kafka monitoring maturity model: from survival to expert: /guides/kafka/kafka-monitoring-maturity-model/
  • Kafka ActiveControllerCount not equal to 1: no controller or split brain: /guides/kafka/kafka-no-active-controller/
  • Kafka NotEnoughReplicasException: acks=all writes rejected below min.insync.replicas: /guides/kafka/kafka-not-enough-replicas-exception/