Kafka consumer group lag growing: detection, lag-as-time, and root causes
When a consumer group falls behind, lag climbs monotonically. If the committed offset crosses the retention boundary, consumers hit OffsetOutOfRangeException and must reset to earliest or latest, reprocessing or skipping data. Restarting the consumer is a common first reaction, but broker-side fetch latency, page cache eviction, and network saturation are equally common culprits. Detect lag accurately, convert it to time, and trace the root cause without guessing.
What this means
Consumer lag is log-end-offset minus the last committed offset for each partition in a group. Monotonic growth means the consumer is not keeping up with the production rate. The immediate risk is committed offset falling behind log retention, forcing a reset to earliest or latest.
Rate of change matters more than absolute offsets. One hundred thousand offsets might be one second on a high-throughput topic and two days on a low-throughput one. Convert lag to time and alert on growth rate.
flowchart TD
A[Growing consumer lag] --> B{Group state Stable?}
B -->|No - Rebalancing| C[Rebalance storm or member failure]
B -->|Yes| D{FetchConsumer LocalTimeMs high?}
D -->|Yes| E[Broker page cache or disk bottleneck]
D -->|No| F{BytesOutPerSec flat?}
F -->|Yes| G[Slow consumer processing]
F -->|No| H[Producer throughput spike or network limit]
D -->|ThrottleTimeMs high| I[Broker quota throttling]Common causes
| Cause | What it looks like | First thing to check |
|---|---|---|
| Slow consumer processing | Lag grows steadily; group state Stable; consumer logs show slow downstream calls | Consumer max.poll.interval.ms and processing latency per batch |
| Consumer rebalance storm | Group state flips between Stable and PreparingRebalance; lag spikes during each cycle | kafka-consumer-groups.sh --describe output and consumer logs for eviction |
| Consumer crash or stop | Group state Empty; lag grows at exactly the production rate | Consumer process presence and exit codes |
| Broker page cache eviction | FetchConsumer LocalTimeMs spikes; pgmajfault rate jumps; tail consumers slow even reading recent data | OS page cache pressure and backfill consumers |
| Broker disk I/O saturation | RequestHandlerAvgIdlePercent drops; disk await elevated; fetch latency high | iostat -xz and per-broker disk metrics |
| Producer throughput spike | BytesInPerSec jumps; lag grows across all consumer groups simultaneously | Per-topic produce rate and traffic correlation |
| Network saturation on broker | BytesOutPerSec near NIC capacity; NetworkProcessorAvgIdlePercent low | NIC utilization and connection count |
| Quota throttling | ThrottleTimeMs elevated for fetch or produce; lag grows despite healthy consumers | Client quota configuration and violation metrics |
Quick checks
# Check current lag and committed offsets per partition
kafka-consumer-groups.sh --bootstrap-server localhost:9092 --describe --group <group-id>
# Check group coordination state (Stable, PreparingRebalance, Empty)
kafka-consumer-groups.sh --bootstrap-server localhost:9092 --describe --group <group-id> --state
# Check active members and partition assignment
kafka-consumer-groups.sh --bootstrap-server localhost:9092 --describe --group <group-id> --members --verbose
# Check broker-side consumer fetch latency (page cache vs disk)
echo "get -b kafka.network:type=RequestMetrics,name=LocalTimeMs,request=FetchConsumer 99thPercentile" | java -jar jmxterm.jar -l localhost:9999
# Check broker egress to consumers
echo "get -b kafka.server:type=BrokerTopicMetrics,name=BytesOutPerSec OneMinuteRate" | java -jar jmxterm.jar -l localhost:9999
# Check I/O thread saturation (broker processing capacity)
echo "get -b kafka.server:type=KafkaRequestHandlerPool,name=RequestHandlerAvgIdlePercent OneMinuteRate" | java -jar jmxterm.jar -l localhost:9999
# Check page cache pressure (major faults indicate disk reads)
grep pgmajfault /proc/vmstat
# Check disk I/O latency
iostat -xz 1
How to diagnose it
- Confirm the trend, not the snapshot. Run
kafka-consumer-groups.sh --describetwice, spaced by a known interval. Calculatedelta(lag) / delta(time). If lag is flat or shrinking, the consumer caught up. If it is growing, proceed. - Check group state. If the state is
PreparingRebalanceorCompletingRebalancefor more than five minutes, the group is in a rebalance storm. Check consumer logs formax.poll.interval.ms exceededorCommitFailedException. - Check for Empty or Dead state. An
Emptygroup with growing lag means no consumer is running. Check whether the consumer process crashed or was stopped intentionally. - Correlate with broker fetch latency. High
FetchConsumerLocalTimeMsmeans the broker is reading from disk instead of page cache. Checkpgmajfaultrate. If it spiked, a backfill consumer or offset reset is thrashing the cache. - Check broker saturation. If
RequestHandlerAvgIdlePercentis below 0.3 andRequestQueueSizeis growing, the broker cannot process requests fast enough. Check diskawaitwithiostat. - Check for throttling. If
kafka.server:type=Request,name=ThrottleTimeMsis elevated, the broker is applying backpressure via quotas. Verify quota policies for the consumer or producer. - Convert lag to time. Estimate
lag_seconds = lag_offsets / produce_rate_per_sec. If this approaches half your retention period, treat it as an imminent data loss risk.
Metrics and signals to monitor
| Signal | Why it matters | Warning sign |
|---|---|---|
| Consumer group lag (rate of change) | Direct measure of falling behind | Growing monotonically for > 5 minutes |
| Consumer group state | Rebalances pause all consumption | Stuck in PreparingRebalance > 5 min |
| FetchConsumer LocalTimeMs (p99) | Broker read-path health; page cache vs disk | Sustained spike above 2x baseline |
| BytesOutPerSec | Broker egress to consumers | Flat or dropping while lag grows |
| RequestHandlerAvgIdlePercent | Broker processing capacity | Sustained below 0.3 |
| pgmajfault rate | Page cache effectiveness | 2x baseline or sustained growth |
| Consumer rebalance rate | Group stability | > 3 rebalances per hour outside deploys |
| ThrottleTimeMs | Broker backpressure | p99 > 0 for consumer or producer |
Fixes
Slow consumer processing
Reduce max.poll.records to give the consumer smaller batches, or increase max.poll.interval.ms if processing is legitimately slow. Avoid blocking I/O inside the poll loop. If the bottleneck is downstream, scale the consumer instances up to the partition count limit. Tradeoff: more instances increase rebalance frequency.
Rebalance storm
Identify the cycling consumer from logs and isolate it. Switch to the cooperative sticky assignor from the eager assignor. For long-running tasks, increase max.poll.interval.ms or reduce batch size. Static group membership (group.instance.id) prevents unnecessary rebalances on rolling restarts.
Broker page cache eviction
Identify the backfill consumer group and throttle it with a consumer_byte_rate quota. If using Kafka 2.4+, consider follower fetching (replica.selector.class) to isolate historical reads from tail consumers. Tradeoff: follower fetching adds replication latency.
Broker disk or network saturation
If disk await is elevated, move partitions off the hot broker or add brokers to spread load. If network is saturated, verify NIC capacity and reduce replication fan-out where possible. Do not restart the broker as a first fix; that worsens the cold-start penalty.
Quota throttling
Raise the consumer or producer byte-rate quota if the workload has legitimately grown. If a single rogue client is causing throttling, impose a stricter quota or isolate it to a dedicated listener.
Consumer crash or offset loss
Restart the consumer. If it hits OffsetOutOfRangeException, decide whether to seek to earliest (reprocess) or latest (skip). For critical pipelines, force explicit handling rather than silent skipping.
Prevention
Monitor lag-as-time rather than absolute offsets. Alert when the rate of change is positive for more than five minutes. Monitor FetchConsumer latency and page cache pressure proactively so broker-side slowdowns surface before lag grows. Keep consumer configurations consistent across instances to avoid heartbeat timeouts. Test consumer recovery time during game days so you know how long a partition can lag before it becomes unrecoverable.
How Netdata helps
- Correlates lag growth with broker
FetchConsumerlatency and OS page cache metrics on the same timeline. - Surfaces
RequestHandlerAvgIdlePercent,NetworkProcessorAvgIdlePercent, and disk I/O latency from the broker host. - Tracks
pgmajfaultrate alongside Kafka JMX metrics to identify page cache thrashing from backfill consumers. - Estimates lag-as-time by combining offset lag with per-topic produce rates, avoiding false positives from absolute offset thresholds.
- Alerts on group state changes and rebalance rates without requiring manual CLI polling.
Related guides
- How Kafka actually works in production: a mental model for operators: /guides/kafka/how-kafka-works-in-production/
- Kafka controller event queue backing up: overwhelmed controller and stalled metadata: /guides/kafka/kafka-controller-event-queue-backup/
- Kafka ISR shrinking: IsrShrinksPerSec, flapping, and the cascade to offline: /guides/kafka/kafka-isr-shrink-storm/
- Kafka KRaft quorum has no leader: current-leader = -1 and frozen metadata: /guides/kafka/kafka-kraft-quorum-no-leader/
- Kafka LeaderElectionRateAndTimeMs spiking: election storms and slow elections: /guides/kafka/kafka-leader-election-rate-high/
- Kafka LEADER_NOT_AVAILABLE: causes during elections, restarts, and topic creation: /guides/kafka/kafka-leader-not-available/
- Kafka leadership imbalance: LeaderCount skew and preferred replica election: /guides/kafka/kafka-leadership-imbalance/
- Kafka min.insync.replicas and acks: configuring durability you actually have: /guides/kafka/kafka-min-insync-replicas-misconfigured/
- Kafka monitoring checklist: the signals every production cluster needs: /guides/kafka/kafka-monitoring-checklist/
- Kafka monitoring maturity model: from survival to expert: /guides/kafka/kafka-monitoring-maturity-model/
- Kafka ActiveControllerCount not equal to 1: no controller or split brain: /guides/kafka/kafka-no-active-controller/
- Kafka NotEnoughReplicasException: acks=all writes rejected below min.insync.replicas: /guides/kafka/kafka-not-enough-replicas-exception/







