Kafka CommitFailedException: rebalanced-out consumers and poll loop timeouts

CommitFailedException with the message that the group has already rebalanced and assigned the partitions to another member means the time between poll() calls exceeded max.poll.interval.ms. The coordinator evicted the consumer and rejected the in-flight offset commit.

When one consumer is evicted, the group rebalances. If other consumers are also slow, or if the rebalance itself takes long enough that healthy consumers miss the same deadline, the group enters a rebalance storm: it oscillates between JoinGroup and SyncGroup without stabilizing, and lag grows without bound.

This is almost always a client-side processing problem, not a broker outage. The fix is on the consumer side. This guide shows how to confirm the diagnosis, stop the storm, and prevent recurrence.

What this means

A consumer must call poll() within max.poll.interval.ms. The heartbeat thread keeps the session alive, but the coordinator enforces the poll deadline separately. Note that session.timeout.ms is enforced via heartbeats and only proves the consumer process is reachable. The poll deadline proves the consumer thread is actively processing. A consumer can heartbeat indefinitely while blocked in a database or HTTP call and still be evicted for missing the poll deadline.

If batch processing exceeds the interval, the coordinator removes the consumer. Subsequent offset commits fail because the member ID is no longer recognized, and the consumer must rejoin, triggering a rebalance.

With the default eager assignor, a rebalance revokes all partitions and stops consumption across the group. If other consumers are near their processing limit, the rebalance time pushes them past max.poll.interval.ms. They are evicted too, causing another rebalance. The group oscillates between Stable and PreparingRebalance, lag grows without bound, and the only resolution is to fix processing time or remove slow members.

flowchart TD
    A[Slow batch processing on poll thread] --> B[Time between poll calls exceeds max.poll.interval.ms]
    B --> C[Group coordinator evicts consumer]
    C --> D[Offset commit rejected]
    D --> E[CommitFailedException raised]
    C --> F[Group rebalances partitions]
    F --> G[Other consumers pause during rebalance]
    G --> H[They also miss max.poll.interval.ms]
    H --> I[Rebalance storm begins]
    I --> J[Consumer lag grows unchecked]

Common causes

CauseWhat it looks likeFirst thing to check
Batch processing time exceeds max.poll.interval.msCommitFailedException after consistent processing duration; logs show long elapsed time between pollsApplication logs for batch latency relative to max.poll.interval.ms
Blocking I/O on the poll threadThreads blocked on database or HTTP calls; GC is clean but the poll interval is still exceededThread dump for blocked consumer threads
Poison pill or crashing consumerLag concentrated on one partition; a specific instance repeatedly joins and leavesConsumer logs for parse errors or crashes at a specific offset
Deployment causing synchronized consumer restartsAll consumers restart together; group state cycles rapidly for minutes after startupGroup state history correlated with deployment timestamps

Quick checks

# Check group state, membership, lag, and partition assignment.
# Look for STATE, CURRENT-OFFSET, LOG-END-OFFSET, LAG, and CONSUMER-ID.
kafka-consumer-groups.sh --bootstrap-server localhost:9092 --describe --group <group-id>
# Inspect consumer threads. Look for the thread running poll() stuck in application code.
jstack <consumer_pid>
# Check consumer GC behavior for pauses that delay poll.
# Watch for rapid growth in FGCT or sustained high FGC time.
jstat -gcutil <consumer_pid> 1000
# Verify the coordinator broker request queue is not saturated.
# This requires JMX enabled and jmxterm installed.
echo "get -b kafka.network:type=RequestChannel,name=RequestQueueSize Value" | java -jar jmxterm.jar -l localhost:9999

How to diagnose it

  1. Read the exception text. Confirm the log line contains the exact message: the group has already rebalanced because the time between poll() calls exceeded max.poll.interval.ms. A different message means a different rebalance trigger. Distinguish this from session.timeout.ms eviction, which produces heartbeat errors rather than a CommitFailedException citing the poll interval.

  2. Check consumer group state. Run kafka-consumer-groups.sh --describe. If the group cycles between Stable, PreparingRebalance, and CompletingRebalance, you are in a rebalance storm. Note which member IDs repeatedly join and leave.

  3. Correlate eviction with processing time. Compare batch processing time from application metrics or logs against max.poll.interval.ms. If processing approaches or exceeds the interval, you have the direct cause.

  4. Check for broker-side coordination pressure. Check the group coordinator broker for normal RequestHandlerAvgIdlePercent and low request queue size. A healthy broker means the problem is strictly client-side. If the coordinator broker is saturated, rebalances slow down, which can push healthy consumers past their poll deadline while the group waits.

  5. Identify poison pills or hot partitions. If lag is concentrated on a subset of partitions, inspect the assigned consumer for parse errors or unexpectedly large messages.

  6. Review deployment timing. If the issue began after a deployment, check whether consumers restarted simultaneously without enough group.initial.rebalance.delay.ms to join in a single wave.

Metrics and signals to monitor

SignalWhy it mattersWarning sign
Consumer group stateTells you if the group is actively rebalancing or stableStuck in PreparingRebalance or CompletingRebalance for more than a few minutes
Consumer group lagUnprocessed messages accumulate while consumers cannot fetchLag growing monotonically, especially during rebalances
Consumer rebalance rateFrequency of group membership changesMore than 2 to 3 rebalances per hour outside of deployments
Group coordinator broker CPUA saturated coordinator can slow rebalance completionCoordinator CPU elevated while other brokers appear normal
Consumer batch processing timeDirectly determines whether max.poll.interval.ms will be violatedp99 processing time exceeds 50 percent of max.poll.interval.ms

Fixes

Shrink the batch or extend the deadline

Reduce max.poll.records so each poll returns fewer records and processing finishes within the interval. If processing is legitimately long-running, increase max.poll.interval.ms. The tradeoff is slower detection of genuinely crashed consumers. Set the interval to at least twice the observed peak processing time. Very small batches increase network overhead and can hurt throughput without fixing a fundamental bottleneck, so tune incrementally.

Move processing off the poll thread

Hand records to a worker thread pool and call poll() on schedule. Use pause() and resume() to stop fetching when the internal queue is full, and commit offsets only after the workers finish.

Warning: Committing asynchronously without confirming worker completion risks duplicate consumption during a rebalance. The tradeoff is added complexity and OOM risk if backpressure fails.

Remove poison pills and crashing members

Stop a continuously failing consumer instance to let the group stabilize. Fix the poison message with a dead-letter topic or skip logic so the consumer does not crash-loop. One unstable member can keep a group in perpetual rebalancing.

Warning: Stopping a consumer instance is disruptive. Do this only when the instance is crash-looping and actively preventing the group from stabilizing.

Reduce rebalance churn

Enable CooperativeStickyAssignor (available since Kafka 2.4) for incremental rebalances. Partitions are revoked only when necessary, so consumption does not fully stop during every rebalance. Use static membership (group.instance.id) so planned restarts do not trigger immediate rebalances. Neither protects against a poll-loop timeout, but both reduce deployment-related churn.

Prevention

Size max.poll.records based on worst-case per-record latency, not the average. Alert when batch processing time crosses 50 percent of max.poll.interval.ms. Monitor consumer lag as a rate of change; growing lag is the leading indicator of a bottleneck. Monitor rebalance rate per group and investigate sustained elevation. Test recovery by intentionally stopping one instance and measuring group stabilization time. Static membership and cooperative rebalancing reduce deployment noise, but they do not remove the need to keep poll-loop processing short.

How Netdata helps

Netdata surfaces the signals that distinguish a slow consumer from a broker problem:

  • Correlate consumer lag growth with per-broker FetchConsumer latency breakdown to rule out broker read-path issues.
  • Alert on consumer group state transitions when a group remains in PreparingRebalance longer than expected.
  • Track JVM GC pause duration on consumer nodes to catch GC-induced poll delays before eviction.
  • Alert on consumer lag as a time estimate rather than raw offsets so thresholds are comparable across topics.
  • Monitor group coordinator CPU and RequestHandlerAvgIdlePercent to rule out broker-side coordination bottlenecks.