Kafka CommitFailedException: rebalanced-out consumers and poll loop timeouts
CommitFailedException with the message that the group has already rebalanced and assigned the partitions to another member means the time between poll() calls exceeded max.poll.interval.ms. The coordinator evicted the consumer and rejected the in-flight offset commit.
When one consumer is evicted, the group rebalances. If other consumers are also slow, or if the rebalance itself takes long enough that healthy consumers miss the same deadline, the group enters a rebalance storm: it oscillates between JoinGroup and SyncGroup without stabilizing, and lag grows without bound.
This is almost always a client-side processing problem, not a broker outage. The fix is on the consumer side. This guide shows how to confirm the diagnosis, stop the storm, and prevent recurrence.
What this means
A consumer must call poll() within max.poll.interval.ms. The heartbeat thread keeps the session alive, but the coordinator enforces the poll deadline separately. Note that session.timeout.ms is enforced via heartbeats and only proves the consumer process is reachable. The poll deadline proves the consumer thread is actively processing. A consumer can heartbeat indefinitely while blocked in a database or HTTP call and still be evicted for missing the poll deadline.
If batch processing exceeds the interval, the coordinator removes the consumer. Subsequent offset commits fail because the member ID is no longer recognized, and the consumer must rejoin, triggering a rebalance.
With the default eager assignor, a rebalance revokes all partitions and stops consumption across the group. If other consumers are near their processing limit, the rebalance time pushes them past max.poll.interval.ms. They are evicted too, causing another rebalance. The group oscillates between Stable and PreparingRebalance, lag grows without bound, and the only resolution is to fix processing time or remove slow members.
flowchart TD
A[Slow batch processing on poll thread] --> B[Time between poll calls exceeds max.poll.interval.ms]
B --> C[Group coordinator evicts consumer]
C --> D[Offset commit rejected]
D --> E[CommitFailedException raised]
C --> F[Group rebalances partitions]
F --> G[Other consumers pause during rebalance]
G --> H[They also miss max.poll.interval.ms]
H --> I[Rebalance storm begins]
I --> J[Consumer lag grows unchecked]Common causes
| Cause | What it looks like | First thing to check |
|---|---|---|
Batch processing time exceeds max.poll.interval.ms | CommitFailedException after consistent processing duration; logs show long elapsed time between polls | Application logs for batch latency relative to max.poll.interval.ms |
| Blocking I/O on the poll thread | Threads blocked on database or HTTP calls; GC is clean but the poll interval is still exceeded | Thread dump for blocked consumer threads |
| Poison pill or crashing consumer | Lag concentrated on one partition; a specific instance repeatedly joins and leaves | Consumer logs for parse errors or crashes at a specific offset |
| Deployment causing synchronized consumer restarts | All consumers restart together; group state cycles rapidly for minutes after startup | Group state history correlated with deployment timestamps |
Quick checks
# Check group state, membership, lag, and partition assignment.
# Look for STATE, CURRENT-OFFSET, LOG-END-OFFSET, LAG, and CONSUMER-ID.
kafka-consumer-groups.sh --bootstrap-server localhost:9092 --describe --group <group-id>
# Inspect consumer threads. Look for the thread running poll() stuck in application code.
jstack <consumer_pid>
# Check consumer GC behavior for pauses that delay poll.
# Watch for rapid growth in FGCT or sustained high FGC time.
jstat -gcutil <consumer_pid> 1000
# Verify the coordinator broker request queue is not saturated.
# This requires JMX enabled and jmxterm installed.
echo "get -b kafka.network:type=RequestChannel,name=RequestQueueSize Value" | java -jar jmxterm.jar -l localhost:9999
How to diagnose it
Read the exception text. Confirm the log line contains the exact message: the group has already rebalanced because the time between
poll()calls exceededmax.poll.interval.ms. A different message means a different rebalance trigger. Distinguish this fromsession.timeout.mseviction, which produces heartbeat errors rather than aCommitFailedExceptionciting the poll interval.Check consumer group state. Run
kafka-consumer-groups.sh --describe. If the group cycles betweenStable,PreparingRebalance, andCompletingRebalance, you are in a rebalance storm. Note which member IDs repeatedly join and leave.Correlate eviction with processing time. Compare batch processing time from application metrics or logs against
max.poll.interval.ms. If processing approaches or exceeds the interval, you have the direct cause.Check for broker-side coordination pressure. Check the group coordinator broker for normal
RequestHandlerAvgIdlePercentand low request queue size. A healthy broker means the problem is strictly client-side. If the coordinator broker is saturated, rebalances slow down, which can push healthy consumers past their poll deadline while the group waits.Identify poison pills or hot partitions. If lag is concentrated on a subset of partitions, inspect the assigned consumer for parse errors or unexpectedly large messages.
Review deployment timing. If the issue began after a deployment, check whether consumers restarted simultaneously without enough
group.initial.rebalance.delay.msto join in a single wave.
Metrics and signals to monitor
| Signal | Why it matters | Warning sign |
|---|---|---|
| Consumer group state | Tells you if the group is actively rebalancing or stable | Stuck in PreparingRebalance or CompletingRebalance for more than a few minutes |
| Consumer group lag | Unprocessed messages accumulate while consumers cannot fetch | Lag growing monotonically, especially during rebalances |
| Consumer rebalance rate | Frequency of group membership changes | More than 2 to 3 rebalances per hour outside of deployments |
| Group coordinator broker CPU | A saturated coordinator can slow rebalance completion | Coordinator CPU elevated while other brokers appear normal |
| Consumer batch processing time | Directly determines whether max.poll.interval.ms will be violated | p99 processing time exceeds 50 percent of max.poll.interval.ms |
Fixes
Shrink the batch or extend the deadline
Reduce max.poll.records so each poll returns fewer records and processing finishes within the interval. If processing is legitimately long-running, increase max.poll.interval.ms. The tradeoff is slower detection of genuinely crashed consumers. Set the interval to at least twice the observed peak processing time. Very small batches increase network overhead and can hurt throughput without fixing a fundamental bottleneck, so tune incrementally.
Move processing off the poll thread
Hand records to a worker thread pool and call poll() on schedule. Use pause() and resume() to stop fetching when the internal queue is full, and commit offsets only after the workers finish.
Warning: Committing asynchronously without confirming worker completion risks duplicate consumption during a rebalance. The tradeoff is added complexity and OOM risk if backpressure fails.
Remove poison pills and crashing members
Stop a continuously failing consumer instance to let the group stabilize. Fix the poison message with a dead-letter topic or skip logic so the consumer does not crash-loop. One unstable member can keep a group in perpetual rebalancing.
Warning: Stopping a consumer instance is disruptive. Do this only when the instance is crash-looping and actively preventing the group from stabilizing.
Reduce rebalance churn
Enable CooperativeStickyAssignor (available since Kafka 2.4) for incremental rebalances. Partitions are revoked only when necessary, so consumption does not fully stop during every rebalance. Use static membership (group.instance.id) so planned restarts do not trigger immediate rebalances. Neither protects against a poll-loop timeout, but both reduce deployment-related churn.
Prevention
Size max.poll.records based on worst-case per-record latency, not the average. Alert when batch processing time crosses 50 percent of max.poll.interval.ms. Monitor consumer lag as a rate of change; growing lag is the leading indicator of a bottleneck. Monitor rebalance rate per group and investigate sustained elevation. Test recovery by intentionally stopping one instance and measuring group stabilization time. Static membership and cooperative rebalancing reduce deployment noise, but they do not remove the need to keep poll-loop processing short.
How Netdata helps
Netdata surfaces the signals that distinguish a slow consumer from a broker problem:
- Correlate consumer lag growth with per-broker
FetchConsumerlatency breakdown to rule out broker read-path issues. - Alert on consumer group state transitions when a group remains in
PreparingRebalancelonger than expected. - Track JVM GC pause duration on consumer nodes to catch GC-induced poll delays before eviction.
- Alert on consumer lag as a time estimate rather than raw offsets so thresholds are comparable across topics.
- Monitor group coordinator CPU and
RequestHandlerAvgIdlePercentto rule out broker-side coordination bottlenecks.
Related guides
- How Kafka actually works in production: a mental model for operators
- Kafka consumer group lag growing: detection, lag-as-time, and root causes
- Kafka controller event queue backing up: overwhelmed controller and stalled metadata
- Kafka ISR shrinking: IsrShrinksPerSec, flapping, and the cascade to offline
- Kafka KRaft metadata log lag: standby controllers and brokers falling behind
- Kafka KRaft quorum has no leader: current-leader = -1 and frozen metadata
- Kafka LeaderElectionRateAndTimeMs spiking: election storms and slow elections
- Kafka LEADER_NOT_AVAILABLE: causes during elections, restarts, and topic creation
- Kafka leadership imbalance: LeaderCount skew and preferred replica election
- Kafka min.insync.replicas and acks: configuring durability you actually have
- Kafka monitoring checklist: the signals every production cluster needs
- Kafka monitoring maturity model: from survival to expert







