Kafka consumer rebalance storm: stuck in PreparingRebalance and max.poll.interval.ms
Consumer group lag climbs while brokers report zero under-replicated partitions, normal produce and fetch latency, and normal request handler idle percent. The consumer group oscillates between Stable, PreparingRebalance, and CompletingRebalance without settling long enough to make progress. Every rebalance cycle pauses consumption; lag grows because time spent rebalancing dwarfs time spent fetching. This is a consumer rebalance storm. It is almost always a client-side timeout or processing issue. Look for CommitFailedException or max.poll.interval.ms exceeded in consumer logs.
What this means
A rebalance storm starts when one consumer instance is slow to process a batch or fails to call poll() within max.poll.interval.ms. The group coordinator evicts the member and initiates a rebalance. If the group uses an eager assignor, all consumers stop fetching until the rebalance completes. The remaining consumers receive more partitions, take longer to process, miss their own poll deadlines, and get evicted. The group enters a loop: rebalance, brief consumption, another rebalance.
This is distinct from session.timeout.ms, which is enforced by the heartbeat thread. A consumer can heartbeat successfully yet still be evicted for not polling.
With the cooperative sticky assignor, rebalances are incremental and consumption does not fully stop. Static membership suppresses rebalances on transient disconnects. Broker metrics such as UnderReplicatedPartitions, RequestQueueSize, and RequestHandlerAvgIdlePercent are typically normal because this is client-driven coordinator traffic, not a broker data-plane failure.
flowchart TD
Stable -->|max.poll.interval.ms exceeded| PreparingRebalance
PreparingRebalance -->|JoinGroup / SyncGroup| CompletingRebalance
CompletingRebalance -->|Assignment complete| Stable
Stable -->|Processing still slower than timeout| PreparingRebalanceCommon causes
| Cause | What it looks like | First thing to check |
|---|---|---|
| Slow per-record processing or blocking I/O in the poll loop | Lag grows steadily; the same consumer instance is evicted repeatedly | Thread dumps and logs for blocking calls such as database writes, HTTP requests, or large object serialization |
| max.poll.interval.ms shorter than batch processing time | All consumers cycle out together; the group never stays Stable for more than a few seconds | Measured poll-to-poll latency versus configured max.poll.interval.ms |
| Poison pill message causing consumer crash-loop | One member ID constantly leaves and rejoins while other members briefly stabilize | Logs for repeated deserialization or processing exceptions on the same topic-partition |
| Large group with eager rebalance protocol | Rebalances take a long time; all members pause; more members miss deadlines during the sync | partition.assignment.strategy and total member count |
Quick checks
# Group state, lag, and member list (read-only)
kafka-consumer-groups.sh --bootstrap-server localhost:9092 --describe --group <group-id>
# Explicit group state only
kafka-consumer-groups.sh --bootstrap-server localhost:9092 --describe --group <group-id> --state
# Consumer logs for eviction and commit failures
grep -iE "max.poll.interval.ms exceeded|CommitFailedException" /var/log/<consumer-app>/*.log
# If running in Kubernetes
kubectl logs -l app=<consumer-label> --tail=500 | grep -iE "max.poll.interval.ms exceeded|CommitFailedException"
# Broker request queue depth via JMX (adjust port and path to your environment)
echo "get -b kafka.network:type=RequestChannel,name=RequestQueueSize Value" | java -jar jmxterm.jar -l localhost:9999
# Coordinator broker CPU (requires sysstat)
mpstat 1 5
How to diagnose it
- Observe the oscillation. Run
kafka-consumer-groups.sh --describe --group <group-id> --stateat 10-15 second intervals. If the state cycles between Stable and PreparingRebalance within minutes, you have a rebalance storm. - Check consumer logs for max.poll.interval.ms exceeded or CommitFailedException. These confirm that consumers are not calling poll() in time.
- Identify whether one member or all members are being evicted. If a single consumer ID repeatedly disappears and rejoins, isolate that instance. Look for a poison pill record or a local resource issue on that host.
- Measure actual processing latency. Instrument the time between poll() calls in your consumer code. Log the duration of record processing and compare it to max.poll.interval.ms. If the p99 interval between poll() calls regularly exceeds the configured timeout, either reduce processing time or raise the timeout.
- Verify the assignment strategy is consistent across all members. If the group uses RangeAssignor or RoundRobinAssignor, every rebalance halts all fetching until sync completes. Eager assignors amplify storms because all members pause during partition migration.
- Rule out broker-side coordinator pressure. Confirm that the coordinator broker’s RequestHandlerAvgIdlePercent has not dropped sharply from baseline and that RequestQueueSize is not sustained near queued.max.requests. Elevated JoinGroup or SyncGroup rates increase coordinator CPU, but the data plane usually remains unaffected unless the broker is CPU-saturated.
Metrics and signals to monitor
| Signal | Why it matters | Warning sign |
|---|---|---|
| Consumer group state | Shows the lifecycle phase directly | Rebalance phases persisting for minutes instead of seconds |
| Consumer group lag | Measures the backlog created during stalls | Lag growing monotonically across rebalance cycles |
| Consumer rebalance rate | Frequency of group reorganization | Rebalances during steady-state operation (outside of deployments or membership changes) |
| Group coordinator CPU | Elevated CPU can slow sync responses | Coordinator broker CPU above baseline during rebalance spikes |
| Request queue size | Rules out broker-side request saturation | Sustained values above half of queued.max.requests |
Fixes
Reduce per-batch processing time
Lower max.poll.records so the consumer finishes each batch faster and calls poll() sooner. Move blocking operations out of the poll thread to a separate worker pool. If you use a worker queue, call pause() on the consumer when the queue fills and resume() when it drains to prevent fetching while the pipeline is backed up. This is the preferred fix when processing is inherently slow.
Tradeoff: Fetching smaller batches increases request overhead and may reduce raw throughput.
Raise max.poll.interval.ms
If processing legitimately takes longer than the current timeout, increase max.poll.interval.ms to a value above your measured p99 processing time. Do not raise this value indefinitely to mask slow processing; the timeout should reflect worst-case latency under normal load, not an unbounded delay. Be aware that this also increases the time the group coordinator waits before evicting an unresponsive consumer.
Tradeoff: Longer detection time for genuinely crashed consumers.
Switch to CooperativeStickyAssignor
If your Kafka client supports it, configure CooperativeStickyAssignor. Rebalances become incremental, so consumers do not fully stop fetching during partition migration. This breaks the feedback loop where rebalance time causes further poll timeouts. All members must use the same strategy.
Tradeoff: All members must use a cooperative assignor for incremental rebalancing to work.
Enable static group membership
Assign a unique group.instance.id to each consumer. Map each host to a persistent ID, for example using the pod name in Kubernetes or the hostname on bare metal. When a consumer restarts or disconnects transiently, the coordinator does not immediately revoke its partitions. This suppresses rebalances during rolling deployments and brief network blips.
Tradeoff: Members must be restarted with persistent IDs. If a member permanently fails, its partitions remain assigned until the session timeout elapses. Static membership does not protect against max.poll.interval.ms timeouts caused by slow processing.
Prevention
Size max.poll.records based on measured end-to-end processing latency per batch. Keep blocking I/O out of the poll loop. Prefer CooperativeStickyAssignor in groups large enough for eager rebalances to be disruptive. Use static membership when consumers have persistent host identities. Alert on consumer group state transitions and rebalance rates so you catch the first cycle before it becomes a storm.
How Netdata helps
- Correlate consumer group lag with broker CPU and request queue depth on the same timeline to confirm the issue is client-side, not broker saturation.
- Monitor the group coordinator broker’s OS-level CPU and memory to detect coordinator pressure that could amplify rebalance latency.
- Track RequestHandlerAvgIdlePercent and RequestQueueSize to rule out data-plane overload while you focus on consumer configuration.
- Use per-second granularity to spot brief PreparingRebalance windows that slower sampling intervals miss.
Related guides
- How Kafka actually works in production: a mental model for operators
- Kafka controller event queue backing up: overwhelmed controller and stalled metadata
- Kafka ISR shrinking: IsrShrinksPerSec, flapping, and the cascade to offline
- Kafka KRaft quorum has no leader: current-leader = -1 and frozen metadata
- Kafka LeaderElectionRateAndTimeMs spiking: election storms and slow elections
- Kafka LEADER_NOT_AVAILABLE: causes during elections, restarts, and topic creation
- Kafka leadership imbalance: LeaderCount skew and preferred replica election
- Kafka min.insync.replicas and acks: configuring durability you actually have
- Kafka monitoring checklist: the signals every production cluster needs
- Kafka monitoring maturity model: from survival to expert
- Kafka ActiveControllerCount not equal to 1: no controller or split brain
- Kafka NotEnoughReplicasException: acks=all writes rejected below min.insync.replicas







