Kafka consumer rebalance storm: stuck in PreparingRebalance and max.poll.interval.ms

Consumer group lag climbs while brokers report zero under-replicated partitions, normal produce and fetch latency, and normal request handler idle percent. The consumer group oscillates between Stable, PreparingRebalance, and CompletingRebalance without settling long enough to make progress. Every rebalance cycle pauses consumption; lag grows because time spent rebalancing dwarfs time spent fetching. This is a consumer rebalance storm. It is almost always a client-side timeout or processing issue. Look for CommitFailedException or max.poll.interval.ms exceeded in consumer logs.

What this means

A rebalance storm starts when one consumer instance is slow to process a batch or fails to call poll() within max.poll.interval.ms. The group coordinator evicts the member and initiates a rebalance. If the group uses an eager assignor, all consumers stop fetching until the rebalance completes. The remaining consumers receive more partitions, take longer to process, miss their own poll deadlines, and get evicted. The group enters a loop: rebalance, brief consumption, another rebalance.

This is distinct from session.timeout.ms, which is enforced by the heartbeat thread. A consumer can heartbeat successfully yet still be evicted for not polling.

With the cooperative sticky assignor, rebalances are incremental and consumption does not fully stop. Static membership suppresses rebalances on transient disconnects. Broker metrics such as UnderReplicatedPartitions, RequestQueueSize, and RequestHandlerAvgIdlePercent are typically normal because this is client-driven coordinator traffic, not a broker data-plane failure.

flowchart TD
    Stable -->|max.poll.interval.ms exceeded| PreparingRebalance
    PreparingRebalance -->|JoinGroup / SyncGroup| CompletingRebalance
    CompletingRebalance -->|Assignment complete| Stable
    Stable -->|Processing still slower than timeout| PreparingRebalance

Common causes

CauseWhat it looks likeFirst thing to check
Slow per-record processing or blocking I/O in the poll loopLag grows steadily; the same consumer instance is evicted repeatedlyThread dumps and logs for blocking calls such as database writes, HTTP requests, or large object serialization
max.poll.interval.ms shorter than batch processing timeAll consumers cycle out together; the group never stays Stable for more than a few secondsMeasured poll-to-poll latency versus configured max.poll.interval.ms
Poison pill message causing consumer crash-loopOne member ID constantly leaves and rejoins while other members briefly stabilizeLogs for repeated deserialization or processing exceptions on the same topic-partition
Large group with eager rebalance protocolRebalances take a long time; all members pause; more members miss deadlines during the syncpartition.assignment.strategy and total member count

Quick checks

# Group state, lag, and member list (read-only)
kafka-consumer-groups.sh --bootstrap-server localhost:9092 --describe --group <group-id>

# Explicit group state only
kafka-consumer-groups.sh --bootstrap-server localhost:9092 --describe --group <group-id> --state

# Consumer logs for eviction and commit failures
grep -iE "max.poll.interval.ms exceeded|CommitFailedException" /var/log/<consumer-app>/*.log

# If running in Kubernetes
kubectl logs -l app=<consumer-label> --tail=500 | grep -iE "max.poll.interval.ms exceeded|CommitFailedException"

# Broker request queue depth via JMX (adjust port and path to your environment)
echo "get -b kafka.network:type=RequestChannel,name=RequestQueueSize Value" | java -jar jmxterm.jar -l localhost:9999

# Coordinator broker CPU (requires sysstat)
mpstat 1 5

How to diagnose it

  1. Observe the oscillation. Run kafka-consumer-groups.sh --describe --group <group-id> --state at 10-15 second intervals. If the state cycles between Stable and PreparingRebalance within minutes, you have a rebalance storm.
  2. Check consumer logs for max.poll.interval.ms exceeded or CommitFailedException. These confirm that consumers are not calling poll() in time.
  3. Identify whether one member or all members are being evicted. If a single consumer ID repeatedly disappears and rejoins, isolate that instance. Look for a poison pill record or a local resource issue on that host.
  4. Measure actual processing latency. Instrument the time between poll() calls in your consumer code. Log the duration of record processing and compare it to max.poll.interval.ms. If the p99 interval between poll() calls regularly exceeds the configured timeout, either reduce processing time or raise the timeout.
  5. Verify the assignment strategy is consistent across all members. If the group uses RangeAssignor or RoundRobinAssignor, every rebalance halts all fetching until sync completes. Eager assignors amplify storms because all members pause during partition migration.
  6. Rule out broker-side coordinator pressure. Confirm that the coordinator broker’s RequestHandlerAvgIdlePercent has not dropped sharply from baseline and that RequestQueueSize is not sustained near queued.max.requests. Elevated JoinGroup or SyncGroup rates increase coordinator CPU, but the data plane usually remains unaffected unless the broker is CPU-saturated.

Metrics and signals to monitor

SignalWhy it mattersWarning sign
Consumer group stateShows the lifecycle phase directlyRebalance phases persisting for minutes instead of seconds
Consumer group lagMeasures the backlog created during stallsLag growing monotonically across rebalance cycles
Consumer rebalance rateFrequency of group reorganizationRebalances during steady-state operation (outside of deployments or membership changes)
Group coordinator CPUElevated CPU can slow sync responsesCoordinator broker CPU above baseline during rebalance spikes
Request queue sizeRules out broker-side request saturationSustained values above half of queued.max.requests

Fixes

Reduce per-batch processing time

Lower max.poll.records so the consumer finishes each batch faster and calls poll() sooner. Move blocking operations out of the poll thread to a separate worker pool. If you use a worker queue, call pause() on the consumer when the queue fills and resume() when it drains to prevent fetching while the pipeline is backed up. This is the preferred fix when processing is inherently slow.

Tradeoff: Fetching smaller batches increases request overhead and may reduce raw throughput.

Raise max.poll.interval.ms

If processing legitimately takes longer than the current timeout, increase max.poll.interval.ms to a value above your measured p99 processing time. Do not raise this value indefinitely to mask slow processing; the timeout should reflect worst-case latency under normal load, not an unbounded delay. Be aware that this also increases the time the group coordinator waits before evicting an unresponsive consumer.

Tradeoff: Longer detection time for genuinely crashed consumers.

Switch to CooperativeStickyAssignor

If your Kafka client supports it, configure CooperativeStickyAssignor. Rebalances become incremental, so consumers do not fully stop fetching during partition migration. This breaks the feedback loop where rebalance time causes further poll timeouts. All members must use the same strategy.

Tradeoff: All members must use a cooperative assignor for incremental rebalancing to work.

Enable static group membership

Assign a unique group.instance.id to each consumer. Map each host to a persistent ID, for example using the pod name in Kubernetes or the hostname on bare metal. When a consumer restarts or disconnects transiently, the coordinator does not immediately revoke its partitions. This suppresses rebalances during rolling deployments and brief network blips.

Tradeoff: Members must be restarted with persistent IDs. If a member permanently fails, its partitions remain assigned until the session timeout elapses. Static membership does not protect against max.poll.interval.ms timeouts caused by slow processing.

Prevention

Size max.poll.records based on measured end-to-end processing latency per batch. Keep blocking I/O out of the poll loop. Prefer CooperativeStickyAssignor in groups large enough for eager rebalances to be disruptive. Use static membership when consumers have persistent host identities. Alert on consumer group state transitions and rebalance rates so you catch the first cycle before it becomes a storm.

How Netdata helps

  • Correlate consumer group lag with broker CPU and request queue depth on the same timeline to confirm the issue is client-side, not broker saturation.
  • Monitor the group coordinator broker’s OS-level CPU and memory to detect coordinator pressure that could amplify rebalance latency.
  • Track RequestHandlerAvgIdlePercent and RequestQueueSize to rule out data-plane overload while you focus on consumer configuration.
  • Use per-second granularity to spot brief PreparingRebalance windows that slower sampling intervals miss.