Kafka consumer group rebalancing too often: heartbeats, session timeout, and assignors
Consumer group lag is growing, application logs are full of JoinGroup and SyncGroup messages, and kafka-consumer-groups.sh shows the group flipping between Stable and PreparingRebalance. Healthy groups rebalance only during membership changes and planned deployments. More than two or three rebalances per hour for a stable group signals instability. The usual cause is a mismatch between processing latency and one of three timeouts: session.timeout.ms, heartbeat.interval.ms, or max.poll.interval.ms. The assignor strategy and static membership configuration determine how painful each rebalance is.
What this means
Kafka’s group coordinator tracks membership. A rebalance begins when the coordinator detects membership change: a join, leave, or eviction from missed heartbeats or poll() deadlines. During a rebalance, partition assignments are revoked and redistributed.
With an eager assignor, all consumers stop processing until the rebalance completes. With CooperativeStickyAssignor, rebalances are incremental, reducing but not eliminating pause times. KIP-62 moved heartbeat sending to a background thread, so session.timeout.ms and max.poll.interval.ms operate independently. A consumer can fail to call poll() because application code is stuck, triggering eviction via max.poll.interval.ms even though background heartbeats are healthy. The result: one slow consumer triggers a rebalance, the pause causes other consumers to miss deadlines, and the group never stabilizes.
flowchart TD
A[Consumer processing blocks in poll loop] --> B[Misses max.poll.interval.ms deadline]
B --> C[Coordinator evicts member]
C --> D[Group enters rebalance]
D --> E[All consumers pause with eager assignor]
E --> F[Remaining consumers miss deadlines]
F --> DCommon causes
| Cause | What it looks like | First thing to check |
|---|---|---|
max.poll.interval.ms too low for processing time | Lag grows during bursts; logs show timeout exceeded | Consumer application logs for processing latency per batch |
session.timeout.ms too low relative to heartbeat interval and GC pauses | Consumers evicted during normal operation with no application error | Consumer GC behavior and heartbeat timing |
| Eager assignor with bursty membership changes | Full stop-the-world pause on every rebalance; cooperative-sticky is not in use | Consumer config partition.assignment.strategy |
| Static membership not configured | Rebalance on every rolling restart or transient network blip | Consumer config group.instance.id |
| Poison pill or crashing consumer | One consumer instance cycles; group repeatedly rebalances | Consumer crash logs and exception traces |
| Group coordinator broker overloaded | Elevated join and sync latency across multiple groups | Coordinator broker CPU and request queue |
Quick checks
# Check consumer group state and rebalance activity
kafka-consumer-groups.sh --bootstrap-server localhost:9092 --describe --group <group-id> --state
# Check member assignments and current lag
kafka-consumer-groups.sh --bootstrap-server localhost:9092 --describe --group <group-id>
# Check group coordinator broker I/O thread saturation
echo "get -b kafka.server:type=KafkaRequestHandlerPool,name=RequestHandlerAvgIdlePercent OneMinuteRate" | java -jar jmxterm.jar -l localhost:9999
# Check group coordinator request queue depth
echo "get -b kafka.network:type=RequestChannel,name=RequestQueueSize Value" | java -jar jmxterm.jar -l localhost:9999
# Check broker logs for group coordinator errors
grep -i "group.*coordinator\|rebalance" /var/log/kafka/server.log
How to diagnose it
Quantify rebalance rate. Use
kafka-consumer-groups.shto observe state transitions. If the group moves betweenStable,PreparingRebalance, andCompletingRebalancemore than two or three times per hour, or stays in rebalance longer than five minutes, you have a rebalance storm.Inspect member churn. Look at consumer IDs in the group description. Changing member IDs mean consumers are evicted and rejoining as new members, pointing to session timeout or crashes. Stable member IDs with rebalances during deployments means you lack static membership.
Check consumer application logs for timeout violations. Search for messages indicating
max.poll.interval.mswas exceeded orCommitFailedException. These indicate the message processing loop is slower than the configured timeout. If found, measure the actual p99 processing time per batch.Verify
session.timeout.msandheartbeat.interval.ms. The heartbeat interval should be no higher than one-third of the session timeout. If the interval is too close to the timeout, network jitter or GC pauses can cause the coordinator to miss heartbeats and evict the consumer.Review the assignor strategy. If
partition.assignment.strategyisRangeAssignoror the classicStickyAssignor, the group uses the eager rebalance protocol. Switching toCooperativeStickyAssignorenables incremental rebalances.Check for
group.instance.id. Without it, every consumer restart triggers a rebalance. A stable, uniquegroup.instance.idper instance lets the consumer rejoin without triggering a rebalance, provided it returns withinsession.timeout.ms.Validate the group coordinator broker. A saturated coordinator slows rebalance completion. Check
RequestHandlerAvgIdlePercentandRequestQueueSizeon the broker that leads the__consumer_offsetspartition for this group. If the broker is below 30% idle or the request queue is consistently elevated, coordinator capacity is part of the problem.
Metrics and signals to monitor
| Signal | Why it matters | Warning sign |
|---|---|---|
| Consumer group state | Rebalance activity | In PreparingRebalance or CompletingRebalance for more than 5 minutes, or flipping repeatedly |
| Consumer rebalance rate | Frequency of partition reassignment | More than 2-3 rebalances per hour outside of deployments |
| Consumer group lag | Unprocessed messages accumulate while the group is not stable | Lag growing monotonically during rebalance windows |
RequestHandlerAvgIdlePercent | Coordinator broker capacity to process join and sync requests | Sustained below 0.3 on the coordinator broker |
RequestQueueSize | Request backlog at the coordinator | Consistently above 50% of queued.max.requests on the coordinator |
Fixes
Increase max.poll.interval.ms or reduce processing time
If consumer logs show max.poll.interval.ms exceeded, the application is taking too long to process batches. You can increase max.poll.interval.ms to match processing latency, but the tradeoff is a longer wait before the coordinator detects a dead consumer. A better fix is often to decrease max.poll.records so each batch finishes faster, or move blocking I/O out of the poll() processing loop.
Tune session.timeout.ms and heartbeat.interval.ms
If consumers are evicted despite finishing batches quickly, the background heartbeat thread is likely missing its deadline. Increase session.timeout.ms to tolerate GC pauses and network jitter. Keep heartbeat.interval.ms at no more than one-third of session.timeout.ms. If the consumer JVM has long GC pauses, tune the heap or garbage collector before widening the session timeout.
Switch to the cooperative-sticky assignor
If the group uses RangeAssignor or the classic StickyAssignor, every rebalance is eager and stop-the-world. Change partition.assignment.strategy to CooperativeStickyAssignor. The tradeoff: consumers must run Kafka 2.4 or newer, and the group may need a rolling bounce to adopt the new protocol cleanly. This does not reduce rebalance frequency, but it limits the pause to partitions that must move.
Enable static membership
If rolling deployments cause rebalances even when processing is healthy, assign a stable group.instance.id to each consumer instance. The tradeoff: if a consumer is permanently removed, its partitions are not reassigned until the session timeout expires. Do not reuse IDs across instances, or you will trigger FENCED_INSTANCE_ID errors.
Isolate a crashing consumer
If one consumer instance is stuck in a crash loop and dragging the group through repeated rebalances, stop that instance manually to let the remaining members stabilize. Then debug the poison pill or code path causing the crash outside the critical path.
Relieve coordinator broker pressure
If the coordinator broker is saturated, rebalances take longer to complete, increasing the chance that other members miss deadlines. Check for leadership imbalance or partition skew on the coordinator. If the broker is overloaded, reduce load or redistribute __consumer_offsets partition leadership.
Prevention
- Monitor rebalance rate and alert when a stable group exceeds two or three rebalances per hour.
- Use
CooperativeStickyAssignorfor all new consumer deployments to reduce stop-the-world impact. - Adopt static membership with
group.instance.idfor long-lived or stateful consumers. - Size
max.poll.interval.msto exceed the consumer’s p99 processing latency by at least 2x. - Keep
heartbeat.interval.msat no more than one-third ofsession.timeout.ms. - Monitor coordinator broker
RequestHandlerAvgIdlePercentandRequestQueueSizeindependently from data brokers.
How Netdata helps
- Netdata surfaces consumer group state and lag via the Kafka collector, making rebalance episodes visible without manual CLI checks.
- Correlate consumer lag spikes with broker
RequestHandlerAvgIdlePercentdrops on the coordinator node to distinguish client-side timeouts from coordinator saturation. - Alert on sustained consumer group state changes, such as a group remaining in
PreparingRebalancebeyond a threshold. - Track consumer group state transitions and lag to confirm that rebalances cause measurable processing delay.
- Track broker request queue size and I/O thread idle percent on the coordinator to catch the rare case where broker overload prolongs rebalances.
Related guides
- How Kafka actually works in production: a mental model for operators
- Kafka consumer group lag growing: detection, lag-as-time, and root causes
- Kafka controller event queue backing up: overwhelmed controller and stalled metadata
- Kafka ISR shrinking: IsrShrinksPerSec, flapping, and the cascade to offline
- Kafka KRaft metadata log lag: standby controllers and brokers falling behind
- Kafka KRaft quorum has no leader: current-leader = -1 and frozen metadata
- Kafka LeaderElectionRateAndTimeMs spiking: election storms and slow elections
- Kafka LEADER_NOT_AVAILABLE: causes during elections, restarts, and topic creation
- Kafka leadership imbalance: LeaderCount skew and preferred replica election
- Kafka min.insync.replicas and acks: configuring durability you actually have
- Kafka monitoring checklist: the signals every production cluster needs
- Kafka monitoring maturity model: from survival to expert







