Kafka consumer group rebalancing too often: heartbeats, session timeout, and assignors

Consumer group lag is growing, application logs are full of JoinGroup and SyncGroup messages, and kafka-consumer-groups.sh shows the group flipping between Stable and PreparingRebalance. Healthy groups rebalance only during membership changes and planned deployments. More than two or three rebalances per hour for a stable group signals instability. The usual cause is a mismatch between processing latency and one of three timeouts: session.timeout.ms, heartbeat.interval.ms, or max.poll.interval.ms. The assignor strategy and static membership configuration determine how painful each rebalance is.

What this means

Kafka’s group coordinator tracks membership. A rebalance begins when the coordinator detects membership change: a join, leave, or eviction from missed heartbeats or poll() deadlines. During a rebalance, partition assignments are revoked and redistributed.

With an eager assignor, all consumers stop processing until the rebalance completes. With CooperativeStickyAssignor, rebalances are incremental, reducing but not eliminating pause times. KIP-62 moved heartbeat sending to a background thread, so session.timeout.ms and max.poll.interval.ms operate independently. A consumer can fail to call poll() because application code is stuck, triggering eviction via max.poll.interval.ms even though background heartbeats are healthy. The result: one slow consumer triggers a rebalance, the pause causes other consumers to miss deadlines, and the group never stabilizes.

flowchart TD
    A[Consumer processing blocks in poll loop] --> B[Misses max.poll.interval.ms deadline]
    B --> C[Coordinator evicts member]
    C --> D[Group enters rebalance]
    D --> E[All consumers pause with eager assignor]
    E --> F[Remaining consumers miss deadlines]
    F --> D

Common causes

CauseWhat it looks likeFirst thing to check
max.poll.interval.ms too low for processing timeLag grows during bursts; logs show timeout exceededConsumer application logs for processing latency per batch
session.timeout.ms too low relative to heartbeat interval and GC pausesConsumers evicted during normal operation with no application errorConsumer GC behavior and heartbeat timing
Eager assignor with bursty membership changesFull stop-the-world pause on every rebalance; cooperative-sticky is not in useConsumer config partition.assignment.strategy
Static membership not configuredRebalance on every rolling restart or transient network blipConsumer config group.instance.id
Poison pill or crashing consumerOne consumer instance cycles; group repeatedly rebalancesConsumer crash logs and exception traces
Group coordinator broker overloadedElevated join and sync latency across multiple groupsCoordinator broker CPU and request queue

Quick checks

# Check consumer group state and rebalance activity
kafka-consumer-groups.sh --bootstrap-server localhost:9092 --describe --group <group-id> --state

# Check member assignments and current lag
kafka-consumer-groups.sh --bootstrap-server localhost:9092 --describe --group <group-id>
# Check group coordinator broker I/O thread saturation
echo "get -b kafka.server:type=KafkaRequestHandlerPool,name=RequestHandlerAvgIdlePercent OneMinuteRate" | java -jar jmxterm.jar -l localhost:9999

# Check group coordinator request queue depth
echo "get -b kafka.network:type=RequestChannel,name=RequestQueueSize Value" | java -jar jmxterm.jar -l localhost:9999

# Check broker logs for group coordinator errors
grep -i "group.*coordinator\|rebalance" /var/log/kafka/server.log

How to diagnose it

  1. Quantify rebalance rate. Use kafka-consumer-groups.sh to observe state transitions. If the group moves between Stable, PreparingRebalance, and CompletingRebalance more than two or three times per hour, or stays in rebalance longer than five minutes, you have a rebalance storm.

  2. Inspect member churn. Look at consumer IDs in the group description. Changing member IDs mean consumers are evicted and rejoining as new members, pointing to session timeout or crashes. Stable member IDs with rebalances during deployments means you lack static membership.

  3. Check consumer application logs for timeout violations. Search for messages indicating max.poll.interval.ms was exceeded or CommitFailedException. These indicate the message processing loop is slower than the configured timeout. If found, measure the actual p99 processing time per batch.

  4. Verify session.timeout.ms and heartbeat.interval.ms. The heartbeat interval should be no higher than one-third of the session timeout. If the interval is too close to the timeout, network jitter or GC pauses can cause the coordinator to miss heartbeats and evict the consumer.

  5. Review the assignor strategy. If partition.assignment.strategy is RangeAssignor or the classic StickyAssignor, the group uses the eager rebalance protocol. Switching to CooperativeStickyAssignor enables incremental rebalances.

  6. Check for group.instance.id. Without it, every consumer restart triggers a rebalance. A stable, unique group.instance.id per instance lets the consumer rejoin without triggering a rebalance, provided it returns within session.timeout.ms.

  7. Validate the group coordinator broker. A saturated coordinator slows rebalance completion. Check RequestHandlerAvgIdlePercent and RequestQueueSize on the broker that leads the __consumer_offsets partition for this group. If the broker is below 30% idle or the request queue is consistently elevated, coordinator capacity is part of the problem.

Metrics and signals to monitor

SignalWhy it mattersWarning sign
Consumer group stateRebalance activityIn PreparingRebalance or CompletingRebalance for more than 5 minutes, or flipping repeatedly
Consumer rebalance rateFrequency of partition reassignmentMore than 2-3 rebalances per hour outside of deployments
Consumer group lagUnprocessed messages accumulate while the group is not stableLag growing monotonically during rebalance windows
RequestHandlerAvgIdlePercentCoordinator broker capacity to process join and sync requestsSustained below 0.3 on the coordinator broker
RequestQueueSizeRequest backlog at the coordinatorConsistently above 50% of queued.max.requests on the coordinator

Fixes

Increase max.poll.interval.ms or reduce processing time

If consumer logs show max.poll.interval.ms exceeded, the application is taking too long to process batches. You can increase max.poll.interval.ms to match processing latency, but the tradeoff is a longer wait before the coordinator detects a dead consumer. A better fix is often to decrease max.poll.records so each batch finishes faster, or move blocking I/O out of the poll() processing loop.

Tune session.timeout.ms and heartbeat.interval.ms

If consumers are evicted despite finishing batches quickly, the background heartbeat thread is likely missing its deadline. Increase session.timeout.ms to tolerate GC pauses and network jitter. Keep heartbeat.interval.ms at no more than one-third of session.timeout.ms. If the consumer JVM has long GC pauses, tune the heap or garbage collector before widening the session timeout.

Switch to the cooperative-sticky assignor

If the group uses RangeAssignor or the classic StickyAssignor, every rebalance is eager and stop-the-world. Change partition.assignment.strategy to CooperativeStickyAssignor. The tradeoff: consumers must run Kafka 2.4 or newer, and the group may need a rolling bounce to adopt the new protocol cleanly. This does not reduce rebalance frequency, but it limits the pause to partitions that must move.

Enable static membership

If rolling deployments cause rebalances even when processing is healthy, assign a stable group.instance.id to each consumer instance. The tradeoff: if a consumer is permanently removed, its partitions are not reassigned until the session timeout expires. Do not reuse IDs across instances, or you will trigger FENCED_INSTANCE_ID errors.

Isolate a crashing consumer

If one consumer instance is stuck in a crash loop and dragging the group through repeated rebalances, stop that instance manually to let the remaining members stabilize. Then debug the poison pill or code path causing the crash outside the critical path.

Relieve coordinator broker pressure

If the coordinator broker is saturated, rebalances take longer to complete, increasing the chance that other members miss deadlines. Check for leadership imbalance or partition skew on the coordinator. If the broker is overloaded, reduce load or redistribute __consumer_offsets partition leadership.

Prevention

  • Monitor rebalance rate and alert when a stable group exceeds two or three rebalances per hour.
  • Use CooperativeStickyAssignor for all new consumer deployments to reduce stop-the-world impact.
  • Adopt static membership with group.instance.id for long-lived or stateful consumers.
  • Size max.poll.interval.ms to exceed the consumer’s p99 processing latency by at least 2x.
  • Keep heartbeat.interval.ms at no more than one-third of session.timeout.ms.
  • Monitor coordinator broker RequestHandlerAvgIdlePercent and RequestQueueSize independently from data brokers.

How Netdata helps

  • Netdata surfaces consumer group state and lag via the Kafka collector, making rebalance episodes visible without manual CLI checks.
  • Correlate consumer lag spikes with broker RequestHandlerAvgIdlePercent drops on the coordinator node to distinguish client-side timeouts from coordinator saturation.
  • Alert on sustained consumer group state changes, such as a group remaining in PreparingRebalance beyond a threshold.
  • Track consumer group state transitions and lag to confirm that rebalances cause measurable processing delay.
  • Track broker request queue size and I/O thread idle percent on the coordinator to catch the rare case where broker overload prolongs rebalances.