Kafka consumer group stuck Empty or Dead: no members consuming

A consumer group that should be active shows Empty or Dead state with no members. For a real-time pipeline, every new message piles up unread. For a batch job, Empty between runs is expected. The difference between an intentional idle state and a silent outage is simple: is lag growing, and was the group supposed to be active?

An Empty group still exists in the coordinator. Committed offsets are retained until they expire, so consumers can resume on restart. A Dead group has been removed, usually because offsets expired while no members were active. If the group is Dead and data is still being produced, new consumers join a fresh group state and default to auto.offset.reset, which can skip or reprocess data.

What this means

Kafka consumer groups move through lifecycle states managed by the Group Coordinator broker. The two states that mean nobody is consuming are:

  • Empty: The group exists in the coordinator but has no active members. Offsets are retained. This is normal for batch jobs that run and exit, or catastrophic if a long-running service is supposed to be active.
  • Dead: The group has been removed by the coordinator. This happens when the last committed offset expires while the group has no active members. Describing a Dead group returns an unknown group error. Any new consumer joining creates a new group instance.

The critical distinction is intent. Empty with stable lag is fine. Empty with monotonically growing lag means nobody is reading and retention is ticking down. Once offsets expire and the group becomes Dead, restarting consumers defaults to auto.offset.reset unless you manually reset offsets.

stateDiagram-v2
    [*] --> Stable : consumers join
    Stable --> PreparingRebalance : member change
    PreparingRebalance --> CompletingRebalance : join completes
    CompletingRebalance --> Stable : sync completes
    Stable --> Empty : last member leaves
    Empty --> Dead : offsets expire
    Empty --> Stable : consumers rejoin
    Dead --> [*] : group removed

Common causes

CauseWhat it looks likeFirst thing to check
All consumer instances crashed or stoppedGroup is Empty; lag grows at the produce rate; no members in --describeConsumer process health, container status, or recent deployments
Consumer blocked in processing and missed heartbeat/poll deadlineGroup drops to Empty after being Stable; logs show CommitFailedException or max.poll.interval.ms exceededConsumer logs and thread dumps for blocking I/O or long computation
Offset retention expired while group was EmptyGroup is Dead; group ID returns unknown group error or disappears from listingsBroker offsets.retention.minutes and how long the group was idle
Rebalance storm evicted all membersGroup cycled rapidly through PreparingRebalance and CompletingRebalance before landing on EmptyRebalance rate and coordinator logs for members timing out during rejoin
Intentional batch job completionGroup is Empty; lag is stable or zero; job runs on a scheduleApplication run schedule and whether the group should be empty between batches

Quick checks

Run these read-only commands to characterize the group state and surrounding context.

# Check group state, protocol, and member count
kafka-consumer-groups.sh --bootstrap-server localhost:9092 --describe --group <group-id> --state

# Check partition-level lag and which members were assigned
kafka-consumer-groups.sh --bootstrap-server localhost:9092 --describe --group <group-id>

# List all groups to confirm the group ID still exists
kafka-consumer-groups.sh --bootstrap-server localhost:9092 --list

# Check topic retention and cleanup policy
kafka-topics.sh --bootstrap-server localhost:9092 --describe --topic <topic>

# Check broker-side request metrics for consumer fetch activity
echo "get -b kafka.server:type=BrokerTopicMetrics,name=BytesOutPerSec OneMinuteRate" | java -jar jmxterm.jar -l localhost:9999

If the group is Dead, check broker logs for group metadata cleanup events. If the group is Empty, compare current lag to the log end offset: lag growing at the produce rate means total consumer failure; stable lag means the topic may be idle.

How to diagnose it

  1. Confirm state. Run kafka-consumer-groups.sh --describe --group <group-id> --state. Empty shows the protocol and assignment strategy. An unknown group error means the group is Dead.

  2. Check lag velocity. Run kafka-consumer-groups.sh --describe --group <group-id> and compare CURRENT-OFFSET to LOG-END-OFFSET. Increasing LAG on repeat runs means data is accumulating. Stable lag means the pipeline is intentionally idle.

  3. Check consumer health. Look at process uptime, container restart counts, and application logs. Search for exceptions that crash the poll loop, OOM kills, or deployment events that scaled the consumer to zero.

  4. Inspect heartbeat and poll violations. Logs showing max.poll.interval.ms exceeded or session timeout mean the application took too long processing a batch. This is common with synchronous I/O or heavy computation inside the poll loop.

  5. Evaluate the coordinator broker. During rebalance storms, the coordinator shows elevated CPU and increased JoinGroup and SyncGroup request rates. Low RequestHandlerAvgIdlePercent on the coordinator can stall group management.

  6. Check offset retention. If the group transitioned from Empty to Dead, review offsets.retention.minutes.

  7. Check __consumer_offsets health. It is compacted by default. A failed log cleaner causes disk growth rather than Empty groups directly, but compaction stalls can delay metadata cleanup.

Metrics and signals to monitor

SignalWhy it mattersWarning sign
Consumer Group StateDirectly indicates whether members are activeEmpty for an active service group, or unexpected Dead
Consumer Group LagMeasures unprocessed messages; growing lag confirms nobody is readingLag increasing monotonically for more than 2 minutes
Consumer Rebalance RateHigh rate indicates instability that can evict membersMore than 3 rebalances per hour outside deployments
Bytes Out Per SecondConfirms whether any fetch traffic is leaving the brokerDrop to zero while lag grows signals total consumer failure
Fetch Consumer LocalTimeMsHigh values mean reads are hitting disk instead of page cacheSpike correlated with lag growth indicates broker-side slowdown
Request Handler Avg Idle PercentSaturation here affects the Group Coordinator’s ability to manage groupsSustained below 0.3 on the coordinator broker

Fixes

Restart or restore crashed consumers

If all consumer instances are down, restart them. An Empty group resumes from last committed offsets.

Warning: If the group is Dead because offsets expired, restarting triggers auto.offset.reset behavior (typically latest or earliest). This can skip data or cause massive reprocessing. Manually reset offsets with kafka-consumer-groups.sh --reset-offsets if you need a specific position.

Fix blocked consumer threads

When consumers are alive but evicted for missing heartbeats, the root cause is usually blocking work inside the poll loop. Reduce max.poll.records to shrink batch processing time, increase max.poll.interval.ms if the workload is legitimately slow, or move blocking I/O out of the poll thread.

Tradeoff: Increasing max.poll.interval.ms delays detection of genuinely crashed consumers. Reducing max.poll.records lowers throughput. The durable fix is non-blocking consumer design.

Break a rebalance storm

If the group cycled through rebalances until no members remained, identify the slow or crashing consumer that triggered the first eviction. Stop that instance to let the remaining members stabilize. Then fix the root cause (poison pill message, downstream latency, or misconfiguration).

Tradeoff: Stopping one consumer reduces parallelism. This is a tactical recovery to prevent total group evacuation.

Handle offset retention expiration

If the group is Dead, you must recreate group membership by starting consumers. Before doing so, decide on offset handling. If data loss is unacceptable and the retention period has not yet deleted the actual topic data, reset offsets to the last known position or to earliest if reprocessing is viable.

Tradeoff: Resetting to earliest creates duplicate processing. Starting at latest skips the gap.

Prevention

  • Monitor state transitions. Alerting only on lag misses cases where consumers drop out but no new data has arrived yet.
  • Monitor lag as a rate of change. Absolute lag values are meaningless without throughput context.
  • Size timeouts to your workload. Timeouts must accommodate actual processing time per batch.
  • Use static group membership. It prevents rebalances on transient consumer bounces and reduces the chance of total group evacuation.
  • Track coordinator broker health. Group management concentrates on one broker.

How Netdata helps

  • Correlate consumer lag growth with broker-side FetchConsumer LocalTimeMs to distinguish slow consumers from page cache thrashing or disk I/O bottlenecks.
  • Track consumer group state transitions and rebalance rates alongside broker RequestHandlerAvgIdlePercent to spot coordinator saturation before groups evacuate.
  • Alert on lag velocity (rate of change) rather than static thresholds to catch an Empty group outage in the first few minutes.
  • Monitor OS-level page cache pressure (pgmajfault) to identify when backfill consumers or cold restarts are evicting the working set and driving up fetch latency.
  • Surface BytesOutPerSec per broker alongside consumer metrics: if lag grows but egress drops, consumers are likely dead rather than slow.
  • How Kafka actually works in production: a mental model for operators: /guides/kafka/how-kafka-works-in-production/
  • Kafka consumer group lag growing: detection, lag-as-time, and root causes: /guides/kafka/kafka-consumer-group-lag-growing/
  • Kafka controller event queue backing up: overwhelmed controller and stalled metadata: /guides/kafka/kafka-controller-event-queue-backup/
  • Kafka ISR shrinking: IsrShrinksPerSec, flapping, and the cascade to offline: /guides/kafka/kafka-isr-shrink-storm/
  • Kafka KRaft metadata log lag: standby controllers and brokers falling behind: /guides/kafka/kafka-kraft-metadata-log-lag/
  • Kafka KRaft quorum has no leader: current-leader = -1 and frozen metadata: /guides/kafka/kafka-kraft-quorum-no-leader/
  • Kafka LeaderElectionRateAndTimeMs spiking: election storms and slow elections: /guides/kafka/kafka-leader-election-rate-high/
  • Kafka LEADER_NOT_AVAILABLE: causes during elections, restarts, and topic creation: /guides/kafka/kafka-leader-not-available/
  • Kafka leadership imbalance: LeaderCount skew and preferred replica election: /guides/kafka/kafka-leadership-imbalance/
  • Kafka min.insync.replicas and acks: configuring durability you actually have: /guides/kafka/kafka-min-insync-replicas-misconfigured/
  • Kafka monitoring checklist: the signals every production cluster needs: /guides/kafka/kafka-monitoring-checklist/
  • Kafka monitoring maturity model: from survival to expert: /guides/kafka/kafka-monitoring-maturity-model/