Kafka consumer group stuck Empty or Dead: no members consuming
A consumer group that should be active shows Empty or Dead state with no members. For a real-time pipeline, every new message piles up unread. For a batch job, Empty between runs is expected. The difference between an intentional idle state and a silent outage is simple: is lag growing, and was the group supposed to be active?
An Empty group still exists in the coordinator. Committed offsets are retained until they expire, so consumers can resume on restart. A Dead group has been removed, usually because offsets expired while no members were active. If the group is Dead and data is still being produced, new consumers join a fresh group state and default to auto.offset.reset, which can skip or reprocess data.
What this means
Kafka consumer groups move through lifecycle states managed by the Group Coordinator broker. The two states that mean nobody is consuming are:
- Empty: The group exists in the coordinator but has no active members. Offsets are retained. This is normal for batch jobs that run and exit, or catastrophic if a long-running service is supposed to be active.
- Dead: The group has been removed by the coordinator. This happens when the last committed offset expires while the group has no active members. Describing a
Deadgroup returns an unknown group error. Any new consumer joining creates a new group instance.
The critical distinction is intent. Empty with stable lag is fine. Empty with monotonically growing lag means nobody is reading and retention is ticking down. Once offsets expire and the group becomes Dead, restarting consumers defaults to auto.offset.reset unless you manually reset offsets.
stateDiagram-v2
[*] --> Stable : consumers join
Stable --> PreparingRebalance : member change
PreparingRebalance --> CompletingRebalance : join completes
CompletingRebalance --> Stable : sync completes
Stable --> Empty : last member leaves
Empty --> Dead : offsets expire
Empty --> Stable : consumers rejoin
Dead --> [*] : group removedCommon causes
| Cause | What it looks like | First thing to check |
|---|---|---|
| All consumer instances crashed or stopped | Group is Empty; lag grows at the produce rate; no members in --describe | Consumer process health, container status, or recent deployments |
| Consumer blocked in processing and missed heartbeat/poll deadline | Group drops to Empty after being Stable; logs show CommitFailedException or max.poll.interval.ms exceeded | Consumer logs and thread dumps for blocking I/O or long computation |
Offset retention expired while group was Empty | Group is Dead; group ID returns unknown group error or disappears from listings | Broker offsets.retention.minutes and how long the group was idle |
| Rebalance storm evicted all members | Group cycled rapidly through PreparingRebalance and CompletingRebalance before landing on Empty | Rebalance rate and coordinator logs for members timing out during rejoin |
| Intentional batch job completion | Group is Empty; lag is stable or zero; job runs on a schedule | Application run schedule and whether the group should be empty between batches |
Quick checks
Run these read-only commands to characterize the group state and surrounding context.
# Check group state, protocol, and member count
kafka-consumer-groups.sh --bootstrap-server localhost:9092 --describe --group <group-id> --state
# Check partition-level lag and which members were assigned
kafka-consumer-groups.sh --bootstrap-server localhost:9092 --describe --group <group-id>
# List all groups to confirm the group ID still exists
kafka-consumer-groups.sh --bootstrap-server localhost:9092 --list
# Check topic retention and cleanup policy
kafka-topics.sh --bootstrap-server localhost:9092 --describe --topic <topic>
# Check broker-side request metrics for consumer fetch activity
echo "get -b kafka.server:type=BrokerTopicMetrics,name=BytesOutPerSec OneMinuteRate" | java -jar jmxterm.jar -l localhost:9999
If the group is Dead, check broker logs for group metadata cleanup events. If the group is Empty, compare current lag to the log end offset: lag growing at the produce rate means total consumer failure; stable lag means the topic may be idle.
How to diagnose it
Confirm state. Run
kafka-consumer-groups.sh --describe --group <group-id> --state.Emptyshows the protocol and assignment strategy. An unknown group error means the group isDead.Check lag velocity. Run
kafka-consumer-groups.sh --describe --group <group-id>and compareCURRENT-OFFSETtoLOG-END-OFFSET. IncreasingLAGon repeat runs means data is accumulating. Stable lag means the pipeline is intentionally idle.Check consumer health. Look at process uptime, container restart counts, and application logs. Search for exceptions that crash the poll loop, OOM kills, or deployment events that scaled the consumer to zero.
Inspect heartbeat and poll violations. Logs showing
max.poll.interval.ms exceededor session timeout mean the application took too long processing a batch. This is common with synchronous I/O or heavy computation inside the poll loop.Evaluate the coordinator broker. During rebalance storms, the coordinator shows elevated CPU and increased
JoinGroupandSyncGrouprequest rates. LowRequestHandlerAvgIdlePercenton the coordinator can stall group management.Check offset retention. If the group transitioned from
EmptytoDead, reviewoffsets.retention.minutes.Check
__consumer_offsetshealth. It is compacted by default. A failed log cleaner causes disk growth rather thanEmptygroups directly, but compaction stalls can delay metadata cleanup.
Metrics and signals to monitor
| Signal | Why it matters | Warning sign |
|---|---|---|
| Consumer Group State | Directly indicates whether members are active | Empty for an active service group, or unexpected Dead |
| Consumer Group Lag | Measures unprocessed messages; growing lag confirms nobody is reading | Lag increasing monotonically for more than 2 minutes |
| Consumer Rebalance Rate | High rate indicates instability that can evict members | More than 3 rebalances per hour outside deployments |
| Bytes Out Per Second | Confirms whether any fetch traffic is leaving the broker | Drop to zero while lag grows signals total consumer failure |
| Fetch Consumer LocalTimeMs | High values mean reads are hitting disk instead of page cache | Spike correlated with lag growth indicates broker-side slowdown |
| Request Handler Avg Idle Percent | Saturation here affects the Group Coordinator’s ability to manage groups | Sustained below 0.3 on the coordinator broker |
Fixes
Restart or restore crashed consumers
If all consumer instances are down, restart them. An Empty group resumes from last committed offsets.
Warning: If the group is Dead because offsets expired, restarting triggers auto.offset.reset behavior (typically latest or earliest). This can skip data or cause massive reprocessing. Manually reset offsets with kafka-consumer-groups.sh --reset-offsets if you need a specific position.
Fix blocked consumer threads
When consumers are alive but evicted for missing heartbeats, the root cause is usually blocking work inside the poll loop. Reduce max.poll.records to shrink batch processing time, increase max.poll.interval.ms if the workload is legitimately slow, or move blocking I/O out of the poll thread.
Tradeoff: Increasing max.poll.interval.ms delays detection of genuinely crashed consumers. Reducing max.poll.records lowers throughput. The durable fix is non-blocking consumer design.
Break a rebalance storm
If the group cycled through rebalances until no members remained, identify the slow or crashing consumer that triggered the first eviction. Stop that instance to let the remaining members stabilize. Then fix the root cause (poison pill message, downstream latency, or misconfiguration).
Tradeoff: Stopping one consumer reduces parallelism. This is a tactical recovery to prevent total group evacuation.
Handle offset retention expiration
If the group is Dead, you must recreate group membership by starting consumers. Before doing so, decide on offset handling. If data loss is unacceptable and the retention period has not yet deleted the actual topic data, reset offsets to the last known position or to earliest if reprocessing is viable.
Tradeoff: Resetting to earliest creates duplicate processing. Starting at latest skips the gap.
Prevention
- Monitor state transitions. Alerting only on lag misses cases where consumers drop out but no new data has arrived yet.
- Monitor lag as a rate of change. Absolute lag values are meaningless without throughput context.
- Size timeouts to your workload. Timeouts must accommodate actual processing time per batch.
- Use static group membership. It prevents rebalances on transient consumer bounces and reduces the chance of total group evacuation.
- Track coordinator broker health. Group management concentrates on one broker.
How Netdata helps
- Correlate consumer lag growth with broker-side
FetchConsumerLocalTimeMsto distinguish slow consumers from page cache thrashing or disk I/O bottlenecks. - Track consumer group state transitions and rebalance rates alongside broker
RequestHandlerAvgIdlePercentto spot coordinator saturation before groups evacuate. - Alert on lag velocity (rate of change) rather than static thresholds to catch an
Emptygroup outage in the first few minutes. - Monitor OS-level page cache pressure (
pgmajfault) to identify when backfill consumers or cold restarts are evicting the working set and driving up fetch latency. - Surface
BytesOutPerSecper broker alongside consumer metrics: if lag grows but egress drops, consumers are likely dead rather than slow.
Related guides
- How Kafka actually works in production: a mental model for operators: /guides/kafka/how-kafka-works-in-production/
- Kafka consumer group lag growing: detection, lag-as-time, and root causes: /guides/kafka/kafka-consumer-group-lag-growing/
- Kafka controller event queue backing up: overwhelmed controller and stalled metadata: /guides/kafka/kafka-controller-event-queue-backup/
- Kafka ISR shrinking: IsrShrinksPerSec, flapping, and the cascade to offline: /guides/kafka/kafka-isr-shrink-storm/
- Kafka KRaft metadata log lag: standby controllers and brokers falling behind: /guides/kafka/kafka-kraft-metadata-log-lag/
- Kafka KRaft quorum has no leader: current-leader = -1 and frozen metadata: /guides/kafka/kafka-kraft-quorum-no-leader/
- Kafka LeaderElectionRateAndTimeMs spiking: election storms and slow elections: /guides/kafka/kafka-leader-election-rate-high/
- Kafka LEADER_NOT_AVAILABLE: causes during elections, restarts, and topic creation: /guides/kafka/kafka-leader-not-available/
- Kafka leadership imbalance: LeaderCount skew and preferred replica election: /guides/kafka/kafka-leadership-imbalance/
- Kafka min.insync.replicas and acks: configuring durability you actually have: /guides/kafka/kafka-min-insync-replicas-misconfigured/
- Kafka monitoring checklist: the signals every production cluster needs: /guides/kafka/kafka-monitoring-checklist/
- Kafka monitoring maturity model: from survival to expert: /guides/kafka/kafka-monitoring-maturity-model/







