Redis Stream consumer group lag: pending entries and dead consumers

Stream processing falls behind. An alert fires on used_memory, or your application dashboard shows lag. You run XINFO GROUPS and see lag or pending climbing. One means consumers cannot keep up. The other means they are not acknowledging messages. Both increase memory pressure, but they have different fixes. This guide shows how to tell them apart, find dead consumers, and clean up the Pending Entry List before it becomes a memory incident.

What this means

Redis Streams consumer groups track two distinct backlogs.

lag, returned by XINFO GROUPS in Redis 7.0+, counts entries the group has not yet delivered to any consumer. Growing lag means producers outpace consumers. This is a throughput problem.

pending counts entries delivered but not yet acknowledged with XACK. Growing pending means consumers received messages and never confirmed completion. The consumer may have crashed, hung, or omitted XACK. This is a reliability problem.

The Pending Entry List (PEL) is a per-group structure. Every pending entry carries metadata overhead. A large PEL consumes heap memory and slows group operations that inspect pending state.

XLEN measures total stream entries. Without MAXLEN or MINID trimming, XLEN grows without bound and adds its own memory cost. Lag and pending are consumer-group signals. XLEN is a stream-level retention signal. Watch all three.

flowchart TD
    A[XINFO GROUPS] --> B{lag growing?}
    B -->|Yes| C[Scale consumers]
    B -->|No| D{pending growing?}
    D -->|Yes| E[XINFO CONSUMERS idle]
    D -->|No| F[Check XLEN trimming]
    E --> G{Idle high?}
    G -->|Yes| H[XAUTOCLAIM dead consumer]
    G -->|No| I[Fix missing XACK]
    F --> J[Add MAXLEN or MINID]

Common causes

CauseWhat it looks likeFirst thing to check
Producer rate exceeds consumer capacitylag growing steadily; pending stable or slowly rising; high producer opsXINFO GROUPS lag trend and instantaneous_ops_per_sec
Consumer crash or hang without XACKpending climbing; lag flat or low; one consumer shows high idle time or is missingXINFO CONSUMERS idle time and application error logs
Application logic omits XACKpending grows linearly with throughput; consumers appear healthy and idle time stays lowXPENDING entry details and application code review
Unbounded stream growthXLEN increases indefinitely; memory climbs even when lag and pending are stableXLEN and whether MAXLEN or MINID is configured
Dead consumer entries never reclaimedpending accumulates after a consumer disappears; no janitor runs XAUTOCLAIMXINFO CONSUMERS list compared to expected instances

Quick checks

Run these read-only commands to triage.

# Check group-level lag and pending counts
redis-cli XINFO GROUPS mystream

# Check per-consumer health and idle time
redis-cli XINFO CONSUMERS mystream mygroup

# Check total stream length
redis-cli XLEN mystream

# Inspect pending entry IDs, idle time, and delivery count
redis-cli XPENDING mystream mygroup - + 10

# Check memory pressure that PEL growth can cause
redis-cli INFO memory | grep -E "used_memory:|used_memory_rss:"

# Check for consumers blocked on stream reads
redis-cli INFO clients | grep blocked_clients

How to diagnose it

  1. Run XINFO GROUPS <stream> and compare lag and pending. If lag is rising fastest, focus on throughput. If pending is rising fastest, focus on consumer reliability.
  2. Run XINFO CONSUMERS <stream> <group> and compare idle times against your expected read interval. A consumer whose idle time is many multiples of the interval is likely dead.
  3. Run XPENDING <stream> <group> - + 10 to inspect individual entries. High idle times on specific entries point to stuck messages. High delivery counts point to messages being redelivered repeatedly.
  4. Run XLEN <stream> to verify the stream is not growing without bound. If XLEN is orders of magnitude larger than your retention target, trimming is missing.
  5. Check INFO memory for used_memory and used_memory_rss growth that correlates with backlog growth. The PEL is heap-allocated; large pending lists show up as memory growth.
  6. Check application logs for consumer crashes, restarts, or exceptions that would prevent an XACK from executing.
  7. If consumers appear healthy but lag persists, check INFO clients | grep blocked_clients. A drop in blocked clients alongside rising lag can indicate consumers are crashing between reads rather than staying connected and blocked.

Metrics and signals to monitor

SignalWhy it mattersWarning sign
XINFO GROUPS lagEntries produced but not yet delivered to the groupSustained growth over multiple minutes
XINFO GROUPS pendingDelivered but unacknowledged entries in the PELSustained growth; correlates with memory pressure
XLENTotal entries retained in the streamGrowing without bound when trimming should cap it
Consumer idle time from XINFO CONSUMERSTime since the consumer last interacted with the groupExceeds 10x the expected read interval
used_memory and used_memory_rssHeap memory including PEL and stream overheadGrowth correlated with pending or stream length
blocked_clientsConsumers waiting on blocking commandsSustained count above baseline for stream workloads
instantaneous_ops_per_secOverall command throughputSudden drop may indicate blocked or crashed consumers

Fixes

Consumers cannot keep up (growing lag)

Add consumer instances if the workload partitions. Review the batch size in XREADGROUP. Too many entries per fetch increases processing latency; too few increases round-trip overhead. Check consumer CPU and network saturation. If consumers use BLOCK, ensure the timeout is not so long that recovery from a stalled consumer is delayed.

If scaling consumers is not immediate, increase the COUNT argument in XREADGROUP to pull larger batches, or apply backpressure at the producer if the pipeline supports it. Verify consumers are not bottlenecked on downstream dependencies, such as databases or external APIs, which can manifest as low CPU but high lag.

Pending entries from dead or crashed consumers

Identify the dead consumer via XINFO CONSUMERS idle time. Run XAUTOCLAIM to reassign its pending entries to a healthy consumer, or use XCLAIM for fine-grained control. Start with a conservative idle-time threshold in XAUTOCLAIM – at least several times your maximum expected processing time – to avoid stealing active work. After claiming, verify the receiving consumer has idempotent processing logic, because XAUTOCLAIM does not deduplicate entries already delivered to other consumers.

The claiming consumer must still call XACK after processing, or the entries remain in the PEL. If you do not have a janitor process running XAUTOCLAIM, implement one. Without it, pending entries from crashed consumers accumulate forever.

Application missing XACK

Review application code to ensure XACK runs after successful processing and before any logic that can throw or exit. If your client library offers automatic acknowledgement, verify it is not silently failing on uncaught exceptions. Some libraries run XACK in a finally block that is skipped on process termination.

A missing XACK is a code bug, not a Redis issue. The PEL grows linearly with throughput until the fix is deployed.

Unbounded stream growth

Configure MAXLEN or MINID trimming on XADD, or run XTRIM periodically. Use approximate trimming with ~ on large streams. Exact trimming is expensive because it operates on individual entries rather than whole macro-nodes. When using MAXLEN ~ <count>, Redis trims to a node boundary near the limit. If you need strict retention compliance, use exact trimming or MINID with an ID derived from your retention window.

Without trimming, XLEN and memory grow indefinitely.

Memory pressure from a large PEL

Reclaim and acknowledge entries to shrink the PEL. No INFO field isolates PEL memory, but growth shows up in used_memory and used_memory_rss. If used_memory approaches maxmemory, verify your eviction policy. allkeys-lru can evict the stream key itself, dropping unconsumed entries. If you must protect the stream, ensure it carries a TTL when using volatile-* policies, or use noeviction and handle write errors explicitly instead of losing data.

Ensure maxmemory is set so that runaway stream growth triggers eviction or alerts before an OOM kill.

Prevention

  • Configure MAXLEN or MINID trimming to cap XLEN.
  • Run a periodic XAUTOCLAIM janitor to recover entries from dead consumers.
  • Ensure consumers call XACK immediately after processing.
  • Monitor lag and pending as first-class metrics alongside memory.
  • Size consumer capacity for peak producer throughput.
  • Set maxmemory and an appropriate eviction policy so that runaway growth triggers eviction before OOM.

How Netdata helps

Netdata surfaces Redis INFO metrics alongside system-level signals, so you can correlate stream backlog with resource pressure:

  • used_memory and used_memory_rss trends show when PEL or stream growth translates into memory pressure.
  • blocked_clients tracking detects changes in blocking consumer behavior.
  • instantaneous_ops_per_sec drops reveal consumer slowdowns that precede lag spikes.
  • Memory saturation alerts (used_memory versus maxmemory) fire before OOM kills, giving you runway to trim streams or reclaim pending entries.
  • How Redis actually works in production: a mental model for operators: /guides/redis/how-redis-works-in-production/
  • Redis aof_last_write_status:err: AOF write failures and recovery: /guides/redis/redis-aof-last-write-status-err/
  • Redis appendfsync always latency: durability vs throughput trade-offs: /guides/redis/redis-appendfsync-always-latency/
  • Redis big keys: finding the giant key that blocks the event loop: /guides/redis/redis-big-keys-latency/
  • Redis blocked_clients growing: dead consumers vs healthy queues: /guides/redis/redis-blocked-clients-growing/
  • Redis BUSY Redis is busy running a script: blocking Lua and how to recover: /guides/redis/redis-busy-running-script/
  • Redis Can’t save in background: fork: Cannot allocate memory - diagnosis and fix: /guides/redis/redis-cant-save-in-background-fork/
  • Redis client output buffer overflow: slow consumers and client-output-buffer-limit: /guides/redis/redis-client-output-buffer-limit/
  • Redis cluster_slots_pfail > 0: impending node failure in a cluster: /guides/redis/redis-cluster-slots-pfail/
  • Redis CLUSTERDOWN / cluster_state:fail: slot coverage and recovery: /guides/redis/redis-cluster-state-fail/
  • Redis connected_clients climbing: connection leak detection: /guides/redis/redis-connected-clients-climbing/
  • Redis connected_slaves dropped: detecting replica disconnects on the primary: /guides/redis/redis-connected-slaves-dropped/