Redis blocked_clients growing: dead consumers vs healthy queues

blocked_clients in INFO clients is climbing. In queue-based architectures this is often normal: workers call BLPOP, BRPOP, or XREAD BLOCK and wait for producers to push work. When blocked_clients grows while queue depth also grows, consumers are no longer consuming. They may have crashed, been OOM-killed, or stalled on replication lag via WAIT.

blocked_clients counts only clients waiting on explicit blocking commands. It does not capture clients stalled by slow commands like KEYS * or large SMEMBERS. A high value is either a healthy signal of an active queue pattern or a pathological signal of dead connections holding slots open forever, especially with timeout 0.

This guide shows how to tell the difference, what commands to run, and how to fix the underlying cause without restarting Redis.

What this means

blocked_clients increments when a client executes a blocking command and the required condition is not met. Commands that increment it include BLPOP, BRPOP, BLMOVE, BZPOPMIN, BZPOPMAX, XREAD BLOCK, XREADGROUP BLOCK, and WAIT. A client blocked on BLPOP with timeout 0 waits indefinitely until data arrives or the connection closes. If the consumer crashes while blocked, the TCP connection may hang open and Redis counts that client in blocked_clients forever, consuming one connection slot and never processing messages.

WAIT blocks the calling client until prior writes are acknowledged by numreplicas replicas within the timeout. If replicas are down or lagging and the timeout is large, WAIT holds a blocked slot until replication catches up or the timeout fires.

A healthy queue worker pool shows a stable blocked_clients count equal to the number of worker processes. The operator problem is sustained growth above baseline, or a count that nears connected_clients while queue depth increases.

flowchart TD
    A[blocked_clients growing] --> B{Queue depth growing?}
    B -->|Yes| C[Dead consumers or WAIT]
    B -->|No| D[Producer failure or healthy idle]
    C --> E{Replication lag?}
    E -->|Yes| F[WAIT blocking on lag]
    E -->|No| G[Crashed consumers with infinite timeout]

Common causes

CauseWhat it looks likeFirst thing to check
Crashed consumers with timeout=0blocked_clients grows; queue length grows; no processing visible in logsCLIENT LIST for idle blocked connections; LLEN or XLEN
WAIT blocking on replication lagblocked_clients grows after write bursts; replicas lagging or link downINFO replication offset delta and master_link_status
Producer failureblocked_clients stable at worker count; queues stay empty; no new jobsApplication producer logs; LLEN / XLEN near zero
Stream consumer group with no live consumersStream lag grows; blocked_clients may be flat because XREADGROUP sessions diedXINFO GROUPS lag and pending counts
Connection leak in blocking clientsblocked_clients and connected_clients both grow; high idle timesCLIENT LIST sorted by idle

Quick checks

Run these read-only commands to characterize the state.

# Confirm blocked client count
redis-cli INFO clients | grep blocked_clients

# Check total connection load
redis-cli INFO clients | grep connected_clients

# Identify blocked clients, their command, and idle time
redis-cli CLIENT LIST

# Check list or stream depth
redis-cli LLEN myqueue
redis-cli XLEN mystream

# Check stream consumer group health (Redis 7.0+ lag field)
redis-cli XINFO GROUPS mystream

# Check for WAIT-induced replication lag
redis-cli INFO replication | grep -E "master_repl_offset|slave_repl_offset|master_link_status"

# Rule out slow commands (these do NOT increment blocked_clients)
redis-cli SLOWLOG LEN
redis-cli SLOWLOG GET 10

# Check command mix to confirm blocking command usage
redis-cli INFO commandstats | grep -E "cmdstat_blpop|cmdstat_brpop|cmdstat_blmove|cmdstat_wait|cmdstat_xread"

Note: CLIENT LIST output includes the cmd field showing the current command and idle showing seconds since last interaction. High idle while cmd is a blocking command suggests a stalled connection; confirm against queue depth and consumer process health before treating it as a zombie.

How to diagnose it

  1. Establish whether the count is truly anomalous. If your application runs 50 queue workers, a stable blocked_clients of 50 is normal. Alert on deviation from baseline, not on absolute value.
  2. Determine which blocking commands are in use. Check INFO commandstats for cmdstat_blpop, cmdstat_brpop, cmdstat_wait, or cmdstat_xread. If none are present but blocked_clients is high, look for module-issued blocking operations or older commands like brpoplpush (deprecated, replaced by blmove).
  3. Correlate with queue depth. Use LLEN for lists or XLEN for streams. If queue depth grows while blocked_clients also grows, consumers are not draining the queue. They are likely dead or stalled. If queue depth is near zero and blocked_clients is stable, the workers are simply idle.
  4. Check for WAIT-specific lag. If cmdstat_wait is present, compare master_repl_offset on the primary with slave_repl_offset on the replica. A large and growing delta means replicas are behind. master_link_status:down on replicas used by WAIT will cause it to block until timeout.
  5. Inspect individual blocked connections. In CLIENT LIST, look for entries with high idle seconds and cmd equal to a blocking command. If idle exceeds your expected processing interval, the consumer process is likely gone but the TCP connection has not yet timed out.
  6. Check stream consumer groups separately. If you use streams, XINFO GROUPS shows lag (undelivered entries) and pending (delivered but unacknowledged). A growing lag while blocked_clients stays flat means XREADGROUP consumers have died and are not reconnecting. This does not always reflect in blocked_clients because the blocked sessions may have closed.
  7. Confirm it is not slow-command blocking. Run SLOWLOG GET 50. Slow commands block the event loop but do not increment blocked_clients. If SLOWLOG is full of KEYS or large SMEMBERS, the real problem is command latency, not queue consumers.

Metrics and signals to monitor

SignalWhy it mattersWarning sign
blocked_clientsCount of clients in blocking stateSustained growth above workload baseline
connected_clientsTotal connection pool pressureGrowing in tandem with blocked_clients
List/stream length (LLEN/XLEN)Distinguishes consumer death from producer outageQueue growing while blocked count is high
Replication offset lagFor WAIT-induced blockingmaster_repl_offset minus slave_repl_offset growing
master_link_statusReplica availabilitydown on replicas that WAIT depends on
Stream group lag / pendingInvisible buildup when stream consumers dielag or pending growing continuously
instantaneous_ops_per_secQueue processing throughputDrop correlated with rising blocked_clients

Fixes

Crashed consumers with infinite timeout

If consumers use BLPOP, BRPOP, or XREAD BLOCK with timeout 0, a crashed consumer leaves a blocked connection open indefinitely.

  • Use CLIENT UNBLOCK <client-id> TIMEOUT to force the connection to return nil as if the timeout fired. Use CLIENT UNBLOCK <client-id> ERROR to return -UNBLOCKED client unblocked via CLIENT UNBLOCK. Both free the slot immediately.
  • Kill the stale connection with CLIENT KILL ID <client-id>. CLIENT LIST provides the id field. This is safe but may cause the application to reconnect immediately if it is still alive.
  • Restart the consumer application to restore throughput.
  • Tradeoff: Unblocking or killing the connection drops any in-flight blocking context. The consumer must handle a nil response or reconnect gracefully.

WAIT blocking on replication lag

If WAIT is the source of blocked clients:

Producer failure with healthy consumers

If blocked_clients is stable at the worker count and queues are empty, but the expected job volume is absent:

  • Check application producer logs. The issue is upstream of Redis.
  • This is not a Redis incident. Do not restart Redis.

Stream consumer group death

If XINFO GROUPS shows growing lag or pending but blocked_clients does not reflect active consumers:

  • Use XAUTOCLAIM or XCLAIM to redistribute pending messages from dead consumers to live ones.
  • Ensure consumers call XACK after processing. Missing XACK causes pending to grow even when consumers are alive.

Prevention

  • Finite timeouts. A timeout of 0 leaves no recovery path if the consumer crashes; use BLPOP key 30 so Redis frees the slot automatically.
  • Baseline-relative alerts. Queue architectures have a normal blocked population equal to worker count; alert on deviation from baseline, not absolute value.
  • Queue depth correlation. blocked_clients alone cannot distinguish a healthy idle worker pool from dead consumers; correlate with LLEN, XLEN, or stream lag.
  • Replication backlog sizing. If you use WAIT, a small repl-backlog-size causes full resyncs that worsen lag; set it to 100MB or more.
  • Consumer liveness checks. Monitor stream consumer idle time via XINFO CONSUMERS and application process health independently of Redis.

How Netdata helps

  • Charts blocked_clients with connected_clients, instantaneous_ops_per_sec, and replication offset lag to distinguish consumer death from WAIT lag.
  • Monitors stream consumer group lag and pending counts to catch buildup that blocked_clients misses.
  • Displays replication lag and master_link_status alongside application metrics for WAIT diagnosis.