Redis blocked_clients growing: dead consumers vs healthy queues
blocked_clients in INFO clients is climbing. In queue-based architectures this is often normal: workers call BLPOP, BRPOP, or XREAD BLOCK and wait for producers to push work. When blocked_clients grows while queue depth also grows, consumers are no longer consuming. They may have crashed, been OOM-killed, or stalled on replication lag via WAIT.
blocked_clients counts only clients waiting on explicit blocking commands. It does not capture clients stalled by slow commands like KEYS * or large SMEMBERS. A high value is either a healthy signal of an active queue pattern or a pathological signal of dead connections holding slots open forever, especially with timeout 0.
This guide shows how to tell the difference, what commands to run, and how to fix the underlying cause without restarting Redis.
What this means
blocked_clients increments when a client executes a blocking command and the required condition is not met. Commands that increment it include BLPOP, BRPOP, BLMOVE, BZPOPMIN, BZPOPMAX, XREAD BLOCK, XREADGROUP BLOCK, and WAIT. A client blocked on BLPOP with timeout 0 waits indefinitely until data arrives or the connection closes. If the consumer crashes while blocked, the TCP connection may hang open and Redis counts that client in blocked_clients forever, consuming one connection slot and never processing messages.
WAIT blocks the calling client until prior writes are acknowledged by numreplicas replicas within the timeout. If replicas are down or lagging and the timeout is large, WAIT holds a blocked slot until replication catches up or the timeout fires.
A healthy queue worker pool shows a stable blocked_clients count equal to the number of worker processes. The operator problem is sustained growth above baseline, or a count that nears connected_clients while queue depth increases.
flowchart TD
A[blocked_clients growing] --> B{Queue depth growing?}
B -->|Yes| C[Dead consumers or WAIT]
B -->|No| D[Producer failure or healthy idle]
C --> E{Replication lag?}
E -->|Yes| F[WAIT blocking on lag]
E -->|No| G[Crashed consumers with infinite timeout]Common causes
| Cause | What it looks like | First thing to check |
|---|---|---|
Crashed consumers with timeout=0 | blocked_clients grows; queue length grows; no processing visible in logs | CLIENT LIST for idle blocked connections; LLEN or XLEN |
| WAIT blocking on replication lag | blocked_clients grows after write bursts; replicas lagging or link down | INFO replication offset delta and master_link_status |
| Producer failure | blocked_clients stable at worker count; queues stay empty; no new jobs | Application producer logs; LLEN / XLEN near zero |
| Stream consumer group with no live consumers | Stream lag grows; blocked_clients may be flat because XREADGROUP sessions died | XINFO GROUPS lag and pending counts |
| Connection leak in blocking clients | blocked_clients and connected_clients both grow; high idle times | CLIENT LIST sorted by idle |
Quick checks
Run these read-only commands to characterize the state.
# Confirm blocked client count
redis-cli INFO clients | grep blocked_clients
# Check total connection load
redis-cli INFO clients | grep connected_clients
# Identify blocked clients, their command, and idle time
redis-cli CLIENT LIST
# Check list or stream depth
redis-cli LLEN myqueue
redis-cli XLEN mystream
# Check stream consumer group health (Redis 7.0+ lag field)
redis-cli XINFO GROUPS mystream
# Check for WAIT-induced replication lag
redis-cli INFO replication | grep -E "master_repl_offset|slave_repl_offset|master_link_status"
# Rule out slow commands (these do NOT increment blocked_clients)
redis-cli SLOWLOG LEN
redis-cli SLOWLOG GET 10
# Check command mix to confirm blocking command usage
redis-cli INFO commandstats | grep -E "cmdstat_blpop|cmdstat_brpop|cmdstat_blmove|cmdstat_wait|cmdstat_xread"
Note: CLIENT LIST output includes the cmd field showing the current command and idle showing seconds since last interaction. High idle while cmd is a blocking command suggests a stalled connection; confirm against queue depth and consumer process health before treating it as a zombie.
How to diagnose it
- Establish whether the count is truly anomalous. If your application runs 50 queue workers, a stable
blocked_clientsof 50 is normal. Alert on deviation from baseline, not on absolute value. - Determine which blocking commands are in use. Check
INFO commandstatsforcmdstat_blpop,cmdstat_brpop,cmdstat_wait, orcmdstat_xread. If none are present butblocked_clientsis high, look for module-issued blocking operations or older commands likebrpoplpush(deprecated, replaced byblmove). - Correlate with queue depth. Use
LLENfor lists orXLENfor streams. If queue depth grows whileblocked_clientsalso grows, consumers are not draining the queue. They are likely dead or stalled. If queue depth is near zero andblocked_clientsis stable, the workers are simply idle. - Check for WAIT-specific lag. If
cmdstat_waitis present, comparemaster_repl_offseton the primary withslave_repl_offseton the replica. A large and growing delta means replicas are behind.master_link_status:downon replicas used by WAIT will cause it to block until timeout. - Inspect individual blocked connections. In
CLIENT LIST, look for entries with highidleseconds andcmdequal to a blocking command. Ifidleexceeds your expected processing interval, the consumer process is likely gone but the TCP connection has not yet timed out. - Check stream consumer groups separately. If you use streams,
XINFO GROUPSshowslag(undelivered entries) andpending(delivered but unacknowledged). A growinglagwhileblocked_clientsstays flat meansXREADGROUPconsumers have died and are not reconnecting. This does not always reflect inblocked_clientsbecause the blocked sessions may have closed. - Confirm it is not slow-command blocking. Run
SLOWLOG GET 50. Slow commands block the event loop but do not incrementblocked_clients. IfSLOWLOGis full ofKEYSor largeSMEMBERS, the real problem is command latency, not queue consumers.
Metrics and signals to monitor
| Signal | Why it matters | Warning sign |
|---|---|---|
blocked_clients | Count of clients in blocking state | Sustained growth above workload baseline |
connected_clients | Total connection pool pressure | Growing in tandem with blocked_clients |
List/stream length (LLEN/XLEN) | Distinguishes consumer death from producer outage | Queue growing while blocked count is high |
| Replication offset lag | For WAIT-induced blocking | master_repl_offset minus slave_repl_offset growing |
master_link_status | Replica availability | down on replicas that WAIT depends on |
Stream group lag / pending | Invisible buildup when stream consumers die | lag or pending growing continuously |
instantaneous_ops_per_sec | Queue processing throughput | Drop correlated with rising blocked_clients |
Fixes
Crashed consumers with infinite timeout
If consumers use BLPOP, BRPOP, or XREAD BLOCK with timeout 0, a crashed consumer leaves a blocked connection open indefinitely.
- Use
CLIENT UNBLOCK <client-id> TIMEOUTto force the connection to returnnilas if the timeout fired. UseCLIENT UNBLOCK <client-id> ERRORto return-UNBLOCKED client unblocked via CLIENT UNBLOCK. Both free the slot immediately. - Kill the stale connection with
CLIENT KILL ID <client-id>.CLIENT LISTprovides theidfield. This is safe but may cause the application to reconnect immediately if it is still alive. - Restart the consumer application to restore throughput.
- Tradeoff: Unblocking or killing the connection drops any in-flight blocking context. The consumer must handle a
nilresponse or reconnect gracefully.
WAIT blocking on replication lag
If WAIT is the source of blocked clients:
- Fix the replica lag by investigating replica CPU, disk I/O, or network saturation. See Redis latency spikes: diagnosis with the LATENCY subsystem and Redis latest_fork_usec too high: THP, NUMA, and fork latency.
- If replicas are down, fail them over or remove them so
WAITcan proceed or fail fast. - Review whether
WAITwith a large timeout is necessary. A shorter timeout fails the write durability check faster, freeing the blocked slot.
Producer failure with healthy consumers
If blocked_clients is stable at the worker count and queues are empty, but the expected job volume is absent:
- Check application producer logs. The issue is upstream of Redis.
- This is not a Redis incident. Do not restart Redis.
Stream consumer group death
If XINFO GROUPS shows growing lag or pending but blocked_clients does not reflect active consumers:
- Use
XAUTOCLAIMorXCLAIMto redistribute pending messages from dead consumers to live ones. - Ensure consumers call
XACKafter processing. MissingXACKcausespendingto grow even when consumers are alive.
Prevention
- Finite timeouts. A
timeoutof0leaves no recovery path if the consumer crashes; useBLPOP key 30so Redis frees the slot automatically. - Baseline-relative alerts. Queue architectures have a normal blocked population equal to worker count; alert on deviation from baseline, not absolute value.
- Queue depth correlation.
blocked_clientsalone cannot distinguish a healthy idle worker pool from dead consumers; correlate withLLEN,XLEN, or streamlag. - Replication backlog sizing. If you use
WAIT, a smallrepl-backlog-sizecauses full resyncs that worsen lag; set it to 100MB or more. - Consumer liveness checks. Monitor stream consumer
idletime viaXINFO CONSUMERSand application process health independently of Redis.
How Netdata helps
- Charts
blocked_clientswithconnected_clients,instantaneous_ops_per_sec, and replication offset lag to distinguish consumer death fromWAITlag. - Monitors stream consumer group
lagandpendingcounts to catch buildup thatblocked_clientsmisses. - Displays replication lag and
master_link_statusalongside application metrics forWAITdiagnosis.
Related guides
- How Redis actually works in production: a mental model for operators
- Redis aof_last_write_status:err: AOF write failures and recovery
- Redis appendfsync always latency: durability vs throughput trade-offs
- Redis BUSY Redis is busy running a script: blocking Lua and how to recover
- Redis Can’t save in background: fork: Cannot allocate memory - diagnosis and fix
- Redis event loop blocked: when one slow command freezes everything
- Redis eviction policy tuning: allkeys-lru vs volatile-ttl vs noeviction
- Redis fork/COW memory storm: why persistence doubles RSS and OOM-kills the box
- Redis KEYS command blocking production: why to replace it with SCAN
- Redis latency spikes: diagnosis with the LATENCY subsystem
- Redis latest_fork_usec too high: THP, NUMA, and fork latency
- Redis max number of clients reached: maxclients and rejected_connections







