Redis event loop blocked: when one slow command freezes everything

Redis processes every command on a single main thread. Even with Redis 6.0 and later offloading network I/O to threads, command execution itself remains strictly sequential. When one command takes too long, everything behind it waits. You will see clients still connected, but PING stalls and throughput collapses to zero. This is the Slow Command Snowball pattern: a single expensive operation blocks the event loop, the client queue backs up, and latency compounds across every connected application.

What this means

The Redis event loop runs one command to completion before accepting the next. During that window no other client is served. A KEYS * against a large keyspace, an SMEMBERS on a million-element set, or an unbounded Lua script can freeze the loop for hundreds of milliseconds or seconds. I/O threads may keep accepting new requests into client buffers, but those requests queue. When the slow command finishes, Redis drains the queue; if another slow command arrives before it clears, the snowball continues.

flowchart TD
    A[Slow command enters event loop] --> B[Main thread blocks]
    B --> C[Incoming clients queue]
    C --> D[Ops/sec drops to zero]
    D --> E[All commands wait]
    E --> F[Queue depth grows]
    F --> G[Latency snowball]

Every GET, HSET, and PUBLISH behind the offender stalls. The process stays alive and replicas may remain connected, but the instance is unresponsive to clients.

Common causes

CauseWhat it looks likeFirst thing to check
KEYS * or large HGETALL / LRANGESlowlog fills with O(N) commands; throughput flatlines during executionSLOWLOG GET 10
SORT on large datasetsLatency spikes correlate with SORT entries in slowlog; CPU pinned on one coreSLOWLOG GET and INFO commandstats
Long-running Lua scriptClients receive BUSY errors; script execution may not appear in slowlog if under thresholdSCRIPT KILL (read-only only); LATENCY HISTORY command
DEBUG SLEEP or admin footgunsExact pause duration matches command; visible in CLIENT LISTCLIENT LIST; check cmdstat_debug
Large DEL on a dense keySingle command blocks while freeing millions of elementsSLOWLOG GET; compare with UNLINK behavior

Quick checks

Run these safe, read-only checks to confirm event loop blocking and identify the source.

# Confirm responsiveness. Exit code 124 means the command timed out.
time timeout 2 redis-cli PING

# Check if throughput collapsed while clients remain connected
redis-cli INFO stats | grep instantaneous_ops_per_sec
redis-cli INFO clients | grep connected_clients

# Inspect the slowlog for offenders
redis-cli SLOWLOG GET 10
redis-cli SLOWLOG LEN

# Check per-command statistics for outliers
redis-cli INFO commandstats | grep -E "cmdstat_keys|cmdstat_smembers|cmdstat_sort|cmdstat_eval|cmdstat_debug"

# Identify which client is issuing expensive commands
redis-cli CLIENT LIST

# Check if latency monitoring is enabled and showing command spikes
redis-cli LATENCY LATEST

# Distinguish event-loop blocking from legitimate blocking commands
redis-cli INFO clients | grep blocked_clients

# Verify slowlog threshold
redis-cli CONFIG GET slowlog-log-slower-than

How to diagnose it

  1. Confirm it is event loop blocking, not a crash. If timeout 2 redis-cli PING hangs or returns after seconds, but the process is still running and connected_clients is high, the loop is wedged rather than down. If PING returns instantly but data commands are slow, the queue may be draining.

  2. Read the slowlog. SLOWLOG GET 50 returns the last commands that exceeded slowlog-log-slower-than. Look for patterns: repeated KEYS, SMEMBERS, SORT, or EVAL. The execution time in microseconds is strictly the time the main thread spent in the command, excluding queue wait time. If the slowlog is empty, the threshold may be too high or the offender is a Lua script that finished quickly but ran many times.

  3. Correlate with commandstats. INFO commandstats shows calls and usec_per_call per command type. A high usec_per_call for cmdstat_keys or cmdstat_eval confirms the workload mix is the problem, not a transient spike.

  4. Check CLIENT LIST. Find the addr and name of the connection issuing slow commands. Look for flags like x (MULTI/EXEC context). The b flag means blocking commands like BLPOP, which are different from event-loop blocking.

  5. Enable and check latency monitoring. If LATENCY LATEST shows command events, Redis is measuring internal command latency above your threshold. If LATENCY HISTORY command shows a spike at the exact time of the incident, you have confirmation. Remember that latency-monitor-threshold defaults to 0 (disabled) and must be set explicitly.

  6. Rule out other composite patterns. Check latest_fork_usec and rdb_bgsave_in_progress. If a fork just happened, you may be seeing a Fork-Induced Latency Cascade instead. Check used_memory vs maxmemory for a Memory Pressure Spiral. Event loop blocking from a slow command will show normal memory and no active fork.

Metrics and signals to monitor

SignalWhy it mattersWarning sign
instantaneous_ops_per_secThroughput collapse while clients remain connected is the hallmark of blockingDrop to near-zero while connected_clients is high
Slowlog entry rateIdentifies which commands are consuming the main threadSustained growth or individual entries > 100 ms
LATENCY LATEST (command event)Internal latency breakdown directly measures event loop stallscommand events appearing or exceeding threshold
INFO commandstats (usec_per_call)Reveals average cost per command type; outliers point to large-key abusecmdstat_keys, cmdstat_eval, or cmdstat_sort with high usec_per_call
connected_clientsHigh count while throughput drops means clients are queued and waitingconnected_clients high with instantaneous_ops_per_sec low
blocked_clientsDistinguishes event-loop blocking from legitimate blocking commandsIf low or zero, blocking is not from BLPOP/BRPOP
used_cpu_user_main_thread (Redis 6.2+)Main thread saturation from command executionRate approaching 100% of one core during the incident

Fixes

The offender is KEYS or an O(N) lookup

Replace every KEYS in application code with SCAN. Audit with INFO commandstats: any non-zero cmdstat_keys in production is a red flag. For large hashes, lists, or sets, paginate with HSCAN, SSCAN, or bounded LRANGE instead of fetching entire structures.

The offender is a large aggregate command

Use UNLINK instead of DEL for large keys. UNLINK (available since Redis 4.0) schedules deletion asynchronously and frees the main thread immediately. For SORT, ensure you are not sorting massive datasets in memory without limits, and avoid SORT with BY and GET patterns that cross cluster slots.

The offender is a Lua script

Set lua-time-limit to a value appropriate for your workload. For read-only scripts, SCRIPT KILL interrupts execution. For scripts that have already performed writes, there is no clean in-band kill path short of SHUTDOWN NOSAVE. If a write-modifying script is out of control and you cannot wait, SHUTDOWN NOSAVE is the last resort. Consider migrating long-running logic to Redis Functions (Redis 7.0+), which offer better replication semantics and script flags.

The offender is an admin or debug session

Use CLIENT KILL ID <client-id> (or CLIENT KILL <ip>:<port>) to disconnect the session after identifying it in CLIENT LIST. Restrict DEBUG, CONFIG, and SHUTDOWN via ACLs (Redis 6.0+) or rename-command in older versions. Never run MONITOR in production and leave it; it copies every command to the monitor client’s output buffer and can itself OOM the instance.

The queue is still not draining

If the slow command is legitimate and must complete, your only option is to wait. Do not restart Redis as the first fix; restarting loses the queued commands and triggers a cold cache, thundering herd, and potential persistence recovery time. If the instance is truly wedged with no recovery path, prepare for failover instead of a hard restart.

Prevention

  • Tune the slowlog. Set slowlog-log-slower-than to 1000 microseconds (1 ms) or lower in production so you catch blockers early. Set slowlog-max-len to 1024 or higher so entries do not rotate out before you inspect them.
  • Prohibit dangerous commands. Use ACLs or rename-command to disable KEYS, DEBUG SLEEP, and FLUSHALL from application users.
  • Monitor commandstats trends. Track usec_per_call for GET, SET, and aggregate commands so key growth or access pattern drift is visible before it blocks the loop.
  • Use asynchronous deletion. Standardize on UNLINK for large key removal and review cron jobs that still use DEL.
  • Add TTL jitter. Mass expiry events add CPU pressure to the main thread and can compound queue latency; jitter prevents synchronized expiry waves.
  • Audit Lua and Functions. Review scripts for unbounded loops and large dataset access, set explicit lua-time-limit, and monitor evicted_scripts if scripts are cached under memory pressure.

How Netdata helps

  • Correlates redis.ops_per_sec dropping with redis.clients_connected remaining high to flag the Slow Command Snowball pattern.
  • Exposes slowlog-derived metrics and latency events so you identify the blocking command type without logging into the instance.
  • Tracks redis.commandstats per command family, highlighting outliers like KEYS or EVAL before they block the loop.
  • Monitors main-thread CPU usage alongside throughput to distinguish event loop blocking from capacity saturation.
  • Alerts on redis.latency spikes and sudden drops in instantaneous operations per second before client timeouts cascade.
  • How Redis actually works in production: a mental model for operators: /guides/redis/how-redis-works-in-production/
  • Redis Can’t save in background: fork: Cannot allocate memory - diagnosis and fix: /guides/redis/redis-cant-save-in-background-fork/
  • Redis eviction policy tuning: allkeys-lru vs volatile-ttl vs noeviction: /guides/redis/redis-eviction-policy-tuning/
  • Redis fork/COW memory storm: why persistence doubles RSS and OOM-kills the box: /guides/redis/redis-fork-cow-storm/
  • Redis latest_fork_usec too high: THP, NUMA, and fork latency: /guides/redis/redis-latest-fork-usec-high/
  • Redis maxmemory not set: why every production instance needs a memory limit: /guides/redis/redis-maxmemory-not-set/
  • MISCONF Redis is configured to save RDB snapshots - what it means and how to fix it: /guides/redis/redis-misconf-rdb-snapshots/
  • Redis monitoring checklist: the signals every production instance needs: /guides/redis/redis-monitoring-checklist/
  • Redis monitoring maturity model: from survival to expert: /guides/redis/redis-monitoring-maturity-model/
  • Redis OOM command not allowed when used memory > ‘maxmemory’ - causes and fixes: /guides/redis/redis-oom-command-not-allowed/
  • Redis OOM-killed by the kernel: RSS, overcommit, and recovery: /guides/redis/redis-out-of-memory-oom-killed/
  • Redis rdb_last_bgsave_status:err: diagnosing failed background saves: /guides/redis/redis-rdb-last-bgsave-status-err/