Redis KEYS command blocking production: why to replace it with SCAN
Redis P99 latency jumps from sub-millisecond to seconds. Clients time out. instantaneous_ops_per_sec drops to near zero while connected_clients stays high. The likely culprit is a single slow command monopolizing the event loop, and KEYS is the classic offender.
Redis runs all client commands on one main thread. KEYS scans the entire keyspace synchronously to match a pattern. Time complexity is O(N) where N is the total number of keys. During the scan, nothing else executes. Every client, including replication streams, health checks, and monitoring probes, waits.
The impact is independent of the result set size. A KEYS call that returns zero matches still traverses the entire keyspace. The blocking duration depends on total key count and server load. Under load, even a few hundred milliseconds of blocking queues enough commands to create a compounding latency backlog. When the event loop resumes, clients may have already started retries, amplifying load. If KEYS is called repeatedly, the instance never recovers.
flowchart TD
A[Latency spike] --> B{SLOWLOG GET}
B --> C{KEYS dominates?}
C -->|Yes| D[cmdstat_keys check]
C -->|No| E[Other slow command]
D --> F{Calls > 0?}
F -->|Yes| G[CLIENT LIST]
F -->|No| H[Check log rotation]
G --> I[Kill client or deploy SCAN]Common causes
| Cause | What it looks like | First thing to check |
|---|---|---|
Application code using KEYS for cache invalidation or key discovery | cmdstat_keys shows non-zero call count; slowlog entries for KEYS | INFO commandstats filtered for cmdstat_keys |
Administrative scripts running KEYS during migration or cleanup | Ops tooling or cron jobs connected to Redis; spikes correlate with scheduled tasks | CLIENT LIST matching source IPs to internal tools |
| Monitoring or inventory tools enumerating keys | Regular intervals of latency spikes; same pattern in slowlog every N minutes | Slowlog timestamps and periodicity |
| Misguided cache warming or dependency checking | Application startup logic runs KEYS to verify state | Application deployment logs correlated with slowlog entries |
Quick checks
# Check if KEYS has been called since startup
redis-cli INFO commandstats | grep cmdstat_keys
# Inspect the slowlog for KEYS entries
redis-cli SLOWLOG GET 10
# Check if the event loop is currently blocked
redis-cli INFO stats | grep instantaneous_ops_per_sec
# List connected clients to find the offender
redis-cli CLIENT LIST
# Check overall command latency events
redis-cli LATENCY LATEST
How to diagnose it
Confirm the symptom. Run
redis-cli INFO stats. Look forinstantaneous_ops_per_secnear zero whileconnected_clientsremains high. This indicates the event loop is blocked despite active connections. On replicas,master_last_io_seconds_agomay grow because the master cannot serve replication traffic while blocked.Find the slow command. Run
redis-cli SLOWLOG GET 50. IfKEYSdominates, you have found the wedge. Note thedurationin microseconds; anything over 1000000 (1 second) is severe. Also note the client address. IfKEYSis absent, look for other O(N) commands such asSORT,LRANGEon large lists, orHGETALLon massive hashes.Correlate with command statistics. Run
redis-cli INFO commandstats | grep cmdstat_keys. Any non-zero call count in production is a red flag. Compareusec_per_callto baseline simple commands likeGETorSET.Identify the source. Use
redis-cli CLIENT LISTand match theaddroridfrom the slowlog entry to the client connection. Check thecmdfield to confirm the client’s workload.Check for recurrence. Run
redis-cli SLOWLOG LEN. If the log is growing rapidly, entries may be evicted before you inspect them. Increaseslowlog-max-lentemporarily to retain evidence.Validate impact duration. If latency monitoring is enabled, run
redis-cli LATENCY HISTORY commandto see how long command processing was stalled. The slowlogdurationfield is also reliable.
Metrics and signals to monitor
| Signal | Why it matters | Warning sign |
|---|---|---|
cmdstat_keys calls | Any production use of KEYS indicates a blocking scan risk | Non-zero call count |
Slowlog KEYS entries | Direct evidence of event loop blocking | Any KEYS entry in slowlog |
instantaneous_ops_per_sec | Proxy for event loop health | Drop to near-zero with high connected_clients |
LATENCY LATEST command events | Internal Redis view of command latency | Sustained command latency spikes |
connected_clients | Shows backlog of waiting clients | High or growing while throughput drops |
blocked_clients | Distinguishes intentional blocking (BLPOP) from slow-command blocking | This does NOT capture slow-command blocking; use slowlog instead |
Fixes
Immediate: stop the active bleed
If a client is calling KEYS in a loop, identify it via CLIENT LIST and terminate the connection to break the cycle. You cannot preempt a KEYS command already in flight; CLIENT KILL applies after the current command finishes. If the bleed is a single massive call, you must wait it out.
# WARNING: this kills the client connection. The application will error.
# Prefer killing by ID if multiple clients share the same source address.
redis-cli CLIENT KILL <ip:port>
# or
redis-cli CLIENT KILL ID <client-id>
This aborts any pending pipelined commands from that client and prevents further invocations. Other clients will resume, though some may have already timed out and need to reconnect.
Short-term: replace KEYS with SCAN
SCAN is the production-safe replacement. It returns a cursor and a small batch of keys per call, yielding the event loop between iterations. Unlike KEYS, SCAN does not freeze the instance.
The basic iteration pattern uses a cursor that starts at 0 and ends when the server returns 0:
# First iteration
redis-cli SCAN 0 MATCH "user:*" COUNT 100
# Subsequent iterations use the returned cursor
redis-cli SCAN 42 MATCH "user:*" COUNT 100
Tradeoffs and gotchas:
- Cursors are opaque. Never assume they increment, decrement, or follow a pattern. Always use the exact value returned by the previous call.
SCANmay return the same key multiple times across iterations. Your application must deduplicate if uniqueness matters.COUNTis a hint, not a guarantee. The server decides the actual batch size per call.MATCHfilters after retrieval from the collection. If your pattern is sparse,SCANmay return empty batches while still consuming iteration cycles. Under load, this can elevate CPU without yielding much value.- A full
SCANiteration is still O(N) aggregate over the entire keyspace, but the per-call boundary yields the event loop, preventing the catastrophic blocking window that makesKEYSdangerous. - Application code should wrap the cursor loop in a timeout guard and handle empty batches without exiting early.
- If the goal is deletion, collect keys with
SCANand remove them in batches. Never pipeKEYSintoDEL.
Structural: prevent KEYS at the source
If application code or scripts cannot be changed immediately, restrict the command at the server level:
- Rename the command in
redis.conf:rename-command KEYS ""to disable it entirely. This requires a restart to take effect, so plan a maintenance window. - Use ACLs (Redis 6.0+) to deny
KEYSfor application users while preserving it for admin roles if absolutely necessary.
Warning: Disabling KEYS via rename-command breaks any tool or script that relies on it. Verify no critical tooling depends on it before deploying. The better fix is always to remove the call site.
Prevention
- Code review policy. Ban
KEYSfrom all application code. Treat it likeDEBUG SEGFAULT. Add linting or static analysis to reject pull requests containing RedisKEYScalls. - CI gates. Audit application repositories for
KEYSusage against actual Redis call paths. - Runtime detection. Alert on
cmdstat_keyscall rate greater than zero. Any production invocation warrants a ticket. - Tooling audit. Inventory administrative scripts, backup tools, and monitoring checks for
KEYSusage and replace them withSCAN. - Latency monitoring. Enable latency monitoring with a low threshold to catch blocking commands before they cascade. Set via
CONFIG SET latency-monitor-threshold 1(value is in milliseconds), or inredis.conf. - Client audits. Review Redis client library configurations; some frameworks historically used
KEYSfor cache tagging or invalidation.
How Netdata helps
- Netdata tracks
redis.commandsandredis.commands_by_type, exposingKEYSin the command mix. - The collector exposes
redis.ops_per_sec, letting you correlate throughput drops with slowlog growth. redis.latencymetrics visualize event loop stalls that align with slowlog entries.- Slowlog monitoring integration captures
KEYSentries as events after deployments. - Dashboards overlay
redis.connected_clientswithredis.ops_per_secto spot the blocked-event-loop signature: flat throughput with sustained connections.
Related guides
- How Redis actually works in production: a mental model for operators
- Redis Can’t save in background: fork: Cannot allocate memory - diagnosis and fix
- Redis eviction policy tuning: allkeys-lru vs volatile-ttl vs noeviction
- Redis fork/COW memory storm: why persistence doubles RSS and OOM-kills the box
- Redis latest_fork_usec too high: THP, NUMA, and fork latency
- Redis maxmemory not set: why every production instance needs a memory limit
- MISCONF Redis is configured to save RDB snapshots - what it means and how to fix it
- Redis monitoring checklist: the signals every production instance needs
- Redis monitoring maturity model: from survival to expert
- Redis OOM command not allowed when used memory > ‘maxmemory’ - causes and fixes
- Redis OOM-killed by the kernel: RSS, overcommit, and recovery
- Redis rdb_last_bgsave_status:err: diagnosing failed background saves







