$ guides / redis / redis-keys-command-blocking-production ▌

Operations Guides

Redis KEYS command blocking production: why to replace it with SCAN

Redis P99 latency jumps from sub-millisecond to seconds. Clients time out. instantaneous_ops_per_sec drops to near zero while connected_clients stays high. The likely culprit is a single slow command monopolizing the event loop, and KEYS is the classic offender.

Redis runs all client commands on one main thread. KEYS scans the entire keyspace synchronously to match a pattern. Time complexity is O(N) where N is the total number of keys. During the scan, nothing else executes. Every client, including replication streams, health checks, and monitoring probes, waits.

The impact is independent of the result set size. A KEYS call that returns zero matches still traverses the entire keyspace. The blocking duration depends on total key count and server load. Under load, even a few hundred milliseconds of blocking queues enough commands to create a compounding latency backlog. When the event loop resumes, clients may have already started retries, amplifying load. If KEYS is called repeatedly, the instance never recovers.

flowchart TD
    A[Latency spike] --> B{SLOWLOG GET}
    B --> C{KEYS dominates?}
    C -->|Yes| D[cmdstat_keys check]
    C -->|No| E[Other slow command]
    D --> F{Calls > 0?}
    F -->|Yes| G[CLIENT LIST]
    F -->|No| H[Check log rotation]
    G --> I[Kill client or deploy SCAN]

Common causes

Cause	What it looks like	First thing to check
Application code using `KEYS` for cache invalidation or key discovery	`cmdstat_keys` shows non-zero call count; slowlog entries for `KEYS`	`INFO commandstats` filtered for `cmdstat_keys`
Administrative scripts running `KEYS` during migration or cleanup	Ops tooling or cron jobs connected to Redis; spikes correlate with scheduled tasks	`CLIENT LIST` matching source IPs to internal tools
Monitoring or inventory tools enumerating keys	Regular intervals of latency spikes; same pattern in slowlog every N minutes	Slowlog timestamps and periodicity
Misguided cache warming or dependency checking	Application startup logic runs `KEYS` to verify state	Application deployment logs correlated with slowlog entries

Quick checks

# Check if KEYS has been called since startup
redis-cli INFO commandstats | grep cmdstat_keys

# Inspect the slowlog for KEYS entries
redis-cli SLOWLOG GET 10

# Check if the event loop is currently blocked
redis-cli INFO stats | grep instantaneous_ops_per_sec

# List connected clients to find the offender
redis-cli CLIENT LIST

# Check overall command latency events
redis-cli LATENCY LATEST

How to diagnose it

Confirm the symptom. Run redis-cli INFO stats. Look for instantaneous_ops_per_sec near zero while connected_clients remains high. This indicates the event loop is blocked despite active connections. On replicas, master_last_io_seconds_ago may grow because the master cannot serve replication traffic while blocked.
Find the slow command. Run redis-cli SLOWLOG GET 50. If KEYS dominates, you have found the wedge. Note the duration in microseconds; anything over 1000000 (1 second) is severe. Also note the client address. If KEYS is absent, look for other O(N) commands such as SORT, LRANGE on large lists, or HGETALL on massive hashes.
Correlate with command statistics. Run redis-cli INFO commandstats | grep cmdstat_keys. Any non-zero call count in production is a red flag. Compare usec_per_call to baseline simple commands like GET or SET.
Identify the source. Use redis-cli CLIENT LIST and match the addr or id from the slowlog entry to the client connection. Check the cmd field to confirm the client’s workload.
Check for recurrence. Run redis-cli SLOWLOG LEN. If the log is growing rapidly, entries may be evicted before you inspect them. Increase slowlog-max-len temporarily to retain evidence.
Validate impact duration. If latency monitoring is enabled, run redis-cli LATENCY HISTORY command to see how long command processing was stalled. The slowlog duration field is also reliable.

Metrics and signals to monitor

Signal	Why it matters	Warning sign
`cmdstat_keys` calls	Any production use of `KEYS` indicates a blocking scan risk	Non-zero call count
Slowlog `KEYS` entries	Direct evidence of event loop blocking	Any `KEYS` entry in slowlog
`instantaneous_ops_per_sec`	Proxy for event loop health	Drop to near-zero with high `connected_clients`
`LATENCY LATEST` command events	Internal Redis view of command latency	Sustained `command` latency spikes
`connected_clients`	Shows backlog of waiting clients	High or growing while throughput drops
`blocked_clients`	Distinguishes intentional blocking (BLPOP) from slow-command blocking	This does NOT capture slow-command blocking; use slowlog instead

Fixes

Immediate: stop the active bleed

If a client is calling KEYS in a loop, identify it via CLIENT LIST and terminate the connection to break the cycle. You cannot preempt a KEYS command already in flight; CLIENT KILL applies after the current command finishes. If the bleed is a single massive call, you must wait it out.

# WARNING: this kills the client connection. The application will error.
# Prefer killing by ID if multiple clients share the same source address.
redis-cli CLIENT KILL <ip:port>
# or
redis-cli CLIENT KILL ID <client-id>

This aborts any pending pipelined commands from that client and prevents further invocations. Other clients will resume, though some may have already timed out and need to reconnect.

Short-term: replace KEYS with SCAN

SCAN is the production-safe replacement. It returns a cursor and a small batch of keys per call, yielding the event loop between iterations. Unlike KEYS, SCAN does not freeze the instance.

The basic iteration pattern uses a cursor that starts at 0 and ends when the server returns 0:

# First iteration
redis-cli SCAN 0 MATCH "user:*" COUNT 100

# Subsequent iterations use the returned cursor
redis-cli SCAN 42 MATCH "user:*" COUNT 100

Tradeoffs and gotchas:

Cursors are opaque. Never assume they increment, decrement, or follow a pattern. Always use the exact value returned by the previous call.
SCAN may return the same key multiple times across iterations. Your application must deduplicate if uniqueness matters.
COUNT is a hint, not a guarantee. The server decides the actual batch size per call.
MATCH filters after retrieval from the collection. If your pattern is sparse, SCAN may return empty batches while still consuming iteration cycles. Under load, this can elevate CPU without yielding much value.
A full SCAN iteration is still O(N) aggregate over the entire keyspace, but the per-call boundary yields the event loop, preventing the catastrophic blocking window that makes KEYS dangerous.
Application code should wrap the cursor loop in a timeout guard and handle empty batches without exiting early.
If the goal is deletion, collect keys with SCAN and remove them in batches. Never pipe KEYS into DEL.

Structural: prevent KEYS at the source

If application code or scripts cannot be changed immediately, restrict the command at the server level:

Rename the command in redis.conf: rename-command KEYS "" to disable it entirely. This requires a restart to take effect, so plan a maintenance window.
Use ACLs (Redis 6.0+) to deny KEYS for application users while preserving it for admin roles if absolutely necessary.

Warning: Disabling KEYS via rename-command breaks any tool or script that relies on it. Verify no critical tooling depends on it before deploying. The better fix is always to remove the call site.

Prevention

Code review policy. Ban KEYS from all application code. Treat it like DEBUG SEGFAULT. Add linting or static analysis to reject pull requests containing Redis KEYS calls.
CI gates. Audit application repositories for KEYS usage against actual Redis call paths.
Runtime detection. Alert on cmdstat_keys call rate greater than zero. Any production invocation warrants a ticket.
Tooling audit. Inventory administrative scripts, backup tools, and monitoring checks for KEYS usage and replace them with SCAN.
Latency monitoring. Enable latency monitoring with a low threshold to catch blocking commands before they cascade. Set via CONFIG SET latency-monitor-threshold 1 (value is in milliseconds), or in redis.conf.
Client audits. Review Redis client library configurations; some frameworks historically used KEYS for cache tagging or invalidation.

How Netdata helps

Netdata tracks redis.commands and redis.commands_by_type, exposing KEYS in the command mix.
The collector exposes redis.ops_per_sec, letting you correlate throughput drops with slowlog growth.
redis.latency metrics visualize event loop stalls that align with slowlog entries.
Slowlog monitoring integration captures KEYS entries as events after deployments.
Dashboards overlay redis.connected_clients with redis.ops_per_sec to spot the blocked-event-loop signature: flat throughput with sustained connections.

The Netdata solution

Redis monitoring with Netdata

Netdata monitors Redis with per-second metrics and ML anomaly detection. Track memory usage and fragmentation, fork/COW latency, replication backlog, evictions, and connection pressure to spot the failure modes in these runbooks early.

See Redis monitoring → Start monitoring free

Redis KEYS command blocking production: why to replace it with SCAN

Redis KEYS command blocking production: why to replace it with SCAN

Common causes

Quick checks

How to diagnose it

Metrics and signals to monitor

Fixes

Immediate: stop the active bleed

Short-term: replace KEYS with SCAN

Structural: prevent KEYS at the source

Prevention

How Netdata helps

Related guides

Redis monitoring with Netdata