Redis KEYS command blocking production: why to replace it with SCAN

Redis P99 latency jumps from sub-millisecond to seconds. Clients time out. instantaneous_ops_per_sec drops to near zero while connected_clients stays high. The likely culprit is a single slow command monopolizing the event loop, and KEYS is the classic offender.

Redis runs all client commands on one main thread. KEYS scans the entire keyspace synchronously to match a pattern. Time complexity is O(N) where N is the total number of keys. During the scan, nothing else executes. Every client, including replication streams, health checks, and monitoring probes, waits.

The impact is independent of the result set size. A KEYS call that returns zero matches still traverses the entire keyspace. The blocking duration depends on total key count and server load. Under load, even a few hundred milliseconds of blocking queues enough commands to create a compounding latency backlog. When the event loop resumes, clients may have already started retries, amplifying load. If KEYS is called repeatedly, the instance never recovers.

flowchart TD
    A[Latency spike] --> B{SLOWLOG GET}
    B --> C{KEYS dominates?}
    C -->|Yes| D[cmdstat_keys check]
    C -->|No| E[Other slow command]
    D --> F{Calls > 0?}
    F -->|Yes| G[CLIENT LIST]
    F -->|No| H[Check log rotation]
    G --> I[Kill client or deploy SCAN]

Common causes

CauseWhat it looks likeFirst thing to check
Application code using KEYS for cache invalidation or key discoverycmdstat_keys shows non-zero call count; slowlog entries for KEYSINFO commandstats filtered for cmdstat_keys
Administrative scripts running KEYS during migration or cleanupOps tooling or cron jobs connected to Redis; spikes correlate with scheduled tasksCLIENT LIST matching source IPs to internal tools
Monitoring or inventory tools enumerating keysRegular intervals of latency spikes; same pattern in slowlog every N minutesSlowlog timestamps and periodicity
Misguided cache warming or dependency checkingApplication startup logic runs KEYS to verify stateApplication deployment logs correlated with slowlog entries

Quick checks

# Check if KEYS has been called since startup
redis-cli INFO commandstats | grep cmdstat_keys

# Inspect the slowlog for KEYS entries
redis-cli SLOWLOG GET 10

# Check if the event loop is currently blocked
redis-cli INFO stats | grep instantaneous_ops_per_sec

# List connected clients to find the offender
redis-cli CLIENT LIST

# Check overall command latency events
redis-cli LATENCY LATEST

How to diagnose it

  1. Confirm the symptom. Run redis-cli INFO stats. Look for instantaneous_ops_per_sec near zero while connected_clients remains high. This indicates the event loop is blocked despite active connections. On replicas, master_last_io_seconds_ago may grow because the master cannot serve replication traffic while blocked.

  2. Find the slow command. Run redis-cli SLOWLOG GET 50. If KEYS dominates, you have found the wedge. Note the duration in microseconds; anything over 1000000 (1 second) is severe. Also note the client address. If KEYS is absent, look for other O(N) commands such as SORT, LRANGE on large lists, or HGETALL on massive hashes.

  3. Correlate with command statistics. Run redis-cli INFO commandstats | grep cmdstat_keys. Any non-zero call count in production is a red flag. Compare usec_per_call to baseline simple commands like GET or SET.

  4. Identify the source. Use redis-cli CLIENT LIST and match the addr or id from the slowlog entry to the client connection. Check the cmd field to confirm the client’s workload.

  5. Check for recurrence. Run redis-cli SLOWLOG LEN. If the log is growing rapidly, entries may be evicted before you inspect them. Increase slowlog-max-len temporarily to retain evidence.

  6. Validate impact duration. If latency monitoring is enabled, run redis-cli LATENCY HISTORY command to see how long command processing was stalled. The slowlog duration field is also reliable.

Metrics and signals to monitor

SignalWhy it mattersWarning sign
cmdstat_keys callsAny production use of KEYS indicates a blocking scan riskNon-zero call count
Slowlog KEYS entriesDirect evidence of event loop blockingAny KEYS entry in slowlog
instantaneous_ops_per_secProxy for event loop healthDrop to near-zero with high connected_clients
LATENCY LATEST command eventsInternal Redis view of command latencySustained command latency spikes
connected_clientsShows backlog of waiting clientsHigh or growing while throughput drops
blocked_clientsDistinguishes intentional blocking (BLPOP) from slow-command blockingThis does NOT capture slow-command blocking; use slowlog instead

Fixes

Immediate: stop the active bleed

If a client is calling KEYS in a loop, identify it via CLIENT LIST and terminate the connection to break the cycle. You cannot preempt a KEYS command already in flight; CLIENT KILL applies after the current command finishes. If the bleed is a single massive call, you must wait it out.

# WARNING: this kills the client connection. The application will error.
# Prefer killing by ID if multiple clients share the same source address.
redis-cli CLIENT KILL <ip:port>
# or
redis-cli CLIENT KILL ID <client-id>

This aborts any pending pipelined commands from that client and prevents further invocations. Other clients will resume, though some may have already timed out and need to reconnect.

Short-term: replace KEYS with SCAN

SCAN is the production-safe replacement. It returns a cursor and a small batch of keys per call, yielding the event loop between iterations. Unlike KEYS, SCAN does not freeze the instance.

The basic iteration pattern uses a cursor that starts at 0 and ends when the server returns 0:

# First iteration
redis-cli SCAN 0 MATCH "user:*" COUNT 100

# Subsequent iterations use the returned cursor
redis-cli SCAN 42 MATCH "user:*" COUNT 100

Tradeoffs and gotchas:

  • Cursors are opaque. Never assume they increment, decrement, or follow a pattern. Always use the exact value returned by the previous call.
  • SCAN may return the same key multiple times across iterations. Your application must deduplicate if uniqueness matters.
  • COUNT is a hint, not a guarantee. The server decides the actual batch size per call.
  • MATCH filters after retrieval from the collection. If your pattern is sparse, SCAN may return empty batches while still consuming iteration cycles. Under load, this can elevate CPU without yielding much value.
  • A full SCAN iteration is still O(N) aggregate over the entire keyspace, but the per-call boundary yields the event loop, preventing the catastrophic blocking window that makes KEYS dangerous.
  • Application code should wrap the cursor loop in a timeout guard and handle empty batches without exiting early.
  • If the goal is deletion, collect keys with SCAN and remove them in batches. Never pipe KEYS into DEL.

Structural: prevent KEYS at the source

If application code or scripts cannot be changed immediately, restrict the command at the server level:

  • Rename the command in redis.conf: rename-command KEYS "" to disable it entirely. This requires a restart to take effect, so plan a maintenance window.
  • Use ACLs (Redis 6.0+) to deny KEYS for application users while preserving it for admin roles if absolutely necessary.

Warning: Disabling KEYS via rename-command breaks any tool or script that relies on it. Verify no critical tooling depends on it before deploying. The better fix is always to remove the call site.

Prevention

  • Code review policy. Ban KEYS from all application code. Treat it like DEBUG SEGFAULT. Add linting or static analysis to reject pull requests containing Redis KEYS calls.
  • CI gates. Audit application repositories for KEYS usage against actual Redis call paths.
  • Runtime detection. Alert on cmdstat_keys call rate greater than zero. Any production invocation warrants a ticket.
  • Tooling audit. Inventory administrative scripts, backup tools, and monitoring checks for KEYS usage and replace them with SCAN.
  • Latency monitoring. Enable latency monitoring with a low threshold to catch blocking commands before they cascade. Set via CONFIG SET latency-monitor-threshold 1 (value is in milliseconds), or in redis.conf.
  • Client audits. Review Redis client library configurations; some frameworks historically used KEYS for cache tagging or invalidation.

How Netdata helps

  • Netdata tracks redis.commands and redis.commands_by_type, exposing KEYS in the command mix.
  • The collector exposes redis.ops_per_sec, letting you correlate throughput drops with slowlog growth.
  • redis.latency metrics visualize event loop stalls that align with slowlog entries.
  • Slowlog monitoring integration captures KEYS entries as events after deployments.
  • Dashboards overlay redis.connected_clients with redis.ops_per_sec to spot the blocked-event-loop signature: flat throughput with sustained connections.