Redis big keys: finding the giant key that blocks the event loop

Application latency spikes while redis-cli PING still returns PONG. Simple GET commands take hundreds of milliseconds. Aggregate used_memory looks stable, instantaneous_ops_per_sec drops, and the slowlog grows. The culprit is often a single oversized key: a sorted set with millions of elements, a hash with millions of fields, or a list fetched with an unbounded range. Redis executes commands sequentially on one main thread; an O(N) command on a giant key blocks every other client until it completes. This guide shows how to find that key and fix it without restarting Redis.

What this means

Redis is single-threaded for command execution. I/O threads can read and write sockets in parallel since Redis 6.0, but traversing a hash table, sorting a set, or freeing a large object happens on the main thread. When a command touches a large structure, every other command queues behind it. A ZRANGEBYSCORE on a 50M-element sorted set, an HGETALL on a giant hash, a DEL on a massive key, or an unbounded LRANGE consumes CPU and wall-clock time proportionally to the key’s size. Aggregate memory metrics hide this: used_memory can look healthy while one key causes periodic freezes.

flowchart TD
    A[Latency spike / ops drop] --> B{SLOWLOG shows O(N) command?}
    B -->|Yes| C[Note key name and command]
    B -->|No| D[Check LATENCY LATEST for fork/fsync]
    D --> E[Not a big key issue]
    C --> F[Run redis-cli --bigkeys]
    F --> G{Key in top per-type results?}
    G -->|Yes| H[MEMORY USAGE on suspect]
    G -->|No| I[Sample random keys with MEMORY USAGE]
    H --> J[Confirm oversized key]
    I --> J
    J --> K[Use UNLINK or paginate access]

Common causes

CauseWhat it looks likeFirst thing to check
O(N) command on a large collectionSlowlog shows HGETALL, SMEMBERS, LRANGE 0 -1, SORT, or ZRANGEBYSCORE with high execution timesSLOWLOG GET 10 and INFO commandstats
A single key growing without boundLatency spikes correlate with writes to one key; key count is stable but one structure is bloatedredis-cli --bigkeys
Synchronous deletion of a large keyA single DEL causes a multi-second freeze; LATENCY LATEST shows a command spikeLATENCY LATEST and LATENCY HISTORY command
Lua script iterating a large keySlowlog shows EVAL or EVALSHA with very high usec_per_callSLOWLOG GET filtered by script entries
Application fetching entire structures instead of paginatingRepeated large output buffer spikes in CLIENT LIST; high outbound network trafficCLIENT LIST omem values

Quick checks

# Check for recent slow commands and their arguments
redis-cli SLOWLOG GET 10

# Find the biggest key per data type via incremental SCAN
redis-cli --bigkeys

# Estimate RAM for a specific suspected key
redis-cli MEMORY USAGE my:suspect:key SAMPLES 5

# Identify commands with high per-call latency
redis-cli INFO commandstats | grep -E 'cmdstat_hgetall|cmdstat_smembers|cmdstat_lrange|cmdstat_sort|cmdstat_zrangebyscore|cmdstat_eval'

# List clients and sort by output buffer size to spot fetch-heavy connections
redis-cli CLIENT LIST | awk -F'[= ]' '{for(i=1;i<=NF;i++) if($i=="omem") print $(i+1)}' | sort -rn | head -10

# Check internal latency events for command spikes
redis-cli LATENCY LATEST

How to diagnose it

  1. Confirm the event loop is blocked by commands, not by fork or fsync. Run SLOWLOG GET 10 and LATENCY LATEST. If slowlog entries show execution times over 100ms and LATENCY LATEST reports command spikes, the event loop is wedged by expensive operations.
  2. Identify the command pattern. Use INFO commandstats and look for outliers in usec_per_call. Common offenders: HGETALL, SMEMBERS, LRANGE, SORT, ZRANGEBYSCORE, and EVAL.
  3. Find the largest keys. Run redis-cli --bigkeys. This uses SCAN incrementally and is safe for production. It reports the biggest key per data type by element count or size.
  4. Measure exact memory for suspects. Run MEMORY USAGE <key> [SAMPLES count] on the candidates from step 3 and on the keys accessed by the slow commands. High byte counts confirm which structures are overweight.
  5. Correlate keys to clients. Run CLIENT LIST and look for connections with large omem values or cmd fields matching the slow command. This identifies which application instance is generating the load.
  6. Determine if the key is necessary. If it is temporary or cache data, removal is the fastest fix. If it is required data, change the access pattern instead of deleting the structure.

Metrics and signals to monitor

SignalWhy it mattersWarning sign
Slowlog entry count rateDirect evidence of commands blocking the event loopSustained growth > 10 entries per minute
Main-thread CPU utilizationBig key operations saturate the single execution core. In Redis 7+, derive a rate from used_cpu_user_main_thread and used_cpu_sys_main_thread deltasRate approaching 1.0 second per second (100% of one core)
instantaneous_ops_per_secDrops when the event loop is blocked by a slow commandSustained drop > 50% from baseline with stable client count
cmdstat_* usec_per_callReveals which specific command types are expensive. Run CONFIG RESETSTAT after deployments to isolate recent behaviourO(N) command averaging > 10ms per call since last reset
Client output buffer memory (omem)Large buffers indicate clients retrieving oversized valuesAny single client omem > 256MB
Aggregate used_memory vs per-key MEMORY USAGEAggregate hides outliers; a single key can dominateTop key consumes > 20% of total dataset memory

Fixes

Immediate: remove the key safely

If the key is expendable, do not use DEL. DEL frees memory synchronously and will block the event loop for the entire duration of the deletion, potentially for seconds on a multi-gigabyte structure. Use UNLINK instead. UNLINK removes the key from the keyspace immediately and defers memory reclamation to a background thread. You can also set CONFIG SET lazyfree-lazy-user-del yes to make DEL behave like UNLINK.

Warning: this destroys data. Confirm the key name and its purpose before running it.

Change application access patterns

Replace full-structure commands with scoped alternatives. Instead of HGETALL, use HSCAN or fetch specific fields with HMGET. Instead of SMEMBERS, use SSCAN or test membership with SISMEMBER. Instead of LRANGE 0 -1, use bounded ranges. Instead of ZRANGEBYSCORE with no limit, use ZSCAN or paginate with COUNT. This reduces command complexity from O(N) to O(1) or O(log N) per chunk.

Optimize or remove Lua scripts

If a Lua script iterates a large structure, break it into smaller batches executed from the client side, or refactor to avoid full traversals. Set lua-time-limit (default 5000 ms) to define when Redis flags a script as slow and allows SCRIPT KILL. Note that SCRIPT KILL succeeds only against scripts that have not yet performed writes.

Shard large structures

If the data must remain and be accessed in bulk, shard it across multiple smaller keys. For example, split a giant hash into user:1000:profile, user:1001:profile, and so on, or partition a sorted set by score range. This keeps any single key small enough that O(N) traversals complete quickly.

Enable lazy freeing by default

Set lazyfree-lazy-user-del yes in redis.conf or via CONFIG SET. This ensures future accidental or intentional DEL operations on large keys do not block the event loop. For FLUSHDB and FLUSHALL, pass the ASYNC flag to avoid synchronous deletion.

Prevention

  • Schedule periodic redis-cli --bigkeys or MEMORY USAGE sampling runs via cron or configuration management. Trend per-key memory to catch growth before it blocks the event loop.
  • Ban unbounded O(N) commands in application code reviews. Enforce pagination for all collection access.
  • Set client-output-buffer-limit normal <hard> <soft> <seconds>; for example, client-output-buffer-limit normal 256mb 128mb 60. This disconnects runaway fetches before they destabilize the server.
  • Monitor INFO commandstats for usec_per_call regressions after each deployment; reset stats with CONFIG RESETSTAT to establish a clean baseline.
  • Keep lazyfree-lazy-user-del yes enabled on all production instances.
  • Maintain per-key memory dashboards if your monitoring system supports scraping MEMORY USAGE samples.

How Netdata helps

  • Correlates drops in instantaneous operations per second with keyspace hits and system CPU spikes to confirm event loop blocking.
  • Surfaces main-thread CPU saturation when a big key monopolizes the single execution core.
  • Tracks memory usage alongside application latency to expose when stable aggregate memory masks per-key outliers.
  • Alerts on rejected connections and connected client anomalies that follow latency spikes caused by queued commands.
  • Provides slowlog integration to visualize command latency outliers without manual SLOWLOG GET queries.