Redis CPU saturation: hitting the single-core throughput ceiling
Redis latency climbs. PING returns PONG, but simple GETs take milliseconds instead of microseconds. Host CPU looks moderate - perhaps 25% across eight cores - yet commands queue. The likely cause is main-thread CPU saturation. Redis executes all commands on a single event-loop thread. Once that thread saturates one core, latency rises linearly with queue depth. There is no performance cliff - only a steady ramp that eventually drives client timeouts. On multi-core hosts, aggregate process CPU hides this bottleneck because background children, I/O threads, and system accounting spread usage across cores.
What this means
Redis is a single-threaded command processor. Since Redis 6.0, I/O threads handle network socket reads and writes in parallel, but command execution, key expiry, active defragmentation, and incremental hash-table rehashing all run on the main thread. When main-thread CPU approaches one full core, every additional command waits in the event-loop queue. That wait time adds directly to latency. Because there is no preemption inside the event loop, a single slow operation blocks all subsequent commands until it completes.
The aggregate CPU counters used_cpu_user and used_cpu_sys include all threads: background fork children, bio threads, and I/O workers. On a multi-core server, these aggregates may show low utilization while the main thread is saturated. Redis 6.2 introduced used_cpu_user_main_thread and used_cpu_sys_main_thread, which isolate event-loop CPU. To detect saturation, sample these counters twice and compute the per-second rate.
Several internal processes compete for the same main-thread CPU: O(N) commands scanning large keys, the active expiry cycle sampling TTLs, active defragmentation relocating allocations, and incremental rehashing resizing the keyspace hash table. Any of these can push a stable host into saturation.
flowchart TD
A[Main-thread CPU rate approaches 1.0] --> B{SLOWLOG shows O(N) commands?}
B -->|Yes| C[Replace KEYS with SCAN,
chunk large ops,
optimize Lua]
B -->|No| D{expired_keys or
evicted_keys spiking?}
D -->|Yes| E[Add TTL jitter,
increase maxmemory,
or shard]
D -->|No| F{active_defrag_running > 0?}
F -->|Yes| G[Tune defrag
cycles or disable]
F -->|No| H[Shard or reduce
command volume]Common causes
| Cause | What it looks like | First thing to check |
|---|---|---|
| Throughput exceeds single-core capacity | instantaneous_ops_per_sec plateaus while main-thread CPU rate approaches 1.0 | INFO cpu main-thread rate versus instantaneous_ops_per_sec |
| O(N) commands blocking the event loop | Slowlog dominated by KEYS, SMEMBERS, HGETALL, SORT, or Lua scripts; latency spikes across all command types | SLOWLOG GET 50, INFO commandstats |
| Active expiry cycle under mass TTL pressure | CPU spikes correlate with jumps in expired_keys; expired_time_cap_reached_count climbing | INFO stats for expired_keys rate and time-cap counter |
| Active defragmentation | Elevated CPU during otherwise idle periods; active_defrag_running sustained above zero | INFO memory and INFO stats defrag metrics |
| Hash table rehashing after growth | Steady CPU overhead following bulk loads or rapid keyspace growth | INFO keyspace key count trend |
| Eviction under memory pressure | evicted_keys rate climbing alongside CPU; used_memory at maxmemory | INFO stats evicted keys rate, used_memory versus maxmemory |
Quick checks
# Main-thread CPU counters (Redis 6.2+)
redis-cli INFO cpu | grep -E 'used_cpu_user_main_thread|used_cpu_sys_main_thread'
# Current throughput
redis-cli INFO stats | grep instantaneous_ops_per_sec
# Commands blocking the loop
redis-cli SLOWLOG GET 10
# Expensive command types
redis-cli INFO commandstats | grep -E 'cmdstat_keys|cmdstat_smembers|cmdstat_hgetall|cmdstat_sort'
# Expiry and eviction pressure
redis-cli INFO stats | grep -E 'expired_keys|evicted_keys'
# Active defrag status
redis-cli INFO memory | grep active_defrag
redis-cli INFO stats | grep active_defrag
# Internal latency events (requires latency-monitor-threshold > 0)
redis-cli LATENCY LATEST
# Blocked clients (distinguish queue wait from CPU wait)
redis-cli INFO clients | grep blocked_clients
How to diagnose it
Compute the main-thread CPU rate. Sample
INFO cputwice, 10 seconds apart. On Redis 6.2+, sumused_cpu_user_main_threadandused_cpu_sys_main_thread. Divide the delta by the elapsed interval. A rate above 0.9 means the main thread is saturated. On Redis versions older than 6.2, sumused_cpu_userandused_cpu_sys, but recognize that this aggregate includes background threads and can underestimate saturation on multi-core hosts.Correlate CPU with throughput. Check
INFO statsforinstantaneous_ops_per_sec. If throughput is flat or falling while client demand rises, the instance has hit its single-core execution ceiling.Identify event loop blockers. Inspect
SLOWLOG GET 50. If the same pattern appears repeatedly - especiallyKEYS,SMEMBERS,HGETALL, unboundedLRANGE, orSORT- that command is serially delaying everything behind it. CheckINFO commandstatsfor highusec_per_calloutliers.Check background CPU consumers. Compute the rate of
expired_keysandevicted_keysfromINFO stats. If either is spiking, the main thread is spending cycles on memory management instead of client commands. CheckINFO memoryforactive_defrag_runningabove zero.Verify internal latency sources. If
latency-monitor-thresholdis set, runLATENCY LATEST. Look forcommand,expire-cycle, oreviction-cycleevents. These categories confirm which subsystem is consuming time.Distinguish from I/O bottlenecks. Check
INFO statsforio_threaded_reads_processedandio_threaded_writes_processed(Redis 6.0+). High numbers mean network I/O is offloaded, ruling out socket syscalls as the primary bottleneck and pointing to execution saturation.
Metrics and signals to monitor
| Signal | Why it matters | Warning sign |
|---|---|---|
| Main-thread CPU rate (Redis 6.2+) | Isolates single-thread saturation from aggregate process CPU | Rate > 0.7 sustained; > 0.9 is saturated |
instantaneous_ops_per_sec | Current command volume against the single-thread ceiling | Plateauing despite increasing client load |
| Slowlog growth rate | Reveals commands that block the event loop | > 10 new entries per minute, or repeated patterns |
expired_keys rate | Active expiry consumes main-thread CPU | Sudden spike > 10x baseline |
evicted_keys rate | Eviction consumes CPU and signals memory pressure | Sustained non-zero rate on persistent workloads |
active_defrag_running | Defrag runs on the main thread | Sustained > 0 with high miss ratio |
LATENCY LATEST events | Internal latency breakdown by subsystem | command, expire-cycle, or eviction-cycle events appearing |
| Keyspace size trend | Rehashing adds incremental overhead | Rapid growth after bulk loads |
Fixes
If throughput exceeds single-core capacity
Scale CPU-bound command execution by sharding across multiple Redis instances or cluster nodes. Redis Cluster splits the keyspace across independent event loops, each capable of roughly one core of command execution. I/O threads offload network reads and writes, but they do not parallelize command execution. Do not expect I/O threads to relieve main-thread saturation.
If O(N) commands block the event loop
Replace KEYS with SCAN in all application code. Break large reads such as SMEMBERS, HGETALL, and unbounded LRANGE into smaller batches. Review Lua scripts for loops over large keyspaces and set lua-time-limit to prevent runaway execution.
After remediation, SLOWLOG RESET clears old noise so you can confirm the pattern disappears.
Warning: SLOWLOG RESET immediately and irreversibly clears the slow log history.
If active expiry or eviction consumes CPU
Add jitter to TTLs so keys do not expire in synchronized waves. If eviction is constant because the dataset exceeds maxmemory, increase the memory limit or shard the data. Tuning the eviction policy does not remove the CPU cost of sampling and deleting keys under pressure.
If active defrag consumes CPU
Review defrag effectiveness by comparing active_defrag_hits to active_defrag_misses. If the hit ratio is low but CPU overhead is high, lower active-defrag-cycle-max or disable activedefrag temporarily. Only enable defrag when mem_fragmentation_ratio is sustainably above 1.5.
If rehashing adds overhead
Rehashing is incremental and usually transient. If it persists after bulk loads, the keyspace may be growing beyond planned capacity. Shard before the next bulk operation.
Prevention
- Main-thread CPU headroom. Keep main-thread CPU below 70% of one core during peak traffic. This leaves room for expiry cycles, defrag, and sudden command mix shifts.
- Monitor main-thread rate, not aggregate CPU. Aggregate
used_cpu_*metrics are misleading on multi-core hosts. Use the Redis 6.2+ main-thread counters. - Prohibit
KEYSin production. Use ACLs orrename-commandto prevent applications from issuingKEYS. - Run periodic big-key analysis. Schedule
redis-cli --bigkeysorMEMORY USAGEsampling on representative keys to catch keys that will eventually block the loop. - Add TTL jitter. Prevent mass expiry events by distributing TTLs across a time window.
- Size
maxmemoryto avoid chronic eviction. Persistent workloads should not rely on eviction as a steady-state mechanism.
How Netdata helps
Netdata derives the per-second rate from used_cpu_user_main_thread and used_cpu_sys_main_thread, charting main-thread CPU in isolation from aggregate process noise.
It correlates main-thread CPU with instantaneous_ops_per_sec, slowlog growth, and LATENCY LATEST events, which helps distinguish execution saturation from network or disk bottlenecks.
Alerts fire when main-thread CPU rate crosses 70% and 90% thresholds, and anomaly detection flags instantaneous_ops_per_sec plateaus.
Netdata tracks evicted_keys, expired_keys, and active_defrag_running alongside CPU to identify which background consumer is competing for the event loop.
It also monitors individual cluster nodes to reveal shard-level hot spots that aggregate cluster metrics miss.
Related guides
- How Redis actually works in production: a mental model for operators
- Redis aof_last_write_status:err: AOF write failures and recovery
- Redis appendfsync always latency: durability vs throughput trade-offs
- Redis big keys: finding the giant key that blocks the event loop
- Redis blocked_clients growing: dead consumers vs healthy queues
- Redis BUSY Redis is busy running a script: blocking Lua and how to recover
- Redis Can’t save in background: fork: Cannot allocate memory - diagnosis and fix
- Redis client output buffer overflow: slow consumers and client-output-buffer-limit
- Redis cluster_slots_pfail > 0: impending node failure in a cluster
- Redis CLUSTERDOWN / cluster_state:fail: slot coverage and recovery
- Redis connected_clients climbing: connection leak detection
- Redis connected_slaves dropped: detecting replica disconnects on the primary







