Redis CPU saturation: hitting the single-core throughput ceiling

Redis latency climbs. PING returns PONG, but simple GETs take milliseconds instead of microseconds. Host CPU looks moderate - perhaps 25% across eight cores - yet commands queue. The likely cause is main-thread CPU saturation. Redis executes all commands on a single event-loop thread. Once that thread saturates one core, latency rises linearly with queue depth. There is no performance cliff - only a steady ramp that eventually drives client timeouts. On multi-core hosts, aggregate process CPU hides this bottleneck because background children, I/O threads, and system accounting spread usage across cores.

What this means

Redis is a single-threaded command processor. Since Redis 6.0, I/O threads handle network socket reads and writes in parallel, but command execution, key expiry, active defragmentation, and incremental hash-table rehashing all run on the main thread. When main-thread CPU approaches one full core, every additional command waits in the event-loop queue. That wait time adds directly to latency. Because there is no preemption inside the event loop, a single slow operation blocks all subsequent commands until it completes.

The aggregate CPU counters used_cpu_user and used_cpu_sys include all threads: background fork children, bio threads, and I/O workers. On a multi-core server, these aggregates may show low utilization while the main thread is saturated. Redis 6.2 introduced used_cpu_user_main_thread and used_cpu_sys_main_thread, which isolate event-loop CPU. To detect saturation, sample these counters twice and compute the per-second rate.

Several internal processes compete for the same main-thread CPU: O(N) commands scanning large keys, the active expiry cycle sampling TTLs, active defragmentation relocating allocations, and incremental rehashing resizing the keyspace hash table. Any of these can push a stable host into saturation.

flowchart TD
  A[Main-thread CPU rate approaches 1.0] --> B{SLOWLOG shows O(N) commands?}
  B -->|Yes| C[Replace KEYS with SCAN,
chunk large ops,
optimize Lua] B -->|No| D{expired_keys or
evicted_keys spiking?} D -->|Yes| E[Add TTL jitter,
increase maxmemory,
or shard] D -->|No| F{active_defrag_running > 0?} F -->|Yes| G[Tune defrag
cycles or disable] F -->|No| H[Shard or reduce
command volume]

Common causes

CauseWhat it looks likeFirst thing to check
Throughput exceeds single-core capacityinstantaneous_ops_per_sec plateaus while main-thread CPU rate approaches 1.0INFO cpu main-thread rate versus instantaneous_ops_per_sec
O(N) commands blocking the event loopSlowlog dominated by KEYS, SMEMBERS, HGETALL, SORT, or Lua scripts; latency spikes across all command typesSLOWLOG GET 50, INFO commandstats
Active expiry cycle under mass TTL pressureCPU spikes correlate with jumps in expired_keys; expired_time_cap_reached_count climbingINFO stats for expired_keys rate and time-cap counter
Active defragmentationElevated CPU during otherwise idle periods; active_defrag_running sustained above zeroINFO memory and INFO stats defrag metrics
Hash table rehashing after growthSteady CPU overhead following bulk loads or rapid keyspace growthINFO keyspace key count trend
Eviction under memory pressureevicted_keys rate climbing alongside CPU; used_memory at maxmemoryINFO stats evicted keys rate, used_memory versus maxmemory

Quick checks

# Main-thread CPU counters (Redis 6.2+)
redis-cli INFO cpu | grep -E 'used_cpu_user_main_thread|used_cpu_sys_main_thread'

# Current throughput
redis-cli INFO stats | grep instantaneous_ops_per_sec

# Commands blocking the loop
redis-cli SLOWLOG GET 10

# Expensive command types
redis-cli INFO commandstats | grep -E 'cmdstat_keys|cmdstat_smembers|cmdstat_hgetall|cmdstat_sort'

# Expiry and eviction pressure
redis-cli INFO stats | grep -E 'expired_keys|evicted_keys'

# Active defrag status
redis-cli INFO memory | grep active_defrag
redis-cli INFO stats | grep active_defrag

# Internal latency events (requires latency-monitor-threshold > 0)
redis-cli LATENCY LATEST

# Blocked clients (distinguish queue wait from CPU wait)
redis-cli INFO clients | grep blocked_clients

How to diagnose it

  1. Compute the main-thread CPU rate. Sample INFO cpu twice, 10 seconds apart. On Redis 6.2+, sum used_cpu_user_main_thread and used_cpu_sys_main_thread. Divide the delta by the elapsed interval. A rate above 0.9 means the main thread is saturated. On Redis versions older than 6.2, sum used_cpu_user and used_cpu_sys, but recognize that this aggregate includes background threads and can underestimate saturation on multi-core hosts.

  2. Correlate CPU with throughput. Check INFO stats for instantaneous_ops_per_sec. If throughput is flat or falling while client demand rises, the instance has hit its single-core execution ceiling.

  3. Identify event loop blockers. Inspect SLOWLOG GET 50. If the same pattern appears repeatedly - especially KEYS, SMEMBERS, HGETALL, unbounded LRANGE, or SORT - that command is serially delaying everything behind it. Check INFO commandstats for high usec_per_call outliers.

  4. Check background CPU consumers. Compute the rate of expired_keys and evicted_keys from INFO stats. If either is spiking, the main thread is spending cycles on memory management instead of client commands. Check INFO memory for active_defrag_running above zero.

  5. Verify internal latency sources. If latency-monitor-threshold is set, run LATENCY LATEST. Look for command, expire-cycle, or eviction-cycle events. These categories confirm which subsystem is consuming time.

  6. Distinguish from I/O bottlenecks. Check INFO stats for io_threaded_reads_processed and io_threaded_writes_processed (Redis 6.0+). High numbers mean network I/O is offloaded, ruling out socket syscalls as the primary bottleneck and pointing to execution saturation.

Metrics and signals to monitor

SignalWhy it mattersWarning sign
Main-thread CPU rate (Redis 6.2+)Isolates single-thread saturation from aggregate process CPURate > 0.7 sustained; > 0.9 is saturated
instantaneous_ops_per_secCurrent command volume against the single-thread ceilingPlateauing despite increasing client load
Slowlog growth rateReveals commands that block the event loop> 10 new entries per minute, or repeated patterns
expired_keys rateActive expiry consumes main-thread CPUSudden spike > 10x baseline
evicted_keys rateEviction consumes CPU and signals memory pressureSustained non-zero rate on persistent workloads
active_defrag_runningDefrag runs on the main threadSustained > 0 with high miss ratio
LATENCY LATEST eventsInternal latency breakdown by subsystemcommand, expire-cycle, or eviction-cycle events appearing
Keyspace size trendRehashing adds incremental overheadRapid growth after bulk loads

Fixes

If throughput exceeds single-core capacity

Scale CPU-bound command execution by sharding across multiple Redis instances or cluster nodes. Redis Cluster splits the keyspace across independent event loops, each capable of roughly one core of command execution. I/O threads offload network reads and writes, but they do not parallelize command execution. Do not expect I/O threads to relieve main-thread saturation.

If O(N) commands block the event loop

Replace KEYS with SCAN in all application code. Break large reads such as SMEMBERS, HGETALL, and unbounded LRANGE into smaller batches. Review Lua scripts for loops over large keyspaces and set lua-time-limit to prevent runaway execution.

After remediation, SLOWLOG RESET clears old noise so you can confirm the pattern disappears.

Warning: SLOWLOG RESET immediately and irreversibly clears the slow log history.

If active expiry or eviction consumes CPU

Add jitter to TTLs so keys do not expire in synchronized waves. If eviction is constant because the dataset exceeds maxmemory, increase the memory limit or shard the data. Tuning the eviction policy does not remove the CPU cost of sampling and deleting keys under pressure.

If active defrag consumes CPU

Review defrag effectiveness by comparing active_defrag_hits to active_defrag_misses. If the hit ratio is low but CPU overhead is high, lower active-defrag-cycle-max or disable activedefrag temporarily. Only enable defrag when mem_fragmentation_ratio is sustainably above 1.5.

If rehashing adds overhead

Rehashing is incremental and usually transient. If it persists after bulk loads, the keyspace may be growing beyond planned capacity. Shard before the next bulk operation.

Prevention

  • Main-thread CPU headroom. Keep main-thread CPU below 70% of one core during peak traffic. This leaves room for expiry cycles, defrag, and sudden command mix shifts.
  • Monitor main-thread rate, not aggregate CPU. Aggregate used_cpu_* metrics are misleading on multi-core hosts. Use the Redis 6.2+ main-thread counters.
  • Prohibit KEYS in production. Use ACLs or rename-command to prevent applications from issuing KEYS.
  • Run periodic big-key analysis. Schedule redis-cli --bigkeys or MEMORY USAGE sampling on representative keys to catch keys that will eventually block the loop.
  • Add TTL jitter. Prevent mass expiry events by distributing TTLs across a time window.
  • Size maxmemory to avoid chronic eviction. Persistent workloads should not rely on eviction as a steady-state mechanism.

How Netdata helps

Netdata derives the per-second rate from used_cpu_user_main_thread and used_cpu_sys_main_thread, charting main-thread CPU in isolation from aggregate process noise.

It correlates main-thread CPU with instantaneous_ops_per_sec, slowlog growth, and LATENCY LATEST events, which helps distinguish execution saturation from network or disk bottlenecks.

Alerts fire when main-thread CPU rate crosses 70% and 90% thresholds, and anomaly detection flags instantaneous_ops_per_sec plateaus.

Netdata tracks evicted_keys, expired_keys, and active_defrag_running alongside CPU to identify which background consumer is competing for the event loop.

It also monitors individual cluster nodes to reveal shard-level hot spots that aggregate cluster metrics miss.