Redis memory pressure spiral: eviction thrashing and how to break it
Redis latency climbs, CPU saturates, and cache hit rate falls. evicted_keys rises while application writes increase. The backend database gets hammered. This is not a simple capacity shortage; it is a memory pressure spiral. Redis has reached maxmemory and started evicting keys. The application responds to cache misses by re-fetching from the origin and writing back to Redis. Those writes trigger more evictions, which cause more misses, which cause more writes. Redis does maximum work for minimum value.
Each eviction runs synchronously on the main thread in the write path. As the eviction rate climbs, command latency degrades for all clients, not just writers. The instance responds to PING and appears healthy in basic liveness checks, yet it is effectively unusable.
What this means
When used_memory reaches maxmemory, Redis enforces the active maxmemory-policy. In cache workloads using allkeys-lru, allkeys-lfu, or similar, the server samples keys and deletes victims inline before completing the write command that triggered the check. This runs on the main thread.
The spiral starts when the working set exceeds allocated memory. The application immediately re-requests evicted keys, driving keyspace_misses upward. The application layer treats misses as signals to repopulate the cache, issuing writes that evict other keys. The result is a sustained burst of eviction, miss, and write operations that saturates the event loop, spikes CPU, and degrades latency across the board. The cache is online but useless.
flowchart TD
A[used_memory reaches maxmemory] --> B[eviction begins]
B --> C[evicted_keys increases]
C --> D[application re-requests keys]
D --> E[keyspace_misses increases]
E --> F[re-population writes]
F --> G[memory pressure continues]
G --> B
G --> H[CPU spikes and latency degrades]Common causes
| Cause | What it looks like | First thing to check |
|---|---|---|
| Working set exceeds memory | used_memory at maxmemory, evicted_keys and keyspace_misses climbing together | used_memory vs maxmemory |
| Memory lost to fragmentation | mem_fragmentation_ratio > 1.5, used_memory below limit but RSS near physical RAM | mem_fragmentation_ratio |
| Client output buffer bloat | Memory pressure without proportional dataset growth, large omem values in CLIENT LIST | CLIENT LIST sorted by omem |
| volatile-* policy with no TTLs | evicted_keys stays at zero, write errors climb with OOM replies | CONFIG GET maxmemory-policy and INFO keyspace |
Quick checks
Run these safe commands to assess state.
# Check memory pressure against the limit
redis-cli INFO memory | grep -E "used_memory:|maxmemory:"
# Check eviction activity (cumulative; sample twice to compute rate)
redis-cli INFO stats | grep evicted_keys
# Check cache hit/miss counters
redis-cli INFO stats | grep -E "keyspace_hits|keyspace_misses"
# Confirm live latency degradation
redis-cli --latency-history -i 1
# Check fragmentation overhead
redis-cli INFO memory | grep mem_fragmentation_ratio
# Find clients with large output buffers
redis-cli CLIENT LIST | grep -o 'omem=[0-9]*' | cut -d= -f2 | sort -rn | head -5
# Check current throughput and whether demand is rising
redis-cli INFO stats | grep instantaneous_ops_per_sec
# Verify eviction policy and whether keys have TTLs
redis-cli CONFIG GET maxmemory-policy
redis-cli INFO keyspace
How to diagnose it
- Confirm the pressure point. Compare
used_memorytomaxmemory. Ifmaxmemoryis 0, Redis has no limit and will grow until the OS OOM killer intervenes. That is a configuration defect, not a capacity trend. - Measure eviction rate.
evicted_keysis cumulative. Sample it twice over a known interval and compute the rate. Any sustained non-zero rate on a persistent workload is critical. For cache workloads, a sudden 10x spike over baseline indicates the spiral. - Correlate misses with evictions. Sample
keyspace_hitsandkeyspace_misses. If the miss rate climbs while evictions climb, the spiral is active. Calculate the hit rate ashits / (hits + misses). Distinguish this from a cold start by checkinguptime_in_seconds; a post-restart miss spike is normal and should recover. - Check for write rejections. If
maxmemory-policyisnoevictionor avolatile-*variant and no keys carry TTLs, evictions stop and writes fail. Look forerrorstat_OOM(Redis 6.2+) or a risingtotal_error_repliesrate. - Identify memory consumers. If
used_memoryis well belowmaxmemorybut the OS is under pressure, checkmem_fragmentation_ratio. Values above 1.5 indicate significant allocator fragmentation. Values below 1.0 suggest swap, which is catastrophic for latency. - Inspect client buffers. Large
omemvalues inCLIENT LISTmean a slow consumer, a replica falling behind, or a forgottenMONITORsession is hoarding memory that should be available for data. - Review workload shape. Use
INFO commandstatsto spot expensive commands. Counters are cumulative; look at highusec_per_callor sample twice to detect spikes incalls. Watch forKEYS, largeSMEMBERS, or unexpected write volume.
Metrics and signals to monitor
| Signal | Why it matters | Warning sign |
|---|---|---|
used_memory / maxmemory | Proximity to eviction or OOM errors | Ratio > 90% |
evicted_keys rate | Active eviction load and data loss | Sustained increase, or any non-zero rate on persistent deployments |
keyspace_misses rate | Cache effectiveness dropping | Rising while evicted_keys also rises |
instantaneous_ops_per_sec | Demand on the event loop | Increasing while hit rate falls |
mem_fragmentation_ratio | Allocator efficiency and RSS pressure | > 1.5 sustained, or < 1.0 with used_memory > 100 MB |
CLIENT LIST omem | Client buffer memory competing with data | Single client > 256 MB, or total buffers > 10% of maxmemory |
total_error_replies / errorstat_OOM | Write failures when eviction cannot run | Rate > 0 |
Fixes
Immediate relief
If the instance is actively thrashing, reduce pressure without restarting.
- Kill buffer-heavy clients. Identify slow consumers or forgotten
MONITORsessions viaCLIENT LIST, then terminate them withCLIENT KILL <ip:port>. This is disruptive to the targeted client but can reclaim large amounts of memory instantly. - Purge allocator fragmentation. Run
MEMORY PURGEto ask jemalloc to release dirty pages. This is safe and online, though the effect depends on allocator state. - Increase maxmemory. If the host has physical headroom, raise the limit with
CONFIG SET maxmemory <bytes>. Persistent instances should leave at least 50% of physical RAM free for copy-on-write during fork. Cache-only instances should keepused_memoryunder 75% ofmaxmemoryto leave headroom for bursts.
Dataset larger than allocation
If the working set legitimately exceeds memory, tuning will not help.
- Shard the dataset. Split the keyspace across multiple Redis instances or enable Redis Cluster. This is the only fix that preserves the full dataset while ending eviction.
- Reduce data volume. Shorten TTLs, drop unnecessary keys, compress values, or replace small discrete keys with hash fields to reduce per-key overhead.
- Add memory. Vertical scaling works only if the host has free RAM and you account for RSS overhead and COW.
Fragmentation consuming headroom
When used_memory is below maxmemory but the OS is under pressure:
- Run
MEMORY PURGE. - Enable active defragmentation with
CONFIG SET activedefrag yes. This adds CPU overhead but can reclaim significant RSS over time. Only available when Redis is built with jemalloc.
Policy mismatch
If maxmemory-policy is volatile-lru or volatile-random but most keys lack TTLs, Redis has no eligible keys to evict.
- Switch to an
allkeys-*policy if the dataset is a cache. - Alternatively, ensure the application sets TTLs on all keys. Without TTLs,
volatile-*policies degrade tonoeviction.
Prevention
- Set maxmemory explicitly on every production instance. A value of 0 means no limit and eventual OOM kill.
- Maintain headroom. Persistent instances should keep
used_memorybelow 50% of physical RAM to survive COW during fork. Cache-only instances should keepused_memorybelow 75% ofmaxmemory. - Monitor rates, not absolutes. Alert on the derivative of
evicted_keys, not the cumulative counter. - Set non-zero
client-output-buffer-limit normalvalues. The default unlimited is dangerous. - Run periodic big key analysis. Use
redis-cli --bigkeysorMEMORY USAGEsampling to catch single keys that monopolize memory. - Add TTL jitter to prevent mass expiry events that suddenly free and then rapidly refill memory.
- Check
CLIENT LISTduring routine maintenance to catchMONITORsessions or slow subscribers before they bloat buffers.
How Netdata helps
Netdata collects evicted_keys, keyspace_misses, and instantaneous_ops_per_sec from the same instance, making the correlation visible in one view. Alerting triggers on used_memory approaching maxmemory or on mem_fragmentation_ratio thresholds. Per-client connection metrics expose output buffer growth before it exhausts the heap. Slowlog and latency metrics help distinguish eviction overhead from expensive commands.
Related guides
- How Redis actually works in production: a mental model for operators
- Redis NOAUTH / WRONGPASS authentication failures: ACL LOG and credential drift
- Redis aof_last_write_status:err: AOF write failures and recovery
- Redis appendfsync always latency: durability vs throughput trade-offs
- Redis big keys: finding the giant key that blocks the event loop
- Redis blocked_clients growing: dead consumers vs healthy queues
- Redis BUSY Redis is busy running a script: blocking Lua and how to recover
- Redis Can’t save in background: fork: Cannot allocate memory - diagnosis and fix
- Redis client output buffer overflow: slow consumers and client-output-buffer-limit
- Redis cluster_slots_pfail > 0: impending node failure in a cluster
- Redis CLUSTERDOWN / cluster_state:fail: slot coverage and recovery
- Redis connected_clients climbing: connection leak detection







