Redis memory pressure spiral: eviction thrashing and how to break it

Redis latency climbs, CPU saturates, and cache hit rate falls. evicted_keys rises while application writes increase. The backend database gets hammered. This is not a simple capacity shortage; it is a memory pressure spiral. Redis has reached maxmemory and started evicting keys. The application responds to cache misses by re-fetching from the origin and writing back to Redis. Those writes trigger more evictions, which cause more misses, which cause more writes. Redis does maximum work for minimum value.

Each eviction runs synchronously on the main thread in the write path. As the eviction rate climbs, command latency degrades for all clients, not just writers. The instance responds to PING and appears healthy in basic liveness checks, yet it is effectively unusable.

What this means

When used_memory reaches maxmemory, Redis enforces the active maxmemory-policy. In cache workloads using allkeys-lru, allkeys-lfu, or similar, the server samples keys and deletes victims inline before completing the write command that triggered the check. This runs on the main thread.

The spiral starts when the working set exceeds allocated memory. The application immediately re-requests evicted keys, driving keyspace_misses upward. The application layer treats misses as signals to repopulate the cache, issuing writes that evict other keys. The result is a sustained burst of eviction, miss, and write operations that saturates the event loop, spikes CPU, and degrades latency across the board. The cache is online but useless.

flowchart TD
    A[used_memory reaches maxmemory] --> B[eviction begins]
    B --> C[evicted_keys increases]
    C --> D[application re-requests keys]
    D --> E[keyspace_misses increases]
    E --> F[re-population writes]
    F --> G[memory pressure continues]
    G --> B
    G --> H[CPU spikes and latency degrades]

Common causes

CauseWhat it looks likeFirst thing to check
Working set exceeds memoryused_memory at maxmemory, evicted_keys and keyspace_misses climbing togetherused_memory vs maxmemory
Memory lost to fragmentationmem_fragmentation_ratio > 1.5, used_memory below limit but RSS near physical RAMmem_fragmentation_ratio
Client output buffer bloatMemory pressure without proportional dataset growth, large omem values in CLIENT LISTCLIENT LIST sorted by omem
volatile-* policy with no TTLsevicted_keys stays at zero, write errors climb with OOM repliesCONFIG GET maxmemory-policy and INFO keyspace

Quick checks

Run these safe commands to assess state.

# Check memory pressure against the limit
redis-cli INFO memory | grep -E "used_memory:|maxmemory:"

# Check eviction activity (cumulative; sample twice to compute rate)
redis-cli INFO stats | grep evicted_keys

# Check cache hit/miss counters
redis-cli INFO stats | grep -E "keyspace_hits|keyspace_misses"

# Confirm live latency degradation
redis-cli --latency-history -i 1

# Check fragmentation overhead
redis-cli INFO memory | grep mem_fragmentation_ratio

# Find clients with large output buffers
redis-cli CLIENT LIST | grep -o 'omem=[0-9]*' | cut -d= -f2 | sort -rn | head -5

# Check current throughput and whether demand is rising
redis-cli INFO stats | grep instantaneous_ops_per_sec

# Verify eviction policy and whether keys have TTLs
redis-cli CONFIG GET maxmemory-policy
redis-cli INFO keyspace

How to diagnose it

  1. Confirm the pressure point. Compare used_memory to maxmemory. If maxmemory is 0, Redis has no limit and will grow until the OS OOM killer intervenes. That is a configuration defect, not a capacity trend.
  2. Measure eviction rate. evicted_keys is cumulative. Sample it twice over a known interval and compute the rate. Any sustained non-zero rate on a persistent workload is critical. For cache workloads, a sudden 10x spike over baseline indicates the spiral.
  3. Correlate misses with evictions. Sample keyspace_hits and keyspace_misses. If the miss rate climbs while evictions climb, the spiral is active. Calculate the hit rate as hits / (hits + misses). Distinguish this from a cold start by checking uptime_in_seconds; a post-restart miss spike is normal and should recover.
  4. Check for write rejections. If maxmemory-policy is noeviction or a volatile-* variant and no keys carry TTLs, evictions stop and writes fail. Look for errorstat_OOM (Redis 6.2+) or a rising total_error_replies rate.
  5. Identify memory consumers. If used_memory is well below maxmemory but the OS is under pressure, check mem_fragmentation_ratio. Values above 1.5 indicate significant allocator fragmentation. Values below 1.0 suggest swap, which is catastrophic for latency.
  6. Inspect client buffers. Large omem values in CLIENT LIST mean a slow consumer, a replica falling behind, or a forgotten MONITOR session is hoarding memory that should be available for data.
  7. Review workload shape. Use INFO commandstats to spot expensive commands. Counters are cumulative; look at high usec_per_call or sample twice to detect spikes in calls. Watch for KEYS, large SMEMBERS, or unexpected write volume.

Metrics and signals to monitor

SignalWhy it mattersWarning sign
used_memory / maxmemoryProximity to eviction or OOM errorsRatio > 90%
evicted_keys rateActive eviction load and data lossSustained increase, or any non-zero rate on persistent deployments
keyspace_misses rateCache effectiveness droppingRising while evicted_keys also rises
instantaneous_ops_per_secDemand on the event loopIncreasing while hit rate falls
mem_fragmentation_ratioAllocator efficiency and RSS pressure> 1.5 sustained, or < 1.0 with used_memory > 100 MB
CLIENT LIST omemClient buffer memory competing with dataSingle client > 256 MB, or total buffers > 10% of maxmemory
total_error_replies / errorstat_OOMWrite failures when eviction cannot runRate > 0

Fixes

Immediate relief

If the instance is actively thrashing, reduce pressure without restarting.

  • Kill buffer-heavy clients. Identify slow consumers or forgotten MONITOR sessions via CLIENT LIST, then terminate them with CLIENT KILL <ip:port>. This is disruptive to the targeted client but can reclaim large amounts of memory instantly.
  • Purge allocator fragmentation. Run MEMORY PURGE to ask jemalloc to release dirty pages. This is safe and online, though the effect depends on allocator state.
  • Increase maxmemory. If the host has physical headroom, raise the limit with CONFIG SET maxmemory <bytes>. Persistent instances should leave at least 50% of physical RAM free for copy-on-write during fork. Cache-only instances should keep used_memory under 75% of maxmemory to leave headroom for bursts.

Dataset larger than allocation

If the working set legitimately exceeds memory, tuning will not help.

  • Shard the dataset. Split the keyspace across multiple Redis instances or enable Redis Cluster. This is the only fix that preserves the full dataset while ending eviction.
  • Reduce data volume. Shorten TTLs, drop unnecessary keys, compress values, or replace small discrete keys with hash fields to reduce per-key overhead.
  • Add memory. Vertical scaling works only if the host has free RAM and you account for RSS overhead and COW.

Fragmentation consuming headroom

When used_memory is below maxmemory but the OS is under pressure:

  • Run MEMORY PURGE.
  • Enable active defragmentation with CONFIG SET activedefrag yes. This adds CPU overhead but can reclaim significant RSS over time. Only available when Redis is built with jemalloc.

Policy mismatch

If maxmemory-policy is volatile-lru or volatile-random but most keys lack TTLs, Redis has no eligible keys to evict.

  • Switch to an allkeys-* policy if the dataset is a cache.
  • Alternatively, ensure the application sets TTLs on all keys. Without TTLs, volatile-* policies degrade to noeviction.

Prevention

  • Set maxmemory explicitly on every production instance. A value of 0 means no limit and eventual OOM kill.
  • Maintain headroom. Persistent instances should keep used_memory below 50% of physical RAM to survive COW during fork. Cache-only instances should keep used_memory below 75% of maxmemory.
  • Monitor rates, not absolutes. Alert on the derivative of evicted_keys, not the cumulative counter.
  • Set non-zero client-output-buffer-limit normal values. The default unlimited is dangerous.
  • Run periodic big key analysis. Use redis-cli --bigkeys or MEMORY USAGE sampling to catch single keys that monopolize memory.
  • Add TTL jitter to prevent mass expiry events that suddenly free and then rapidly refill memory.
  • Check CLIENT LIST during routine maintenance to catch MONITOR sessions or slow subscribers before they bloat buffers.

How Netdata helps

Netdata collects evicted_keys, keyspace_misses, and instantaneous_ops_per_sec from the same instance, making the correlation visible in one view. Alerting triggers on used_memory approaching maxmemory or on mem_fragmentation_ratio thresholds. Per-client connection metrics expose output buffer growth before it exhausts the heap. Slowlog and latency metrics help distinguish eviction overhead from expensive commands.