Redis low keyspace hit rate: cache effectiveness and cold-start recovery
A low keyspace hit rate turns Redis into a pass-through to the backend. The metric is keyspace_hits / (keyspace_hits + keyspace_misses), but interpreting it is not simple. A restarted instance shows 0% for minutes. Workloads heavy on EXISTS naturally miss. A cache that evicts faster than it is hit degrades silently until backend load spikes. This guide shows how to distinguish real cache degradation from false alarms and recover.
What this means
The counters are cumulative in INFO stats since restart or CONFIG RESETSTAT. You must compute deltas to get a meaningful rate. For pure cache workloads, sustained rates below 90% are poor; below 80% the cache likely adds latency without reducing backend load. For primary stores, queues, or session stores, the metric is meaningless because misses are expected. After restart the cache is cold and the hit rate starts at 0%. It normalizes as the working set repopulates. Do not page on low hit rate during the first 30 minutes after startup.
flowchart TD
A[Hit rate drops] --> B{Uptime < 30 min?}
B -->|Yes| C[Cold cache: suppress alert]
B -->|No| D{Evicted_keys increasing?}
D -->|Yes| E[Memory pressure: check used_memory vs maxmemory]
D -->|No| F{Expired_keys spiking?}
F -->|Yes| G[TTL expiry wave: check avg_ttl and jitter]
F -->|No| H[Access-pattern shift: audit commandstats and key prefixes]Common causes
| Cause | What it looks like | First thing to check |
|---|---|---|
| Cold cache after restart | uptime_in_seconds < 1800, miss rate tapering from 100% | uptime_in_seconds and loading state |
| Eviction wave | evicted_keys rate climbing alongside misses | used_memory vs maxmemory |
| TTL expiry wave | Sudden spike in expired_keys; avg_ttl very short | INFO keyspace for avg_ttl and expires ratio |
| Access-pattern shift | Hit rate drops while eviction and expiry stay flat | INFO commandstats for new prefixes or EXISTS spikes |
| Memory pressure spiral | evicted_keys, keyspace_misses, and command rate all rise together | CLIENT LIST for large omem and mem_fragmentation_ratio |
Quick checks
Run these read-only commands to triage without changing server state.
# Cumulative counters
redis-cli INFO stats | grep -E "keyspace_hits|keyspace_misses"
# Rule out cold start
redis-cli INFO server | grep uptime_in_seconds
# Check memory pressure
redis-cli INFO memory | grep -E "used_memory:|maxmemory:"
# Check eviction rate
redis-cli INFO stats | grep evicted_keys
# Check keyspace TTL health
redis-cli INFO keyspace
# Check for slow commands blocking the event loop
redis-cli SLOWLOG GET 10
# Check client buffers consuming memory
redis-cli CLIENT LIST
How to diagnose it
- Confirm the workload is actually a cache. If Redis is used as a primary store, queue, or session store,
keyspace_missesare normal. Alerting on hit rate here is noise. Stop and exclude the instance from this alert. - Check
uptime_in_seconds. Below 1800, treat as cold start. Correlate withINFO persistence: ifloading:1, the instance is still restoring data and lookups are rejected. Wait for it to finish before assessing hit rate. - Compute the rate from deltas. Raw cumulative counters are meaningless. Calculate
delta(hits) / (delta(hits) + delta(misses))over a 1-5 minute window. A single point-in-time sample of the cumulative ratio is not actionable. - Look for eviction. If the hit rate is low and
evicted_keysis increasing, you are likely in a memory pressure spiral. Checkused_memoryagainstmaxmemory. If the ratio is above 90%, Redis is deleting data to survive. Eviction of recently written keys that are immediately re-requested amplifies the miss rate. - Inspect TTL health. In
INFO keyspace, compareexpirestokeys. If the expires ratio is high andavg_ttlis very short, keys are vanishing before the workload can reuse them. Also checkexpired_keysfor a sudden spike that indicates a mass expiry event. - Audit the command mix. If eviction and expiry are flat, check
INFO commandstats. A spike incmdstat_existsinflates misses becauseEXISTSon an absent key counts as a miss. New key prefixes or a shifted read pattern can also drop the rate without any server-side fault. - Check for buffer bloat. Run
CLIENT LISTand scan for any client with largeomem(output buffer memory). Slow consumers, forgottenMONITORsessions, or large result sets can consume memory that should hold cached data, indirectly triggering eviction and lowering the hit rate. - Check fragmentation. A
mem_fragmentation_ratiosustained above 1.5 means jemalloc is holding memory pages it cannot release back to the OS. This invisible overhead can pushused_memory_rsstoward the OOM killer ormaxmemorylimit even whenused_memorylooks healthy, leaving less room for the working set.
Metrics and signals to monitor
| Signal | Why it matters | Warning sign |
|---|---|---|
keyspace_hits / (hits + misses) rate | Direct cache effectiveness measure | Sustained < 90% for cache workloads |
evicted_keys rate | Keys removed due to memory pressure | Any sustained increase above baseline |
used_memory / maxmemory | Proximity to forced eviction | Ratio > 0.9 |
avg_ttl (INFO keyspace) | Average survival time of expiring keys | Very short with high expires/keys ratio |
expired_keys rate | Mass expiry can cause miss spikes | Sudden 10x spike over baseline |
uptime_in_seconds | Gates cold-start false positives | Unexpected reset to near zero |
mem_fragmentation_ratio | Wasted memory that could hold data | Sustained > 1.5 |
instantaneous_ops_per_sec | Throughput proxy for retry storms | Rising while hit rate falls |
Fixes
Cold cache after restart
Suppress hit rate alerts for at least 30 minutes after uptime_in_seconds resets. If the backend cannot handle the repopulation thundering herd, warm the cache by pre-loading critical keys before traffic acceptance. Verify repl-backlog-size is at least 100MB to limit full resyncs that spike primary I/O and memory pressure.
Eviction wave and memory pressure spiral
If used_memory is at or near maxmemory, Redis is evicting keys that are immediately re-requested, which causes repopulation writes that evict more keys. Break the loop:
- Immediate: In
CLIENT LIST, identify clients with largeomemand kill stale connections withCLIENT KILLif safe. RunMEMORY PURGEto ask jemalloc to release dirty pages. - Short-term: Increase
maxmemoryif the host has headroom. Persistent instances need roughly 50% RSS headroom to survive copy-on-write duringfork(). If you cannot add memory, evaluatemaxmemory-policy. For caches,allkeys-lruorallkeys-lfuare appropriate.volatile-lruis risky if the working set includes keys without a TTL. - Long-term: Shard the dataset across instances, reduce payload sizes, or review client output buffer limits (
client-output-buffer-limit). The defaultnormalclient limit is0 0 0(unlimited), which is dangerous.
TTL expiry wave
If expired_keys spiked and avg_ttl is short, add jitter to TTLs so they do not expire simultaneously. At the application layer, spread expiry with EXPIRE key (3600 + random(0, 300)) instead of a fixed window. Review whether the application is setting TTLs shorter than the re-request interval; a key that expires before it is read again provides no cache value.
Access-pattern shift
When eviction and expiry are flat but hits dropped, compare current INFO commandstats against a known baseline. Look for new key prefixes, a sudden increase in EXISTS calls, or an application change that reads from a different namespace. If you use client-side caching (tracking_clients), remember that near-cache hits in the application process will not increment server-side keyspace_hits, which can make the server rate look lower than actual end-to-end cache effectiveness.
Prevention
- Set
maxmemoryon every production instance and alert onused_memory / maxmemorytrending upward. - Add TTL jitter to all cache entries to prevent mass expiry events.
- Monitor
evicted_keysrate as a leading indicator. Sustained eviction predicts a hit rate drop before the backend feels it. - Size
repl-backlog-sizeto at least 100MB to avoid full resync cascades that spike primary load and memory pressure. - Exclude non-cache workloads from hit rate alerting entirely.
- Enable and monitor the slowlog (
slowlog-log-slower-than) so that slow repopulation queries do not worsen a spiral.
How Netdata helps
Netdata computes the hit rate from INFO stats deltas automatically and suppresses alerts during the post-restart cold-start window. It charts keyspace_misses against evicted_keys, used_memory, and instantaneous_ops_per_sec to expose memory pressure spirals in one view. It tracks uptime_in_seconds to gate false positives after restarts or failovers, and includes avg_ttl from INFO keyspace alongside hit rate to distinguish TTL waves from eviction.
Related guides
- How Redis actually works in production: a mental model for operators
- Redis aof_last_write_status:err: AOF write failures and recovery
- Redis appendfsync always latency: durability vs throughput trade-offs
- Redis big keys: finding the giant key that blocks the event loop
- Redis blocked_clients growing: dead consumers vs healthy queues
- Redis BUSY Redis is busy running a script: blocking Lua and how to recover
- Redis Can’t save in background: fork: Cannot allocate memory - diagnosis and fix
- Redis client output buffer overflow: slow consumers and client-output-buffer-limit
- Redis cluster_slots_pfail > 0: impending node failure in a cluster
- Redis CLUSTERDOWN / cluster_state:fail: slot coverage and recovery
- Redis connected_clients climbing: connection leak detection
- Redis connected_slaves dropped: detecting replica disconnects on the primary







