Redis low keyspace hit rate: cache effectiveness and cold-start recovery

A low keyspace hit rate turns Redis into a pass-through to the backend. The metric is keyspace_hits / (keyspace_hits + keyspace_misses), but interpreting it is not simple. A restarted instance shows 0% for minutes. Workloads heavy on EXISTS naturally miss. A cache that evicts faster than it is hit degrades silently until backend load spikes. This guide shows how to distinguish real cache degradation from false alarms and recover.

What this means

The counters are cumulative in INFO stats since restart or CONFIG RESETSTAT. You must compute deltas to get a meaningful rate. For pure cache workloads, sustained rates below 90% are poor; below 80% the cache likely adds latency without reducing backend load. For primary stores, queues, or session stores, the metric is meaningless because misses are expected. After restart the cache is cold and the hit rate starts at 0%. It normalizes as the working set repopulates. Do not page on low hit rate during the first 30 minutes after startup.

flowchart TD
    A[Hit rate drops] --> B{Uptime < 30 min?}
    B -->|Yes| C[Cold cache: suppress alert]
    B -->|No| D{Evicted_keys increasing?}
    D -->|Yes| E[Memory pressure: check used_memory vs maxmemory]
    D -->|No| F{Expired_keys spiking?}
    F -->|Yes| G[TTL expiry wave: check avg_ttl and jitter]
    F -->|No| H[Access-pattern shift: audit commandstats and key prefixes]

Common causes

CauseWhat it looks likeFirst thing to check
Cold cache after restartuptime_in_seconds < 1800, miss rate tapering from 100%uptime_in_seconds and loading state
Eviction waveevicted_keys rate climbing alongside missesused_memory vs maxmemory
TTL expiry waveSudden spike in expired_keys; avg_ttl very shortINFO keyspace for avg_ttl and expires ratio
Access-pattern shiftHit rate drops while eviction and expiry stay flatINFO commandstats for new prefixes or EXISTS spikes
Memory pressure spiralevicted_keys, keyspace_misses, and command rate all rise togetherCLIENT LIST for large omem and mem_fragmentation_ratio

Quick checks

Run these read-only commands to triage without changing server state.

# Cumulative counters
redis-cli INFO stats | grep -E "keyspace_hits|keyspace_misses"

# Rule out cold start
redis-cli INFO server | grep uptime_in_seconds

# Check memory pressure
redis-cli INFO memory | grep -E "used_memory:|maxmemory:"

# Check eviction rate
redis-cli INFO stats | grep evicted_keys

# Check keyspace TTL health
redis-cli INFO keyspace

# Check for slow commands blocking the event loop
redis-cli SLOWLOG GET 10

# Check client buffers consuming memory
redis-cli CLIENT LIST

How to diagnose it

  1. Confirm the workload is actually a cache. If Redis is used as a primary store, queue, or session store, keyspace_misses are normal. Alerting on hit rate here is noise. Stop and exclude the instance from this alert.
  2. Check uptime_in_seconds. Below 1800, treat as cold start. Correlate with INFO persistence: if loading:1, the instance is still restoring data and lookups are rejected. Wait for it to finish before assessing hit rate.
  3. Compute the rate from deltas. Raw cumulative counters are meaningless. Calculate delta(hits) / (delta(hits) + delta(misses)) over a 1-5 minute window. A single point-in-time sample of the cumulative ratio is not actionable.
  4. Look for eviction. If the hit rate is low and evicted_keys is increasing, you are likely in a memory pressure spiral. Check used_memory against maxmemory. If the ratio is above 90%, Redis is deleting data to survive. Eviction of recently written keys that are immediately re-requested amplifies the miss rate.
  5. Inspect TTL health. In INFO keyspace, compare expires to keys. If the expires ratio is high and avg_ttl is very short, keys are vanishing before the workload can reuse them. Also check expired_keys for a sudden spike that indicates a mass expiry event.
  6. Audit the command mix. If eviction and expiry are flat, check INFO commandstats. A spike in cmdstat_exists inflates misses because EXISTS on an absent key counts as a miss. New key prefixes or a shifted read pattern can also drop the rate without any server-side fault.
  7. Check for buffer bloat. Run CLIENT LIST and scan for any client with large omem (output buffer memory). Slow consumers, forgotten MONITOR sessions, or large result sets can consume memory that should hold cached data, indirectly triggering eviction and lowering the hit rate.
  8. Check fragmentation. A mem_fragmentation_ratio sustained above 1.5 means jemalloc is holding memory pages it cannot release back to the OS. This invisible overhead can push used_memory_rss toward the OOM killer or maxmemory limit even when used_memory looks healthy, leaving less room for the working set.

Metrics and signals to monitor

SignalWhy it mattersWarning sign
keyspace_hits / (hits + misses) rateDirect cache effectiveness measureSustained < 90% for cache workloads
evicted_keys rateKeys removed due to memory pressureAny sustained increase above baseline
used_memory / maxmemoryProximity to forced evictionRatio > 0.9
avg_ttl (INFO keyspace)Average survival time of expiring keysVery short with high expires/keys ratio
expired_keys rateMass expiry can cause miss spikesSudden 10x spike over baseline
uptime_in_secondsGates cold-start false positivesUnexpected reset to near zero
mem_fragmentation_ratioWasted memory that could hold dataSustained > 1.5
instantaneous_ops_per_secThroughput proxy for retry stormsRising while hit rate falls

Fixes

Cold cache after restart

Suppress hit rate alerts for at least 30 minutes after uptime_in_seconds resets. If the backend cannot handle the repopulation thundering herd, warm the cache by pre-loading critical keys before traffic acceptance. Verify repl-backlog-size is at least 100MB to limit full resyncs that spike primary I/O and memory pressure.

Eviction wave and memory pressure spiral

If used_memory is at or near maxmemory, Redis is evicting keys that are immediately re-requested, which causes repopulation writes that evict more keys. Break the loop:

  • Immediate: In CLIENT LIST, identify clients with large omem and kill stale connections with CLIENT KILL if safe. Run MEMORY PURGE to ask jemalloc to release dirty pages.
  • Short-term: Increase maxmemory if the host has headroom. Persistent instances need roughly 50% RSS headroom to survive copy-on-write during fork(). If you cannot add memory, evaluate maxmemory-policy. For caches, allkeys-lru or allkeys-lfu are appropriate. volatile-lru is risky if the working set includes keys without a TTL.
  • Long-term: Shard the dataset across instances, reduce payload sizes, or review client output buffer limits (client-output-buffer-limit). The default normal client limit is 0 0 0 (unlimited), which is dangerous.

TTL expiry wave

If expired_keys spiked and avg_ttl is short, add jitter to TTLs so they do not expire simultaneously. At the application layer, spread expiry with EXPIRE key (3600 + random(0, 300)) instead of a fixed window. Review whether the application is setting TTLs shorter than the re-request interval; a key that expires before it is read again provides no cache value.

Access-pattern shift

When eviction and expiry are flat but hits dropped, compare current INFO commandstats against a known baseline. Look for new key prefixes, a sudden increase in EXISTS calls, or an application change that reads from a different namespace. If you use client-side caching (tracking_clients), remember that near-cache hits in the application process will not increment server-side keyspace_hits, which can make the server rate look lower than actual end-to-end cache effectiveness.

Prevention

  • Set maxmemory on every production instance and alert on used_memory / maxmemory trending upward.
  • Add TTL jitter to all cache entries to prevent mass expiry events.
  • Monitor evicted_keys rate as a leading indicator. Sustained eviction predicts a hit rate drop before the backend feels it.
  • Size repl-backlog-size to at least 100MB to avoid full resync cascades that spike primary load and memory pressure.
  • Exclude non-cache workloads from hit rate alerting entirely.
  • Enable and monitor the slowlog (slowlog-log-slower-than) so that slow repopulation queries do not worsen a spiral.

How Netdata helps

Netdata computes the hit rate from INFO stats deltas automatically and suppresses alerts during the post-restart cold-start window. It charts keyspace_misses against evicted_keys, used_memory, and instantaneous_ops_per_sec to expose memory pressure spirals in one view. It tracks uptime_in_seconds to gate false positives after restarts or failovers, and includes avg_ttl from INFO keyspace alongside hit rate to distinguish TTL waves from eviction.