Redis mass key expiration spike: TTL jitter and the active expiry cycle
Your application latency graph just spiked. Redis instantaneous_ops_per_sec dropped. keyspace_misses jumped. INFO stats shows expired_keys climbing by thousands per second. The cause is likely not a traffic surge, but a wave of keys hitting TTL at the same moment. When millions of keys share an identical expiration time, Redis’s active expiry cycle cannot sample and delete them fast enough. The main thread spends increasing time on expiration cleanup, blocking client commands and triggering a cache stampede as expired keys return nil.
This article covers how the active expiry cycle works, why aligned TTLs overwhelm it, and how to stop the spike and prevent it from recurring.
What this means
Redis removes expired keys two ways. Lazy expiry deletes a key when a client accesses it after the TTL. Active expiry runs a sampling cycle in the main event loop to find and delete keys that have expired but have not been touched. By default, this cycle runs ten times per second at hz=10 and samples twenty keys per database per iteration.
The cycle is probabilistic. If a sample finds expired keys, it deletes them and immediately samples again, up to a CPU budget. When many keys expire simultaneously, the sampler finds expired keys on almost every draw. expired_stale_perc rises above 25%, telling Redis that expiration pressure is high. The cycle becomes more aggressive and can consume up to 25% of main-thread CPU by design. If the backlog is severe, Redis 6.0 and later increments expired_time_cap_reached_count, meaning the cycle stopped early because it hit its time cap. At that point, expired keys accumulate in memory even though they are logically dead, and client latency increases because the event loop is busy with cleanup instead of commands.
flowchart TD
A[Bulk write with identical TTL] --> B[Keys expire simultaneously]
B --> C[Active expiry samples 20 keys per cycle]
C --> D{expired_stale_perc > 25%?}
D -->|Yes| E[Cycle repeats aggressively]
E --> F[Main thread CPU consumed by expiry]
F --> G[Client latency spikes]
B --> H[Cache stampede]
H --> I[Backend load surge]Common causes
| Cause | What it looks like | First thing to check |
|---|---|---|
| Bulk cache load without jitter | expired_keys spikes exactly at hourly boundaries | Application SET EX or PEXPIRE using fixed TTL values |
| Scheduled job writing batched keys | Periodic latency spikes aligned with cron schedules | Correlation between job timestamps and expired_keys rate |
| Application framework defaults | All cache entries use the same default TTL | Client library configuration or wrapper code |
| Cold start after deployment | Cache warmup script populates keys with identical expiration | Deployment logs and warmup procedure |
Quick checks
Run these read-only commands to assess the current expiry pressure.
# Check expiry rate and pressure indicators
redis-cli INFO stats | grep -E "expired_keys|expired_stale_perc|expired_time_cap_reached_count"
# Check keyspace size and average TTL
redis-cli INFO keyspace
# Check memory footprint; if total memory stays high while key count drops, expired keys may be resident
redis-cli INFO memory | grep -E "used_memory:|used_memory_rss:"
# Check for expiry-related latency events
redis-cli LATENCY LATEST
# Check if eviction is compounding the pressure
redis-cli INFO stats | grep evicted_keys
# Check cumulative CPU consumption (derive rate from consecutive samples)
redis-cli INFO cpu | grep -E "used_cpu_user|used_cpu_sys"
# Verify active expiry frequency
redis-cli CONFIG GET hz
How to diagnose it
- Confirm the expiration wave. Collect
INFO statstwice, thirty seconds apart. Compute theexpired_keysrate. A rate more than ten times your baseline confirms a mass expiration event. - Check cycle pressure. Read
expired_stale_perc. Values above 25% mean the active sampler is finding expired keys on most draws and running aggressively. - Check for throttling. If
expired_time_cap_reached_countis increasing, the cycle is hitting its time cap and cannot keep up. - Correlate with latency. Run
LATENCY LATESTand look forexpire-cycleevents above 25 ms orcommandlatency spikes during the same window. Checklatest_fork_usecto rule out a persistence fork as the real cause. - Find the aligned TTL source. Review application logs for bulk
SET EX,MSET, orEXPIREcommands issued around the time the keys were created. Look for cache warmup scripts, batch jobs, or framework code that sets a fixed TTL without jitter. - Estimate cleanup backlog. Compare
DBSIZEagainstused_memory. If the key count drops slowly but memory does not, expired keys are still resident and the cycle is behind.
Metrics and signals to monitor
| Signal | Why it matters | Warning sign |
|---|---|---|
expired_keys rate | Volume of keys being removed by active and lazy expiry | Sudden spike >10x baseline |
expired_stale_perc | Percentage of sampled keys already expired | Sustained >25% |
expired_time_cap_reached_count | Times the active cycle stopped early due to time limits | Any increase (Redis 6.0+) |
used_memory | Expired keys not yet cleaned still consume RAM | Flat or rising while key count drops |
| CPU time growth (user+sys) | Expiry runs in the single event loop | Rate approaching one core saturation |
keyspace_misses rate | Cache stampede as expired keys return nil | Spike correlating with expiry window |
instantaneous_ops_per_sec | Overall throughput | Drop during the expiration spike |
Fixes
Add TTL jitter client-side
The most effective fix is to prevent alignment. Instead of setting every key with EX 3600, add random jitter so expirations spread across a window. A typical approach is TTL + RANDOM(0, 300) seconds, spreading a one-hour batch across five minutes. Apply this in application code, cache wrappers, or proxies. Even a small percentage of jitter prevents the thundering herd from hitting the exact same second.
Break up bulk writes
If a scheduled job or warmup script writes thousands of keys at once, split it into smaller batches and stagger their TTLs. This keeps the expiration rate within the active cycle’s capacity.
Reduce unnecessary TTL alignment
Review framework defaults and library configurations that set uniform TTLs. A common mistake is using the same TTL for all session keys, rate-limit buckets, or cached query results. Vary TTLs based on data lifecycle or add a random offset.
Prevention
Enforce TTL jitter in your application standards. During load testing, monitor expired_stale_perc to verify that bulk cache population does not create an expiration cliff. After deployments with cache schema changes, watch expired_keys for the first few TTL cycles to catch misaligned expirations before they peak in production.
How Netdata helps
- Netdata collects
expired_keys,expired_stale_perc, andexpired_time_cap_reached_countas real-time rates, making expiration waves visible without manualINFOpolling. - It correlates Redis expiry metrics with host CPU and application latency on the same dashboard, distinguishing an expiry spike from a fork event or slow command.
- Alerts on
expired_stale_perccrossing 25% and abnormalexpired_keysrate deviations surface the problem before latency becomes severe. - Netdata tracks
keyspace_missesalongside expiry signals, helping identify when mass expiration triggers a backend stampede.
Related guides
- How Redis actually works in production: a mental model for operators
- Redis aof_last_write_status:err: AOF write failures and recovery
- Redis appendfsync always latency: durability vs throughput trade-offs
- Redis blocked_clients growing: dead consumers vs healthy queues
- Redis BUSY Redis is busy running a script: blocking Lua and how to recover
- Redis Can’t save in background: fork: Cannot allocate memory - diagnosis and fix
- Redis client output buffer overflow: slow consumers and client-output-buffer-limit
- Redis cluster_slots_pfail > 0: impending node failure in a cluster
- Redis CLUSTERDOWN / cluster_state:fail: slot coverage and recovery
- Redis connected_clients climbing: connection leak detection
- Redis connected_slaves dropped: detecting replica disconnects on the primary
- Redis connection exhaustion: leaks, pools, and the retry storm







