Redis mass key expiration spike: TTL jitter and the active expiry cycle

Your application latency graph just spiked. Redis instantaneous_ops_per_sec dropped. keyspace_misses jumped. INFO stats shows expired_keys climbing by thousands per second. The cause is likely not a traffic surge, but a wave of keys hitting TTL at the same moment. When millions of keys share an identical expiration time, Redis’s active expiry cycle cannot sample and delete them fast enough. The main thread spends increasing time on expiration cleanup, blocking client commands and triggering a cache stampede as expired keys return nil.

This article covers how the active expiry cycle works, why aligned TTLs overwhelm it, and how to stop the spike and prevent it from recurring.

What this means

Redis removes expired keys two ways. Lazy expiry deletes a key when a client accesses it after the TTL. Active expiry runs a sampling cycle in the main event loop to find and delete keys that have expired but have not been touched. By default, this cycle runs ten times per second at hz=10 and samples twenty keys per database per iteration.

The cycle is probabilistic. If a sample finds expired keys, it deletes them and immediately samples again, up to a CPU budget. When many keys expire simultaneously, the sampler finds expired keys on almost every draw. expired_stale_perc rises above 25%, telling Redis that expiration pressure is high. The cycle becomes more aggressive and can consume up to 25% of main-thread CPU by design. If the backlog is severe, Redis 6.0 and later increments expired_time_cap_reached_count, meaning the cycle stopped early because it hit its time cap. At that point, expired keys accumulate in memory even though they are logically dead, and client latency increases because the event loop is busy with cleanup instead of commands.

flowchart TD
  A[Bulk write with identical TTL] --> B[Keys expire simultaneously]
  B --> C[Active expiry samples 20 keys per cycle]
  C --> D{expired_stale_perc > 25%?}
  D -->|Yes| E[Cycle repeats aggressively]
  E --> F[Main thread CPU consumed by expiry]
  F --> G[Client latency spikes]
  B --> H[Cache stampede]
  H --> I[Backend load surge]

Common causes

CauseWhat it looks likeFirst thing to check
Bulk cache load without jitterexpired_keys spikes exactly at hourly boundariesApplication SET EX or PEXPIRE using fixed TTL values
Scheduled job writing batched keysPeriodic latency spikes aligned with cron schedulesCorrelation between job timestamps and expired_keys rate
Application framework defaultsAll cache entries use the same default TTLClient library configuration or wrapper code
Cold start after deploymentCache warmup script populates keys with identical expirationDeployment logs and warmup procedure

Quick checks

Run these read-only commands to assess the current expiry pressure.

# Check expiry rate and pressure indicators
redis-cli INFO stats | grep -E "expired_keys|expired_stale_perc|expired_time_cap_reached_count"

# Check keyspace size and average TTL
redis-cli INFO keyspace

# Check memory footprint; if total memory stays high while key count drops, expired keys may be resident
redis-cli INFO memory | grep -E "used_memory:|used_memory_rss:"

# Check for expiry-related latency events
redis-cli LATENCY LATEST

# Check if eviction is compounding the pressure
redis-cli INFO stats | grep evicted_keys

# Check cumulative CPU consumption (derive rate from consecutive samples)
redis-cli INFO cpu | grep -E "used_cpu_user|used_cpu_sys"

# Verify active expiry frequency
redis-cli CONFIG GET hz

How to diagnose it

  1. Confirm the expiration wave. Collect INFO stats twice, thirty seconds apart. Compute the expired_keys rate. A rate more than ten times your baseline confirms a mass expiration event.
  2. Check cycle pressure. Read expired_stale_perc. Values above 25% mean the active sampler is finding expired keys on most draws and running aggressively.
  3. Check for throttling. If expired_time_cap_reached_count is increasing, the cycle is hitting its time cap and cannot keep up.
  4. Correlate with latency. Run LATENCY LATEST and look for expire-cycle events above 25 ms or command latency spikes during the same window. Check latest_fork_usec to rule out a persistence fork as the real cause.
  5. Find the aligned TTL source. Review application logs for bulk SET EX, MSET, or EXPIRE commands issued around the time the keys were created. Look for cache warmup scripts, batch jobs, or framework code that sets a fixed TTL without jitter.
  6. Estimate cleanup backlog. Compare DBSIZE against used_memory. If the key count drops slowly but memory does not, expired keys are still resident and the cycle is behind.

Metrics and signals to monitor

SignalWhy it mattersWarning sign
expired_keys rateVolume of keys being removed by active and lazy expirySudden spike >10x baseline
expired_stale_percPercentage of sampled keys already expiredSustained >25%
expired_time_cap_reached_countTimes the active cycle stopped early due to time limitsAny increase (Redis 6.0+)
used_memoryExpired keys not yet cleaned still consume RAMFlat or rising while key count drops
CPU time growth (user+sys)Expiry runs in the single event loopRate approaching one core saturation
keyspace_misses rateCache stampede as expired keys return nilSpike correlating with expiry window
instantaneous_ops_per_secOverall throughputDrop during the expiration spike

Fixes

Add TTL jitter client-side

The most effective fix is to prevent alignment. Instead of setting every key with EX 3600, add random jitter so expirations spread across a window. A typical approach is TTL + RANDOM(0, 300) seconds, spreading a one-hour batch across five minutes. Apply this in application code, cache wrappers, or proxies. Even a small percentage of jitter prevents the thundering herd from hitting the exact same second.

Break up bulk writes

If a scheduled job or warmup script writes thousands of keys at once, split it into smaller batches and stagger their TTLs. This keeps the expiration rate within the active cycle’s capacity.

Reduce unnecessary TTL alignment

Review framework defaults and library configurations that set uniform TTLs. A common mistake is using the same TTL for all session keys, rate-limit buckets, or cached query results. Vary TTLs based on data lifecycle or add a random offset.

Prevention

Enforce TTL jitter in your application standards. During load testing, monitor expired_stale_perc to verify that bulk cache population does not create an expiration cliff. After deployments with cache schema changes, watch expired_keys for the first few TTL cycles to catch misaligned expirations before they peak in production.

How Netdata helps

  • Netdata collects expired_keys, expired_stale_perc, and expired_time_cap_reached_count as real-time rates, making expiration waves visible without manual INFO polling.
  • It correlates Redis expiry metrics with host CPU and application latency on the same dashboard, distinguishing an expiry spike from a fork event or slow command.
  • Alerts on expired_stale_perc crossing 25% and abnormal expired_keys rate deviations surface the problem before latency becomes severe.
  • Netdata tracks keyspace_misses alongside expiry signals, helping identify when mass expiration triggers a backend stampede.