Redis latency spikes: diagnosis with the LATENCY subsystem

Your application P99 latency just jumped and clients are timing out. The infrastructure dashboard shows the Redis host is up, memory is not exhausted, and ops per second look normal, yet something is blocking the main thread. It could be a single KEYS * freezing the event loop, a fork() duplicating page tables for an RDB save, an AOF fsync stalling on saturated disk, or the active expire cycle burning CPU. Standard INFO counters will not tell you which one.

Redis includes a LATENCY subsystem that samples internal operations by event type and records the worst latency per category in an in-memory time series. LATENCY LATEST and LATENCY HISTORY show which subsystem is responsible when spikes occur. Latency monitoring is disabled by default: latency-monitor-threshold is 0, so unless you have enabled it, Redis is not recording anything.

This guide shows how to enable the subsystem, interpret its event categories, and correlate its output with slowlog, fork metrics, and disk I/O signals.

What this means

Redis executes commands on a single main thread, so any delay delays every client. The LATENCY subsystem captures delays that exceed a configurable threshold and buckets them into categories:

  • command: Time taken to execute a command inside the event loop. A sustained command spike almost always means one or more slow operations are blocking the thread.
  • fork: Time spent in fork() for RDB snapshots, AOF rewrites, or replication full resyncs. Fork freezes the main thread while the OS duplicates page tables.
  • aof-fsync-always: Latency of fsync() when appendfsync always is configured. This is disk-bound and blocks until the write reaches physical storage.
  • expire-cycle: Time spent in the active expiration sampling loop. Mass expiry events or insufficient CPU headroom show up here.
  • eviction-cycle and eviction-del: Time spent evicting keys when maxmemory is reached.

Each category maps to a specific failure mode, so the LATENCY output turns a generic “Redis is slow” incident into a targeted diagnosis.

flowchart TD
    A[Latency spike detected] --> B[Run LATENCY LATEST]
    B --> C{Which event is spiking?}
    C -->|command| D[SLOWLOG GET + CLIENT LIST]
    C -->|fork| E[Check latest_fork_usec + THP]
    C -->|aof-fsync-always| F[Check disk I/O + aof_delayed_fsync]
    C -->|expire-cycle| G[Check expired_keys rate + TTL jitter]
    C -->|eviction-cycle| H[Check evicted_keys + used_memory vs maxmemory]
    D --> I[Fix application code]
    E --> J[Disable THP or shard]
    F --> K[Tune fsync or upgrade disk]
    G --> L[Spread TTLs or reduce key churn]
    H --> M[Add memory or shard]

Common causes

CauseWhat it looks like in LATENCYFirst thing to check
Slow command blocking the event loopcommand events sustained or spikingSLOWLOG GET 50 and INFO commandstats
Fork latency from persistence or replicationfork events, especially during BGSAVE or replica reconnectINFO stats latest_fork_usec and THP status
AOF fsync disk pressureaof-fsync-always eventsINFO persistence aof_delayed_fsync and OS disk metrics
Expire cycle overloadexpire-cycle events correlated with mass TTL expiryINFO stats expired_keys rate and expired_time_cap_reached_count
Eviction under memory pressureeviction-cycle or eviction-del eventsINFO stats evicted_keys rate and used_memory vs maxmemory
Active defrag overheadactive-defrag-cycle eventsINFO memory mem_fragmentation_ratio

Quick checks

Run these commands against the affected instance. All are read-only except CONFIG SET, which is safe and takes effect immediately without a restart.

# Verify whether latency monitoring is enabled
redis-cli CONFIG GET latency-monitor-threshold

# Enable it if the value is 0; use a threshold appropriate for your SLA
redis-cli CONFIG SET latency-monitor-threshold 100

# List the most recent spike per event category
redis-cli LATENCY LATEST

# Pull the time series for a specific event
redis-cli LATENCY HISTORY command
redis-cli LATENCY HISTORY fork
redis-cli LATENCY HISTORY aof-fsync-always
redis-cli LATENCY HISTORY expire-cycle

# Get a human-readable diagnosis summary
redis-cli LATENCY DOCTOR

# Check the slowlog for specific commands
redis-cli SLOWLOG GET 10

# Check the most recent fork duration
redis-cli INFO stats | grep latest_fork_usec

# Check for delayed AOF fsync operations
redis-cli INFO persistence | grep aof_delayed_fsync

Do not run LATENCY RESET during an active investigation because you will erase the evidence. Use it only to clear stale data after an incident.

How to diagnose it

  1. Enable monitoring if it is off. Run CONFIG GET latency-monitor-threshold. If it returns 0, run CONFIG SET latency-monitor-threshold 100 or lower. The change is instantaneous and does not require a restart. If your client timeout is 200ms, a threshold of 100ms gives you warning before clients fail.

  2. Run LATENCY LATEST. This returns one row per event category that has breached the threshold since startup or since the last LATENCY RESET. Each row shows the event name, the timestamp of the latest spike, and the latency in milliseconds. Multiple event types can spike simultaneously; for example, a replication full resync produces both a fork event and potentially a command event if the command queue backs up.

  3. Drill into the time series with LATENCY HISTORY <event>. This returns timestamped samples for that event. Look for patterns: periodic spikes on the hour, spikes aligned with persistence schedules, or sustained elevation. Compare the timestamps to your application latency graph to confirm causality.

  4. If command is spiking, correlate with slowlog. Run SLOWLOG GET 50 and look for commands with high execution times. Cross-reference with INFO commandstats to see if a specific command type has abnormal usec_per_call. Common culprits are KEYS, SMEMBERS, large SORT, and unoptimized Lua scripts. Replace KEYS with SCAN, paginate large range commands, and review Lua script complexity. Use CLIENT LIST or the client information in SLOWLOG GET to identify the source client.

  5. If fork is spiking, investigate memory and kernel configuration. Run INFO stats | grep latest_fork_usec. If the value is higher than roughly 20ms per GB of dataset, check whether Transparent Huge Pages (THP) is enabled: cat /sys/kernel/mm/transparent_hugepage/enabled. It should read [never]. THP is the single most common cause of excessive fork latency. Also verify vm.overcommit_memory is set to 1.

  6. If aof-fsync-always is spiking, treat it as disk I/O pressure. Run INFO persistence | grep aof_delayed_fsync. An increasing count means fsync operations are not completing within their expected window. Check host-level disk I/O metrics such as await and utilization. If you are using appendfsync always, consider whether you can tolerate everysec instead. Do not make this change during an incident without understanding the durability tradeoff.

  7. If expire-cycle is spiking, look for mass expiry. Check the rate of expired_keys in INFO stats. A sudden jump indicates many keys with the same TTL expiring simultaneously. Add jitter to TTLs (EXPIRE key (base + random(0, spread))) to spread the load. Check expired_time_cap_reached_count (Redis 6.0+); if it is increasing, the expire cycle is hitting its CPU budget and falling behind.

  8. If eviction-cycle or eviction-del appears, you are at or near maxmemory. Check used_memory against maxmemory and review evicted_keys rate. If eviction is constant, your working set exceeds available memory. Increase maxmemory, shard the dataset, or reduce data volume. Eviction consumes CPU and adds latency to the write path.

  9. Review LATENCY DOCTOR. This command prints a plain-text interpretation of the current latency data. It flags obvious issues such as high fork latency or slow commands. Use it as a sanity check after reviewing the raw history, not as a substitute for it.

  10. Persist the configuration. If you enabled latency monitoring with CONFIG SET, run CONFIG REWRITE so the change survives a restart. Otherwise the instance will boot with monitoring disabled and you will have no historical data for the next incident.

Metrics and signals to monitor

SignalWhy it mattersWarning sign
LATENCY LATEST commandEvent loop blocking by slow operationsSustained or recurring entries
LATENCY LATEST forkPersistence or replication freezing the main threadEntries > 500ms, or > 20ms per GB of dataset
LATENCY LATEST aof-fsync-alwaysDisk I/O blocking writesAny recurring entry when using appendfsync always
LATENCY LATEST expire-cycleTTL cleanup consuming excessive CPUEntries > 25ms or aligned with mass expiry
LATENCY LATEST eviction-cycleKey eviction under memory pressureAny entry when used_memory is near maxmemory
latest_fork_usecActual duration of the last fork> 500ms sustained across multiple forks
aof_delayed_fsyncFsync operations that missed their deadlineRate increasing over time
Slowlog entry rateSpecific commands that executed slowlyMore than a handful of new entries per minute

Fixes

Slow commands identified via command latency

Find the exact commands in SLOWLOG GET and the issuing client in CLIENT LIST. Replace KEYS with SCAN, paginate large LRANGE or HGETALL, optimize Lua scripts, and avoid running SMEMBERS on large sets. Move legitimate but expensive commands to a replica or cache the result. CLIENT KILL disconnects a rogue client, but it is a temporary bandage; fix the application.

Fork latency identified via fork events

Disable THP immediately if it is not already disabled. If the dataset is large enough that fork duration consistently exceeds your client timeout, shard the data across smaller instances. Ensure vm.overcommit_memory is 1. Increase repl-backlog-size to 100MB or more so that brief replica disconnections do not trigger full resyncs, which cause additional forks.

AOF fsync latency identified via aof-fsync-always events

Improve disk performance or reduce write volume. Switching from appendfsync always to everysec removes the per-write fsync penalty but increases the durability window to roughly one second. Do not switch to no unless you understand that the OS, not Redis, controls flush timing. If you are already on everysec and still seeing fsync delays, investigate disk saturation, competing I/O from other services, or container I/O throttling.

Expire cycle latency identified via expire-cycle events

Add jitter to TTLs so they do not align to the same second. Reduce the volume of expiring keys if possible. If expired_time_cap_reached_count is climbing, the active expiry loop is CPU-bound. You may need to increase memory headroom or reduce key churn.

Eviction latency identified via eviction-cycle events

Raise maxmemory if possible, or add shards to distribute the dataset. Check for memory leaks in client output buffers (CLIENT LIST omem) or fragmentation (mem_fragmentation_ratio). If eviction is expected, ensure the eviction policy matches your workload. For caches, allkeys-lru is common; for database use cases, eviction indicates a capacity problem rather than a tuning problem.

Prevention

  • Enable latency monitoring in production. Set latency-monitor-threshold to match your SLA. If client timeout is 200ms, a threshold of 100ms gives warning before clients fail.
  • Monitor LATENCY LATEST continuously. Do not wait for an incident. A gradual increase in fork or expire-cycle baseline latency signals a capacity limit.
  • Keep the slowlog short and reviewed. Set slowlog-log-slower-than aggressively and review entries regularly. One recurring KEYS command in the slowlog is worth fixing before it causes an outage.
  • Disable THP on all Redis hosts. This is a one-time kernel configuration change that prevents the most common source of fork latency.
  • Size replication backlog defensively. Set repl-backlog-size to at least 100MB so that network blips do not cascade into full resyncs and fork storms.
  • Add TTL jitter. Prevent mass expiry events by adding random spread to TTLs.

How Netdata helps

  • Netdata collects Redis latency events and correlates them with system-level metrics such as disk I/O await, CPU utilization, and memory RSS. This helps distinguish disk pressure from command blocking.
  • It tracks latest_fork_usec and persistence states alongside latency spikes, so you can see whether a fork occurred at the moment of a P99 jump.
  • It monitors slowlog growth rate and surfaces new slowlog entries without requiring manual SLOWLOG GET checks.
  • It tracks aof_delayed_fsync and eviction counters, correlating AOF or memory pressure with latency events on the same timeline.