Redis monitoring maturity model: from survival to expert

Redis can return PONG while fragmentation doubles RSS, a replica falls behind, or a slow command wedges the event loop. This guide maps four cumulative operator levels derived from production runbooks. Level 1 is liveness and memory limits. Level 2 is workload anomalies and capacity pressure. Level 3 is leading indicators for composite failures. Level 4 is forensic depth. Do not skip Level 1 because you are collecting allocator statistics.

Use this as a coverage checklist for existing deployments or a roadmap for new ones.

flowchart TD
    L1[Level 1: survival - liveness and memory limits]
    L2[Level 2: operational - throughput and eviction rates]
    L3[Level 3: mature - latency events and replication quality]
    L4[Level 4: expert - allocator and key-space forensics]
    L1 --> L2
    L2 --> L3
    L3 --> L4

Level 1: survival

Binary signals. If any fail, you have an immediate incident.

SignalWhy it mattersSource
PINGBinary liveness. If it fails, the event loop is blocked or the process is down.redis-cli PING
uptime_in_secondsUnexpected resets indicate OOM kills, crashes, or operator restarts.INFO server
loadingDuring loading:1, data commands are rejected. Track duration to detect stuck restarts.INFO persistence
used_memory vs maxmemoryApproaching the limit means imminent eviction or write rejection. If maxmemory is 0, Redis will OOM-kill.INFO memory, CONFIG GET maxmemory
connected_clients vs maxclientsConnection exhaustion causes immediate client errors.INFO clients, CONFIG GET maxclients
rdb_last_bgsave_statusFailed RDB saves mean your last snapshot is stale. With stop-writes-on-bgsave-error yes, writes are rejected.INFO persistence
aof_last_write_statusFailed AOF writes break durability.INFO persistence
master_link_statusOn replicas, down means stale data and potential data loss on failover.INFO replication

If PING times out, check INFO server before assuming the process is gone; a blocked event loop may still answer metadata commands. If maxmemory is 0, Redis grows until the OOM killer terminates it. If loading stays 1 longer than your baseline, investigate disk I/O or dataset corruption. Correlate rdb_last_bgsave_status:err with rdb_last_save_time; a status of ok with a stale timestamp means saves are not triggering.

Level 2: operational

Rate-based signals that distinguish “Redis is up” from “Redis is keeping up.”

SignalWhy it mattersSource
instantaneous_ops_per_secThroughput baseline. A sustained drop without traffic changes signals event loop blocking.INFO stats
keyspace_hits / keyspace_misses rateCache effectiveness. A sustained drop after warmup means evictions or pattern shifts.INFO stats
evicted_keys rateNon-zero rate means the dataset exceeds memory. For persistent workloads, any eviction is data loss.INFO stats
total_error_replies rateWrite rejections, persistence failures, and ACL denials. Break down by error type with INFO errorstats where available.INFO stats
rejected_connections rateAny increase means clients are actively failing to connect.INFO stats
mem_fragmentation_ratioValues greater than 1.5 indicate waste; values below 1.0 on substantial instances signal swap.INFO memory
latest_fork_usecFork freezes the main thread. Values above 500 ms cause client timeouts.INFO stats
Slowlog entry rateRecurring slow commands block the single-threaded event loop for all clients.SLOWLOG LEN, SLOWLOG GET
expired_keys rateSudden spikes indicate mass expiration events that can spike CPU.INFO stats
aof_delayed_fsync rateDisk I/O cannot keep up with appendfsync everysec. This is a precursor to write blocking.INFO persistence
connected_slavesA drop below the expected count means a replica is disconnected.INFO replication
Replication offset lagByte lag between primary and replica. Lag approaching repl-backlog-size triggers full resyncs.INFO replication
Network I/O rateAsymmetric spikes or NIC saturation. Replication and Pub/Sub fan-out consume bandwidth.INFO stats

Calculate cache hit rate as keyspace_hits / (keyspace_hits + keyspace_misses). A rate below your baseline after warmup indicates evictions or a pattern shift. Watch evicted_keys as a rate, not an absolute. Cache workloads tolerate some eviction; persistent workloads treat any eviction as data loss. When maxmemory-policy is noeviction, watch total_error_replies instead: writes fail with OOM errors while evicted_keys stays flat. A simultaneous climb in evicted_keys, keyspace_misses, and instantaneous_ops_per_sec signals a memory pressure spiral.

Level 3: mature

Internal latency decomposition, replication quality, and security signals. These catch composite patterns such as fork-induced cascades, backlog overflow loops, and memory pressure spirals.

SignalWhy it mattersSource
LATENCY LATESTInternal latency by event type such as command, fork, and aof-fsync-always.LATENCY LATEST, LATENCY HISTORY
Client output buffer memoryA single slow subscriber or forgotten MONITOR can consume gigabytes and OOM the instance.CLIENT LIST (omem)
blocked_clientsGrowing count without producers means consumers are stalling or crashing.INFO clients
Keyspace key count trendUnbounded growth signals missing TTLs, leaks, or abandoned features.INFO keyspace
used_memory_dataset vs used_memory_overheadOverhead growth from buffers or metadata can crowd out data before maxmemory is reached.INFO memory
sync_full / sync_partial_errFull resyncs are expensive. sync_partial_err means repl-backlog-size is too small.INFO stats
COW size during forkCOW memory during fork. Values above 50% of used_memory risk OOM on the next fork.INFO persistence
AOF size ratioaof_current_size / aof_base_size above 2 means rewrite is overdue or failing.INFO persistence
Stream consumer group lagGrowing lag or pending means consumers cannot keep up with producers.XINFO GROUPS
Cluster state and slot healthcluster_state:fail or non-zero cluster_slots_fail means active outage.CLUSTER INFO
ACL LOGFailed auth and authorization attempts.ACL LOG
THP statusTransparent Huge Pages multiply fork latency. Should read [never]./sys/kernel/mm/transparent_hugepage/enabled
MEMORY DOCTORBuilt-in heuristic diagnostics for memory issues.MEMORY DOCTOR

LATENCY LATEST requires latency-monitor-threshold > 0; the default is 0, which disables the subsystem. On replicas, master_link_status:up does not mean the replica is caught up: check offset lag. In Cluster mode, cluster_state:ok with cluster-require-full-coverage no masks partial failures, so monitor cluster_slots_fail and cluster_slots_pfail directly. If blocked_clients climbs, identify blocking consumers with CLIENT LIST and check producer health.

WARNING: On instances with thousands of connections, CLIENT LIST can stall the event loop.

Level 4: expert

Forensic signals that separate allocator noise from real leaks, identify resource-heavy keys, and catch encoding downgrades that silently inflate memory.

SignalWhy it mattersSource
allocator_frag_ratio, allocator_rss_ratioPrecise jemalloc fragmentation, isolating allocator waste from process overhead.INFO memory
active_defrag_running, active_defrag_hits/missesDefrag effectiveness. High misses with sustained CPU means the workload is fragmentation-prone.INFO memory, INFO stats
expired_time_cap_reached_countThe expire cycle hit its CPU cap. Expiration is falling behind creation.INFO stats
used_cpu_user_main_thread, used_cpu_sys_main_threadPrecise main-thread saturation. Available on Redis 6.2+.INFO cpu
io_threaded_reads_processed, io_threaded_writes_processedI/O thread activity. Confirms whether I/O threading is engaged.INFO stats
tracking_total_keys, tracking_total_itemsClient-side caching tracking table memory.INFO stats
MEMORY MALLOC-STATSRaw jemalloc arena and bin statistics.MEMORY MALLOC-STATS
Per-client connection analysisMatch addr, idle, and cmd patterns to find connection leaks and stale monitoring clients.CLIENT LIST
Big key analysisA single large key can dominate latency and memory.redis-cli --bigkeys, MEMORY USAGE
Key encoding analysisEncoding downgrades, such as ziplist to hashtable, dramatically inflate memory.OBJECT ENCODING
Kernel memory settingsvm.overcommit_memory must be 1 for fork reliability. vm.swappiness should be near 0.sysctl vm.overcommit_memory, sysctl vm.swappiness
DEBUG CHANGE-REPL-IDUnexpected invocation indicates manual replication topology changes.INFO commandstats or logs

WARNING: redis-cli --bigkeys scans the full keyspace and can impact production latency. Run it against a replica or during low-traffic windows. MEMORY USAGE on large aggregate keys can block; test on a replica first.

Detect encoding downgrades by sampling with OBJECT ENCODING. A ziplist that transitions to hashtable for small hashes is a common memory surprise. Set vm.overcommit_memory to 1 so fork() does not fail, and keep vm.swappiness near 0. Disable Transparent Huge Pages; they are the most common cause of fork latency spikes.

How Netdata helps

Netdata derives rates from cumulative counters such as evicted_keys, sync_full, and total_commands_processed, preventing the common mistake of graphing raw absolutes. It charts used_memory, used_memory_rss, and mem_fragmentation_ratio together, making swap and fragmentation visible before the OOM killer fires.

It tracks latest_fork_usec alongside rdb_bgsave_in_progress and used_memory_rss to surface fork-induced latency cascades as composite events rather than isolated spikes. It tracks replication offset lag per replica and warns when lag approaches the configured repl-backlog-size, giving runway before a full resync triggers.

It surfaces slowlog entries and LATENCY LATEST events next to throughput drops so you can distinguish event loop blocking from traffic spikes. It ingests CLIENT LIST to visualize per-client output buffer memory, catching slow Pub/Sub subscribers and forgotten MONITOR sessions before they OOM the instance.