Redis monitoring maturity model: from survival to expert
Redis can return PONG while fragmentation doubles RSS, a replica falls behind, or a slow command wedges the event loop. This guide maps four cumulative operator levels derived from production runbooks. Level 1 is liveness and memory limits. Level 2 is workload anomalies and capacity pressure. Level 3 is leading indicators for composite failures. Level 4 is forensic depth. Do not skip Level 1 because you are collecting allocator statistics.
Use this as a coverage checklist for existing deployments or a roadmap for new ones.
flowchart TD
L1[Level 1: survival - liveness and memory limits]
L2[Level 2: operational - throughput and eviction rates]
L3[Level 3: mature - latency events and replication quality]
L4[Level 4: expert - allocator and key-space forensics]
L1 --> L2
L2 --> L3
L3 --> L4Level 1: survival
Binary signals. If any fail, you have an immediate incident.
| Signal | Why it matters | Source |
|---|---|---|
PING | Binary liveness. If it fails, the event loop is blocked or the process is down. | redis-cli PING |
uptime_in_seconds | Unexpected resets indicate OOM kills, crashes, or operator restarts. | INFO server |
loading | During loading:1, data commands are rejected. Track duration to detect stuck restarts. | INFO persistence |
used_memory vs maxmemory | Approaching the limit means imminent eviction or write rejection. If maxmemory is 0, Redis will OOM-kill. | INFO memory, CONFIG GET maxmemory |
connected_clients vs maxclients | Connection exhaustion causes immediate client errors. | INFO clients, CONFIG GET maxclients |
rdb_last_bgsave_status | Failed RDB saves mean your last snapshot is stale. With stop-writes-on-bgsave-error yes, writes are rejected. | INFO persistence |
aof_last_write_status | Failed AOF writes break durability. | INFO persistence |
master_link_status | On replicas, down means stale data and potential data loss on failover. | INFO replication |
If PING times out, check INFO server before assuming the process is gone; a blocked event loop may still answer metadata commands. If maxmemory is 0, Redis grows until the OOM killer terminates it. If loading stays 1 longer than your baseline, investigate disk I/O or dataset corruption. Correlate rdb_last_bgsave_status:err with rdb_last_save_time; a status of ok with a stale timestamp means saves are not triggering.
Level 2: operational
Rate-based signals that distinguish “Redis is up” from “Redis is keeping up.”
| Signal | Why it matters | Source |
|---|---|---|
instantaneous_ops_per_sec | Throughput baseline. A sustained drop without traffic changes signals event loop blocking. | INFO stats |
keyspace_hits / keyspace_misses rate | Cache effectiveness. A sustained drop after warmup means evictions or pattern shifts. | INFO stats |
evicted_keys rate | Non-zero rate means the dataset exceeds memory. For persistent workloads, any eviction is data loss. | INFO stats |
total_error_replies rate | Write rejections, persistence failures, and ACL denials. Break down by error type with INFO errorstats where available. | INFO stats |
rejected_connections rate | Any increase means clients are actively failing to connect. | INFO stats |
mem_fragmentation_ratio | Values greater than 1.5 indicate waste; values below 1.0 on substantial instances signal swap. | INFO memory |
latest_fork_usec | Fork freezes the main thread. Values above 500 ms cause client timeouts. | INFO stats |
| Slowlog entry rate | Recurring slow commands block the single-threaded event loop for all clients. | SLOWLOG LEN, SLOWLOG GET |
expired_keys rate | Sudden spikes indicate mass expiration events that can spike CPU. | INFO stats |
aof_delayed_fsync rate | Disk I/O cannot keep up with appendfsync everysec. This is a precursor to write blocking. | INFO persistence |
connected_slaves | A drop below the expected count means a replica is disconnected. | INFO replication |
| Replication offset lag | Byte lag between primary and replica. Lag approaching repl-backlog-size triggers full resyncs. | INFO replication |
| Network I/O rate | Asymmetric spikes or NIC saturation. Replication and Pub/Sub fan-out consume bandwidth. | INFO stats |
Calculate cache hit rate as keyspace_hits / (keyspace_hits + keyspace_misses). A rate below your baseline after warmup indicates evictions or a pattern shift. Watch evicted_keys as a rate, not an absolute. Cache workloads tolerate some eviction; persistent workloads treat any eviction as data loss. When maxmemory-policy is noeviction, watch total_error_replies instead: writes fail with OOM errors while evicted_keys stays flat. A simultaneous climb in evicted_keys, keyspace_misses, and instantaneous_ops_per_sec signals a memory pressure spiral.
Level 3: mature
Internal latency decomposition, replication quality, and security signals. These catch composite patterns such as fork-induced cascades, backlog overflow loops, and memory pressure spirals.
| Signal | Why it matters | Source |
|---|---|---|
LATENCY LATEST | Internal latency by event type such as command, fork, and aof-fsync-always. | LATENCY LATEST, LATENCY HISTORY |
| Client output buffer memory | A single slow subscriber or forgotten MONITOR can consume gigabytes and OOM the instance. | CLIENT LIST (omem) |
blocked_clients | Growing count without producers means consumers are stalling or crashing. | INFO clients |
| Keyspace key count trend | Unbounded growth signals missing TTLs, leaks, or abandoned features. | INFO keyspace |
used_memory_dataset vs used_memory_overhead | Overhead growth from buffers or metadata can crowd out data before maxmemory is reached. | INFO memory |
sync_full / sync_partial_err | Full resyncs are expensive. sync_partial_err means repl-backlog-size is too small. | INFO stats |
| COW size during fork | COW memory during fork. Values above 50% of used_memory risk OOM on the next fork. | INFO persistence |
| AOF size ratio | aof_current_size / aof_base_size above 2 means rewrite is overdue or failing. | INFO persistence |
| Stream consumer group lag | Growing lag or pending means consumers cannot keep up with producers. | XINFO GROUPS |
| Cluster state and slot health | cluster_state:fail or non-zero cluster_slots_fail means active outage. | CLUSTER INFO |
ACL LOG | Failed auth and authorization attempts. | ACL LOG |
| THP status | Transparent Huge Pages multiply fork latency. Should read [never]. | /sys/kernel/mm/transparent_hugepage/enabled |
MEMORY DOCTOR | Built-in heuristic diagnostics for memory issues. | MEMORY DOCTOR |
LATENCY LATEST requires latency-monitor-threshold > 0; the default is 0, which disables the subsystem. On replicas, master_link_status:up does not mean the replica is caught up: check offset lag. In Cluster mode, cluster_state:ok with cluster-require-full-coverage no masks partial failures, so monitor cluster_slots_fail and cluster_slots_pfail directly. If blocked_clients climbs, identify blocking consumers with CLIENT LIST and check producer health.
WARNING: On instances with thousands of connections, CLIENT LIST can stall the event loop.
Level 4: expert
Forensic signals that separate allocator noise from real leaks, identify resource-heavy keys, and catch encoding downgrades that silently inflate memory.
| Signal | Why it matters | Source |
|---|---|---|
allocator_frag_ratio, allocator_rss_ratio | Precise jemalloc fragmentation, isolating allocator waste from process overhead. | INFO memory |
active_defrag_running, active_defrag_hits/misses | Defrag effectiveness. High misses with sustained CPU means the workload is fragmentation-prone. | INFO memory, INFO stats |
expired_time_cap_reached_count | The expire cycle hit its CPU cap. Expiration is falling behind creation. | INFO stats |
used_cpu_user_main_thread, used_cpu_sys_main_thread | Precise main-thread saturation. Available on Redis 6.2+. | INFO cpu |
io_threaded_reads_processed, io_threaded_writes_processed | I/O thread activity. Confirms whether I/O threading is engaged. | INFO stats |
tracking_total_keys, tracking_total_items | Client-side caching tracking table memory. | INFO stats |
MEMORY MALLOC-STATS | Raw jemalloc arena and bin statistics. | MEMORY MALLOC-STATS |
| Per-client connection analysis | Match addr, idle, and cmd patterns to find connection leaks and stale monitoring clients. | CLIENT LIST |
| Big key analysis | A single large key can dominate latency and memory. | redis-cli --bigkeys, MEMORY USAGE |
| Key encoding analysis | Encoding downgrades, such as ziplist to hashtable, dramatically inflate memory. | OBJECT ENCODING |
| Kernel memory settings | vm.overcommit_memory must be 1 for fork reliability. vm.swappiness should be near 0. | sysctl vm.overcommit_memory, sysctl vm.swappiness |
DEBUG CHANGE-REPL-ID | Unexpected invocation indicates manual replication topology changes. | INFO commandstats or logs |
WARNING: redis-cli --bigkeys scans the full keyspace and can impact production latency. Run it against a replica or during low-traffic windows. MEMORY USAGE on large aggregate keys can block; test on a replica first.
Detect encoding downgrades by sampling with OBJECT ENCODING. A ziplist that transitions to hashtable for small hashes is a common memory surprise. Set vm.overcommit_memory to 1 so fork() does not fail, and keep vm.swappiness near 0. Disable Transparent Huge Pages; they are the most common cause of fork latency spikes.
How Netdata helps
Netdata derives rates from cumulative counters such as evicted_keys, sync_full, and total_commands_processed, preventing the common mistake of graphing raw absolutes. It charts used_memory, used_memory_rss, and mem_fragmentation_ratio together, making swap and fragmentation visible before the OOM killer fires.
It tracks latest_fork_usec alongside rdb_bgsave_in_progress and used_memory_rss to surface fork-induced latency cascades as composite events rather than isolated spikes. It tracks replication offset lag per replica and warns when lag approaches the configured repl-backlog-size, giving runway before a full resync triggers.
It surfaces slowlog entries and LATENCY LATEST events next to throughput drops so you can distinguish event loop blocking from traffic spikes. It ingests CLIENT LIST to visualize per-client output buffer memory, catching slow Pub/Sub subscribers and forgotten MONITOR sessions before they OOM the instance.







