Redis monitoring maturity model: from survival to expert

Redis can return PONG while fragmentation doubles RSS, a replica falls behind, or a slow command wedges the event loop. This guide maps four cumulative operator levels derived from production runbooks. Level 1 is liveness and memory limits. Level 2 is workload anomalies and capacity pressure. Level 3 is leading indicators for composite failures. Level 4 is forensic depth. Do not skip Level 1 because you are collecting allocator statistics.

Use this as a coverage checklist for existing deployments or a roadmap for new ones.

flowchart TD
    L1[Level 1: survival - liveness and memory limits]
    L2[Level 2: operational - throughput and eviction rates]
    L3[Level 3: mature - latency events and replication quality]
    L4[Level 4: expert - allocator and key-space forensics]
    L1 --> L2
    L2 --> L3
    L3 --> L4

Level 1: survival

Binary signals. If any fail, you have an immediate incident.

Signal	Why it matters	Source
`PING`	Binary liveness. If it fails, the event loop is blocked or the process is down.	`redis-cli PING`
`uptime_in_seconds`	Unexpected resets indicate OOM kills, crashes, or operator restarts.	`INFO server`
`loading`	During `loading:1`, data commands are rejected. Track duration to detect stuck restarts.	`INFO persistence`
`used_memory` vs `maxmemory`	Approaching the limit means imminent eviction or write rejection. If `maxmemory` is 0, Redis will OOM-kill.	`INFO memory`, `CONFIG GET maxmemory`
`connected_clients` vs `maxclients`	Connection exhaustion causes immediate client errors.	`INFO clients`, `CONFIG GET maxclients`
`rdb_last_bgsave_status`	Failed RDB saves mean your last snapshot is stale. With `stop-writes-on-bgsave-error yes`, writes are rejected.	`INFO persistence`
`aof_last_write_status`	Failed AOF writes break durability.	`INFO persistence`
`master_link_status`	On replicas, `down` means stale data and potential data loss on failover.	`INFO replication`

If PING times out, check INFO server before assuming the process is gone; a blocked event loop may still answer metadata commands. If maxmemory is 0, Redis grows until the OOM killer terminates it. If loading stays 1 longer than your baseline, investigate disk I/O or dataset corruption. Correlate rdb_last_bgsave_status:err with rdb_last_save_time; a status of ok with a stale timestamp means saves are not triggering.

Level 2: operational

Rate-based signals that distinguish “Redis is up” from “Redis is keeping up.”

Signal	Why it matters	Source
`instantaneous_ops_per_sec`	Throughput baseline. A sustained drop without traffic changes signals event loop blocking.	`INFO stats`
`keyspace_hits` / `keyspace_misses` rate	Cache effectiveness. A sustained drop after warmup means evictions or pattern shifts.	`INFO stats`
`evicted_keys` rate	Non-zero rate means the dataset exceeds memory. For persistent workloads, any eviction is data loss.	`INFO stats`
`total_error_replies` rate	Write rejections, persistence failures, and ACL denials. Break down by error type with `INFO errorstats` where available.	`INFO stats`
`rejected_connections` rate	Any increase means clients are actively failing to connect.	`INFO stats`
`mem_fragmentation_ratio`	Values greater than 1.5 indicate waste; values below 1.0 on substantial instances signal swap.	`INFO memory`
`latest_fork_usec`	Fork freezes the main thread. Values above 500 ms cause client timeouts.	`INFO stats`
Slowlog entry rate	Recurring slow commands block the single-threaded event loop for all clients.	`SLOWLOG LEN`, `SLOWLOG GET`
`expired_keys` rate	Sudden spikes indicate mass expiration events that can spike CPU.	`INFO stats`
`aof_delayed_fsync` rate	Disk I/O cannot keep up with `appendfsync everysec`. This is a precursor to write blocking.	`INFO persistence`
`connected_slaves`	A drop below the expected count means a replica is disconnected.	`INFO replication`
Replication offset lag	Byte lag between primary and replica. Lag approaching `repl-backlog-size` triggers full resyncs.	`INFO replication`
Network I/O rate	Asymmetric spikes or NIC saturation. Replication and Pub/Sub fan-out consume bandwidth.	`INFO stats`

Calculate cache hit rate as keyspace_hits / (keyspace_hits + keyspace_misses). A rate below your baseline after warmup indicates evictions or a pattern shift. Watch evicted_keys as a rate, not an absolute. Cache workloads tolerate some eviction; persistent workloads treat any eviction as data loss. When maxmemory-policy is noeviction, watch total_error_replies instead: writes fail with OOM errors while evicted_keys stays flat. A simultaneous climb in evicted_keys, keyspace_misses, and instantaneous_ops_per_sec signals a memory pressure spiral.

Level 3: mature

Internal latency decomposition, replication quality, and security signals. These catch composite patterns such as fork-induced cascades, backlog overflow loops, and memory pressure spirals.

Signal	Why it matters	Source
`LATENCY LATEST`	Internal latency by event type such as `command`, `fork`, and `aof-fsync-always`.	`LATENCY LATEST`, `LATENCY HISTORY`
Client output buffer memory	A single slow subscriber or forgotten `MONITOR` can consume gigabytes and OOM the instance.	`CLIENT LIST` (`omem`)
`blocked_clients`	Growing count without producers means consumers are stalling or crashing.	`INFO clients`
Keyspace key count trend	Unbounded growth signals missing TTLs, leaks, or abandoned features.	`INFO keyspace`
`used_memory_dataset` vs `used_memory_overhead`	Overhead growth from buffers or metadata can crowd out data before `maxmemory` is reached.	`INFO memory`
`sync_full` / `sync_partial_err`	Full resyncs are expensive. `sync_partial_err` means `repl-backlog-size` is too small.	`INFO stats`
COW size during fork	COW memory during fork. Values above 50% of `used_memory` risk OOM on the next fork.	`INFO persistence`
AOF size ratio	`aof_current_size / aof_base_size` above 2 means rewrite is overdue or failing.	`INFO persistence`
Stream consumer group lag	Growing `lag` or `pending` means consumers cannot keep up with producers.	`XINFO GROUPS`
Cluster state and slot health	`cluster_state:fail` or non-zero `cluster_slots_fail` means active outage.	`CLUSTER INFO`
`ACL LOG`	Failed auth and authorization attempts.	`ACL LOG`
THP status	Transparent Huge Pages multiply fork latency. Should read `[never]`.	`/sys/kernel/mm/transparent_hugepage/enabled`
`MEMORY DOCTOR`	Built-in heuristic diagnostics for memory issues.	`MEMORY DOCTOR`

LATENCY LATEST requires latency-monitor-threshold > 0; the default is 0, which disables the subsystem. On replicas, master_link_status:up does not mean the replica is caught up: check offset lag. In Cluster mode, cluster_state:ok with cluster-require-full-coverage no masks partial failures, so monitor cluster_slots_fail and cluster_slots_pfail directly. If blocked_clients climbs, identify blocking consumers with CLIENT LIST and check producer health.

WARNING: On instances with thousands of connections, CLIENT LIST can stall the event loop.

Level 4: expert

Forensic signals that separate allocator noise from real leaks, identify resource-heavy keys, and catch encoding downgrades that silently inflate memory.

Signal	Why it matters	Source
`allocator_frag_ratio`, `allocator_rss_ratio`	Precise jemalloc fragmentation, isolating allocator waste from process overhead.	`INFO memory`
`active_defrag_running`, `active_defrag_hits/misses`	Defrag effectiveness. High misses with sustained CPU means the workload is fragmentation-prone.	`INFO memory`, `INFO stats`
`expired_time_cap_reached_count`	The expire cycle hit its CPU cap. Expiration is falling behind creation.	`INFO stats`
`used_cpu_user_main_thread`, `used_cpu_sys_main_thread`	Precise main-thread saturation. Available on Redis 6.2+.	`INFO cpu`
`io_threaded_reads_processed`, `io_threaded_writes_processed`	I/O thread activity. Confirms whether I/O threading is engaged.	`INFO stats`
`tracking_total_keys`, `tracking_total_items`	Client-side caching tracking table memory.	`INFO stats`
`MEMORY MALLOC-STATS`	Raw jemalloc arena and bin statistics.	`MEMORY MALLOC-STATS`
Per-client connection analysis	Match `addr`, `idle`, and `cmd` patterns to find connection leaks and stale monitoring clients.	`CLIENT LIST`
Big key analysis	A single large key can dominate latency and memory.	`redis-cli --bigkeys`, `MEMORY USAGE`
Key encoding analysis	Encoding downgrades, such as `ziplist` to `hashtable`, dramatically inflate memory.	`OBJECT ENCODING`
Kernel memory settings	`vm.overcommit_memory` must be 1 for fork reliability. `vm.swappiness` should be near 0.	`sysctl vm.overcommit_memory`, `sysctl vm.swappiness`
`DEBUG CHANGE-REPL-ID`	Unexpected invocation indicates manual replication topology changes.	`INFO commandstats` or logs

WARNING: redis-cli --bigkeys scans the full keyspace and can impact production latency. Run it against a replica or during low-traffic windows. MEMORY USAGE on large aggregate keys can block; test on a replica first.

Detect encoding downgrades by sampling with OBJECT ENCODING. A ziplist that transitions to hashtable for small hashes is a common memory surprise. Set vm.overcommit_memory to 1 so fork() does not fail, and keep vm.swappiness near 0. Disable Transparent Huge Pages; they are the most common cause of fork latency spikes.

How Netdata helps

Netdata derives rates from cumulative counters such as evicted_keys, sync_full, and total_commands_processed, preventing the common mistake of graphing raw absolutes. It charts used_memory, used_memory_rss, and mem_fragmentation_ratio together, making swap and fragmentation visible before the OOM killer fires.

It tracks latest_fork_usec alongside rdb_bgsave_in_progress and used_memory_rss to surface fork-induced latency cascades as composite events rather than isolated spikes. It tracks replication offset lag per replica and warns when lag approaches the configured repl-backlog-size, giving runway before a full resync triggers.

It surfaces slowlog entries and LATENCY LATEST events next to throughput drops so you can distinguish event loop blocking from traffic spikes. It ingests CLIENT LIST to visualize per-client output buffer memory, catching slow Pub/Sub subscribers and forgotten MONITOR sessions before they OOM the instance.

The Netdata solution

Redis monitoring with Netdata

Netdata monitors Redis with per-second metrics and ML anomaly detection. Track memory usage and fragmentation, fork/COW latency, replication backlog, evictions, and connection pressure to spot the failure modes in these runbooks early.

See Redis monitoring → Start monitoring free