Redis monitoring checklist: the signals every production instance needs

Redis can return PONG while replicating hours behind, during an OOM kill in a background save, or while a KEYS command wedges the event loop. This checklist structures monitoring into four maturity levels. Level 1 is the survival floor. Level 2 adds workload and resource awareness. Level 3 introduces leading indicators that catch degradation before it becomes an incident. Level 4 exposes allocator and encoding internals for granular diagnostics.

Work through the levels in order. Most production incidents are preventable with Level 2 signals that teams never configure. All metrics below are available via standard Redis commands.

flowchart TD
    L1["Level 1 — survival"]
    L2["Level 2 — operational"]
    L3["Level 3 — mature"]
    L4["Level 4 — expert"]
    L1 --> L2
    L2 --> L3
    L3 --> L4

Level 1 — survival

These are binary signals. If any fail, you have an active incident or are one heartbeat away from one. Cover every production instance here before moving on.

PING response. A failed PING means the process is down, unreachable, or the event loop is frozen by a slow command or Lua script. Monitor from the same network path as your application to catch partition issues the node cannot detect itself.

Uptime in seconds. uptime_in_seconds resetting to a low value indicates a crash, OOM kill, or unplanned restart. Use it to gate other alerts during cold-start windows.

Loading state. INFO persistence returns loading:1 while Redis restores an RDB or AOF file. During this phase the instance rejects data commands but still responds to PING. Loading that exceeds your baseline duration indicates disk pressure or an unexpectedly large dataset.

Memory usage versus maxmemory. used_memory approaching maxmemory means eviction or write rejection is imminent. If maxmemory is unset (0), Redis grows until the OS OOM killer terminates it. Set maxmemory explicitly and choose a maxmemory-policy appropriate to your workload; persistent stores should use noeviction, while caches should use allkeys-lru or allkeys-lfu.

Connected clients versus maxclients. connected_clients approaching the effective maxclients limit means new connections will be rejected. The default maxclients is 10000. Replicas and cluster bus connections count toward the limit, and the OS file descriptor ceiling can cap you below the configured value.

Last RDB save status. rdb_last_bgsave_status must be ok. A status of err means your last backup failed. If stop-writes-on-bgsave-error is enabled, writes are already being rejected.

Last AOF write status. aof_last_write_status must be ok when AOF is enabled. A failed AOF write means data is not being persisted and, depending on configuration, writes may soon be blocked.

Master link status (replicas only). master_link_status must be up. A down link means the replica is serving stale data and failover would lose everything written since the disconnect.

Level 2 — operational

These signals move from binary liveness to workload health. They answer whether Redis is doing useful work efficiently and whether resource limits are approaching. Add these only after Level 1 is fully covered.

Operations per second. instantaneous_ops_per_sec establishes your throughput baseline. A sustained drop without a traffic decrease often means the event loop is blocked by a slow command, Lua script, or fork. A spike can precede saturation or signal a traffic pattern shift that invalidates your baseline.

Keyspace hit rate. Compute from keyspace_hits / (keyspace_hits + keyspace_misses). A sustained drop below your baseline after the cold-start window indicates eviction pressure, mass expiry, or a cache stampede. Correlate with evicted_keys and expired_keys rates to distinguish capacity problems from TTL issues.

Evicted keys rate. evicted_keys is a cumulative counter. Any sustained rate of change means the working set exceeds memory. For persistent workloads, any eviction is data loss. For cache workloads, a sudden spike indicates capacity trouble.

Rejected connections rate. rejected_connections incrementing means clients are actively failing to connect. Any increase is an active incident.

Memory fragmentation ratio. mem_fragmentation_ratio above 1.5 indicates wasted memory that brings OOM closer. A ratio below 0.8 on an instance with more than 100 MB of data indicates Redis is swapping, which destroys latency. Confirm with host-level swap metrics.

Latest fork duration. latest_fork_usec above 500 ms means clients experienced a noticeable freeze during the last RDB save or AOF rewrite. Above 1 second, client timeouts are likely. High fork latency on Linux often means Transparent Huge Pages is enabled.

Slowlog growth. SLOWLOG LEN increasing means expensive commands are blocking the event loop. Review SLOWLOG GET to identify specific offenders such as KEYS, SMEMBERS, or unoptimized Lua scripts. Set slowlog-log-slower-than to a threshold that captures queries above your application’s tolerance, typically 10 ms for user-facing workloads.

Expired keys rate. expired_keys rate spikes indicate mass expiry events. Combine this with expired_time_cap_reached_count if available to detect when the active expiry cycle cannot keep up.

Connected replica count. connected_slaves on the primary must match your expected topology. A drop means a replica disconnected, which may trigger a full resync and another fork.

Replication offset lag. Calculate master_repl_offset minus slave_repl_offset. Lag approaching repl-backlog-size means the next reconnection will force a full resync, with associated fork latency and bandwidth.

Network throughput. instantaneous_input_kbps and instantaneous_output_kbps track bandwidth utilization. Asymmetric output spikes can indicate Pub/Sub fan-out, replication pressure, or a forgotten MONITOR session.

Level 3 — mature

These are leading indicators. They detect problems before they trigger Level 1 alarms. They require more instrumentation but separate stable workloads from incidents waiting to happen.

Internal latency events. LATENCY LATEST breaks down delay by event type: command, fork, AOF fsync, and eviction cycle. Enable this by setting latency-monitor-threshold to a low millisecond value in redis.conf or via CONFIG SET. The default of 0 disables it entirely.

Client output buffer memory. CLIENT LIST exposes omem per client. A single client consuming hundreds of megabytes indicates a slow subscriber, a forgotten MONITOR, or an application that cannot read responses fast enough. If omem grows while throughput is flat, the client is likely dead.

Blocked clients count. blocked_clients is expected for queue patterns using BLPOP or XREAD BLOCK. A sustained increase outside your baseline suggests dead consumers or a failed producer.

Keyspace growth trend. INFO keyspace key counts should follow a predictable pattern. Linear or exponential growth without TTL coverage indicates a memory leak or unbounded key creation.

Replication sync quality. sync_full, sync_partial_ok, and sync_partial_err in INFO stats reveal whether replicas are reconnecting cleanly. Any increase in sync_full means the replication backlog is too small. Increase repl-backlog-size if your replicas reconnect frequently enough to exhaust the current buffer.

Copy-on-write memory. rdb_last_cow_size and aof_last_cow_size measure the memory cost of persistence forks. COW exceeding 50 percent of used_memory predicts an OOM kill during the next write spike. If COW grows while write volume is flat, check for large key deletions that dirty pages.

AOF size ratio. aof_current_size / aof_base_size growing above your configured rewrite threshold means AOF rewrites are failing or not triggering. Check aof_rewrite_in_progress and aof_last_rewrite_status to confirm whether a rewrite is stuck, allowing the file to grow unboundedly.

Error statistics. INFO errorstats (Redis 6.2+) breaks down total_error_replies by type. A rising errorstat_OOM count under a noeviction policy confirms writes are being rejected due to memory pressure.

Stream consumer group lag. XINFO GROUPS exposes lag (Redis 7.0+) and pending counts. Growing lag means consumers cannot keep up. Growing pending means entries are delivered but never acknowledged.

Cluster state and slot health. CLUSTER INFO must show cluster_state:ok with all 16384 slots assigned and healthy. Any non-zero cluster_slots_fail is an active outage. Non-zero cluster_slots_pfail is an impending one.

ACL log. ACL LOG (Redis 6.0+) records authentication and authorization failures. Unexplained entries indicate credential rotation gaps or unauthorized access attempts.

Configuration drift. Audit CONFIG GET * against your expected configuration. Changes made via CONFIG SET without CONFIG REWRITE are lost on restart and can mask the root cause of an incident.

Level 4 — expert

These signals expose internal allocator, encoding, and per-client behavior that aggregate metrics hide. Use them when standard dashboards look healthy but sporadic latency or memory growth persists.

Allocator fragmentation ratios. allocator_frag_ratio and allocator_rss_ratio (Redis 4.0+) separate true jemalloc fragmentation from process overhead. Use them when mem_fragmentation_ratio is ambiguous.

Active defrag effectiveness. active_defrag_running, active_defrag_hits, and active_defrag_misses show whether defragmentation is reducing waste or just burning CPU.

Expiry cycle throttling. expired_time_cap_reached_count (Redis 6.0+) increments when the active expiry cycle hits its CPU budget. If this grows, expired keys are accumulating faster than Redis can clean them.

Main thread CPU. used_cpu_user_main_thread and used_cpu_sys_main_thread (Redis 6.2+) isolate command execution CPU from child process and I/O thread usage. Track the rate of change to detect single-core saturation precisely.

I/O thread activity. io_threaded_reads_processed and io_threaded_writes_processed (Redis 6.0+) confirm whether I/O threading is actually engaging under load. Zero values during high throughput indicate a configuration or workload mismatch.

Client-side caching tracking. tracking_total_keys and tracking_total_items (Redis 6.0+) measure the memory overhead of client-side caching invalidation tables. Large values add hidden memory pressure.

jemalloc statistics. MEMORY MALLOC-STATS exposes arena-level fragmentation, dirty pages, and retained memory. Use this when standard ratios do not explain RSS growth. Look for high dirty and retained pages across arenas, which indicate allocator retention rather than true process fragmentation.

Per-client leak analysis. Track CLIENT LIST over time to identify connections with monotonically increasing idle time or address patterns that never close. Match these to application deploys to find leaking pools.

Big key analysis. Run redis-cli --bigkeys or sample MEMORY USAGE across your keyspace. This uses SCAN and adds read load; run it during low traffic or against a replica. A single large sorted set or hash can dominate latency and memory while aggregate metrics look healthy. For precise sizing, call MEMORY USAGE on specific key patterns.

Key encoding analysis. OBJECT ENCODING samples reveal when Redis transitions compact encodings to generic ones, such as ziplist to hashtable. These transitions cause step-function memory growth that aggregate counters smooth over.

How Netdata helps

Netdata auto-discovers Redis instances and collects INFO, SLOWLOG, and LATENCY metrics. Use it to:

  • Correlate used_memory with used_memory_rss to spot fragmentation or swap before the OOM killer fires.
  • Visualize replication offset lag per replica alongside master_link_status to distinguish transient blips from growing divergence.
  • Alert on latest_fork_usec spikes tied to rdb_bgsave_in_progress or aof_rewrite_in_progress to separate persistence latency from runaway commands.
  • Track instantaneous_ops_per_sec against main-thread CPU to identify single-core saturation before client timeouts.
  • Surface stream consumer group lag and cluster slot health without custom scripting.