Redis OOM-killed by the kernel: RSS, overcommit, and recovery

Redis reports used_memory at 60% of maxmemory, then disappears. The container status is OOMKilled, or dmesg shows the kernel OOM killer selected redis-server. The kernel enforces resident memory (RSS), while used_memory and maxmemory track logical allocator state. Fragmentation, copy-on-write pages during persistence, and client buffers inflate RSS above the logical figure most operators monitor. When RSS hits the host or cgroup memory ceiling, the kernel terminates the process even though Redis believes it is within limits.

During a background save or AOF rewrite, Redis forks a child process. Under Linux copy-on-write semantics, the child shares pages with the parent until either modifies them. On a write-heavy instance, dirty pages are physically duplicated, temporarily pushing RSS to roughly twice the normal working set. If the system or container is sized only for logical used_memory, the OOM kill is guaranteed. Do not tweak eviction policies; size the host for the physical memory Redis actually occupies.

What this means

The kernel OOM killer targets processes by RSS, not Redis used_memory. RSS includes allocator fragmentation, shared libraries, client output buffers, and copy-on-write dirty pages. A Redis instance reporting used_memory of 2.75 GB can have used_memory_rss of 4.12 GB and be killed because the kernel sees 4.12 GB resident.

maxmemory does not protect against kernel OOM kills. It caps Redis’s logical allocator, but the OOM killer acts on physical RSS. If fragmentation or COW doubles RSS, the kernel may kill the process before maxmemory is reached. The gap between used_memory and used_memory_rss is the danger zone.

flowchart TD
    A[Fork for BGSAVE or AOF rewrite] --> B[COW duplicates dirty pages]
    B --> C[used_memory_rss spikes]
    C --> D{Memory limit reached?}
    D -->|Yes| E[Kernel OOM killer]
    E --> F[Redis terminated]
    F --> G[Restart with cold cache]
    G --> H[Client thundering herd]

Common causes

CauseWhat it looks likeFirst thing to check
Allocator fragmentationmem_fragmentation_ratio sustained above 1.5, stable used_memory, climbing used_memory_rssredis-cli INFO memory for mem_fragmentation_ratio and allocator_frag_ratio
COW bloat during persistenceRSS doubles while rdb_bgsave_in_progress or aof_rewrite_in_progress is 1redis-cli INFO persistence for rdb_last_cow_size or aof_last_cow_size
vm.overcommit_memory = 0Fork fails or succeeds without headroom for COW pages; Redis logs “Cannot allocate memory” or the kernel OOM killer fires mid-savecat /proc/sys/vm/overcommit_memory
Client output buffer accumulationA single client or replica consumes hundreds of megabytes; used_memory climbs slowly but RSS jumpsredis-cli CLIENT LIST and inspect omem values
Transparent Huge Pages enabledFork latency spikes and COW copies 2 MB pages instead of 4 KB, amplifying RSScat /sys/kernel/mm/transparent_hugepage/enabled

Quick checks

# Logical vs physical memory
redis-cli INFO memory | grep -E "used_memory:|used_memory_rss:"

# Fragmentation ratio
redis-cli INFO memory | grep mem_fragmentation_ratio

# Kernel OOM evidence
sudo dmesg | grep -i "killed process"

# Overcommit policy
cat /proc/sys/vm/overcommit_memory

# THP status
cat /sys/kernel/mm/transparent_hugepage/enabled

# Persistence activity and recent COW size
redis-cli INFO persistence | grep -E "cow_size|bgsave_in_progress|rewrite_in_progress"

# Recent fork latency
redis-cli INFO stats | grep latest_fork_usec

# Client buffer bloat
redis-cli CLIENT LIST | awk -F'[= ]' '{for(i=1;i<=NF;i++) if($i=="omem") print $(i+1)}' | sort -rn | head -20

# Configured memory limit
redis-cli CONFIG GET maxmemory

How to diagnose it

  1. Confirm a kernel OOM event. Check dmesg or /var/log/kern.log for Killed process near the time of death. Compare with uptime_in_seconds from INFO server; a sudden reset confirms a restart.
  2. Measure the RSS-to-logical gap. Run redis-cli INFO memory and compare used_memory_rss to used_memory. A ratio above 1.5 on an instance with more than 100 MB of data signals meaningful overhead.
  3. Identify COW as the trigger. Check rdb_last_cow_size and aof_last_cow_size in INFO persistence. If either exceeds 50% of used_memory, the last fork duplicated enough pages to push RSS toward the limit. On Redis 7.0+, monitor current_cow_peak during active forks.
  4. Check vm.overcommit_memory. If it is 0 (the default), the kernel requires enough free RAM to cover the parent’s RSS before allowing a fork. This either causes fork failures or leaves zero margin for COW growth, making mid-operation OOM kills likely.
  5. Audit client buffers. Run CLIENT LIST and look for large omem values. The default client-output-buffer-limit normal 0 0 0 means no limit for normal clients, so a slow subscriber or forgotten MONITOR session can consume gigabytes of RSS.
  6. Inspect THP status. If /sys/kernel/mm/transparent_hugepage/enabled is not [never], a single-byte write during a fork can duplicate an entire 2 MB huge page instead of a 4 KB standard page, multiplying COW overhead.
  7. Distinguish from Redis-level OOM. If maxmemory was reached with a noeviction policy, Redis returns -OOM errors tracked in errorstat_OOM (Redis 6.2+) rather than being killed by the kernel. Kernel OOM and Redis OOM require different fixes.

Metrics and signals to monitor

SignalWhy it mattersWarning sign
used_memory_rss / total_system_memoryKernel OOM killer uses RSS, not logical memory> 0.75 on persistent instances that fork
mem_fragmentation_ratioAllocator fragmentation inflates RSS silentlySustained > 1.5 on instances > 100 MB
rdb_last_cow_size / aof_last_cow_sizePages duplicated during the last fork> 50% of used_memory
latest_fork_usecLong forks block the event loop and correlate with heavy COW> 500 ms
current_cow_peak (Redis 7.0+)Real-time COW memory during an active forkApproaching container or host limit
allocator_frag_ratio (Redis 4.0+)Isolates allocator waste from process overheadSustained > 1.5
Client omemOutput buffers are allocated from the heap and count toward RSSAny single client > 256 MB
uptime_in_secondsDetects unexpected restarts after OOM killsSudden drop or reset
errorstat_OOM (Redis 6.2+)Distinguishes Redis-level OOM errors from kernel killsNon-zero rate

Fixes

Reduce RSS from fragmentation

Run MEMORY PURGE to return jemalloc dirty pages to the OS. This reduces RSS briefly but cannot defragment live objects. For persistent fragmentation, enable activedefrag yes (Redis 4.0+) to compact live allocations in the background. Active defrag consumes main-thread CPU; cap active-defrag-cpu-max to avoid latency spikes.

Right-size for COW during persistence

For instances with RDB or AOF enabled, keep used_memory below 50% of physical RAM so a worst-case COW spike does not hit the ceiling. If you cannot add memory, disable automatic save directives and schedule BGSAVE during low-traffic windows. Tradeoff: wider RPO and manual operational burden. Alternatively, set appendonly no if AOF rewrite COW is the primary trigger, though this sacrifices AOF durability.

Fix overcommit and THP

Set vm.overcommit_memory = 1 so fork() succeeds without requiring free RAM equal to the parent’s RSS. You must rely on your own capacity planning rather than the kernel’s heuristic.

Disable Transparent Huge Pages:

# Warning: run as root. Applies immediately but resets on reboot.
# Persist via init scripts or systemd tmpfiles to survive reboot.
echo never > /sys/kernel/mm/transparent_hugepage/enabled

Tradeoff: slightly higher TLB pressure for some workloads, but fork latency improves and COW page granularity drops from 2 MB to 4 KB.

Contain client buffers

Set explicit output buffer limits for normal clients instead of the default unlimited:

redis-cli CONFIG SET client-output-buffer-limit normal 64mb 32mb 60

Add the same directive to redis.conf to survive restart.

Tradeoff: slow clients are forcibly disconnected. Audit CLIENT LIST for MONITOR sessions, which copy every command to an output buffer and can OOM an instance within minutes under load.

Prevention

  • Set maxmemory to leave headroom for RSS overhead, not just logical data. On persistent instances, treat 50% of available RAM as the practical ceiling for used_memory.
  • Monitor used_memory_rss and alert on it approaching the host or cgroup limit. Do not rely solely on used_memory or maxmemory ratio alerts.
  • Size repl-backlog-size to at least 100 MB to avoid replica disconnections that trigger full resyncs and additional forks.
  • Keep vm.overcommit_memory=1 and THP disabled on all Redis hosts. Verify both at provisioning time and after kernel upgrades.
  • Run redis-cli --bigkeys or MEMORY USAGE sampling periodically to catch single keys that disproportionately expand the dataset and COW cost.

How Netdata helps

  • Collects used_memory_rss, used_memory, and mem_fragmentation_ratio from INFO memory, correlating them with system RAM and container cgroup metrics to expose the gap that leads to kernel OOM kills.
  • Tracks rdb_last_cow_size, aof_last_cow_size, and latest_fork_usec to correlate persistence events with RSS spikes.
  • The redis.instance_available alarm triggers on uptime_in_seconds resets, surfacing OOM-killed restarts immediately.
  • Surfaces RSS-based memory usage alongside logical allocator memory, making fragmentation and COW bloat visible before the kernel intervenes.
  • On Kubernetes, monitors container memory.working_set and memory.limit, distinguishing cgroup OOM from global kernel OOM.