Redis OOM-killed by the kernel: RSS, overcommit, and recovery
Redis reports used_memory at 60% of maxmemory, then disappears. The container status is OOMKilled, or dmesg shows the kernel OOM killer selected redis-server. The kernel enforces resident memory (RSS), while used_memory and maxmemory track logical allocator state. Fragmentation, copy-on-write pages during persistence, and client buffers inflate RSS above the logical figure most operators monitor. When RSS hits the host or cgroup memory ceiling, the kernel terminates the process even though Redis believes it is within limits.
During a background save or AOF rewrite, Redis forks a child process. Under Linux copy-on-write semantics, the child shares pages with the parent until either modifies them. On a write-heavy instance, dirty pages are physically duplicated, temporarily pushing RSS to roughly twice the normal working set. If the system or container is sized only for logical used_memory, the OOM kill is guaranteed. Do not tweak eviction policies; size the host for the physical memory Redis actually occupies.
What this means
The kernel OOM killer targets processes by RSS, not Redis used_memory. RSS includes allocator fragmentation, shared libraries, client output buffers, and copy-on-write dirty pages. A Redis instance reporting used_memory of 2.75 GB can have used_memory_rss of 4.12 GB and be killed because the kernel sees 4.12 GB resident.
maxmemory does not protect against kernel OOM kills. It caps Redis’s logical allocator, but the OOM killer acts on physical RSS. If fragmentation or COW doubles RSS, the kernel may kill the process before maxmemory is reached. The gap between used_memory and used_memory_rss is the danger zone.
flowchart TD
A[Fork for BGSAVE or AOF rewrite] --> B[COW duplicates dirty pages]
B --> C[used_memory_rss spikes]
C --> D{Memory limit reached?}
D -->|Yes| E[Kernel OOM killer]
E --> F[Redis terminated]
F --> G[Restart with cold cache]
G --> H[Client thundering herd]Common causes
| Cause | What it looks like | First thing to check |
|---|---|---|
| Allocator fragmentation | mem_fragmentation_ratio sustained above 1.5, stable used_memory, climbing used_memory_rss | redis-cli INFO memory for mem_fragmentation_ratio and allocator_frag_ratio |
| COW bloat during persistence | RSS doubles while rdb_bgsave_in_progress or aof_rewrite_in_progress is 1 | redis-cli INFO persistence for rdb_last_cow_size or aof_last_cow_size |
| vm.overcommit_memory = 0 | Fork fails or succeeds without headroom for COW pages; Redis logs “Cannot allocate memory” or the kernel OOM killer fires mid-save | cat /proc/sys/vm/overcommit_memory |
| Client output buffer accumulation | A single client or replica consumes hundreds of megabytes; used_memory climbs slowly but RSS jumps | redis-cli CLIENT LIST and inspect omem values |
| Transparent Huge Pages enabled | Fork latency spikes and COW copies 2 MB pages instead of 4 KB, amplifying RSS | cat /sys/kernel/mm/transparent_hugepage/enabled |
Quick checks
# Logical vs physical memory
redis-cli INFO memory | grep -E "used_memory:|used_memory_rss:"
# Fragmentation ratio
redis-cli INFO memory | grep mem_fragmentation_ratio
# Kernel OOM evidence
sudo dmesg | grep -i "killed process"
# Overcommit policy
cat /proc/sys/vm/overcommit_memory
# THP status
cat /sys/kernel/mm/transparent_hugepage/enabled
# Persistence activity and recent COW size
redis-cli INFO persistence | grep -E "cow_size|bgsave_in_progress|rewrite_in_progress"
# Recent fork latency
redis-cli INFO stats | grep latest_fork_usec
# Client buffer bloat
redis-cli CLIENT LIST | awk -F'[= ]' '{for(i=1;i<=NF;i++) if($i=="omem") print $(i+1)}' | sort -rn | head -20
# Configured memory limit
redis-cli CONFIG GET maxmemory
How to diagnose it
- Confirm a kernel OOM event. Check
dmesgor/var/log/kern.logforKilled processnear the time of death. Compare withuptime_in_secondsfromINFO server; a sudden reset confirms a restart. - Measure the RSS-to-logical gap. Run
redis-cli INFO memoryand compareused_memory_rsstoused_memory. A ratio above 1.5 on an instance with more than 100 MB of data signals meaningful overhead. - Identify COW as the trigger. Check
rdb_last_cow_sizeandaof_last_cow_sizeinINFO persistence. If either exceeds 50% ofused_memory, the last fork duplicated enough pages to push RSS toward the limit. On Redis 7.0+, monitorcurrent_cow_peakduring active forks. - Check
vm.overcommit_memory. If it is 0 (the default), the kernel requires enough free RAM to cover the parent’s RSS before allowing a fork. This either causes fork failures or leaves zero margin for COW growth, making mid-operation OOM kills likely. - Audit client buffers. Run
CLIENT LISTand look for largeomemvalues. The defaultclient-output-buffer-limit normal 0 0 0means no limit for normal clients, so a slow subscriber or forgottenMONITORsession can consume gigabytes of RSS. - Inspect THP status. If
/sys/kernel/mm/transparent_hugepage/enabledis not[never], a single-byte write during a fork can duplicate an entire 2 MB huge page instead of a 4 KB standard page, multiplying COW overhead. - Distinguish from Redis-level OOM. If
maxmemorywas reached with anoevictionpolicy, Redis returns-OOMerrors tracked inerrorstat_OOM(Redis 6.2+) rather than being killed by the kernel. Kernel OOM and Redis OOM require different fixes.
Metrics and signals to monitor
| Signal | Why it matters | Warning sign |
|---|---|---|
used_memory_rss / total_system_memory | Kernel OOM killer uses RSS, not logical memory | > 0.75 on persistent instances that fork |
mem_fragmentation_ratio | Allocator fragmentation inflates RSS silently | Sustained > 1.5 on instances > 100 MB |
rdb_last_cow_size / aof_last_cow_size | Pages duplicated during the last fork | > 50% of used_memory |
latest_fork_usec | Long forks block the event loop and correlate with heavy COW | > 500 ms |
current_cow_peak (Redis 7.0+) | Real-time COW memory during an active fork | Approaching container or host limit |
allocator_frag_ratio (Redis 4.0+) | Isolates allocator waste from process overhead | Sustained > 1.5 |
Client omem | Output buffers are allocated from the heap and count toward RSS | Any single client > 256 MB |
uptime_in_seconds | Detects unexpected restarts after OOM kills | Sudden drop or reset |
errorstat_OOM (Redis 6.2+) | Distinguishes Redis-level OOM errors from kernel kills | Non-zero rate |
Fixes
Reduce RSS from fragmentation
Run MEMORY PURGE to return jemalloc dirty pages to the OS. This reduces RSS briefly but cannot defragment live objects. For persistent fragmentation, enable activedefrag yes (Redis 4.0+) to compact live allocations in the background. Active defrag consumes main-thread CPU; cap active-defrag-cpu-max to avoid latency spikes.
Right-size for COW during persistence
For instances with RDB or AOF enabled, keep used_memory below 50% of physical RAM so a worst-case COW spike does not hit the ceiling. If you cannot add memory, disable automatic save directives and schedule BGSAVE during low-traffic windows. Tradeoff: wider RPO and manual operational burden. Alternatively, set appendonly no if AOF rewrite COW is the primary trigger, though this sacrifices AOF durability.
Fix overcommit and THP
Set vm.overcommit_memory = 1 so fork() succeeds without requiring free RAM equal to the parent’s RSS. You must rely on your own capacity planning rather than the kernel’s heuristic.
Disable Transparent Huge Pages:
# Warning: run as root. Applies immediately but resets on reboot.
# Persist via init scripts or systemd tmpfiles to survive reboot.
echo never > /sys/kernel/mm/transparent_hugepage/enabled
Tradeoff: slightly higher TLB pressure for some workloads, but fork latency improves and COW page granularity drops from 2 MB to 4 KB.
Contain client buffers
Set explicit output buffer limits for normal clients instead of the default unlimited:
redis-cli CONFIG SET client-output-buffer-limit normal 64mb 32mb 60
Add the same directive to redis.conf to survive restart.
Tradeoff: slow clients are forcibly disconnected. Audit CLIENT LIST for MONITOR sessions, which copy every command to an output buffer and can OOM an instance within minutes under load.
Prevention
- Set
maxmemoryto leave headroom for RSS overhead, not just logical data. On persistent instances, treat 50% of available RAM as the practical ceiling forused_memory. - Monitor
used_memory_rssand alert on it approaching the host or cgroup limit. Do not rely solely onused_memoryormaxmemoryratio alerts. - Size
repl-backlog-sizeto at least 100 MB to avoid replica disconnections that trigger full resyncs and additional forks. - Keep
vm.overcommit_memory=1and THP disabled on all Redis hosts. Verify both at provisioning time and after kernel upgrades. - Run
redis-cli --bigkeysorMEMORY USAGEsampling periodically to catch single keys that disproportionately expand the dataset and COW cost.
How Netdata helps
- Collects
used_memory_rss,used_memory, andmem_fragmentation_ratiofromINFO memory, correlating them with system RAM and container cgroup metrics to expose the gap that leads to kernel OOM kills. - Tracks
rdb_last_cow_size,aof_last_cow_size, andlatest_fork_usecto correlate persistence events with RSS spikes. - The
redis.instance_availablealarm triggers onuptime_in_secondsresets, surfacing OOM-killed restarts immediately. - Surfaces RSS-based memory usage alongside logical allocator memory, making fragmentation and COW bloat visible before the kernel intervenes.
- On Kubernetes, monitors container
memory.working_setandmemory.limit, distinguishing cgroup OOM from global kernel OOM.







