Redis fork/COW memory storm: why persistence doubles RSS and OOM-kills the box
Redis disappeared from your container with only an OOMKilled status and a metrics gap that aligns with an RDB snapshot or AOF rewrite. The dataset was under its memory limit moments ago, but during persistence the reported RSS doubled and the kernel killed the process.
This is the Redis fork/copy-on-write memory storm. Redis calls fork() to spawn a child process for background RDB snapshots, AOF rewrites, and full replication syncs. After the fork, parent and child share pages through copy-on-write. Pages stay read-only until one process writes. If the parent continues serving writes, every modified page is copied. On a write-heavy instance, this can duplicate the entire dataset, pushing RSS to roughly twice the logical data size. Containers with tight memory limits do not see used_memory; they see RSS. When RSS hits the cgroup ceiling, the OOM killer fires, both processes die, and the instance restarts cold.
What this means
During a background save, used_memory stays flat, but used_memory_rss can spike dramatically. The kernel charges the parent for every COW page. Under heavy writes, duplication approaches 100% of the dataset. If the container was sized for used_memory plus a small buffer, there is no room for the fork.
This is not a memory leak; it is a mechanical consequence of fork plus writes. Risk is highest on write-heavy primaries with automatic save directives, large datasets, and anywhere Transparent Huge Pages (THP) is enabled. THP amplifies COW because a single-byte write to a 2MB huge page copies the entire page, turning a 2x spike into a 4x or larger one.
flowchart TD
A[Redis fork for RDB/AOF] --> B[Parent and child share pages]
B --> C[Write-heavy workload continues]
C --> D[Kernel copies dirty pages via COW]
D --> E[RSS temporarily doubles]
E --> F{Exceeds cgroup or host limit?}
F -->|Yes| G[OOM killer terminates Redis]
F -->|No| H[Save completes, child exits]Common causes
| Cause | What it looks like | First thing to check |
|---|---|---|
| Heavy writes during RDB/AOF fork | RSS spikes to roughly 2x used_memory during save, then drops | INFO persistence for rdb_bgsave_in_progress or aof_rewrite_in_progress |
| Transparent Huge Pages enabled | COW size far exceeds expected page-by-page cost; fork latency also spikes | /sys/kernel/mm/transparent_hugepage/enabled |
Automatic save directives on a write-heavy master | Predictable OOM kills at save intervals; correlates with save 900 1 style config | CONFIG GET save |
| Insufficient memory headroom | OOM kills even under moderate write load because RSS has no room to expand | used_memory_rss vs available RAM or cgroup limit |
| Fragmentation amplifying RSS | RSS is permanently high before fork; COW pushes it over the limit | mem_fragmentation_ratio sustained above 1.5 |
Quick checks
# Check if a background save or rewrite is currently running
redis-cli INFO persistence | grep -E "bgsave_in_progress|rewrite_in_progress"
# Check COW cost from the last RDB save and AOF rewrite
redis-cli INFO persistence | grep -E "rdb_last_cow_size|aof_last_cow_size"
# Check current RSS vs logical memory
redis-cli INFO memory | grep -E "used_memory_rss|used_memory:"
# Check fork latency of the last operation
redis-cli INFO stats | grep latest_fork_usec
# Check THP status (should be [never])
cat /sys/kernel/mm/transparent_hugepage/enabled
# Check automatic save configuration
redis-cli CONFIG GET save
# Check vm.overcommit_memory (should be 1)
sysctl vm.overcommit_memory
# Check for clients with large output buffers that eat headroom
redis-cli CLIENT LIST | tr ' ' '\n' | grep '^omem=' | cut -d= -f2 | sort -rn | head -5
How to diagnose it
- Confirm the kill was OOM. Check
dmesg,/var/log/kern.log, or container events forOut of memory: Kill process <pid> (redis-server)orOOMKilled. Note the exact timestamp. - Correlate timing with persistence. Run
redis-cli INFO persistenceand checkrdb_last_save_timeoraof_last_write_time. If the OOM aligns with a save window, COW is the likely culprit. - Measure the COW cost. Check
rdb_last_cow_sizeoraof_last_cow_size. If the value exceeds 50% ofused_memory, the workload is copying enough pages to threaten the host. - Check for active fork metrics (Redis 7.0+). If the instance is still running and another fork may occur, check
current_cow_sizeandcurrent_cow_peakinINFO persistenceto see live COW accumulation. - Verify THP status. Run
cat /sys/kernel/mm/transparent_hugepage/enabled. If the value is not[never], THP is amplifying COW. - Review automatic save directives. Run
CONFIG GET save. Non-empty save directives on a write-heavy master are a common trigger. - Assess fragmentation. Check
mem_fragmentation_ratio. If it is sustained above 1.5, fragmentation has consumed headroom that would otherwise absorb the COW spike. - Check container limits. If running in Docker or Kubernetes, verify that the memory limit accounts for COW. A limit set to
used_memoryplus 20% is usually insufficient for a persistent instance.
Metrics and signals to monitor
| Signal | Why it matters | Warning sign |
|---|---|---|
used_memory_rss | This is what the OS and OOM killer see, not used_memory | Spike approaching the container or host memory limit during saves |
rdb_last_cow_size / aof_last_cow_size | Post-mortem COW cost of the last save | Greater than 50% of used_memory |
current_cow_size / current_cow_peak (Redis 7.0+) | Live COW bytes during an active fork | Growing toward available headroom while a save is in progress |
latest_fork_usec | Duration the main thread was frozen | Above 500ms; clients will notice, replicas may disconnect |
mem_fragmentation_ratio | Fragmentation wastes RAM before COW even starts | Sustained above 1.5 on instances with substantial datasets |
rdb_bgsave_in_progress / aof_rewrite_in_progress | Tells you a fork is active and COW is accumulating | Correlates with RSS spikes in your memory graphs |
Fixes
Disable automatic RDB snapshots on write-heavy masters
On a primary receiving heavy writes, automatic save directives are dangerous. Disable them with CONFIG SET save "" and update redis.conf to persist the change. Delegate RDB snapshots to replicas. The tradeoff is that the primary no longer creates local point-in-time backups, but replicas can persist without endangering the write path.
Disable Transparent Huge Pages
THP is the single most common amplifier of COW memory storms. Disable it immediately:
echo never > /sys/kernel/mm/transparent_hugepage/enabled
echo never > /sys/kernel/mm/transparent_hugepage/defrag
Warning: These commands require root and modify kernel behavior on the host. Inside a container you typically need privileged access to write these sysctls. Make the change persistent across reboots via
rc.local, a systemd unit, or node initialization. Do not rely on runtime changes surviving a restart.
Add memory headroom or increase limits
For persistent instances (RDB or AOF enabled), maintain at least 50% headroom: used_memory_rss should stay below 50% of physical RAM or the cgroup limit. For cache-only instances, 20% to 25% is acceptable. If you cannot add headroom, shard the dataset across smaller instances so that each fork’s COW spike stays within its limit.
Move persistence to replicas
If the primary must stay lean, run save "" on the primary and configure replicas with RDB or AOF. The replica pays the fork cost during its own saves and full resyncs, but the primary stays stable. Be aware that replica initial sync still triggers a fork on the primary.
Reduce fragmentation after a spike
If a COW event left RSS permanently high due to allocator retained pages, run MEMORY PURGE to force jemalloc to return memory to the OS. For ongoing fragmentation, enable activedefrag yes (Redis 4.0+). Active defrag adds CPU overhead, so evaluate the tradeoff on latency-sensitive primaries.
Prevention
- Size for COW. Persistent instances need headroom equal to at least the dataset size to survive a worst-case COW spike.
- Disable THP before production. Check it in your base image and node provisioning.
- Set
vm.overcommit_memory=1. Without this, the kernel may reject thefork()even when physical memory is available because it cannot guarantee pages for the hypothetical worst case. - Avoid
savedirectives on write-heavy masters. Use replication and run persistence on replicas. - Monitor fork latency and COW size. Set thresholds on
latest_fork_usecandrdb_last_cow_sizeso you know when a save is becoming dangerous before the OOM killer acts. - Account for COW in container limits. A Kubernetes memory limit sized to
used_memoryis an OOM trap. Size limits to at least 2x the expected dataset RSS, or run cache-only workloads if you cannot.
How Netdata helps
- Correlate
used_memory_rssspikes withrdb_bgsave_in_progressoraof_rewrite_in_progressto confirm a COW storm. - Alert on
rdb_last_cow_sizeandaof_last_cow_sizecrossing thresholds relative toused_memory. - Track
latest_fork_usecanomalies that precede replica disconnects and resync cascades. - Monitor
mem_fragmentation_ratioalongside RSS to distinguish fragmentation pressure from dataset growth. - Surface container-aware memory metrics so you can see when RSS is approaching cgroup limits while
used_memorystill looks safe.
Related guides
- How Redis actually works in production: a mental model for operators
- Redis eviction policy tuning: allkeys-lru vs volatile-ttl vs noeviction
- Redis maxmemory not set: why every production instance needs a memory limit
- MISCONF Redis is configured to save RDB snapshots — what it means and how to fix it
- Redis monitoring checklist: the signals every production instance needs
- Redis monitoring maturity model: from survival to expert
- Redis OOM command not allowed when used memory > ‘maxmemory’ - causes and fixes
- Redis OOM-killed by the kernel: RSS, overcommit, and recovery







