Redis Can’t save in background: fork: Cannot allocate memory - diagnosis and fix

Redis logs Can't save in background: fork: Cannot allocate memory. free -h shows plenty of free RAM, yet BGSAVE or BGREWRITEAOF fails. If stop-writes-on-bgsave-error is yes (default), writes fail too. The gap between free RAM and fork failure is the key.

This is not a simple OOM. It is a kernel commit charge failure. Linux fork() must account for the worst case where every copy-on-write page is modified. With vm.overcommit_memory=0 (the default), the kernel enforces a heuristic commit limit. When Redis RSS is large, that limit blocks fork() even with free physical memory. The fix is usually one sysctl, but THP, container limits, and actual RAM headroom determine whether it holds.

What this means

Redis calls fork() for BGSAVE and BGREWRITEAOF. The child inherits the parent’s page tables and reads the dataset while the parent continues serving writes. Copy-on-write keeps physical pages shared until one process modifies them, so RAM does not double immediately. But the kernel commit charge at fork() time must account for the worst case: every shared page being copied.

When vm.overcommit_memory is 0, the kernel uses a heuristic commit limit: swap plus roughly 50% of physical RAM. If Redis RSS is near that threshold, fork() fails with ENOMEM. A server reporting 40% memory usage can still refuse to fork because the kernel cannot guarantee the child will not eventually copy every page.

If stop-writes-on-bgsave-error is yes (default), Redis rejects writes after the failed save, turning a persistence failure into a write availability incident.

flowchart TD
    A[BGSAVE or AOF rewrite triggers fork] --> B{vm.overcommit_memory = 1?}
    B -->|Yes| C[Kernel allows virtual commit]
    B -->|No| D{Redis RSS < swap + 50% RAM?}
    D -->|Yes| C
    D -->|No| E[fork returns ENOMEM]
    E --> F[Redis logs 'Cannot allocate memory']
    F --> G{stop-writes-on-bgsave-error}
    G -->|yes| H[Redis rejects writes]
    G -->|no| I[Writes continue with no snapshot]

Common causes

CauseWhat it looks likeFirst thing to check
vm.overcommit_memory=0Exact error in logs; RSS above ~50% of RAM; host not actually OOMsysctl vm.overcommit_memory
THP enabledFork succeeds intermittently but COW spikes are massive; latest_fork_usec is highcat /sys/kernel/mm/transparent_hugepage/enabled
Container memory limitSame error inside containers even when host memory is free; child process disappearsHost vm.overcommit_memory and container memory limit
Dataset exhausting physical RAMused_memory_rss near total memory; swap or OOM kills followredis-cli INFO memory and system free

Quick checks

Run these read-only checks to confirm the failure mode before making changes.

# Check kernel overcommit mode
sysctl vm.overcommit_memory

# Check THP status
cat /sys/kernel/mm/transparent_hugepage/enabled

# Check last bgsave status and timestamp
redis-cli INFO persistence | grep -E "rdb_last_bgsave_status|rdb_last_save_time"

# Check last fork duration
redis-cli INFO stats | grep latest_fork_usec

# Check Redis RSS vs logical memory
redis-cli INFO memory | grep -E "used_memory_rss|used_memory:"

# Check COW cost from last save
redis-cli INFO persistence | grep rdb_last_cow_size

# Check for recent OOM kills in kernel log
dmesg | grep -i "oom killer\|killed process"

How to diagnose it

  1. Confirm the error. Look for Can't save in background: fork: Cannot allocate memory in the Redis log, or check redis-cli INFO persistence for rdb_last_bgsave_status:err.
  2. Check vm.overcommit_memory. If it returns 0, the kernel heuristic is blocking fork. This is the most common root cause.
  3. Compare RSS to total memory. Run redis-cli INFO memory | grep used_memory_rss and compare it to physical RAM. If RSS is above ~50% of RAM and overcommit is 0, the failure is expected.
  4. Check latest_fork_usec. Values trending upward or above 500ms suggest THP or dataset size pressure.
  5. Check THP status. If cat /sys/kernel/mm/transparent_hugepage/enabled does not show [never], COW granularity is inflated and memory pressure is amplified.
  6. Verify container context. If Redis runs inside a container, confirm the host vm.overcommit_memory value. Containers inherit it by default; changing it inside a container usually fails.
  7. Check rdb_last_cow_size. If it exceeds 50% of used_memory, your write rate during saves is high and headroom is insufficient even with overcommit enabled.

Metrics and signals to monitor

SignalWhy it mattersWarning sign
used_memory_rss vs system memoryRSS is the practical metric for commit accounting on Redis hostsRSS approaching 50% of RAM when vm.overcommit_memory=0
rdb_last_bgsave_statusBinary indicator of save healthAny err value
latest_fork_usecFork freezes the main thread; duration blocks all commandsSustained values above 500ms, or above 200ms per GB of dataset
rdb_last_cow_sizeMeasures actual COW memory cost during last saveExceeding 50% of used_memory
THP kernel settingTHP copies 2MB pages on write, amplifying COWAny value other than [never]

Fixes

Kernel overcommit policy

The canonical fix is vm.overcommit_memory=1. The kernel allows virtual allocations until physical memory is exhausted. For Redis this is safe because the background child only reads the shared dataset; it does not modify pages at random.

Apply live:

sudo sysctl vm.overcommit_memory=1

Persist by adding vm.overcommit_memory = 1 to /etc/sysctl.conf or a file under /etc/sysctl.d/.

Tradeoff: On multi-tenant hosts, overcommit=1 allows other processes to allocate freely, increasing the risk of OOM kills under genuine memory pressure. Isolate Redis on dedicated hosts or reserved slices if possible.

Transparent huge pages

Disable THP host-wide:

echo never | sudo tee /sys/kernel/mm/transparent_hugepage/enabled

This takes effect immediately. Persist it across reboots using your distribution’s standard method.

The Redis documentation recommends disabling THP. A single write to a 2MB huge page copies the entire page during COW, inflating RSS and fork latency by an order of magnitude.

Tradeoff: 4KB pages increase TLB miss rates for some workloads, but the improvement in fork predictability outweighs the cost.

Memory headroom and sizing

If used_memory_rss is near physical RAM, overcommit=1 alone will not prevent the OOM killer from terminating the child or parent. Reduce maxmemory, enable stricter eviction, shard the dataset, or add RAM.

Persistent instances should keep used_memory below roughly 50% of physical RAM to leave room for COW. Cache-only instances can run closer to the limit, but fork() still requires free pages for page table duplication.

Container-specific behavior

Containers inherit the host’s vm.overcommit_memory. Changing it inside a container usually fails. Apply it on the host or via --sysctl flags where supported. Ensure container memory limits account for COW spikes; a limit tight to current RSS will OOM-kill the child during fork.

Do not mask the failure

Setting stop-writes-on-bgsave-error no turns a loud persistence failure into silent data-loss risk. Fix the fork or disable automatic saves if you do not need them; do not suppress the error.

Prevention

  • Set vm.overcommit_memory=1 on every Redis host before production.
  • Disable THP before starting Redis.
  • Size instances with COW headroom. Persistent instances should keep used_memory below roughly 50% of physical RAM.
  • Monitor latest_fork_usec and rdb_last_cow_size after every save to trend memory pressure.
  • Schedule BGSAVE during low-write windows, or rely on AOF with periodic rewrites instead of frequent RDB snapshots when fork pressure is high.

How Netdata helps

  • Correlate redis.latest_fork_usec with system memory metrics to spot growing fork duration before it fails.
  • Alert on redis.rdb_last_bgsave_status transitioning to err immediately after the first failed save.
  • Track redis.used_memory_rss against system available memory to visualize commit headroom.
  • Surface redis.rdb_last_cow_size trends to predict whether the next fork will fit within container or host limits.
  • Cross-reference Redis persistence health with kernel THP state and vm.overcommit_memory context.