Redis mem_fragmentation_ratio below 1.0: detecting swap death

A mem_fragmentation_ratio of 0.72 on a substantial dataset means the operating system has swapped out Redis memory pages. Redis stays alive and responds to PING, but every command touching a swapped key blocks the single event loop on disk I/O. Because Redis logs nothing about swap, the resulting latency catastrophe looks like a mystery.

mem_fragmentation_ratio equals used_memory_rss / used_memory. When the ratio drops below 1.0, the resident set size is smaller than the memory Redis requested from its allocator. The missing bytes are in swap. Operators typically watch for ratios above 1.5, so a low ratio is misread as “good fragmentation” when it is actually the worst memory-related failure mode short of an OOM kill.

The operational threshold is roughly 0.8 on instances with used_memory above 100 MB. Smaller instances can show ratio noise from process overhead, but on a production dataset a sustained value below 0.8 means swap death.

What this means

used_memory tracks what Redis requested from its allocator. used_memory_rss tracks what the OS kept in RAM. When RSS falls below allocated memory, the kernel paged some of Redis’s anonymous memory out to disk.

Redis has no awareness that its pages are on swap. The event loop continues accepting commands, but when it touches a swapped key, value, or internal structure, the thread blocks waiting for the storage layer to page data back in. Because command execution is single-threaded, one swapped access stalls every other client. The latency is a wall, not a slope.

Swapped pages persist even after host memory pressure subsides. The kernel does not bring them back automatically when free RAM becomes available; they stay on disk until Redis touches them again. The latency hit can outlast the original pressure by hours.

Common causes

CauseWhat it looks likeFirst thing to check
Host memory overcommit or competing processesRatio drops as other processes allocate RAM; system swap usage grows; latency climbs silentlyfree -h and /proc/swaps
Fork COW pressure from persistenceRatio drops during or immediately after BGSAVE or AOF rewrite; latest_fork_usec spikedrdb_bgsave_in_progress, aof_rewrite_in_progress, and host RAM headroom
vm.swappiness too highGradual ratio decline under normal load; no sudden memory spike in Rediscat /proc/sys/vm/swappiness
Container memory limits without host headroomRatio drops inside a container while the host shows free RAM; OOM kills may followContainer cgroup memory limits versus used_memory_rss
NUMA misconfigurationIntermittent ratio drops on large bare-metal hosts under loadnumactl --hardware or /sys/devices/system/node/

Quick checks

# Confirm ratio and instance size
redis-cli INFO memory | grep -E "mem_fragmentation_ratio|used_memory:"
# Check system swap usage and available RAM
free -h && swapon --show
# Check kernel tendency to swap anonymous pages
cat /proc/sys/vm/swappiness
# Check per-process swap consumption for the Redis process
cat /proc/$(pidof redis-server | awk '{print $1}')/status | grep VmSwap
# Check allocator metrics to rule out non-swap artifacts (Redis 4.0+)
redis-cli INFO memory | grep -E "allocator_frag_ratio|allocator_rss_ratio"
# Check for active persistence forks that may have triggered COW bloat
redis-cli INFO persistence | grep -E "rdb_bgsave_in_progress|aof_rewrite_in_progress"

All of these are read-only. None change server state.

How to diagnose it

  1. Validate the signal. On instances with used_memory above 100 MB, a mem_fragmentation_ratio below 0.8 indicates swap. On smaller instances the ratio may be ambiguous due to process overhead; corroborate with OS metrics before treating as swap death.

  2. Confirm OS-level swap. Run free -h and check that swap used is non-zero or trending upward. Check VmSwap in /proc/<pid>/status for the Redis process. Non-zero VmSwap is ground truth that the kernel has moved Redis pages out of RAM.

  3. Rule out allocator artifacts. On Redis 4.0+, check allocator_frag_ratio and allocator_rss_ratio. If these sit in normal ranges while mem_fragmentation_ratio is depressed, the low ratio is not an allocator artifact. It is swap.

  4. Correlate with persistence events. Check whether the ratio drop coincides with rdb_bgsave_in_progress=1 or aof_rewrite_in_progress=1. A fork() doubles RSS via copy-on-write; if the host had no headroom, the kernel may have swapped Redis parent pages to make room for COW overhead. This is likely when latest_fork_usec spiked immediately beforehand.

  5. Check system memory pressure history. Inspect dmesg for OOM killer activity or memory-reclaim messages around the time the ratio dropped. Run vmstat 1 and look for sustained si and so columns indicating active swap-in and swap-out.

  6. Determine whether pressure is ongoing or residual. If host free RAM has recovered but mem_fragmentation_ratio remains low, the pages are still swapped. Redis never touched them to trigger a page fault and bring them back. The event loop is running on borrowed time until the next access to a cold page stalls every client.

flowchart TD
    A[mem_fragmentation_ratio < 1.0] --> B{used_memory > 100MB?}
    B -->|No| C[Ambiguous: process overhead dominates]
    B -->|Yes| D[Check OS swap usage]
    D --> E[Swap confirmed]
    E --> F[Identify memory pressure source]
    F --> G[Restart Redis to reload pages in RAM]

Metrics and signals to monitor

SignalWhy it mattersWarning sign
mem_fragmentation_ratioPrimary indicator comparing RSS to allocated memory< 0.8 while used_memory > 100 MB
used_memory_rssWhat the OS sees; the OOM killer and swap subsystem use thisSustained divergence below used_memory
used_memoryAllocator-reported consumption; provides context for the ratioValidates whether ratio thresholds are meaningful
allocator_frag_ratio (Redis 4.0+)True allocator fragmentation, separate from OS swap behaviorNormal value while mem_fragmentation_ratio is low confirms swap
latest_fork_usecCOW fork latency; spikes precede memory pressure events> 500 ms before a ratio drop suggests COW-induced swap
OS swap usedGround truth for whether swapping is activeNon-zero and correlated with ratio below 1.0
rdb_bgsave_in_progress / aof_rewrite_in_progressForks trigger COW that can push a tight host into swapRatio dropping during or after persistence operations

Fixes

Immediate: stop the memory pressure

Identify and terminate or migrate the non-Redis memory consumers that pushed the host over the edge. Stopping the pressure prevents additional pages from being swapped. It does not automatically bring already-swapped Redis pages back into RAM.

Recover the pages

Restart Redis. On startup, the dataset loads from RDB or AOF back into fresh physical memory. All swapped pages are abandoned by the process.

If a restart is not immediately feasible and you have root access, swapoff -a && swapon -a forces the kernel to move swapped pages back to RAM if capacity exists. Warning: if insufficient free memory exists, this command can hang for minutes or hours and may trigger the OOM killer. Do not run it on a memory-constrained host without an escape plan.

Do not rely on MEMORY PURGE. It instructs jemalloc to release dirty pages to the OS, but swapped pages are already on disk, not in allocator arenas. It will not reload them.

Right-size the host

Persistent instances need roughly 50% headroom above used_memory_rss to survive fork copy-on-write without pressuring the OS into swap. Cache-only instances still need headroom for client buffers, replication backlogs, and fragmentation. If the host cannot provide this, shard the dataset or move to larger hardware before the next persistence event triggers the same failure.

Prevention

  • Set vm.swappiness to 0 or 1. This reduces the kernel’s tendency to swap anonymous pages, though it does not eliminate swap under severe memory pressure.
  • Account for COW in capacity planning. For instances with RDB or AOF enabled, maintain used_memory below 50% of physical RAM. For cache-only instances, stay below 75% of available memory.
  • Disable Transparent Huge Pages. THP is the most common cause of excessive fork latency and COW bloat. A fork that takes ten times longer than necessary increases the window during which memory pressure can force swap.
  • Set maxmemory and enforce host-level headroom. A Redis instance with no limit grows until the OOM killer intervenes. Even with maxmemory configured, ensure the host has enough RSS headroom that the OS never needs to reclaim Redis pages.
  • Monitor RSS, not just used_memory. The swap subsystem and the OOM killer operate on RSS. A healthy used_memory means nothing if used_memory_rss is being reclaimed.
  • In containerized environments, ensure the container memory limit includes COW overhead. A limit set equal to maxmemory guarantees fork failures or swap death during the next background save.

How Netdata helps

  • Correlates mem_fragmentation_ratio, used_memory, and used_memory_rss on the same charts to expose the swap gap.
  • Alerts on mem_fragmentation_ratio < 0.8 when used_memory > 100 MB, suppressing noise from small instances where process overhead dominates.
  • Surfaces system-level swap metrics alongside Redis metrics to confirm OS-level swapping.
  • Tracks latest_fork_usec spikes that precede COW-induced memory pressure.
  • Displays allocator_frag_ratio and allocator_rss_ratio on Redis 4.0+ instances to help distinguish true swap from allocator artifacts.