Redis mem_fragmentation_ratio below 1.0: detecting swap death
A mem_fragmentation_ratio of 0.72 on a substantial dataset means the operating system has swapped out Redis memory pages. Redis stays alive and responds to PING, but every command touching a swapped key blocks the single event loop on disk I/O. Because Redis logs nothing about swap, the resulting latency catastrophe looks like a mystery.
mem_fragmentation_ratio equals used_memory_rss / used_memory. When the ratio drops below 1.0, the resident set size is smaller than the memory Redis requested from its allocator. The missing bytes are in swap. Operators typically watch for ratios above 1.5, so a low ratio is misread as “good fragmentation” when it is actually the worst memory-related failure mode short of an OOM kill.
The operational threshold is roughly 0.8 on instances with used_memory above 100 MB. Smaller instances can show ratio noise from process overhead, but on a production dataset a sustained value below 0.8 means swap death.
What this means
used_memory tracks what Redis requested from its allocator. used_memory_rss tracks what the OS kept in RAM. When RSS falls below allocated memory, the kernel paged some of Redis’s anonymous memory out to disk.
Redis has no awareness that its pages are on swap. The event loop continues accepting commands, but when it touches a swapped key, value, or internal structure, the thread blocks waiting for the storage layer to page data back in. Because command execution is single-threaded, one swapped access stalls every other client. The latency is a wall, not a slope.
Swapped pages persist even after host memory pressure subsides. The kernel does not bring them back automatically when free RAM becomes available; they stay on disk until Redis touches them again. The latency hit can outlast the original pressure by hours.
Common causes
| Cause | What it looks like | First thing to check |
|---|---|---|
| Host memory overcommit or competing processes | Ratio drops as other processes allocate RAM; system swap usage grows; latency climbs silently | free -h and /proc/swaps |
| Fork COW pressure from persistence | Ratio drops during or immediately after BGSAVE or AOF rewrite; latest_fork_usec spiked | rdb_bgsave_in_progress, aof_rewrite_in_progress, and host RAM headroom |
vm.swappiness too high | Gradual ratio decline under normal load; no sudden memory spike in Redis | cat /proc/sys/vm/swappiness |
| Container memory limits without host headroom | Ratio drops inside a container while the host shows free RAM; OOM kills may follow | Container cgroup memory limits versus used_memory_rss |
| NUMA misconfiguration | Intermittent ratio drops on large bare-metal hosts under load | numactl --hardware or /sys/devices/system/node/ |
Quick checks
# Confirm ratio and instance size
redis-cli INFO memory | grep -E "mem_fragmentation_ratio|used_memory:"
# Check system swap usage and available RAM
free -h && swapon --show
# Check kernel tendency to swap anonymous pages
cat /proc/sys/vm/swappiness
# Check per-process swap consumption for the Redis process
cat /proc/$(pidof redis-server | awk '{print $1}')/status | grep VmSwap
# Check allocator metrics to rule out non-swap artifacts (Redis 4.0+)
redis-cli INFO memory | grep -E "allocator_frag_ratio|allocator_rss_ratio"
# Check for active persistence forks that may have triggered COW bloat
redis-cli INFO persistence | grep -E "rdb_bgsave_in_progress|aof_rewrite_in_progress"
All of these are read-only. None change server state.
How to diagnose it
Validate the signal. On instances with
used_memoryabove 100 MB, amem_fragmentation_ratiobelow 0.8 indicates swap. On smaller instances the ratio may be ambiguous due to process overhead; corroborate with OS metrics before treating as swap death.Confirm OS-level swap. Run
free -hand check that swap used is non-zero or trending upward. CheckVmSwapin/proc/<pid>/statusfor the Redis process. Non-zeroVmSwapis ground truth that the kernel has moved Redis pages out of RAM.Rule out allocator artifacts. On Redis 4.0+, check
allocator_frag_ratioandallocator_rss_ratio. If these sit in normal ranges whilemem_fragmentation_ratiois depressed, the low ratio is not an allocator artifact. It is swap.Correlate with persistence events. Check whether the ratio drop coincides with
rdb_bgsave_in_progress=1oraof_rewrite_in_progress=1. Afork()doubles RSS via copy-on-write; if the host had no headroom, the kernel may have swapped Redis parent pages to make room for COW overhead. This is likely whenlatest_fork_usecspiked immediately beforehand.Check system memory pressure history. Inspect
dmesgfor OOM killer activity or memory-reclaim messages around the time the ratio dropped. Runvmstat 1and look for sustainedsiandsocolumns indicating active swap-in and swap-out.Determine whether pressure is ongoing or residual. If host free RAM has recovered but
mem_fragmentation_ratioremains low, the pages are still swapped. Redis never touched them to trigger a page fault and bring them back. The event loop is running on borrowed time until the next access to a cold page stalls every client.
flowchart TD
A[mem_fragmentation_ratio < 1.0] --> B{used_memory > 100MB?}
B -->|No| C[Ambiguous: process overhead dominates]
B -->|Yes| D[Check OS swap usage]
D --> E[Swap confirmed]
E --> F[Identify memory pressure source]
F --> G[Restart Redis to reload pages in RAM]Metrics and signals to monitor
| Signal | Why it matters | Warning sign |
|---|---|---|
mem_fragmentation_ratio | Primary indicator comparing RSS to allocated memory | < 0.8 while used_memory > 100 MB |
used_memory_rss | What the OS sees; the OOM killer and swap subsystem use this | Sustained divergence below used_memory |
used_memory | Allocator-reported consumption; provides context for the ratio | Validates whether ratio thresholds are meaningful |
allocator_frag_ratio (Redis 4.0+) | True allocator fragmentation, separate from OS swap behavior | Normal value while mem_fragmentation_ratio is low confirms swap |
latest_fork_usec | COW fork latency; spikes precede memory pressure events | > 500 ms before a ratio drop suggests COW-induced swap |
| OS swap used | Ground truth for whether swapping is active | Non-zero and correlated with ratio below 1.0 |
rdb_bgsave_in_progress / aof_rewrite_in_progress | Forks trigger COW that can push a tight host into swap | Ratio dropping during or after persistence operations |
Fixes
Immediate: stop the memory pressure
Identify and terminate or migrate the non-Redis memory consumers that pushed the host over the edge. Stopping the pressure prevents additional pages from being swapped. It does not automatically bring already-swapped Redis pages back into RAM.
Recover the pages
Restart Redis. On startup, the dataset loads from RDB or AOF back into fresh physical memory. All swapped pages are abandoned by the process.
If a restart is not immediately feasible and you have root access, swapoff -a && swapon -a forces the kernel to move swapped pages back to RAM if capacity exists. Warning: if insufficient free memory exists, this command can hang for minutes or hours and may trigger the OOM killer. Do not run it on a memory-constrained host without an escape plan.
Do not rely on MEMORY PURGE. It instructs jemalloc to release dirty pages to the OS, but swapped pages are already on disk, not in allocator arenas. It will not reload them.
Right-size the host
Persistent instances need roughly 50% headroom above used_memory_rss to survive fork copy-on-write without pressuring the OS into swap. Cache-only instances still need headroom for client buffers, replication backlogs, and fragmentation. If the host cannot provide this, shard the dataset or move to larger hardware before the next persistence event triggers the same failure.
Prevention
- Set
vm.swappinessto 0 or 1. This reduces the kernel’s tendency to swap anonymous pages, though it does not eliminate swap under severe memory pressure. - Account for COW in capacity planning. For instances with RDB or AOF enabled, maintain
used_memorybelow 50% of physical RAM. For cache-only instances, stay below 75% of available memory. - Disable Transparent Huge Pages. THP is the most common cause of excessive fork latency and COW bloat. A fork that takes ten times longer than necessary increases the window during which memory pressure can force swap.
- Set
maxmemoryand enforce host-level headroom. A Redis instance with no limit grows until the OOM killer intervenes. Even withmaxmemoryconfigured, ensure the host has enough RSS headroom that the OS never needs to reclaim Redis pages. - Monitor RSS, not just
used_memory. The swap subsystem and the OOM killer operate on RSS. A healthyused_memorymeans nothing ifused_memory_rssis being reclaimed. - In containerized environments, ensure the container memory limit includes COW overhead. A limit set equal to
maxmemoryguarantees fork failures or swap death during the next background save.
How Netdata helps
- Correlates
mem_fragmentation_ratio,used_memory, andused_memory_rsson the same charts to expose the swap gap. - Alerts on
mem_fragmentation_ratio< 0.8 whenused_memory> 100 MB, suppressing noise from small instances where process overhead dominates. - Surfaces system-level swap metrics alongside Redis metrics to confirm OS-level swapping.
- Tracks
latest_fork_usecspikes that precede COW-induced memory pressure. - Displays
allocator_frag_ratioandallocator_rss_ratioon Redis 4.0+ instances to help distinguish true swap from allocator artifacts.
Related guides
- How Redis actually works in production: a mental model for operators
- Redis aof_last_write_status:err: AOF write failures and recovery
- Redis appendfsync always latency: durability vs throughput trade-offs
- Redis big keys: finding the giant key that blocks the event loop
- Redis blocked_clients growing: dead consumers vs healthy queues
- Redis BUSY Redis is busy running a script: blocking Lua and how to recover
- Redis Can’t save in background: fork: Cannot allocate memory - diagnosis and fix
- Redis client output buffer overflow: slow consumers and client-output-buffer-limit
- Redis cluster_slots_pfail > 0: impending node failure in a cluster
- Redis CLUSTERDOWN / cluster_state:fail: slot coverage and recovery
- Redis connected_clients climbing: connection leak detection
- Redis connected_slaves dropped: detecting replica disconnects on the primary







