Redis rdb_last_bgsave_status:err: diagnosing failed background saves
INFO persistence returning rdb_last_bgsave_status:err means the last background save failed. The flag is sticky: it remains err until a subsequent BGSAVE succeeds, so the failure may be hours old. If stop-writes-on-bgsave-error is enabled (the default), Redis rejects writes and your application sees MISCONF errors. If the setting is disabled, writes continue but durability is broken; the exposure window grows with every update.
The failure modes are a narrow set: fork failure, disk full, filesystem write rejection, or child death before completion. Follow this sequence to separate them without restarting Redis.
What this means
rdb_last_bgsave_status reports the result of the most recent background RDB save attempt. Valid values are ok and err. The field is sticky: it stays err until a subsequent BGSAVE succeeds. Redis initializes this status to ok on startup, so an err after restart means a save already failed. Do not assume the failure is transient just because the status looks old.
When the status is err, check three companion fields immediately:
rdb_last_save_time: the Unix epoch timestamp of the last successful save. It does not advance when saves fail.rdb_changes_since_last_save: the number of dirty keys accumulated since that last successful save.rdb_bgsave_in_progress:1when a child is currently serializing the dataset.
If rdb_last_bgsave_status is ok but rdb_last_save_time is hours old, the configured save thresholds have not been met and no save was attempted. That is normal for idle instances. This guide covers explicit save failures.
While a BGSAVE runs, rdb_bgsave_in_progress is 1. Redis does not run RDB and AOF background operations simultaneously, so a long-running AOF rewrite can delay an automatic RDB save. A delayed save does not produce err; it waits. err only appears when the child exits with a failure.
flowchart TD
A[BGSAVE triggered by threshold or command] --> B{fork succeeds?}
B -->|No| C[Status err
fork failure]
B -->|Yes| D{Disk space
sufficient?}
D -->|No| E[Status err
disk full]
D -->|Yes| F{Permissions
valid?}
F -->|No| G[Status err
permission denied]
F -->|Yes| H{Child survives
COW pressure?}
H -->|No| I[Status err
child OOM killed]
H -->|Yes| J[Atomic rename
Status ok]Common causes
| Cause | What it looks like | First thing to check |
|---|---|---|
| Disk full on persistence volume | err status, Redis read-only if stop-writes-on-bgsave-error yes, no new RDB file | Free space on the directory configured by the dir directive |
| Fork failure (memory/overcommit) | Log contains fork: Cannot allocate memory, latest_fork_usec absent or very high | vm.overcommit_memory and available RAM versus used_memory_rss |
| Permission denied | Status flips to err after deployment or migration, log shows write denied | Ownership and permissions on the dir directory and dbfilename path |
| Child OOM killed during COW | Kernel OOM killer log, rdb_bgsave_in_progress drops from 1 to 0 with err | dmesg or /var/log/kern.log for OOM events |
| Slow fork (THP, NUMA) | Save hangs or times out, latest_fork_usec spikes before failure | /sys/kernel/mm/transparent_hugepage/enabled |
Quick checks
Run these read-only commands to orient yourself before making changes.
# Check persistence status and companion fields
redis-cli INFO persistence | grep -E "rdb_last_bgsave_status|rdb_last_save_time|rdb_changes_since_last_save|rdb_bgsave_in_progress"
# Check how saves are configured
redis-cli CONFIG GET save
# Check where Redis is trying to write
redis-cli CONFIG GET dir
redis-cli CONFIG GET dbfilename
# Check whether failed saves are blocking writes
redis-cli CONFIG GET stop-writes-on-bgsave-error
# Check fork latency from the last attempt
redis-cli INFO stats | grep latest_fork_usec
# Check memory footprint
redis-cli INFO memory | grep used_memory_rss
# Check THP status, a common cause of fork slowness
cat /sys/kernel/mm/transparent_hugepage/enabled
How to diagnose it
Follow these steps in order. Each one targets a specific failure mode.
Confirm a save was attempted.
Checkrdb_bgsave_in_progress. If it is1, a child process is currently serializing the dataset. Do not treat theerrstatus as current until that child finishes; the status reflects the previous attempt. Redis prevents simultaneous background RDB and AOF operations, so an in-progress AOF rewrite can delay a scheduled RDB save without causing an error.Check the configured save directory and file name.
RunCONFIG GET dirandCONFIG GET dbfilename. Verify the path is the one you expect. If the instance was recently migrated or started with a different configuration, Redis may be trying to write to a path that does not exist or is on the wrong volume.Check disk space on the persistence volume.
The RDB write path is: child process writes to a temporary file, then performs an atomic rename to the target file. The temporary file consumes roughly the same space as the final RDB snapshot. Usedf -hagainst the directory from step 2. If the volume is full, this is your root cause.Check for fork failure in the Redis log.
If the log showsCan't save in background: fork: Cannot allocate memory, the child process could not be created. This is almost always caused byvm.overcommit_memoryset to0on Linux. Redis relies on overcommit becausefork()duplicates page tables, not data pages. Check withsysctl vm.overcommit_memory. If it returns0, runsudo sysctl vm.overcommit_memory=1and update/etc/sysctl.confto persist the change.Check kernel logs for child OOM kills.
Even when fork succeeds, the child process can be killed by the OOM killer during copy-on-write if write traffic is heavy and the host is memory-constrained. Checkdmesgor/var/log/kern.logfor OOM killer events targeting the Redis child. If you find them, your system lacks sufficient headroom for COW during persistence. For persistent instances,used_memory_rssshould stay below roughly 50 percent of physical RAM to leave room for COW.Check directory and file permissions.
If the disk has space and fork succeeded, verify that the user running the Redis process has write permission on thedirdirectory and can create thedbfilenamefile.ls -ldon the directory is usually enough to spot a permission mismatch after a deployment or restore.Check for THP-induced fork latency.
Iflatest_fork_usecis extremely high (hundreds of milliseconds or more) relative to your dataset size, Transparent Huge Pages may be causing page-table bloat. Check/sys/kernel/mm/transparent_hugepage/enabled. If it is not[never], disable THP. While THP usually causes slowness rather than outright failure, a sufficiently delayed fork can trigger client timeouts or watchdog behavior that aborts the save.
Metrics and signals to monitor
| Signal | Why it matters | Warning sign |
|---|---|---|
rdb_last_bgsave_status | Binary result of the last save attempt | Any err value |
rdb_last_save_time | Age of the last successful snapshot | Older than twice the expected save interval |
rdb_changes_since_last_save | Exposure window in dirty keys | Growing while status is err |
latest_fork_usec | How long the main thread was frozen during fork | > 500 ms, or trending upward |
used_memory_rss | Physical memory footprint; COW spikes this during saves | Approaching host or container memory limit |
Disk free space on dir volume | RDB requires contiguous free space for the temp file | Less than 2x current RDB size |
Fixes
Disk full or insufficient space
Free space on the volume holding the dir directory. The temporary RDB file requires roughly the final snapshot size before the atomic rename. Maintain at least three times the dataset size as free space on the persistence volume to cover RDB snapshots, AOF, and temporary rewrite files.
If you are blocked by stop-writes-on-bgsave-error yes and need to restore writes immediately while you arrange more disk space, you can run CONFIG SET stop-writes-on-bgsave-error no. Do this only as an emergency workaround. Your persistence guarantee remains broken until the root cause is fixed.
If the volume is shared with logs or AOF, consider moving RDB to a dedicated mount. The tradeoff is operational complexity: another volume to monitor and back up. Isolating RDB I/O prevents log rotation or AOF growth from stepping on snapshot space.
Fork failure or child OOM kill
Set vm.overcommit_memory to 1. Redis requires this because fork() must allocate page tables for the entire virtual address space even though COW avoids immediate physical duplication. Run sudo sysctl vm.overcommit_memory=1 and update /etc/sysctl.conf to persist the change.
If the child is being OOM killed, reduce memory pressure. Lower maxmemory, add physical RAM, or shed write load during the save window. Sharding the dataset across multiple Redis instances reduces the memory footprint per node, which lowers both fork latency and COW risk. The tradeoff is application complexity: you must handle multi-key operations that cross slot boundaries.
If you do not need RDB snapshots, disable them, but ensure AOF is enabled and healthy if you require durability.
Permission denied
Fix ownership and permissions on the dir directory so the Redis process user can write. You can change dir at runtime with CONFIG SET dir <path>, but ensure the configuration file is updated so the change survives restart.
THP or NUMA-induced slowness
Disable Transparent Huge Pages at runtime with echo never > /sys/kernel/mm/transparent_hugepage/enabled (requires root). To make this survive reboot, add it to a sysctl script or your init system. Also check NUMA binding if running on bare metal; misconfigured NUMA can inflate fork latency.
Prevention
- Set
stop-writes-on-bgsave-error yesso that a failed save becomes an immediate availability signal rather than a silent durability gap. - Monitor disk free space on the persistence volume; alert when it drops below three times the current RDB size.
- Ensure
vm.overcommit_memoryis1on all Linux hosts running Redis. - Disable THP before putting Redis into production.
- Keep
used_memory_rssbelow 50 percent of physical memory on persistent instances to leave headroom for COW. - Place
diron a volume with dedicated IOPS, not shared with logs or noisy neighbors. - Review
savethresholds against your actual change rate. Aggressive settings such assave 60 10000increase fork frequency and COW pressure on busy instances. Relaxing them widens your RPO window.
How Netdata helps
- Plots
rdb_last_bgsave_statusandrdb_last_save_timealongside disk space and memory utilization to correlateerrflips with disk-full or RSS-spike events. - Alerts when
rdb_last_bgsave_status:errpersists longer than one expected save cycle, with filtering to suppress noise on idle instances where saves do not trigger. - Tracks
rdb_changes_since_last_saveto quantify the dirty-key exposure window during a persistence outage. - Plots
latest_fork_usecspikes against save failures to flag THP or memory-pressure root causes.
Related guides
- How Redis actually works in production: a mental model for operators
- Redis eviction policy tuning: allkeys-lru vs volatile-ttl vs noeviction
- Redis maxmemory not set: why every production instance needs a memory limit
- MISCONF Redis is configured to save RDB snapshots - what it means and how to fix it
- Redis monitoring checklist: the signals every production instance needs
- Redis monitoring maturity model: from survival to expert
- Redis OOM command not allowed when used memory > ‘maxmemory’ - causes and fixes
- Redis OOM-killed by the kernel: RSS, overcommit, and recovery







