Redis aof_last_write_status:err: AOF write failures and recovery

INFO persistence showing aof_last_write_status:err means Redis failed to flush its Append-Only File buffer to disk on the last attempt. If you depend on AOF for durability, the instance is no longer persisting writes. With the default appendfsync everysec, Redis logs the failed fsync and retries, but once aof_last_write_status is err and stop-writes-on-bgsave-error is enabled, the server rejects mutations. The error returned to clients references RDB snapshots even when AOF is the actual failure, which often misleads first-line diagnosis.

What this means

Redis appends writes to an in-memory AOF buffer and flushes to disk via a background thread according to appendfsync. The aof_last_write_status field in INFO persistence tracks whether the last write or fsync succeeded. A value of err is sticky: it remains until the next successful flush. It can be set by an open(), write(), or fsync() failure on the AOF file, or by a stall that exceeds the internal threshold.

When stop-writes-on-bgsave-error is yes (the default), a persistent AOF failure causes Redis to reject write commands with:

(error) MISCONF Redis is configured to save RDB snapshots, but it's currently unable to persist to disk.

This protects against silent data loss, but turns a disk issue into a write availability incident.

flowchart TD
    A[Disk full / I/O stall / permissions] --> B[fsync or write fails]
    B --> C[aof_last_write_status:err]
    C --> D{stop-writes-on-bgsave-error?}
    D -->|yes| E[Redis rejects writes with MISCONF]
    D -->|no| F[Writes accepted but not persisted]
    A --> G[aof_delayed_fsync rises]

Common causes

CauseWhat it looks likeFirst thing to check
Disk full on persistence volumeaof_last_write_status:err, OS ENOSPC alerts, write rejectiondf -h on the directory holding the AOF file
Disk I/O saturationaof_delayed_fsync increasing, high iowait, fsync latency spikesiostat -x 1 or cloud volume burst metrics
Filesystem permissions errorRedis logs permission denied, often after migrations or package updatesls -ld on the Redis dir and AOF path
Filesystem-level I/O errorKernel logs show ext4/xfs errors or EIOdmesg -T and filesystem health checks
AOF rewrite failureaof_last_bgrewrite_status:err, aof_current_size growing without compactionINFO persistence size and status fields

Quick checks

Run these safe, read-only commands to triage.

# Check AOF state, rewrite health, and delayed fsync count
redis-cli INFO persistence | grep -E "aof_last_write_status|aof_enabled|aof_last_bgrewrite_status|aof_delayed_fsync|aof_current_size|aof_base_size"

# Check if writes are already being rejected
redis-cli INFO stats | grep total_error_replies
redis-cli INFO errorstats
# Verify fsync policy and write-stop behavior
redis-cli CONFIG GET appendfsync
redis-cli CONFIG GET stop-writes-on-bgsave-error

# Check disk space on the persistence volume (run on the Redis host)
df -h "$(redis-cli CONFIG GET dir | tail -n1)"

# Inspect recent kernel storage errors
dmesg -T | grep -iE "error|ext4|xfs|scsi|block"

# Verify AOF directory permissions (run on the Redis host)
ls -ld "$(redis-cli CONFIG GET dir | tail -n1)"

How to diagnose it

  1. Confirm the error and scope. Run the INFO persistence checks. Verify aof_enabled is 1. If AOF is disabled, the status is irrelevant and you are running without AOF persistence. Note aof_last_bgrewrite_status as well; a rewrite failure compounds the problem by allowing unbounded AOF growth.
  2. Determine if clients are impacted. Check total_error_replies rate. If stop-writes-on-bgsave-error is yes, attempt a test SET from a non-production client. The MISCONF error confirms that write rejection is active.
  3. Inspect disk space and inodes. AOF appends continuously and rewrites temporarily need up to the dataset size in additional space. Use df -h and df -i on the persistence volume. If usage is near 100%, this is the cause.
  4. Check logs for specific errors. Review the Redis log and dmesg for “No space left on device”, “Permission denied”, “Read-only file system”, or block-layer I/O errors.
  5. Correlate with aof_delayed_fsync. A rising counter means the background fsync thread is stalling. Cross-reference with disk latency metrics. During RDB snapshots or AOF rewrite, temporary spikes are expected; sustained growth is not.
  6. Verify permissions. Ensure the Redis process owner can write to the configured dir. Ownership changes after OS upgrades or volume remounts are common culprits.
  7. Differentiate transient from persistent. Because aof_last_write_status is sticky, it may reflect a past error that has already cleared. Free space or restore I/O capacity, then re-check the field after a write operation triggers a new fsync.
  8. If Redis will not start. If the instance crashes on startup due to AOF issues, check the startup log. You may need to run redis-check-aof --fix against your AOF file before the server can load.

Metrics and signals to monitor

SignalWhy it mattersWarning sign
aof_last_write_statusBinary AOF healtherr
aof_delayed_fsyncLeading indicator of disk I/O saturationRate increasing per minute
aof_last_bgrewrite_statusRewrite failure prevents AOF compactionerr
aof_current_size / aof_base_sizeAOF bloat when rewrite is stuckRatio > 2 for extended periods
total_error_repliesApplication-visible command failuresRate > 0
errorstat_OOMSpecific write rejections from noeviction or persistence blocksRate > 0
latest_fork_usecBackground persistence latency and memory pressureSpikes correlating with AOF issues
Disk free spaceAOF needs space to append and rewrite< 3x dataset size

Fixes

Disk full

Free space on the persistence volume by removing logs, rotating files, or expanding storage. Do not delete or truncate the active AOF file while Redis is running. Once space is available, Redis should succeed on the next fsync. Verify recovery:

redis-cli INFO persistence | grep aof_last_write_status

If the status does not return to ok after a successful write cycle, check for lingering filesystem errors. A controlled restart is a last resort if you suspect a stale file descriptor.

Disk I/O saturation

Identify competing I/O consumers. RDB snapshots, AOF rewrites, log shippers, and backup agents all contend for the same volume. If possible, move AOF to a dedicated fast disk. Reduce write volume temporarily at the application layer. Ensure that AOF rewrite scheduling does not compound normal fsync load.

Permissions or filesystem errors

Fix directory ownership so the Redis user can write to the configured dir and AOF path. If the filesystem has remounted read-only due to corruption, resolve the filesystem health before allowing Redis to continue writing.

AOF corruption preventing startup

If Redis detects corruption at startup and refuses to load, use redis-check-aof --fix against your configured AOF file. This truncates the last incomplete command. After repair, start Redis and confirm aof_last_write_status:ok. Be aware that --fix discards data at the tail of the file.

Emergency: allow writes while storage is unavailable

If you cannot restore storage quickly and need to prevent a total outage, you can temporarily disable write rejection:

# DANGER: allows writes without durability. Use only as a temporary bridge.
redis-cli CONFIG SET stop-writes-on-bgsave-error no

All writes accepted during this window are at risk of loss. Re-enable the setting immediately after the storage issue is resolved.

Prevention

  • Monitor disk free space independently. Maintain at least 3x the dataset size as free space on the persistence volume to accommodate AOF growth, temporary rewrite files, and RDB snapshots.
  • Alert on aof_delayed_fsync rate increases. It is the earliest signal of disk I/O pressure before aof_last_write_status flips to err.
  • Audit permissions on the Redis dir after deployments, volume mounts, or OS upgrades.
  • Track the ratio aof_current_size / aof_base_size. If it climbs steadily, AOF rewrite is failing or disabled.
  • For automated backup scripts, ensure you capture a consistent AOF state. Disable automatic rewrites during the backup window if your tooling supports it.

How Netdata helps

  • Netdata collects aof_last_write_status, aof_delayed_fsync, and aof_last_bgrewrite_status from INFO persistence without extra configuration.
  • Correlate rising aof_delayed_fsync with node-level disk.await and utilization to distinguish disk saturation from a configuration issue.
  • Alert on aof_last_write_status:err with a duration threshold to avoid paging on transient stalls.
  • Cross-reference redis.total_error_replies and redis.errorstat_OOM to detect whether persistence failure has escalated to write rejection.
  • Disk space charts on the persistence volume show remaining capacity before Redis hits ENOSPC.