Redis replication lag: detection, diagnosis, and fixes
Reads from Redis replicas returning stale data indicate replication lag. Your monitoring shows a growing gap between the primary’s replication offset and what the replica has acknowledged. During failover, every byte of that gap is potential data loss.
Replication lag in Redis is measured in bytes: the difference between master_repl_offset on the primary and slave_repl_offset on the replica. Small, stable lag is normal in asynchronous replication, but lag that grows continuously or exceeds repl-backlog-size signals a bottleneck that can cascade into full resync storms.
When a replica falls behind far enough that its offset rolls off the primary’s circular backlog, a reconnection forces a full resync. The primary forks to generate an RDB snapshot, and the replica loads it. This blocks the primary’s event loop during fork and leaves the replica unresponsive during load. Applications reading from the replica serve stale data, and failover to that replica drops all writes in the lag window.
flowchart TD
A[Replica falls behind primary] --> B[Lag exceeds repl-backlog-size]
B --> C[Replica reconnects after blip]
C --> D[Partial resync fails]
D --> E[Full resync starts]
E --> F[Primary forks to create RDB]
F --> G[Latency spike blocks event loop]
G --> H[Other replicas timeout]
H --> ACommon causes
| Cause | What it looks like | First thing to check |
|---|---|---|
| Replica CPU, disk, or network bottleneck | Lag grows steadily under write load; replica CPU saturated or disk I/O waits high | INFO cpu on the replica; OS iostat, mpstat, and interface throughput |
| Slow commands blocking the primary event loop | Lag spikes correlate with slowlog entries; all replicas lag simultaneously | SLOWLOG GET 10 on the primary |
| Replication backlog too small | sync_partial_err increments after brief network blips; replicas fall into full resync loops | CONFIG GET repl-backlog-size compared to write throughput |
| Replica loading RDB after full resync | Lag drops to zero after load, but was massive during transfer; replica unresponsive | INFO persistence on the replica for loading flag |
| Network bandwidth saturation | High output kbps on primary approaching NIC limits; lag grows during traffic peaks | INFO stats network metrics and OS interface counters |
| WAIT command or min-replicas blocking | blocked_clients grows; applications report write timeouts | INFO clients for blocked_clients; CONFIG GET min-replicas-to-write |
Quick checks
Run these read-only checks before making changes.
# On the primary: check byte-level offsets of connected replicas
redis-cli INFO replication | grep -E "master_repl_offset|slave[0-9]"
# On the replica: check link status and applied offset
redis-cli INFO replication | grep -E "master_link_status|slave_repl_offset|master_last_io_seconds_ago"
# On the primary: check full vs partial resync history
redis-cli INFO stats | grep -E "sync_full|sync_partial"
# On the primary: check for slow commands
redis-cli SLOWLOG GET 10
# On the primary: check current backlog size
redis-cli CONFIG GET repl-backlog-size
# On the replica: check if it is loading an RDB snapshot
redis-cli INFO persistence | grep -E "loading|rdb_bgsave_in_progress"
# On the primary: check clients blocked by WAIT or slow replication
redis-cli INFO clients | grep blocked_clients
# On the primary: check replica client output buffer limits
redis-cli CONFIG GET client-output-buffer-limit
How to diagnose it
Quantify the lag in bytes. On the primary, read
master_repl_offset. Subtract the replica’s acknowledged offset from the matchingslaveN:line. On the replica, readslave_repl_offsetdirectly. A gap of a few kilobytes is healthy; megabytes and growing is not.Determine if the lag is growing, stable, or spiky. Steady growth under load points to a replica bottleneck. Sudden spikes correlate with slow commands or fork events on the primary.
Check the primary slowlog. If
SLOWLOG GETshowsKEYS, largeSMEMBERS, or long Lua scripts, the primary event loop is blocked. Replication stalls for all replicas during these pauses.Inspect replica host resources. Check CPU saturation, disk I/O wait, and network throughput with OS tools. A replica with slower disk or less CPU than the primary will lag during write bursts.
Check for failed partial resyncs. If
sync_partial_erris incrementing, the replica reconnected but its offset was no longer in the primary’s backlog. This forces full resyncs.Compare write throughput to backlog size. Calculate your peak write rate in bytes per second from
master_repl_offsetdeltas. Multiply by your longest expected disconnect duration. If the result exceedsrepl-backlog-size, the backlog is undersized.Verify replica loading state. If
loading:1appears inINFO persistence, the replica is processing an RDB dump and cannot serve reads. Compareloading_loaded_bytestoloading_total_bytesfor progress.Check blocked clients. If
blocked_clientsis high while lag is elevated, clients may be usingWAITfor synchronous replication and stalling until the replica catches up.Review network metrics. Check
instantaneous_output_kbpson the primary and OS-level interface counters. Replication traffic competes with client traffic for bandwidth.
Metrics and signals to monitor
| Signal | Why it matters | Warning sign |
|---|---|---|
master_repl_offset minus replica offset | Direct measure of data at risk during failover | Growing consistently or exceeding repl-backlog-size |
master_link_status | Binary indicator of replication connectivity | down for more than 30 seconds |
sync_partial_err | Failed partial resyncs force expensive full syncs | Any sustained increment |
latest_fork_usec | Fork latency blocks the primary event loop | > 500ms; spikes precede replica disconnects |
loading on replica | Replica cannot serve reads while loading RDB | loading:1 for longer than baseline |
blocked_clients | Clients blocked by WAIT or slow replication | Growing while replication lag is high |
instantaneous_ops_per_sec | Drops during fork indicate event loop blocking | Sustained drop coinciding with rdb_bgsave_in_progress |
Fixes
Scale replica resources
If the replica CPU, disk, or network is saturated, move the replica to a larger instance or isolate replication traffic onto dedicated network paths. Do not restart the primary.
Increase the replication backlog
If sync_partial_err is incrementing, raise repl-backlog-size to cover your write rate multiplied by expected disconnect duration. Apply it live with CONFIG SET repl-backlog-size 104857600 (100MB) and persist it in redis.conf. The default 1MB is insufficient for almost all production workloads.
Eliminate slow commands on the primary
Audit SLOWLOG GET and remove KEYS, large SMEMBERS, or unbounded Lua scripts. Replace KEYS with SCAN. Set lua-time-limit to prevent runaway scripts. One slow command stalls replication to all replicas simultaneously.
Adjust output buffer limits
If replicas disconnect because their client output buffer exceeds client-output-buffer-limit, the limit may be too low for your write rate, or the replica may be genuinely unable to consume the stream. Do not simply raise the limit without fixing the underlying consumption bottleneck, or you risk OOM on the primary.
Enable diskless replication
If full resyncs are frequent and fork latency is acceptable, consider repl-diskless-sync yes. This avoids writing the RDB to disk on the primary before streaming it to the replica, reducing disk I/O pressure during resync. Evaluate this against your network stability.
Configure write safety guards
Set min-replicas-to-write 1 and min-replicas-max-lag 10 to prevent the primary from accepting writes when replicas are too far behind. The tradeoff is reduced availability: if no replica meets the lag threshold, writes are rejected with (error) NOREPLICAS.
Handle WAIT timeouts in applications
If clients use WAIT for synchronous replication, ensure they handle partial acknowledgments gracefully. Do not let WAIT block indefinitely. Use a client-side timeout and treat partial acks as a signal to investigate the replica rather than a hard failure.
Prevention
- Size
repl-backlog-sizeto at least twice your peak write bytes per second multiplied by your longest expected maintenance window or network partition duration. Most production deployments need 100MB to 512MB. - Monitor the slowlog on every primary continuously. Any
KEYScommand or multi-second Lua script is an incident waiting to happen. - Monitor
sync_fullandsync_partial_err. A rising full sync rate is an early warning that your backlog is undersized or your network is unstable. - Set
min-replicas-to-writeandmin-replicas-max-lageven if you do not require synchronous replication. The default of 0 provides no protection against failover to a stale replica. - Verify replica host resources during peak load. A replica with slower disk or less CPU than the primary will always lag during bursts.
How Netdata helps
- Chart
master_repl_offsetandslave_repl_offsettogether to show byte-level lag across the topology. - Alert on
sync_partial_errincrements andlatest_fork_usecspikes to catch backlog overflow loops before they cascade. - Surface slowlog entries and command latency alongside replication metrics to distinguish primary-side blocking from replica-side bottlenecks.
- Track replica CPU, memory, and network saturation on the same charts as replication lag to identify resource-constrained followers.
Related guides
- How Redis actually works in production: a mental model for operators
- Redis aof_last_write_status:err: AOF write failures and recovery
- Redis appendfsync always latency: durability vs throughput trade-offs
- Redis blocked_clients growing: dead consumers vs healthy queues
- Redis BUSY Redis is busy running a script: blocking Lua and how to recover
- Redis Can’t save in background: fork: Cannot allocate memory - diagnosis and fix
- Redis client output buffer overflow: slow consumers and client-output-buffer-limit
- Redis connected_clients climbing: connection leak detection
- Redis connection exhaustion: leaks, pools, and the retry storm
- Redis event loop blocked: when one slow command freezes everything
- Redis eviction policy tuning: allkeys-lru vs volatile-ttl vs noeviction
- Redis fork/COW memory storm: why persistence doubles RSS and OOM-kills the box







