$ guides / redis / redis-replication-lag ▌

Operations Guides

Redis replication lag: detection, diagnosis, and fixes

Reads from Redis replicas returning stale data indicate replication lag. Your monitoring shows a growing gap between the primary’s replication offset and what the replica has acknowledged. During failover, every byte of that gap is potential data loss.

Replication lag in Redis is measured in bytes: the difference between master_repl_offset on the primary and slave_repl_offset on the replica. Small, stable lag is normal in asynchronous replication, but lag that grows continuously or exceeds repl-backlog-size signals a bottleneck that can cascade into full resync storms.

When a replica falls behind far enough that its offset rolls off the primary’s circular backlog, a reconnection forces a full resync. The primary forks to generate an RDB snapshot, and the replica loads it. This blocks the primary’s event loop during fork and leaves the replica unresponsive during load. Applications reading from the replica serve stale data, and failover to that replica drops all writes in the lag window.

flowchart TD
    A[Replica falls behind primary] --> B[Lag exceeds repl-backlog-size]
    B --> C[Replica reconnects after blip]
    C --> D[Partial resync fails]
    D --> E[Full resync starts]
    E --> F[Primary forks to create RDB]
    F --> G[Latency spike blocks event loop]
    G --> H[Other replicas timeout]
    H --> A

Common causes

Cause	What it looks like	First thing to check
Replica CPU, disk, or network bottleneck	Lag grows steadily under write load; replica CPU saturated or disk I/O waits high	`INFO cpu` on the replica; OS `iostat`, `mpstat`, and interface throughput
Slow commands blocking the primary event loop	Lag spikes correlate with slowlog entries; all replicas lag simultaneously	`SLOWLOG GET 10` on the primary
Replication backlog too small	`sync_partial_err` increments after brief network blips; replicas fall into full resync loops	`CONFIG GET repl-backlog-size` compared to write throughput
Replica loading RDB after full resync	Lag drops to zero after load, but was massive during transfer; replica unresponsive	`INFO persistence` on the replica for `loading` flag
Network bandwidth saturation	High output kbps on primary approaching NIC limits; lag grows during traffic peaks	`INFO stats` network metrics and OS interface counters
WAIT command or min-replicas blocking	`blocked_clients` grows; applications report write timeouts	`INFO clients` for `blocked_clients`; `CONFIG GET min-replicas-to-write`

Quick checks

Run these read-only checks before making changes.

# On the primary: check byte-level offsets of connected replicas
redis-cli INFO replication | grep -E "master_repl_offset|slave[0-9]"

# On the replica: check link status and applied offset
redis-cli INFO replication | grep -E "master_link_status|slave_repl_offset|master_last_io_seconds_ago"

# On the primary: check full vs partial resync history
redis-cli INFO stats | grep -E "sync_full|sync_partial"

# On the primary: check for slow commands
redis-cli SLOWLOG GET 10

# On the primary: check current backlog size
redis-cli CONFIG GET repl-backlog-size

# On the replica: check if it is loading an RDB snapshot
redis-cli INFO persistence | grep -E "loading|rdb_bgsave_in_progress"

# On the primary: check clients blocked by WAIT or slow replication
redis-cli INFO clients | grep blocked_clients

# On the primary: check replica client output buffer limits
redis-cli CONFIG GET client-output-buffer-limit

How to diagnose it

Quantify the lag in bytes. On the primary, read master_repl_offset. Subtract the replica’s acknowledged offset from the matching slaveN: line. On the replica, read slave_repl_offset directly. A gap of a few kilobytes is healthy; megabytes and growing is not.
Determine if the lag is growing, stable, or spiky. Steady growth under load points to a replica bottleneck. Sudden spikes correlate with slow commands or fork events on the primary.
Check the primary slowlog. If SLOWLOG GET shows KEYS, large SMEMBERS, or long Lua scripts, the primary event loop is blocked. Replication stalls for all replicas during these pauses.
Inspect replica host resources. Check CPU saturation, disk I/O wait, and network throughput with OS tools. A replica with slower disk or less CPU than the primary will lag during write bursts.
Check for failed partial resyncs. If sync_partial_err is incrementing, the replica reconnected but its offset was no longer in the primary’s backlog. This forces full resyncs.
Compare write throughput to backlog size. Calculate your peak write rate in bytes per second from master_repl_offset deltas. Multiply by your longest expected disconnect duration. If the result exceeds repl-backlog-size, the backlog is undersized.
Verify replica loading state. If loading:1 appears in INFO persistence, the replica is processing an RDB dump and cannot serve reads. Compare loading_loaded_bytes to loading_total_bytes for progress.
Check blocked clients. If blocked_clients is high while lag is elevated, clients may be using WAIT for synchronous replication and stalling until the replica catches up.
Review network metrics. Check instantaneous_output_kbps on the primary and OS-level interface counters. Replication traffic competes with client traffic for bandwidth.

Metrics and signals to monitor

Signal	Why it matters	Warning sign
`master_repl_offset` minus replica offset	Direct measure of data at risk during failover	Growing consistently or exceeding `repl-backlog-size`
`master_link_status`	Binary indicator of replication connectivity	`down` for more than 30 seconds
`sync_partial_err`	Failed partial resyncs force expensive full syncs	Any sustained increment
`latest_fork_usec`	Fork latency blocks the primary event loop	> 500ms; spikes precede replica disconnects
`loading` on replica	Replica cannot serve reads while loading RDB	`loading:1` for longer than baseline
`blocked_clients`	Clients blocked by WAIT or slow replication	Growing while replication lag is high
`instantaneous_ops_per_sec`	Drops during fork indicate event loop blocking	Sustained drop coinciding with `rdb_bgsave_in_progress`

Fixes

Scale replica resources

If the replica CPU, disk, or network is saturated, move the replica to a larger instance or isolate replication traffic onto dedicated network paths. Do not restart the primary.

Increase the replication backlog

If sync_partial_err is incrementing, raise repl-backlog-size to cover your write rate multiplied by expected disconnect duration. Apply it live with CONFIG SET repl-backlog-size 104857600 (100MB) and persist it in redis.conf. The default 1MB is insufficient for almost all production workloads.

Eliminate slow commands on the primary

Audit SLOWLOG GET and remove KEYS, large SMEMBERS, or unbounded Lua scripts. Replace KEYS with SCAN. Set lua-time-limit to prevent runaway scripts. One slow command stalls replication to all replicas simultaneously.

Adjust output buffer limits

If replicas disconnect because their client output buffer exceeds client-output-buffer-limit, the limit may be too low for your write rate, or the replica may be genuinely unable to consume the stream. Do not simply raise the limit without fixing the underlying consumption bottleneck, or you risk OOM on the primary.

Enable diskless replication

If full resyncs are frequent and fork latency is acceptable, consider repl-diskless-sync yes. This avoids writing the RDB to disk on the primary before streaming it to the replica, reducing disk I/O pressure during resync. Evaluate this against your network stability.

Configure write safety guards

Set min-replicas-to-write 1 and min-replicas-max-lag 10 to prevent the primary from accepting writes when replicas are too far behind. The tradeoff is reduced availability: if no replica meets the lag threshold, writes are rejected with (error) NOREPLICAS.

Handle WAIT timeouts in applications

If clients use WAIT for synchronous replication, ensure they handle partial acknowledgments gracefully. Do not let WAIT block indefinitely. Use a client-side timeout and treat partial acks as a signal to investigate the replica rather than a hard failure.

Prevention

Size repl-backlog-size to at least twice your peak write bytes per second multiplied by your longest expected maintenance window or network partition duration. Most production deployments need 100MB to 512MB.
Monitor the slowlog on every primary continuously. Any KEYS command or multi-second Lua script is an incident waiting to happen.
Monitor sync_full and sync_partial_err. A rising full sync rate is an early warning that your backlog is undersized or your network is unstable.
Set min-replicas-to-write and min-replicas-max-lag even if you do not require synchronous replication. The default of 0 provides no protection against failover to a stale replica.
Verify replica host resources during peak load. A replica with slower disk or less CPU than the primary will always lag during bursts.

How Netdata helps

Chart master_repl_offset and slave_repl_offset together to show byte-level lag across the topology.
Alert on sync_partial_err increments and latest_fork_usec spikes to catch backlog overflow loops before they cascade.
Surface slowlog entries and command latency alongside replication metrics to distinguish primary-side blocking from replica-side bottlenecks.
Track replica CPU, memory, and network saturation on the same charts as replication lag to identify resource-constrained followers.

The Netdata solution

Redis monitoring with Netdata

Netdata monitors Redis with per-second metrics and ML anomaly detection. Track memory usage and fragmentation, fork/COW latency, replication backlog, evictions, and connection pressure to spot the failure modes in these runbooks early.

See Redis monitoring → Start monitoring free

Redis replication lag: detection, diagnosis, and fixes

Redis replication lag: detection, diagnosis, and fixes

Common causes

Quick checks

How to diagnose it

Metrics and signals to monitor

Fixes

Scale replica resources

Increase the replication backlog

Eliminate slow commands on the primary

Adjust output buffer limits

Enable diskless replication

Configure write safety guards

Handle WAIT timeouts in applications

Prevention

How Netdata helps

Related guides

Redis monitoring with Netdata