Redis replication lag: detection, diagnosis, and fixes

Reads from Redis replicas returning stale data indicate replication lag. Your monitoring shows a growing gap between the primary’s replication offset and what the replica has acknowledged. During failover, every byte of that gap is potential data loss.

Replication lag in Redis is measured in bytes: the difference between master_repl_offset on the primary and slave_repl_offset on the replica. Small, stable lag is normal in asynchronous replication, but lag that grows continuously or exceeds repl-backlog-size signals a bottleneck that can cascade into full resync storms.

When a replica falls behind far enough that its offset rolls off the primary’s circular backlog, a reconnection forces a full resync. The primary forks to generate an RDB snapshot, and the replica loads it. This blocks the primary’s event loop during fork and leaves the replica unresponsive during load. Applications reading from the replica serve stale data, and failover to that replica drops all writes in the lag window.

flowchart TD
    A[Replica falls behind primary] --> B[Lag exceeds repl-backlog-size]
    B --> C[Replica reconnects after blip]
    C --> D[Partial resync fails]
    D --> E[Full resync starts]
    E --> F[Primary forks to create RDB]
    F --> G[Latency spike blocks event loop]
    G --> H[Other replicas timeout]
    H --> A

Common causes

CauseWhat it looks likeFirst thing to check
Replica CPU, disk, or network bottleneckLag grows steadily under write load; replica CPU saturated or disk I/O waits highINFO cpu on the replica; OS iostat, mpstat, and interface throughput
Slow commands blocking the primary event loopLag spikes correlate with slowlog entries; all replicas lag simultaneouslySLOWLOG GET 10 on the primary
Replication backlog too smallsync_partial_err increments after brief network blips; replicas fall into full resync loopsCONFIG GET repl-backlog-size compared to write throughput
Replica loading RDB after full resyncLag drops to zero after load, but was massive during transfer; replica unresponsiveINFO persistence on the replica for loading flag
Network bandwidth saturationHigh output kbps on primary approaching NIC limits; lag grows during traffic peaksINFO stats network metrics and OS interface counters
WAIT command or min-replicas blockingblocked_clients grows; applications report write timeoutsINFO clients for blocked_clients; CONFIG GET min-replicas-to-write

Quick checks

Run these read-only checks before making changes.

# On the primary: check byte-level offsets of connected replicas
redis-cli INFO replication | grep -E "master_repl_offset|slave[0-9]"

# On the replica: check link status and applied offset
redis-cli INFO replication | grep -E "master_link_status|slave_repl_offset|master_last_io_seconds_ago"

# On the primary: check full vs partial resync history
redis-cli INFO stats | grep -E "sync_full|sync_partial"

# On the primary: check for slow commands
redis-cli SLOWLOG GET 10

# On the primary: check current backlog size
redis-cli CONFIG GET repl-backlog-size

# On the replica: check if it is loading an RDB snapshot
redis-cli INFO persistence | grep -E "loading|rdb_bgsave_in_progress"

# On the primary: check clients blocked by WAIT or slow replication
redis-cli INFO clients | grep blocked_clients

# On the primary: check replica client output buffer limits
redis-cli CONFIG GET client-output-buffer-limit

How to diagnose it

  1. Quantify the lag in bytes. On the primary, read master_repl_offset. Subtract the replica’s acknowledged offset from the matching slaveN: line. On the replica, read slave_repl_offset directly. A gap of a few kilobytes is healthy; megabytes and growing is not.

  2. Determine if the lag is growing, stable, or spiky. Steady growth under load points to a replica bottleneck. Sudden spikes correlate with slow commands or fork events on the primary.

  3. Check the primary slowlog. If SLOWLOG GET shows KEYS, large SMEMBERS, or long Lua scripts, the primary event loop is blocked. Replication stalls for all replicas during these pauses.

  4. Inspect replica host resources. Check CPU saturation, disk I/O wait, and network throughput with OS tools. A replica with slower disk or less CPU than the primary will lag during write bursts.

  5. Check for failed partial resyncs. If sync_partial_err is incrementing, the replica reconnected but its offset was no longer in the primary’s backlog. This forces full resyncs.

  6. Compare write throughput to backlog size. Calculate your peak write rate in bytes per second from master_repl_offset deltas. Multiply by your longest expected disconnect duration. If the result exceeds repl-backlog-size, the backlog is undersized.

  7. Verify replica loading state. If loading:1 appears in INFO persistence, the replica is processing an RDB dump and cannot serve reads. Compare loading_loaded_bytes to loading_total_bytes for progress.

  8. Check blocked clients. If blocked_clients is high while lag is elevated, clients may be using WAIT for synchronous replication and stalling until the replica catches up.

  9. Review network metrics. Check instantaneous_output_kbps on the primary and OS-level interface counters. Replication traffic competes with client traffic for bandwidth.

Metrics and signals to monitor

SignalWhy it mattersWarning sign
master_repl_offset minus replica offsetDirect measure of data at risk during failoverGrowing consistently or exceeding repl-backlog-size
master_link_statusBinary indicator of replication connectivitydown for more than 30 seconds
sync_partial_errFailed partial resyncs force expensive full syncsAny sustained increment
latest_fork_usecFork latency blocks the primary event loop> 500ms; spikes precede replica disconnects
loading on replicaReplica cannot serve reads while loading RDBloading:1 for longer than baseline
blocked_clientsClients blocked by WAIT or slow replicationGrowing while replication lag is high
instantaneous_ops_per_secDrops during fork indicate event loop blockingSustained drop coinciding with rdb_bgsave_in_progress

Fixes

Scale replica resources

If the replica CPU, disk, or network is saturated, move the replica to a larger instance or isolate replication traffic onto dedicated network paths. Do not restart the primary.

Increase the replication backlog

If sync_partial_err is incrementing, raise repl-backlog-size to cover your write rate multiplied by expected disconnect duration. Apply it live with CONFIG SET repl-backlog-size 104857600 (100MB) and persist it in redis.conf. The default 1MB is insufficient for almost all production workloads.

Eliminate slow commands on the primary

Audit SLOWLOG GET and remove KEYS, large SMEMBERS, or unbounded Lua scripts. Replace KEYS with SCAN. Set lua-time-limit to prevent runaway scripts. One slow command stalls replication to all replicas simultaneously.

Adjust output buffer limits

If replicas disconnect because their client output buffer exceeds client-output-buffer-limit, the limit may be too low for your write rate, or the replica may be genuinely unable to consume the stream. Do not simply raise the limit without fixing the underlying consumption bottleneck, or you risk OOM on the primary.

Enable diskless replication

If full resyncs are frequent and fork latency is acceptable, consider repl-diskless-sync yes. This avoids writing the RDB to disk on the primary before streaming it to the replica, reducing disk I/O pressure during resync. Evaluate this against your network stability.

Configure write safety guards

Set min-replicas-to-write 1 and min-replicas-max-lag 10 to prevent the primary from accepting writes when replicas are too far behind. The tradeoff is reduced availability: if no replica meets the lag threshold, writes are rejected with (error) NOREPLICAS.

Handle WAIT timeouts in applications

If clients use WAIT for synchronous replication, ensure they handle partial acknowledgments gracefully. Do not let WAIT block indefinitely. Use a client-side timeout and treat partial acks as a signal to investigate the replica rather than a hard failure.

Prevention

  • Size repl-backlog-size to at least twice your peak write bytes per second multiplied by your longest expected maintenance window or network partition duration. Most production deployments need 100MB to 512MB.
  • Monitor the slowlog on every primary continuously. Any KEYS command or multi-second Lua script is an incident waiting to happen.
  • Monitor sync_full and sync_partial_err. A rising full sync rate is an early warning that your backlog is undersized or your network is unstable.
  • Set min-replicas-to-write and min-replicas-max-lag even if you do not require synchronous replication. The default of 0 provides no protection against failover to a stale replica.
  • Verify replica host resources during peak load. A replica with slower disk or less CPU than the primary will always lag during bursts.

How Netdata helps

  • Chart master_repl_offset and slave_repl_offset together to show byte-level lag across the topology.
  • Alert on sync_partial_err increments and latest_fork_usec spikes to catch backlog overflow loops before they cascade.
  • Surface slowlog entries and command latency alongside replication metrics to distinguish primary-side blocking from replica-side bottlenecks.
  • Track replica CPU, memory, and network saturation on the same charts as replication lag to identify resource-constrained followers.