Redis replication backlog overflow: full-resync storms and the 1MB default

Replicas drop and reconnect, but each reconnection triggers a full resync instead of a partial sync. The primary forks for an RDB dump, latency spikes, and other replicas fall behind. Before recovery, another replica exceeds the backlog window and the cycle repeats. The default repl-backlog-size of 1 MB triggers this cascade in most production workloads.

The backlog is a fixed-size circular buffer of recent writes that lets a disconnected replica catch up without a full resync. When writes during a blip exceed the 1 MB default, the replica’s offset falls outside the window. Recovery requires a full resync, which forks the primary and turns a brief disconnect into a site-wide latency event.

What this means

A partial resync (PSYNC) sends the replica only the writes it missed. The primary keeps these in repl-backlog-size, a circular buffer. When a replica reconnects, it sends its last replication offset. If that offset is still inside the backlog, the primary streams the delta. If the buffer wrapped past the offset, the primary must do a full resync: fork a child, serialize the dataset to an RDB file or socket, transfer it, and load it on the replica.

Full resyncs are expensive. The fork freezes the main thread for latest_fork_usec, which can reach hundreds of milliseconds. During that freeze, commands queue. Replicas waiting for replication data may see gaps and disconnect. When they reconnect, they too may find themselves outside the backlog, triggering more full resyncs. A single overflow can cascade into a storm that keeps the primary in a perpetual fork-resync loop.

flowchart TD
    A[Replica lag grows] --> B{Lag > repl-backlog-size?}
    B -->|Yes| C[Partial resync fails]
    C --> D[Full resync triggered]
    D --> E[Primary forks for RDB]
    E --> F[Latency spike on primary]
    F --> G[Other replicas timeout and lag]
    G --> A
    B -->|No| H[Normal partial resync]

Common causes

CauseWhat it looks likeFirst thing to check
repl-backlog-size too smallsync_partial_err and sync_full counters rising; replicas resync after brief disconnectsCONFIG GET repl-backlog-size
Network blip or instabilitymaster_link_status flaps to down; one or more replicas show growing master_link_down_since_secondsINFO replication on replicas
Primary latency spike (fork or slow command)latest_fork_usec > 500 ms or slowlog entries appear; then connected_slaves drops and sync_full risesINFO persistence and SLOWLOG GET
Replica resource bottleneckOne replica consistently lags while others stay caught up; its offset trails the primaryPer-replica offset in INFO replication on the primary

Quick checks

# Check full vs partial resync counters
redis-cli INFO stats | grep -E "sync_full|sync_partial"

# Check primary offset and per-replica offsets
redis-cli INFO replication | grep -E "master_repl_offset|slave[0-9]"

# Check replica link status and downtime duration
redis-cli INFO replication | grep -E "master_link_status|master_link_down_since"

# Check if a fork is currently happening and how long the last one took
redis-cli INFO persistence | grep -E "rdb_bgsave_in_progress|latest_fork_usec"

# Check current backlog size
redis-cli CONFIG GET repl-backlog-size

# Check recent slow commands that could have blocked the event loop
redis-cli SLOWLOG GET 5

How to diagnose it

  1. On the primary, check INFO stats. Rising sync_partial_err means partial resyncs are failing because the offset left the backlog or the replication ID changed.
  2. Check INFO replication on the primary. Subtract each replica’s offset from master_repl_offset to get per-replica lag. If lag exceeds repl-backlog-size, that replica will full-resync on reconnect.
  3. On a lagging replica, check INFO replication. If master_link_status is down, note master_link_down_since_seconds. Multiply your write rate by the downtime to see if the backlog could cover the gap.
  4. On the primary, check INFO persistence. rdb_bgsave_in_progress:1 means a full resync is currently generating an RDB snapshot. latest_fork_usec above 500 ms explains why clients and replicas timed out.
  5. Check SLOWLOG GET 10 on the primary. Look for KEYS *, large SMEMBERS, or long Lua scripts that blocked the event loop and caused replicas to time out.
  6. Correlate connected_slaves over time. A drop followed by a full resync followed by another drop confirms the cascade.

Metrics and signals to monitor

SignalWhy it mattersWarning sign
sync_partial_errEach increment means a replica could not use partial resyncSustained increase; any rise during stable topology
sync_fullEvery full resync forks the primary and transfers the entire datasetIncrease outside of new replica provisioning
Replication offset lagByte distance between primary and replicaLag approaching or exceeding repl-backlog-size
latest_fork_usecDuration the primary is frozen during fork> 500 ms; sudden spikes without dataset growth
connected_slavesNumber of replicas currently streamingUnexpected drops below the expected count
master_link_statusWhether a replica has an active replication streamdown for more than 30 seconds
rdb_bgsave_in_progressIndicates an active full resyncValue of 1 coinciding with replica reconnections

Fixes

Increase the replication backlog size

Raise the buffer so replicas survive longer disconnections without a full resync. Apply live:

redis-cli CONFIG SET repl-backlog-size 104857600

Persist the change:

redis-cli CONFIG REWRITE

100 MB is a pragmatic minimum for production; write-heavy primaries often need 256 MB or 512 MB. The backlog consumes primary memory, so size it against your write rate and tolerance for disconnect duration. Target at least 2 times write_bytes_per_second multiplied by max_expected_disconnect_seconds.

Enable diskless replication

If full resyncs are unavoidable, enable repl-diskless-sync yes on the primary. The forked child streams the RDB directly to the replica socket instead of writing to disk first. This removes disk I/O pressure on the primary and reduces time spent in the forked state.

Address the triggering replica

If one replica lags while others stay caught up, the problem is not the backlog size. Check that replica for CPU saturation, disk I/O contention from its own persistence, or network bandwidth limits. Intentionally restarting it will trigger a full resync, so plan for the fork cost.

Break the storm with temporary topology changes

If multiple replicas are looping through full resyncs and the primary cannot keep up, temporarily disconnect the most lagged replicas at the network or application layer to stop the fork cascade. This causes outages for those replicas. Reconnect them only after the backlog size increase has taken effect and the primary is stable.

Prevention

Size repl-backlog-size before the incident. Calculate peak write rate in bytes per second from master_repl_offset deltas, multiply by the longest maintenance window or network blip you tolerate (typically 30 to 120 seconds), and double it. Treat 1 MB as a placeholder. Monitor sync_partial_err as a leading indicator: any sustained increase means headroom is shrinking. Ensure replicas have enough CPU and network to apply the replication stream in real time. If replicas run their own persistence, their fork load can delay replication processing and widen lag. Keep repl-diskless-sync yes to reduce the cost of unavoidable full resyncs.

How Netdata helps

  • Correlate sync_full spikes with latest_fork_usec, used_memory_rss, and instantaneous_ops_per_sec drops to confirm the cascade pattern.
  • Monitor per-replica offset lag and connected_slaves to identify the replica that triggered the storm.
  • Track sync_partial_err as an early warning before full resyncs begin.
  • Overlay primary CPU, memory, and disk metrics to distinguish backlog overflow from primary resource exhaustion.
  • How Redis actually works in production: a mental model for operators: /guides/redis/how-redis-works-in-production/
  • Redis aof_last_write_status:err: AOF write failures and recovery: /guides/redis/redis-aof-last-write-status-err/
  • Redis appendfsync always latency: durability vs throughput trade-offs: /guides/redis/redis-appendfsync-always-latency/
  • Redis blocked_clients growing: dead consumers vs healthy queues: /guides/redis/redis-blocked-clients-growing/
  • Redis BUSY Redis is busy running a script: blocking Lua and how to recover: /guides/redis/redis-busy-running-script/
  • Redis Can’t save in background: fork: Cannot allocate memory - diagnosis and fix: /guides/redis/redis-cant-save-in-background-fork/
  • Redis client output buffer overflow: slow consumers and client-output-buffer-limit: /guides/redis/redis-client-output-buffer-limit/
  • Redis connected_clients climbing: connection leak detection: /guides/redis/redis-connected-clients-climbing/
  • Redis connection exhaustion: leaks, pools, and the retry storm: /guides/redis/redis-connection-exhaustion/
  • Redis event loop blocked: when one slow command freezes everything: /guides/redis/redis-event-loop-blocked/
  • Redis eviction policy tuning: allkeys-lru vs volatile-ttl vs noeviction: /guides/redis/redis-eviction-policy-tuning/
  • Redis fork/COW memory storm: why persistence doubles RSS and OOM-kills the box: /guides/redis/redis-fork-cow-storm/