Redis replication backlog overflow: full-resync storms and the 1MB default
Replicas drop and reconnect, but each reconnection triggers a full resync instead of a partial sync. The primary forks for an RDB dump, latency spikes, and other replicas fall behind. Before recovery, another replica exceeds the backlog window and the cycle repeats. The default repl-backlog-size of 1 MB triggers this cascade in most production workloads.
The backlog is a fixed-size circular buffer of recent writes that lets a disconnected replica catch up without a full resync. When writes during a blip exceed the 1 MB default, the replica’s offset falls outside the window. Recovery requires a full resync, which forks the primary and turns a brief disconnect into a site-wide latency event.
What this means
A partial resync (PSYNC) sends the replica only the writes it missed. The primary keeps these in repl-backlog-size, a circular buffer. When a replica reconnects, it sends its last replication offset. If that offset is still inside the backlog, the primary streams the delta. If the buffer wrapped past the offset, the primary must do a full resync: fork a child, serialize the dataset to an RDB file or socket, transfer it, and load it on the replica.
Full resyncs are expensive. The fork freezes the main thread for latest_fork_usec, which can reach hundreds of milliseconds. During that freeze, commands queue. Replicas waiting for replication data may see gaps and disconnect. When they reconnect, they too may find themselves outside the backlog, triggering more full resyncs. A single overflow can cascade into a storm that keeps the primary in a perpetual fork-resync loop.
flowchart TD
A[Replica lag grows] --> B{Lag > repl-backlog-size?}
B -->|Yes| C[Partial resync fails]
C --> D[Full resync triggered]
D --> E[Primary forks for RDB]
E --> F[Latency spike on primary]
F --> G[Other replicas timeout and lag]
G --> A
B -->|No| H[Normal partial resync]Common causes
| Cause | What it looks like | First thing to check |
|---|---|---|
repl-backlog-size too small | sync_partial_err and sync_full counters rising; replicas resync after brief disconnects | CONFIG GET repl-backlog-size |
| Network blip or instability | master_link_status flaps to down; one or more replicas show growing master_link_down_since_seconds | INFO replication on replicas |
| Primary latency spike (fork or slow command) | latest_fork_usec > 500 ms or slowlog entries appear; then connected_slaves drops and sync_full rises | INFO persistence and SLOWLOG GET |
| Replica resource bottleneck | One replica consistently lags while others stay caught up; its offset trails the primary | Per-replica offset in INFO replication on the primary |
Quick checks
# Check full vs partial resync counters
redis-cli INFO stats | grep -E "sync_full|sync_partial"
# Check primary offset and per-replica offsets
redis-cli INFO replication | grep -E "master_repl_offset|slave[0-9]"
# Check replica link status and downtime duration
redis-cli INFO replication | grep -E "master_link_status|master_link_down_since"
# Check if a fork is currently happening and how long the last one took
redis-cli INFO persistence | grep -E "rdb_bgsave_in_progress|latest_fork_usec"
# Check current backlog size
redis-cli CONFIG GET repl-backlog-size
# Check recent slow commands that could have blocked the event loop
redis-cli SLOWLOG GET 5
How to diagnose it
- On the primary, check
INFO stats. Risingsync_partial_errmeans partial resyncs are failing because the offset left the backlog or the replication ID changed. - Check
INFO replicationon the primary. Subtract each replica’soffsetfrommaster_repl_offsetto get per-replica lag. If lag exceedsrepl-backlog-size, that replica will full-resync on reconnect. - On a lagging replica, check
INFO replication. Ifmaster_link_statusisdown, notemaster_link_down_since_seconds. Multiply your write rate by the downtime to see if the backlog could cover the gap. - On the primary, check
INFO persistence.rdb_bgsave_in_progress:1means a full resync is currently generating an RDB snapshot.latest_fork_usecabove 500 ms explains why clients and replicas timed out. - Check
SLOWLOG GET 10on the primary. Look forKEYS *, largeSMEMBERS, or long Lua scripts that blocked the event loop and caused replicas to time out. - Correlate
connected_slavesover time. A drop followed by a full resync followed by another drop confirms the cascade.
Metrics and signals to monitor
| Signal | Why it matters | Warning sign |
|---|---|---|
sync_partial_err | Each increment means a replica could not use partial resync | Sustained increase; any rise during stable topology |
sync_full | Every full resync forks the primary and transfers the entire dataset | Increase outside of new replica provisioning |
| Replication offset lag | Byte distance between primary and replica | Lag approaching or exceeding repl-backlog-size |
latest_fork_usec | Duration the primary is frozen during fork | > 500 ms; sudden spikes without dataset growth |
connected_slaves | Number of replicas currently streaming | Unexpected drops below the expected count |
master_link_status | Whether a replica has an active replication stream | down for more than 30 seconds |
rdb_bgsave_in_progress | Indicates an active full resync | Value of 1 coinciding with replica reconnections |
Fixes
Increase the replication backlog size
Raise the buffer so replicas survive longer disconnections without a full resync. Apply live:
redis-cli CONFIG SET repl-backlog-size 104857600
Persist the change:
redis-cli CONFIG REWRITE
100 MB is a pragmatic minimum for production; write-heavy primaries often need 256 MB or 512 MB. The backlog consumes primary memory, so size it against your write rate and tolerance for disconnect duration. Target at least 2 times write_bytes_per_second multiplied by max_expected_disconnect_seconds.
Enable diskless replication
If full resyncs are unavoidable, enable repl-diskless-sync yes on the primary. The forked child streams the RDB directly to the replica socket instead of writing to disk first. This removes disk I/O pressure on the primary and reduces time spent in the forked state.
Address the triggering replica
If one replica lags while others stay caught up, the problem is not the backlog size. Check that replica for CPU saturation, disk I/O contention from its own persistence, or network bandwidth limits. Intentionally restarting it will trigger a full resync, so plan for the fork cost.
Break the storm with temporary topology changes
If multiple replicas are looping through full resyncs and the primary cannot keep up, temporarily disconnect the most lagged replicas at the network or application layer to stop the fork cascade. This causes outages for those replicas. Reconnect them only after the backlog size increase has taken effect and the primary is stable.
Prevention
Size repl-backlog-size before the incident. Calculate peak write rate in bytes per second from master_repl_offset deltas, multiply by the longest maintenance window or network blip you tolerate (typically 30 to 120 seconds), and double it. Treat 1 MB as a placeholder. Monitor sync_partial_err as a leading indicator: any sustained increase means headroom is shrinking. Ensure replicas have enough CPU and network to apply the replication stream in real time. If replicas run their own persistence, their fork load can delay replication processing and widen lag. Keep repl-diskless-sync yes to reduce the cost of unavoidable full resyncs.
How Netdata helps
- Correlate
sync_fullspikes withlatest_fork_usec,used_memory_rss, andinstantaneous_ops_per_secdrops to confirm the cascade pattern. - Monitor per-replica offset lag and
connected_slavesto identify the replica that triggered the storm. - Track
sync_partial_erras an early warning before full resyncs begin. - Overlay primary CPU, memory, and disk metrics to distinguish backlog overflow from primary resource exhaustion.
Related guides
- How Redis actually works in production: a mental model for operators: /guides/redis/how-redis-works-in-production/
- Redis aof_last_write_status:err: AOF write failures and recovery: /guides/redis/redis-aof-last-write-status-err/
- Redis appendfsync always latency: durability vs throughput trade-offs: /guides/redis/redis-appendfsync-always-latency/
- Redis blocked_clients growing: dead consumers vs healthy queues: /guides/redis/redis-blocked-clients-growing/
- Redis BUSY Redis is busy running a script: blocking Lua and how to recover: /guides/redis/redis-busy-running-script/
- Redis Can’t save in background: fork: Cannot allocate memory - diagnosis and fix: /guides/redis/redis-cant-save-in-background-fork/
- Redis client output buffer overflow: slow consumers and client-output-buffer-limit: /guides/redis/redis-client-output-buffer-limit/
- Redis connected_clients climbing: connection leak detection: /guides/redis/redis-connected-clients-climbing/
- Redis connection exhaustion: leaks, pools, and the retry storm: /guides/redis/redis-connection-exhaustion/
- Redis event loop blocked: when one slow command freezes everything: /guides/redis/redis-event-loop-blocked/
- Redis eviction policy tuning: allkeys-lru vs volatile-ttl vs noeviction: /guides/redis/redis-eviction-policy-tuning/
- Redis fork/COW memory storm: why persistence doubles RSS and OOM-kills the box: /guides/redis/redis-fork-cow-storm/







