Redis sync_full incrementing: diagnosing full resync events
Your Redis primary’s sync_full counter is climbing. That means replicas are performing full resyncs instead of partial ones. Each full resync forces the primary to fork, write an RDB snapshot, and push it to the replica, which then wipes its own dataset and reloads from scratch. One full resync is a heavy operation. Several in succession, or multiple at once, can freeze the primary’s event loop, spike memory via copy-on-write, and trigger a cascade where more replicas fall behind and also need full resyncs.
What this means
Redis replicates incrementally. When a replica reconnects after a brief interruption, it attempts a partial resync using the primary’s replication backlog, a fixed-size circular buffer. If the replica’s last known offset still exists in the backlog, the primary streams only the missing bytes. If the offset has been overwritten, or the replica presents an unknown replication ID, Redis falls back to a full resync.
A full resync forces the primary to fork() so a child process can serialize the dataset to an RDB file. The replica receives this file, flushes its own data, and loads the new dump. The fork() call itself blocks the primary’s main thread briefly. With a large dataset, or if Transparent Huge Pages are enabled, that pause can last hundreds of milliseconds or longer. Copy-on-write overhead can sharply increase the primary’s resident memory. When multiple replicas request full resyncs simultaneously, the primary can enter a loop of fork latency, replica timeouts, and even more full resyncs.
Common causes
| Cause | What it looks like | First thing to check |
|---|---|---|
| Replication backlog too small | sync_partial_err climbs as partial resyncs fail back to full | CONFIG GET repl-backlog-size against write throughput |
| Network instability | master_link_status flaps on replicas, connected_slaves fluctuates | master_link_down_since_seconds and network path health |
| Replica restart or crash | connected_slaves drops and recovers, replica uptime_in_seconds resets | Replica logs and INFO server on the replica |
| Write rate exhausting backlog during short blips | master_repl_offset grows rapidly; even sub-second disconnects exceed the backlog | Rate of offset change versus repl-backlog-size |
| Replica too slow to keep up | Replication offset lag grows steadily on one replica before disconnect | Per-replica offset in INFO replication |
Quick checks
Run these read-only commands on the primary and replicas to confirm scope.
# Primary: full and partial resync counters
redis-cli INFO stats | grep -E "sync_full|sync_partial"
# Primary: replication backlog size
redis-cli CONFIG GET repl-backlog-size
# Primary: offset and per-replica offsets
redis-cli INFO replication | grep -E "master_repl_offset|connected_slaves|slave[0-9]"
# Primary: background save in progress
redis-cli INFO persistence | grep rdb_bgsave_in_progress
# Primary: latest fork duration (full resyncs require fork)
redis-cli INFO | grep latest_fork_usec
# Replica: link status and downtime
redis-cli INFO replication | grep -E "master_link_status|master_link_down_since"
# Replica: applied offset
redis-cli INFO replication | grep slave_repl_offset
How to diagnose it
- Confirm full resyncs are active. On the primary, run
INFO stats | grep sync_full. If the counter is increasing, full resyncs are happening now. Checksync_partial_err; if it is also increasing, partial resyncs are failing and falling back. - Evaluate the backlog. Run
CONFIG GET repl-backlog-size. The default is 1 MB. For most production write rates, this is insufficient. Compare the backlog size to your write throughput. If you write 50 MB/s and the backlog is 1 MB, a replica disconnected for more than 20 milliseconds will require a full resync. - Inspect replica topology. On the primary, run
INFO replication. Checkconnected_slaves. If the count is fluctuating, replicas are disconnecting and reconnecting. Review theslaveN:lines for individual replica offsets. A replica whose offset lags far behindmaster_repl_offsetis a candidate for triggering a full resync on its next disconnect. - Check replica-side link health. On each replica, run
INFO replication | grep master_link_status. If it isdown, checkmaster_link_down_since_secondsto see how long the replica has been partitioned. Intermittent blips combined with a small backlog are the most common trigger. - Correlate with fork pressure. Run
INFO | grep latest_fork_usec. If fork duration is spiking above 500 ms, the primary is already under pressure from persistence or replication forks. High fork latency can cause replicas to time out, which then reconnect and trigger more full resyncs. - Look for the cascade. A sustained
rdb_bgsave_in_progresscombined with droppingconnected_slavessuggests fork latency is timing out other replicas. - Check for external triggers. Review replica logs for OOM kills, unexpected restarts, or disk I/O stalls that caused the replica to stop processing the replication stream. Also verify that no operator maintenance window caused a batch of replica restarts.
flowchart TD
A[Replica disconnects or lags] --> B{Offset still in backlog?}
B -->|Yes| C[Partial resync]
B -->|No| D[Full resync]
D --> E[Primary forks for RDB]
E --> F[Latency spike and COW]
F --> G[Other replicas timeout]
G --> D
C --> H[Replica catches up]Metrics and signals to monitor
| Signal | Why it matters | Warning sign |
|---|---|---|
sync_full rate | Each increment is an expensive fork, RDB dump, and bulk transfer | Sustained increase |
sync_partial_err rate | Failed partial resyncs force full resyncs; indicates backlog exhaustion | Any sustained rate above zero |
repl-backlog-size vs write throughput | Determines the maximum disconnect duration a replica can tolerate before full resync | Backlog smaller than write rate multiplied by expected disconnect time |
| Replication offset lag | Byte distance between primary and replica; exceeding the backlog triggers full resync | Lag approaching or exceeding repl-backlog-size |
latest_fork_usec | Full resyncs require fork; long forks block the event loop and can cascade | Spikes above 500 ms |
connected_slaves | Drops indicate replica disconnects that may resync on reconnect | Count below expected or fluctuating |
rdb_bgsave_in_progress | Confirms a full resync is actively consuming primary resources | Active during replica reconnect storms |
Fixes
Increase the replication backlog
If sync_partial_err is incrementing and the backlog is at or near the default 1 MB, increase it. A live change takes effect immediately:
CONFIG SET repl-backlog-size 104857600
Persist the change in redis.conf so it survives restart. The backlog consumes additional memory on the primary. Size it to at least twice the expected write volume during your longest planned disconnect or maintenance window. A common starting point for production workloads is 100 MB.
Reduce network instability
If master_link_status is flapping on replicas, investigate the network path between the primary and replicas. Look for packet loss, latency spikes, or firewall timeouts. Stable links prevent replicas from losing their place in the replication stream.
Address replica instability
If one specific replica is repeatedly triggering full resyncs, inspect that replica for resource exhaustion. Check uptime_in_seconds on the replica to see if it is restarting unexpectedly. Review kernel logs for OOM kills and check disk latency if the replica stalls while persisting its own data.
Lower fork latency
High latest_fork_usec turns a single full resync into a system-wide latency event. The most common cause is Transparent Huge Pages. Check THP status:
cat /sys/kernel/mm/transparent_hugepage/enabled
If the value is not [never], disable THP. This requires root privileges and typically a system restart to apply safely. Fork latency should scale roughly with dataset size; significantly higher values indicate a kernel or hypervisor issue.
Use diskless replication if appropriate
If the primary is bottlenecked on disk I/O while writing the RDB for transfer, consider enabling repl-diskless-sync yes. This streams the RDB directly to the replica without writing to disk on the primary first. Tradeoffs include higher network utilization on the primary and potential timeouts on high-latency links.
Stagger replica reconnections
If multiple replicas were restarted simultaneously, for example after a deployment, they may all request full resyncs at the same time. Restart or reconnect replicas in batches to avoid overwhelming the primary with concurrent forks and transfers.
Prevention
Size the replication backlog using your measured write throughput and your tolerance for disconnect duration:
repl-backlog-size >= 2 * (write_bytes_per_second * max_expected_disconnect_seconds)
Monitor sync_partial_err as an early warning. Any increment means the backlog was insufficient for a reconnect that just happened. Do not wait for sync_full to climb.
Disable THP on all Redis hosts during provisioning. A fork that takes seconds instead of milliseconds is often the difference between a brief blip and a resync storm.
Avoid restarting all replicas at once. Rolling restarts spread the load on the primary and keep the majority of replicas online.
How Netdata helps
- Netdata collects
sync_full,sync_partial_ok, andsync_partial_errfromINFO statsand charts the rate of change. - Correlate climbing
sync_fullwithlatest_fork_usecandrdb_bgsave_in_progressto confirm the primary is under fork pressure. - Alert on
sync_partial_errincrements to catch backlog exhaustion before it forces a full resync. - Chart replication offset lag per replica to identify lagging replicas before they exceed the backlog window.
- Tie Redis replication metrics to system-level charts for RSS, CPU, and disk I/O to see when a full resync is contributing to host-level saturation.
Related guides
- How Redis actually works in production: a mental model for operators: /guides/redis/how-redis-works-in-production/
- Redis aof_last_write_status:err: AOF write failures and recovery: /guides/redis/redis-aof-last-write-status-err/
- Redis appendfsync always latency: durability vs throughput trade-offs: /guides/redis/redis-appendfsync-always-latency/
- Redis blocked_clients growing: dead consumers vs healthy queues: /guides/redis/redis-blocked-clients-growing/
- Redis BUSY Redis is busy running a script: blocking Lua and how to recover: /guides/redis/redis-busy-running-script/
- Redis Can’t save in background: fork: Cannot allocate memory - diagnosis and fix: /guides/redis/redis-cant-save-in-background-fork/
- Redis client output buffer overflow: slow consumers and client-output-buffer-limit: /guides/redis/redis-client-output-buffer-limit/
- Redis connected_clients climbing: connection leak detection: /guides/redis/redis-connected-clients-climbing/
- Redis connection exhaustion: leaks, pools, and the retry storm: /guides/redis/redis-connection-exhaustion/
- Redis event loop blocked: when one slow command freezes everything: /guides/redis/redis-event-loop-blocked/
- Redis eviction policy tuning: allkeys-lru vs volatile-ttl vs noeviction: /guides/redis/redis-eviction-policy-tuning/
- Redis fork/COW memory storm: why persistence doubles RSS and OOM-kills the box: /guides/redis/redis-fork-cow-storm/







