Redis sync_full incrementing: diagnosing full resync events

Your Redis primary’s sync_full counter is climbing. That means replicas are performing full resyncs instead of partial ones. Each full resync forces the primary to fork, write an RDB snapshot, and push it to the replica, which then wipes its own dataset and reloads from scratch. One full resync is a heavy operation. Several in succession, or multiple at once, can freeze the primary’s event loop, spike memory via copy-on-write, and trigger a cascade where more replicas fall behind and also need full resyncs.

What this means

Redis replicates incrementally. When a replica reconnects after a brief interruption, it attempts a partial resync using the primary’s replication backlog, a fixed-size circular buffer. If the replica’s last known offset still exists in the backlog, the primary streams only the missing bytes. If the offset has been overwritten, or the replica presents an unknown replication ID, Redis falls back to a full resync.

A full resync forces the primary to fork() so a child process can serialize the dataset to an RDB file. The replica receives this file, flushes its own data, and loads the new dump. The fork() call itself blocks the primary’s main thread briefly. With a large dataset, or if Transparent Huge Pages are enabled, that pause can last hundreds of milliseconds or longer. Copy-on-write overhead can sharply increase the primary’s resident memory. When multiple replicas request full resyncs simultaneously, the primary can enter a loop of fork latency, replica timeouts, and even more full resyncs.

Common causes

CauseWhat it looks likeFirst thing to check
Replication backlog too smallsync_partial_err climbs as partial resyncs fail back to fullCONFIG GET repl-backlog-size against write throughput
Network instabilitymaster_link_status flaps on replicas, connected_slaves fluctuatesmaster_link_down_since_seconds and network path health
Replica restart or crashconnected_slaves drops and recovers, replica uptime_in_seconds resetsReplica logs and INFO server on the replica
Write rate exhausting backlog during short blipsmaster_repl_offset grows rapidly; even sub-second disconnects exceed the backlogRate of offset change versus repl-backlog-size
Replica too slow to keep upReplication offset lag grows steadily on one replica before disconnectPer-replica offset in INFO replication

Quick checks

Run these read-only commands on the primary and replicas to confirm scope.

# Primary: full and partial resync counters
redis-cli INFO stats | grep -E "sync_full|sync_partial"
# Primary: replication backlog size
redis-cli CONFIG GET repl-backlog-size
# Primary: offset and per-replica offsets
redis-cli INFO replication | grep -E "master_repl_offset|connected_slaves|slave[0-9]"
# Primary: background save in progress
redis-cli INFO persistence | grep rdb_bgsave_in_progress
# Primary: latest fork duration (full resyncs require fork)
redis-cli INFO | grep latest_fork_usec
# Replica: link status and downtime
redis-cli INFO replication | grep -E "master_link_status|master_link_down_since"
# Replica: applied offset
redis-cli INFO replication | grep slave_repl_offset

How to diagnose it

  1. Confirm full resyncs are active. On the primary, run INFO stats | grep sync_full. If the counter is increasing, full resyncs are happening now. Check sync_partial_err; if it is also increasing, partial resyncs are failing and falling back.
  2. Evaluate the backlog. Run CONFIG GET repl-backlog-size. The default is 1 MB. For most production write rates, this is insufficient. Compare the backlog size to your write throughput. If you write 50 MB/s and the backlog is 1 MB, a replica disconnected for more than 20 milliseconds will require a full resync.
  3. Inspect replica topology. On the primary, run INFO replication. Check connected_slaves. If the count is fluctuating, replicas are disconnecting and reconnecting. Review the slaveN: lines for individual replica offsets. A replica whose offset lags far behind master_repl_offset is a candidate for triggering a full resync on its next disconnect.
  4. Check replica-side link health. On each replica, run INFO replication | grep master_link_status. If it is down, check master_link_down_since_seconds to see how long the replica has been partitioned. Intermittent blips combined with a small backlog are the most common trigger.
  5. Correlate with fork pressure. Run INFO | grep latest_fork_usec. If fork duration is spiking above 500 ms, the primary is already under pressure from persistence or replication forks. High fork latency can cause replicas to time out, which then reconnect and trigger more full resyncs.
  6. Look for the cascade. A sustained rdb_bgsave_in_progress combined with dropping connected_slaves suggests fork latency is timing out other replicas.
  7. Check for external triggers. Review replica logs for OOM kills, unexpected restarts, or disk I/O stalls that caused the replica to stop processing the replication stream. Also verify that no operator maintenance window caused a batch of replica restarts.
flowchart TD
    A[Replica disconnects or lags] --> B{Offset still in backlog?}
    B -->|Yes| C[Partial resync]
    B -->|No| D[Full resync]
    D --> E[Primary forks for RDB]
    E --> F[Latency spike and COW]
    F --> G[Other replicas timeout]
    G --> D
    C --> H[Replica catches up]

Metrics and signals to monitor

SignalWhy it mattersWarning sign
sync_full rateEach increment is an expensive fork, RDB dump, and bulk transferSustained increase
sync_partial_err rateFailed partial resyncs force full resyncs; indicates backlog exhaustionAny sustained rate above zero
repl-backlog-size vs write throughputDetermines the maximum disconnect duration a replica can tolerate before full resyncBacklog smaller than write rate multiplied by expected disconnect time
Replication offset lagByte distance between primary and replica; exceeding the backlog triggers full resyncLag approaching or exceeding repl-backlog-size
latest_fork_usecFull resyncs require fork; long forks block the event loop and can cascadeSpikes above 500 ms
connected_slavesDrops indicate replica disconnects that may resync on reconnectCount below expected or fluctuating
rdb_bgsave_in_progressConfirms a full resync is actively consuming primary resourcesActive during replica reconnect storms

Fixes

Increase the replication backlog

If sync_partial_err is incrementing and the backlog is at or near the default 1 MB, increase it. A live change takes effect immediately:

CONFIG SET repl-backlog-size 104857600

Persist the change in redis.conf so it survives restart. The backlog consumes additional memory on the primary. Size it to at least twice the expected write volume during your longest planned disconnect or maintenance window. A common starting point for production workloads is 100 MB.

Reduce network instability

If master_link_status is flapping on replicas, investigate the network path between the primary and replicas. Look for packet loss, latency spikes, or firewall timeouts. Stable links prevent replicas from losing their place in the replication stream.

Address replica instability

If one specific replica is repeatedly triggering full resyncs, inspect that replica for resource exhaustion. Check uptime_in_seconds on the replica to see if it is restarting unexpectedly. Review kernel logs for OOM kills and check disk latency if the replica stalls while persisting its own data.

Lower fork latency

High latest_fork_usec turns a single full resync into a system-wide latency event. The most common cause is Transparent Huge Pages. Check THP status:

cat /sys/kernel/mm/transparent_hugepage/enabled

If the value is not [never], disable THP. This requires root privileges and typically a system restart to apply safely. Fork latency should scale roughly with dataset size; significantly higher values indicate a kernel or hypervisor issue.

Use diskless replication if appropriate

If the primary is bottlenecked on disk I/O while writing the RDB for transfer, consider enabling repl-diskless-sync yes. This streams the RDB directly to the replica without writing to disk on the primary first. Tradeoffs include higher network utilization on the primary and potential timeouts on high-latency links.

Stagger replica reconnections

If multiple replicas were restarted simultaneously, for example after a deployment, they may all request full resyncs at the same time. Restart or reconnect replicas in batches to avoid overwhelming the primary with concurrent forks and transfers.

Prevention

Size the replication backlog using your measured write throughput and your tolerance for disconnect duration:

repl-backlog-size >= 2 * (write_bytes_per_second * max_expected_disconnect_seconds)

Monitor sync_partial_err as an early warning. Any increment means the backlog was insufficient for a reconnect that just happened. Do not wait for sync_full to climb.

Disable THP on all Redis hosts during provisioning. A fork that takes seconds instead of milliseconds is often the difference between a brief blip and a resync storm.

Avoid restarting all replicas at once. Rolling restarts spread the load on the primary and keep the majority of replicas online.

How Netdata helps

  • Netdata collects sync_full, sync_partial_ok, and sync_partial_err from INFO stats and charts the rate of change.
  • Correlate climbing sync_full with latest_fork_usec and rdb_bgsave_in_progress to confirm the primary is under fork pressure.
  • Alert on sync_partial_err increments to catch backlog exhaustion before it forces a full resync.
  • Chart replication offset lag per replica to identify lagging replicas before they exceed the backlog window.
  • Tie Redis replication metrics to system-level charts for RSS, CPU, and disk I/O to see when a full resync is contributing to host-level saturation.
  • How Redis actually works in production: a mental model for operators: /guides/redis/how-redis-works-in-production/
  • Redis aof_last_write_status:err: AOF write failures and recovery: /guides/redis/redis-aof-last-write-status-err/
  • Redis appendfsync always latency: durability vs throughput trade-offs: /guides/redis/redis-appendfsync-always-latency/
  • Redis blocked_clients growing: dead consumers vs healthy queues: /guides/redis/redis-blocked-clients-growing/
  • Redis BUSY Redis is busy running a script: blocking Lua and how to recover: /guides/redis/redis-busy-running-script/
  • Redis Can’t save in background: fork: Cannot allocate memory - diagnosis and fix: /guides/redis/redis-cant-save-in-background-fork/
  • Redis client output buffer overflow: slow consumers and client-output-buffer-limit: /guides/redis/redis-client-output-buffer-limit/
  • Redis connected_clients climbing: connection leak detection: /guides/redis/redis-connected-clients-climbing/
  • Redis connection exhaustion: leaks, pools, and the retry storm: /guides/redis/redis-connection-exhaustion/
  • Redis event loop blocked: when one slow command freezes everything: /guides/redis/redis-event-loop-blocked/
  • Redis eviction policy tuning: allkeys-lru vs volatile-ttl vs noeviction: /guides/redis/redis-eviction-policy-tuning/
  • Redis fork/COW memory storm: why persistence doubles RSS and OOM-kills the box: /guides/redis/redis-fork-cow-storm/