Redis MASTERDOWN / master_link_status:down: replication link broken
You see MASTERDOWN errors from client libraries, or monitoring shows master_link_status:down on a replica. The replica is still accepting connections and serving reads, but every response is increasingly stale. The primary continues to take writes, so the gap widens. Determine whether this is a transient resync or a real partition, and fix it without forcing an expensive full resync that freezes the primary with a fork.
What this means
On the replica, INFO replication reports master_link_status:down whenever the TCP connection to the primary cannot be maintained. The replica retries the connection every second. While the link is down, the replica still accepts reads from its last known dataset. It does not fail queries automatically unless the client is configured to do so, so applications may serve stale data silently.
master_link_down_since_seconds accumulates from the moment the link drops and resets only after successful re-establishment. master_last_io_seconds_ago grows toward the configured timeout while the link is silent.
Distinguish a broken link from a healthy full resync. During a full resync, master_sync_in_progress:1 and loading:1 appear. In that state, master_link_status may be down while the replica transfers and loads the RDB. That is self-healing and should not be treated as a permanent fault. If master_sync_in_progress is 0 and master_link_status remains down, the link is genuinely broken.
The default repl-timeout is 60 seconds. This timer controls three thresholds: bulk-transfer I/O during SYNC from the replica side, master timeout from the replica side (data or pings), and replica timeout from the master side (REPLCONF ACKs). The replica sends REPLCONF ACK every second. If the primary sees no data or PING within repl-timeout, it closes the connection. If the replica sees nothing from the primary within repl-timeout, it logs a timeout and retries.
Common causes
| Cause | What it looks like | First thing to check |
|---|---|---|
| Network partition or firewall change | master_link_status:down on replica; primary reachable from other hosts | redis-cli -h <primary> PING from the replica host |
| Primary overloaded or event loop blocked | Primary slow or unresponsive; latest_fork_usec spiked | INFO stats on primary: instantaneous_ops_per_sec, slowlog |
repl-timeout exceeded due to slow network or proxy stall | master_last_io_seconds_ago climbs toward 60; Kubernetes ClusterIP causes TCP hang | CONFIG GET repl-timeout and master_last_io_seconds_ago trend |
| Replication backlog overflow causing full resync loop | sync_full and sync_partial_err increase; replica reconnects repeatedly | INFO stats on primary: sync_full growing |
| Master client output buffer overflow on slow replica | Primary logs “Client … scheduled to be closed ASAP for overcoming of output buffer limits” | CLIENT LIST on primary for replica omem; CONFIG GET client-output-buffer-limit |
| Authentication or credential mismatch after failover | Replica logs auth errors immediately after role change | Replica masterauth and primary ACLs / requirepass consistency |
| TCP stack timeout locking the replica process | Replica logs timeout at 60s but the process hangs before retrying | OS TCP retransmission behavior vs repl-timeout |
Quick checks
Read-only commands to triage state:
# Replica link status and whether a full resync is in progress
redis-cli INFO replication | grep -E "master_link_status|master_link_down_since|master_last_io_seconds_ago|master_sync_in_progress|role"
# Whether the replica is currently loading an RDB dump
redis-cli INFO persistence | grep loading
# Sync counters on the primary for backlog overflow
redis-cli INFO stats | grep -E "sync_full|sync_partial"
# Replication offsets on both sides
# On primary:
redis-cli INFO replication | grep -E "master_repl_offset|slave[0-9]"
# On replica:
redis-cli INFO replication | grep -E "slave_repl_offset|master_repl_offset"
# Client output buffer limits and replica buffer usage on primary
redis-cli CONFIG GET client-output-buffer-limit
redis-cli CLIENT LIST | grep "flags=S"
# Timeout configuration
redis-cli CONFIG GET repl-timeout
redis-cli CONFIG GET repl-ping-replica-period
# Reachability to the primary from the replica host
redis-cli -h <primary_ip> -p <primary_port> PING
# Primary fork latency, which can cause replicas to timeout
redis-cli INFO stats | grep latest_fork_usec
How to diagnose it
Confirm it is not a transient full resync. If
master_sync_in_progress:1orloading:1, wait forloadingto return to0. If neither flag is set andmaster_link_statusis stilldown, the link is genuinely broken.Check
master_link_down_since_seconds. Values under 30 seconds are often rolling maintenance or a Sentinel failover. Values over 60 seconds indicate a persistent failure that requires intervention.Verify primary liveness. From the replica host, run
redis-cli -h <primary> PING. IfPONGreturns but the link is still down, suspect authentication, buffer limits, or a replication-specific timeout rather than total network failure.Compare
master_last_io_seconds_agotorepl-timeout. If the last I/O approaches or exceeds the timeout, the network path is too slow or silent. In Kubernetes with ClusterIP services, zero healthy endpoints can cause the replica to hang for the fullrepl-timeoutbefore retrying. Switch to a headless service or direct pod IP, or lowerrepl-timeoutto 15-20 seconds (aboverepl-ping-replica-period).Inspect primary logs for output buffer overruns. If the primary logs that a client was closed for overcoming output buffer limits, the replica is not consuming the replication stream fast enough. Check the replica for disk I/O contention (if AOF or RDB is enabled) or CPU saturation.
Check
sync_fullandsync_partial_erron the primary. Ifsync_partial_erris incrementing, the replica is attempting partial resync but the backlog is too small, forcing expensive full resyncs. This often manifests as intermittentmaster_link_status:downfollowed by rapid reconnections.Check for ACL or password mismatches. After a failover, the replica may be pointing to a new primary with different
requirepassor ACL credentials. Verifymasterauthon the replica matches the primary’s authentication.Evaluate OS-level TCP behavior. On Linux, the OS-level TCP retransmission timeout can reach approximately 130 seconds before declaring a connection dead. If
repl-timeoutis set to 60 seconds, the replica may log a timeout at 60 seconds while the OS still holds the socket for a longer period, delaying the reconnection attempt.
Fixes
Network partition or firewall change
Restore connectivity between the replica and the primary on the Redis port. Verify with redis-cli PING. Avoid restarting the replica unless necessary; a restart forces a full sync if the primary’s backlog cannot cover the gap.
Mismatched repl-timeout
If the network path is legitimately slow or lossy, raise repl-timeout cautiously. In Kubernetes with ClusterIP services, a lower repl-timeout (15-20 seconds) helps the replica detect endpoint changes faster. Apply with CONFIG SET repl-timeout <seconds>, then update redis.conf to persist the change.
Master output buffer overflow
If the primary closes the replica connection due to buffer limits, the replica is not reading fast enough. Check the replica for disk I/O contention (if AOF or RDB is enabled) or CPU saturation. Raise the limit live, but CONFIG SET client-output-buffer-limit requires the full class string, so preserve existing normal and pubsub limits from CONFIG GET.
redis-cli CONFIG GET client-output-buffer-limit
# Edit the returned string so only the replica triplet changes, then:
redis-cli CONFIG SET client-output-buffer-limit "normal 0 0 0 replica 512mb 128mb 60 pubsub 32mb 8mb 60"
The real fix is to speed up the replica or reduce primary write volume.
Replication backlog overflow
If sync_partial_err is incrementing, increase repl-backlog-size immediately on the primary:
CONFIG SET repl-backlog-size 104857600
The default 1MB is insufficient for almost all production workloads. A larger backlog prevents full resyncs on brief disconnections. Also consider repl-diskless-sync yes to avoid writing RDB to disk on the primary during sync.
Primary event loop blocked
If latest_fork_usec is spiking or the slowlog shows blocking commands, the primary is too busy to service replication pings. Disable Transparent Huge Pages if enabled. Remove slow commands such as KEYS * and optimize Lua scripts. If the dataset is too large for single-core operation, shard it.
Authentication failure
After failover, ensure the replica uses the correct masterauth or ACL credentials for the new primary. Check REPLICAOF configuration and ACL logs on the primary.
Prevention
- Set
repl-backlog-sizeto at least 100MB on every primary. For write-heavy workloads, use 512MB. - Monitor
sync_fullandsync_partial_erras leading indicators of replication health. - In container orchestration platforms, use headless services or direct pod IPs for replication traffic instead of ClusterIP load balancers.
- Configure
min-replicas-to-write 1andmin-replicas-max-lag 10on the primary to prevent accepting writes when replicas are disconnected or severely lagging. - Disable Transparent Huge Pages on all Redis nodes to prevent fork latency from causing replica timeouts.
- Size
client-output-buffer-limit replicaappropriately for your peak replication throughput. The default 256MB hard limit may be too tight for high-write workloads.
How Netdata helps
- Correlate
master_link_status:downwithmaster_link_down_since_secondsto distinguish brief blips from sustained outages. - Track
sync_fullandsync_partial_errrates on the primary to detect backlog overflow before it cascades into repeated full resyncs. - Surface replication offset lag alongside link status to reveal when
upstill means stale data. - Alert on
connected_slavesdrops from the primary view, giving a cross-node perspective on the same event. - Correlate replica link failures with system-level metrics such as network drops, TCP retransmits, and disk latency to isolate whether the root cause is network, disk, or CPU.
Related guides
- How Redis actually works in production: a mental model for operators
- Redis client output buffer overflow: slow consumers and client-output-buffer-limit
- Redis event loop blocked: when one slow command freezes everything
- Redis fork/COW memory storm: why persistence doubles RSS and OOM-kills the box
- Redis blocked_clients growing: dead consumers vs healthy queues
- Redis BUSY Redis is busy running a script: blocking Lua and how to recover
- Redis connection exhaustion: leaks, pools, and the retry storm
- Redis aof_last_write_status:err: AOF write failures and recovery
- Redis Can’t save in background: fork: Cannot allocate memory - diagnosis and fix







