Redis MASTERDOWN / master_link_status:down: replication link broken

You see MASTERDOWN errors from client libraries, or monitoring shows master_link_status:down on a replica. The replica is still accepting connections and serving reads, but every response is increasingly stale. The primary continues to take writes, so the gap widens. Determine whether this is a transient resync or a real partition, and fix it without forcing an expensive full resync that freezes the primary with a fork.

What this means

On the replica, INFO replication reports master_link_status:down whenever the TCP connection to the primary cannot be maintained. The replica retries the connection every second. While the link is down, the replica still accepts reads from its last known dataset. It does not fail queries automatically unless the client is configured to do so, so applications may serve stale data silently.

master_link_down_since_seconds accumulates from the moment the link drops and resets only after successful re-establishment. master_last_io_seconds_ago grows toward the configured timeout while the link is silent.

Distinguish a broken link from a healthy full resync. During a full resync, master_sync_in_progress:1 and loading:1 appear. In that state, master_link_status may be down while the replica transfers and loads the RDB. That is self-healing and should not be treated as a permanent fault. If master_sync_in_progress is 0 and master_link_status remains down, the link is genuinely broken.

The default repl-timeout is 60 seconds. This timer controls three thresholds: bulk-transfer I/O during SYNC from the replica side, master timeout from the replica side (data or pings), and replica timeout from the master side (REPLCONF ACKs). The replica sends REPLCONF ACK every second. If the primary sees no data or PING within repl-timeout, it closes the connection. If the replica sees nothing from the primary within repl-timeout, it logs a timeout and retries.

Common causes

CauseWhat it looks likeFirst thing to check
Network partition or firewall changemaster_link_status:down on replica; primary reachable from other hostsredis-cli -h <primary> PING from the replica host
Primary overloaded or event loop blockedPrimary slow or unresponsive; latest_fork_usec spikedINFO stats on primary: instantaneous_ops_per_sec, slowlog
repl-timeout exceeded due to slow network or proxy stallmaster_last_io_seconds_ago climbs toward 60; Kubernetes ClusterIP causes TCP hangCONFIG GET repl-timeout and master_last_io_seconds_ago trend
Replication backlog overflow causing full resync loopsync_full and sync_partial_err increase; replica reconnects repeatedlyINFO stats on primary: sync_full growing
Master client output buffer overflow on slow replicaPrimary logs “Client … scheduled to be closed ASAP for overcoming of output buffer limits”CLIENT LIST on primary for replica omem; CONFIG GET client-output-buffer-limit
Authentication or credential mismatch after failoverReplica logs auth errors immediately after role changeReplica masterauth and primary ACLs / requirepass consistency
TCP stack timeout locking the replica processReplica logs timeout at 60s but the process hangs before retryingOS TCP retransmission behavior vs repl-timeout

Quick checks

Read-only commands to triage state:

# Replica link status and whether a full resync is in progress
redis-cli INFO replication | grep -E "master_link_status|master_link_down_since|master_last_io_seconds_ago|master_sync_in_progress|role"
# Whether the replica is currently loading an RDB dump
redis-cli INFO persistence | grep loading
# Sync counters on the primary for backlog overflow
redis-cli INFO stats | grep -E "sync_full|sync_partial"
# Replication offsets on both sides
# On primary:
redis-cli INFO replication | grep -E "master_repl_offset|slave[0-9]"
# On replica:
redis-cli INFO replication | grep -E "slave_repl_offset|master_repl_offset"
# Client output buffer limits and replica buffer usage on primary
redis-cli CONFIG GET client-output-buffer-limit
redis-cli CLIENT LIST | grep "flags=S"
# Timeout configuration
redis-cli CONFIG GET repl-timeout
redis-cli CONFIG GET repl-ping-replica-period
# Reachability to the primary from the replica host
redis-cli -h <primary_ip> -p <primary_port> PING
# Primary fork latency, which can cause replicas to timeout
redis-cli INFO stats | grep latest_fork_usec

How to diagnose it

  1. Confirm it is not a transient full resync. If master_sync_in_progress:1 or loading:1, wait for loading to return to 0. If neither flag is set and master_link_status is still down, the link is genuinely broken.

  2. Check master_link_down_since_seconds. Values under 30 seconds are often rolling maintenance or a Sentinel failover. Values over 60 seconds indicate a persistent failure that requires intervention.

  3. Verify primary liveness. From the replica host, run redis-cli -h <primary> PING. If PONG returns but the link is still down, suspect authentication, buffer limits, or a replication-specific timeout rather than total network failure.

  4. Compare master_last_io_seconds_ago to repl-timeout. If the last I/O approaches or exceeds the timeout, the network path is too slow or silent. In Kubernetes with ClusterIP services, zero healthy endpoints can cause the replica to hang for the full repl-timeout before retrying. Switch to a headless service or direct pod IP, or lower repl-timeout to 15-20 seconds (above repl-ping-replica-period).

  5. Inspect primary logs for output buffer overruns. If the primary logs that a client was closed for overcoming output buffer limits, the replica is not consuming the replication stream fast enough. Check the replica for disk I/O contention (if AOF or RDB is enabled) or CPU saturation.

  6. Check sync_full and sync_partial_err on the primary. If sync_partial_err is incrementing, the replica is attempting partial resync but the backlog is too small, forcing expensive full resyncs. This often manifests as intermittent master_link_status:down followed by rapid reconnections.

  7. Check for ACL or password mismatches. After a failover, the replica may be pointing to a new primary with different requirepass or ACL credentials. Verify masterauth on the replica matches the primary’s authentication.

  8. Evaluate OS-level TCP behavior. On Linux, the OS-level TCP retransmission timeout can reach approximately 130 seconds before declaring a connection dead. If repl-timeout is set to 60 seconds, the replica may log a timeout at 60 seconds while the OS still holds the socket for a longer period, delaying the reconnection attempt.

Fixes

Network partition or firewall change

Restore connectivity between the replica and the primary on the Redis port. Verify with redis-cli PING. Avoid restarting the replica unless necessary; a restart forces a full sync if the primary’s backlog cannot cover the gap.

Mismatched repl-timeout

If the network path is legitimately slow or lossy, raise repl-timeout cautiously. In Kubernetes with ClusterIP services, a lower repl-timeout (15-20 seconds) helps the replica detect endpoint changes faster. Apply with CONFIG SET repl-timeout <seconds>, then update redis.conf to persist the change.

Master output buffer overflow

If the primary closes the replica connection due to buffer limits, the replica is not reading fast enough. Check the replica for disk I/O contention (if AOF or RDB is enabled) or CPU saturation. Raise the limit live, but CONFIG SET client-output-buffer-limit requires the full class string, so preserve existing normal and pubsub limits from CONFIG GET.

redis-cli CONFIG GET client-output-buffer-limit
# Edit the returned string so only the replica triplet changes, then:
redis-cli CONFIG SET client-output-buffer-limit "normal 0 0 0 replica 512mb 128mb 60 pubsub 32mb 8mb 60"

The real fix is to speed up the replica or reduce primary write volume.

Replication backlog overflow

If sync_partial_err is incrementing, increase repl-backlog-size immediately on the primary:

CONFIG SET repl-backlog-size 104857600

The default 1MB is insufficient for almost all production workloads. A larger backlog prevents full resyncs on brief disconnections. Also consider repl-diskless-sync yes to avoid writing RDB to disk on the primary during sync.

Primary event loop blocked

If latest_fork_usec is spiking or the slowlog shows blocking commands, the primary is too busy to service replication pings. Disable Transparent Huge Pages if enabled. Remove slow commands such as KEYS * and optimize Lua scripts. If the dataset is too large for single-core operation, shard it.

Authentication failure

After failover, ensure the replica uses the correct masterauth or ACL credentials for the new primary. Check REPLICAOF configuration and ACL logs on the primary.

Prevention

  • Set repl-backlog-size to at least 100MB on every primary. For write-heavy workloads, use 512MB.
  • Monitor sync_full and sync_partial_err as leading indicators of replication health.
  • In container orchestration platforms, use headless services or direct pod IPs for replication traffic instead of ClusterIP load balancers.
  • Configure min-replicas-to-write 1 and min-replicas-max-lag 10 on the primary to prevent accepting writes when replicas are disconnected or severely lagging.
  • Disable Transparent Huge Pages on all Redis nodes to prevent fork latency from causing replica timeouts.
  • Size client-output-buffer-limit replica appropriately for your peak replication throughput. The default 256MB hard limit may be too tight for high-write workloads.

How Netdata helps

  • Correlate master_link_status:down with master_link_down_since_seconds to distinguish brief blips from sustained outages.
  • Track sync_full and sync_partial_err rates on the primary to detect backlog overflow before it cascades into repeated full resyncs.
  • Surface replication offset lag alongside link status to reveal when up still means stale data.
  • Alert on connected_slaves drops from the primary view, giving a cross-node perspective on the same event.
  • Correlate replica link failures with system-level metrics such as network drops, TCP retransmits, and disk latency to isolate whether the root cause is network, disk, or CPU.