$ guides / redis / redis-master-link-status-down ▌

Operations Guides

Redis MASTERDOWN / master_link_status:down: replication link broken

You see MASTERDOWN errors from client libraries, or monitoring shows master_link_status:down on a replica. The replica is still accepting connections and serving reads, but every response is increasingly stale. The primary continues to take writes, so the gap widens. Determine whether this is a transient resync or a real partition, and fix it without forcing an expensive full resync that freezes the primary with a fork.

What this means

On the replica, INFO replication reports master_link_status:down whenever the TCP connection to the primary cannot be maintained. The replica retries the connection every second. While the link is down, the replica still accepts reads from its last known dataset. It does not fail queries automatically unless the client is configured to do so, so applications may serve stale data silently.

master_link_down_since_seconds accumulates from the moment the link drops and resets only after successful re-establishment. master_last_io_seconds_ago grows toward the configured timeout while the link is silent.

Distinguish a broken link from a healthy full resync. During a full resync, master_sync_in_progress:1 and loading:1 appear. In that state, master_link_status may be down while the replica transfers and loads the RDB. That is self-healing and should not be treated as a permanent fault. If master_sync_in_progress is 0 and master_link_status remains down, the link is genuinely broken.

The default repl-timeout is 60 seconds. This timer controls three thresholds: bulk-transfer I/O during SYNC from the replica side, master timeout from the replica side (data or pings), and replica timeout from the master side (REPLCONF ACKs). The replica sends REPLCONF ACK every second. If the primary sees no data or PING within repl-timeout, it closes the connection. If the replica sees nothing from the primary within repl-timeout, it logs a timeout and retries.

Common causes

Cause	What it looks like	First thing to check
Network partition or firewall change	`master_link_status:down` on replica; primary reachable from other hosts	`redis-cli -h <primary> PING` from the replica host
Primary overloaded or event loop blocked	Primary slow or unresponsive; `latest_fork_usec` spiked	`INFO stats` on primary: `instantaneous_ops_per_sec`, slowlog
`repl-timeout` exceeded due to slow network or proxy stall	`master_last_io_seconds_ago` climbs toward 60; Kubernetes ClusterIP causes TCP hang	`CONFIG GET repl-timeout` and `master_last_io_seconds_ago` trend
Replication backlog overflow causing full resync loop	`sync_full` and `sync_partial_err` increase; replica reconnects repeatedly	`INFO stats` on primary: `sync_full` growing
Master client output buffer overflow on slow replica	Primary logs “Client … scheduled to be closed ASAP for overcoming of output buffer limits”	`CLIENT LIST` on primary for replica `omem`; `CONFIG GET client-output-buffer-limit`
Authentication or credential mismatch after failover	Replica logs auth errors immediately after role change	Replica `masterauth` and primary ACLs / `requirepass` consistency
TCP stack timeout locking the replica process	Replica logs timeout at 60s but the process hangs before retrying	OS TCP retransmission behavior vs `repl-timeout`

Quick checks

Read-only commands to triage state:

# Replica link status and whether a full resync is in progress
redis-cli INFO replication | grep -E "master_link_status|master_link_down_since|master_last_io_seconds_ago|master_sync_in_progress|role"

# Whether the replica is currently loading an RDB dump
redis-cli INFO persistence | grep loading

# Sync counters on the primary for backlog overflow
redis-cli INFO stats | grep -E "sync_full|sync_partial"

# Replication offsets on both sides
# On primary:
redis-cli INFO replication | grep -E "master_repl_offset|slave[0-9]"
# On replica:
redis-cli INFO replication | grep -E "slave_repl_offset|master_repl_offset"

# Client output buffer limits and replica buffer usage on primary
redis-cli CONFIG GET client-output-buffer-limit
redis-cli CLIENT LIST | grep "flags=S"

# Timeout configuration
redis-cli CONFIG GET repl-timeout
redis-cli CONFIG GET repl-ping-replica-period

# Reachability to the primary from the replica host
redis-cli -h <primary_ip> -p <primary_port> PING

# Primary fork latency, which can cause replicas to timeout
redis-cli INFO stats | grep latest_fork_usec

How to diagnose it

Confirm it is not a transient full resync. If master_sync_in_progress:1 or loading:1, wait for loading to return to 0. If neither flag is set and master_link_status is still down, the link is genuinely broken.
Check master_link_down_since_seconds. Values under 30 seconds are often rolling maintenance or a Sentinel failover. Values over 60 seconds indicate a persistent failure that requires intervention.
Verify primary liveness. From the replica host, run redis-cli -h <primary> PING. If PONG returns but the link is still down, suspect authentication, buffer limits, or a replication-specific timeout rather than total network failure.
Compare master_last_io_seconds_ago to repl-timeout. If the last I/O approaches or exceeds the timeout, the network path is too slow or silent. In Kubernetes with ClusterIP services, zero healthy endpoints can cause the replica to hang for the full repl-timeout before retrying. Switch to a headless service or direct pod IP, or lower repl-timeout to 15-20 seconds (above repl-ping-replica-period).
Inspect primary logs for output buffer overruns. If the primary logs that a client was closed for overcoming output buffer limits, the replica is not consuming the replication stream fast enough. Check the replica for disk I/O contention (if AOF or RDB is enabled) or CPU saturation.
Check sync_full and sync_partial_err on the primary. If sync_partial_err is incrementing, the replica is attempting partial resync but the backlog is too small, forcing expensive full resyncs. This often manifests as intermittent master_link_status:down followed by rapid reconnections.
Check for ACL or password mismatches. After a failover, the replica may be pointing to a new primary with different requirepass or ACL credentials. Verify masterauth on the replica matches the primary’s authentication.
Evaluate OS-level TCP behavior. On Linux, the OS-level TCP retransmission timeout can reach approximately 130 seconds before declaring a connection dead. If repl-timeout is set to 60 seconds, the replica may log a timeout at 60 seconds while the OS still holds the socket for a longer period, delaying the reconnection attempt.

Fixes

Network partition or firewall change

Restore connectivity between the replica and the primary on the Redis port. Verify with redis-cli PING. Avoid restarting the replica unless necessary; a restart forces a full sync if the primary’s backlog cannot cover the gap.

Mismatched repl-timeout

If the network path is legitimately slow or lossy, raise repl-timeout cautiously. In Kubernetes with ClusterIP services, a lower repl-timeout (15-20 seconds) helps the replica detect endpoint changes faster. Apply with CONFIG SET repl-timeout <seconds>, then update redis.conf to persist the change.

Master output buffer overflow

If the primary closes the replica connection due to buffer limits, the replica is not reading fast enough. Check the replica for disk I/O contention (if AOF or RDB is enabled) or CPU saturation. Raise the limit live, but CONFIG SET client-output-buffer-limit requires the full class string, so preserve existing normal and pubsub limits from CONFIG GET.

redis-cli CONFIG GET client-output-buffer-limit
# Edit the returned string so only the replica triplet changes, then:
redis-cli CONFIG SET client-output-buffer-limit "normal 0 0 0 replica 512mb 128mb 60 pubsub 32mb 8mb 60"

The real fix is to speed up the replica or reduce primary write volume.

Replication backlog overflow

If sync_partial_err is incrementing, increase repl-backlog-size immediately on the primary:

CONFIG SET repl-backlog-size 104857600

The default 1MB is insufficient for almost all production workloads. A larger backlog prevents full resyncs on brief disconnections. Also consider repl-diskless-sync yes to avoid writing RDB to disk on the primary during sync.

Primary event loop blocked

If latest_fork_usec is spiking or the slowlog shows blocking commands, the primary is too busy to service replication pings. Disable Transparent Huge Pages if enabled. Remove slow commands such as KEYS * and optimize Lua scripts. If the dataset is too large for single-core operation, shard it.

Authentication failure

After failover, ensure the replica uses the correct masterauth or ACL credentials for the new primary. Check REPLICAOF configuration and ACL logs on the primary.

Prevention

Set repl-backlog-size to at least 100MB on every primary. For write-heavy workloads, use 512MB.
Monitor sync_full and sync_partial_err as leading indicators of replication health.
In container orchestration platforms, use headless services or direct pod IPs for replication traffic instead of ClusterIP load balancers.
Configure min-replicas-to-write 1 and min-replicas-max-lag 10 on the primary to prevent accepting writes when replicas are disconnected or severely lagging.
Disable Transparent Huge Pages on all Redis nodes to prevent fork latency from causing replica timeouts.
Size client-output-buffer-limit replica appropriately for your peak replication throughput. The default 256MB hard limit may be too tight for high-write workloads.

How Netdata helps

Correlate master_link_status:down with master_link_down_since_seconds to distinguish brief blips from sustained outages.
Track sync_full and sync_partial_err rates on the primary to detect backlog overflow before it cascades into repeated full resyncs.
Surface replication offset lag alongside link status to reveal when up still means stale data.
Alert on connected_slaves drops from the primary view, giving a cross-node perspective on the same event.
Correlate replica link failures with system-level metrics such as network drops, TCP retransmits, and disk latency to isolate whether the root cause is network, disk, or CPU.

The Netdata solution

Redis monitoring with Netdata

Netdata monitors Redis with per-second metrics and ML anomaly detection. Track memory usage and fragmentation, fork/COW latency, replication backlog, evictions, and connection pressure to spot the failure modes in these runbooks early.

See Redis monitoring → Start monitoring free

Redis MASTERDOWN / master_link_status:down: replication link broken

Redis MASTERDOWN / master_link_status:down: replication link broken

What this means

Common causes

Quick checks

How to diagnose it

Fixes

Network partition or firewall change

Mismatched repl-timeout

Master output buffer overflow

Replication backlog overflow

Primary event loop blocked

Authentication failure

Prevention

How Netdata helps

Related guides

Redis monitoring with Netdata