Redis connected_slaves dropped: detecting replica disconnects on the primary
If INFO replication on a Redis primary shows connected_slaves lower than expected, the missing replica shrinks read capacity, makes a full resync likely, and widens the data-loss window during failover. The primary does not keep a tombstone: it decrements the counter and drops the corresponding slaveN line from the next INFO sample. You need to determine whether the replica crashed, the network partitioned, or Sentinel promoted the replica and the old primary has not caught up.
This guide covers the primary-side view: how connected_slaves behaves, what disappears from INFO replication, and how to correlate the drop with replica-side signals, sync counters, and fork latency to find the root cause.
What this means
A Redis primary emits connected_slaves:<N> in INFO replication, followed by one slave<N>:ip=...,port=...,state=...,offset=...,lag=... line per connected replica. Redis uses slave in the wire format for backward compatibility. When a replica disconnects, times out, or is kicked because its output buffer exceeded client-output-buffer-limit, the primary closes the TCP connection and immediately updates the counter and the per-replica detail lines.
There is no last_disconnected_slave field, no explicit disconnected state, and no timestamp on the primary. The only signals are the lower integer and the missing line. A planned maintenance window, a Sentinel failover, a container restart, and a replication-buffer overflow all look identical at first glance. Diff the INFO output against the expected topology and corroborate with other signals.
flowchart TD
A[Replica TCP close or repl-timeout] --> B[connected_slaves decrements]
B --> C{Investigate cause}
C --> D[Replica crash or OOM]
C --> E[Network partition or timeout]
C --> F[Sentinel promotion race]
C --> G[Output buffer overflow]Common causes
| Cause | What it looks like | First thing to check |
|---|---|---|
| Replica crash or OOM kill | connected_slaves drops sharply. The replica’s uptime_in_seconds resets if reachable; system logs may show OOM killer activity. | INFO server on the replica for uptime_in_seconds, kernel logs, and INFO stats for sync_full. |
Network partition or repl-timeout | Replica stays up but master_link_status:down. master_link_down_since_seconds grows. Drop timing matches repl-timeout (default 60 seconds). | Network latency and reachability; master_last_io_seconds_ago on the replica. |
| Sentinel promotion unknown to primary | connected_slaves drops on the old primary after Sentinel elects a new master. The promoted replica reports role:master and accepts writes. | Sentinel state (SENTINEL MASTER, SENTINEL REPLICAS) and the old primary’s role to confirm demotion. |
| Replica output buffer overflow | A slow replica stops consuming the replication stream. Its output buffer (omem) grows until it hits client-output-buffer-limit replica; the primary disconnects it. | CLIENT LIST on the primary sorted by omem; client-output-buffer-limit replica; replica disk or CPU saturation. |
| Manual topology change | An operator or automation runs REPLICAOF NO ONE or SLAVEOF NO ONE on the replica. | Command stats (INFO commandstats), ACL logs, and deployment automation audit trails. |
Quick checks
Run these read-only commands before making changes.
# Confirm current replica count and which replicas are present
redis-cli INFO replication | grep -E "connected_slaves|role|master_repl_offset|slave[0-9]"
# Check the replica-side view of the link
redis-cli -h <replica_host> INFO replication | grep -E "role|master_link_status|master_link_down_since_seconds|master_last_io_seconds_ago"
# See if disconnects are triggering expensive full resyncs
redis-cli INFO stats | grep -E "sync_full|sync_partial_ok|sync_partial_err"
# Evaluate replication backlog headroom
redis-cli INFO replication | grep -E "repl_backlog_size|repl_backlog_histlen|master_repl_offset"
# Detect recent restarts on the replica
redis-cli -h <replica_host> INFO server | grep uptime_in_seconds
# Look for capacity pressure that can cause timeouts or buffer drops
redis-cli INFO clients | grep -E "connected_clients|cluster_connections"
redis-cli CONFIG GET maxclients
redis-cli INFO stats | grep rejected_connections
# Check for a fork or slow command that may have stalled the primary
redis-cli INFO stats | grep latest_fork_usec
redis-cli INFO persistence | grep rdb_bgsave_in_progress
redis-cli SLOWLOG LEN
# Find clients with large output buffers on the primary
redis-cli CLIENT LIST | awk -F'[= ]' '{for(i=1;i<=NF;i++) if($i=="omem") print $(i+1)}' | sort -rn | head -10
How to diagnose it
Establish the expected topology. Know how many replicas should be connected, their hostnames or IP addresses, and whether any maintenance window or Sentinel failover is in progress. A drop during a planned failover is expected; an unplanned drop is the incident.
Compare consecutive
INFO replicationsamples on the primary. Note theconnected_slavesvalue and the set ofslaveNlines. The missing line tells you which replica disconnected. Record its lastoffsetandlagbefore it disappeared.Query the replica directly. Run
INFO replicationon the missing replica. If it reportsrole:slaveandmaster_link_status:down, the replica thinks it is still a replica but cannot reach the primary. Checkmaster_link_down_since_secondsandmaster_last_io_seconds_agoto judge whether the disconnect is fresh or persistent.Check for a role change. If the replica reports
role:master, Sentinel or an operator promoted it. Verify with Sentinel (SENTINEL MASTER <name>andSENTINEL REPLICAS <name>). If the old primary still reportsrole:masterand is accepting writes, you have a split-brain window that risks data loss.Look for restart evidence. An unexpected drop in
uptime_in_secondson the replica points to a crash or OOM kill. Correlate withused_memoryandmaxmemoryon the replica, and with OS-leveldmesgor container events.Investigate primary-side triggers. A replica may time out because the primary stalled. Forks and slow commands block the event loop; if the stall exceeds
repl-timeout, replicas disconnect. Checklatest_fork_usec,rdb_bgsave_in_progress, andSLOWLOG LENon the primary. Also checkCLIENT LISTfor a replica with a largeomemvalue that hit the buffer limit.Verify the network path. Even if both processes are healthy, firewall rules, routing changes, or congestion can break the replication TCP stream. Compare
master_last_io_seconds_agoon the replica against the time the drop was first observed.Correlate sync counters. If
sync_fullorsync_partial_errincremented around the same time, the disconnect forced (or will force) a full resync. That is a load event on the primary and a sign thatrepl-backlog-sizemay be too small.
Metrics and signals to monitor
| Signal | Why it matters | Warning sign |
|---|---|---|
connected_slaves on primary | Tracks how many replicas the primary believes are connected | Below expected count for more than one monitoring interval |
slaveN offset lines | Identifies which replica disappeared and how far behind it was | Missing IP/port or offset lag growing before the line vanished |
master_link_status on replica | Confirms the replica’s view of the link | down for more than 30-60 seconds |
master_link_down_since_seconds | Measures disconnect duration | Exceeds repl-timeout (default 60 seconds) |
sync_full / sync_partial_err | Indicates expensive full resyncs caused by disconnects | Rate increasing after a replica drop |
latest_fork_usec | Long forks can freeze the primary and trigger replica timeouts | Greater than 500ms or spikes that correlate with link drops |
Replica omem in CLIENT LIST | Large output buffers can cause the primary to disconnect slow replicas | Approaching client-output-buffer-limit replica (default 256MB hard) |
rejected_connections | Connection exhaustion can prevent replicas from reconnecting | Any rate increase |
| Replication offset lag | Defines the data-loss window if failover happens while a replica is behind | Lag approaching or exceeding repl-backlog-size |
Fixes
Replica crash or OOM
Bring the replica back online. An empty or stale dataset triggers a full resync. Ensure the replica has maxmemory configured and enough RSS headroom to avoid a second OOM. For write-heavy workloads, you can run CONFIG SET repl-diskless-sync yes on the primary to stream the RDB directly to replicas without disk I/O. This increases network load and keeps the fork alive for the transfer duration; it is not persisted across restarts. After recovery, increase repl-backlog-size if full resyncs recur.
Network partition or repl-timeout
Fix the network path before tuning repl-timeout. Raising the timeout masks real problems and increases the window during which a dead replica is still counted as connected. If transient blips are unavoidable, increase repl-backlog-size to at least 100MB so reconnections use partial resync.
Sentinel promotion race
If Sentinel promoted the replica and the old primary still reports role:master, demote it immediately to avoid split brain. Run REPLICAOF <new_master_ip> <new_master_port> on the old primary. This is disruptive: it drops existing writes and resyncs the dataset. Verify clients follow Sentinel’s new master. Once the old primary is a replica of the new master, connected_slaves on the new master should reflect it. Review min-replicas-to-write and min-replicas-max-lag to reduce the chance of accepting writes while replicas lag.
Output buffer overflow
Identify the slow replica from CLIENT LIST before it vanishes. Fix disk saturation during AOF fsync or heavy local reads on the replica. If you must accommodate bursts, raise client-output-buffer-limit replica, trading memory safety for replication stability. A forgotten MONITOR session or a Pub/Sub subscriber on the primary can also consume output buffers and OOM the instance. Audit CLIENT LIST for unexpected clients.
Connection exhaustion
If total connections approach maxclients, the replica may be unable to reconnect. Kill stale or idle clients with CLIENT KILL (this drops application connections) or raise maxclients if the OS file-descriptor limit allows. Fix the underlying connection leak. Reserve headroom for replicas and monitoring connections.
Prevention
- Maintain an expected-replica alert on
connected_slavesand update it after every topology change. - Set
repl-backlog-sizeto at least 100MB in production, and monitorsync_partial_errto confirm partial resyncs are succeeding. - Monitor each replica’s
master_link_statusanduptime_in_secondsindependently so you detect replica-side problems before the primary count drops. - Disable Transparent Huge Pages and keep at least 50% memory headroom on persistent instances to avoid fork latency that can trigger replica timeouts.
- Configure
min-replicas-to-writeandmin-replicas-max-lagif your consistency requirements tolerate the write-availability tradeoff. - Document maintenance windows and Sentinel failover procedures so on-call engineers can distinguish expected drops from incidents.
How Netdata helps
Netdata collects connected_slaves, per-replica offset lines, sync_full, and sync_partial_err from the primary to flag when a disconnect forces a full resync. It tracks latest_fork_usec to highlight fork-induced timeouts. On the replica side, it collects master_link_status, master_link_down_since_seconds, and replication offset lag. It also alerts on rejected connections, memory pressure, and client output buffer growth that can lead to buffer-limit disconnects.
Related guides
- How Redis actually works in production: a mental model for operators: /guides/redis/how-redis-works-in-production/
- Redis aof_last_write_status:err: AOF write failures and recovery: /guides/redis/redis-aof-last-write-status-err/
- Redis appendfsync always latency: durability vs throughput trade-offs: /guides/redis/redis-appendfsync-always-latency/
- Redis blocked_clients growing: dead consumers vs healthy queues: /guides/redis/redis-blocked-clients-growing/
- Redis BUSY Redis is busy running a script: blocking Lua and how to recover: /guides/redis/redis-busy-running-script/
- Redis Can’t save in background: fork: Cannot allocate memory - diagnosis and fix: /guides/redis/redis-cant-save-in-background-fork/
- Redis client output buffer overflow: slow consumers and client-output-buffer-limit: /guides/redis/redis-client-output-buffer-limit/
- Redis connected_clients climbing: connection leak detection: /guides/redis/redis-connected-clients-climbing/
- Redis connection exhaustion: leaks, pools, and the retry storm: /guides/redis/redis-connection-exhaustion/
- Redis event loop blocked: when one slow command freezes everything: /guides/redis/redis-event-loop-blocked/
- Redis eviction policy tuning: allkeys-lru vs volatile-ttl vs noeviction: /guides/redis/redis-eviction-policy-tuning/
- Redis fork/COW memory storm: why persistence doubles RSS and OOM-kills the box: /guides/redis/redis-fork-cow-storm/







