Redis connected_slaves dropped: detecting replica disconnects on the primary

If INFO replication on a Redis primary shows connected_slaves lower than expected, the missing replica shrinks read capacity, makes a full resync likely, and widens the data-loss window during failover. The primary does not keep a tombstone: it decrements the counter and drops the corresponding slaveN line from the next INFO sample. You need to determine whether the replica crashed, the network partitioned, or Sentinel promoted the replica and the old primary has not caught up.

This guide covers the primary-side view: how connected_slaves behaves, what disappears from INFO replication, and how to correlate the drop with replica-side signals, sync counters, and fork latency to find the root cause.

What this means

A Redis primary emits connected_slaves:<N> in INFO replication, followed by one slave<N>:ip=...,port=...,state=...,offset=...,lag=... line per connected replica. Redis uses slave in the wire format for backward compatibility. When a replica disconnects, times out, or is kicked because its output buffer exceeded client-output-buffer-limit, the primary closes the TCP connection and immediately updates the counter and the per-replica detail lines.

There is no last_disconnected_slave field, no explicit disconnected state, and no timestamp on the primary. The only signals are the lower integer and the missing line. A planned maintenance window, a Sentinel failover, a container restart, and a replication-buffer overflow all look identical at first glance. Diff the INFO output against the expected topology and corroborate with other signals.

flowchart TD
    A[Replica TCP close or repl-timeout] --> B[connected_slaves decrements]
    B --> C{Investigate cause}
    C --> D[Replica crash or OOM]
    C --> E[Network partition or timeout]
    C --> F[Sentinel promotion race]
    C --> G[Output buffer overflow]

Common causes

CauseWhat it looks likeFirst thing to check
Replica crash or OOM killconnected_slaves drops sharply. The replica’s uptime_in_seconds resets if reachable; system logs may show OOM killer activity.INFO server on the replica for uptime_in_seconds, kernel logs, and INFO stats for sync_full.
Network partition or repl-timeoutReplica stays up but master_link_status:down. master_link_down_since_seconds grows. Drop timing matches repl-timeout (default 60 seconds).Network latency and reachability; master_last_io_seconds_ago on the replica.
Sentinel promotion unknown to primaryconnected_slaves drops on the old primary after Sentinel elects a new master. The promoted replica reports role:master and accepts writes.Sentinel state (SENTINEL MASTER, SENTINEL REPLICAS) and the old primary’s role to confirm demotion.
Replica output buffer overflowA slow replica stops consuming the replication stream. Its output buffer (omem) grows until it hits client-output-buffer-limit replica; the primary disconnects it.CLIENT LIST on the primary sorted by omem; client-output-buffer-limit replica; replica disk or CPU saturation.
Manual topology changeAn operator or automation runs REPLICAOF NO ONE or SLAVEOF NO ONE on the replica.Command stats (INFO commandstats), ACL logs, and deployment automation audit trails.

Quick checks

Run these read-only commands before making changes.

# Confirm current replica count and which replicas are present
redis-cli INFO replication | grep -E "connected_slaves|role|master_repl_offset|slave[0-9]"
# Check the replica-side view of the link
redis-cli -h <replica_host> INFO replication | grep -E "role|master_link_status|master_link_down_since_seconds|master_last_io_seconds_ago"
# See if disconnects are triggering expensive full resyncs
redis-cli INFO stats | grep -E "sync_full|sync_partial_ok|sync_partial_err"
# Evaluate replication backlog headroom
redis-cli INFO replication | grep -E "repl_backlog_size|repl_backlog_histlen|master_repl_offset"
# Detect recent restarts on the replica
redis-cli -h <replica_host> INFO server | grep uptime_in_seconds
# Look for capacity pressure that can cause timeouts or buffer drops
redis-cli INFO clients | grep -E "connected_clients|cluster_connections"
redis-cli CONFIG GET maxclients
redis-cli INFO stats | grep rejected_connections
# Check for a fork or slow command that may have stalled the primary
redis-cli INFO stats | grep latest_fork_usec
redis-cli INFO persistence | grep rdb_bgsave_in_progress
redis-cli SLOWLOG LEN
# Find clients with large output buffers on the primary
redis-cli CLIENT LIST | awk -F'[= ]' '{for(i=1;i<=NF;i++) if($i=="omem") print $(i+1)}' | sort -rn | head -10

How to diagnose it

  1. Establish the expected topology. Know how many replicas should be connected, their hostnames or IP addresses, and whether any maintenance window or Sentinel failover is in progress. A drop during a planned failover is expected; an unplanned drop is the incident.

  2. Compare consecutive INFO replication samples on the primary. Note the connected_slaves value and the set of slaveN lines. The missing line tells you which replica disconnected. Record its last offset and lag before it disappeared.

  3. Query the replica directly. Run INFO replication on the missing replica. If it reports role:slave and master_link_status:down, the replica thinks it is still a replica but cannot reach the primary. Check master_link_down_since_seconds and master_last_io_seconds_ago to judge whether the disconnect is fresh or persistent.

  4. Check for a role change. If the replica reports role:master, Sentinel or an operator promoted it. Verify with Sentinel (SENTINEL MASTER <name> and SENTINEL REPLICAS <name>). If the old primary still reports role:master and is accepting writes, you have a split-brain window that risks data loss.

  5. Look for restart evidence. An unexpected drop in uptime_in_seconds on the replica points to a crash or OOM kill. Correlate with used_memory and maxmemory on the replica, and with OS-level dmesg or container events.

  6. Investigate primary-side triggers. A replica may time out because the primary stalled. Forks and slow commands block the event loop; if the stall exceeds repl-timeout, replicas disconnect. Check latest_fork_usec, rdb_bgsave_in_progress, and SLOWLOG LEN on the primary. Also check CLIENT LIST for a replica with a large omem value that hit the buffer limit.

  7. Verify the network path. Even if both processes are healthy, firewall rules, routing changes, or congestion can break the replication TCP stream. Compare master_last_io_seconds_ago on the replica against the time the drop was first observed.

  8. Correlate sync counters. If sync_full or sync_partial_err incremented around the same time, the disconnect forced (or will force) a full resync. That is a load event on the primary and a sign that repl-backlog-size may be too small.

Metrics and signals to monitor

SignalWhy it mattersWarning sign
connected_slaves on primaryTracks how many replicas the primary believes are connectedBelow expected count for more than one monitoring interval
slaveN offset linesIdentifies which replica disappeared and how far behind it wasMissing IP/port or offset lag growing before the line vanished
master_link_status on replicaConfirms the replica’s view of the linkdown for more than 30-60 seconds
master_link_down_since_secondsMeasures disconnect durationExceeds repl-timeout (default 60 seconds)
sync_full / sync_partial_errIndicates expensive full resyncs caused by disconnectsRate increasing after a replica drop
latest_fork_usecLong forks can freeze the primary and trigger replica timeoutsGreater than 500ms or spikes that correlate with link drops
Replica omem in CLIENT LISTLarge output buffers can cause the primary to disconnect slow replicasApproaching client-output-buffer-limit replica (default 256MB hard)
rejected_connectionsConnection exhaustion can prevent replicas from reconnectingAny rate increase
Replication offset lagDefines the data-loss window if failover happens while a replica is behindLag approaching or exceeding repl-backlog-size

Fixes

Replica crash or OOM

Bring the replica back online. An empty or stale dataset triggers a full resync. Ensure the replica has maxmemory configured and enough RSS headroom to avoid a second OOM. For write-heavy workloads, you can run CONFIG SET repl-diskless-sync yes on the primary to stream the RDB directly to replicas without disk I/O. This increases network load and keeps the fork alive for the transfer duration; it is not persisted across restarts. After recovery, increase repl-backlog-size if full resyncs recur.

Network partition or repl-timeout

Fix the network path before tuning repl-timeout. Raising the timeout masks real problems and increases the window during which a dead replica is still counted as connected. If transient blips are unavoidable, increase repl-backlog-size to at least 100MB so reconnections use partial resync.

Sentinel promotion race

If Sentinel promoted the replica and the old primary still reports role:master, demote it immediately to avoid split brain. Run REPLICAOF <new_master_ip> <new_master_port> on the old primary. This is disruptive: it drops existing writes and resyncs the dataset. Verify clients follow Sentinel’s new master. Once the old primary is a replica of the new master, connected_slaves on the new master should reflect it. Review min-replicas-to-write and min-replicas-max-lag to reduce the chance of accepting writes while replicas lag.

Output buffer overflow

Identify the slow replica from CLIENT LIST before it vanishes. Fix disk saturation during AOF fsync or heavy local reads on the replica. If you must accommodate bursts, raise client-output-buffer-limit replica, trading memory safety for replication stability. A forgotten MONITOR session or a Pub/Sub subscriber on the primary can also consume output buffers and OOM the instance. Audit CLIENT LIST for unexpected clients.

Connection exhaustion

If total connections approach maxclients, the replica may be unable to reconnect. Kill stale or idle clients with CLIENT KILL (this drops application connections) or raise maxclients if the OS file-descriptor limit allows. Fix the underlying connection leak. Reserve headroom for replicas and monitoring connections.

Prevention

  • Maintain an expected-replica alert on connected_slaves and update it after every topology change.
  • Set repl-backlog-size to at least 100MB in production, and monitor sync_partial_err to confirm partial resyncs are succeeding.
  • Monitor each replica’s master_link_status and uptime_in_seconds independently so you detect replica-side problems before the primary count drops.
  • Disable Transparent Huge Pages and keep at least 50% memory headroom on persistent instances to avoid fork latency that can trigger replica timeouts.
  • Configure min-replicas-to-write and min-replicas-max-lag if your consistency requirements tolerate the write-availability tradeoff.
  • Document maintenance windows and Sentinel failover procedures so on-call engineers can distinguish expected drops from incidents.

How Netdata helps

Netdata collects connected_slaves, per-replica offset lines, sync_full, and sync_partial_err from the primary to flag when a disconnect forces a full resync. It tracks latest_fork_usec to highlight fork-induced timeouts. On the replica side, it collects master_link_status, master_link_down_since_seconds, and replication offset lag. It also alerts on rejected connections, memory pressure, and client output buffer growth that can lead to buffer-limit disconnects.

  • How Redis actually works in production: a mental model for operators: /guides/redis/how-redis-works-in-production/
  • Redis aof_last_write_status:err: AOF write failures and recovery: /guides/redis/redis-aof-last-write-status-err/
  • Redis appendfsync always latency: durability vs throughput trade-offs: /guides/redis/redis-appendfsync-always-latency/
  • Redis blocked_clients growing: dead consumers vs healthy queues: /guides/redis/redis-blocked-clients-growing/
  • Redis BUSY Redis is busy running a script: blocking Lua and how to recover: /guides/redis/redis-busy-running-script/
  • Redis Can’t save in background: fork: Cannot allocate memory - diagnosis and fix: /guides/redis/redis-cant-save-in-background-fork/
  • Redis client output buffer overflow: slow consumers and client-output-buffer-limit: /guides/redis/redis-client-output-buffer-limit/
  • Redis connected_clients climbing: connection leak detection: /guides/redis/redis-connected-clients-climbing/
  • Redis connection exhaustion: leaks, pools, and the retry storm: /guides/redis/redis-connection-exhaustion/
  • Redis event loop blocked: when one slow command freezes everything: /guides/redis/redis-event-loop-blocked/
  • Redis eviction policy tuning: allkeys-lru vs volatile-ttl vs noeviction: /guides/redis/redis-eviction-policy-tuning/
  • Redis fork/COW memory storm: why persistence doubles RSS and OOM-kills the box: /guides/redis/redis-fork-cow-storm/