Redis Sentinel triggering unnecessary failovers: quorum and split-brain

Your logs show a failover, but the old master never restarted. Or failovers flap: one node is promoted, then another, and clients see MASTERDOWN and READONLY while Sentinels disagree. The root cause is usually not the Redis data node. It is the Sentinel control plane: quorum too low, down-after-milliseconds too aggressive, or a network partition that leaves Sentinels on the wrong side of the split.

Sentinel acts on its own network view. A single Sentinel that cannot reach the master declares it Subjectively Down (SDOWN). If enough Sentinels agree, the master becomes Objectively Down (ODOWN). ODOWN opens the gate for failover, but the failover itself still requires a majority of available Sentinels to authorize it. When thresholds are misaligned, or Sentinels cannot see each other, the system promotes a replica while the old master is still healthy. The result is split-brain, data divergence, and often more damage than the original network hiccup.

flowchart TD
    A[Master unreachable from one Sentinel] --> B{Quorum met for ODOWN?}
    B -->|No| C[No failover]
    B -->|Yes| D[ODOWN declared]
    D --> E{Majority of Sentinels available?}
    E -->|No| F[Failover blocked]
    E -->|Yes| G[Failover authorized]
    G --> H[Replica promoted]
    H --> I{Can old master still accept writes?}
    I -->|Yes| J[Split-brain: two primaries]
    I -->|No| K[Clean failover]

What this means

An unnecessary failover promotes a replica even though the old master is healthy or only briefly unreachable. Sentinel is a separate process with its own network perspective; a congested link, firewall rule, GC pause on the Sentinel host, or aggressive down-after-milliseconds can all trigger SDOWN. If quorum is too low, a minority declares ODOWN. If the majority of Sentinels are on the isolated side of a partition, they authorize failover while the minority still sees the old master as primary. When the partition heals, both nodes may briefly accept writes, forcing conflicting datasets or full resyncs.

Quorum controls ODOWN declaration; majority controls failover authorization. On a three-Sentinel ring with quorum 1, any single blip creates ODOWN, but failover still needs two Sentinels. If one Sentinel is isolated, failover is blocked. If two are isolated together, they can authorize failover while the third Sentinel and old master remain alive on the other side.

Common causes

CauseWhat it looks likeFirst thing to check
Quorum set too lowODOWN declared after a single blip; failovers correlate with minor network eventsSENTINEL CKQUORUM and the quorum value in SENTINEL MASTER
Quorum set too highODOWN never reached when the master is actually dead; failover does not happenCompare quorum to num-other-sentinels in SENTINEL MASTER
Aggressive down-after-millisecondsFlapping SDOWN during traffic bursts or GC pausesdown-after-milliseconds value versus baseline PING latency
Network partition between SentinelsSentinels disagree on master state; multiple nodes claim primary after partition healsSENTINEL SENTINELS output on each Sentinel host
Even or insufficient Sentinel countTwo-Sentinel deployments cannot survive any single-node loss; even counts risk split decisionsNumber of Sentinels reported in SENTINEL SENTINELS
Short failover-timeoutA previous failover is still in progress when a new ODOWN is declared; clients see conflicting primariesSentinel logs for +failover-state-* transitions

Quick checks

Run these from a host that can reach both the Redis data nodes and the Sentinel processes.

# Verify Sentinel quorum can be reached
redis-cli -p 26379 SENTINEL CKQUORUM mymaster

# Inspect the master state from Sentinel's perspective
redis-cli -p 26379 SENTINEL MASTER mymaster

# List other Sentinels and their last-seen timestamps
redis-cli -p 26379 SENTINEL SENTINELS mymaster

# Check whether the declared-down master is actually responding
redis-cli -h <master-ip> -p 6379 PING
redis-cli -h <master-ip> -p 6379 INFO server | grep uptime_in_seconds

# Check replica link status to distinguish master failure from Sentinel hallucination
redis-cli -h <replica-ip> -p 6379 INFO replication | grep master_link_status

# Check replication offset lag to assess data-loss risk if a failover already ran
redis-cli -h <master-ip> -p 6379 INFO replication | grep master_repl_offset

How to diagnose it

  1. Confirm the master is actually healthy. Run PING and INFO server directly against the old master. If it returns PONG and uptime_in_seconds is high, the failover was unnecessary.
  2. Run SENTINEL CKQUORUM on every Sentinel. If any Sentinel returns an error, the ring cannot form a majority and should not be authorizing failovers. Note which Sentinels are missing from each other’s view.
  3. Compare SENTINEL MASTER across all Sentinels. Look for mismatched num-slaves, num-other-sentinels, or flags. Disagreement means the Sentinels are partitioned or stale entries exist.
  4. Check Sentinel logs for +sdown and +odown timestamps. If +sdown appears on one Sentinel but not the others at the same time, the issue is a local network or CPU stall on that Sentinel, not a master failure.
  5. Verify down-after-milliseconds and failover-timeout. If down-after-milliseconds is below the 99th percentile of PING latency between Sentinel and master, expect flapping SDOWN. If failover-timeout is shorter than the time your largest replica needs to reconfigure and acknowledge the new master, overlapping failovers can occur.
  6. Inspect replica replication offsets. If a failover already happened, compare the new master’s master_repl_offset to the old master’s last known offset. The gap is committed data that was lost or duplicated.
  7. Check for stale Sentinel entries. If Sentinel restarts and dynamic configuration is not persisted, SENTINEL SENTINELS may show duplicate or ghost entries that break quorum math.

Metrics and signals to monitor

SignalWhy it mattersWarning sign
SENTINEL CKQUORUM resultValidates that quorum is reachable and a majority existsAny response other than OK
SENTINEL MASTER flags and num-other-sentinelsReveals whether Sentinels agree on topologynum-other-sentinels lower than expected, or flags indicating disagreement
master_link_status on replicasIf replicas are actually disconnected, a failover is warranteddown for more than one repl-timeout cycle
uptime_in_seconds on data nodesDetects unexpected restarts or failoversSudden reset without a planned restart
connected_slaves on primaryConfirms expected replica topologyDrop below the configured replica count
Replication offset lagQuantifies data-loss exposure during failoverLag growing or exceeding repl-backlog-size
rejected_connectionsPost-failover reconnections can exhaust poolsAny sustained increase
loading stateNewly promoted master may still be loadingloading:1 after failover indicates the replica was not fully caught up

Fixes

Right-size quorum and Sentinel count

Quorum should require more than a transient minority, but not be so high that a single unreachable Sentinel blocks all detection. With three Sentinels, quorum=2 is the practical choice: two Sentinels must agree, and those same two form the majority needed to authorize failover. A two-Sentinel deployment cannot survive any single-node loss, so it should not be used for production HA.

Tune down-after-milliseconds

The default is 30000 ms. Lowering this to 1000 ms on a noisy or burst-loaded network produces flapping SDOWN. Set it to at least twice the 99th percentile round-trip time between Sentinel and the master, plus headroom for periodic latency spikes from RDB saves or AOF rewrite forks.

Tune failover-timeout

Set it higher than the time your largest replica needs to reconfigure and acknowledge the new master. If it is too short, Sentinel may start an overlapping failover before replicas finish syncing. Validate the value against your measured replica sync latency during peak traffic.

Fix the network partition

Sentinel failovers are often symptoms of layer-3 instability. Check for:

  • Asymmetric routing between Sentinel hosts and Redis nodes
  • Firewall rules blocking the Sentinel port (default 26379) or the Redis replication port
  • Kubernetes network policies or service mesh timeouts that drop long-lived Sentinel connections
  • DNS resolution flapping if Sentinels are configured by hostname

Limit divergence during split-brain

Configure min-replicas-to-write 1 and min-replicas-max-lag 10 on the master. The master then rejects writes when fewer than one replica is acknowledging within ten seconds. During a split-brain, this limits divergence only if the partition isolates the master from enough replicas. The tradeoff is reduced write availability during replica lag or partial partitions.

Increase repl-backlog-size

After an unnecessary failover, clients reconnect and replicas reconfigure. If the replication backlog is too small, partial resync fails and the new master forks for a full resync. Set repl-backlog-size to at least 100 MB in production, or calculate it as two times your peak write bytes per second multiplied by your maximum expected partition duration.

Prevention

  • Independent Sentinel monitoring. Poll SENTINEL CKQUORUM, SENTINEL MASTER, and SENTINEL SENTINELS from your monitoring system. Sentinel health is not implied by Redis health.
  • Odd Sentinel count, minimum three. Use three or more Sentinels in an odd count to avoid tied votes and ensure a clear majority during authorization.
  • Persisted Sentinel configuration. Write changes made via SENTINEL SET to sentinel.conf so restarts do not create ghost entries or quorum drift.
  • Baselined timeout tuning. Run redis-cli --latency between Sentinel hosts and Redis nodes during peak traffic to set down-after-milliseconds with real data.
  • Partition testing. Induce a network partition between one Sentinel and the master, then between two Sentinels and the master, to verify that failovers occur only when appropriate.

How Netdata helps

  • Correlates failover events with Redis uptime_in_seconds, connected_slaves, and master_link_status to distinguish control-plane issues from data-node failures.
  • Tracks replication offset lag and sync_full events to surface post-failover full resyncs caused by insufficient backlog.
  • Alerts on rejected_connections spikes after failovers to detect client reconnect storms.
  • Monitors loading state on newly promoted primaries to catch replicas promoted before they finished catching up.
  • How Redis actually works in production: a mental model for operators: /guides/redis/how-redis-works-in-production/
  • Redis aof_last_write_status:err: AOF write failures and recovery: /guides/redis/redis-aof-last-write-status-err/
  • Redis appendfsync always latency: durability vs throughput trade-offs: /guides/redis/redis-appendfsync-always-latency/
  • Redis blocked_clients growing: dead consumers vs healthy queues: /guides/redis/redis-blocked-clients-growing/
  • Redis BUSY Redis is busy running a script: blocking Lua and how to recover: /guides/redis/redis-busy-running-script/
  • Redis Can’t save in background: fork: Cannot allocate memory - diagnosis and fix: /guides/redis/redis-cant-save-in-background-fork/
  • Redis client output buffer overflow: slow consumers and client-output-buffer-limit: /guides/redis/redis-client-output-buffer-limit/
  • Redis cluster_slots_pfail > 0: impending node failure in a cluster: /guides/redis/redis-cluster-slots-pfail/
  • Redis CLUSTERDOWN / cluster_state:fail: slot coverage and recovery: /guides/redis/redis-cluster-state-fail/
  • Redis connected_clients climbing: connection leak detection: /guides/redis/redis-connected-clients-climbing/
  • Redis connected_slaves dropped: detecting replica disconnects on the primary: /guides/redis/redis-connected-slaves-dropped/
  • Redis connection exhaustion: leaks, pools, and the retry storm: /guides/redis/redis-connection-exhaustion/