Redis Sentinel triggering unnecessary failovers: quorum and split-brain
Your logs show a failover, but the old master never restarted. Or failovers flap: one node is promoted, then another, and clients see MASTERDOWN and READONLY while Sentinels disagree. The root cause is usually not the Redis data node. It is the Sentinel control plane: quorum too low, down-after-milliseconds too aggressive, or a network partition that leaves Sentinels on the wrong side of the split.
Sentinel acts on its own network view. A single Sentinel that cannot reach the master declares it Subjectively Down (SDOWN). If enough Sentinels agree, the master becomes Objectively Down (ODOWN). ODOWN opens the gate for failover, but the failover itself still requires a majority of available Sentinels to authorize it. When thresholds are misaligned, or Sentinels cannot see each other, the system promotes a replica while the old master is still healthy. The result is split-brain, data divergence, and often more damage than the original network hiccup.
flowchart TD
A[Master unreachable from one Sentinel] --> B{Quorum met for ODOWN?}
B -->|No| C[No failover]
B -->|Yes| D[ODOWN declared]
D --> E{Majority of Sentinels available?}
E -->|No| F[Failover blocked]
E -->|Yes| G[Failover authorized]
G --> H[Replica promoted]
H --> I{Can old master still accept writes?}
I -->|Yes| J[Split-brain: two primaries]
I -->|No| K[Clean failover]What this means
An unnecessary failover promotes a replica even though the old master is healthy or only briefly unreachable. Sentinel is a separate process with its own network perspective; a congested link, firewall rule, GC pause on the Sentinel host, or aggressive down-after-milliseconds can all trigger SDOWN. If quorum is too low, a minority declares ODOWN. If the majority of Sentinels are on the isolated side of a partition, they authorize failover while the minority still sees the old master as primary. When the partition heals, both nodes may briefly accept writes, forcing conflicting datasets or full resyncs.
Quorum controls ODOWN declaration; majority controls failover authorization. On a three-Sentinel ring with quorum 1, any single blip creates ODOWN, but failover still needs two Sentinels. If one Sentinel is isolated, failover is blocked. If two are isolated together, they can authorize failover while the third Sentinel and old master remain alive on the other side.
Common causes
| Cause | What it looks like | First thing to check |
|---|---|---|
| Quorum set too low | ODOWN declared after a single blip; failovers correlate with minor network events | SENTINEL CKQUORUM and the quorum value in SENTINEL MASTER |
| Quorum set too high | ODOWN never reached when the master is actually dead; failover does not happen | Compare quorum to num-other-sentinels in SENTINEL MASTER |
Aggressive down-after-milliseconds | Flapping SDOWN during traffic bursts or GC pauses | down-after-milliseconds value versus baseline PING latency |
| Network partition between Sentinels | Sentinels disagree on master state; multiple nodes claim primary after partition heals | SENTINEL SENTINELS output on each Sentinel host |
| Even or insufficient Sentinel count | Two-Sentinel deployments cannot survive any single-node loss; even counts risk split decisions | Number of Sentinels reported in SENTINEL SENTINELS |
Short failover-timeout | A previous failover is still in progress when a new ODOWN is declared; clients see conflicting primaries | Sentinel logs for +failover-state-* transitions |
Quick checks
Run these from a host that can reach both the Redis data nodes and the Sentinel processes.
# Verify Sentinel quorum can be reached
redis-cli -p 26379 SENTINEL CKQUORUM mymaster
# Inspect the master state from Sentinel's perspective
redis-cli -p 26379 SENTINEL MASTER mymaster
# List other Sentinels and their last-seen timestamps
redis-cli -p 26379 SENTINEL SENTINELS mymaster
# Check whether the declared-down master is actually responding
redis-cli -h <master-ip> -p 6379 PING
redis-cli -h <master-ip> -p 6379 INFO server | grep uptime_in_seconds
# Check replica link status to distinguish master failure from Sentinel hallucination
redis-cli -h <replica-ip> -p 6379 INFO replication | grep master_link_status
# Check replication offset lag to assess data-loss risk if a failover already ran
redis-cli -h <master-ip> -p 6379 INFO replication | grep master_repl_offset
How to diagnose it
- Confirm the master is actually healthy. Run
PINGandINFO serverdirectly against the old master. If it returnsPONGanduptime_in_secondsis high, the failover was unnecessary. - Run
SENTINEL CKQUORUMon every Sentinel. If any Sentinel returns an error, the ring cannot form a majority and should not be authorizing failovers. Note which Sentinels are missing from each other’s view. - Compare
SENTINEL MASTERacross all Sentinels. Look for mismatchednum-slaves,num-other-sentinels, orflags. Disagreement means the Sentinels are partitioned or stale entries exist. - Check Sentinel logs for
+sdownand+odowntimestamps. If+sdownappears on one Sentinel but not the others at the same time, the issue is a local network or CPU stall on that Sentinel, not a master failure. - Verify
down-after-millisecondsandfailover-timeout. Ifdown-after-millisecondsis below the 99th percentile of PING latency between Sentinel and master, expect flapping SDOWN. Iffailover-timeoutis shorter than the time your largest replica needs to reconfigure and acknowledge the new master, overlapping failovers can occur. - Inspect replica replication offsets. If a failover already happened, compare the new master’s
master_repl_offsetto the old master’s last known offset. The gap is committed data that was lost or duplicated. - Check for stale Sentinel entries. If Sentinel restarts and dynamic configuration is not persisted,
SENTINEL SENTINELSmay show duplicate or ghost entries that break quorum math.
Metrics and signals to monitor
| Signal | Why it matters | Warning sign |
|---|---|---|
SENTINEL CKQUORUM result | Validates that quorum is reachable and a majority exists | Any response other than OK |
SENTINEL MASTER flags and num-other-sentinels | Reveals whether Sentinels agree on topology | num-other-sentinels lower than expected, or flags indicating disagreement |
master_link_status on replicas | If replicas are actually disconnected, a failover is warranted | down for more than one repl-timeout cycle |
uptime_in_seconds on data nodes | Detects unexpected restarts or failovers | Sudden reset without a planned restart |
connected_slaves on primary | Confirms expected replica topology | Drop below the configured replica count |
| Replication offset lag | Quantifies data-loss exposure during failover | Lag growing or exceeding repl-backlog-size |
rejected_connections | Post-failover reconnections can exhaust pools | Any sustained increase |
loading state | Newly promoted master may still be loading | loading:1 after failover indicates the replica was not fully caught up |
Fixes
Right-size quorum and Sentinel count
Quorum should require more than a transient minority, but not be so high that a single unreachable Sentinel blocks all detection. With three Sentinels, quorum=2 is the practical choice: two Sentinels must agree, and those same two form the majority needed to authorize failover. A two-Sentinel deployment cannot survive any single-node loss, so it should not be used for production HA.
Tune down-after-milliseconds
The default is 30000 ms. Lowering this to 1000 ms on a noisy or burst-loaded network produces flapping SDOWN. Set it to at least twice the 99th percentile round-trip time between Sentinel and the master, plus headroom for periodic latency spikes from RDB saves or AOF rewrite forks.
Tune failover-timeout
Set it higher than the time your largest replica needs to reconfigure and acknowledge the new master. If it is too short, Sentinel may start an overlapping failover before replicas finish syncing. Validate the value against your measured replica sync latency during peak traffic.
Fix the network partition
Sentinel failovers are often symptoms of layer-3 instability. Check for:
- Asymmetric routing between Sentinel hosts and Redis nodes
- Firewall rules blocking the Sentinel port (default 26379) or the Redis replication port
- Kubernetes network policies or service mesh timeouts that drop long-lived Sentinel connections
- DNS resolution flapping if Sentinels are configured by hostname
Limit divergence during split-brain
Configure min-replicas-to-write 1 and min-replicas-max-lag 10 on the master. The master then rejects writes when fewer than one replica is acknowledging within ten seconds. During a split-brain, this limits divergence only if the partition isolates the master from enough replicas. The tradeoff is reduced write availability during replica lag or partial partitions.
Increase repl-backlog-size
After an unnecessary failover, clients reconnect and replicas reconfigure. If the replication backlog is too small, partial resync fails and the new master forks for a full resync. Set repl-backlog-size to at least 100 MB in production, or calculate it as two times your peak write bytes per second multiplied by your maximum expected partition duration.
Prevention
- Independent Sentinel monitoring. Poll
SENTINEL CKQUORUM,SENTINEL MASTER, andSENTINEL SENTINELSfrom your monitoring system. Sentinel health is not implied by Redis health. - Odd Sentinel count, minimum three. Use three or more Sentinels in an odd count to avoid tied votes and ensure a clear majority during authorization.
- Persisted Sentinel configuration. Write changes made via
SENTINEL SETtosentinel.confso restarts do not create ghost entries or quorum drift. - Baselined timeout tuning. Run
redis-cli --latencybetween Sentinel hosts and Redis nodes during peak traffic to setdown-after-millisecondswith real data. - Partition testing. Induce a network partition between one Sentinel and the master, then between two Sentinels and the master, to verify that failovers occur only when appropriate.
How Netdata helps
- Correlates failover events with Redis
uptime_in_seconds,connected_slaves, andmaster_link_statusto distinguish control-plane issues from data-node failures. - Tracks replication offset lag and
sync_fullevents to surface post-failover full resyncs caused by insufficient backlog. - Alerts on
rejected_connectionsspikes after failovers to detect client reconnect storms. - Monitors
loadingstate on newly promoted primaries to catch replicas promoted before they finished catching up.
Related guides
- How Redis actually works in production: a mental model for operators: /guides/redis/how-redis-works-in-production/
- Redis aof_last_write_status:err: AOF write failures and recovery: /guides/redis/redis-aof-last-write-status-err/
- Redis appendfsync always latency: durability vs throughput trade-offs: /guides/redis/redis-appendfsync-always-latency/
- Redis blocked_clients growing: dead consumers vs healthy queues: /guides/redis/redis-blocked-clients-growing/
- Redis BUSY Redis is busy running a script: blocking Lua and how to recover: /guides/redis/redis-busy-running-script/
- Redis Can’t save in background: fork: Cannot allocate memory - diagnosis and fix: /guides/redis/redis-cant-save-in-background-fork/
- Redis client output buffer overflow: slow consumers and client-output-buffer-limit: /guides/redis/redis-client-output-buffer-limit/
- Redis cluster_slots_pfail > 0: impending node failure in a cluster: /guides/redis/redis-cluster-slots-pfail/
- Redis CLUSTERDOWN / cluster_state:fail: slot coverage and recovery: /guides/redis/redis-cluster-state-fail/
- Redis connected_clients climbing: connection leak detection: /guides/redis/redis-connected-clients-climbing/
- Redis connected_slaves dropped: detecting replica disconnects on the primary: /guides/redis/redis-connected-slaves-dropped/
- Redis connection exhaustion: leaks, pools, and the retry storm: /guides/redis/redis-connection-exhaustion/







