$ guides / redis / redis-sentinel-unnecessary-failover ▌

Operations Guides

Redis Sentinel triggering unnecessary failovers: quorum and split-brain

Your logs show a failover, but the old master never restarted. Or failovers flap: one node is promoted, then another, and clients see MASTERDOWN and READONLY while Sentinels disagree. The root cause is usually not the Redis data node. It is the Sentinel control plane: quorum too low, down-after-milliseconds too aggressive, or a network partition that leaves Sentinels on the wrong side of the split.

Sentinel acts on its own network view. A single Sentinel that cannot reach the master declares it Subjectively Down (SDOWN). If enough Sentinels agree, the master becomes Objectively Down (ODOWN). ODOWN opens the gate for failover, but the failover itself still requires a majority of available Sentinels to authorize it. When thresholds are misaligned, or Sentinels cannot see each other, the system promotes a replica while the old master is still healthy. The result is split-brain, data divergence, and often more damage than the original network hiccup.

flowchart TD
    A[Master unreachable from one Sentinel] --> B{Quorum met for ODOWN?}
    B -->|No| C[No failover]
    B -->|Yes| D[ODOWN declared]
    D --> E{Majority of Sentinels available?}
    E -->|No| F[Failover blocked]
    E -->|Yes| G[Failover authorized]
    G --> H[Replica promoted]
    H --> I{Can old master still accept writes?}
    I -->|Yes| J[Split-brain: two primaries]
    I -->|No| K[Clean failover]

What this means

An unnecessary failover promotes a replica even though the old master is healthy or only briefly unreachable. Sentinel is a separate process with its own network perspective; a congested link, firewall rule, GC pause on the Sentinel host, or aggressive down-after-milliseconds can all trigger SDOWN. If quorum is too low, a minority declares ODOWN. If the majority of Sentinels are on the isolated side of a partition, they authorize failover while the minority still sees the old master as primary. When the partition heals, both nodes may briefly accept writes, forcing conflicting datasets or full resyncs.

Quorum controls ODOWN declaration; majority controls failover authorization. On a three-Sentinel ring with quorum 1, any single blip creates ODOWN, but failover still needs two Sentinels. If one Sentinel is isolated, failover is blocked. If two are isolated together, they can authorize failover while the third Sentinel and old master remain alive on the other side.

Common causes

Cause	What it looks like	First thing to check
Quorum set too low	ODOWN declared after a single blip; failovers correlate with minor network events	`SENTINEL CKQUORUM` and the quorum value in `SENTINEL MASTER`
Quorum set too high	ODOWN never reached when the master is actually dead; failover does not happen	Compare quorum to `num-other-sentinels` in `SENTINEL MASTER`
Aggressive `down-after-milliseconds`	Flapping SDOWN during traffic bursts or GC pauses	`down-after-milliseconds` value versus baseline PING latency
Network partition between Sentinels	Sentinels disagree on master state; multiple nodes claim primary after partition heals	`SENTINEL SENTINELS` output on each Sentinel host
Even or insufficient Sentinel count	Two-Sentinel deployments cannot survive any single-node loss; even counts risk split decisions	Number of Sentinels reported in `SENTINEL SENTINELS`
Short `failover-timeout`	A previous failover is still in progress when a new ODOWN is declared; clients see conflicting primaries	Sentinel logs for `+failover-state-*` transitions

Quick checks

Run these from a host that can reach both the Redis data nodes and the Sentinel processes.

# Verify Sentinel quorum can be reached
redis-cli -p 26379 SENTINEL CKQUORUM mymaster

# Inspect the master state from Sentinel's perspective
redis-cli -p 26379 SENTINEL MASTER mymaster

# List other Sentinels and their last-seen timestamps
redis-cli -p 26379 SENTINEL SENTINELS mymaster

# Check whether the declared-down master is actually responding
redis-cli -h <master-ip> -p 6379 PING
redis-cli -h <master-ip> -p 6379 INFO server | grep uptime_in_seconds

# Check replica link status to distinguish master failure from Sentinel hallucination
redis-cli -h <replica-ip> -p 6379 INFO replication | grep master_link_status

# Check replication offset lag to assess data-loss risk if a failover already ran
redis-cli -h <master-ip> -p 6379 INFO replication | grep master_repl_offset

How to diagnose it

Confirm the master is actually healthy. Run PING and INFO server directly against the old master. If it returns PONG and uptime_in_seconds is high, the failover was unnecessary.
Run SENTINEL CKQUORUM on every Sentinel. If any Sentinel returns an error, the ring cannot form a majority and should not be authorizing failovers. Note which Sentinels are missing from each other’s view.
Compare SENTINEL MASTER across all Sentinels. Look for mismatched num-slaves, num-other-sentinels, or flags. Disagreement means the Sentinels are partitioned or stale entries exist.
Check Sentinel logs for +sdown and +odown timestamps. If +sdown appears on one Sentinel but not the others at the same time, the issue is a local network or CPU stall on that Sentinel, not a master failure.
Verify down-after-milliseconds and failover-timeout. If down-after-milliseconds is below the 99th percentile of PING latency between Sentinel and master, expect flapping SDOWN. If failover-timeout is shorter than the time your largest replica needs to reconfigure and acknowledge the new master, overlapping failovers can occur.
Inspect replica replication offsets. If a failover already happened, compare the new master’s master_repl_offset to the old master’s last known offset. The gap is committed data that was lost or duplicated.
Check for stale Sentinel entries. If Sentinel restarts and dynamic configuration is not persisted, SENTINEL SENTINELS may show duplicate or ghost entries that break quorum math.

Metrics and signals to monitor

Signal	Why it matters	Warning sign
`SENTINEL CKQUORUM` result	Validates that quorum is reachable and a majority exists	Any response other than `OK`
`SENTINEL MASTER` flags and `num-other-sentinels`	Reveals whether Sentinels agree on topology	`num-other-sentinels` lower than expected, or flags indicating disagreement
`master_link_status` on replicas	If replicas are actually disconnected, a failover is warranted	`down` for more than one `repl-timeout` cycle
`uptime_in_seconds` on data nodes	Detects unexpected restarts or failovers	Sudden reset without a planned restart
`connected_slaves` on primary	Confirms expected replica topology	Drop below the configured replica count
Replication offset lag	Quantifies data-loss exposure during failover	Lag growing or exceeding `repl-backlog-size`
`rejected_connections`	Post-failover reconnections can exhaust pools	Any sustained increase
`loading` state	Newly promoted master may still be loading	`loading:1` after failover indicates the replica was not fully caught up

Fixes

Right-size quorum and Sentinel count

Quorum should require more than a transient minority, but not be so high that a single unreachable Sentinel blocks all detection. With three Sentinels, quorum=2 is the practical choice: two Sentinels must agree, and those same two form the majority needed to authorize failover. A two-Sentinel deployment cannot survive any single-node loss, so it should not be used for production HA.

Tune `down-after-milliseconds`

The default is 30000 ms. Lowering this to 1000 ms on a noisy or burst-loaded network produces flapping SDOWN. Set it to at least twice the 99th percentile round-trip time between Sentinel and the master, plus headroom for periodic latency spikes from RDB saves or AOF rewrite forks.

Tune `failover-timeout`

Set it higher than the time your largest replica needs to reconfigure and acknowledge the new master. If it is too short, Sentinel may start an overlapping failover before replicas finish syncing. Validate the value against your measured replica sync latency during peak traffic.

Fix the network partition

Sentinel failovers are often symptoms of layer-3 instability. Check for:

Asymmetric routing between Sentinel hosts and Redis nodes
Firewall rules blocking the Sentinel port (default 26379) or the Redis replication port
Kubernetes network policies or service mesh timeouts that drop long-lived Sentinel connections
DNS resolution flapping if Sentinels are configured by hostname

Limit divergence during split-brain

Configure min-replicas-to-write 1 and min-replicas-max-lag 10 on the master. The master then rejects writes when fewer than one replica is acknowledging within ten seconds. During a split-brain, this limits divergence only if the partition isolates the master from enough replicas. The tradeoff is reduced write availability during replica lag or partial partitions.

Increase `repl-backlog-size`

After an unnecessary failover, clients reconnect and replicas reconfigure. If the replication backlog is too small, partial resync fails and the new master forks for a full resync. Set repl-backlog-size to at least 100 MB in production, or calculate it as two times your peak write bytes per second multiplied by your maximum expected partition duration.

Prevention

Independent Sentinel monitoring. Poll SENTINEL CKQUORUM, SENTINEL MASTER, and SENTINEL SENTINELS from your monitoring system. Sentinel health is not implied by Redis health.
Odd Sentinel count, minimum three. Use three or more Sentinels in an odd count to avoid tied votes and ensure a clear majority during authorization.
Persisted Sentinel configuration. Write changes made via SENTINEL SET to sentinel.conf so restarts do not create ghost entries or quorum drift.
Baselined timeout tuning. Run redis-cli --latency between Sentinel hosts and Redis nodes during peak traffic to set down-after-milliseconds with real data.
Partition testing. Induce a network partition between one Sentinel and the master, then between two Sentinels and the master, to verify that failovers occur only when appropriate.

How Netdata helps

Correlates failover events with Redis uptime_in_seconds, connected_slaves, and master_link_status to distinguish control-plane issues from data-node failures.
Tracks replication offset lag and sync_full events to surface post-failover full resyncs caused by insufficient backlog.
Alerts on rejected_connections spikes after failovers to detect client reconnect storms.
Monitors loading state on newly promoted primaries to catch replicas promoted before they finished catching up.

How Redis actually works in production: a mental model for operators: /guides/redis/how-redis-works-in-production/
Redis aof_last_write_status:err: AOF write failures and recovery: /guides/redis/redis-aof-last-write-status-err/
Redis appendfsync always latency: durability vs throughput trade-offs: /guides/redis/redis-appendfsync-always-latency/
Redis blocked_clients growing: dead consumers vs healthy queues: /guides/redis/redis-blocked-clients-growing/
Redis BUSY Redis is busy running a script: blocking Lua and how to recover: /guides/redis/redis-busy-running-script/
Redis Can’t save in background: fork: Cannot allocate memory - diagnosis and fix: /guides/redis/redis-cant-save-in-background-fork/
Redis client output buffer overflow: slow consumers and client-output-buffer-limit: /guides/redis/redis-client-output-buffer-limit/
Redis cluster_slots_pfail > 0: impending node failure in a cluster: /guides/redis/redis-cluster-slots-pfail/
Redis CLUSTERDOWN / cluster_state:fail: slot coverage and recovery: /guides/redis/redis-cluster-state-fail/
Redis connected_clients climbing: connection leak detection: /guides/redis/redis-connected-clients-climbing/
Redis connected_slaves dropped: detecting replica disconnects on the primary: /guides/redis/redis-connected-slaves-dropped/
Redis connection exhaustion: leaks, pools, and the retry storm: /guides/redis/redis-connection-exhaustion/

The Netdata solution

Redis monitoring with Netdata

Netdata monitors Redis with per-second metrics and ML anomaly detection. Track memory usage and fragmentation, fork/COW latency, replication backlog, evictions, and connection pressure to spot the failure modes in these runbooks early.

See Redis monitoring → Start monitoring free

Redis Sentinel triggering unnecessary failovers: quorum and split-brain

Redis Sentinel triggering unnecessary failovers: quorum and split-brain

What this means

Common causes

Quick checks

How to diagnose it

Metrics and signals to monitor

Fixes

Right-size quorum and Sentinel count

Tune down-after-milliseconds

Tune failover-timeout

Fix the network partition

Limit divergence during split-brain

Increase repl-backlog-size

Prevention

How Netdata helps

Related guides

Redis monitoring with Netdata

Tune `down-after-milliseconds`

Tune `failover-timeout`

Increase `repl-backlog-size`