Redis stale replica promotion: silent data loss at failover

After a Redis failover, the new primary accepts writes immediately and clients reconnect without errors. Writes that were acknowledged by the old primary but not yet replicated are gone. This is stale replica promotion.

Redis replication is asynchronous by default. The primary persists a write locally, replies OK to the client, then streams the change to replicas. If the primary fails before a replica receives the outstanding writes, that replica never sees them. Sentinel or Redis Cluster promotes the best available replica. If that replica is lagging, the writes in the gap are permanently lost. The client receives no error and no log warns you. The data is simply missing.

What this means

In a standard primary/replica deployment, Sentinel selects a new primary by evaluating replica priority, replication offset, lag, and run ID. The best candidate is promoted even if it is not fully caught up.

Because the primary acknowledges writes before they reach replicas, there is always a window of unsynced data. Under normal conditions this window is small. Under network congestion, replica saturation, or a small replication backlog, lag can grow to megabytes or seconds. If the primary fails then, the promoted replica starts from an older state. Acknowledged but un-replicated writes are discarded. There is no rollback mechanism. The application may only notice through data inconsistencies or missing records.

flowchart TD
    A[Primary receives write] --> B[Returns OK to client]
    B --> C[Replicates to replica async]
    C --> D[Replica lags behind]
    D --> E[Primary fails]
    E --> F[Sentinel promotes lagging replica]
    F --> G[Unsynced writes are permanently lost]

Common causes

Cause	What it looks like	First thing to check
Async replication without a safety fence	Primary accepts writes while replica is disconnected or lagging	`CONFIG GET min-replicas-to-write`
Replication backlog too small	A brief network blip forces a full resync, extending the lag window	`CONFIG GET repl-backlog-size`
Replica resource saturation	Replica CPU or network cannot keep up with the primary write rate	`INFO replication` offset delta
All replicas lag simultaneously	The least-lagging replica is still behind the old primary at failover	Compare replica offsets to the old primary’s last offset

Quick checks

These read-only commands assess current exposure. Run them against the primary and each replica.

# Check whether the primary rejects writes when replicas are unavailable.
# A value of 0 means there is no safety fence.
redis-cli CONFIG GET min-replicas-to-write
redis-cli CONFIG GET min-replicas-max-lag

# On the primary: compare master_repl_offset to each replica's offset.
# The difference is the current loss window in bytes.
redis-cli INFO replication | grep -E "master_repl_offset|slave[0-9]"

# On the replica: confirm the link is up and view its offset.
# Note: on a replica, master_repl_offset is the replica's own offset.
redis-cli INFO replication | grep -E "master_repl_offset|master_link_status"

# Check backlog size. If it is too small for your write rate,
# brief disconnects force full resyncs.
redis-cli CONFIG GET repl-backlog-size

# Check for replication instability. Rising sync_full or sync_partial_err
# indicates the replica is struggling to stay in sync.
redis-cli INFO stats | grep -E "sync_full|sync_partial_err"

How to diagnose it

Because the data loss is silent, diagnosis is usually post-mortem or preventive assessment.

Identify the failover timestamp from Sentinel logs or monitoring events.
Retrieve the old primary’s last known master_repl_offset from metrics, an RDB header, or INFO replication captured before the failure. If the node is unreachable, check whether your monitoring system recorded the final offset.
Retrieve the promoted replica’s offset at promotion time from metrics or logs. The byte difference is the data loss window.
Check whether min-replicas-to-write was configured on the old primary. A value of 0 means no write fence was active, so the primary continued accepting writes while replicas were disconnected or lagging.
Review replication lag history for the 5-15 minutes before failover. Sustained or growing lag confirms the replica was falling behind.
Examine sync_partial_err and sync_full counters before the incident. Rising values indicate the replica was repeatedly failing partial syncs, which extends lag during recovery.
Check application logs for data inconsistencies that correlate with the failover window, such as duplicate keys, missing records, or foreign-key violations in downstream systems.

Metrics and signals to monitor

Signal	Why it matters	Warning sign
Replication offset lag	Byte gap between primary and replica is the maximum data loss window	Sustained lag larger than your replication backlog, or monotonically growing lag
`min-replicas-to-write` / `min-replicas-max-lag`	Safety fence that rejects writes when replicas do not acknowledge	Value is 0 or missing in configuration
`master_link_status`	A down link means the replica is not receiving writes	`down` for more than 60 seconds, or flapping
`sync_partial_err`	Failed partial resyncs force expensive full resyncs that increase lag	Non-zero or increasing rate
`connected_slaves`	Fewer replicas than expected reduces redundancy and safety	Below expected count for your topology

Prevention

Configure a replication safety fence. Set min-replicas-to-write to at least 1 and min-replicas-max-lag to a value that bounds your acceptable loss window (for example, 10 seconds). The primary will then reject writes when fewer replicas are connected or when they lag beyond the threshold. This trades availability for consistency: if all replicas disconnect, writes block. Tune the value to your network RTT and replica capacity. A value of 0, the default, means no fence. Apply the change with CONFIG SET and persist it in redis.conf so it survives restarts.
Size the replication backlog. Increase repl-backlog-size from the 1 MB default to at least 100 MB, or enough to cover your typical disconnect duration multiplied by peak write throughput. For example, a 10 MB/s write rate and a 10 second blip require 100 MB. An undersized backlog overflows quickly, forcing full resyncs that leave the replica exposed for extended periods.
Use WAIT for critical writes. The WAIT numreplicas timeout command blocks until at least numreplicas acknowledge the write. For example, WAIT 1 100 waits up to 100 ms for one replica. If the timeout expires, the command returns the number of replicas that synced, but the write remains on the primary. It does not make Redis a CP system.
Monitor offset lag, not just link status. A replication link can report up while the replica is megabytes behind. Alert on the byte delta between master_repl_offset and the replica’s reported offset. Treat monotonically growing lag, or lag that exceeds your replication backlog, as an incident requiring investigation.
Right-size replica resources. A replica that saturates its CPU, memory bandwidth, or network interface cannot apply writes as fast as the primary. This creates structural lag. Safety fences and backpressure cannot compensate for an under-provisioned replica. Profile the replica’s used_cpu_sys and network throughput during peak primary load.

How Netdata helps

Tracks replication offset lag continuously, so you can verify whether the promoted replica was behind at failover time.
Alerts on master_link_status:down and drops in connected_slaves.
Correlates sync_full and sync_partial_err spikes with system events to surface replication instability.
Surfaces CPU, memory, and network saturation on replica nodes to explain why lag is growing.
Retains historical metrics around failover events, enabling post-mortem offset comparison without relying solely on logs.

The Netdata solution

Redis monitoring with Netdata

Netdata monitors Redis with per-second metrics and ML anomaly detection. Track memory usage and fragmentation, fork/COW latency, replication backlog, evictions, and connection pressure to spot the failure modes in these runbooks early.

See Redis monitoring → Start monitoring free

Redis stale replica promotion: silent data loss at failover

Redis stale replica promotion: silent data loss at failover

What this means

Common causes

Quick checks

How to diagnose it

Metrics and signals to monitor

Prevention

How Netdata helps

Related guides

Redis monitoring with Netdata