$ guides / postgres / postgres-replica-out-of-sync ▌

Operations Guides

PostgreSQL replica out of sync: timeline mismatches and recovery

A PostgreSQL streaming replica that was healthy yesterday now refuses to start with a timeline mismatch error, or a former primary that you brought back online cannot rejoin the cluster as a standby. The log shows requested timeline N is not a child of this server's history and the replica loops in crash recovery while the current primary continues to diverge. This guide covers identifying the divergence point, choosing between pg_rewind and a full re-clone, and recovering the replica without introducing split-brain or data loss.

What this means

PostgreSQL creates a new timeline each time a primary is promoted, incrementing the timeline ID. Parent-child relationships are recorded in .history files inside pg_wal. A physical streaming replica follows a specific timeline. If the replica’s history does not list the upstream’s current timeline as a descendant, startup exits with a fatal error.

Three scenarios cause this:

A standby is configured to follow a fixed timeline ID and a promotion switched the primary to a new one.
A former primary was restarted as a standby after diverging.
A cascading replica downstream of a promoted standby does not switch to the latest timeline.

flowchart TD
    A[Replica fails to start] --> B{Log shows timeline mismatch?}
    B -->|Yes| C[Check timeline IDs on both nodes]
    C --> D{Is target timeline a descendant?}
    D -->|No| E{Former primary? wal_log_hints on?}
    E -->|Yes| F[pg_rewind from current primary]
    E -->|No| G[Re-clone with pg_basebackup]
    D -->|Yes| H[Set recovery_target_timeline = latest]
    B -->|No| I[Check WAL gap and slot health]

Common causes

Cause	What it looks like	First thing to check
`recovery_target_timeline` fixed to a stale ID	Replica fails immediately after promotion with a timeline mismatch	`SHOW recovery_target_timeline;` or grep `postgresql.conf` and `postgresql.auto.conf`
Former primary rejoining without rewinding	Old primary restarted as standby; `pg_controldata` shows a higher timeline than the current primary	Timeline ID on both nodes with `pg_controldata`
`wal_log_hints` missing at divergence time	`pg_rewind` fails complaining about `wal_log_hints`	`SHOW wal_log_hints;` and `SHOW data_checksums;`
WAL missing back to divergence point	`pg_rewind` fails with `could not find previous WAL record`	WAL file existence in `pg_wal` and on the source
Downstream replica after cascading promotion	A standby was promoted and its own replicas cannot attach to the new primary	`recovery_target_timeline` on downstream nodes

Quick checks

Run these safe, read-only checks before making changes.

# Timeline and latest checkpoint on the failed node
pg_controldata $PGDATA | grep -iE 'timeline|checkpoint'
cat $PGDATA/pg_wal/*.history

# Timeline on the intended new primary
pg_controldata $PGDATA | grep -i 'timeline'

-- Recovery target configuration and rewind prerequisites on the replica
SHOW recovery_target_timeline;
SHOW wal_log_hints;
SHOW data_checksums;
SHOW full_page_writes;

-- Current timeline from SQL
SELECT timeline_id FROM pg_control_checkpoint();

# WAL presence around the divergence point
ls -la $PGDATA/pg_wal/ | head -20

-- Replication slot health
SELECT slot_name, active, restart_lsn,
       pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn) AS lag_bytes
FROM pg_replication_slots;

How to diagnose it

Read the exact error. Look for FATAL: requested timeline N is not a child of this server's history in the replica log. Note the requested timeline ID and the server’s current timeline.
Compare timeline IDs. Run pg_controldata on both the target (failed replica) and the source (current primary). If the target’s timeline is higher than the source’s, the target was likely a former primary that diverged.
Inspect the history file. On the target, read the .history file in pg_wal that corresponds to its current timeline. It lists the parent timeline and the LSN where the switch happened. Confirm whether the source’s timeline appears in that lineage.
Determine if the target is a former primary. Check Database cluster state in pg_controldata. If it shows in production, the instance accepted writes and diverged.
Verify WAL reachability. pg_rewind needs WAL on the target reaching back to the divergence point. List pg_wal files around the switch LSN. If they were recycled, check whether the source still retains them and whether pg_rewind -c can stream them.
Verify pg_rewind prerequisites. The target must be cleanly shut down. wal_log_hints must have been enabled before the instances diverged; enabling it after the fact does not help. Data checksums enabled at initdb time also satisfy the requirement.
Choose the recovery path. If the timeline is valid but the replica is looking at a fixed ID, a config change and restart suffice. If the target diverged as a former primary and prerequisites are met, use pg_rewind. If prerequisites fail or WAL is missing, re-clone.

Metrics and signals to monitor

Signal	Why it matters	Warning sign
`pg_stat_replication.replay_lag`	A replica that is behind is more likely to require rebuilding after a timeline switch	`replay_lag` > 30 s async and growing
`pg_replication_slots.active`	Inactive slots retain WAL that may be needed for rewind or catch-up	`active = false` with a stale `restart_lsn`
`pg_wal` directory size	Rapid growth indicates a replica is not consuming WAL, or a slot is retaining it	Size persistently above `max_wal_size`
`pg_stat_database.conflicts`	Query cancels on the replica indicate replay is blocked by long-running reads	Sustained nonzero values
Timeline ID on primary	Unexpected timeline increments signal unplanned promotions or failovers	Timeline changes outside scheduled maintenance

Fixes

Set `recovery_target_timeline = 'latest'`

If the replica is not a former primary and the only issue is a fixed timeline ID, edit postgresql.conf or postgresql.auto.conf:

recovery_target_timeline = 'latest'

Ensure standby.signal exists in the data directory. Restart the replica. This is the safest and fastest fix when the replica’s history is otherwise consistent.

Rewind a former primary with `pg_rewind`

When the target was previously a primary and diverged, pg_rewind can resync it to the current source without a full base backup.

Prerequisites:

Target must be cleanly shut down.
wal_log_hints = on or data checksums enabled at initdb.
full_page_writes = on.
Target retains WAL back to the divergence point, or the source can stream missing segments.

# Execute rewind from the live primary
pg_rewind --target-pgdata=$PGDATA \
  --source-server="host=new_primary port=5432 user=replicator dbname=postgres" \
  -P -c

After pg_rewind:

The command destructively modifies the target data directory.
Review postgresql.auto.conf and pg_hba.conf; the target may retain stale settings from when it was a primary.
Ensure standby.signal is present.
Start the target. It enters archive recovery and replays WAL from the new primary.

Tradeoffs: Faster than a full clone for large clusters, but it destructively modifies the target data directory. If wal_log_hints was not enabled before divergence, this path is closed.

Re-clone with `pg_basebackup`

If pg_rewind prerequisites are not met, or WAL back to the divergence is missing, rebuild the replica from scratch.

Warning: pg_basebackup with -D $PGDATA overwrites the target data directory. Stop PostgreSQL on the target before running this.

pg_basebackup -D $PGDATA \
  -h new_primary -U replicator \
  -R -Xs -c fast -P -v

The -R flag creates standby.signal and seeds primary_conninfo in postgresql.auto.conf. Review the generated connection string before starting the replica.

Tradeoffs: Network- and time-intensive for multi-terabyte clusters, but it guarantees a consistent starting point and removes uncertainty about pre-divergence configuration.

Fix downstream cascading replicas

When a standby is promoted, its own replicas must also follow the new timeline. If they fail with a timeline mismatch, set recovery_target_timeline = 'latest' on each downstream node and restart.

Prevention

Always set recovery_target_timeline = 'latest' for any standby in an HA topology.
Enable wal_log_hints = on and data checksums at initdb so pg_rewind remains available.
Use replication slots and set max_slot_wal_keep_size to prevent unbounded WAL retention while still preserving enough history for catch-up.
Fence the old primary after failover. Stop PostgreSQL or isolate the host so it cannot restart independently and accept writes.
Test failover, rewind, and re-clone procedures monthly. The first time you try to rewind a former primary should not be during an incident.

How Netdata helps

Correlate replication lag in seconds and bytes with WAL generation rate on the primary to identify replicas at risk of falling behind before a timeline switch.
Alert on inactive replication slots and WAL directory growth before disk exhaustion blocks recovery.
Track checkpoint frequency and backend process states to distinguish a slow replica from a stuck recovery process.
Visualize replica conflict counts to detect hot-standby queries that block WAL replay.

The Netdata solution

PostgreSQL monitoring with Netdata

Netdata monitors PostgreSQL with per-second metrics, pre-built dashboards, and ML-powered anomaly detection. Correlate connection saturation, lock waits, autovacuum progress, replication lag, and checkpoint I/O against the rest of your stack so you catch the incidents in these runbooks before they page anyone.

See PostgreSQL monitoring → Start monitoring free

PostgreSQL replica out of sync: timeline mismatches and recovery

PostgreSQL replica out of sync: timeline mismatches and recovery

What this means

Common causes

Quick checks

How to diagnose it

Metrics and signals to monitor

Fixes

Set recovery_target_timeline = 'latest'

Rewind a former primary with pg_rewind

Re-clone with pg_basebackup

Fix downstream cascading replicas

Prevention

How Netdata helps

Related guides

PostgreSQL monitoring with Netdata

Set `recovery_target_timeline = 'latest'`

Rewind a former primary with `pg_rewind`

Re-clone with `pg_basebackup`