PostgreSQL replica out of sync: timeline mismatches and recovery

A PostgreSQL streaming replica that was healthy yesterday now refuses to start with a timeline mismatch error, or a former primary that you brought back online cannot rejoin the cluster as a standby. The log shows requested timeline N is not a child of this server's history and the replica loops in crash recovery while the current primary continues to diverge. This guide covers identifying the divergence point, choosing between pg_rewind and a full re-clone, and recovering the replica without introducing split-brain or data loss.

What this means

PostgreSQL creates a new timeline each time a primary is promoted, incrementing the timeline ID. Parent-child relationships are recorded in .history files inside pg_wal. A physical streaming replica follows a specific timeline. If the replica’s history does not list the upstream’s current timeline as a descendant, startup exits with a fatal error.

Three scenarios cause this:

  • A standby is configured to follow a fixed timeline ID and a promotion switched the primary to a new one.
  • A former primary was restarted as a standby after diverging.
  • A cascading replica downstream of a promoted standby does not switch to the latest timeline.
flowchart TD
    A[Replica fails to start] --> B{Log shows timeline mismatch?}
    B -->|Yes| C[Check timeline IDs on both nodes]
    C --> D{Is target timeline a descendant?}
    D -->|No| E{Former primary? wal_log_hints on?}
    E -->|Yes| F[pg_rewind from current primary]
    E -->|No| G[Re-clone with pg_basebackup]
    D -->|Yes| H[Set recovery_target_timeline = latest]
    B -->|No| I[Check WAL gap and slot health]

Common causes

CauseWhat it looks likeFirst thing to check
recovery_target_timeline fixed to a stale IDReplica fails immediately after promotion with a timeline mismatchSHOW recovery_target_timeline; or grep postgresql.conf and postgresql.auto.conf
Former primary rejoining without rewindingOld primary restarted as standby; pg_controldata shows a higher timeline than the current primaryTimeline ID on both nodes with pg_controldata
wal_log_hints missing at divergence timepg_rewind fails complaining about wal_log_hintsSHOW wal_log_hints; and SHOW data_checksums;
WAL missing back to divergence pointpg_rewind fails with could not find previous WAL recordWAL file existence in pg_wal and on the source
Downstream replica after cascading promotionA standby was promoted and its own replicas cannot attach to the new primaryrecovery_target_timeline on downstream nodes

Quick checks

Run these safe, read-only checks before making changes.

# Timeline and latest checkpoint on the failed node
pg_controldata $PGDATA | grep -iE 'timeline|checkpoint'
cat $PGDATA/pg_wal/*.history
# Timeline on the intended new primary
pg_controldata $PGDATA | grep -i 'timeline'
-- Recovery target configuration and rewind prerequisites on the replica
SHOW recovery_target_timeline;
SHOW wal_log_hints;
SHOW data_checksums;
SHOW full_page_writes;
-- Current timeline from SQL
SELECT timeline_id FROM pg_control_checkpoint();
# WAL presence around the divergence point
ls -la $PGDATA/pg_wal/ | head -20
-- Replication slot health
SELECT slot_name, active, restart_lsn,
       pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn) AS lag_bytes
FROM pg_replication_slots;

How to diagnose it

  1. Read the exact error. Look for FATAL: requested timeline N is not a child of this server's history in the replica log. Note the requested timeline ID and the server’s current timeline.

  2. Compare timeline IDs. Run pg_controldata on both the target (failed replica) and the source (current primary). If the target’s timeline is higher than the source’s, the target was likely a former primary that diverged.

  3. Inspect the history file. On the target, read the .history file in pg_wal that corresponds to its current timeline. It lists the parent timeline and the LSN where the switch happened. Confirm whether the source’s timeline appears in that lineage.

  4. Determine if the target is a former primary. Check Database cluster state in pg_controldata. If it shows in production, the instance accepted writes and diverged.

  5. Verify WAL reachability. pg_rewind needs WAL on the target reaching back to the divergence point. List pg_wal files around the switch LSN. If they were recycled, check whether the source still retains them and whether pg_rewind -c can stream them.

  6. Verify pg_rewind prerequisites. The target must be cleanly shut down. wal_log_hints must have been enabled before the instances diverged; enabling it after the fact does not help. Data checksums enabled at initdb time also satisfy the requirement.

  7. Choose the recovery path. If the timeline is valid but the replica is looking at a fixed ID, a config change and restart suffice. If the target diverged as a former primary and prerequisites are met, use pg_rewind. If prerequisites fail or WAL is missing, re-clone.

Metrics and signals to monitor

SignalWhy it mattersWarning sign
pg_stat_replication.replay_lagA replica that is behind is more likely to require rebuilding after a timeline switchreplay_lag > 30 s async and growing
pg_replication_slots.activeInactive slots retain WAL that may be needed for rewind or catch-upactive = false with a stale restart_lsn
pg_wal directory sizeRapid growth indicates a replica is not consuming WAL, or a slot is retaining itSize persistently above max_wal_size
pg_stat_database.conflictsQuery cancels on the replica indicate replay is blocked by long-running readsSustained nonzero values
Timeline ID on primaryUnexpected timeline increments signal unplanned promotions or failoversTimeline changes outside scheduled maintenance

Fixes

Set recovery_target_timeline = 'latest'

If the replica is not a former primary and the only issue is a fixed timeline ID, edit postgresql.conf or postgresql.auto.conf:

recovery_target_timeline = 'latest'

Ensure standby.signal exists in the data directory. Restart the replica. This is the safest and fastest fix when the replica’s history is otherwise consistent.

Rewind a former primary with pg_rewind

When the target was previously a primary and diverged, pg_rewind can resync it to the current source without a full base backup.

Prerequisites:

  • Target must be cleanly shut down.
  • wal_log_hints = on or data checksums enabled at initdb.
  • full_page_writes = on.
  • Target retains WAL back to the divergence point, or the source can stream missing segments.
# Execute rewind from the live primary
pg_rewind --target-pgdata=$PGDATA \
  --source-server="host=new_primary port=5432 user=replicator dbname=postgres" \
  -P -c

After pg_rewind:

  • The command destructively modifies the target data directory.
  • Review postgresql.auto.conf and pg_hba.conf; the target may retain stale settings from when it was a primary.
  • Ensure standby.signal is present.
  • Start the target. It enters archive recovery and replays WAL from the new primary.

Tradeoffs: Faster than a full clone for large clusters, but it destructively modifies the target data directory. If wal_log_hints was not enabled before divergence, this path is closed.

Re-clone with pg_basebackup

If pg_rewind prerequisites are not met, or WAL back to the divergence is missing, rebuild the replica from scratch.

Warning: pg_basebackup with -D $PGDATA overwrites the target data directory. Stop PostgreSQL on the target before running this.

pg_basebackup -D $PGDATA \
  -h new_primary -U replicator \
  -R -Xs -c fast -P -v

The -R flag creates standby.signal and seeds primary_conninfo in postgresql.auto.conf. Review the generated connection string before starting the replica.

Tradeoffs: Network- and time-intensive for multi-terabyte clusters, but it guarantees a consistent starting point and removes uncertainty about pre-divergence configuration.

Fix downstream cascading replicas

When a standby is promoted, its own replicas must also follow the new timeline. If they fail with a timeline mismatch, set recovery_target_timeline = 'latest' on each downstream node and restart.

Prevention

  • Always set recovery_target_timeline = 'latest' for any standby in an HA topology.
  • Enable wal_log_hints = on and data checksums at initdb so pg_rewind remains available.
  • Use replication slots and set max_slot_wal_keep_size to prevent unbounded WAL retention while still preserving enough history for catch-up.
  • Fence the old primary after failover. Stop PostgreSQL or isolate the host so it cannot restart independently and accept writes.
  • Test failover, rewind, and re-clone procedures monthly. The first time you try to rewind a former primary should not be during an incident.

How Netdata helps

  • Correlate replication lag in seconds and bytes with WAL generation rate on the primary to identify replicas at risk of falling behind before a timeline switch.
  • Alert on inactive replication slots and WAL directory growth before disk exhaustion blocks recovery.
  • Track checkpoint frequency and backend process states to distinguish a slow replica from a stuck recovery process.
  • Visualize replica conflict counts to detect hot-standby queries that block WAL replay.