PostgreSQL replica out of sync: timeline mismatches and recovery
A PostgreSQL streaming replica that was healthy yesterday now refuses to start with a timeline mismatch error, or a former primary that you brought back online cannot rejoin the cluster as a standby. The log shows requested timeline N is not a child of this server's history and the replica loops in crash recovery while the current primary continues to diverge. This guide covers identifying the divergence point, choosing between pg_rewind and a full re-clone, and recovering the replica without introducing split-brain or data loss.
What this means
PostgreSQL creates a new timeline each time a primary is promoted, incrementing the timeline ID. Parent-child relationships are recorded in .history files inside pg_wal. A physical streaming replica follows a specific timeline. If the replica’s history does not list the upstream’s current timeline as a descendant, startup exits with a fatal error.
Three scenarios cause this:
- A standby is configured to follow a fixed timeline ID and a promotion switched the primary to a new one.
- A former primary was restarted as a standby after diverging.
- A cascading replica downstream of a promoted standby does not switch to the latest timeline.
flowchart TD
A[Replica fails to start] --> B{Log shows timeline mismatch?}
B -->|Yes| C[Check timeline IDs on both nodes]
C --> D{Is target timeline a descendant?}
D -->|No| E{Former primary? wal_log_hints on?}
E -->|Yes| F[pg_rewind from current primary]
E -->|No| G[Re-clone with pg_basebackup]
D -->|Yes| H[Set recovery_target_timeline = latest]
B -->|No| I[Check WAL gap and slot health]Common causes
| Cause | What it looks like | First thing to check |
|---|---|---|
recovery_target_timeline fixed to a stale ID | Replica fails immediately after promotion with a timeline mismatch | SHOW recovery_target_timeline; or grep postgresql.conf and postgresql.auto.conf |
| Former primary rejoining without rewinding | Old primary restarted as standby; pg_controldata shows a higher timeline than the current primary | Timeline ID on both nodes with pg_controldata |
wal_log_hints missing at divergence time | pg_rewind fails complaining about wal_log_hints | SHOW wal_log_hints; and SHOW data_checksums; |
| WAL missing back to divergence point | pg_rewind fails with could not find previous WAL record | WAL file existence in pg_wal and on the source |
| Downstream replica after cascading promotion | A standby was promoted and its own replicas cannot attach to the new primary | recovery_target_timeline on downstream nodes |
Quick checks
Run these safe, read-only checks before making changes.
# Timeline and latest checkpoint on the failed node
pg_controldata $PGDATA | grep -iE 'timeline|checkpoint'
cat $PGDATA/pg_wal/*.history
# Timeline on the intended new primary
pg_controldata $PGDATA | grep -i 'timeline'
-- Recovery target configuration and rewind prerequisites on the replica
SHOW recovery_target_timeline;
SHOW wal_log_hints;
SHOW data_checksums;
SHOW full_page_writes;
-- Current timeline from SQL
SELECT timeline_id FROM pg_control_checkpoint();
# WAL presence around the divergence point
ls -la $PGDATA/pg_wal/ | head -20
-- Replication slot health
SELECT slot_name, active, restart_lsn,
pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn) AS lag_bytes
FROM pg_replication_slots;
How to diagnose it
Read the exact error. Look for
FATAL: requested timeline N is not a child of this server's historyin the replica log. Note the requested timeline ID and the server’s current timeline.Compare timeline IDs. Run
pg_controldataon both the target (failed replica) and the source (current primary). If the target’s timeline is higher than the source’s, the target was likely a former primary that diverged.Inspect the history file. On the target, read the
.historyfile inpg_walthat corresponds to its current timeline. It lists the parent timeline and the LSN where the switch happened. Confirm whether the source’s timeline appears in that lineage.Determine if the target is a former primary. Check
Database cluster stateinpg_controldata. If it showsin production, the instance accepted writes and diverged.Verify WAL reachability.
pg_rewindneeds WAL on the target reaching back to the divergence point. Listpg_walfiles around the switch LSN. If they were recycled, check whether the source still retains them and whetherpg_rewind -ccan stream them.Verify
pg_rewindprerequisites. The target must be cleanly shut down.wal_log_hintsmust have been enabled before the instances diverged; enabling it after the fact does not help. Data checksums enabled atinitdbtime also satisfy the requirement.Choose the recovery path. If the timeline is valid but the replica is looking at a fixed ID, a config change and restart suffice. If the target diverged as a former primary and prerequisites are met, use
pg_rewind. If prerequisites fail or WAL is missing, re-clone.
Metrics and signals to monitor
| Signal | Why it matters | Warning sign |
|---|---|---|
pg_stat_replication.replay_lag | A replica that is behind is more likely to require rebuilding after a timeline switch | replay_lag > 30 s async and growing |
pg_replication_slots.active | Inactive slots retain WAL that may be needed for rewind or catch-up | active = false with a stale restart_lsn |
pg_wal directory size | Rapid growth indicates a replica is not consuming WAL, or a slot is retaining it | Size persistently above max_wal_size |
pg_stat_database.conflicts | Query cancels on the replica indicate replay is blocked by long-running reads | Sustained nonzero values |
| Timeline ID on primary | Unexpected timeline increments signal unplanned promotions or failovers | Timeline changes outside scheduled maintenance |
Fixes
Set recovery_target_timeline = 'latest'
If the replica is not a former primary and the only issue is a fixed timeline ID, edit postgresql.conf or postgresql.auto.conf:
recovery_target_timeline = 'latest'
Ensure standby.signal exists in the data directory. Restart the replica. This is the safest and fastest fix when the replica’s history is otherwise consistent.
Rewind a former primary with pg_rewind
When the target was previously a primary and diverged, pg_rewind can resync it to the current source without a full base backup.
Prerequisites:
- Target must be cleanly shut down.
wal_log_hints = onor data checksums enabled atinitdb.full_page_writes = on.- Target retains WAL back to the divergence point, or the source can stream missing segments.
# Execute rewind from the live primary
pg_rewind --target-pgdata=$PGDATA \
--source-server="host=new_primary port=5432 user=replicator dbname=postgres" \
-P -c
After pg_rewind:
- The command destructively modifies the target data directory.
- Review
postgresql.auto.confandpg_hba.conf; the target may retain stale settings from when it was a primary. - Ensure
standby.signalis present. - Start the target. It enters archive recovery and replays WAL from the new primary.
Tradeoffs: Faster than a full clone for large clusters, but it destructively modifies the target data directory. If wal_log_hints was not enabled before divergence, this path is closed.
Re-clone with pg_basebackup
If pg_rewind prerequisites are not met, or WAL back to the divergence is missing, rebuild the replica from scratch.
Warning: pg_basebackup with -D $PGDATA overwrites the target data directory. Stop PostgreSQL on the target before running this.
pg_basebackup -D $PGDATA \
-h new_primary -U replicator \
-R -Xs -c fast -P -v
The -R flag creates standby.signal and seeds primary_conninfo in postgresql.auto.conf. Review the generated connection string before starting the replica.
Tradeoffs: Network- and time-intensive for multi-terabyte clusters, but it guarantees a consistent starting point and removes uncertainty about pre-divergence configuration.
Fix downstream cascading replicas
When a standby is promoted, its own replicas must also follow the new timeline. If they fail with a timeline mismatch, set recovery_target_timeline = 'latest' on each downstream node and restart.
Prevention
- Always set
recovery_target_timeline = 'latest'for any standby in an HA topology. - Enable
wal_log_hints = onand data checksums atinitdbsopg_rewindremains available. - Use replication slots and set
max_slot_wal_keep_sizeto prevent unbounded WAL retention while still preserving enough history for catch-up. - Fence the old primary after failover. Stop PostgreSQL or isolate the host so it cannot restart independently and accept writes.
- Test failover, rewind, and re-clone procedures monthly. The first time you try to rewind a former primary should not be during an incident.
How Netdata helps
- Correlate replication lag in seconds and bytes with WAL generation rate on the primary to identify replicas at risk of falling behind before a timeline switch.
- Alert on inactive replication slots and WAL directory growth before disk exhaustion blocks recovery.
- Track checkpoint frequency and backend process states to distinguish a slow replica from a stuck recovery process.
- Visualize replica conflict counts to detect hot-standby queries that block WAL replay.
Related guides
- How PostgreSQL actually works in production: a mental model for operators
- PostgreSQL ALTER TABLE blocked: zero-downtime DDL patterns
- PostgreSQL autovacuum blocked by long-running transaction: detection and fix
- PostgreSQL autovacuum not running: detection, causes, and fixes
- PostgreSQL autovacuum tuning: per-table thresholds for high-churn workloads
- PostgreSQL blocking queries: finding the root blocker in a lock cascade
- PostgreSQL checkpoint storms: detection, causes, and tuning
- PostgreSQL: checkpoints are occurring too frequently – what to tune
- PostgreSQL connection exhaustion: detection, diagnosis, and prevention
- PostgreSQL connection refused: pg_hba, listen_addresses, and TCP diagnosis
- PostgreSQL: database is not accepting commands to avoid wraparound data loss
- PostgreSQL dead tuples piling up: why autovacuum can’t keep up






