PostgreSQL streaming replication broken: how to rebuild without full basebackup

When a replica logs requested WAL segment has already been removed, or after failover when the old primary must rejoin as a standby, a full pg_basebackup on a multi-terabyte database can take hours and saturate network and disk. If data files have diverged but are mostly identical, pg_rewind can resync by copying only changed blocks. It is faster than a base backup, but has strict prerequisites. If pg_rewind crashes mid-operation, the target data directory is likely corrupt and only a fresh base backup is safe.

What this means

Physical streaming replication ships WAL bytes from primary to replica. If the replica falls too far behind, the primary recycles old WAL segments according to max_wal_size and retention policy. If the needed segment is gone from pg_wal and not in the archive, streaming stops.

After failover, promoting a standby increments the timeline. The old primary’s data files now reflect a different history. You cannot point the old primary at the new one; PostgreSQL rejects the connection because timelines and checkpoint histories no longer match. pg_rewind reads the target’s data pages, compares them to the source using WAL-derived block images, and overwrites only differing blocks. After rewinding, it configures the target to start recovery from the source’s current timeline.

This works when the target cluster is intact, prerequisites are met, and the divergence is small enough that block-level sync is faster than a full rebuild.

Common causes

CauseWhat it looks likeFirst thing to check
Timeline divergence after failoverOld primary logs a timeline mismatch and refuses to stream from the new primarypg_controldata on the target for “Latest checkpoint’s TimeLineID”
Primary recycled WAL before replica caught upReplica logs requested WAL segment has already been removedpg_stat_replication on the primary for sent versus replay LSN, and pg_wal file count
Missing or inactive replication slotPrimary no longer retains WAL for this replica; slot was dropped or consumer is gonepg_replication_slots on the primary for active status and restart_lsn
pg_rewind prerequisites missingpg_rewind exits immediately with an error about hints or checksumsSHOW wal_log_hints; or verify data checksums were enabled at initdb
Unclean target shutdownpg_rewind reports target is not cleanly shut down and may need crash recoverypg_controldata output for “Database cluster state”

Quick checks

Run these read-only checks before deciding between pg_rewind, WAL archive rescue, or pg_basebackup.

# Check if the target cluster is shut down cleanly
pg_controldata $PGDATA | grep "Database cluster state"
# Verify wal_log_hints is enabled on the target
psql -c "SHOW wal_log_hints;"
-- Inspect physical replication slot health on the primary
SELECT slot_name, active, restart_lsn,
       pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn) AS lag_bytes
FROM pg_replication_slots
WHERE slot_type = 'physical';
-- Check replication lag from the primary's perspective
SELECT client_addr, state,
       pg_wal_lsn_diff(sent_lsn, replay_lsn) AS replay_lag_bytes
FROM pg_stat_replication;
# Verify data checksums are enabled (alternative prerequisite for pg_rewind)
psql -c "SHOW data_checksums;"
# Estimate WAL retention pressure on the primary
du -sh $PGDATA/pg_wal

How to diagnose it

  1. Read the exact error on the target. If the replica logs requested WAL segment has already been removed, the primary has recycled that segment. If the error references a timeline mismatch, the clusters have diverged because of a promotion or an unrecoverable split-brain event.

  2. Verify pg_rewind prerequisites. The target must have been initialized with either wal_log_hints = on or data checksums enabled. full_page_writes = on is also required, but it defaults to on. If neither hints nor checksums are available, pg_rewind is impossible. Plan for pg_basebackup.

  3. Check the WAL archive. If timelines have not diverged and only a few segments are missing from the primary’s local pg_wal, verify whether your archive_command stored them durably and whether the target’s restore_command can retrieve them. If the archive is complete, the replica can fetch the gap via restore_command and resume streaming.

  4. Assess target integrity. If pg_rewind was previously attempted on this target and aborted, the PostgreSQL documentation warns that the data directory is likely in an unrecoverable state. Do not retry pg_rewind. Reinitialize with pg_basebackup.

  5. Inspect replication slots on the primary. Query pg_replication_slots. An inactive physical slot with a stale restart_lsn indicates the primary is retaining WAL for a consumer that may never return. If the slot for this replica was dropped, the primary has already recycled WAL past the replica’s position, and streaming alone cannot resume.

  6. Decide the fix path. If prerequisites are met, the target is clean, and timelines have diverged, use pg_rewind. If timelines are identical and only WAL is missing, try archive rescue. If the target is corrupt, prerequisites are missing, or divergence is massive, use pg_basebackup.

Metrics and signals to monitor

SignalWhy it mattersWarning sign
pg_stat_replication.replay_lagMeasures how far behind the replica is; directly impacts RPO and failover timeSustained lag above your RPO threshold, or sudden spikes after batch jobs
pg_replication_slots.restart_lsnThe oldest WAL the primary must retain for this slot; inactive slots cause unbounded growthSlot is active = false and LSN lag is increasing
pg_wal size or segment countDisk space consumed by WAL; slots and archive failures cause bloatGrowth rate exceeding the baseline set by max_wal_size
wal_log_hints and data_checksumsRequired for pg_rewind; without them, failback requires full rebuildwal_log_hints = off and data_checksums = off on a primary that may need failback
pg_stat_database.conflictsReplica queries canceled by WAL replay; indicates replay is blockedNon-zero counts that correlate with replication lag spikes

Fixes

Resynchronize with pg_rewind

This is the standard path for failing back an old primary or repairing a diverged replica. The target must be shut down. The source can be a running primary accessed via libpq.

pg_rewind --target-pgdata=$PGDATA \
          --source-server="host=new_primary port=5432 user=replicator" \
          --write-recovery-conf

On PostgreSQL 12 and later, --write-recovery-conf creates standby.signal and appends primary_conninfo to postgresql.auto.conf. On PostgreSQL 11 and earlier, you must manage recovery.conf manually.

If the target was not cleanly shut down, pg_rewind attempts to run crash recovery in single-user mode before rewinding. This is automatic but adds risk; a clean shutdown is safer.

Critical post-rewind step: pg_rewind writes the source connection string into the target’s recovery configuration. Before starting the target as a standby, review primary_conninfo, primary_slot_name, and any identity-specific settings such as application_name. Failure to do this can cause the old primary to connect to itself or claim the wrong slot name.

Tradeoffs: pg_rewind is fast for large databases because it copies only changed blocks. It requires the source to be reachable and the target to be down. It also demands manual configuration cleanup after the sync.

WAL archive rescue

If timelines have not diverged and only a gap in streamed WAL prevents replication, the archive may save you. Ensure the target’s postgresql.conf (or postgresql.auto.conf) defines a working restore_command, then start or restart the target. PostgreSQL invokes restore_command for each missing segment. Once the gap is filled, streaming replication resumes automatically.

Tradeoffs: This is the lightest fix because no block copying occurs. It only works if your archive is complete and no timeline switch happened.

When to use pg_basebackup instead

Do not attempt pg_rewind if any of the following are true:

  • The target was initialized without wal_log_hints or data checksums.
  • A previous pg_rewind attempt crashed on this target data directory. The documentation explicitly warns the directory is likely corrupt.
  • The divergence is so extensive that pg_rewind would copy nearly all blocks anyway.

In these cases, reinitialize the replica with pg_basebackup. It is slower and consumes more I/O, but it is the safest path when pg_rewind prerequisites are not satisfied.

Prevention

  • Enable wal_log_hints = on or data checksums at initdb. Either satisfies pg_rewind prerequisites. Without one, a failed primary can only be rebuilt with a full base backup.
  • Use physical replication slots. Slots guarantee the primary retains WAL for known replicas. Monitor pg_replication_slots.active and restart_lsn.
  • Maintain a working WAL archive. A reliable archive_command and restore_command give you a rescue path that avoids both pg_rewind and pg_basebackup when only a small WAL gap exists.
  • Set max_slot_wal_keep_size on PostgreSQL 13 and later. This caps how much WAL an inactive slot can retain, preventing a forgotten slot from filling your disk.
  • Test failback before an incident. Verify that your HA tooling or runbooks can execute pg_rewind cleanly during a drill. A procedure that has never been tested is not a procedure.

How Netdata helps

  • Correlate replication lag metrics with WAL directory size to distinguish between normal catch-up and unbounded WAL retention from a stuck slot.
  • Alert on inactive replication slots before they accumulate enough WAL to threaten disk space.
  • Surface hot standby conflicts alongside replication lag to determine whether replica queries are blocking WAL apply.