PostgreSQL streaming replication broken: how to rebuild without full basebackup
When a replica logs requested WAL segment has already been removed, or after failover when the old primary must rejoin as a standby, a full pg_basebackup on a multi-terabyte database can take hours and saturate network and disk. If data files have diverged but are mostly identical, pg_rewind can resync by copying only changed blocks. It is faster than a base backup, but has strict prerequisites. If pg_rewind crashes mid-operation, the target data directory is likely corrupt and only a fresh base backup is safe.
What this means
Physical streaming replication ships WAL bytes from primary to replica. If the replica falls too far behind, the primary recycles old WAL segments according to max_wal_size and retention policy. If the needed segment is gone from pg_wal and not in the archive, streaming stops.
After failover, promoting a standby increments the timeline. The old primary’s data files now reflect a different history. You cannot point the old primary at the new one; PostgreSQL rejects the connection because timelines and checkpoint histories no longer match. pg_rewind reads the target’s data pages, compares them to the source using WAL-derived block images, and overwrites only differing blocks. After rewinding, it configures the target to start recovery from the source’s current timeline.
This works when the target cluster is intact, prerequisites are met, and the divergence is small enough that block-level sync is faster than a full rebuild.
Common causes
| Cause | What it looks like | First thing to check |
|---|---|---|
| Timeline divergence after failover | Old primary logs a timeline mismatch and refuses to stream from the new primary | pg_controldata on the target for “Latest checkpoint’s TimeLineID” |
| Primary recycled WAL before replica caught up | Replica logs requested WAL segment has already been removed | pg_stat_replication on the primary for sent versus replay LSN, and pg_wal file count |
| Missing or inactive replication slot | Primary no longer retains WAL for this replica; slot was dropped or consumer is gone | pg_replication_slots on the primary for active status and restart_lsn |
pg_rewind prerequisites missing | pg_rewind exits immediately with an error about hints or checksums | SHOW wal_log_hints; or verify data checksums were enabled at initdb |
| Unclean target shutdown | pg_rewind reports target is not cleanly shut down and may need crash recovery | pg_controldata output for “Database cluster state” |
Quick checks
Run these read-only checks before deciding between pg_rewind, WAL archive rescue, or pg_basebackup.
# Check if the target cluster is shut down cleanly
pg_controldata $PGDATA | grep "Database cluster state"
# Verify wal_log_hints is enabled on the target
psql -c "SHOW wal_log_hints;"
-- Inspect physical replication slot health on the primary
SELECT slot_name, active, restart_lsn,
pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn) AS lag_bytes
FROM pg_replication_slots
WHERE slot_type = 'physical';
-- Check replication lag from the primary's perspective
SELECT client_addr, state,
pg_wal_lsn_diff(sent_lsn, replay_lsn) AS replay_lag_bytes
FROM pg_stat_replication;
# Verify data checksums are enabled (alternative prerequisite for pg_rewind)
psql -c "SHOW data_checksums;"
# Estimate WAL retention pressure on the primary
du -sh $PGDATA/pg_wal
How to diagnose it
Read the exact error on the target. If the replica logs
requested WAL segment has already been removed, the primary has recycled that segment. If the error references a timeline mismatch, the clusters have diverged because of a promotion or an unrecoverable split-brain event.Verify
pg_rewindprerequisites. The target must have been initialized with eitherwal_log_hints = onor data checksums enabled.full_page_writes = onis also required, but it defaults to on. If neither hints nor checksums are available,pg_rewindis impossible. Plan forpg_basebackup.Check the WAL archive. If timelines have not diverged and only a few segments are missing from the primary’s local
pg_wal, verify whether yourarchive_commandstored them durably and whether the target’srestore_commandcan retrieve them. If the archive is complete, the replica can fetch the gap viarestore_commandand resume streaming.Assess target integrity. If
pg_rewindwas previously attempted on this target and aborted, the PostgreSQL documentation warns that the data directory is likely in an unrecoverable state. Do not retrypg_rewind. Reinitialize withpg_basebackup.Inspect replication slots on the primary. Query
pg_replication_slots. An inactive physical slot with a stalerestart_lsnindicates the primary is retaining WAL for a consumer that may never return. If the slot for this replica was dropped, the primary has already recycled WAL past the replica’s position, and streaming alone cannot resume.Decide the fix path. If prerequisites are met, the target is clean, and timelines have diverged, use
pg_rewind. If timelines are identical and only WAL is missing, try archive rescue. If the target is corrupt, prerequisites are missing, or divergence is massive, usepg_basebackup.
Metrics and signals to monitor
| Signal | Why it matters | Warning sign |
|---|---|---|
pg_stat_replication.replay_lag | Measures how far behind the replica is; directly impacts RPO and failover time | Sustained lag above your RPO threshold, or sudden spikes after batch jobs |
pg_replication_slots.restart_lsn | The oldest WAL the primary must retain for this slot; inactive slots cause unbounded growth | Slot is active = false and LSN lag is increasing |
pg_wal size or segment count | Disk space consumed by WAL; slots and archive failures cause bloat | Growth rate exceeding the baseline set by max_wal_size |
wal_log_hints and data_checksums | Required for pg_rewind; without them, failback requires full rebuild | wal_log_hints = off and data_checksums = off on a primary that may need failback |
pg_stat_database.conflicts | Replica queries canceled by WAL replay; indicates replay is blocked | Non-zero counts that correlate with replication lag spikes |
Fixes
Resynchronize with pg_rewind
This is the standard path for failing back an old primary or repairing a diverged replica. The target must be shut down. The source can be a running primary accessed via libpq.
pg_rewind --target-pgdata=$PGDATA \
--source-server="host=new_primary port=5432 user=replicator" \
--write-recovery-conf
On PostgreSQL 12 and later, --write-recovery-conf creates standby.signal and appends primary_conninfo to postgresql.auto.conf. On PostgreSQL 11 and earlier, you must manage recovery.conf manually.
If the target was not cleanly shut down, pg_rewind attempts to run crash recovery in single-user mode before rewinding. This is automatic but adds risk; a clean shutdown is safer.
Critical post-rewind step: pg_rewind writes the source connection string into the target’s recovery configuration. Before starting the target as a standby, review primary_conninfo, primary_slot_name, and any identity-specific settings such as application_name. Failure to do this can cause the old primary to connect to itself or claim the wrong slot name.
Tradeoffs: pg_rewind is fast for large databases because it copies only changed blocks. It requires the source to be reachable and the target to be down. It also demands manual configuration cleanup after the sync.
WAL archive rescue
If timelines have not diverged and only a gap in streamed WAL prevents replication, the archive may save you. Ensure the target’s postgresql.conf (or postgresql.auto.conf) defines a working restore_command, then start or restart the target. PostgreSQL invokes restore_command for each missing segment. Once the gap is filled, streaming replication resumes automatically.
Tradeoffs: This is the lightest fix because no block copying occurs. It only works if your archive is complete and no timeline switch happened.
When to use pg_basebackup instead
Do not attempt pg_rewind if any of the following are true:
- The target was initialized without
wal_log_hintsor data checksums. - A previous
pg_rewindattempt crashed on this target data directory. The documentation explicitly warns the directory is likely corrupt. - The divergence is so extensive that
pg_rewindwould copy nearly all blocks anyway.
In these cases, reinitialize the replica with pg_basebackup. It is slower and consumes more I/O, but it is the safest path when pg_rewind prerequisites are not satisfied.
Prevention
- Enable
wal_log_hints = onor data checksums atinitdb. Either satisfiespg_rewindprerequisites. Without one, a failed primary can only be rebuilt with a full base backup. - Use physical replication slots. Slots guarantee the primary retains WAL for known replicas. Monitor
pg_replication_slots.activeandrestart_lsn. - Maintain a working WAL archive. A reliable
archive_commandandrestore_commandgive you a rescue path that avoids bothpg_rewindandpg_basebackupwhen only a small WAL gap exists. - Set
max_slot_wal_keep_sizeon PostgreSQL 13 and later. This caps how much WAL an inactive slot can retain, preventing a forgotten slot from filling your disk. - Test failback before an incident. Verify that your HA tooling or runbooks can execute
pg_rewindcleanly during a drill. A procedure that has never been tested is not a procedure.
How Netdata helps
- Correlate replication lag metrics with WAL directory size to distinguish between normal catch-up and unbounded WAL retention from a stuck slot.
- Alert on inactive replication slots before they accumulate enough WAL to threaten disk space.
- Surface hot standby conflicts alongside replication lag to determine whether replica queries are blocking WAL apply.
Related guides
- How PostgreSQL actually works in production: a mental model for operators
- PostgreSQL ALTER TABLE blocked: zero-downtime DDL patterns
- PostgreSQL autovacuum blocked by long-running transaction: detection and fix
- PostgreSQL autovacuum not running: detection, causes, and fixes
- PostgreSQL autovacuum tuning: per-table thresholds for high-churn workloads
- PostgreSQL blocking queries: finding the root blocker in a lock cascade
- PostgreSQL checkpoint storms: detection, causes, and tuning
- PostgreSQL: checkpoints are occurring too frequently – what to tune
- PostgreSQL connection exhaustion: detection, diagnosis, and prevention
- PostgreSQL connection refused: pg_hba, listen_addresses, and TCP diagnosis
- PostgreSQL: database is not accepting commands to avoid wraparound data loss
- PostgreSQL dead tuples piling up: why autovacuum can’t keep up






