$ guides / postgres / postgres-streaming-replication-broken ▌

Operations Guides

PostgreSQL streaming replication broken: how to rebuild without full basebackup

When a replica logs requested WAL segment has already been removed, or after failover when the old primary must rejoin as a standby, a full pg_basebackup on a multi-terabyte database can take hours and saturate network and disk. If data files have diverged but are mostly identical, pg_rewind can resync by copying only changed blocks. It is faster than a base backup, but has strict prerequisites. If pg_rewind crashes mid-operation, the target data directory is likely corrupt and only a fresh base backup is safe.

What this means

Physical streaming replication ships WAL bytes from primary to replica. If the replica falls too far behind, the primary recycles old WAL segments according to max_wal_size and retention policy. If the needed segment is gone from pg_wal and not in the archive, streaming stops.

After failover, promoting a standby increments the timeline. The old primary’s data files now reflect a different history. You cannot point the old primary at the new one; PostgreSQL rejects the connection because timelines and checkpoint histories no longer match. pg_rewind reads the target’s data pages, compares them to the source using WAL-derived block images, and overwrites only differing blocks. After rewinding, it configures the target to start recovery from the source’s current timeline.

This works when the target cluster is intact, prerequisites are met, and the divergence is small enough that block-level sync is faster than a full rebuild.

Common causes

Cause	What it looks like	First thing to check
Timeline divergence after failover	Old primary logs a timeline mismatch and refuses to stream from the new primary	`pg_controldata` on the target for “Latest checkpoint’s TimeLineID”
Primary recycled WAL before replica caught up	Replica logs `requested WAL segment has already been removed`	`pg_stat_replication` on the primary for sent versus replay LSN, and `pg_wal` file count
Missing or inactive replication slot	Primary no longer retains WAL for this replica; slot was dropped or consumer is gone	`pg_replication_slots` on the primary for `active` status and `restart_lsn`
`pg_rewind` prerequisites missing	`pg_rewind` exits immediately with an error about hints or checksums	`SHOW wal_log_hints;` or verify data checksums were enabled at `initdb`
Unclean target shutdown	`pg_rewind` reports target is not cleanly shut down and may need crash recovery	`pg_controldata` output for “Database cluster state”

Quick checks

Run these read-only checks before deciding between pg_rewind, WAL archive rescue, or pg_basebackup.

# Check if the target cluster is shut down cleanly
pg_controldata $PGDATA | grep "Database cluster state"

# Verify wal_log_hints is enabled on the target
psql -c "SHOW wal_log_hints;"

-- Inspect physical replication slot health on the primary
SELECT slot_name, active, restart_lsn,
       pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn) AS lag_bytes
FROM pg_replication_slots
WHERE slot_type = 'physical';

-- Check replication lag from the primary's perspective
SELECT client_addr, state,
       pg_wal_lsn_diff(sent_lsn, replay_lsn) AS replay_lag_bytes
FROM pg_stat_replication;

# Verify data checksums are enabled (alternative prerequisite for pg_rewind)
psql -c "SHOW data_checksums;"

# Estimate WAL retention pressure on the primary
du -sh $PGDATA/pg_wal

How to diagnose it

Read the exact error on the target. If the replica logs requested WAL segment has already been removed, the primary has recycled that segment. If the error references a timeline mismatch, the clusters have diverged because of a promotion or an unrecoverable split-brain event.
Verify pg_rewind prerequisites. The target must have been initialized with either wal_log_hints = on or data checksums enabled. full_page_writes = on is also required, but it defaults to on. If neither hints nor checksums are available, pg_rewind is impossible. Plan for pg_basebackup.
Check the WAL archive. If timelines have not diverged and only a few segments are missing from the primary’s local pg_wal, verify whether your archive_command stored them durably and whether the target’s restore_command can retrieve them. If the archive is complete, the replica can fetch the gap via restore_command and resume streaming.
Assess target integrity. If pg_rewind was previously attempted on this target and aborted, the PostgreSQL documentation warns that the data directory is likely in an unrecoverable state. Do not retry pg_rewind. Reinitialize with pg_basebackup.
Inspect replication slots on the primary. Query pg_replication_slots. An inactive physical slot with a stale restart_lsn indicates the primary is retaining WAL for a consumer that may never return. If the slot for this replica was dropped, the primary has already recycled WAL past the replica’s position, and streaming alone cannot resume.
Decide the fix path. If prerequisites are met, the target is clean, and timelines have diverged, use pg_rewind. If timelines are identical and only WAL is missing, try archive rescue. If the target is corrupt, prerequisites are missing, or divergence is massive, use pg_basebackup.

Metrics and signals to monitor

Signal	Why it matters	Warning sign
`pg_stat_replication.replay_lag`	Measures how far behind the replica is; directly impacts RPO and failover time	Sustained lag above your RPO threshold, or sudden spikes after batch jobs
`pg_replication_slots.restart_lsn`	The oldest WAL the primary must retain for this slot; inactive slots cause unbounded growth	Slot is `active = false` and LSN lag is increasing
`pg_wal` size or segment count	Disk space consumed by WAL; slots and archive failures cause bloat	Growth rate exceeding the baseline set by `max_wal_size`
`wal_log_hints` and `data_checksums`	Required for `pg_rewind`; without them, failback requires full rebuild	`wal_log_hints = off` and `data_checksums = off` on a primary that may need failback
`pg_stat_database.conflicts`	Replica queries canceled by WAL replay; indicates replay is blocked	Non-zero counts that correlate with replication lag spikes

Fixes

Resynchronize with pg_rewind

This is the standard path for failing back an old primary or repairing a diverged replica. The target must be shut down. The source can be a running primary accessed via libpq.

pg_rewind --target-pgdata=$PGDATA \
          --source-server="host=new_primary port=5432 user=replicator" \
          --write-recovery-conf

On PostgreSQL 12 and later, --write-recovery-conf creates standby.signal and appends primary_conninfo to postgresql.auto.conf. On PostgreSQL 11 and earlier, you must manage recovery.conf manually.

If the target was not cleanly shut down, pg_rewind attempts to run crash recovery in single-user mode before rewinding. This is automatic but adds risk; a clean shutdown is safer.

Critical post-rewind step: pg_rewind writes the source connection string into the target’s recovery configuration. Before starting the target as a standby, review primary_conninfo, primary_slot_name, and any identity-specific settings such as application_name. Failure to do this can cause the old primary to connect to itself or claim the wrong slot name.

Tradeoffs: pg_rewind is fast for large databases because it copies only changed blocks. It requires the source to be reachable and the target to be down. It also demands manual configuration cleanup after the sync.

WAL archive rescue

If timelines have not diverged and only a gap in streamed WAL prevents replication, the archive may save you. Ensure the target’s postgresql.conf (or postgresql.auto.conf) defines a working restore_command, then start or restart the target. PostgreSQL invokes restore_command for each missing segment. Once the gap is filled, streaming replication resumes automatically.

Tradeoffs: This is the lightest fix because no block copying occurs. It only works if your archive is complete and no timeline switch happened.

When to use pg_basebackup instead

Do not attempt pg_rewind if any of the following are true:

The target was initialized without wal_log_hints or data checksums.
A previous pg_rewind attempt crashed on this target data directory. The documentation explicitly warns the directory is likely corrupt.
The divergence is so extensive that pg_rewind would copy nearly all blocks anyway.

In these cases, reinitialize the replica with pg_basebackup. It is slower and consumes more I/O, but it is the safest path when pg_rewind prerequisites are not satisfied.

Prevention

Enable wal_log_hints = on or data checksums at initdb. Either satisfies pg_rewind prerequisites. Without one, a failed primary can only be rebuilt with a full base backup.
Use physical replication slots. Slots guarantee the primary retains WAL for known replicas. Monitor pg_replication_slots.active and restart_lsn.
Maintain a working WAL archive. A reliable archive_command and restore_command give you a rescue path that avoids both pg_rewind and pg_basebackup when only a small WAL gap exists.
Set max_slot_wal_keep_size on PostgreSQL 13 and later. This caps how much WAL an inactive slot can retain, preventing a forgotten slot from filling your disk.
Test failback before an incident. Verify that your HA tooling or runbooks can execute pg_rewind cleanly during a drill. A procedure that has never been tested is not a procedure.

How Netdata helps

Correlate replication lag metrics with WAL directory size to distinguish between normal catch-up and unbounded WAL retention from a stuck slot.
Alert on inactive replication slots before they accumulate enough WAL to threaten disk space.
Surface hot standby conflicts alongside replication lag to determine whether replica queries are blocking WAL apply.

The Netdata solution

PostgreSQL monitoring with Netdata

Netdata monitors PostgreSQL with per-second metrics, pre-built dashboards, and ML-powered anomaly detection. Correlate connection saturation, lock waits, autovacuum progress, replication lag, and checkpoint I/O against the rest of your stack so you catch the incidents in these runbooks before they page anyone.

See PostgreSQL monitoring → Start monitoring free

PostgreSQL streaming replication broken: how to rebuild without full basebackup

PostgreSQL streaming replication broken: how to rebuild without full basebackup

What this means

Common causes

Quick checks

How to diagnose it

Metrics and signals to monitor

Fixes

Resynchronize with pg_rewind

WAL archive rescue

When to use pg_basebackup instead

Prevention

How Netdata helps

Related guides

PostgreSQL monitoring with Netdata