PostgreSQL replica disconnected: detecting and recovering streaming replication
When replay_lag grows and the primary’s pg_stat_replication no longer lists the standby, queries return stale data and failover is unsafe. PostgreSQL’s WAL receiver automatically retries when streaming breaks, cycling through WAL archive, local pg_wal, and streaming connections. That loop can mask the root cause while the replica drifts toward an unrecoverable gap. Determine whether the replica is temporarily stalled, beyond recovery, or missing its replication slot on the primary.
What this means
Physical streaming replication ships WAL from the primary’s walsender to the standby’s walreceiver. The standby reports write, flush, and apply positions at intervals controlled by wal_receiver_status_interval. If no data arrives within wal_receiver_timeout (default 60 seconds), the standby drops the connection and retries. On the primary, wal_sender_timeout (default 60 seconds) kills the walsender if status replies stop.
After disconnect, the postmaster respawns walreceiver. The standby resumes ingestion in order: restore_command (archive), local pg_wal, then streaming via primary_conninfo.
While disconnected, the replica does not apply new WAL. If the primary recycles segments before the replica reconnects, the replica cannot self-heal. Without a replication slot guaranteeing retention, it eventually hits:
requested WAL segment has already been removed
At that point recovery stops. Until then, the replica may appear to retry normally while drifting into an unrecoverable gap.
flowchart TD
A[Replica lag growing] --> B{pg_stat_wal_receiver status}
B -->|stopped or missing| C[Check replica logs for WAL error]
B -->|streaming| D[Check replica pg_stat_activity for replay blocks]
C --> E{WAL error type}
E -->|segment removed| F[Check primary replication slot]
E -->|connection closed| G[Check primary walsender and network]
F --> H{Slot inactive or lagging}
H -->|yes| I[Drop and recreate slot or reclone replica]
H -->|no| J[Check archive availability]
G --> K[Restore network or wait for auto-retry loop]
D --> L[Terminate blocking query or tune max_standby_streaming_delay]Common causes
| Cause | What it looks like | First thing to check |
|---|---|---|
| Primary unreachable or network partition | pg_stat_wal_receiver row missing or status stopped; replica logs show repeated connection timeouts | Network path from replica to primary: ping, telnet to primary port, firewall rules |
| WAL sender terminated on primary | Log: “FATAL: could not receive data from WAL stream: server closed the connection unexpectedly”; replica enters retry loop | Primary logs for walsender death (OOM kill, restart, crash); pg_stat_replication empty for this replica |
| WAL segment removed before replica fetched it | Log: “requested WAL segment has already been removed”; replica recovery process stops | pg_replication_slots on primary: slot missing, inactive, or restart_lsn far behind pg_current_wal_lsn() |
| Replication slot conflict (stale PID) | Connection fails with error that slot is active for another PID; new walreceiver cannot start | pg_stat_replication_slots on primary for active state and conflicting PID in pg_stat_activity |
| Replica process killed by OOM | WAL receiver PID changes between checks or vanishes from ps; replica retries from scratch | dmesg or journalctl for OOM killer on replica; memory usage vs. shared_buffers + work_mem |
| Long-running query blocking replay | pg_stat_replication.replay_lag grows while write_lag stays flat; replica log shows “canceling statement due to conflict with recovery” | pg_stat_activity on replica for active queries; pg_stat_database.conflicts for cancel counts |
Quick checks
Run these read-only checks first.
# Replica walreceiver status
psql -c "SELECT pid, status, received_lsn, last_msg_send_time, last_msg_receipt_time, slot_name FROM pg_stat_wal_receiver;"
# Primary active senders
psql -c "SELECT pid, usename, application_name, client_addr, state, sent_lsn, write_lag, flush_lag, replay_lag FROM pg_stat_replication;"
# Primary slot health
psql -c "SELECT slot_name, active, restart_lsn, pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn) AS lag_bytes FROM pg_replication_slots;"
# Replica OS walreceiver process
ps aux | grep walreceiver
# Last WAL stream error in replica logs
grep -i "wal stream\|requested wal segment\|could not receive data" /var/log/postgresql/postgresql-*.log | tail -n 20
# Network path from replica to primary
ping -c 3 <primary_ip>
telnet <primary_ip> 5432
# OOM kills on replica
dmesg | grep -i "killed process\|oom"
How to diagnose it
Confirm disconnect on the replica. Query
pg_stat_wal_receiver. If the row is missing orstatusisstopped, the walreceiver is not connected. Notelast_msg_receipt_time: if it is more than a few minutes old, the connection dropped even if the process ID still exists.Confirm the primary’s view. On the primary, query
pg_stat_replication. If the replica’sapplication_nameorclient_addris absent, the primary has no live walsender for it. If the row exists butstateis notstreaming, the sender may be stuck in startup.Check for missing WAL segments. On the primary, compare
pg_replication_slots.restart_lsntopg_current_wal_lsn(). If the slot lag exceeds yourmax_wal_sizeor archive retention, the replica is at risk of falling into a gap. If the slot does not exist andwal_keep_sizeis small, the primary may have recycled segments already.Identify failure mode from logs. On the replica, look for:
- “requested WAL segment has already been removed” -> unrecoverable gap without archive or slot.
- “server closed the connection unexpectedly” -> primary-side sender death.
- “replication slot is active for PID” -> stale slot lock on primary.
Check for OS-level process loss. On the replica, verify the walreceiver PID in
pg_stat_wal_receivermatches a runningwalreceiverprocess. If the PID changed or the process is gone, the postmaster respawned it. Checkdmesgfor OOM kills. Avoid aggressive cache dropping (echo 3 > /proc/sys/vm/drop_caches) on memory-constrained replicas; it can trigger OOM and kill walreceiver.Distinguish network lag from apply lag. If
pg_stat_wal_receivershowsstreamingbut lag is growing, the connection is healthy but replay is blocked. Checkpg_stat_activityon the replica for long-running queries holding locks that conflict with WAL replay, and checkpg_stat_database.conflicts.
Metrics and signals to monitor
| Signal | Why it matters | Warning sign |
|---|---|---|
pg_stat_replication.replay_lag | Failover readiness and data loss window | > 30 seconds on async replicas |
pg_stat_wal_receiver.status | Whether the walreceiver has an active streaming connection | stopped or row missing entirely |
pg_replication_slots.restart_lsn lag | How much WAL the primary is retaining for this consumer | > 100 MB behind pg_current_wal_lsn() |
pg_stat_database.conflicts | Query cancels on replica due to replay conflicts | Any non-zero value |
| Replica log: “requested WAL segment has already been removed” | The replica has fallen into an unrecoverable WAL gap | Any occurrence requires immediate intervention |
| Primary log: “terminating walsender” | The sender process died, forcing a disconnect | Correlates with replica reconnect attempts |
pg_stat_wal_receiver.last_msg_receipt_time | Time since last heartbeat from primary | > 2 minutes (varies by timeout settings) |
Fixes
Network partition or primary restart
If the primary was briefly unreachable or restarted, the retry loop usually reconnects automatically once TCP returns. Verify:
# On replica
psql -c "SELECT status FROM pg_stat_wal_receiver WHERE status = 'streaming';"
If the replica does not reconnect within a few minutes, check that primary_conninfo in the replica’s data directory points to the correct host and port. A forced replica restart is a last resort; prefer to let the loop run.
WAL segment removed / slot missing
If the replica logs “requested WAL segment has already been removed,” it has fallen behind the primary’s WAL retention. This is fatal for streaming recovery.
If you have continuous archiving (
restore_command) and the segment exists in archive, the replica switches to archive recovery and may catch up. Monitorpg_stat_wal_receiverforstatusreturning tostreaming.If the segment is gone from both streaming and archive, reinitialize the replica. Destructive: This replaces the data directory.
# On primary: create slot if missing psql -c "SELECT pg_create_physical_replication_slot('<slot_name>', true);" # On replica: stop, reclone, configure slot, start pg_ctl stop -D $PGDATA pg_basebackup -h <primary_host> -D $PGDATA -P -X stream -S <slot_name> -R echo "primary_slot_name = '<slot_name>'" >> $PGDATA/postgresql.auto.conf pg_ctl start -D $PGDATAUse named slots for every replica. Without them, you depend on
wal_keep_size, which bulk loads or network hiccups easily exceed.
Stale replication slot (active for PID)
On the primary, if the slot is active but the associated walsender is stuck:
# On primary
SELECT pg_terminate_backend(<stale_pid>);
If the slot remains active with no running walsender, a primary restart may be needed to clear the catalog state. Dropping and recreating the slot forces the replica to rescan WAL from restart_lsn.
Replica OOM or process death
If the walreceiver was killed by OOM, the postmaster respawns it. If respawns fail repeatedly, reduce shared_buffers or work_mem, or add RAM. Restart the replica only if the postmaster itself has crashed or processes are stuck.
Replay blocked by long queries
If the connection is alive but replay_lag grows because queries block WAL replay:
- Cancel blocking queries:
SELECT pg_cancel_backend(pid) FROM pg_stat_activity WHERE state = 'active' AND query_start < NOW() - INTERVAL '10 minutes'; - Decrease
max_standby_streaming_delayto cancel conflicting queries sooner, or route long-running reports away from the standby.
Prevention
- Use named replication slots. Set
primary_slot_nameon every streaming replica. Slots force the primary to retain WAL until the replica confirms receipt. Without slots, you depend onwal_keep_size, which is easily exceeded during bulk loads or network hiccups. - Monitor slot lag, not just process existence. Alert on
pg_replication_slotslag in bytes, not just whether the walreceiver process is running. A process can be running while falling into an unrecoverable gap. - Configure archive recovery as a fallback. Ensure
restore_commandis functional. If streaming breaks, the replica can consume archived WAL while you fix the network. - Keep replicas on the same major version as the primary. Physical streaming replication requires identical major versions. A version mismatch prevents any connection.
- Test failover and replica rebuild procedures monthly. An untested replica rebuild during an incident is a high-risk operation. Verify that
pg_basebackupcompletes within your RTO and that your backup network can sustain the throughput.
How Netdata helps
- Collects
pg_stat_replicationlag (write_lag,flush_lag,replay_lag) per replica to distinguish network lag from apply lag. - Tracks
pg_stat_wal_receiverconnection state and timestamps to surface disconnects before applications see stale data. - Correlates PostgreSQL process charts with system OOM and memory metrics to confirm walreceiver death from memory pressure.
- Monitors
pg_replication_slotslag andactivestatus to alert on slots falling behind before the primary disk fills.
Related guides
- How PostgreSQL actually works in production: a mental model for operators: /guides/postgres/how-postgres-works-in-production/
- PostgreSQL ALTER TABLE blocked: zero-downtime DDL patterns: /guides/postgres/postgres-alter-table-blocked/
- PostgreSQL autovacuum blocked by long-running transaction: detection and fix: /guides/postgres/postgres-autovacuum-blocked-by-long-transaction/
- PostgreSQL autovacuum not running: detection, causes, and fixes: /guides/postgres/postgres-autovacuum-not-running/
- PostgreSQL autovacuum tuning: per-table thresholds for high-churn workloads: /guides/postgres/postgres-autovacuum-tuning/
- PostgreSQL blocking queries: finding the root blocker in a lock cascade: /guides/postgres/postgres-blocking-queries/
- PostgreSQL checkpoint storms: detection, causes, and tuning: /guides/postgres/postgres-checkpoint-storms/
- PostgreSQL: checkpoints are occurring too frequently – what to tune: /guides/postgres/postgres-checkpoints-occurring-too-frequently/
- PostgreSQL connection exhaustion: detection, diagnosis, and prevention: /guides/postgres/postgres-connection-exhaustion/
- PostgreSQL connection refused: pg_hba, listen_addresses, and TCP diagnosis: /guides/postgres/postgres-connection-refused/
- PostgreSQL: database is not accepting commands to avoid wraparound data loss: /guides/postgres/postgres-database-not-accepting-commands/
- PostgreSQL dead tuples piling up: why autovacuum can’t keep up: /guides/postgres/postgres-dead-tuples-piling-up/






