PostgreSQL replica disconnected: detecting and recovering streaming replication

When replay_lag grows and the primary’s pg_stat_replication no longer lists the standby, queries return stale data and failover is unsafe. PostgreSQL’s WAL receiver automatically retries when streaming breaks, cycling through WAL archive, local pg_wal, and streaming connections. That loop can mask the root cause while the replica drifts toward an unrecoverable gap. Determine whether the replica is temporarily stalled, beyond recovery, or missing its replication slot on the primary.

What this means

Physical streaming replication ships WAL from the primary’s walsender to the standby’s walreceiver. The standby reports write, flush, and apply positions at intervals controlled by wal_receiver_status_interval. If no data arrives within wal_receiver_timeout (default 60 seconds), the standby drops the connection and retries. On the primary, wal_sender_timeout (default 60 seconds) kills the walsender if status replies stop.

After disconnect, the postmaster respawns walreceiver. The standby resumes ingestion in order: restore_command (archive), local pg_wal, then streaming via primary_conninfo.

While disconnected, the replica does not apply new WAL. If the primary recycles segments before the replica reconnects, the replica cannot self-heal. Without a replication slot guaranteeing retention, it eventually hits:

requested WAL segment has already been removed

At that point recovery stops. Until then, the replica may appear to retry normally while drifting into an unrecoverable gap.

flowchart TD
    A[Replica lag growing] --> B{pg_stat_wal_receiver status}
    B -->|stopped or missing| C[Check replica logs for WAL error]
    B -->|streaming| D[Check replica pg_stat_activity for replay blocks]
    C --> E{WAL error type}
    E -->|segment removed| F[Check primary replication slot]
    E -->|connection closed| G[Check primary walsender and network]
    F --> H{Slot inactive or lagging}
    H -->|yes| I[Drop and recreate slot or reclone replica]
    H -->|no| J[Check archive availability]
    G --> K[Restore network or wait for auto-retry loop]
    D --> L[Terminate blocking query or tune max_standby_streaming_delay]

Common causes

CauseWhat it looks likeFirst thing to check
Primary unreachable or network partitionpg_stat_wal_receiver row missing or status stopped; replica logs show repeated connection timeoutsNetwork path from replica to primary: ping, telnet to primary port, firewall rules
WAL sender terminated on primaryLog: “FATAL: could not receive data from WAL stream: server closed the connection unexpectedly”; replica enters retry loopPrimary logs for walsender death (OOM kill, restart, crash); pg_stat_replication empty for this replica
WAL segment removed before replica fetched itLog: “requested WAL segment has already been removed”; replica recovery process stopspg_replication_slots on primary: slot missing, inactive, or restart_lsn far behind pg_current_wal_lsn()
Replication slot conflict (stale PID)Connection fails with error that slot is active for another PID; new walreceiver cannot startpg_stat_replication_slots on primary for active state and conflicting PID in pg_stat_activity
Replica process killed by OOMWAL receiver PID changes between checks or vanishes from ps; replica retries from scratchdmesg or journalctl for OOM killer on replica; memory usage vs. shared_buffers + work_mem
Long-running query blocking replaypg_stat_replication.replay_lag grows while write_lag stays flat; replica log shows “canceling statement due to conflict with recovery”pg_stat_activity on replica for active queries; pg_stat_database.conflicts for cancel counts

Quick checks

Run these read-only checks first.

# Replica walreceiver status
psql -c "SELECT pid, status, received_lsn, last_msg_send_time, last_msg_receipt_time, slot_name FROM pg_stat_wal_receiver;"
# Primary active senders
psql -c "SELECT pid, usename, application_name, client_addr, state, sent_lsn, write_lag, flush_lag, replay_lag FROM pg_stat_replication;"
# Primary slot health
psql -c "SELECT slot_name, active, restart_lsn, pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn) AS lag_bytes FROM pg_replication_slots;"
# Replica OS walreceiver process
ps aux | grep walreceiver
# Last WAL stream error in replica logs
grep -i "wal stream\|requested wal segment\|could not receive data" /var/log/postgresql/postgresql-*.log | tail -n 20
# Network path from replica to primary
ping -c 3 <primary_ip>
telnet <primary_ip> 5432
# OOM kills on replica
dmesg | grep -i "killed process\|oom"

How to diagnose it

  1. Confirm disconnect on the replica. Query pg_stat_wal_receiver. If the row is missing or status is stopped, the walreceiver is not connected. Note last_msg_receipt_time: if it is more than a few minutes old, the connection dropped even if the process ID still exists.

  2. Confirm the primary’s view. On the primary, query pg_stat_replication. If the replica’s application_name or client_addr is absent, the primary has no live walsender for it. If the row exists but state is not streaming, the sender may be stuck in startup.

  3. Check for missing WAL segments. On the primary, compare pg_replication_slots.restart_lsn to pg_current_wal_lsn(). If the slot lag exceeds your max_wal_size or archive retention, the replica is at risk of falling into a gap. If the slot does not exist and wal_keep_size is small, the primary may have recycled segments already.

  4. Identify failure mode from logs. On the replica, look for:

    • “requested WAL segment has already been removed” -> unrecoverable gap without archive or slot.
    • “server closed the connection unexpectedly” -> primary-side sender death.
    • “replication slot is active for PID” -> stale slot lock on primary.
  5. Check for OS-level process loss. On the replica, verify the walreceiver PID in pg_stat_wal_receiver matches a running walreceiver process. If the PID changed or the process is gone, the postmaster respawned it. Check dmesg for OOM kills. Avoid aggressive cache dropping (echo 3 > /proc/sys/vm/drop_caches) on memory-constrained replicas; it can trigger OOM and kill walreceiver.

  6. Distinguish network lag from apply lag. If pg_stat_wal_receiver shows streaming but lag is growing, the connection is healthy but replay is blocked. Check pg_stat_activity on the replica for long-running queries holding locks that conflict with WAL replay, and check pg_stat_database.conflicts.

Metrics and signals to monitor

SignalWhy it mattersWarning sign
pg_stat_replication.replay_lagFailover readiness and data loss window> 30 seconds on async replicas
pg_stat_wal_receiver.statusWhether the walreceiver has an active streaming connectionstopped or row missing entirely
pg_replication_slots.restart_lsn lagHow much WAL the primary is retaining for this consumer> 100 MB behind pg_current_wal_lsn()
pg_stat_database.conflictsQuery cancels on replica due to replay conflictsAny non-zero value
Replica log: “requested WAL segment has already been removed”The replica has fallen into an unrecoverable WAL gapAny occurrence requires immediate intervention
Primary log: “terminating walsender”The sender process died, forcing a disconnectCorrelates with replica reconnect attempts
pg_stat_wal_receiver.last_msg_receipt_timeTime since last heartbeat from primary> 2 minutes (varies by timeout settings)

Fixes

Network partition or primary restart

If the primary was briefly unreachable or restarted, the retry loop usually reconnects automatically once TCP returns. Verify:

# On replica
psql -c "SELECT status FROM pg_stat_wal_receiver WHERE status = 'streaming';"

If the replica does not reconnect within a few minutes, check that primary_conninfo in the replica’s data directory points to the correct host and port. A forced replica restart is a last resort; prefer to let the loop run.

WAL segment removed / slot missing

If the replica logs “requested WAL segment has already been removed,” it has fallen behind the primary’s WAL retention. This is fatal for streaming recovery.

  • If you have continuous archiving (restore_command) and the segment exists in archive, the replica switches to archive recovery and may catch up. Monitor pg_stat_wal_receiver for status returning to streaming.

  • If the segment is gone from both streaming and archive, reinitialize the replica. Destructive: This replaces the data directory.

    # On primary: create slot if missing
    psql -c "SELECT pg_create_physical_replication_slot('<slot_name>', true);"
    
    # On replica: stop, reclone, configure slot, start
    pg_ctl stop -D $PGDATA
    pg_basebackup -h <primary_host> -D $PGDATA -P -X stream -S <slot_name> -R
    echo "primary_slot_name = '<slot_name>'" >> $PGDATA/postgresql.auto.conf
    pg_ctl start -D $PGDATA
    
  • Use named slots for every replica. Without them, you depend on wal_keep_size, which bulk loads or network hiccups easily exceed.

Stale replication slot (active for PID)

On the primary, if the slot is active but the associated walsender is stuck:

# On primary
SELECT pg_terminate_backend(<stale_pid>);

If the slot remains active with no running walsender, a primary restart may be needed to clear the catalog state. Dropping and recreating the slot forces the replica to rescan WAL from restart_lsn.

Replica OOM or process death

If the walreceiver was killed by OOM, the postmaster respawns it. If respawns fail repeatedly, reduce shared_buffers or work_mem, or add RAM. Restart the replica only if the postmaster itself has crashed or processes are stuck.

Replay blocked by long queries

If the connection is alive but replay_lag grows because queries block WAL replay:

  • Cancel blocking queries:
    SELECT pg_cancel_backend(pid) FROM pg_stat_activity WHERE state = 'active' AND query_start < NOW() - INTERVAL '10 minutes';
    
  • Decrease max_standby_streaming_delay to cancel conflicting queries sooner, or route long-running reports away from the standby.

Prevention

  • Use named replication slots. Set primary_slot_name on every streaming replica. Slots force the primary to retain WAL until the replica confirms receipt. Without slots, you depend on wal_keep_size, which is easily exceeded during bulk loads or network hiccups.
  • Monitor slot lag, not just process existence. Alert on pg_replication_slots lag in bytes, not just whether the walreceiver process is running. A process can be running while falling into an unrecoverable gap.
  • Configure archive recovery as a fallback. Ensure restore_command is functional. If streaming breaks, the replica can consume archived WAL while you fix the network.
  • Keep replicas on the same major version as the primary. Physical streaming replication requires identical major versions. A version mismatch prevents any connection.
  • Test failover and replica rebuild procedures monthly. An untested replica rebuild during an incident is a high-risk operation. Verify that pg_basebackup completes within your RTO and that your backup network can sustain the throughput.

How Netdata helps

  • Collects pg_stat_replication lag (write_lag, flush_lag, replay_lag) per replica to distinguish network lag from apply lag.
  • Tracks pg_stat_wal_receiver connection state and timestamps to surface disconnects before applications see stale data.
  • Correlates PostgreSQL process charts with system OOM and memory metrics to confirm walreceiver death from memory pressure.
  • Monitors pg_replication_slots lag and active status to alert on slots falling behind before the primary disk fills.
  • How PostgreSQL actually works in production: a mental model for operators: /guides/postgres/how-postgres-works-in-production/
  • PostgreSQL ALTER TABLE blocked: zero-downtime DDL patterns: /guides/postgres/postgres-alter-table-blocked/
  • PostgreSQL autovacuum blocked by long-running transaction: detection and fix: /guides/postgres/postgres-autovacuum-blocked-by-long-transaction/
  • PostgreSQL autovacuum not running: detection, causes, and fixes: /guides/postgres/postgres-autovacuum-not-running/
  • PostgreSQL autovacuum tuning: per-table thresholds for high-churn workloads: /guides/postgres/postgres-autovacuum-tuning/
  • PostgreSQL blocking queries: finding the root blocker in a lock cascade: /guides/postgres/postgres-blocking-queries/
  • PostgreSQL checkpoint storms: detection, causes, and tuning: /guides/postgres/postgres-checkpoint-storms/
  • PostgreSQL: checkpoints are occurring too frequently – what to tune: /guides/postgres/postgres-checkpoints-occurring-too-frequently/
  • PostgreSQL connection exhaustion: detection, diagnosis, and prevention: /guides/postgres/postgres-connection-exhaustion/
  • PostgreSQL connection refused: pg_hba, listen_addresses, and TCP diagnosis: /guides/postgres/postgres-connection-refused/
  • PostgreSQL: database is not accepting commands to avoid wraparound data loss: /guides/postgres/postgres-database-not-accepting-commands/
  • PostgreSQL dead tuples piling up: why autovacuum can’t keep up: /guides/postgres/postgres-dead-tuples-piling-up/