$ guides / postgres / postgres-replication-slot-bloat ▌

Operations Guides

PostgreSQL replication slot bloat: when a stale slot fills the disk

Disk is filling on the primary. Table sizes are stable and active replicas show healthy replication lag, but WAL in pg_wal keeps growing and is not being recycled. The most likely cause is a stale replication slot. When a logical subscriber, CDC connector, or physical replica disconnects without cleaning up its slot, PostgreSQL retains every WAL segment from the slot’s restart_lsn onward. This retention ignores max_wal_size unless max_slot_wal_keep_size is set to a finite value. The default is -1 (unlimited). The primary will retain WAL until the disk fills and writes halt. This guide covers confirmation, recovery, and prevention.

What this means

A replication slot is a durable promise that PostgreSQL will not remove WAL until the consumer processes it. Each slot tracks a restart_lsn. The checkpointer can recycle segments only when they are older than every slot’s restart_lsn. If a consumer goes offline, its restart_lsn stops advancing, and WAL accumulates in pg_wal. Unlike normal WAL growth bounded by max_wal_size, slot-bound retention is independent. Unless max_slot_wal_keep_size is configured, there is no ceiling.

Dropping a stale slot removes the retention barrier, but the next checkpoint is required to mark old segments as removable. If the disk fills completely, PostgreSQL cannot complete checkpoints, cannot write new WAL, and rejects writes. All applications are affected, not just the missing consumer.

flowchart TD
    A[Consumer disconnects or stalls] --> B[Slot freezes restart_lsn]
    B --> C[WAL accumulates in pg_wal]
    C --> D[Disk fills past max_wal_size]
    D --> E[Checkpoint cannot recycle segments]
    E --> F[Primary rejects writes]

Common causes

Cause	What it looks like	First thing to check
Orphaned logical slot from a decommissioned CDC tool	Slot active = false, inactive_since hours old on PG14+, WAL growing steadily	pg_replication_slots for logical slots with no matching consumer
Physical replica rebuilt without dropping its old slot	Old physical slot remains with stale restart_lsn; new replica uses a different slot name	Slot names against your known replica inventory
Stalled consumer that is connected but not advancing	Slot active = true, lag increasing, consumer process idle or crashed	Consumer logs and pg_stat_replication for sent vs. flushed LSN
Unlimited max_slot_wal_keep_size with no monitoring	No ceiling on WAL retention; disk fills silently until manual intervention	SHOW max_slot_wal_keep_size; returns -1

Quick checks

Run these read-only checks to confirm slot-related WAL growth.

-- Slot status and lag in bytes
SELECT slot_name, slot_type, active, restart_lsn,
       pg_current_wal_lsn(),
       pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn) AS lag_bytes
FROM pg_replication_slots;

-- Total WAL size (PG10+)
SELECT pg_size_pretty(sum(size)) FROM pg_ls_waldir();

-- WAL status and safety margin (PG13+)
SELECT slot_name, wal_status, safe_wal_size
FROM pg_replication_slots;

# From the PostgreSQL data directory
du -sh pg_wal/

-- Retention limits
SHOW max_wal_size;
SHOW max_slot_wal_keep_size;

How to diagnose it

Confirm the growth is WAL, not data. Compare total database size to disk usage. If pg_database_size() totals are stable but disk usage is climbing, suspect WAL, logs, or temporary files. Use pg_ls_waldir() or du on the pg_wal directory to confirm WAL growth.
Identify stale slots. Query pg_replication_slots. Flag rows where active is false or pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn) is large and increasing. A slot mapping to a decommissioned host or unknown connector name is likely orphaned. For example, a logical slot named after a Kafka connector that was deleted last week is a clear candidate for removal.
Distinguish inactive from lagging. An inactive slot (active = false) has no streaming walsender. An active slot with growing lag still has a consumer, but that consumer is not acknowledging progress. The fix differs: inactive slots are dropped, while lagging consumers require intervention on the remote side. Cross-reference pg_replication_slots.active_pid with pg_stat_replication.pid to map active connections to slot names and confirm which consumer is attached.
Check WAL status. On PostgreSQL 13+, wal_status indicates whether the slot is within max_wal_size (reserved), beyond it but still held (extended), or past max_slot_wal_keep_size and heading for removal (unreserved). If the status is lost, required WAL has already been removed and the slot is permanently unusable.
Correlate with infrastructure changes. Slot names often match hostnames or connector names. Check your deployment logs. If a replica was rebuilt under a new name or a CDC task was deleted recently, the old slot was likely left behind.

Metrics and signals to monitor

Signal	Why it matters	Warning sign
pg_replication_slots.active	Inactive slots retain WAL indefinitely	active = false for longer than 15 minutes
Lag bytes per slot	Measures exact WAL retained	pg_wal_lsn_diff > 1 GB and growing
pg_replication_slots.wal_status	Tracks whether slot is within limits	extended, unreserved, or lost
safe_wal_size	Bytes remaining before invalidation	Negative or approaching zero
pg_wal directory size	Direct measure of accumulation	Sustained growth above baseline
pg_stat_replication lag	Distinguishes active streaming from slot-bound WAL	Active replicas healthy while WAL still grows

Fixes

Orphaned inactive slot

If the slot is inactive and the consumer is decommissioned, drop it. This is destructive and irreversible. If you ever need that consumer again, it must be reinitialized from a full snapshot.

SELECT pg_drop_replication_slot('slot_name');

After the drop, the slot barrier is gone, but segments are not removed instantly. The next checkpoint evaluates them for removal. If disk is critically full and the regular checkpoint interval is too long, run CHECKPOINT manually. This can cause a brief I/O spike. If the disk is completely full, CHECKPOINT itself may fail; in that case, expand storage or remove non-essential files (such as old log files outside PGDATA) to free space before forcing a checkpoint.

Attempting to drop an active slot returns an error. Stop the consumer application or replica first.

Active but lagging consumer

Do not drop an active slot. Identify the consumer and resolve the root cause before the lag exhausts disk. For logical replication, check subscriber logs for apply errors, network stalls, or large transactions that block the apply worker. For physical replication, check replica I/O latency, replication slot state on the replica, and pg_stat_wal_receiver on the replica. If the consumer cannot catch up and the lag threatens disk exhaustion, plan a controlled rebuild: stop the consumer, drop the slot, take a fresh base backup, and recreate the slot.

Slot approaching invalidation

If wal_status is unreserved, the slot has exceeded max_slot_wal_keep_size. At the next checkpoint, required WAL can be removed and the slot will become unusable. Do not attempt to restart the consumer against this slot. Recreate the slot and reinitialize the consumer from a fresh base backup. If this was your only logical subscription, plan for a full initial sync.

Prevention

Set max_slot_wal_keep_size to a finite value on PostgreSQL 13 and later. A limit of 10 GB to 100 GB, depending on disk headroom, caps per-slot WAL retention. When exceeded, the slot is invalidated rather than filling the disk. Balance this against recovery needs: a limit that is too low may invalidate a slot during a temporary outage and force a full resync.
Monitor pg_replication_slots for inactive slots and alert on active = false.
Document every slot. Record the slot name, owner, and consumer. Remove slots as part of consumer decommissioning.
Automate cleanup. Include slot removal in teardown pipelines for ephemeral replicas and CDC tasks.
Separate WAL from data. Store WAL on a dedicated volume so slot bloat cannot take down the data directory or root filesystem.

How Netdata helps

Correlate disk usage on the WAL volume with per-slot replication lag.
Alert on inactive slots and lag in bytes without manual polling.
Track WAL growth rate independently of table size to separate slot bloat from table bloat.
Expose wal_status transitions before the slot reaches lost.
Map slot creation timestamps to deployment events to identify orphaned slots quickly.

The Netdata solution

PostgreSQL monitoring with Netdata

Netdata monitors PostgreSQL with per-second metrics, pre-built dashboards, and ML-powered anomaly detection. Correlate connection saturation, lock waits, autovacuum progress, replication lag, and checkpoint I/O against the rest of your stack so you catch the incidents in these runbooks before they page anyone.

See PostgreSQL monitoring → Start monitoring free

PostgreSQL replication slot bloat: when a stale slot fills the disk

PostgreSQL replication slot bloat: when a stale slot fills the disk

What this means

Common causes

Quick checks

How to diagnose it

Metrics and signals to monitor

Fixes

Orphaned inactive slot

Active but lagging consumer

Slot approaching invalidation

Prevention

How Netdata helps

Related guides

PostgreSQL monitoring with Netdata