PostgreSQL replication slot bloat: when a stale slot fills the disk

Disk is filling on the primary. Table sizes are stable and active replicas show healthy replication lag, but WAL in pg_wal keeps growing and is not being recycled. The most likely cause is a stale replication slot. When a logical subscriber, CDC connector, or physical replica disconnects without cleaning up its slot, PostgreSQL retains every WAL segment from the slot’s restart_lsn onward. This retention ignores max_wal_size unless max_slot_wal_keep_size is set to a finite value. The default is -1 (unlimited). The primary will retain WAL until the disk fills and writes halt. This guide covers confirmation, recovery, and prevention.

What this means

A replication slot is a durable promise that PostgreSQL will not remove WAL until the consumer processes it. Each slot tracks a restart_lsn. The checkpointer can recycle segments only when they are older than every slot’s restart_lsn. If a consumer goes offline, its restart_lsn stops advancing, and WAL accumulates in pg_wal. Unlike normal WAL growth bounded by max_wal_size, slot-bound retention is independent. Unless max_slot_wal_keep_size is configured, there is no ceiling.

Dropping a stale slot removes the retention barrier, but the next checkpoint is required to mark old segments as removable. If the disk fills completely, PostgreSQL cannot complete checkpoints, cannot write new WAL, and rejects writes. All applications are affected, not just the missing consumer.

flowchart TD
    A[Consumer disconnects or stalls] --> B[Slot freezes restart_lsn]
    B --> C[WAL accumulates in pg_wal]
    C --> D[Disk fills past max_wal_size]
    D --> E[Checkpoint cannot recycle segments]
    E --> F[Primary rejects writes]

Common causes

CauseWhat it looks likeFirst thing to check
Orphaned logical slot from a decommissioned CDC toolSlot active = false, inactive_since hours old on PG14+, WAL growing steadilypg_replication_slots for logical slots with no matching consumer
Physical replica rebuilt without dropping its old slotOld physical slot remains with stale restart_lsn; new replica uses a different slot nameSlot names against your known replica inventory
Stalled consumer that is connected but not advancingSlot active = true, lag increasing, consumer process idle or crashedConsumer logs and pg_stat_replication for sent vs. flushed LSN
Unlimited max_slot_wal_keep_size with no monitoringNo ceiling on WAL retention; disk fills silently until manual interventionSHOW max_slot_wal_keep_size; returns -1

Quick checks

Run these read-only checks to confirm slot-related WAL growth.

-- Slot status and lag in bytes
SELECT slot_name, slot_type, active, restart_lsn,
       pg_current_wal_lsn(),
       pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn) AS lag_bytes
FROM pg_replication_slots;
-- Total WAL size (PG10+)
SELECT pg_size_pretty(sum(size)) FROM pg_ls_waldir();
-- WAL status and safety margin (PG13+)
SELECT slot_name, wal_status, safe_wal_size
FROM pg_replication_slots;
# From the PostgreSQL data directory
du -sh pg_wal/
-- Retention limits
SHOW max_wal_size;
SHOW max_slot_wal_keep_size;

How to diagnose it

  1. Confirm the growth is WAL, not data. Compare total database size to disk usage. If pg_database_size() totals are stable but disk usage is climbing, suspect WAL, logs, or temporary files. Use pg_ls_waldir() or du on the pg_wal directory to confirm WAL growth.
  2. Identify stale slots. Query pg_replication_slots. Flag rows where active is false or pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn) is large and increasing. A slot mapping to a decommissioned host or unknown connector name is likely orphaned. For example, a logical slot named after a Kafka connector that was deleted last week is a clear candidate for removal.
  3. Distinguish inactive from lagging. An inactive slot (active = false) has no streaming walsender. An active slot with growing lag still has a consumer, but that consumer is not acknowledging progress. The fix differs: inactive slots are dropped, while lagging consumers require intervention on the remote side. Cross-reference pg_replication_slots.active_pid with pg_stat_replication.pid to map active connections to slot names and confirm which consumer is attached.
  4. Check WAL status. On PostgreSQL 13+, wal_status indicates whether the slot is within max_wal_size (reserved), beyond it but still held (extended), or past max_slot_wal_keep_size and heading for removal (unreserved). If the status is lost, required WAL has already been removed and the slot is permanently unusable.
  5. Correlate with infrastructure changes. Slot names often match hostnames or connector names. Check your deployment logs. If a replica was rebuilt under a new name or a CDC task was deleted recently, the old slot was likely left behind.

Metrics and signals to monitor

SignalWhy it mattersWarning sign
pg_replication_slots.activeInactive slots retain WAL indefinitelyactive = false for longer than 15 minutes
Lag bytes per slotMeasures exact WAL retainedpg_wal_lsn_diff > 1 GB and growing
pg_replication_slots.wal_statusTracks whether slot is within limitsextended, unreserved, or lost
safe_wal_sizeBytes remaining before invalidationNegative or approaching zero
pg_wal directory sizeDirect measure of accumulationSustained growth above baseline
pg_stat_replication lagDistinguishes active streaming from slot-bound WALActive replicas healthy while WAL still grows

Fixes

Orphaned inactive slot

If the slot is inactive and the consumer is decommissioned, drop it. This is destructive and irreversible. If you ever need that consumer again, it must be reinitialized from a full snapshot.

SELECT pg_drop_replication_slot('slot_name');

After the drop, the slot barrier is gone, but segments are not removed instantly. The next checkpoint evaluates them for removal. If disk is critically full and the regular checkpoint interval is too long, run CHECKPOINT manually. This can cause a brief I/O spike. If the disk is completely full, CHECKPOINT itself may fail; in that case, expand storage or remove non-essential files (such as old log files outside PGDATA) to free space before forcing a checkpoint.

Attempting to drop an active slot returns an error. Stop the consumer application or replica first.

Active but lagging consumer

Do not drop an active slot. Identify the consumer and resolve the root cause before the lag exhausts disk. For logical replication, check subscriber logs for apply errors, network stalls, or large transactions that block the apply worker. For physical replication, check replica I/O latency, replication slot state on the replica, and pg_stat_wal_receiver on the replica. If the consumer cannot catch up and the lag threatens disk exhaustion, plan a controlled rebuild: stop the consumer, drop the slot, take a fresh base backup, and recreate the slot.

Slot approaching invalidation

If wal_status is unreserved, the slot has exceeded max_slot_wal_keep_size. At the next checkpoint, required WAL can be removed and the slot will become unusable. Do not attempt to restart the consumer against this slot. Recreate the slot and reinitialize the consumer from a fresh base backup. If this was your only logical subscription, plan for a full initial sync.

Prevention

  • Set max_slot_wal_keep_size to a finite value on PostgreSQL 13 and later. A limit of 10 GB to 100 GB, depending on disk headroom, caps per-slot WAL retention. When exceeded, the slot is invalidated rather than filling the disk. Balance this against recovery needs: a limit that is too low may invalidate a slot during a temporary outage and force a full resync.
  • Monitor pg_replication_slots for inactive slots and alert on active = false.
  • Document every slot. Record the slot name, owner, and consumer. Remove slots as part of consumer decommissioning.
  • Automate cleanup. Include slot removal in teardown pipelines for ephemeral replicas and CDC tasks.
  • Separate WAL from data. Store WAL on a dedicated volume so slot bloat cannot take down the data directory or root filesystem.

How Netdata helps

  • Correlate disk usage on the WAL volume with per-slot replication lag.
  • Alert on inactive slots and lag in bytes without manual polling.
  • Track WAL growth rate independently of table size to separate slot bloat from table bloat.
  • Expose wal_status transitions before the slot reaches lost.
  • Map slot creation timestamps to deployment events to identify orphaned slots quickly.