$ guides / postgres / postgres-wal-disk-full ▌

Operations Guides

PostgreSQL pg_wal directory full: causes and emergency recovery

Your paging system fires because the PostgreSQL primary has stopped accepting writes. The error log reports a disk-full condition, and the WAL volume is at 100%. You cannot simply delete files from pg_wal to free space: doing so corrupts the database and breaks replication.

PostgreSQL recycles WAL segments only after a checkpoint, and only when they are no longer needed for crash recovery, archiving, or replication slots. Archiving failures, stalled replication slots, or bulk loads that exceed max_wal_size cause unbounded accumulation. This guide covers identification, safe recovery, and prevention.

What this means

The pg_wal directory contains the Write-Ahead Log. Each segment is 16 MB by default. During normal operation, PostgreSQL reuses segments after a checkpoint once no consumer needs them. If a consumer is missing or failing, segments are retained indefinitely.

Because max_wal_size is a soft target, large transactions or bulk loads can push WAL generation well above the configured value. If the volume lacks headroom, a temporary spike becomes an emergency. Once the disk is full, PostgreSQL cannot complete checkpoints and will eventually shut down.

flowchart TD
    A[Write workload] --> B{WAL retention block}
    B -->|archive_command fails| C[WAL accumulates]
    B -->|inactive replication slot| C
    B -->|bulk load exceeds max_wal_size| C
    C --> D[pg_wal directory grows]
    D --> E[Disk volume fills]
    E --> F[Checkpoint cannot complete]
    F --> G[PostgreSQL stops accepting writes]

Common causes

Cause	What it looks like	First thing to check
Failing `archive_command`	`pg_wal` grows steadily; logs contain “archiving write-ahead log file … failed too many times”	`pg_stat_archiver.failed_count` and PostgreSQL logs
Inactive replication slot	Primary disk grows while replicas appear healthy; a slot shows `active = false`	`pg_replication_slots` for `active` status and `restart_lsn` lag
`max_wal_size` pressure with insufficient headroom	Brief latency spikes during bulk loads; volume usage climbs fast but archive and slots are healthy	`pg_stat_bgwriter.checkpoints_req` versus `checkpoints_timed`, and current WAL LSN growth

Quick checks

Run these read-only checks to confirm the cause.

# WAL volume usage
df -h $PGDATA/pg_wal

# pg_wal directory size
du -sh $PGDATA/pg_wal

-- Archiver failures
SELECT failed_count, last_failed_time, last_archived_time
FROM pg_stat_archiver;

-- Replication slot status and lag
SELECT slot_name, active, restart_lsn,
       pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn) AS lag_bytes
FROM pg_replication_slots;

-- Checkpoint pressure
SELECT checkpoints_timed, checkpoints_req
FROM pg_stat_bgwriter;

How to diagnose it

Confirm the symptom. Use df and du on the WAL volume. Usage above 85% and climbing is an active incident.
Check pg_stat_archiver. If failed_count is increasing, archiving is broken. Inspect PostgreSQL logs for permission errors, a full archive destination, or an invalid command path.
Check pg_replication_slots. If a slot is active = false or its restart_lsn lags far behind pg_current_wal_lsn(), that slot retains WAL. Divide lag_bytes by 16 MB to estimate the segment backlog.
Check pg_stat_bgwriter. If checkpoints_req exceeds roughly 10% of checkpoints_timed, your write rate forces frequent checkpoints because max_wal_size is too small for the workload.
Correlate with recent workload. Bulk inserts, large DDL, or VACUUM FREEZE generate WAL spikes that temporarily exceed max_wal_size. Look at pg_stat_database for sudden increases in xact_commit or blks_written.
Determine whether growth is bounded or unbounded. Archive failures and inactive slots cause unbounded growth. Bulk load spikes plateau once the load finishes.

Metrics and signals to monitor

Signal	Why it matters	Warning sign
`pg_stat_archiver.failed_count`	A rising count means WAL is not leaving the primary	Any sustained increase over a 5-minute window
`pg_replication_slots.active`	An inactive slot retains WAL indefinitely	`active = false` for more than 5 minutes
`pg_wal` disk utilization	Direct measure of WAL volume pressure	Sustained growth above 80%
`pg_wal_lsn_diff` for slots	Quantifies how much WAL a slot is holding back	Lag exceeding 1 GB (roughly 64 segments)
`checkpoints_req / checkpoints_timed`	Forced checkpoints indicate `max_wal_size` stress	Ratio greater than 0.1 sustained

Fixes

Archive command failure

If pg_stat_archiver shows failures, the primary retains every WAL segment because it believes they are not archived.

Find the root cause in PostgreSQL logs. Common issues include a full NFS mount, misconfigured S3 credentials, or a changed path in archive_command. As a temporary emergency measure, you can disable archiving to allow recycling:

WARNING: Setting archive_command = '/bin/true' breaks your backup chain and must be reverted immediately after the incident.

Set the parameter and reload the configuration. If the disk is already 100% full, ALTER SYSTEM may fail because it writes to the data directory; edit postgresql.conf directly before starting the server.

Once the real archive path is restored, PostgreSQL resumes archiving. Monitor pg_stat_archiver until the backlog clears.

Replication slot retention

If an inactive or lagging slot is the cause, decide whether the consumer is coming back.

If the consumer is decommissioned or cannot catch up before the disk fills, drop the slot:

SELECT pg_drop_replication_slot('slot_name');

The slot must be inactive; if it is active, stop the associated replica or subscriber first.

Dropping the slot immediately allows WAL recycling. If the consumer was a logical replication subscriber or CDC connector, you must reinitialize it from a fresh snapshot. If the disk is already 100% full, dropping the slot may still require a checkpoint before space is freed, so add emergency disk space first.

On PostgreSQL 13 and later, set max_slot_wal_keep_size to cap how much WAL a slot can retain. The tradeoff is that a replica or subscriber falling behind beyond this limit must be rebuilt.

max_wal_size pressure and bulk loads

If the cause is a write spike and the volume is simply too small, expand the WAL volume if possible. Then increase max_wal_size to match your sustained WAL generation rate, and ensure the volume has at least 30% headroom above that value.

After resolving the root cause, force a checkpoint to recycle unneeded segments:

CHECKPOINT;

This requires superuser privileges and can cause a brief I/O spike.

Emergency recovery when PostgreSQL has stopped

If the database has already shut down because pg_wal is full, do not delete WAL files manually.

Expand the WAL volume by even a few gigabytes, if your infrastructure allows it, to give PostgreSQL room to start.
Start PostgreSQL and immediately identify the root cause using the steps above.
If you cannot expand the volume and you have a valid base backup with a known cutoff LSN, you can use pg_archivecleanup to remove old archived segments.

WARNING: pg_archivecleanup is designed for standby and archive cleanup. Running it against the primary pg_wal risks removing segments still required for crash recovery or replication slots.

Always run a dry run first:

# Preview what would be deleted
pg_archivecleanup -n -d $PGDATA/pg_wal 0000000100000001000000AB

Only remove segments you are certain are older than your recovery requirements and not needed by any replication slot.

As an absolute last resort, if the server refuses to start and you have no other path, you may use pg_resetwal -f on the data directory after taking a filesystem backup. This strips crash-recovery instructions and risks data loss. Use it only when all other options are exhausted.

Prevention

Monitor pg_stat_archiver and alert on any increase in failed_count. A single failed archive command, left unattended, eventually fills the disk.

Monitor pg_replication_slots for inactive slots. On PostgreSQL 13 and later, set max_slot_wal_keep_size to prevent unbounded retention. Document every slot owner and require automated cleanup when decommissioning subscribers or replicas.

Size the WAL volume with at least 30% headroom above your configured max_wal_size, and review checkpoints_req weekly to ensure checkpoint frequency is not forcing excessive WAL retention.

How Netdata helps

Correlate pg_wal disk utilization with pg_stat_archiver.failed_count and replication slot lag on the same timeline to distinguish archive failures from slot retention.
Alert on WAL volume utilization before it crosses 85%, giving time to drop a stale slot or fix archiving.
Track checkpoints_req against checkpoints_timed to detect sizing stress that precedes volume exhaustion.
Visualize replication slot lag in bytes, surfacing retention growth before it triggers an outage.

How PostgreSQL actually works in production: a mental model for operators: /guides/postgres/how-postgres-works-in-production/
PostgreSQL ALTER TABLE blocked: zero-downtime DDL patterns: /guides/postgres/postgres-alter-table-blocked/
PostgreSQL autovacuum blocked by long-running transaction: detection and fix: /guides/postgres/postgres-autovacuum-blocked-by-long-transaction/
PostgreSQL autovacuum not running: detection, causes, and fixes: /guides/postgres/postgres-autovacuum-not-running/
PostgreSQL autovacuum tuning: per-table thresholds for high-churn workloads: /guides/postgres/postgres-autovacuum-tuning/
PostgreSQL blocking queries: finding the root blocker in a lock cascade: /guides/postgres/postgres-blocking-queries/
PostgreSQL connection exhaustion: detection, diagnosis, and prevention: /guides/postgres/postgres-connection-exhaustion/
PostgreSQL connection refused: pg_hba, listen_addresses, and TCP diagnosis: /guides/postgres/postgres-connection-refused/
PostgreSQL dead tuples piling up: why autovacuum can’t keep up: /guides/postgres/postgres-dead-tuples-piling-up/
PostgreSQL deadlock detected: how to diagnose and prevent deadlocks: /guides/postgres/postgres-deadlock-detected/
PostgreSQL frozen XID monitoring: catching wraparound 6 months early: /guides/postgres/postgres-frozen-xid-monitoring/
PostgreSQL idle in transaction: detecting and killing zombie sessions: /guides/postgres/postgres-idle-in-transaction/

The Netdata solution

PostgreSQL monitoring with Netdata

Netdata monitors PostgreSQL with per-second metrics, pre-built dashboards, and ML-powered anomaly detection. Correlate connection saturation, lock waits, autovacuum progress, replication lag, and checkpoint I/O against the rest of your stack so you catch the incidents in these runbooks before they page anyone.

See PostgreSQL monitoring → Start monitoring free

PostgreSQL pg_wal directory full: causes and emergency recovery

PostgreSQL pg_wal directory full: causes and emergency recovery

What this means

Common causes

Quick checks

How to diagnose it

Metrics and signals to monitor

Fixes

Archive command failure

Replication slot retention

max_wal_size pressure and bulk loads

Emergency recovery when PostgreSQL has stopped

Prevention

How Netdata helps

Related guides

PostgreSQL monitoring with Netdata