PostgreSQL pg_wal directory full: causes and emergency recovery
Your paging system fires because the PostgreSQL primary has stopped accepting writes. The error log reports a disk-full condition, and the WAL volume is at 100%. You cannot simply delete files from pg_wal to free space: doing so corrupts the database and breaks replication.
PostgreSQL recycles WAL segments only after a checkpoint, and only when they are no longer needed for crash recovery, archiving, or replication slots. Archiving failures, stalled replication slots, or bulk loads that exceed max_wal_size cause unbounded accumulation. This guide covers identification, safe recovery, and prevention.
What this means
The pg_wal directory contains the Write-Ahead Log. Each segment is 16 MB by default. During normal operation, PostgreSQL reuses segments after a checkpoint once no consumer needs them. If a consumer is missing or failing, segments are retained indefinitely.
Because max_wal_size is a soft target, large transactions or bulk loads can push WAL generation well above the configured value. If the volume lacks headroom, a temporary spike becomes an emergency. Once the disk is full, PostgreSQL cannot complete checkpoints and will eventually shut down.
flowchart TD
A[Write workload] --> B{WAL retention block}
B -->|archive_command fails| C[WAL accumulates]
B -->|inactive replication slot| C
B -->|bulk load exceeds max_wal_size| C
C --> D[pg_wal directory grows]
D --> E[Disk volume fills]
E --> F[Checkpoint cannot complete]
F --> G[PostgreSQL stops accepting writes]Common causes
| Cause | What it looks like | First thing to check |
|---|---|---|
Failing archive_command | pg_wal grows steadily; logs contain “archiving write-ahead log file … failed too many times” | pg_stat_archiver.failed_count and PostgreSQL logs |
| Inactive replication slot | Primary disk grows while replicas appear healthy; a slot shows active = false | pg_replication_slots for active status and restart_lsn lag |
max_wal_size pressure with insufficient headroom | Brief latency spikes during bulk loads; volume usage climbs fast but archive and slots are healthy | pg_stat_bgwriter.checkpoints_req versus checkpoints_timed, and current WAL LSN growth |
Quick checks
Run these read-only checks to confirm the cause.
# WAL volume usage
df -h $PGDATA/pg_wal
# pg_wal directory size
du -sh $PGDATA/pg_wal
-- Archiver failures
SELECT failed_count, last_failed_time, last_archived_time
FROM pg_stat_archiver;
-- Replication slot status and lag
SELECT slot_name, active, restart_lsn,
pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn) AS lag_bytes
FROM pg_replication_slots;
-- Checkpoint pressure
SELECT checkpoints_timed, checkpoints_req
FROM pg_stat_bgwriter;
How to diagnose it
- Confirm the symptom. Use
dfandduon the WAL volume. Usage above 85% and climbing is an active incident. - Check
pg_stat_archiver. Iffailed_countis increasing, archiving is broken. Inspect PostgreSQL logs for permission errors, a full archive destination, or an invalid command path. - Check
pg_replication_slots. If a slot isactive = falseor itsrestart_lsnlags far behindpg_current_wal_lsn(), that slot retains WAL. Dividelag_bytesby 16 MB to estimate the segment backlog. - Check
pg_stat_bgwriter. Ifcheckpoints_reqexceeds roughly 10% ofcheckpoints_timed, your write rate forces frequent checkpoints becausemax_wal_sizeis too small for the workload. - Correlate with recent workload. Bulk inserts, large DDL, or
VACUUM FREEZEgenerate WAL spikes that temporarily exceedmax_wal_size. Look atpg_stat_databasefor sudden increases inxact_commitorblks_written. - Determine whether growth is bounded or unbounded. Archive failures and inactive slots cause unbounded growth. Bulk load spikes plateau once the load finishes.
Metrics and signals to monitor
| Signal | Why it matters | Warning sign |
|---|---|---|
pg_stat_archiver.failed_count | A rising count means WAL is not leaving the primary | Any sustained increase over a 5-minute window |
pg_replication_slots.active | An inactive slot retains WAL indefinitely | active = false for more than 5 minutes |
pg_wal disk utilization | Direct measure of WAL volume pressure | Sustained growth above 80% |
pg_wal_lsn_diff for slots | Quantifies how much WAL a slot is holding back | Lag exceeding 1 GB (roughly 64 segments) |
checkpoints_req / checkpoints_timed | Forced checkpoints indicate max_wal_size stress | Ratio greater than 0.1 sustained |
Fixes
Archive command failure
If pg_stat_archiver shows failures, the primary retains every WAL segment because it believes they are not archived.
Find the root cause in PostgreSQL logs. Common issues include a full NFS mount, misconfigured S3 credentials, or a changed path in archive_command. As a temporary emergency measure, you can disable archiving to allow recycling:
WARNING: Setting archive_command = '/bin/true' breaks your backup chain and must be reverted immediately after the incident.
Set the parameter and reload the configuration. If the disk is already 100% full, ALTER SYSTEM may fail because it writes to the data directory; edit postgresql.conf directly before starting the server.
Once the real archive path is restored, PostgreSQL resumes archiving. Monitor pg_stat_archiver until the backlog clears.
Replication slot retention
If an inactive or lagging slot is the cause, decide whether the consumer is coming back.
If the consumer is decommissioned or cannot catch up before the disk fills, drop the slot:
SELECT pg_drop_replication_slot('slot_name');
The slot must be inactive; if it is active, stop the associated replica or subscriber first.
Dropping the slot immediately allows WAL recycling. If the consumer was a logical replication subscriber or CDC connector, you must reinitialize it from a fresh snapshot. If the disk is already 100% full, dropping the slot may still require a checkpoint before space is freed, so add emergency disk space first.
On PostgreSQL 13 and later, set max_slot_wal_keep_size to cap how much WAL a slot can retain. The tradeoff is that a replica or subscriber falling behind beyond this limit must be rebuilt.
max_wal_size pressure and bulk loads
If the cause is a write spike and the volume is simply too small, expand the WAL volume if possible. Then increase max_wal_size to match your sustained WAL generation rate, and ensure the volume has at least 30% headroom above that value.
After resolving the root cause, force a checkpoint to recycle unneeded segments:
CHECKPOINT;
This requires superuser privileges and can cause a brief I/O spike.
Emergency recovery when PostgreSQL has stopped
If the database has already shut down because pg_wal is full, do not delete WAL files manually.
- Expand the WAL volume by even a few gigabytes, if your infrastructure allows it, to give PostgreSQL room to start.
- Start PostgreSQL and immediately identify the root cause using the steps above.
- If you cannot expand the volume and you have a valid base backup with a known cutoff LSN, you can use
pg_archivecleanupto remove old archived segments.
WARNING: pg_archivecleanup is designed for standby and archive cleanup. Running it against the primary pg_wal risks removing segments still required for crash recovery or replication slots.
Always run a dry run first:
# Preview what would be deleted
pg_archivecleanup -n -d $PGDATA/pg_wal 0000000100000001000000AB
Only remove segments you are certain are older than your recovery requirements and not needed by any replication slot.
- As an absolute last resort, if the server refuses to start and you have no other path, you may use
pg_resetwal -fon the data directory after taking a filesystem backup. This strips crash-recovery instructions and risks data loss. Use it only when all other options are exhausted.
Prevention
Monitor pg_stat_archiver and alert on any increase in failed_count. A single failed archive command, left unattended, eventually fills the disk.
Monitor pg_replication_slots for inactive slots. On PostgreSQL 13 and later, set max_slot_wal_keep_size to prevent unbounded retention. Document every slot owner and require automated cleanup when decommissioning subscribers or replicas.
Size the WAL volume with at least 30% headroom above your configured max_wal_size, and review checkpoints_req weekly to ensure checkpoint frequency is not forcing excessive WAL retention.
How Netdata helps
- Correlate
pg_waldisk utilization withpg_stat_archiver.failed_countand replication slot lag on the same timeline to distinguish archive failures from slot retention. - Alert on WAL volume utilization before it crosses 85%, giving time to drop a stale slot or fix archiving.
- Track
checkpoints_reqagainstcheckpoints_timedto detect sizing stress that precedes volume exhaustion. - Visualize replication slot lag in bytes, surfacing retention growth before it triggers an outage.
Related guides
- How PostgreSQL actually works in production: a mental model for operators: /guides/postgres/how-postgres-works-in-production/
- PostgreSQL ALTER TABLE blocked: zero-downtime DDL patterns: /guides/postgres/postgres-alter-table-blocked/
- PostgreSQL autovacuum blocked by long-running transaction: detection and fix: /guides/postgres/postgres-autovacuum-blocked-by-long-transaction/
- PostgreSQL autovacuum not running: detection, causes, and fixes: /guides/postgres/postgres-autovacuum-not-running/
- PostgreSQL autovacuum tuning: per-table thresholds for high-churn workloads: /guides/postgres/postgres-autovacuum-tuning/
- PostgreSQL blocking queries: finding the root blocker in a lock cascade: /guides/postgres/postgres-blocking-queries/
- PostgreSQL connection exhaustion: detection, diagnosis, and prevention: /guides/postgres/postgres-connection-exhaustion/
- PostgreSQL connection refused: pg_hba, listen_addresses, and TCP diagnosis: /guides/postgres/postgres-connection-refused/
- PostgreSQL dead tuples piling up: why autovacuum can’t keep up: /guides/postgres/postgres-dead-tuples-piling-up/
- PostgreSQL deadlock detected: how to diagnose and prevent deadlocks: /guides/postgres/postgres-deadlock-detected/
- PostgreSQL frozen XID monitoring: catching wraparound 6 months early: /guides/postgres/postgres-frozen-xid-monitoring/
- PostgreSQL idle in transaction: detecting and killing zombie sessions: /guides/postgres/postgres-idle-in-transaction/






