PostgreSQL disk full: emergency recovery and root cause analysis

When df -h shows 100% utilization on the PostgreSQL data volume, queries fail with “could not write to file”. If pg_wal fills, the server enters PANIC and refuses to restart until space is freed. The fastest way to make the incident worse is to delete WAL files from pg_wal manually. PostgreSQL needs those files for crash recovery; removing them causes data inconsistency that forces a restore from backup. Identify which subsystem is consuming space, reclaim it safely, and fix the root cause before the cycle repeats.

Four subsystems usually dominate disk consumption: WAL retention from replication slots or archiving failures, transient temp files from runaway queries, table and index bloat from blocked vacuum, and log files that outgrew their rotation policy.

What this means

When PostgreSQL exhausts disk space, it cannot append to WAL or extend data files. A full pg_wal causes PANIC and a clean shutdown; the server will not restart until space is restored. If another filesystem fills while the database stays online, queries fail, autovacuum stalls, and transaction ID wraparound risk grows. The root cause is usually an upstream operational failure: a forgotten replication slot, a broken archive_command, or a long-running transaction that blocks vacuum and allows bloat to accumulate. Freeing space without fixing the source guarantees a repeat incident.

Common causes

CauseWhat it looks likeFirst thing to check
Replication slot retaining WALpg_wal grows steadily while replicas appear healthy; pg_replication_slots shows active = false with an old restart_lsnSELECT slot_name, active, restart_lsn FROM pg_replication_slots;
Failing archive_commandWAL count exceeds max_wal_size; pg_stat_archiver shows failed attemptsSELECT archived_count, failed_count FROM pg_stat_archiver;
Runaway temp filesLarge files appear in pgsql_tmp while a long query runs; sorts or hashes spill to diskpg_stat_database temp counters and ls on pgsql_tmp
Table or index bloatTable size is far larger than the logical row count implies; n_dead_tup is high and growingSELECT schemaname, relname, n_live_tup, n_dead_tup FROM pg_stat_user_tables ORDER BY n_dead_tup DESC LIMIT 10;
Log file explosionThe configured log directory grows by gigabytes without rotationdu -sh on the log path

Quick checks

# Check filesystem utilization
df -h

# Check WAL directory size and segment count
du -sh $PGDATA/pg_wal
ls $PGDATA/pg_wal | wc -l

# Check replication slot status and WAL lag
psql -c "SELECT slot_name, active, restart_lsn, pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn) AS lag_bytes FROM pg_replication_slots;"

# Check archiver health
psql -c "SELECT archived_count, failed_count, last_archived_time, last_failed_time FROM pg_stat_archiver;"

# Check largest relations
psql -c "SELECT schemaname, relname, pg_size_pretty(pg_total_relation_size(schemaname||'.'||relname)) AS size FROM pg_stat_user_tables ORDER BY pg_total_relation_size(schemaname||'.'||relname) DESC LIMIT 10;"

# Check temp file accumulation per database
psql -c "SELECT datname, temp_files, pg_size_pretty(temp_bytes) FROM pg_stat_database WHERE temp_files > 0 ORDER BY temp_bytes DESC;"

# Check for long-running active queries
psql -c "SELECT pid, usename, state, query_start, query FROM pg_stat_activity WHERE state = 'active' AND query_start < NOW() - INTERVAL '5 minutes';"

# Check log directory. The path may be relative to $PGDATA.
LOGDIR=$(psql -t -P format=unaligned -c "SHOW log_directory")
du -sh "${PGDATA}/${LOGDIR}" 2>/dev/null || du -sh "${LOGDIR}" 2>/dev/null

How to diagnose it

flowchart TD
    A[Disk full alert] --> B{Which volume?}
    B -->|pg_wal| C[Check slots and archiver]
    B -->|Data| D[Check bloat and temp files]
    B -->|Log| E[Check rotation policy]
    C --> F[Inactive slot? Drop or resume]
    C --> G[Archive failure? Fix pipeline]
    D --> H[Long query? Terminate backend]
    D --> I[High dead tuples? Vacuum or repack]
    E --> J[Compress or purge old logs]
  1. Confirm which filesystem is full with df -h. PostgreSQL often stores WAL, data, and logs on separate volumes. If only the data volume is full, pg_wal on a separate volume may still have space, allowing the server to stay up.
  2. Quantify WAL growth. Measure pg_wal size and file count. A healthy cluster recycles WAL after archiving and checkpointing. If the file count is in the thousands and growing, WAL is not being removed.
  3. Check replication slots. Query pg_replication_slots. An active = false slot with a restart_lsn far behind the current LSN means WAL is being retained for a consumer that is offline. Compare the byte lag with available disk space.
  4. Check the archiver. Query pg_stat_archiver. If failed_count is increasing, archive_command is failing and WAL cannot be recycled. Check the archive destination and permissions.
  5. Inspect temp files. Look in $PGDATA/base/*/pgsql_tmp for large files. Correlate them with backends in pg_stat_activity that have been active for a long time.
  6. Measure bloat. Query pg_stat_user_tables for high n_dead_tup. If the table is still accessible, install pgstattuple and run SELECT * FROM pgstattuple('schema.table'); to confirm dead tuple ratio. If pgstattuple is not available, compare pg_total_relation_size against an estimated logical size.
  7. Review log volume. Check the PostgreSQL log directory size. If logs are not rotating, a single day of verbose logging can consume tens of gigabytes.

Metrics and signals to monitor

SignalWhy it mattersWarning sign
pg_wal sizeWAL accumulation is the fastest way to fill a diskSize growing > 1 GB/hour sustained
Replication slot LSN lagInactive slots retain WAL indefinitelyactive = false and lag > 1 GB
pg_stat_archiver.failed_countFailed archiving blocks WAL recyclingAny sustained increase over baseline
Per-table dead tuple ratioBloat wastes disk and degrades performancen_dead_tup / (n_live_tup + n_dead_tup) > 20%
pg_stat_database.temp_bytes rateRunaway queries spill sorts and hashes to diskSudden spike without workload change
Database age (datfrozenxid)Vacuum starvation risks wraparound shutdownage(datfrozenxid) > 500 million

Fixes

WAL retained by a replication slot

If the consumer is permanently offline, drop the slot:

SELECT pg_drop_replication_slot('slot_name');

WARNING: This immediately allows WAL to be recycled. The consumer must be rebuilt from a new base backup if it was a replica, or resynchronized if it was a logical subscriber. Do not drop the slot if the consumer is temporarily down and you have disk headroom; instead, add space and let it catch up.

On PostgreSQL 13+, set max_slot_wal_keep_size to limit future WAL retention per slot. Setting it too low risks breaking lagging standbys that must be rebuilt from a new base backup.

Failing archive_command

Fix the archive pipeline first. Common causes include a full backup repository, expired credentials, or a network path failure. Once archive_command succeeds, PostgreSQL archives the backlog and recycles old segments. This can take minutes to hours depending on the backlog. Do not disable archive_mode to free space; you lose point-in-time recovery capability.

Temp file bloat

Identify the backend that created the temp files in pgsql_tmp. If the query is not critical, terminate the backend:

SELECT pg_terminate_backend(<pid>);

WARNING: The query fails and must be rerun after addressing the root cause, typically by increasing work_mem or rewriting the query. The temp files are removed automatically when the backend exits.

Table and index bloat

If autovacuum is blocked by a long-running transaction, terminate the blocker:

SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE state = 'idle in transaction' AND state_change < NOW() - INTERVAL '5 minutes';

After the blocker is gone, run a manual VACUUM on the bloated table:

VACUUM (VERBOSE, ANALYZE) schema.table;

VACUUM marks dead space as reusable but does not shrink the table on disk. If you must return space to the operating system immediately and can tolerate downtime, VACUUM FULL reclaims space but holds an ACCESS EXCLUSIVE lock for the duration, blocking all reads and writes. For online reclamation, use pg_repack, which requires a primary key or unique NOT NULL index and approximately 2x the table size in free disk space.

Log explosion

Move, compress, or remove old log files outside the current active log file. If using logging_collector, PostgreSQL holds the file descriptor open; a reload does not release space to the OS. You may need a server restart to reclaim space from truncated or deleted files. Ensure you retain enough logs for compliance and forensics.

Emergency disk expansion

If the root volume is full and you cannot reclaim space fast enough, expand the underlying block device. On cloud platforms, increase the volume size and resize the filesystem without downtime. If pg_wal is on the same volume as data and cannot expand, you can relocate pg_wal to a new volume with more space.

WARNING: This requires stopping the server, moving the directory, creating a symbolic link from $PGDATA/pg_wal to the new location, and restarting. Plan for downtime and verify the symlink resolves correctly before starting PostgreSQL.

Prevention

  • Monitor replication slots. An inactive slot can retain gigabytes of WAL per hour. Alert on active = false and LSN lag growth.
  • Monitor the archiver. A failing archive_command blocks WAL recycling. Alert on pg_stat_archiver.failed_count increases.
  • Guard slot retention. On PostgreSQL 13+, set max_slot_wal_keep_size to bound the disk cost of a lost consumer.
  • Set connection timeouts. idle_in_transaction_session_timeout and statement_timeout prevent backends from blocking vacuum or spilling temp files indefinitely.
  • Tune autovacuum per table. High-churn tables need aggressive thresholds so dead tuples are reclaimed before they become bloat that consumes disk.
  • Separate volumes. Keep WAL, data, and logs on independent filesystems so one subsystem cannot starve the others.
  • Enforce log rotation. Ensure log_rotation_age or an external rotation policy is active and tested.

How Netdata helps

  • Correlate filesystem utilization with pg_wal growth and replication slot lag to distinguish WAL retention from data bloat.
  • Alert on inactive replication slots and LSN lag growth.
  • Expose pg_stat_archiver failures that block WAL recycling.
  • Track per-table dead tuple ratios and autovacuum lag to identify bloat early.
  • Monitor temp file activity per database to detect runaway queries that spill to disk.
  • How PostgreSQL actually works in production: a mental model for operators: /guides/postgres/how-postgres-works-in-production/
  • PostgreSQL ALTER TABLE blocked: zero-downtime DDL patterns: /guides/postgres/postgres-alter-table-blocked/
  • PostgreSQL autovacuum blocked by long-running transaction: detection and fix: /guides/postgres/postgres-autovacuum-blocked-by-long-transaction/
  • PostgreSQL autovacuum not running: detection, causes, and fixes: /guides/postgres/postgres-autovacuum-not-running/
  • PostgreSQL autovacuum tuning: per-table thresholds for high-churn workloads: /guides/postgres/postgres-autovacuum-tuning/
  • PostgreSQL blocking queries: finding the root blocker in a lock cascade: /guides/postgres/postgres-blocking-queries/
  • PostgreSQL checkpoint storms: detection, causes, and tuning: /guides/postgres/postgres-checkpoint-storms/
  • PostgreSQL: checkpoints are occurring too frequently – what to tune: /guides/postgres/postgres-checkpoints-occurring-too-frequently/
  • PostgreSQL connection exhaustion: detection, diagnosis, and prevention: /guides/postgres/postgres-connection-exhaustion/
  • PostgreSQL connection refused: pg_hba, listen_addresses, and TCP diagnosis: /guides/postgres/postgres-connection-refused/
  • PostgreSQL: database is not accepting commands to avoid wraparound data loss: /guides/postgres/postgres-database-not-accepting-commands/
  • PostgreSQL dead tuples piling up: why autovacuum can’t keep up: /guides/postgres/postgres-dead-tuples-piling-up/