PostgreSQL WAL archive failures: archive_command exit codes and recovery
WAL archiving is the durability bridge between your PostgreSQL primary and your ability to recover to any point in time. When archive_command fails, WAL segments stay pinned in pg_wal/. The directory grows until the filesystem fills, at which point PostgreSQL performs an emergency PANIC shutdown. Even before that happens, every failed segment is a gap in your backup chain, rendering base backups useless for PITR beyond the first missing file.
The failure modes are not all obvious. A nonzero exit code from archive_command does increment pg_stat_archiver.failed_count, but if the archiver process is killed by a signal or exits with a code above 125, the postmaster restarts it silently and does not record the failure in the statistics view. You may have zero failed counts and a full disk. This guide covers how to read the signals, test the contract, and recover without making the gap worse.
What this means
PostgreSQL spawns an archiver background process that executes archive_command for every completed WAL segment. The contract is strict:
- Exit code 0 means success. PostgreSQL considers the segment archived and will eventually remove or recycle it.
- Exit code nonzero means failure. PostgreSQL retries indefinitely and keeps the file in
pg_wal/. - SIGTERM is a graceful shutdown signal for the archiver and is not treated as a failure.
- Signal death (non-SIGTERM) or exit code greater than 125 causes the archiver to abort. The postmaster restarts it, but the failure is not reflected in
pg_stat_archiver.
The idempotency rule matters for recovery. If the server crashes before recording durable archive success, it may attempt to re-archive the same segment on restart. Your command must return 0 if an identical file already exists, and nonzero if a file with the same name but different contents exists. Returning 0 for a mismatched file corrupts the archive chain.
Common causes
| Cause | What it looks like | First thing to check |
|---|---|---|
| Storage backend rejection | failed_count rising; last_failed_wal stuck on one segment | Manually run archive_command against that file and check the exit code |
| Unsafe duplicate handling (GNU cp -i) | Command exits 0 but the file is not actually archived; gaps appear in the backup chain | Re-run the command against an already-archived file. If cp -i is used, it returns 0 on duplicate names without checking content |
| Archiver death by signal or OOM | pg_wal/ grows but failed_count is flat; logs show archiver process started or restarted | OS logs or dmesg for OOM kills; check for external process killers |
pg_wal disk full | PANIC in PostgreSQL logs; database stops accepting writes | Disk utilization on the WAL volume |
archive_command syntax or missing binary | Immediate failure on every WAL file; logs show the command was not found | SHOW archive_command; and test the binary path in a shell |
| pgBackRest async handoff failure | pgBackRest error [082]; WAL piles up despite archive_command returning 0 quickly | pgBackRest async queue state and timeout configuration |
Quick checks
# Check archiver statistics for failures and last success
psql -c "SELECT archived_count, failed_count, last_archived_wal,
last_failed_wal, last_archived_time, last_failed_time
FROM pg_stat_archiver;"
# Measure WAL directory size and segment count
du -sh $PGDATA/pg_wal
ls $PGDATA/pg_wal | wc -l
# Verify current archive configuration
psql -c "SHOW archive_command;"
psql -c "SHOW archive_mode;"
# Look for archiver process restarts or errors in logs
grep -i archiver $PGDATA/log/postgresql-*.log | tail -n 20
# Force a WAL switch to test the current pipeline
psql -c "SELECT pg_switch_wal();"
How to diagnose it
- Check
pg_stat_archiverfor a smoking gun. Iffailed_countis increasing, notelast_failed_waland the timestamp. A stagnantlast_archived_timewith a rising current WAL LSN means the pipeline is blocked. - Test the command manually. Pick the file from
last_failed_walor a completed segment inpg_wal/. Substitute%pwith the absolute path to the file inpg_wal/and%fwith the filename, then run the command as thepostgresOS user. Check$?. - Look for silent archiver deaths. If
pg_wal/is growing butfailed_countis not, search PostgreSQL logs for archiver startup messages without preceding archive successes. Match these against OS logs for OOM kills or signal terminations. - Verify idempotency. Copy a WAL segment to a temporary path, run your
archive_commandagainst it twice. The second run must return 0 only if the archived file is byte-identical. If your script usescp -i, replace it with a content-aware tool such asrsync -c, pgBackRest, or WAL-G. - Check disk pressure on both sides. A full destination filesystem causes persistent nonzero exits. A full
pg_walfilesystem causes PANIC. Check both before they overlap. - If using pgBackRest or WAL-G, inspect their logs. Async archiving can return 0 to PostgreSQL while the background push fails or times out. Look for timeout error
[082]in pgBackRest or retry exhaustion in WAL-G.
flowchart TD
A[WAL accumulation detected] --> B{pg_stat_archiver.failed_count increasing?}
B -->|Yes| C[Test archive_command on last_failed_wal]
B -->|No| D[Check logs for archiver deaths]
C --> E{Exit code nonzero?}
E -->|Yes| F[Fix storage backend or permissions]
E -->|No| G[Idempotency bug or cp -i pitfall]
D --> H{Signal or exit >125?}
H -->|Yes| I[Silent failure: check OOM or external kills]
H -->|No| J[Check disk full or command syntax]
I --> K[Stabilize host and verify with pg_switch_wal]
J --> L[Emergency: set archive_command to empty string if disk critical]Metrics and signals to monitor
| Signal | Why it matters | Warning sign |
|---|---|---|
pg_stat_archiver.failed_count | Direct count of failed attempts | Increasing over any 5-minute window |
pg_stat_archiver.last_archived_time | Staleness indicates pipeline stall | Lag greater than 2 times archive_timeout |
pg_wal directory size | Unarchived segments consume local disk | Growth rate exceeding your baseline |
pg_stat_archiver.last_failed_wal | Identifies the exact stuck segment | Same filename across consecutive samples |
| Disk utilization on WAL volume | Disk full triggers PANIC | Greater than 85 percent |
| Archiver restart rate in logs | Silent deaths bypass failed_count | Log entries for archiver startup without archive success |
Fixes
Fix storage backend or permissions
If manual testing returns a permission denied, authentication, or network error, repair the backend. This could mean rotating keys, fixing bucket policies, or restoring connectivity. If you only changed archive_command or archive_library, reload PostgreSQL with SELECT pg_reload_conf();. If you changed archive_mode, a server restart is required.
Replace unsafe archive_command patterns
The classic pattern test ! -f /path/%f && cp %p /path/%f is unsafe. GNU cp -i returns exit 0 when the destination file already exists, which violates the idempotency contract and creates silent archive gaps. Use a tool that compares content on collision, such as rsync -c, pgBackRest, WAL-G, or the basic_archive module available in PostgreSQL 15+.
Recover from silent archiver death
If the archiver is dying from OOM or signals, address the root cause before restarting PostgreSQL. Increase memory limits, remove memory pressure, or stop the external process that is sending signals. After stabilizing the host, verify the pipeline with SELECT pg_switch_wal(); and confirm the new segment appears in the archive.
Triage disk-full emergencies
If pg_wal/ is filling and you cannot fix the archive backend immediately, set archive_command = '' (empty string). This stops new archive attempts and prevents PANIC while you free space or repair the destination. WAL files will accumulate in pg_wal/ until you restore a working command, so this is a bridge, not a cure.
Resolve pgBackRest async timeouts
If pgBackRest async archiving is failing with error [082], the background push is not keeping up. Reduce archive_timeout temporarily to slow WAL generation, or switch to synchronous archiving until the backlog clears. Review the pgBackRest async configuration and ensure the handoff process is healthy.
Prevention
- Do not rely solely on
pg_stat_archiver.failed_count. Monitorpg_walgrowth rate and archiver process liveness via logs, because signal deaths do not increment the counter. - Use
archive_libraryon PostgreSQL 15+ or a proven archiving tool instead of hand-rolled shell scripts. Thebasic_archivemodule provides atomic renames and built-in idempotency checks. - Test idempotency quarterly by re-running your archive command against an already-archived segment.
- Keep
pg_walon a dedicated volume with enough headroom to survive several hours of backlog. - Monitor the age of
last_archived_timeagainst your current WAL position. If the lag exceeds your RPO, page immediately.
How Netdata helps
- Correlate
pg_stat_archiver.failed_countwith disk utilization on the WAL volume to catch failures that statistics alone miss. - Alert on
pg_waldirectory growth rate even when PostgreSQL reports no failed archive attempts, surfacing silent archiver deaths. - Track WAL generation and checkpoint rates alongside archive lag to identify when write volume outpaces the backend.
- Surface PostgreSQL process-level metrics to flag archiver OOM kills or unexpected process restarts.
- Combine PostgreSQL logs with metrics to pinpoint the exact WAL filename and timestamp where archiving first stalled.
Related guides
- How PostgreSQL actually works in production: a mental model for operators
- PostgreSQL ALTER TABLE blocked: zero-downtime DDL patterns
- PostgreSQL autovacuum blocked by long-running transaction: detection and fix
- PostgreSQL autovacuum not running: detection, causes, and fixes
- PostgreSQL autovacuum tuning: per-table thresholds for high-churn workloads
- PostgreSQL blocking queries: finding the root blocker in a lock cascade
- PostgreSQL checkpoint storms: detection, causes, and tuning
- PostgreSQL connection exhaustion: detection, diagnosis, and prevention
- PostgreSQL connection refused: pg_hba, listen_addresses, and TCP diagnosis
- PostgreSQL: database is not accepting commands to avoid wraparound data loss
- PostgreSQL dead tuples piling up: why autovacuum can’t keep up
- PostgreSQL deadlock detected: how to diagnose and prevent deadlocks






