$ guides / postgres / postgres-wal-archive-failures ▌

Operations Guides

PostgreSQL WAL archive failures: archive_command exit codes and recovery

WAL archiving is the durability bridge between your PostgreSQL primary and your ability to recover to any point in time. When archive_command fails, WAL segments stay pinned in pg_wal/. The directory grows until the filesystem fills, at which point PostgreSQL performs an emergency PANIC shutdown. Even before that happens, every failed segment is a gap in your backup chain, rendering base backups useless for PITR beyond the first missing file.

The failure modes are not all obvious. A nonzero exit code from archive_command does increment pg_stat_archiver.failed_count, but if the archiver process is killed by a signal or exits with a code above 125, the postmaster restarts it silently and does not record the failure in the statistics view. You may have zero failed counts and a full disk. This guide covers how to read the signals, test the contract, and recover without making the gap worse.

What this means

PostgreSQL spawns an archiver background process that executes archive_command for every completed WAL segment. The contract is strict:

Exit code 0 means success. PostgreSQL considers the segment archived and will eventually remove or recycle it.
Exit code nonzero means failure. PostgreSQL retries indefinitely and keeps the file in pg_wal/.
SIGTERM is a graceful shutdown signal for the archiver and is not treated as a failure.
Signal death (non-SIGTERM) or exit code greater than 125 causes the archiver to abort. The postmaster restarts it, but the failure is not reflected in pg_stat_archiver.

The idempotency rule matters for recovery. If the server crashes before recording durable archive success, it may attempt to re-archive the same segment on restart. Your command must return 0 if an identical file already exists, and nonzero if a file with the same name but different contents exists. Returning 0 for a mismatched file corrupts the archive chain.

Common causes

Cause	What it looks like	First thing to check
Storage backend rejection	`failed_count` rising; `last_failed_wal` stuck on one segment	Manually run `archive_command` against that file and check the exit code
Unsafe duplicate handling (GNU cp -i)	Command exits 0 but the file is not actually archived; gaps appear in the backup chain	Re-run the command against an already-archived file. If `cp -i` is used, it returns 0 on duplicate names without checking content
Archiver death by signal or OOM	`pg_wal/` grows but `failed_count` is flat; logs show archiver process started or restarted	OS logs or `dmesg` for OOM kills; check for external process killers
`pg_wal` disk full	`PANIC` in PostgreSQL logs; database stops accepting writes	Disk utilization on the WAL volume
`archive_command` syntax or missing binary	Immediate failure on every WAL file; logs show the command was not found	`SHOW archive_command;` and test the binary path in a shell
pgBackRest async handoff failure	pgBackRest error `[082]`; WAL piles up despite `archive_command` returning 0 quickly	pgBackRest async queue state and timeout configuration

Quick checks

# Check archiver statistics for failures and last success
psql -c "SELECT archived_count, failed_count, last_archived_wal,
last_failed_wal, last_archived_time, last_failed_time
FROM pg_stat_archiver;"

# Measure WAL directory size and segment count
du -sh $PGDATA/pg_wal
ls $PGDATA/pg_wal | wc -l

# Verify current archive configuration
psql -c "SHOW archive_command;"
psql -c "SHOW archive_mode;"

# Look for archiver process restarts or errors in logs
grep -i archiver $PGDATA/log/postgresql-*.log | tail -n 20

# Force a WAL switch to test the current pipeline
psql -c "SELECT pg_switch_wal();"

How to diagnose it

Check pg_stat_archiver for a smoking gun. If failed_count is increasing, note last_failed_wal and the timestamp. A stagnant last_archived_time with a rising current WAL LSN means the pipeline is blocked.
Test the command manually. Pick the file from last_failed_wal or a completed segment in pg_wal/. Substitute %p with the absolute path to the file in pg_wal/ and %f with the filename, then run the command as the postgres OS user. Check $?.
Look for silent archiver deaths. If pg_wal/ is growing but failed_count is not, search PostgreSQL logs for archiver startup messages without preceding archive successes. Match these against OS logs for OOM kills or signal terminations.
Verify idempotency. Copy a WAL segment to a temporary path, run your archive_command against it twice. The second run must return 0 only if the archived file is byte-identical. If your script uses cp -i, replace it with a content-aware tool such as rsync -c, pgBackRest, or WAL-G.
Check disk pressure on both sides. A full destination filesystem causes persistent nonzero exits. A full pg_wal filesystem causes PANIC. Check both before they overlap.
If using pgBackRest or WAL-G, inspect their logs. Async archiving can return 0 to PostgreSQL while the background push fails or times out. Look for timeout error [082] in pgBackRest or retry exhaustion in WAL-G.

flowchart TD
    A[WAL accumulation detected] --> B{pg_stat_archiver.failed_count increasing?}
    B -->|Yes| C[Test archive_command on last_failed_wal]
    B -->|No| D[Check logs for archiver deaths]
    C --> E{Exit code nonzero?}
    E -->|Yes| F[Fix storage backend or permissions]
    E -->|No| G[Idempotency bug or cp -i pitfall]
    D --> H{Signal or exit >125?}
    H -->|Yes| I[Silent failure: check OOM or external kills]
    H -->|No| J[Check disk full or command syntax]
    I --> K[Stabilize host and verify with pg_switch_wal]
    J --> L[Emergency: set archive_command to empty string if disk critical]

Metrics and signals to monitor

Signal	Why it matters	Warning sign
`pg_stat_archiver.failed_count`	Direct count of failed attempts	Increasing over any 5-minute window
`pg_stat_archiver.last_archived_time`	Staleness indicates pipeline stall	Lag greater than 2 times `archive_timeout`
`pg_wal` directory size	Unarchived segments consume local disk	Growth rate exceeding your baseline
`pg_stat_archiver.last_failed_wal`	Identifies the exact stuck segment	Same filename across consecutive samples
Disk utilization on WAL volume	Disk full triggers PANIC	Greater than 85 percent
Archiver restart rate in logs	Silent deaths bypass `failed_count`	Log entries for archiver startup without archive success

Fixes

Fix storage backend or permissions

If manual testing returns a permission denied, authentication, or network error, repair the backend. This could mean rotating keys, fixing bucket policies, or restoring connectivity. If you only changed archive_command or archive_library, reload PostgreSQL with SELECT pg_reload_conf();. If you changed archive_mode, a server restart is required.

Replace unsafe archive_command patterns

The classic pattern test ! -f /path/%f && cp %p /path/%f is unsafe. GNU cp -i returns exit 0 when the destination file already exists, which violates the idempotency contract and creates silent archive gaps. Use a tool that compares content on collision, such as rsync -c, pgBackRest, WAL-G, or the basic_archive module available in PostgreSQL 15+.

Recover from silent archiver death

If the archiver is dying from OOM or signals, address the root cause before restarting PostgreSQL. Increase memory limits, remove memory pressure, or stop the external process that is sending signals. After stabilizing the host, verify the pipeline with SELECT pg_switch_wal(); and confirm the new segment appears in the archive.

Triage disk-full emergencies

If pg_wal/ is filling and you cannot fix the archive backend immediately, set archive_command = '' (empty string). This stops new archive attempts and prevents PANIC while you free space or repair the destination. WAL files will accumulate in pg_wal/ until you restore a working command, so this is a bridge, not a cure.

Resolve pgBackRest async timeouts

If pgBackRest async archiving is failing with error [082], the background push is not keeping up. Reduce archive_timeout temporarily to slow WAL generation, or switch to synchronous archiving until the backlog clears. Review the pgBackRest async configuration and ensure the handoff process is healthy.

Prevention

Do not rely solely on pg_stat_archiver.failed_count. Monitor pg_wal growth rate and archiver process liveness via logs, because signal deaths do not increment the counter.
Use archive_library on PostgreSQL 15+ or a proven archiving tool instead of hand-rolled shell scripts. The basic_archive module provides atomic renames and built-in idempotency checks.
Test idempotency quarterly by re-running your archive command against an already-archived segment.
Keep pg_wal on a dedicated volume with enough headroom to survive several hours of backlog.
Monitor the age of last_archived_time against your current WAL position. If the lag exceeds your RPO, page immediately.

How Netdata helps

Correlate pg_stat_archiver.failed_count with disk utilization on the WAL volume to catch failures that statistics alone miss.
Alert on pg_wal directory growth rate even when PostgreSQL reports no failed archive attempts, surfacing silent archiver deaths.
Track WAL generation and checkpoint rates alongside archive lag to identify when write volume outpaces the backend.
Surface PostgreSQL process-level metrics to flag archiver OOM kills or unexpected process restarts.
Combine PostgreSQL logs with metrics to pinpoint the exact WAL filename and timestamp where archiving first stalled.

The Netdata solution

PostgreSQL monitoring with Netdata

Netdata monitors PostgreSQL with per-second metrics, pre-built dashboards, and ML-powered anomaly detection. Correlate connection saturation, lock waits, autovacuum progress, replication lag, and checkpoint I/O against the rest of your stack so you catch the incidents in these runbooks before they page anyone.

See PostgreSQL monitoring → Start monitoring free

PostgreSQL WAL archive failures: archive_command exit codes and recovery

PostgreSQL WAL archive failures: archive_command exit codes and recovery

What this means

Common causes

Quick checks

How to diagnose it

Metrics and signals to monitor

Fixes

Fix storage backend or permissions

Replace unsafe archive_command patterns

Recover from silent archiver death

Triage disk-full emergencies

Resolve pgBackRest async timeouts

Prevention

How Netdata helps

Related guides

PostgreSQL monitoring with Netdata