PostgreSQL transaction ID wraparound: detection and emergency recovery

PostgreSQL can suddenly stop accepting writes and emit warnings that the database must be vacuumed within a shrinking number of transactions. This is transaction ID wraparound. It is not gradual performance degradation; it is a hard stop that can take a database offline for hours if old tuples are not frozen in time.

Every write transaction consumes a 32-bit XID. After roughly two billion transactions, the counter nears the point where older tuples could appear to belong to the future, corrupting visibility. PostgreSQL refuses to hand out new XIDs once fewer than roughly three million remain. Warnings appear at roughly forty million remaining. The defense is VACUUM freeze, which marks old rows with FrozenTransactionId so they no longer depend on the live counter.

If you are responding to an incident, skip to Fixes. If you are establishing monitoring, start with Metrics and signals to monitor.

What this means

PostgreSQL assigns each transaction a monotonic 32-bit unsigned XID. The counter cannot be reset. VACUUM prevents wraparound by replacing old XIDs with FrozenTransactionId, which is always treated as older than any normal XID.

As age(datfrozenxid) grows, you move through three operational risk thresholds:

  • Early warning: age(datfrozenxid) > 500 million. Autovacuum is behind.
  • Urgent: age(datfrozenxid) > 1 billion. Freeze work must complete soon.
  • Refusal: fewer than ~3 million XIDs remain. PostgreSQL stops accepting commands that assign new XIDs. Read-only queries still work; writes fail.

At ~40 million XIDs remaining, PostgreSQL emits WARNING lines in the logs. There is no latency spike before the cliff. The database operates normally until it does not.

Common causes

CauseWhat it looks likeFirst thing to check
Long-running or idle-in-transaction sessionsAn open transaction holds back the XID horizon, blocking VACUUM from freezing newer tuples. pg_stat_activity shows idle in transaction with an old state_change.SELECT pid, state, state_change FROM pg_stat_activity WHERE state = 'idle in transaction';
Inactive replication slotsA slot retains WAL and can hold catalog_xmin, preventing autovacuum from advancing. pg_wal grows while the slot is inactive.SELECT slot_name, active, restart_lsn FROM pg_replication_slots WHERE NOT active;
Insert-only or archive tables with default scale factorDefault autovacuum_vacuum_scale_factor = 0.2 means a 500-million-row table never triggers vacuum. last_autovacuum is NULL or very old.SELECT relname, n_dead_tup, last_autovacuum FROM pg_stat_user_tables ORDER BY n_live_tup DESC LIMIT 10;
Autovacuum disabled or throttled to uselessnessautovacuum = off or autovacuum_max_workers = 0. Dead tuples accumulate and XID age grows silently.SHOW autovacuum; and SHOW autovacuum_max_workers;
High-churn tables with default tuningA table receiving millions of updates per day outruns the default thresholds. Autovacuum runs but never finishes before the next wave.SELECT relname, n_dead_tup, last_autovacuum FROM pg_stat_user_tables ORDER BY n_dead_tup DESC LIMIT 5;

Quick checks

Run these read-only queries to assess exposure and blockers.

-- Database-level XID age
SELECT datname, age(datfrozenxid)
FROM pg_database
ORDER BY age(datfrozenxid) DESC;
-- Table-level XID age
SELECT n.nspname || '.' || c.relname AS table_name,
       age(c.relfrozenxid) AS xid_age
FROM pg_class c
JOIN pg_namespace n ON n.oid = c.relnamespace
WHERE c.relkind = 'r'
ORDER BY age(c.relfrozenxid) DESC
LIMIT 10;
-- Idle-in-transaction blockers
SELECT pid, usename, datname,
       EXTRACT(EPOCH FROM (now() - state_change)) AS idle_seconds
FROM pg_stat_activity
WHERE state = 'idle in transaction'
ORDER BY state_change;
-- Replication slot status
SELECT slot_name, plugin, slot_type, active, restart_lsn
FROM pg_replication_slots;
-- Tables missing autovacuum attention
SELECT schemaname, relname, n_dead_tup, n_live_tup,
       last_autovacuum, last_autoanalyze
FROM pg_stat_user_tables
ORDER BY n_dead_tup DESC
LIMIT 10;

How to diagnose it

  1. Confirm the XID age. Run the database-level query. If any database is above 1 billion, treat this as urgent. Above 500 million, investigate immediately.

  2. Find the oldest tables. Run the table-level query. The oldest tables drive database age; target them first.

  3. Check for idle-in-transaction sessions. These block VACUUM entirely. Sessions with state_change older than a few minutes are likely the root cause.

  4. Check replication slots. Inactive slots pin catalog_xmin and prevent freeze progress. Compare restart_lsn against the current WAL position.

  5. Verify autovacuum is reaching the tables. If last_autovacuum is stale while n_dead_tup grows, autovacuum is blocked or misconfigured.

  6. Check for multixact wraparound. Run SELECT datname, mxid_age(datminmxid) FROM pg_database;. Multixact wraparound uses separate thresholds and blocks row-level locking when exhausted.

flowchart TD
    A[XID age > 500M] --> B{Blockers present?}
    B -->|Idle in transaction| C[Terminate sessions]
    B -->|Inactive slot| D[Drop slot if safe]
    B -->|None| E[Identify oldest tables]
    E --> F[Run VACUUM FREEZE]
    F --> G[Watch pg_stat_progress_vacuum]

Metrics and signals to monitor

SignalWhy it mattersWarning sign
age(datfrozenxid)Whole-database wraparound proximity.> 500 million (early); > 1 billion (urgent)
age(relfrozenxid) per tableTables driving database age.> 500 million on any large table
mxid_age(datminmxid)Multixact IDs wrap separately and block row locking.Growing steadily without bound
pg_stat_activity idle-in-transaction timeSessions preventing VACUUM advance.Any session idle in transaction > 5 minutes
pg_replication_slots.activeSlots retaining WAL and pinning xmin.active = false with old restart_lsn
pg_stat_user_tables.last_autovacuumConfirmation that freeze scans reach the table.Stale while n_dead_tup grows

Fixes

Clear blockers before vacuuming

Remove blockers before running emergency VACUUM.

  • Terminate idle-in-transaction sessions. These hold back the freeze horizon. Terminating a session rolls back uncommitted work.

    -- Terminate idle-in-transaction blockers older than 5 minutes
    SELECT pg_terminate_backend(pid)
    FROM pg_stat_activity
    WHERE state = 'idle in transaction'
    AND state_change < NOW() - INTERVAL '5 minutes';
    
  • Drop inactive replication slots. Only drop a slot if the consumer is decommissioned. Dropping forces a full re-sync if the consumer reconnects.

    -- Drop an inactive replication slot
    SELECT pg_drop_replication_slot('slot_name');
    

Emergency VACUUM freeze

If the database is still accepting writes but age is high, or if it has already stopped accepting new XIDs, run VACUUM FREEZE on the offending tables. Do not use VACUUM FULL; it acquires an ACCESS EXCLUSIVE lock and rewrites the entire table, which is far slower and more disruptive than VACUUM FREEZE.

-- Emergency freeze on a single table
VACUUM FREEZE VERBOSE my_schema.my_table;

Monitor progress in another session:

SELECT * FROM pg_stat_progress_vacuum;

If the table is large, the vacuum can take hours. It competes for I/O, so expect latency spikes on the same disk. If the database has already refused writes and you cannot connect in multi-user mode, single-user mode is a last resort. It extends downtime; use it only when multi-user mode is unreachable.

Multixact wraparound

If mxid_age(datminmxid) is high, run VACUUM FREEZE on tables with the oldest relminmxid.

-- Tables with oldest multixact IDs
SELECT n.nspname || '.' || c.relname AS table_name,
       mxid_age(c.relminmxid) AS mxid_age
FROM pg_class c
JOIN pg_namespace n ON n.oid = c.relnamespace
WHERE c.relkind = 'r'
ORDER BY mxid_age(c.relminmxid) DESC
LIMIT 10;

Multixact wraparound blocks row-locking operations, not all writes, but it is fatal to applications using SELECT ... FOR SHARE.

Prevention

  • Alert early. Alert on age(datfrozenxid) at 500 million and 1 billion. There is no visible symptom before the cliff; proactive metrics are mandatory.
  • Set idle_in_transaction_session_timeout. Two minutes prevents abandoned transactions from blocking VACUUM.
  • Monitor replication slots. Alert on active = false or lagging restart_lsn.
  • Tune per-table autovacuum. Lower autovacuum_vacuum_scale_factor to 0.01 or 0.05 on large or high-churn tables.
  • Never disable autovacuum. Throttle with autovacuum_vacuum_cost_delay if I/O is too high.

How Netdata helps

  • Correlate age(datfrozenxid) with autovacuum worker activity to distinguish a missing vacuum from a blocked one.
  • Alert on XID age before PostgreSQL emits its own warnings.
  • Surface idle in transaction sessions and table bloat to identify root blockers.
  • Track replication slot lag to catch inactive slots before WAL accumulation fills the disk.