PostgreSQL frozen XID monitoring: catching wraparound 6 months early

Transaction ID wraparound is a slow burn, not a sudden crisis. The 32-bit XID counter advances with every transaction. If VACUUM freeze does not keep pace, PostgreSQL will eventually refuse new transactions to prevent data corruption. By the time built-in log warnings fire, you may have only hours of runway left. Monitor age(datfrozenxid) and age(relfrozenxid) with tiered thresholds so you act with months of lead time.

To use those metrics effectively, you need to know what blocks freeze progress and which response matches each threshold tier. This guide covers the freeze mechanism, exact queries, common blocker patterns, and operational thresholds.

What XID wraparound is and why it matters

Wraparound is a hard architectural limit. PostgreSQL assigns every transaction a 32-bit unsigned transaction ID, cycling at roughly 2.147 billion transactions. To avoid cycling back over live data, VACUUM marks old row versions as frozen. The oldest unfrozen XID in each database is recorded in pg_database.datfrozenxid. age(datfrozenxid) returns the transaction span between that frozen point and the current XID counter.

When age(datfrozenxid) grows, the database is consuming XIDs faster than freeze advances. PostgreSQL emits warnings when approximately 40 million XIDs remain (age roughly 2.107 billion). When roughly 3 million XIDs remain, the database refuses new transactions. There is no parameter that expands the 32-bit space. Your only operational defense is to monitor age() per database and per table, remove blockers, and ensure anti-wraparound vacuum completes.

How freezing works

Freeze is a side effect of VACUUM. Every table stores a relfrozenxid in pg_class that tracks the oldest unfrozen XID in that table. When VACUUM runs, it freezes tuples older than vacuum_freeze_min_age and performs an aggressive freeze scan when the table’s age exceeds vacuum_freeze_table_age (default 150 million). If a table’s age crosses autovacuum_freeze_max_age (default 200 million), PostgreSQL launches an anti-wraparound autovacuum even if autovacuum is globally disabled. You can identify these emergency workers in pg_stat_activity by the query string pattern autovacuum: ... to prevent wraparound.

relfrozenxid updates only when VACUUM finishes. If a freeze scan runs for hours, the reported table age does not drop until completion. This confuses operators who expect immediate progress.

The database-level datfrozenxid is the oldest relfrozenxid across all tables in that database, including system catalogs. Even a user database with no application writes advances its XID counter because background processes and system catalog churn consume transaction IDs. Monitor all databases, not just application-facing ones.

flowchart LR
    XID[XID counter increments] --> VAC{VACUUM freeze}
    VAC -->|Unblocked| FROZEN[Tuple frozen
relfrozenxid advances] FROZEN --> DBAGE[datfrozenxid advances] DBAGE --> SAFE[Wraparound risk resets] VAC -->|Blocked by long tx
or replication slot| STALL[Freeze stalls] STALL --> AGE[age(datfrozenxid) grows] AGE --> WRAP[Approach 2.147B ceiling]

Where freeze lag shows up in production

Freeze stalls when something pins the visibility horizon VACUUM uses to judge whether a tuple can be frozen. The two most common blockers are long-running transactions and inactive replication slots.

An open transaction holds a snapshot that prevents VACUUM from freezing tuples created after it started. This includes idle in transaction sessions left by connection leaks or ORM auto-begin behavior. While the transaction sits idle, age(datfrozenxid) climbs.

Replication slots, especially logical slots for CDC or forgotten physical slots, retain WAL and hold back the freeze horizon (logical slots via catalog_xmin, physical slots via xmin). An inactive slot forces the primary to retain WAL and prevents freeze from advancing. Combined with WAL growth, this can fill the disk and compound the outage.

The monitoring surface: database, table, and multixact metrics

Run these checks daily or stream them into your time-series store.

Database-level age. The primary scalar for wraparound risk:

SELECT datname, age(datfrozenxid)
FROM pg_database
ORDER BY 2 DESC;

Table-level age. Identifies which tables are lagging. Check the main table and its TOAST table; TOAST has its own relfrozenxid and can independently drive database age:

SELECT n.nspname, c.relname, age(c.relfrozenxid), age(t.relfrozenxid)
FROM pg_class c
JOIN pg_namespace n ON n.oid = c.relnamespace
LEFT JOIN pg_class t ON t.oid = c.reltoastrelid
WHERE c.relkind = 'r';

Multixact age. Multixact IDs wrap independently and are consumed by row-level share locks. A database heavy on FOR SHARE or FOR KEY SHARE can hit multixact wraparound even when XID age looks safe. The equivalent forced-vacuum threshold is autovacuum_multixact_freeze_max_age:

SELECT datname, mxid_age(datminmxid)
FROM pg_database
ORDER BY 2 DESC;

Vacuum progress. When you are actively remediating, confirm that freeze vacuum is scanning rather than blocked. Focus on phase; values like scanning heap or vacuuming indexes confirm forward motion. If heap_blks_scanned is static for minutes, the worker is likely blocked on a lock or I/O:

SELECT pid, phase, heap_blks_scanned, heap_blks_total
FROM pg_stat_progress_vacuum;

Alert thresholds and tiered response

Use raw XID age against the 2.147 billion ceiling. Do not wait for PostgreSQL’s built-in warnings.

Tierage(datfrozenxid)MeaningResponse
Early warning500 millionRoughly 25% of safe range consumedReview autovacuum schedule and blocker queries
Urgent1 billionRoughly 50% consumedSchedule manual VACUUM FREEZE windows; terminate blockers
Emergency1.5 billionRoughly 75% consumedEmergency VACUUM FREEZE on lagging tables; page on-call
Critical2.107 billion~40 million XIDs remainPostgreSQL warnings are firing; writes will stop imminently

Anti-wraparound autovacuum triggers automatically when a table crosses autovacuum_freeze_max_age (default 200 million), but the database-level datfrozenxid can still be driven upward by a single lagging table or a blocked vacuum. Tiered alerts give you time to act before the automatic emergency.

Warning: VACUUM FREEZE acquires a SHARE UPDATE EXCLUSIVE lock and generates heavy sequential I/O. Run it during a maintenance window on large tables, and verify the freeze is not blocked before starting.

Blockers that stall freeze progress

Before running manual freeze, check for the blockers that make vacuum ineffective.

Long-running transactions. Check for sessions with old xact_start timestamps, especially those in idle in transaction:

SELECT pid, usename, state, xact_start, query
FROM pg_stat_activity
WHERE xact_start < NOW() - INTERVAL '5 minutes'
ORDER BY xact_start;

Set idle_in_transaction_session_timeout to a reasonable value (for example, 2 minutes) to prevent connection leaks from holding back freeze indefinitely.

Inactive replication slots. Check for slots that are not currently active but are retaining WAL:

SELECT slot_name, active, restart_lsn
FROM pg_replication_slots
WHERE NOT active;

Drop slots for decommissioned consumers. On PostgreSQL 13 and later, set max_slot_wal_keep_size to prevent an inactive slot from filling the disk.

Signals to watch in production

SignalWhy it mattersWarning sign
age(datfrozenxid)Primary scalar for wraparound riskExceeds 500 million
age(relfrozenxid) per tableIdentifies specific lagging tablesAny table approaching 1 billion
mxid_age(datminmxid)Multixact wraparound is a separate failure modeGrowing steadily across databases
pg_stat_activity.queryConfirms anti-wraparound vacuum is runningPattern matches to prevent wraparound
pg_replication_slots.activeInactive slots block freeze and fill diskfalse with a stale restart_lsn
pg_stat_activity.stateOpen transactions pin the freeze horizonidle in transaction with an old xact_start

How Netdata helps

  • Netdata collects age(datfrozenxid) and age(relfrozenxid) as time-series, exposing upward trends before a threshold breach.
  • Alerts on XID and multixact age can be set to the tiers above, paging you before PostgreSQL’s built-in warnings fire.
  • pg_stat_progress_vacuum metrics show whether a remediation vacuum is scanning heap blocks or stalled.
  • Correlating XID age with pg_stat_activity state counts and replication slot status distinguishes “VACUUM cannot keep up” from “VACUUM is blocked by a long transaction or slot.”