PostgreSQL frozen XID monitoring: catching wraparound 6 months early
Transaction ID wraparound is a slow burn, not a sudden crisis. The 32-bit XID counter advances with every transaction. If VACUUM freeze does not keep pace, PostgreSQL will eventually refuse new transactions to prevent data corruption. By the time built-in log warnings fire, you may have only hours of runway left. Monitor age(datfrozenxid) and age(relfrozenxid) with tiered thresholds so you act with months of lead time.
To use those metrics effectively, you need to know what blocks freeze progress and which response matches each threshold tier. This guide covers the freeze mechanism, exact queries, common blocker patterns, and operational thresholds.
What XID wraparound is and why it matters
Wraparound is a hard architectural limit. PostgreSQL assigns every transaction a 32-bit unsigned transaction ID, cycling at roughly 2.147 billion transactions. To avoid cycling back over live data, VACUUM marks old row versions as frozen. The oldest unfrozen XID in each database is recorded in pg_database.datfrozenxid. age(datfrozenxid) returns the transaction span between that frozen point and the current XID counter.
When age(datfrozenxid) grows, the database is consuming XIDs faster than freeze advances. PostgreSQL emits warnings when approximately 40 million XIDs remain (age roughly 2.107 billion). When roughly 3 million XIDs remain, the database refuses new transactions. There is no parameter that expands the 32-bit space. Your only operational defense is to monitor age() per database and per table, remove blockers, and ensure anti-wraparound vacuum completes.
How freezing works
Freeze is a side effect of VACUUM. Every table stores a relfrozenxid in pg_class that tracks the oldest unfrozen XID in that table. When VACUUM runs, it freezes tuples older than vacuum_freeze_min_age and performs an aggressive freeze scan when the table’s age exceeds vacuum_freeze_table_age (default 150 million). If a table’s age crosses autovacuum_freeze_max_age (default 200 million), PostgreSQL launches an anti-wraparound autovacuum even if autovacuum is globally disabled. You can identify these emergency workers in pg_stat_activity by the query string pattern autovacuum: ... to prevent wraparound.
relfrozenxid updates only when VACUUM finishes. If a freeze scan runs for hours, the reported table age does not drop until completion. This confuses operators who expect immediate progress.
The database-level datfrozenxid is the oldest relfrozenxid across all tables in that database, including system catalogs. Even a user database with no application writes advances its XID counter because background processes and system catalog churn consume transaction IDs. Monitor all databases, not just application-facing ones.
flowchart LR
XID[XID counter increments] --> VAC{VACUUM freeze}
VAC -->|Unblocked| FROZEN[Tuple frozen
relfrozenxid advances]
FROZEN --> DBAGE[datfrozenxid advances]
DBAGE --> SAFE[Wraparound risk resets]
VAC -->|Blocked by long tx
or replication slot| STALL[Freeze stalls]
STALL --> AGE[age(datfrozenxid) grows]
AGE --> WRAP[Approach 2.147B ceiling]Where freeze lag shows up in production
Freeze stalls when something pins the visibility horizon VACUUM uses to judge whether a tuple can be frozen. The two most common blockers are long-running transactions and inactive replication slots.
An open transaction holds a snapshot that prevents VACUUM from freezing tuples created after it started. This includes idle in transaction sessions left by connection leaks or ORM auto-begin behavior. While the transaction sits idle, age(datfrozenxid) climbs.
Replication slots, especially logical slots for CDC or forgotten physical slots, retain WAL and hold back the freeze horizon (logical slots via catalog_xmin, physical slots via xmin). An inactive slot forces the primary to retain WAL and prevents freeze from advancing. Combined with WAL growth, this can fill the disk and compound the outage.
The monitoring surface: database, table, and multixact metrics
Run these checks daily or stream them into your time-series store.
Database-level age. The primary scalar for wraparound risk:
SELECT datname, age(datfrozenxid)
FROM pg_database
ORDER BY 2 DESC;
Table-level age. Identifies which tables are lagging. Check the main table and its TOAST table; TOAST has its own relfrozenxid and can independently drive database age:
SELECT n.nspname, c.relname, age(c.relfrozenxid), age(t.relfrozenxid)
FROM pg_class c
JOIN pg_namespace n ON n.oid = c.relnamespace
LEFT JOIN pg_class t ON t.oid = c.reltoastrelid
WHERE c.relkind = 'r';
Multixact age. Multixact IDs wrap independently and are consumed by row-level share locks. A database heavy on FOR SHARE or FOR KEY SHARE can hit multixact wraparound even when XID age looks safe. The equivalent forced-vacuum threshold is autovacuum_multixact_freeze_max_age:
SELECT datname, mxid_age(datminmxid)
FROM pg_database
ORDER BY 2 DESC;
Vacuum progress. When you are actively remediating, confirm that freeze vacuum is scanning rather than blocked. Focus on phase; values like scanning heap or vacuuming indexes confirm forward motion. If heap_blks_scanned is static for minutes, the worker is likely blocked on a lock or I/O:
SELECT pid, phase, heap_blks_scanned, heap_blks_total
FROM pg_stat_progress_vacuum;
Alert thresholds and tiered response
Use raw XID age against the 2.147 billion ceiling. Do not wait for PostgreSQL’s built-in warnings.
| Tier | age(datfrozenxid) | Meaning | Response |
|---|---|---|---|
| Early warning | 500 million | Roughly 25% of safe range consumed | Review autovacuum schedule and blocker queries |
| Urgent | 1 billion | Roughly 50% consumed | Schedule manual VACUUM FREEZE windows; terminate blockers |
| Emergency | 1.5 billion | Roughly 75% consumed | Emergency VACUUM FREEZE on lagging tables; page on-call |
| Critical | 2.107 billion | ~40 million XIDs remain | PostgreSQL warnings are firing; writes will stop imminently |
Anti-wraparound autovacuum triggers automatically when a table crosses autovacuum_freeze_max_age (default 200 million), but the database-level datfrozenxid can still be driven upward by a single lagging table or a blocked vacuum. Tiered alerts give you time to act before the automatic emergency.
Warning: VACUUM FREEZE acquires a SHARE UPDATE EXCLUSIVE lock and generates heavy sequential I/O. Run it during a maintenance window on large tables, and verify the freeze is not blocked before starting.
Blockers that stall freeze progress
Before running manual freeze, check for the blockers that make vacuum ineffective.
Long-running transactions. Check for sessions with old xact_start timestamps, especially those in idle in transaction:
SELECT pid, usename, state, xact_start, query
FROM pg_stat_activity
WHERE xact_start < NOW() - INTERVAL '5 minutes'
ORDER BY xact_start;
Set idle_in_transaction_session_timeout to a reasonable value (for example, 2 minutes) to prevent connection leaks from holding back freeze indefinitely.
Inactive replication slots. Check for slots that are not currently active but are retaining WAL:
SELECT slot_name, active, restart_lsn
FROM pg_replication_slots
WHERE NOT active;
Drop slots for decommissioned consumers. On PostgreSQL 13 and later, set max_slot_wal_keep_size to prevent an inactive slot from filling the disk.
Signals to watch in production
| Signal | Why it matters | Warning sign |
|---|---|---|
age(datfrozenxid) | Primary scalar for wraparound risk | Exceeds 500 million |
age(relfrozenxid) per table | Identifies specific lagging tables | Any table approaching 1 billion |
mxid_age(datminmxid) | Multixact wraparound is a separate failure mode | Growing steadily across databases |
pg_stat_activity.query | Confirms anti-wraparound vacuum is running | Pattern matches to prevent wraparound |
pg_replication_slots.active | Inactive slots block freeze and fill disk | false with a stale restart_lsn |
pg_stat_activity.state | Open transactions pin the freeze horizon | idle in transaction with an old xact_start |
How Netdata helps
- Netdata collects
age(datfrozenxid)andage(relfrozenxid)as time-series, exposing upward trends before a threshold breach. - Alerts on XID and multixact age can be set to the tiers above, paging you before PostgreSQL’s built-in warnings fire.
pg_stat_progress_vacuummetrics show whether a remediation vacuum is scanning heap blocks or stalled.- Correlating XID age with
pg_stat_activitystate counts and replication slot status distinguishes “VACUUM cannot keep up” from “VACUUM is blocked by a long transaction or slot.”
Related guides
- How PostgreSQL actually works in production: a mental model for operators
- PostgreSQL ALTER TABLE blocked: zero-downtime DDL patterns
- PostgreSQL autovacuum not running: detection, causes, and fixes
- PostgreSQL autovacuum tuning: per-table thresholds for high-churn workloads
- PostgreSQL blocking queries: finding the root blocker in a lock cascade
- PostgreSQL connection exhaustion: detection, diagnosis, and prevention
- PostgreSQL connection refused: pg_hba, listen_addresses, and TCP diagnosis
- PostgreSQL dead tuples piling up: why autovacuum can’t keep up
- PostgreSQL deadlock detected: how to diagnose and prevent deadlocks
- PostgreSQL idle in transaction: detecting and killing zombie sessions
- PostgreSQL index bloat: detection and REINDEX CONCURRENTLY recovery
- PostgreSQL ERROR: could not obtain lock — diagnosis and recovery






