$ guides / postgres ▌

POSTGRESQL · OPERATIONS PLAYBOOK

Keeping PostgreSQL fast: vacuum debt, lock queues, and the wraparound clock

MVCC, WAL, autovacuum, replication — how the server really works, where it tends to break, the signals worth watching, and a runbook for each incident.

> Start with the monitoring checklist → # Jump to the full guide list

PostgreSQL is famously easy to run for the first year, and famously hard to run for the fifth.

The defaults work. Until autovacuum cannot keep up with a high-churn table and dead tuples pile up. Until a forgotten replication slot retains WAL forever and fills the disk. Until age(datfrozenxid) crosses 2 billion and the database refuses writes to avoid wraparound corruption. Until a long-running transaction silently blocks every vacuum across the cluster. Until one slow query takes an AccessExclusiveLock that blocks every other transaction. Until a checkpoint storm turns a steady write workload into a stop-the-world I/O spike.

These guides are written for engineers who already run PostgreSQL, not for people learning what an index is. The goal is to give you the mental model of how the server actually behaves under load, the failure patterns that keep recurring, the monitoring story that catches problems before they page anyone, and the runbooks you wish someone had handed you before your last incident.

How PostgreSQL actually runs in production

PostgreSQL is not a single process. It is a postmaster supervising a per-connection backend, several background processes, a chunk of shared memory, and a strict contract with the storage layer. Most production failures live between these layers, not inside any one of them.

applications / ORMs

Whatever opens connections: application servers, batch jobs, CI scripts, BI tools, replication consumers. Each connection eventually becomes one Postgres backend process.

USER

connection pooler

PgBouncer, Pgpool-II, Odyssey. Multiplexes thousands of client connections onto a small server pool. Architecturally mandatory at scale.

POOL

postmaster + backends

One backend process per server connection. Each backend uses ~5–10 MB of memory even when idle. <code>work_mem</code> is per-operation, not per-backend, so a single complex query can multiply allocations.

BACKEND

shared memory

<code>shared_buffers</code>, WAL buffers, the lock table, and the procarray. The piece of PostgreSQL that survives across queries.

SHARED

background workers

Autovacuum launcher + workers, walwriter, bgwriter, checkpointer, walsender, walreceiver, logical replication apply workers. They run the server's hygiene and replication contracts.

BACKGROUND

storage layout

Heap files, indexes, TOAST tables, pg_wal, temp files, replication slots. The on-disk shape of the database.

STORAGE

OS page cache

The kernel caches PostgreSQL data files. PostgreSQL double-caches deliberately. Above ~40% of RAM in <code>shared_buffers</code> you starve this cache and lose more than you gain.

KERNEL

block storage

Local NVMe, EBS, ZFS, or whatever sits under the data directory. WAL fsync latency on this layer sets the ceiling on commit throughput.

DISK

Why this matters: a query can be slow because of a missing index, a stale plan, a lock wait, a temp-file spill, a checkpoint flush, an autovacuum I/O storm, an OS page-cache miss, or a slow disk fsync. The symptom is the same — slow query — but each layer has a different signal and a different fix.

The failures you'll actually see

Most PostgreSQL incidents fall into a small set of recurring patterns. Recognise the shape, and triage gets dramatically faster.

CRITICAL

The connection exhaustion cliff

FATAL: sorry, too many clients already. Applications fail to acquire connections; new sessions are refused. Underneath it is usually max_connections set too low for the workload, an application leak, idle-in-transaction sessions piling up, or no PgBouncer in front of the database.

too many connections errors at the driver
pg_stat_activity hits max_connections
idle in transaction sessions piling up
PgBouncer waiting_client_count climbs

Investigate →

IMMINENT

The lock cascade

One slow transaction takes a lock; everything else queues behind it. A migration takes AccessExclusiveLock on a hot table; the entire app stalls. A row-level lock contends; deadlock detector fires every deadlock_timeout. The database keeps running while the workload grinds to a halt.

active sessions climbing without throughput
pg_blocking_pids shows a deep wait chain
deadlock_timeout logs spike
AccessExclusiveLock held by a DDL session

Investigate →

ACTIVE

The autovacuum starvation spiral

A long-running transaction prevents dead tuple cleanup. Bloat accumulates on hot tables. Sequential scans get slower. Indexes balloon. Autovacuum eventually catches up — at the worst possible time, competing with peak load. The fix is rarely "tune autovacuum harder"; it is "find the long transaction."

n_dead_tup growing without n_live_tup matching
pg_stat_activity has a transaction older than 30 minutes
table size grows faster than row count
VACUUM runs that don't reclaim dead tuples

Investigate →

CRITICAL

The transaction ID wraparound emergency

PostgreSQL stops accepting writes when transaction IDs come within ~3 million of wraparound. WARNING: database must be vacuumed within X transactions escalates to ERROR: database is not accepting commands. Recovery is single-user mode and VACUUM FREEZE. Prevention is monitoring age(datfrozenxid) long before it matters.

log warnings about transaction ID wraparound
age(datfrozenxid) above 1 billion
autovacuum_freeze_max_age frequently triggered
anti-wraparound vacuums running against multiple tables

Investigate →

IMMINENT

The replication slot disk-fill

A logical or physical replication slot stops being consumed. The primary cannot recycle WAL because the slot retains it. pg_wal grows without bound until the disk fills. The primary then refuses writes. The fix in the moment is to drop the slot; the prevention is alerting on slot lag and max_slot_wal_keep_size.

pg_wal directory growing steadily
pg_replication_slots shows active=false on a retained slot
slot_lag_bytes > a few GB
checkpoints occurring but WAL not recycling

Investigate →

WATCHFUL

The checkpoint storm

A burst of dirty pages forces a checkpoints_req ahead of schedule. Buffered writes drain to disk in a spike; fsync latency climbs; query latency follows. Logs show checkpoints are occurring too frequently. The fix is almost always max_wal_size, not checkpoint_timeout.

checkpoints_req >> checkpoints_timed
log warning: checkpoints are occurring too frequently
I/O spikes aligned with checkpoint completion
p99 commit latency climbs during checkpoints

Investigate →

The Netdata solution

PostgreSQL monitoring with Netdata

Netdata monitors PostgreSQL with per-second metrics, pre-built dashboards, and ML-powered anomaly detection. Correlate connection saturation, lock waits, autovacuum progress, replication lag, and checkpoint I/O against the rest of your stack so you catch the incidents in these runbooks before they page anyone.

See PostgreSQL monitoring → Start monitoring free

PostgreSQL monitoring maturity levels

PostgreSQL observability works in four practical levels. Each is a complete operation, not a stepping stone. Pick the level that matches how much your database matters. Most production databases should land at the second level.

Level 1: Survival

Know that something is wrong

Survival monitoring is the floor. With these signals you can answer one question: is the database still functioning? You will not learn what broke, but you will learn that something broke before users do. Survival is enough for dev environments and hobby clusters.

Database reachability Can a probe connect and run SELECT 1?
Server uptime / unexpected restarts Did the postmaster restart without your permission?
Disk free on the data directory Is the volume hosting pg_wal and base/ near full?
Connection count vs max_connections Are you within the connection ceiling?
Replication: replicas connected Are the expected replicas attached to the primary?
Backup last-success age When did pg_basebackup or pgBackRest last succeed?

Level 2: Operational

Diagnose most incidents on your own

Operational monitoring is what most production databases should target. Survival tells you something is wrong; operational tells you what. With this coverage your team can usually diagnose an incident on its own: bloat, replication lag, slow queries, checkpoint pressure, lock waits.

Transactions per second (commits + rollbacks) Is the workload doing what it should?
Cache hit ratio per database Are reads served from shared_buffers?
Replication lag (write/flush/replay) How far behind is each replica, in bytes and seconds?
Dead tuples and table bloat Is autovacuum keeping up with churn?
Active vs idle vs waiting sessions What is pg_stat_activity actually doing?
Lock waits and blocking sessions Is anything in a multi-second wait?
Long-running transactions (>5 min) Anything holding xmin back from cleanup?
Checkpoints: timed vs requested Is max_wal_size sized correctly?
WAL generation rate Is the write workload growing?
pg_stat_statements top by total_time Which queries actually cost the most?

Level 3: Mature

Catch problems before they become incidents

Mature monitoring catches problems before they wake anyone up. age(datfrozenxid) climbing, replication slot lag drifting, statistics going stale, plan cache regressing to a generic plan, temp file rate creeping. None of these will page you on day one. They become page-out incidents on day thirty.

age(datfrozenxid) per database Months of headroom against wraparound?
Replication slot lag (bytes retained) Is a stale slot accumulating WAL?
Autovacuum worker utilisation Are workers saturated? Is anything blocked?
Temp file generation rate and size Is work_mem too small for real queries?
Buffer eviction rate (bgwriter + backend writes) Is shared_buffers thrashing?
Heap fetches per index-only scan Is the visibility map stale?
WAL fsync p99 latency How fast does the underlying disk really fsync?
Connection age distribution Are pgbouncer transaction-pool connections rotating?
Plan cache hit ratio (prepared stmts) Is the planner using generic vs custom plans correctly?

Level 4: Expert

Reactive instrumentation after real incidents

Expert signals enter your stack the day after a specific incident proved you needed them. wait-event sampling, autovacuum I/O accounting per table, btree split rates, ProcArray contention, replication apply conflicts on hot_standby. Most teams never need every signal here. Add the ones your incident history says you do.

wait_event sampling from pg_stat_activity Where is the server spending its waiting time?
Per-table autovacuum I/O and duration Which tables consume vacuum budget?
B-tree split and fillfactor effectiveness Are HOT updates winning, or are indexes bloating?
Hot standby recovery conflicts Is replay being interrupted by replica queries?
Logical replication apply latency by table Which subscriber tables fall behind?
shared_buffer dirty rate vs flush rate Are checkpoints flushing what bgwriter should?
Page cache pressure on the data volume Is the OS evicting Postgres pages?
auto_explain captures of slow queries Plan + actual rows for every slow path.

Operating mistakes worth avoiding

The traps PostgreSQL teams keep falling into. Each has a clear, well-known fix. Most teams only learn it after an incident.

max_connections set to 500+ instead of using a pooler

PostgreSQL is process-per-connection. Each backend costs ~5–10 MB even idle. Five hundred backends is 5 GB of memory and serious context-switch overhead. PgBouncer in transaction mode lets you serve thousands of clients with 50 server connections.

Not monitoring age(datfrozenxid)

Wraparound is the silent killer. Default <code>autovacuum_freeze_max_age</code> is 200M. The hardcoded shutdown threshold is around 2.147B. Alert at 500M and 1B; ignore both and you will eventually meet a database that refuses writes.

Replication slots without monitoring

A slot retains WAL until consumed. A forgotten or stalled slot is the #1 root cause of pg_wal filling the disk. Alert on slot lag bytes and active=false on any persistent slot.

fsync = off "for performance"

fsync is what makes PostgreSQL durable. Disabling it can corrupt the cluster on any unclean shutdown. If you genuinely need extra write performance, tune synchronous_commit, not fsync.

pg_basebackup or pgBackRest backups never restore-tested

An untested backup is not a backup. Schedule a quarterly restore drill on a separate host. The first time you discover that backups don't restore must not be during an incident.

Treating autovacuum as something to disable

Disabling autovacuum on "hot" tables to "avoid I/O" is how teams meet wraparound emergencies. Tune <code>autovacuum_vacuum_scale_factor</code> and <code>autovacuum_vacuum_cost_delay</code> per table; never set <code>autovacuum_enabled = off</code> in production.

Ignoring idle in transaction sessions

An idle-in-transaction session holds xmin and prevents cleanup of any tuple newer than its snapshot. Set <code>idle_in_transaction_session_timeout</code> on every production cluster (60s–5min depending on workload).

Tuning shared_buffers to 80% of RAM

The OS page cache also caches Postgres pages. Above ~40% of RAM in shared_buffers, the kernel cache starves and you pay double for the same data. 25–40% is the well-known sweet spot.

PostgreSQL runbooks in this section

Each guide is a focused runbook for one symptom or topic. Pick one when you have an incident, or use the categories to learn the area.

▸

Start here

▸

Connections, pooling, and PgBouncer

▸

Locks, deadlocks, and blocking

▸

Autovacuum, bloat, and dead tuples

▸

Transaction ID wraparound

▸

WAL, checkpoints, and durability

▸

Replication, slots, and failover

▸

Slow queries, plans, statistics

▸

Disk, WAL directory, TOAST

▸

Memory and OOM

▸

Upgrades and backups

WHERE TO GO NEXT

Setting up PostgreSQL monitoring, or putting out a fire?

If you're starting from scratch, the monitoring checklist is the path of least regret. If you're mid-incident, jump straight to the symptom that matches what you're seeing.

> Start with the checklist > Back to Operations Guides