How PostgreSQL actually works in production: a mental model for operators

PostgreSQL is a reliable relational database, but that reliability is not magic. It is the emergent behavior of five concrete subsystems that interact in specific, observable ways: the Write-Ahead Log, Multi-Version Concurrency Control, checkpoints, autovacuum, and the process-per-connection model. If you do not understand how these interact, you will misdiagnose table bloat as a missing index, interpret checkpoint I/O spikes as disk failures, and respond to connection exhaustion by raising max_connections until the OOM killer arrives.

This article traces the path of a committed write from the client through the backend, WAL, and data files. It explains why a DELETE does not free space, why more connections can mean less throughput, and why a forgotten replication slot can fill a primary’s disk even when the replica looks healthy. The goal is to give you a concrete mental model for reasoning through incidents at 3 a.m. without guessing.

Why these five abstractions matter

Every production incident in PostgreSQL eventually touches WAL, MVCC, checkpoints, autovacuum, or the connection model. WAL determines whether you can recover from a crash. MVCC explains why tables grow after deletes and why long transactions cause bloat. Checkpoints explain why latency spikes on a predictable cadence. Autovacuum determines whether the database survives long enough to need a major version upgrade. The process-per-connection model dictates your memory ceiling and why connection pooling is architecturally mandatory rather than optional.

These details are the control levers you have, whether you run on EC2, in Kubernetes, or on a managed service that hides the filesystem.

How it works

flowchart TD
    Client[Client connection] -->|SQL command| Backend[Backend process]
    Backend -->|read/modify pages| SharedBuffers[Shared buffers]
    Backend -->|append record| WAL[WAL buffer]
    WAL -->|flush| WALDisk[WAL files]
    SharedBuffers -->|checkpoint flush| DataFiles[Data files]
    Autovacuum[Autovacuum worker] -->|reclaim dead tuples| SharedBuffers
    Autovacuum -->|mark all-visible| VM[Visibility map]

WAL is the source of truth

The Write-Ahead Log is an append-only, sequential journal. Every change is written to WAL before it is applied to heap pages. In practice, WAL is the source of truth and the data files are a materialized cache. If the server crashes, recovery replays WAL from the last checkpoint forward to bring the data files to a consistent state.

The default wal_level is replica, which supports archiving and streaming replication. full_page_writes is on by default, writing full page images after every checkpoint to prevent torn pages. Turning fsync off risks unrecoverable corruption. If you need lower commit latency, use synchronous_commit = off or local instead, which still flushes WAL but may return before the disk acknowledge.

Checkpoints bound recovery time

A checkpoint is a point where all dirty buffers are flushed to disk. Checkpoints begin every checkpoint_timeout seconds (default 5 minutes) or when max_wal_size is about to be exceeded (default 1 GB), whichever comes first. The background writer helps smooth I/O between checkpoints, but it does not eliminate the spike. checkpoint_completion_target defaults to 0.9, spreading the flush over most of the checkpoint interval.

If max_wal_size is too small for your write volume, checkpoints become forced and frequent. You will see I/O latency spikes on a fixed schedule. The tuning goal is to keep checkpoints_req rare, well under 10 percent of checkpoints_timed.

MVCC creates dead tuples

Multi-Version Concurrency Control allows readers and writers to proceed without blocking each other by keeping multiple versions of rows in the table. Every UPDATE creates a new tuple version; every DELETE leaves a dead tuple behind. A transaction sees the snapshot that existed when it started, implemented through xmin and xmax transaction IDs in each row header.

Because old versions stay in the table until reclaimed, the free space from a DELETE is not immediately reusable for new rows of different sizes. A table can grow to many times its logical size while still reporting the same row count. Only VACUUM, VACUUM FULL, or pg_repack reclaims that space. The Visibility Map tracks which pages contain only all-visible tuples, enabling index-only scans and allowing autovacuum to skip clean pages. All access is page-oriented, and the page is the 8 KB I/O unit.

Autovacuum is a first-class write workload

Autovacuum is not optional cleanup. It is a background write workload that reclaims dead tuple space, updates the free space map, freezes transaction IDs, and maintains the visibility map. Workers launch when a table exceeds a threshold of dead tuples: autovacuum_vacuum_threshold (default 50 tuples) plus autovacuum_vacuum_scale_factor (default 0.2, or 20 percent of table rows).

Anti-wraparound vacuum fires when a table’s oldest XID exceeds autovacuum_freeze_max_age (default 200 million). Because the XID counter is 32-bit, wraparound is inevitable without freezing. PostgreSQL emits warnings when roughly 40 million XIDs remain, and refuses writes when approximately 3 million remain.

Long-running transactions block vacuum progress because a worker cannot remove dead tuples newer than the oldest active snapshot. Disabling autovacuum because it competes for I/O causes bloat that slows the system down far more than the vacuum itself. VACUUM also writes WAL block by block, so it is recoverable after a crash.

Connection-per-process sets your memory ceiling

Each client connection spawns a dedicated operating-system process consuming roughly 5-10 MB of resident memory even when idle. The default max_connections is 100. Context-switching dominates after roughly CPU count multiplied by 2 active connections, so raising max_connections to absorb load usually makes throughput worse.

Connection poolers such as PgBouncer multiplex many client connections over a small pool of backends. This avoids the per-process memory tax and is mandatory at scale. Memory pressure is amplified by work_mem, which is allocated per operation, not per session. A query with four hash joins can use four times work_mem simultaneously. Treat connection pooling and conservative work_mem sizing as part of the architecture, not as optimizations.

Where these abstractions show up in production

Table bloat after deletes. A developer runs a large DELETE and expects the table to shrink. Instead, query latency degrades because sequential scans traverse pages full of dead tuples. The fix is not to run VACUUM FULL during business hours, which acquires an exclusive lock, but to ensure autovacuum is tuned and not blocked by idle transactions.

Checkpoint I/O spikes. Every five minutes, latency jumps for ten to twenty seconds. pg_stat_bgwriter shows checkpoints_req approaching checkpoints_timed. The root cause is a max_wal_size that is too small for the write volume. Spreading checkpoints by raising max_wal_size and checkpoint_timeout smooths the spike.

Connection storms during deploys. A rolling deployment opens new connections before old ones close. Without a pooler, the backend count hits max_connections and the database rejects new clients. The fix is PgBouncer in transaction mode, not a larger max_connections that consumes memory and increases context switches.

Replication lag and WAL retention. Streaming replication ships WAL from primary to replica. A forgotten replication slot retains WAL indefinitely, filling the primary’s disk even though the replica looks healthy. Slots must be monitored explicitly; they do not clean themselves up when a consumer disappears.

Transaction ID wraparound emergencies. A database that has been running fine suddenly goes read-only. age(datfrozenxid) has been approaching the limit for months while autovacuum appeared active but was blocked by idle transactions or misconfigured thresholds. Recovery requires emergency VACUUM FREEZE and may take hours.

Lock contention versus internal bottlenecks. High CPU with low throughput and wait_event_type = 'Lock' in pg_stat_activity indicates application-level contention. If the wait event is LWLock, the bottleneck is internal PostgreSQL coordination, often from high WAL volume or aggressive parallel query launches.

Tradeoffs and common misuses

Durability versus latency. fsync = off is faster and absolutely unacceptable for production. synchronous_commit = off provides lower latency at the cost of a small data-loss window. synchronous_commit = remote_apply guarantees durability on a replica but adds network round-trip latency to every commit.

Autovacuum aggressiveness versus query performance. Aggressive settings prevent bloat but consume I/O and CPU continuously. Conservative settings preserve foreground query performance but allow dead tuples to accumulate. The correct balance is workload-dependent: high-churn OLTP needs frequent vacuum, while read-heavy systems can tolerate relaxed settings.

Index coverage versus write amplification. Each index speeds specific reads but slows every INSERT, UPDATE, and DELETE because the engine must maintain the index. Partial and covering indexes reduce write overhead, but only for queries that match the predicate.

Shared buffers sizing. The default shared_buffers of 128 MB is too small for production, but setting it above roughly 40 percent of RAM can starve the operating system page cache and lead to out-of-memory kills. A reasonable starting point is 25 to 40 percent of RAM for OLTP workloads.

Memory allocation: cache versus sort/hash. shared_buffers caches data pages, while work_mem allocates per-operation memory for sorts and hashes. Over-allocating work_mem causes OOM; under-allocating it spills to disk. Size work_mem conservatively and account for concurrent operations and parallel workers.

Signals to watch in production

SignalWhy it mattersWarning sign
Buffer cache hit ratioIndicates whether the working set fits in memoryBelow 95 percent for OLTP
Dead tuple ratioMeasures bloat and vacuum healthAbove 20 percent sustained
Transaction ID ageTracks wraparound riskage(datfrozenxid) above 500 million
Required versus timed checkpointsReveals checkpoint pressure from WAL volumecheckpoints_req above 10 percent of checkpoints_timed
Connection utilizationCapacity headroom before connection refusalAbove 90 percent of max_connections
Replication lagFailover readiness and WAL retention riskAbove 30 seconds on async replicas

How Netdata helps

  • Correlate PostgreSQL query latency with system disk latency to confirm whether latency spikes are checkpoint I/O rather than storage failures.
  • Track per-table dead tuple ratios to spot vacuum lag before bloat degrades sequential scan and index performance.
  • Monitor replication lag, WAL directory size, and connection counts together to distinguish inactive replication slots from network saturation.
  • Alert on transaction ID age per database to provide weeks of warning before wraparound emergencies.
  • Visualize buffer cache hit ratio alongside operating system memory metrics to validate shared_buffers sizing without guessing.