PostgreSQL monitoring checklist: the signals every production database needs

PostgreSQL exposes hundreds of counters across the pg_stat_* views, yet most production outages trace back to a small set of undetected conditions. Bloat accumulates silently until vacuum cannot catch up. Replication lag grows until failover becomes a data-loss event. Transaction ID age crosses a threshold and the database stops accepting writes. The problem is rarely a lack of metrics. It is knowing which signals to instrument at each stage of operational maturity.

This checklist maps the essential PostgreSQL signals to an eight-stage maturity model. It is not an encyclopedia of every view and column. It is the minimum set of signals that must be visible, alerted, and staffed with a runbook before you can claim a given operational level. If you are watching connection count but not transaction ID age, you are one wraparound emergency away from a reactive firefight.

Use this article to audit your current observability. Identify the highest stage where every signal is monitored with a defined threshold. Close the gaps stage by stage rather than trying to build a perfect dashboard overnight.

Stage 1: Escape reactive mode

Stage 1 teams respond to user complaints and have no telemetry. Before you can monitor signals, you must expose them. The transition enablers are architectural, not metric tweaks.

Enable pg_stat_statements in shared_preload_libraries. Without it, query-level performance is invisible. Configure streaming replication so pg_stat_replication exists. Deploy a connection pooler such as PgBouncer in transaction mode. Without pooling, connection count is a meaningless signal because max_connections becomes a false ceiling. Establish continuous WAL archiving and test a restore. Backups that have never been restored are not backups.

Until these prerequisites are in place, the stages below are theoretical.

Stage 2: Health and capacity signals

Stage 2 teams have dashboards and alerts for survival-level metrics. These signals detect capacity exhaustion and availability risk before applications fail.

SignalSourceWhy it mattersAlert threshold
Connection utilizationpg_stat_activity count / max_connectionsProcess-per-connection architecture exhausts memory and CPU.> 70% of max_connections; critical at > 90%.
Replication lagpg_stat_replication.replay_lag (time) or LSN byte diffFailover readiness and RPO protection.> 30 seconds async; > 1 second sync.
Cache hit ratiopg_stat_database.blks_hit / (blks_hit + blks_read)Working set fit. Low ratio means disk latency dominates.< 95% is concerning; < 90% is critical for OLTP.
Checkpoint pressurepg_stat_bgwriter.checkpoints_req / checkpoints_timedForced checkpoints cause I/O latency spikes.> 10% of timed checkpoints sustained over 24 hours.
Disk usagepg_wal directory size, data volumeWAL retention from slots or archiving failures fills disks faster than data growth.> 80% on any PostgreSQL volume.

Decompose replication lag when it spikes. pg_stat_replication exposes write_lag, flush_lag, and replay_lag. If write_lag grows but replay_lag stays low, the network or WAL sender is the bottleneck. If replay_lag grows while the others stay flat, the replica is slow to apply WAL, often due to disk I/O or a long-running query blocking replay.

Stage 3: Workload quality signals

Stage 3 teams baseline query performance and tune per-table autovacuum. Add these signals to detect plan regressions, bloat, and index inefficiency before they degrade user experience.

SignalSourceWhy it mattersAlert threshold
Query latencypg_stat_statements.mean_time / max_time; application percentilesUser-facing performance. Averages hide tail latency.P99 > 1 second for OLTP; > 5 seconds for analytical queries.
Dead tuple ratiopg_stat_user_tables.n_dead_tup / (n_live_tup + n_dead_tup)Bloat and vacuum health. High ratio means space is not reusable and scans touch more dead pages.> 10% is unhealthy; > 20% requires immediate intervention.
Plan stabilitypg_stat_statements.stddev_time / mean_timePlan flapping or data skew causes bimodal latency.> 1 suggests plan instability; > 2 is severe.
Index efficiencypg_stat_user_tables.seq_scan / idx_scan per tableSequential scans on large tables indicate missing indexes or stale statistics.High ratio on tables with > 10k rows.
Autovacuum lagpg_stat_user_tables.last_autovacuum vs. n_dead_tup growthVacuum must keep up with write rate or bloat becomes irreversible.n_dead_tup increasing while last_autovacuum is stale.

Sampling caveat: pg_stat_statements normalizes query text, stripping literal values. Two executions with the same structure but different parameters are aggregated. A stable mean_time with high stddev_time often means the query is fast for most parameters and pathological for one value. Use auto_explain with log_min_duration_statement to capture the actual plan for outlier executions.

Stage 4: Contention signals

Stage 4 teams understand locking and hot standby behavior. Monitor these signals to detect serialization bottlenecks and replica friction.

SignalSourceWhy it mattersAlert threshold
Lock waitspg_locks where granted = falseUngranted locks cascade into latency spikes and deadlocks.Any waitstart growing beyond deadlock_timeout.
Deadlockspg_stat_database.deadlocksApplication-level lock ordering bugs.Any sustained increment; target is zero.
Hot standby conflictspg_stat_database.conflictsReplica queries canceled by WAL replay.> 0 indicates max_standby_streaming_delay is being hit.
LWLock contentionpg_stat_activity.wait_event_type = 'LWLock'Internal bottlenecks from WAL volume or buffer pressure.Sustained high counts with low throughput.

Discriminate wait event types carefully. Lock events on user tables indicate application contention, often from UPDATE or DELETE on the same rows. LWLock events indicate internal serialization pressure, buffer mapping contention, or WAL write saturation. High CPU with low query throughput and many LWLock waits means the database is bottlenecked on internal coordination, not query complexity.

Stage 5: Plan stability and deadlock-free operation

Stage 5 teams enforce query plan stability and eliminate deadlocks through application design. The signals here validate that previous tuning has stuck and that the workload is predictable.

SignalSourceWhy it mattersAlert threshold
Query plan variancepg_stat_statements.stddev_time / mean_time on critical queriesStable plans mean predictable latency and throughput.> 0.5 for critical query fingerprints.
Deadlock ratepg_stat_database.deadlocksZero deadlocks is achievable with correct lock ordering and short transactions.Any non-zero deadlock in a 24-hour window.

At this stage, latency SLOs should be met consistently, not just on average. If stddev_time rises after a deployment, investigate statistics changes or parameter sniffing before the mean time degrades.

Stage 6: Durability and recovery signals

Stage 6 teams treat backups and WAL as first-class concerns. These signals prevent the silent failures that turn recoverable incidents into data loss.

SignalSourceWhy it mattersAlert threshold
Transaction ID ageage(datfrozenxid) from pg_database; age(relfrozenxid) per tableXID wraparound stops writes and can force emergency shutdown.> 500 million (early); > 1 billion (urgent).
Multixact agemxid_age(datminmxid) from pg_databaseSeparate wraparound mechanism with identical failure mode.Same thresholds as XID age.
WAL archive healthpg_stat_archiver.failed_countBroken archiving invalidates PITR and creates gaps in the WAL chain.Any increment.
Checksum failurespg_stat_database.checksum_failuresSilent data corruption from storage or memory.> 0 on primary or replica.
Replication slot lagpg_replication_slots.restart_lsn vs. current LSNInactive slots retain WAL and fill the disk.Inactive slot with growing LSN diff.
Backup restorabilityVerified restore testUntested backups are not backups.Last successful restore > 30 days ago.

Note on XID age: PostgreSQL emits warnings at approximately 2.107 billion (40 million remaining) and refuses writes at approximately 3 million remaining. Operational alerts must fire at 500 million so you have weeks to correct autovacuum behavior before an emergency freeze is required.

Stage 7: Optimizer health signals

Stage 7 teams tune the planner with custom statistics and verify partition pruning. These signals catch the statistical failures that cause plan regressions on large or skewed data sets.

SignalSourceWhy it mattersAlert threshold
Statistics freshnesspg_stat_user_tables.last_autoanalyze vs. table churnStale statistics produce cardinality misestimates and wrong join types.last_autoanalyze older than expected for high-churn tables.
Partition pruningEXPLAIN plans on partitioned tablesPartitioning overhead is only worthwhile if the planner excludes irrelevant partitions.Plans scanning all partitions for single-partition queries.
Correlated column statspg_stats histograms; CREATE STATISTICS coverageCorrelated columns cause the planner to assume independence and underestimate result sizes.Persistent underestimate on queries with correlated predicates.

Monitor pg_class.relpages growth rate against pg_class.reltuples. If pages grow faster than tuples, bloat or fillfactor issues are confusing the planner’s cost model.

Stage 8: Predictive and autotuned signals

Stage 8 teams automate remediation and predict capacity exhaustion. These signals drive automation rather than human runbooks.

SignalSourceWhy it mattersAlert threshold
Bloat growth ratepgstattuple or pg_stat_user_tables trendsAccelerating bloat triggers automated repack before manual intervention is needed.Growth rate exceeding autovacuum reclaim rate.
Capacity trendpg_database_size(), pg_wal growth ratePredict disk exhaustion weeks in advance.Projected to hit limit within provisioning lead time.
Query regressionpg_stat_statements plan history or auto_explain logsAutomated rollback or paging when plans flip after a deploy.Plan change correlated with latency spike.

At this stage, manual intervention for performance should be rare. When these signals fire, the response is typically an automated scaling action, a scheduled repack, or a deployment hold rather than a 3 a.m. page.

How Netdata helps

Netdata collects PostgreSQL metrics directly from the pg_stat_* views and correlates them with OS-level CPU, memory, disk, and network telemetry on the same dashboard.

  • It exposes pg_stat_statements metrics including calls, total time, and mean time, making it easier to spot workload regressions without running manual queries during an incident.
  • Replication lag, connection utilization, and cache hit ratio are collected with per-second granularity, so you can correlate a query latency spike with a checkpoint I/O burst or autovacuum worker launch.
  • Netdata alerts on composite conditions such as high dead tuple ratio combined with a stale last_autovacuum, which indicates vacuum is falling behind.
  • The PostgreSQL collector surfaces pg_stat_replication_slots, flagging inactive slots that retain WAL before they fill the disk.
  • System-level metrics from the same agent reveal whether a database issue is rooted in the OS or storage layer, reducing misattribution.

No related guides are available in this section yet.