PostgreSQL Monitoring Checklist: The Signals Every Production Database Needs

PostgreSQL exposes hundreds of counters across the pg_stat_* views, yet most production outages trace back to a small set of undetected conditions. Bloat accumulates silently until vacuum cannot catch up. Replication lag grows until failover becomes a data-loss event. Transaction ID age crosses a threshold and the database stops accepting writes. The problem is rarely a lack of metrics. It is knowing which signals to instrument at each stage of operational maturity.

This checklist maps the essential PostgreSQL signals to an eight-stage maturity model. It is not an encyclopedia of every view and column. It is the minimum set of signals that must be visible, alerted, and staffed with a runbook before you can claim a given operational level. If you are watching connection count but not transaction ID age, you are one wraparound emergency away from a reactive firefight.

Use this article to audit your current observability. Identify the highest stage where every signal is monitored with a defined threshold. Close the gaps stage by stage rather than trying to build a perfect dashboard overnight.

Stage 1: Escape Reactive Mode

Stage 1 teams respond to user complaints and have no telemetry. Before you can monitor signals, you must expose them. The transition enablers are architectural, not metric tweaks.

Enable pg_stat_statements in shared_preload_libraries. Without it, query-level performance is invisible. Configure streaming replication so pg_stat_replication exists. Deploy a connection pooler such as PgBouncer in transaction mode. Without pooling, connection count is a meaningless signal because max_connections becomes a false ceiling. Establish continuous WAL archiving and test a restore. Backups that have never been restored are not backups.

Until these prerequisites are in place, the stages below are theoretical.

Stage 2: Health & Capacity Signals

Stage 2 teams have dashboards and alerts for survival-level metrics. These signals detect capacity exhaustion and availability risk before applications fail.

Signal	Source	Why it matters	Alert threshold
Connection utilization	`pg_stat_activity` count / `max_connections`	Process-per-connection architecture exhausts memory and CPU.	> 70% of `max_connections`; critical at > 90%.
Replication lag	`pg_stat_replication.replay_lag` (time) or LSN byte diff	Failover readiness and RPO protection.	> 30 seconds async; > 1 second sync.
Cache hit ratio	`pg_stat_database.blks_hit / (blks_hit + blks_read)`	Working set fit. Low ratio means disk latency dominates.	< 95% is concerning; < 90% is critical for OLTP.
Checkpoint pressure	`pg_stat_bgwriter.checkpoints_req / checkpoints_timed`	Forced checkpoints cause I/O latency spikes.	> 10% of timed checkpoints sustained over 24 hours.
Disk usage	`pg_wal` directory size, data volume	WAL retention from slots or archiving failures fills disks faster than data growth.	> 80% on any PostgreSQL volume.

Decompose replication lag when it spikes. pg_stat_replication exposes write_lag, flush_lag, and replay_lag. If write_lag grows but replay_lag stays low, the network or WAL sender is the bottleneck. If replay_lag grows while the others stay flat, the replica is slow to apply WAL, often due to disk I/O or a long-running query blocking replay.

Stage 3: Workload Quality Signals

Stage 3 teams baseline query performance and tune per-table autovacuum. Add these signals to detect plan regressions, bloat, and index inefficiency before they degrade user experience.

Signal	Source	Why it matters	Alert threshold
Query latency	`pg_stat_statements.mean_time` / `max_time`; application percentiles	User-facing performance. Averages hide tail latency.	P99 > 1 second for OLTP; > 5 seconds for analytical queries.
Dead tuple ratio	`pg_stat_user_tables.n_dead_tup / (n_live_tup + n_dead_tup)`	Bloat and vacuum health. High ratio means space is not reusable and scans touch more dead pages.	> 10% is unhealthy; > 20% requires immediate intervention.
Plan stability	`pg_stat_statements.stddev_time / mean_time`	Plan flapping or data skew causes bimodal latency.	> 1 suggests plan instability; > 2 is severe.
Index efficiency	`pg_stat_user_tables.seq_scan / idx_scan` per table	Sequential scans on large tables indicate missing indexes or stale statistics.	High ratio on tables with > 10k rows.
Autovacuum lag	`pg_stat_user_tables.last_autovacuum` vs. `n_dead_tup` growth	Vacuum must keep up with write rate or bloat becomes irreversible.	`n_dead_tup` increasing while `last_autovacuum` is stale.

Sampling caveat: pg_stat_statements normalizes query text, stripping literal values. Two executions with the same structure but different parameters are aggregated. A stable mean_time with high stddev_time often means the query is fast for most parameters and pathological for one value. Use auto_explain with log_min_duration_statement to capture the actual plan for outlier executions.

Stage 4: Contention Signals

Stage 4 teams understand locking and hot standby behavior. Monitor these signals to detect serialization bottlenecks and replica friction.

Signal	Source	Why it matters	Alert threshold
Lock waits	`pg_locks` where `granted = false`	Ungranted locks cascade into latency spikes and deadlocks.	Any `waitstart` growing beyond `deadlock_timeout`.
Deadlocks	`pg_stat_database.deadlocks`	Application-level lock ordering bugs.	Any sustained increment; target is zero.
Hot standby conflicts	`pg_stat_database.conflicts`	Replica queries canceled by WAL replay.	> 0 indicates `max_standby_streaming_delay` is being hit.
LWLock contention	`pg_stat_activity.wait_event_type = 'LWLock'`	Internal bottlenecks from WAL volume or buffer pressure.	Sustained high counts with low throughput.

Discriminate wait event types carefully. Lock events on user tables indicate application contention, often from UPDATE or DELETE on the same rows. LWLock events indicate internal serialization pressure, buffer mapping contention, or WAL write saturation. High CPU with low query throughput and many LWLock waits means the database is bottlenecked on internal coordination, not query complexity.

Stage 5: Plan Stability & Deadlock-Free Operation

Stage 5 teams enforce query plan stability and eliminate deadlocks through application design. The signals here validate that previous tuning has stuck and that the workload is predictable.

Signal	Source	Why it matters	Alert threshold
Query plan variance	`pg_stat_statements.stddev_time / mean_time` on critical queries	Stable plans mean predictable latency and throughput.	> 0.5 for critical query fingerprints.
Deadlock rate	`pg_stat_database.deadlocks`	Zero deadlocks is achievable with correct lock ordering and short transactions.	Any non-zero deadlock in a 24-hour window.

At this stage, latency SLOs should be met consistently, not just on average. If stddev_time rises after a deployment, investigate statistics changes or parameter sniffing before the mean time degrades.

Stage 6: Durability & Recovery Signals

Stage 6 teams treat backups and WAL as first-class concerns. These signals prevent the silent failures that turn recoverable incidents into data loss.

Signal	Source	Why it matters	Alert threshold
Transaction ID age	`age(datfrozenxid)` from `pg_database`; `age(relfrozenxid)` per table	XID wraparound stops writes and can force emergency shutdown.	> 500 million (early); > 1 billion (urgent).
Multixact age	`mxid_age(datminmxid)` from `pg_database`	Separate wraparound mechanism with identical failure mode.	Same thresholds as XID age.
WAL archive health	`pg_stat_archiver.failed_count`	Broken archiving invalidates PITR and creates gaps in the WAL chain.	Any increment.
Checksum failures	`pg_stat_database.checksum_failures`	Silent data corruption from storage or memory.	> 0 on primary or replica.
Replication slot lag	`pg_replication_slots.restart_lsn` vs. current LSN	Inactive slots retain WAL and fill the disk.	Inactive slot with growing LSN diff.
Backup restorability	Verified restore test	Untested backups are not backups.	Last successful restore > 30 days ago.

Note on XID age: PostgreSQL emits warnings at approximately 2.107 billion (40 million remaining) and refuses writes at approximately 3 million remaining. Operational alerts must fire at 500 million so you have weeks to correct autovacuum behavior before an emergency freeze is required.

Stage 7: Optimizer Health Signals

Stage 7 teams tune the planner with custom statistics and verify partition pruning. These signals catch the statistical failures that cause plan regressions on large or skewed data sets.

Signal	Source	Why it matters	Alert threshold
Statistics freshness	`pg_stat_user_tables.last_autoanalyze` vs. table churn	Stale statistics produce cardinality misestimates and wrong join types.	`last_autoanalyze` older than expected for high-churn tables.
Partition pruning	`EXPLAIN` plans on partitioned tables	Partitioning overhead is only worthwhile if the planner excludes irrelevant partitions.	Plans scanning all partitions for single-partition queries.
Correlated column stats	`pg_stats` histograms; `CREATE STATISTICS` coverage	Correlated columns cause the planner to assume independence and underestimate result sizes.	Persistent underestimate on queries with correlated predicates.

Monitor pg_class.relpages growth rate against pg_class.reltuples. If pages grow faster than tuples, bloat or fillfactor issues are confusing the planner’s cost model.

Stage 8: Predictive & Autotuned Signals

Stage 8 teams automate remediation and predict capacity exhaustion. These signals drive automation rather than human runbooks.

Signal	Source	Why it matters	Alert threshold
Bloat growth rate	`pgstattuple` or `pg_stat_user_tables` trends	Accelerating bloat triggers automated repack before manual intervention is needed.	Growth rate exceeding autovacuum reclaim rate.
Capacity trend	`pg_database_size()`, `pg_wal` growth rate	Predict disk exhaustion weeks in advance.	Projected to hit limit within provisioning lead time.
Query regression	`pg_stat_statements` plan history or `auto_explain` logs	Automated rollback or paging when plans flip after a deploy.	Plan change correlated with latency spike.

At this stage, manual intervention for performance should be rare. When these signals fire, the response is typically an automated scaling action, a scheduled repack, or a deployment hold rather than a 3 a.m. page.

How Netdata Helps

Netdata collects PostgreSQL metrics directly from the pg_stat_* views and correlates them with OS-level CPU, memory, disk, and network telemetry on the same dashboard.

It exposes pg_stat_statements metrics including calls, total time, and mean time, making it easier to spot workload regressions without running manual queries during an incident.
Replication lag, connection utilization, and cache hit ratio are collected with per-second granularity, so you can correlate a query latency spike with a checkpoint I/O burst or autovacuum worker launch.
Netdata alerts on composite conditions such as high dead tuple ratio combined with a stale last_autovacuum, which indicates vacuum is falling behind.
The PostgreSQL collector surfaces pg_stat_replication_slots, flagging inactive slots that retain WAL before they fill the disk.
System-level metrics from the same agent reveal whether a database issue is rooted in the OS or storage layer, reducing misattribution.

The Netdata solution

PostgreSQL monitoring with Netdata

Netdata monitors PostgreSQL with per-second metrics, pre-built dashboards, and ML-powered anomaly detection. Correlate connection saturation, lock waits, autovacuum progress, replication lag, and checkpoint I/O against the rest of your stack so you catch the incidents in these runbooks before they page anyone.

See PostgreSQL monitoring → Start monitoring free