PostgreSQL monitoring checklist: the signals every production database needs
PostgreSQL exposes hundreds of counters across the pg_stat_* views, yet most production outages trace back to a small set of undetected conditions. Bloat accumulates silently until vacuum cannot catch up. Replication lag grows until failover becomes a data-loss event. Transaction ID age crosses a threshold and the database stops accepting writes. The problem is rarely a lack of metrics. It is knowing which signals to instrument at each stage of operational maturity.
This checklist maps the essential PostgreSQL signals to an eight-stage maturity model. It is not an encyclopedia of every view and column. It is the minimum set of signals that must be visible, alerted, and staffed with a runbook before you can claim a given operational level. If you are watching connection count but not transaction ID age, you are one wraparound emergency away from a reactive firefight.
Use this article to audit your current observability. Identify the highest stage where every signal is monitored with a defined threshold. Close the gaps stage by stage rather than trying to build a perfect dashboard overnight.
Stage 1: Escape reactive mode
Stage 1 teams respond to user complaints and have no telemetry. Before you can monitor signals, you must expose them. The transition enablers are architectural, not metric tweaks.
Enable pg_stat_statements in shared_preload_libraries. Without it, query-level performance is invisible. Configure streaming replication so pg_stat_replication exists. Deploy a connection pooler such as PgBouncer in transaction mode. Without pooling, connection count is a meaningless signal because max_connections becomes a false ceiling. Establish continuous WAL archiving and test a restore. Backups that have never been restored are not backups.
Until these prerequisites are in place, the stages below are theoretical.
Stage 2: Health and capacity signals
Stage 2 teams have dashboards and alerts for survival-level metrics. These signals detect capacity exhaustion and availability risk before applications fail.
| Signal | Source | Why it matters | Alert threshold |
|---|---|---|---|
| Connection utilization | pg_stat_activity count / max_connections | Process-per-connection architecture exhausts memory and CPU. | > 70% of max_connections; critical at > 90%. |
| Replication lag | pg_stat_replication.replay_lag (time) or LSN byte diff | Failover readiness and RPO protection. | > 30 seconds async; > 1 second sync. |
| Cache hit ratio | pg_stat_database.blks_hit / (blks_hit + blks_read) | Working set fit. Low ratio means disk latency dominates. | < 95% is concerning; < 90% is critical for OLTP. |
| Checkpoint pressure | pg_stat_bgwriter.checkpoints_req / checkpoints_timed | Forced checkpoints cause I/O latency spikes. | > 10% of timed checkpoints sustained over 24 hours. |
| Disk usage | pg_wal directory size, data volume | WAL retention from slots or archiving failures fills disks faster than data growth. | > 80% on any PostgreSQL volume. |
Decompose replication lag when it spikes. pg_stat_replication exposes write_lag, flush_lag, and replay_lag. If write_lag grows but replay_lag stays low, the network or WAL sender is the bottleneck. If replay_lag grows while the others stay flat, the replica is slow to apply WAL, often due to disk I/O or a long-running query blocking replay.
Stage 3: Workload quality signals
Stage 3 teams baseline query performance and tune per-table autovacuum. Add these signals to detect plan regressions, bloat, and index inefficiency before they degrade user experience.
| Signal | Source | Why it matters | Alert threshold |
|---|---|---|---|
| Query latency | pg_stat_statements.mean_time / max_time; application percentiles | User-facing performance. Averages hide tail latency. | P99 > 1 second for OLTP; > 5 seconds for analytical queries. |
| Dead tuple ratio | pg_stat_user_tables.n_dead_tup / (n_live_tup + n_dead_tup) | Bloat and vacuum health. High ratio means space is not reusable and scans touch more dead pages. | > 10% is unhealthy; > 20% requires immediate intervention. |
| Plan stability | pg_stat_statements.stddev_time / mean_time | Plan flapping or data skew causes bimodal latency. | > 1 suggests plan instability; > 2 is severe. |
| Index efficiency | pg_stat_user_tables.seq_scan / idx_scan per table | Sequential scans on large tables indicate missing indexes or stale statistics. | High ratio on tables with > 10k rows. |
| Autovacuum lag | pg_stat_user_tables.last_autovacuum vs. n_dead_tup growth | Vacuum must keep up with write rate or bloat becomes irreversible. | n_dead_tup increasing while last_autovacuum is stale. |
Sampling caveat: pg_stat_statements normalizes query text, stripping literal values. Two executions with the same structure but different parameters are aggregated. A stable mean_time with high stddev_time often means the query is fast for most parameters and pathological for one value. Use auto_explain with log_min_duration_statement to capture the actual plan for outlier executions.
Stage 4: Contention signals
Stage 4 teams understand locking and hot standby behavior. Monitor these signals to detect serialization bottlenecks and replica friction.
| Signal | Source | Why it matters | Alert threshold |
|---|---|---|---|
| Lock waits | pg_locks where granted = false | Ungranted locks cascade into latency spikes and deadlocks. | Any waitstart growing beyond deadlock_timeout. |
| Deadlocks | pg_stat_database.deadlocks | Application-level lock ordering bugs. | Any sustained increment; target is zero. |
| Hot standby conflicts | pg_stat_database.conflicts | Replica queries canceled by WAL replay. | > 0 indicates max_standby_streaming_delay is being hit. |
| LWLock contention | pg_stat_activity.wait_event_type = 'LWLock' | Internal bottlenecks from WAL volume or buffer pressure. | Sustained high counts with low throughput. |
Discriminate wait event types carefully. Lock events on user tables indicate application contention, often from UPDATE or DELETE on the same rows. LWLock events indicate internal serialization pressure, buffer mapping contention, or WAL write saturation. High CPU with low query throughput and many LWLock waits means the database is bottlenecked on internal coordination, not query complexity.
Stage 5: Plan stability and deadlock-free operation
Stage 5 teams enforce query plan stability and eliminate deadlocks through application design. The signals here validate that previous tuning has stuck and that the workload is predictable.
| Signal | Source | Why it matters | Alert threshold |
|---|---|---|---|
| Query plan variance | pg_stat_statements.stddev_time / mean_time on critical queries | Stable plans mean predictable latency and throughput. | > 0.5 for critical query fingerprints. |
| Deadlock rate | pg_stat_database.deadlocks | Zero deadlocks is achievable with correct lock ordering and short transactions. | Any non-zero deadlock in a 24-hour window. |
At this stage, latency SLOs should be met consistently, not just on average. If stddev_time rises after a deployment, investigate statistics changes or parameter sniffing before the mean time degrades.
Stage 6: Durability and recovery signals
Stage 6 teams treat backups and WAL as first-class concerns. These signals prevent the silent failures that turn recoverable incidents into data loss.
| Signal | Source | Why it matters | Alert threshold |
|---|---|---|---|
| Transaction ID age | age(datfrozenxid) from pg_database; age(relfrozenxid) per table | XID wraparound stops writes and can force emergency shutdown. | > 500 million (early); > 1 billion (urgent). |
| Multixact age | mxid_age(datminmxid) from pg_database | Separate wraparound mechanism with identical failure mode. | Same thresholds as XID age. |
| WAL archive health | pg_stat_archiver.failed_count | Broken archiving invalidates PITR and creates gaps in the WAL chain. | Any increment. |
| Checksum failures | pg_stat_database.checksum_failures | Silent data corruption from storage or memory. | > 0 on primary or replica. |
| Replication slot lag | pg_replication_slots.restart_lsn vs. current LSN | Inactive slots retain WAL and fill the disk. | Inactive slot with growing LSN diff. |
| Backup restorability | Verified restore test | Untested backups are not backups. | Last successful restore > 30 days ago. |
Note on XID age: PostgreSQL emits warnings at approximately 2.107 billion (40 million remaining) and refuses writes at approximately 3 million remaining. Operational alerts must fire at 500 million so you have weeks to correct autovacuum behavior before an emergency freeze is required.
Stage 7: Optimizer health signals
Stage 7 teams tune the planner with custom statistics and verify partition pruning. These signals catch the statistical failures that cause plan regressions on large or skewed data sets.
| Signal | Source | Why it matters | Alert threshold |
|---|---|---|---|
| Statistics freshness | pg_stat_user_tables.last_autoanalyze vs. table churn | Stale statistics produce cardinality misestimates and wrong join types. | last_autoanalyze older than expected for high-churn tables. |
| Partition pruning | EXPLAIN plans on partitioned tables | Partitioning overhead is only worthwhile if the planner excludes irrelevant partitions. | Plans scanning all partitions for single-partition queries. |
| Correlated column stats | pg_stats histograms; CREATE STATISTICS coverage | Correlated columns cause the planner to assume independence and underestimate result sizes. | Persistent underestimate on queries with correlated predicates. |
Monitor pg_class.relpages growth rate against pg_class.reltuples. If pages grow faster than tuples, bloat or fillfactor issues are confusing the planner’s cost model.
Stage 8: Predictive and autotuned signals
Stage 8 teams automate remediation and predict capacity exhaustion. These signals drive automation rather than human runbooks.
| Signal | Source | Why it matters | Alert threshold |
|---|---|---|---|
| Bloat growth rate | pgstattuple or pg_stat_user_tables trends | Accelerating bloat triggers automated repack before manual intervention is needed. | Growth rate exceeding autovacuum reclaim rate. |
| Capacity trend | pg_database_size(), pg_wal growth rate | Predict disk exhaustion weeks in advance. | Projected to hit limit within provisioning lead time. |
| Query regression | pg_stat_statements plan history or auto_explain logs | Automated rollback or paging when plans flip after a deploy. | Plan change correlated with latency spike. |
At this stage, manual intervention for performance should be rare. When these signals fire, the response is typically an automated scaling action, a scheduled repack, or a deployment hold rather than a 3 a.m. page.
How Netdata helps
Netdata collects PostgreSQL metrics directly from the pg_stat_* views and correlates them with OS-level CPU, memory, disk, and network telemetry on the same dashboard.
- It exposes
pg_stat_statementsmetrics including calls, total time, and mean time, making it easier to spot workload regressions without running manual queries during an incident. - Replication lag, connection utilization, and cache hit ratio are collected with per-second granularity, so you can correlate a query latency spike with a checkpoint I/O burst or autovacuum worker launch.
- Netdata alerts on composite conditions such as high dead tuple ratio combined with a stale
last_autovacuum, which indicates vacuum is falling behind. - The PostgreSQL collector surfaces
pg_stat_replication_slots, flagging inactive slots that retain WAL before they fill the disk. - System-level metrics from the same agent reveal whether a database issue is rooted in the OS or storage layer, reducing misattribution.
Related guides
No related guides are available in this section yet.






