PostgreSQL slow queries: diagnosis from log to plan to fix
A query that returned in 10 ms yesterday is now taking 8 seconds. No deploys, no schema changes. Before adding an index or restarting the database, determine whether the slowness is in the plan, the data, or the environment. This guide covers a three-layer workflow: log-based discovery with log_min_duration_statement, aggregate profiling with pg_stat_statements, and per-query execution plan capture with auto_explain and manual EXPLAIN (ANALYZE, BUFFERS).
flowchart TD
A[Slow query reported] --> B[Check pg_stat_statements for mean_time and stddev_time]
B --> C{High stddev relative to mean?}
C -->|Yes| D[Plan flapping or parameter skew]
C -->|No| E[Stable plan or system bottleneck]
D --> F[Capture plan with auto_explain or EXPLAIN]
E --> F
F --> G{Estimated rows far from actual?}
G -->|Yes| H[Stale statistics or skewed data]
G -->|No| I[Bloat, locks, or cache pressure]What this means
A slow query is a symptom. PostgreSQL’s planner chooses a path based on statistics from ANALYZE, bound parameter values, and configuration such as work_mem. When these inputs change, the same query text can switch from a hash join to a nested loop and slow down by orders of magnitude.
log_min_duration_statement logs statements that exceed a millisecond threshold after execution completes. It will not log queries that hang indefinitely. pg_stat_statements normalizes queries by replacing literals with placeholders and groups them by queryid. It exposes mean_exec_time, stddev_exec_time, and calls, which show whether slowness is consistent or intermittent. It does not capture execution plans. auto_explain closes that gap by logging the actual plan, including timing and buffer usage, for queries that exceed its threshold.
queryid is computed from the post-parse-analysis representation. Two semantically different queries can collide into one entry due to a hash collision, though this is rare. Identical query text with different search_path contexts produces different queryid values. The hash is stable across minor versions on the same architecture with matching catalog metadata, but is not guaranteed stable across major versions or different architectures.
Common causes
| Cause | What it looks like | First thing to check |
|---|---|---|
| Stale statistics or skewed data | EXPLAIN shows row estimates far from actual rows; sequential scan on a large table that should use an index | last_autoanalyze in pg_stat_user_tables for tables in the query |
| Generic plan regression (prepared statements) | Query is fast for the first 5 executions, then suddenly slows; partition pruning stops working | Whether the query uses prepared statements and the value of plan_cache_mode |
| Table bloat (dead tuples) | Sequential scan or index scan touches mostly dead pages; n_dead_tup is growing while last_autovacuum is stale | n_dead_tup / (n_live_tup + n_dead_tup) ratio in pg_stat_user_tables |
| Lock contention | Query in pg_stat_activity shows wait_event_type = 'Lock' | pg_locks for ungranted locks and the blocker PID |
| Buffer cache pressure | EXPLAIN (ANALYZE, BUFFERS) shows high shared read=; cache hit ratio drops | pg_stat_database blks_hit / (blks_hit + blks_read) |
Quick checks
Run these read-only checks before making any configuration changes.
-- Top queries by total execution time
SELECT queryid, query, calls, total_exec_time, mean_exec_time, stddev_exec_time
FROM pg_stat_statements
ORDER BY total_exec_time DESC LIMIT 10;
-- Sessions waiting on locks
SELECT pid, wait_event_type, wait_event, query
FROM pg_stat_activity
WHERE wait_event_type = 'Lock';
-- Dead tuple accumulation for tables in the slow query
-- Replace (...) with the target table names
SELECT schemaname, relname, n_live_tup, n_dead_tup,
n_dead_tup::float / NULLIF(n_live_tup + n_dead_tup, 0) AS dead_ratio
FROM pg_stat_user_tables
WHERE relname IN (...);
-- Cache hit ratio for the database
SELECT datname,
blks_hit::float / NULLIF(blks_hit + blks_read, 0) AS cache_hit_ratio
FROM pg_stat_database
WHERE datname = current_database();
-- Ungranted locks and blockers
SELECT l.locktype, l.relation::regclass, l.pid, l.mode, l.granted, a.query
FROM pg_locks l JOIN pg_stat_activity a ON l.pid = a.pid
WHERE NOT l.granted;
How to diagnose it
Confirm the database is the bottleneck. Use
pg_stat_statementsto find the query entry. Comparemean_exec_timeandstddev_exec_time. Ifstddev_exec_timeis high relative tomean_exec_time, the same query text is producing bimodal performance. This is the signature of plan flapping or parameter-sensitive skew.Determine whether the query is waiting or working. Query
pg_stat_activity. Ifwait_event_typeis'Lock', the query is blocked. Find the blocker inpg_locks. Only terminate the blocker withpg_terminate_backendif you accept the risk of aborting that session’s work.Capture the execution plan. If the query is reproducible, run
EXPLAIN (ANALYZE, BUFFERS).Warning:
EXPLAIN (ANALYZE, BUFFERS)executes the query. For DML statements, wrap it in a transaction and roll back to avoid modifying data, or rely onauto_explaininstead. Compare the planner’s estimated row counts to the actual row counts. If estimates are off by more than an order of magnitude, the planner is working with stale or insufficient statistics.Check statistics freshness. In
pg_stat_user_tables, look atlast_autoanalyzefor every table referenced in the query. If the timestamp predates the last known bulk change, runANALYZEon those tables and re-check the plan. For correlated columns, considerCREATE STATISTICSto improve cardinality estimates.Check for plan caching regression. PostgreSQL uses custom plans for the first 5 executions of a prepared statement, then evaluates whether a generic plan is competitive. For queries with skewed data distributions or partitioned tables, the generic plan can be catastrophically worse because it loses per-execution partition pruning. Set
plan_cache_mode = force_custom_planat the session level and re-run the query. If performance recovers, the generic plan is the culprit.Check for bloat and cache efficiency. In
EXPLAIN (ANALYZE, BUFFERS)output, look atshared read=versusshared hit=. A plan dominated byshared read=points to cold cache or a working set larger thanshared_buffers. HighHeap Fetchesin anIndex Only Scanindicate dead tuples or a visibility map held back by long-running transactions. At the database level, a sustained drop in cache hit ratio inpg_stat_databaseconfirms cache pressure.Correlate across time. If
pg_stat_statementsshows thequeryidbut you need the exact plan that was slow, useauto_explainlogs if enabled. Thequeryidinpg_stat_statementscan be correlated with log entries to link aggregate statistics to a specific plan shape.
Metrics and signals to monitor
| Signal | Why it matters | Warning sign |
|---|---|---|
pg_stat_statements.mean_exec_time | Baseline latency per query fingerprint | P99 exceeds SLO or spikes relative to previous day |
pg_stat_statements.stddev_exec_time / mean_exec_time | Plan stability and data skew | Ratio greater than 1 suggests plan flapping |
pg_stat_user_tables.n_dead_tup ratio | Bloat causing scans to read dead pages | Ratio greater than 20% on active tables |
pg_stat_database.blks_hit / (blks_hit + blks_read) | Cache efficiency | Below 95% for OLTP workloads |
pg_stat_activity.wait_event_type = 'Lock' | Time spent waiting, not executing | Sustained ungranted locks for more than 30 seconds |
auto_explain log frequency | New regressions appearing | Sudden increase in plans logged for previously fast queries |
Fixes
Stale statistics or skewed data
Run ANALYZE on the affected tables. If the planner still underestimates cardinality, increase the per-column statistics target with ALTER TABLE ... ALTER COLUMN ... SET STATISTICS 500. For correlated columns, create extended statistics with CREATE STATISTICS. Tradeoff: higher targets increase ANALYZE duration and planner memory usage.
Generic plan regression
Set plan_cache_mode = force_custom_plan at the session or database level for queries that suffer from skewed parameters or partitioned tables. This forces parameter-specific planning on every execution. Tradeoff: planning overhead increases CPU consumption, especially for high-throughput queries.
Table bloat
If autovacuum is not keeping up, tune per-table settings. Lower autovacuum_vacuum_scale_factor to 0.01 or 0.05 for large, high-churn tables. For severe existing bloat, use pg_repack to rebuild the table online.
Warning:
pg_repackrequires roughly 2x the disk space of the target table and a primary key or unique NOT NULL index.
Lock contention
Terminate the blocker with pg_terminate_backend only if it is safe to do so. Set lock_timeout on DDL operations and idle_in_transaction_session_timeout globally to prevent future cascades. Fix application code that holds transactions open while calling external services.
Missing or inefficient index
Add a partial or covering index only if pg_stat_statements shows the query pattern is frequent and the table is large enough to matter. Tradeoff: every additional index slows writes and increases vacuum I/O.
Prevention
- Enable
pg_stat_statements. It requiresshared_preload_librariesand a server restart. This is the minimum viable observability for query performance. - Set
log_min_duration_statement. Configure it to a threshold that captures your tail latency so you have a log trail for ad-hoc investigations. - Enable
auto_explain. Add it toshared_preload_librariesand setauto_explain.log_min_durationto match your slow-query threshold. Be aware thatauto_explain.log_analyze = onadds execution overhead. - Monitor
stddev_exec_time. A rising standard deviation is often the first signal of a plan regression. - Tune autovacuum per table. Keep statistics and dead tuple maps current so the planner and vacuum keep up with the workload.
- Set timeouts. Use
statement_timeoutto contain runaway queries andlock_timeoutto prevent indefinite lock waits.
How Netdata helps
- Correlates PostgreSQL query latency with system-level CPU, disk I/O, and memory pressure to distinguish database-level slowness from resource starvation.
- Tracks
pg_stat_statementsmetrics over time to surface regressions without manual polling of cumulative counters. - Monitors cache hit ratio, dead tuple accumulation, and lock wait events alongside query throughput.
- Alerts on checkpoint spikes and replication lag that can manifest as query slowdowns.
- Provides historical context for
auto_explainand log analysis by overlaying slow query periods with system resource charts.
Related guides
- How PostgreSQL actually works in production: a mental model for operators
- PostgreSQL ALTER TABLE blocked: zero-downtime DDL patterns
- PostgreSQL autovacuum blocked by long-running transaction: detection and fix
- PostgreSQL autovacuum not running: detection, causes, and fixes
- PostgreSQL autovacuum tuning: per-table thresholds for high-churn workloads
- PostgreSQL blocking queries: finding the root blocker in a lock cascade
- PostgreSQL checkpoint storms: detection, causes, and tuning
- PostgreSQL: checkpoints are occurring too frequently – what to tune
- PostgreSQL connection exhaustion: detection, diagnosis, and prevention
- PostgreSQL connection refused: pg_hba, listen_addresses, and TCP diagnosis
- PostgreSQL: database is not accepting commands to avoid wraparound data loss
- PostgreSQL dead tuples piling up: why autovacuum can’t keep up






