$ guides / postgres / postgres-slow-queries-diagnosis ▌

Operations Guides

PostgreSQL Slow Queries: Diagnosis From Log To Plan To Fix

A query that returned in 10 ms yesterday is now taking 8 seconds. No deploys, no schema changes. Before adding an index or restarting the database, determine whether the slowness is in the plan, the data, or the environment. This guide covers a three-layer workflow: log-based discovery with log_min_duration_statement, aggregate profiling with pg_stat_statements, and per-query execution plan capture with auto_explain and manual EXPLAIN (ANALYZE, BUFFERS).

flowchart TD
    A[Slow query reported] --> B[Check pg_stat_statements for mean_time and stddev_time]
    B --> C{High stddev relative to mean?}
    C -->|Yes| D[Plan flapping or parameter skew]
    C -->|No| E[Stable plan or system bottleneck]
    D --> F[Capture plan with auto_explain or EXPLAIN]
    E --> F
    F --> G{Estimated rows far from actual?}
    G -->|Yes| H[Stale statistics or skewed data]
    G -->|No| I[Bloat, locks, or cache pressure]

What This Means

A slow query is a symptom. PostgreSQL’s planner chooses a path based on statistics from ANALYZE, bound parameter values, and configuration such as work_mem. When these inputs change, the same query text can switch from a hash join to a nested loop and slow down by orders of magnitude.

log_min_duration_statement logs statements that exceed a millisecond threshold after execution completes. It will not log queries that hang indefinitely. pg_stat_statements normalizes queries by replacing literals with placeholders and groups them by queryid. It exposes mean_exec_time, stddev_exec_time, and calls, which show whether slowness is consistent or intermittent. It does not capture execution plans. auto_explain closes that gap by logging the actual plan, including timing and buffer usage, for queries that exceed its threshold.

queryid is computed from the post-parse-analysis representation. Two semantically different queries can collide into one entry due to a hash collision, though this is rare. Identical query text with different search_path contexts produces different queryid values. The hash is stable across minor versions on the same architecture with matching catalog metadata, but is not guaranteed stable across major versions or different architectures.

Common Causes

Cause	What it looks like	First thing to check
Stale statistics or skewed data	`EXPLAIN` shows row estimates far from actual rows; sequential scan on a large table that should use an index	`last_autoanalyze` in `pg_stat_user_tables` for tables in the query
Generic plan regression (prepared statements)	Query is fast for the first 5 executions, then suddenly slows; partition pruning stops working	Whether the query uses prepared statements and the value of `plan_cache_mode`
Table bloat (dead tuples)	Sequential scan or index scan touches mostly dead pages; `n_dead_tup` is growing while `last_autovacuum` is stale	`n_dead_tup / (n_live_tup + n_dead_tup)` ratio in `pg_stat_user_tables`
Lock contention	Query in `pg_stat_activity` shows `wait_event_type = 'Lock'`	`pg_locks` for ungranted locks and the blocker PID
Buffer cache pressure	`EXPLAIN (ANALYZE, BUFFERS)` shows high `shared read=`; cache hit ratio drops	`pg_stat_database` `blks_hit / (blks_hit + blks_read)`

Quick Checks

Run these read-only checks before making any configuration changes.

-- Top queries by total execution time
SELECT queryid, query, calls, total_exec_time, mean_exec_time, stddev_exec_time
FROM pg_stat_statements
ORDER BY total_exec_time DESC LIMIT 10;

-- Sessions waiting on locks
SELECT pid, wait_event_type, wait_event, query
FROM pg_stat_activity
WHERE wait_event_type = 'Lock';

-- Dead tuple accumulation for tables in the slow query
-- Replace (...) with the target table names
SELECT schemaname, relname, n_live_tup, n_dead_tup,
       n_dead_tup::float / NULLIF(n_live_tup + n_dead_tup, 0) AS dead_ratio
FROM pg_stat_user_tables
WHERE relname IN (...);

-- Cache hit ratio for the database
SELECT datname,
       blks_hit::float / NULLIF(blks_hit + blks_read, 0) AS cache_hit_ratio
FROM pg_stat_database
WHERE datname = current_database();

-- Ungranted locks and blockers
SELECT l.locktype, l.relation::regclass, l.pid, l.mode, l.granted, a.query
FROM pg_locks l JOIN pg_stat_activity a ON l.pid = a.pid
WHERE NOT l.granted;

How To Diagnose It

Confirm the database is the bottleneck. Use pg_stat_statements to find the query entry. Compare mean_exec_time and stddev_exec_time. If stddev_exec_time is high relative to mean_exec_time, the same query text is producing bimodal performance. This is the signature of plan flapping or parameter-sensitive skew.
Determine whether the query is waiting or working. Query pg_stat_activity. If wait_event_type is 'Lock', the query is blocked. Find the blocker in pg_locks. Only terminate the blocker with pg_terminate_backend if you accept the risk of aborting that session’s work.
Capture the execution plan. If the query is reproducible, run EXPLAIN (ANALYZE, BUFFERS).
Warning: EXPLAIN (ANALYZE, BUFFERS) executes the query. For DML statements, wrap it in a transaction and roll back to avoid modifying data, or rely on auto_explain instead. Compare the planner’s estimated row counts to the actual row counts. If estimates are off by more than an order of magnitude, the planner is working with stale or insufficient statistics.
Check statistics freshness. In pg_stat_user_tables, look at last_autoanalyze for every table referenced in the query. If the timestamp predates the last known bulk change, run ANALYZE on those tables and re-check the plan. For correlated columns, consider CREATE STATISTICS to improve cardinality estimates.
Check for plan caching regression. PostgreSQL uses custom plans for the first 5 executions of a prepared statement, then evaluates whether a generic plan is competitive. For queries with skewed data distributions or partitioned tables, the generic plan can be catastrophically worse because it loses per-execution partition pruning. Set plan_cache_mode = force_custom_plan at the session level and re-run the query. If performance recovers, the generic plan is the culprit.
Check for bloat and cache efficiency. In EXPLAIN (ANALYZE, BUFFERS) output, look at shared read= versus shared hit=. A plan dominated by shared read= points to cold cache or a working set larger than shared_buffers. High Heap Fetches in an Index Only Scan indicate dead tuples or a visibility map held back by long-running transactions. At the database level, a sustained drop in cache hit ratio in pg_stat_database confirms cache pressure.
Correlate across time. If pg_stat_statements shows the queryid but you need the exact plan that was slow, use auto_explain logs if enabled. The queryid in pg_stat_statements can be correlated with log entries to link aggregate statistics to a specific plan shape.

Metrics & Signals To Monitor

Signal	Why it matters	Warning sign
`pg_stat_statements.mean_exec_time`	Baseline latency per query fingerprint	P99 exceeds SLO or spikes relative to previous day
`pg_stat_statements.stddev_exec_time / mean_exec_time`	Plan stability and data skew	Ratio greater than 1 suggests plan flapping
`pg_stat_user_tables.n_dead_tup` ratio	Bloat causing scans to read dead pages	Ratio greater than 20% on active tables
`pg_stat_database.blks_hit / (blks_hit + blks_read)`	Cache efficiency	Below 95% for OLTP workloads
`pg_stat_activity.wait_event_type = 'Lock'`	Time spent waiting, not executing	Sustained ungranted locks for more than 30 seconds
`auto_explain` log frequency	New regressions appearing	Sudden increase in plans logged for previously fast queries

Fixes

Stale Statistics Or Skewed Data

Run ANALYZE on the affected tables. If the planner still underestimates cardinality, increase the per-column statistics target with ALTER TABLE ... ALTER COLUMN ... SET STATISTICS 500. For correlated columns, create extended statistics with CREATE STATISTICS. Tradeoff: higher targets increase ANALYZE duration and planner memory usage.

Generic Plan Regression

Set plan_cache_mode = force_custom_plan at the session or database level for queries that suffer from skewed parameters or partitioned tables. This forces parameter-specific planning on every execution. Tradeoff: planning overhead increases CPU consumption, especially for high-throughput queries.

Table Bloat

If autovacuum is not keeping up, tune per-table settings. Lower autovacuum_vacuum_scale_factor to 0.01 or 0.05 for large, high-churn tables. For severe existing bloat, use pg_repack to rebuild the table online.

Warning: pg_repack requires roughly 2x the disk space of the target table and a primary key or unique NOT NULL index.

Lock Contention

Terminate the blocker with pg_terminate_backend only if it is safe to do so. Set lock_timeout on DDL operations and idle_in_transaction_session_timeout globally to prevent future cascades. Fix application code that holds transactions open while calling external services.

Missing Or Inefficient Index

Add a partial or covering index only if pg_stat_statements shows the query pattern is frequent and the table is large enough to matter. Tradeoff: every additional index slows writes and increases vacuum I/O.

Prevention

Enable pg_stat_statements. It requires shared_preload_libraries and a server restart. This is the minimum viable observability for query performance.
Set log_min_duration_statement. Configure it to a threshold that captures your tail latency so you have a log trail for ad-hoc investigations.
Enable auto_explain. Add it to shared_preload_libraries and set auto_explain.log_min_duration to match your slow-query threshold. Be aware that auto_explain.log_analyze = on adds execution overhead.
Monitor stddev_exec_time. A rising standard deviation is often the first signal of a plan regression.
Tune autovacuum per table. Keep statistics and dead tuple maps current so the planner and vacuum keep up with the workload.
Set timeouts. Use statement_timeout to contain runaway queries and lock_timeout to prevent indefinite lock waits.

How Netdata Helps

Correlates PostgreSQL query latency with system-level CPU, disk I/O, and memory pressure to distinguish database-level slowness from resource starvation.
Tracks pg_stat_statements metrics over time to surface regressions without manual polling of cumulative counters.
Monitors cache hit ratio, dead tuple accumulation, and lock wait events alongside query throughput.
Alerts on checkpoint spikes and replication lag that can manifest as query slowdowns.
Provides historical context for auto_explain and log analysis by overlaying slow query periods with system resource charts.

The Netdata solution

PostgreSQL monitoring with Netdata

Netdata monitors PostgreSQL with per-second metrics, pre-built dashboards, and ML-powered anomaly detection. Correlate connection saturation, lock waits, autovacuum progress, replication lag, and checkpoint I/O against the rest of your stack so you catch the incidents in these runbooks before they page anyone.

See PostgreSQL monitoring → Start monitoring free