PostgreSQL slow queries: diagnosis from log to plan to fix

A query that returned in 10 ms yesterday is now taking 8 seconds. No deploys, no schema changes. Before adding an index or restarting the database, determine whether the slowness is in the plan, the data, or the environment. This guide covers a three-layer workflow: log-based discovery with log_min_duration_statement, aggregate profiling with pg_stat_statements, and per-query execution plan capture with auto_explain and manual EXPLAIN (ANALYZE, BUFFERS).

flowchart TD
    A[Slow query reported] --> B[Check pg_stat_statements for mean_time and stddev_time]
    B --> C{High stddev relative to mean?}
    C -->|Yes| D[Plan flapping or parameter skew]
    C -->|No| E[Stable plan or system bottleneck]
    D --> F[Capture plan with auto_explain or EXPLAIN]
    E --> F
    F --> G{Estimated rows far from actual?}
    G -->|Yes| H[Stale statistics or skewed data]
    G -->|No| I[Bloat, locks, or cache pressure]

What this means

A slow query is a symptom. PostgreSQL’s planner chooses a path based on statistics from ANALYZE, bound parameter values, and configuration such as work_mem. When these inputs change, the same query text can switch from a hash join to a nested loop and slow down by orders of magnitude.

log_min_duration_statement logs statements that exceed a millisecond threshold after execution completes. It will not log queries that hang indefinitely. pg_stat_statements normalizes queries by replacing literals with placeholders and groups them by queryid. It exposes mean_exec_time, stddev_exec_time, and calls, which show whether slowness is consistent or intermittent. It does not capture execution plans. auto_explain closes that gap by logging the actual plan, including timing and buffer usage, for queries that exceed its threshold.

queryid is computed from the post-parse-analysis representation. Two semantically different queries can collide into one entry due to a hash collision, though this is rare. Identical query text with different search_path contexts produces different queryid values. The hash is stable across minor versions on the same architecture with matching catalog metadata, but is not guaranteed stable across major versions or different architectures.

Common causes

CauseWhat it looks likeFirst thing to check
Stale statistics or skewed dataEXPLAIN shows row estimates far from actual rows; sequential scan on a large table that should use an indexlast_autoanalyze in pg_stat_user_tables for tables in the query
Generic plan regression (prepared statements)Query is fast for the first 5 executions, then suddenly slows; partition pruning stops workingWhether the query uses prepared statements and the value of plan_cache_mode
Table bloat (dead tuples)Sequential scan or index scan touches mostly dead pages; n_dead_tup is growing while last_autovacuum is stalen_dead_tup / (n_live_tup + n_dead_tup) ratio in pg_stat_user_tables
Lock contentionQuery in pg_stat_activity shows wait_event_type = 'Lock'pg_locks for ungranted locks and the blocker PID
Buffer cache pressureEXPLAIN (ANALYZE, BUFFERS) shows high shared read=; cache hit ratio dropspg_stat_database blks_hit / (blks_hit + blks_read)

Quick checks

Run these read-only checks before making any configuration changes.

-- Top queries by total execution time
SELECT queryid, query, calls, total_exec_time, mean_exec_time, stddev_exec_time
FROM pg_stat_statements
ORDER BY total_exec_time DESC LIMIT 10;
-- Sessions waiting on locks
SELECT pid, wait_event_type, wait_event, query
FROM pg_stat_activity
WHERE wait_event_type = 'Lock';
-- Dead tuple accumulation for tables in the slow query
-- Replace (...) with the target table names
SELECT schemaname, relname, n_live_tup, n_dead_tup,
       n_dead_tup::float / NULLIF(n_live_tup + n_dead_tup, 0) AS dead_ratio
FROM pg_stat_user_tables
WHERE relname IN (...);
-- Cache hit ratio for the database
SELECT datname,
       blks_hit::float / NULLIF(blks_hit + blks_read, 0) AS cache_hit_ratio
FROM pg_stat_database
WHERE datname = current_database();
-- Ungranted locks and blockers
SELECT l.locktype, l.relation::regclass, l.pid, l.mode, l.granted, a.query
FROM pg_locks l JOIN pg_stat_activity a ON l.pid = a.pid
WHERE NOT l.granted;

How to diagnose it

  1. Confirm the database is the bottleneck. Use pg_stat_statements to find the query entry. Compare mean_exec_time and stddev_exec_time. If stddev_exec_time is high relative to mean_exec_time, the same query text is producing bimodal performance. This is the signature of plan flapping or parameter-sensitive skew.

  2. Determine whether the query is waiting or working. Query pg_stat_activity. If wait_event_type is 'Lock', the query is blocked. Find the blocker in pg_locks. Only terminate the blocker with pg_terminate_backend if you accept the risk of aborting that session’s work.

  3. Capture the execution plan. If the query is reproducible, run EXPLAIN (ANALYZE, BUFFERS).

    Warning: EXPLAIN (ANALYZE, BUFFERS) executes the query. For DML statements, wrap it in a transaction and roll back to avoid modifying data, or rely on auto_explain instead. Compare the planner’s estimated row counts to the actual row counts. If estimates are off by more than an order of magnitude, the planner is working with stale or insufficient statistics.

  4. Check statistics freshness. In pg_stat_user_tables, look at last_autoanalyze for every table referenced in the query. If the timestamp predates the last known bulk change, run ANALYZE on those tables and re-check the plan. For correlated columns, consider CREATE STATISTICS to improve cardinality estimates.

  5. Check for plan caching regression. PostgreSQL uses custom plans for the first 5 executions of a prepared statement, then evaluates whether a generic plan is competitive. For queries with skewed data distributions or partitioned tables, the generic plan can be catastrophically worse because it loses per-execution partition pruning. Set plan_cache_mode = force_custom_plan at the session level and re-run the query. If performance recovers, the generic plan is the culprit.

  6. Check for bloat and cache efficiency. In EXPLAIN (ANALYZE, BUFFERS) output, look at shared read= versus shared hit=. A plan dominated by shared read= points to cold cache or a working set larger than shared_buffers. High Heap Fetches in an Index Only Scan indicate dead tuples or a visibility map held back by long-running transactions. At the database level, a sustained drop in cache hit ratio in pg_stat_database confirms cache pressure.

  7. Correlate across time. If pg_stat_statements shows the queryid but you need the exact plan that was slow, use auto_explain logs if enabled. The queryid in pg_stat_statements can be correlated with log entries to link aggregate statistics to a specific plan shape.

Metrics and signals to monitor

SignalWhy it mattersWarning sign
pg_stat_statements.mean_exec_timeBaseline latency per query fingerprintP99 exceeds SLO or spikes relative to previous day
pg_stat_statements.stddev_exec_time / mean_exec_timePlan stability and data skewRatio greater than 1 suggests plan flapping
pg_stat_user_tables.n_dead_tup ratioBloat causing scans to read dead pagesRatio greater than 20% on active tables
pg_stat_database.blks_hit / (blks_hit + blks_read)Cache efficiencyBelow 95% for OLTP workloads
pg_stat_activity.wait_event_type = 'Lock'Time spent waiting, not executingSustained ungranted locks for more than 30 seconds
auto_explain log frequencyNew regressions appearingSudden increase in plans logged for previously fast queries

Fixes

Stale statistics or skewed data

Run ANALYZE on the affected tables. If the planner still underestimates cardinality, increase the per-column statistics target with ALTER TABLE ... ALTER COLUMN ... SET STATISTICS 500. For correlated columns, create extended statistics with CREATE STATISTICS. Tradeoff: higher targets increase ANALYZE duration and planner memory usage.

Generic plan regression

Set plan_cache_mode = force_custom_plan at the session or database level for queries that suffer from skewed parameters or partitioned tables. This forces parameter-specific planning on every execution. Tradeoff: planning overhead increases CPU consumption, especially for high-throughput queries.

Table bloat

If autovacuum is not keeping up, tune per-table settings. Lower autovacuum_vacuum_scale_factor to 0.01 or 0.05 for large, high-churn tables. For severe existing bloat, use pg_repack to rebuild the table online.

Warning: pg_repack requires roughly 2x the disk space of the target table and a primary key or unique NOT NULL index.

Lock contention

Terminate the blocker with pg_terminate_backend only if it is safe to do so. Set lock_timeout on DDL operations and idle_in_transaction_session_timeout globally to prevent future cascades. Fix application code that holds transactions open while calling external services.

Missing or inefficient index

Add a partial or covering index only if pg_stat_statements shows the query pattern is frequent and the table is large enough to matter. Tradeoff: every additional index slows writes and increases vacuum I/O.

Prevention

  • Enable pg_stat_statements. It requires shared_preload_libraries and a server restart. This is the minimum viable observability for query performance.
  • Set log_min_duration_statement. Configure it to a threshold that captures your tail latency so you have a log trail for ad-hoc investigations.
  • Enable auto_explain. Add it to shared_preload_libraries and set auto_explain.log_min_duration to match your slow-query threshold. Be aware that auto_explain.log_analyze = on adds execution overhead.
  • Monitor stddev_exec_time. A rising standard deviation is often the first signal of a plan regression.
  • Tune autovacuum per table. Keep statistics and dead tuple maps current so the planner and vacuum keep up with the workload.
  • Set timeouts. Use statement_timeout to contain runaway queries and lock_timeout to prevent indefinite lock waits.

How Netdata helps

  • Correlates PostgreSQL query latency with system-level CPU, disk I/O, and memory pressure to distinguish database-level slowness from resource starvation.
  • Tracks pg_stat_statements metrics over time to surface regressions without manual polling of cumulative counters.
  • Monitors cache hit ratio, dead tuple accumulation, and lock wait events alongside query throughput.
  • Alerts on checkpoint spikes and replication lag that can manifest as query slowdowns.
  • Provides historical context for auto_explain and log analysis by overlaying slow query periods with system resource charts.