Databases

Cardinality Estimation in PostgreSQL 17 New Statistics Model and Tuning Tips

A deep dive into how the query planner thinks and how to leverage new features for smarter- faster queries

Cardinality Estimation in PostgreSQL 17 New Statistics Model and Tuning Tips

You’ve seen it before: a query that runs in milliseconds on your staging server takes minutes to execute in production. Or a seemingly simple JOIN causes the query_planner to choose a disastrously slow Nested Loop over a much faster Hash Join. In almost every case, the root cause of these performance mysteries is not a bug in PostgreSQL, but a flaw in its understanding of your data. This is the challenge of cardinality_estimation. The planner is only as smart as the statistics it’s given, and when its row_estimation is wrong, the consequences are severe.

For years, database administrators and SREs have wrestled with this, using tools like extended_statistics and planner_tuning to guide the cost_estimator. But the statistics-gathering process itself, particularly on massive tables, has been a significant bottleneck. A postgres_upgrade has always been one of the easiest ways to get performance improvements, and the upcoming postgres_17 promises to be one of the most impactful releases yet for query optimization. With a significantly improved statistics_model, it aims to solve some of the most persistent problems in cardinality_estimation, making proactive optimization_postgres easier and more effective than ever.

The Foundation: How PostgreSQL’s Planner Thinks

Before diving into new features, it’s essential to understand how the query_planner makes decisions. Its goal is to find the cheapest execution plan for a given query. This “cost” is an arbitrary unit that represents a combination of CPU work and, most importantly, I/O operations. To calculate this cost, the planner must estimate how many rows will be processed at each step of the query.

This estimation relies on statistical data collected by the ANALYZE command (which can be run manually or automatically via autovacuum). These stats are stored in the pg_statistic catalog and include:

  • Single-Column Statistics: For each column, PostgreSQL stores a wealth of information, which you can view through the pg_stats view. This includes:
    • n_distinct: An estimate of the number of unique values in the column. A value of -1 indicates 100% unique values, while a positive number is an absolute count.
    • Most Common Values (MCVs): A list of the most frequent values and their frequencies.
    • histograms: A distribution graph that divides the column’s values into buckets of equal frequency, crucial for estimating the selectivity of range queries (WHERE value > 100).

The level of detail in these statistics, particularly for MCVs and histograms, is controlled by the default_statistics_target parameter in postgres_conf, which can be overridden on a per-column basis with ALTER TABLE ... SET STATISTICS. A higher stats_target means ANALYZE uses a larger sampling_rate, leading to more accurate stats at the cost of more planning_time and a slower analyze_command.

The Classic Problem: The Flaw of Averages and Correlated Data

The single-column statistics_model has a fundamental weakness: it assumes all conditions in a WHERE clause are independent. This assumption breaks down when dealing with correlated data, leading to wildly inaccurate row_estimation.

Consider a zipcodes table with city and state columns. A query like WHERE state = 'California' AND city = 'San Francisco' will be badly estimated. The planner calculates the selectivity of “California” and the selectivity of “San Francisco” independently and multiplies them, assuming that any city can exist in any state. This results in a massive underestimation of the number of rows that will be returned, which can lead to a poor choice of join_selectivity in more complex queries.

To combat this, PostgreSQL introduced extended_statistics. These are user-defined objects that tell ANALYZE to collect correlation_stats on a group of columns:

  • Functional Dependencies: This tells the planner if the value of one column determines the value of another (e.g., zip_code determines city).
  • Multivariate N-Distinct Counts: Improves GROUP BY estimates for multiple columns.
  • Multivariate MCV Lists: Provides frequency information for combinations of values, directly addressing the city/state problem.

While powerful, extended statistics are only a partial solution. They still depend on the analyze_command being run frequently enough to keep the data fresh.

What’s New? The Evolving Statistics Model in PostgreSQL 17

The biggest challenge with statistics has always been freshness. On a multi-terabyte table, running a full ANALYZE is a resource-intensive operation that can take hours. As a result, many large databases run with statistics that are hours or even days out of date, making planner_tuning a constant battle.

Postgres_17 is poised to address this head-on with one of the most anticipated features in recent years: Incremental Analyze.

Incremental ANALYZE: A Game Changer for Large Tables

Inspired by the success of incremental_vacuum, this feature fundamentally changes how statistics are maintained. Instead of scanning the entire table every time, ANALYZE will be able to update its statistics by only scanning the portions of the table that have changed since the last run.

The Impact:

  1. Massively Reduced Overhead: The cost of running ANALYZE on huge tables will plummet, allowing it to be run much more frequently.
  2. Fresher, More Accurate Statistics: With stats updated every few minutes instead of every few hours, the query_planner will have a much more accurate picture of the current data distribution.
  3. Improved Plan Stability: More accurate stats mean more stable, predictable query plans. The “fast on staging, slow on production” problem, often caused by differences in statistics, will become far less common.

This single feature represents a major leap forward in PostgreSQL’s statistics_model, making it far more manageable for very large databases (VLDBs).

Practical Tuning Tips for Optimal Cardinality Estimation

While a postgres_upgrade to version 17 will provide a significant boost, you still need to actively manage your statistics.

1. Identify and Tune Skewed Columns

Use a query to find columns where the most common value accounts for a large percentage of the table. A status column where 99% of rows are ‘archived’ is a classic example. For these columns, the default stats_target of 100 is often insufficient to capture the distribution accurately. Increasing the target provides the planner with a more detailed histogram of the less common values, leading to better estimates.

2. Create Extended Statistics for Correlated Columns

Identify columns that frequently appear together in your WHERE, JOIN, or GROUP BY clauses. These are prime candidates for correlation_stats. By creating a statistics object for these columns, you instruct the next ANALYZE run to collect the necessary data to improve plans for queries filtering on them together.

3. Don’t Forget to ANALYZE

Statistics are worthless if they are not current. Monitor the last_autoanalyze column in pg_stat_user_tables to ensure your tables are being analyzed regularly. If you have tables that are not being analyzed frequently enough by autovacuum, you may need to schedule manual ANALYZE commands. This is the exact problem that postgres_17’s incremental analyze is designed to solve.

Conclusion: The Future is Proactive

The PostgreSQL query_planner is a remarkably complex piece of software, but its effectiveness hinges on a simple principle: accurate data leads to smart decisions. For years, the difficulty of maintaining fresh statistics on large tables has been a major hurdle. With the introduction of features like incremental analyze in postgres_17, the paradigm is shifting from periodic, heavy-handed updates to continuous, lightweight maintenance.

By combining these new capabilities with a solid understanding of single-column statistics, extended_statistics, and targeted planner_tuning, you can dramatically improve the performance and stability of your database. The key is to move from a reactive to a proactive mindset—identifying potential problem queries and providing the planner with the data it needs to handle them correctly before they cause a production issue.

To effectively measure the impact of these changes, you need a monitoring solution that can correlate query performance with system-level metrics in real time. Netdata provides this deep visibility out of the box, allowing you to see the direct effect of a postgres_upgrade or a change to your postgres_conf on query latency and plan selection.

Stop guessing about query performance. Get the data you need for effective optimization_postgres with Netdata.