Hacking the Postgres Statistics Tables for...

Postgres does a great job of making queries really efficient. By gathering data in internal statistics tables, Postgres estimates before a query is run lots of things - like will an index scan be better than a sequential scan. How to pull data for the WHERE statement.

What Postgres doesn’t know …. is how your columns are related to each other. Postgres isn’t a machine learning algorithm. It is not going to learn over time as you query things what is related and what isn't. It uses the same statistical probabilities regardless of the content in your columns. But don’t get too discouraged. You can help the planner along. You can actually add table statistics and tell Postgres about your data to help with query performance.

On today’s blog, let’s walk through how table statistics work and how you can add statistics for relatedness. This blog is based on a talk done at Postgres Conference Europe, Deep Dive into Postgres Table Statistics, Only part of the talk will be covered in this blog.

There are some sample EXPLAIN plans and data in here from a database of clinical trials, available as Postgres dmp file at https://aact.ctti-clinicaltrials.org.

What statistics are gathered

Postgres gathers statistics on your tables when you run ANALYZE or when autovacuum runs automatically.

For each column, depending on its type, it will gather the following statistics:

Distinct values: An estimate of the number of unique values in a column.
Average data width: The typical size of values in a column.
Null fraction: The proportion of NULL values in a column.
Correlation: Varies from -1 to 1. Describes the correlation between physical order of your tuples and values order of this column. For example, if you use a serial bigint for your id, the correlation will be closer to 1 than if you use uuid.
Most common values (MCV) and their frequencies.
Histograms: Describe the data distribution outside of the most common values

Postgres provides the pg_stats view, which shows a user-friendly version of the pg_statistics table. The pg_statistics table is optimized for disk space and isn’t easy for users to look at directly.

Let's look at a couple examples of statistics that you can find in the view.

Most common column values (MCV) and their frequency

SELECT * FROM pg_stats WHERE tablename = 'studies' AND attname = 'study_type';
-[ RECORD 1 ]----------+-----------------------------------------------
schemaname             | ctgov
tablename              | studies
attname                | study_type
inherited              | f
null_frac              | 0.0021
avg_width              | 14
n_distinct             | 3
most_common_vals       | {INTERVENTIONAL,OBSERVATIONAL,EXPANDED_ACCESS}
most_common_freqs      | {0.7689667,0.2265,0.0024333333}
histogram_bounds       | (null)
correlation            | 0.64599395
most_common_elems      | (null)
most_common_elem_freqs | (null)
elem_count_histogram   | (null)

Here, the histogram is null because all values are covered in the MCV (it's an enum). The frequency of the value INTERVENTIONAL is 76.9%.

Histograms of column value frequency

SELECT * FROM pg_stats WHERE tablename = 'baseline_counts' AND attname = 'count';
schemaname             | ctgov
tablename              | baseline_counts
attname                | count
inherited              | f
null_frac              | 0
avg_width              | 4
n_distinct             | 1729
most_common_vals       | {6,3,10,..,94,104}
most_common_freqs      | {0.0328,0.0254,0.023,..,0.0019,0.0019}
histogram_bounds       | {83,84,88,89,95,97,99,107,108,111,113,115,117,118,121,123,125,126,129,132,134,136,139,142,145,147,149,152,155,158,161,163,166,169,173,176,179,182,186,191,195,199,201,204,207,212,216,220,224,229,235,240,245,250,255,261,267,274,280,287,296,301,308,315,324,333,344,352,362,375,388,400,410,424,442,456,478,495,511,530,554,582,607,642,680,722,774,821,884,965,1057,1172,1305,1518,1751,2159,3031,4357,6817,15871,2622164}
correlation            | -0.0021751618
most_common_elems      | (null)
most_common_elem_freqs | (null)
elem_count_histogram   | (null)

A histogram is made of buckets. Each bucket should contain, roughly, the same percentage of rows. In this specific example, there should be around the same number of rows in baseline_counts, where the count is between 83 and 84, as between 245 and 250.

How Postgres uses statistics in query planning

Postgres uses statistics to estimate query costs and select the most efficient execution plan. It considers:

How many rows will be returned
The total data size
The number of disk pages that need to be scanned

Selectivity is the fraction of rows a query will return. The query planner relies on selectivity estimates to determine, for example, whether to use an index scan or a sequential scan. If a query filters out most rows, an index scan is preferred. If a query returns most rows, a sequential scan is better.

The selectivity of a single clause

For WHERE column = value. If the value is in the MCV list, Postgres uses the stored frequency. If not, it assumes an even distribution of non-MCV values and estimates selectivity accordingly.

For scalar queries (WHERE column < value or WHERE column > value). The optimizer uses both MCVs and histograms to estimate selectivity.

Postgres will gather the following elements:

The sum of all MCV selectivities (sumcommon)
The fraction of null values (nullfrac)
The MCV selectivity (mcv_select): This is the sum of the frequency of MCVs matching the clause
The histogram selectivity (hist_select): This is the percentage of buckets matching the clause. To get that postgres loops through the histogram and count the number of buckets matching the clause. The histogram selectivity will be match/number of buckets

Postgres will then use all of this to calculate the selectivity of our clause.

Initialize selectivity:

select = 1.0 - nullfrac - sumcommon

Merge the histogram selectivity:

select *= hist_selec

Merge the MCV selectivity:

select += mcv_select

Combining clauses

Most queries have more than one WHERE clause. The planner estimates selectivity for each column separately and multiplies them.

Let's look at this query:

SELECT * FROM studies WHERE phase = 'PHASE1' AND brief_titleILIKE '%diabetes%';

Merging by multiplying means that out of the 8.5% of studies in phase 1, 2% have diabetes in their title.

By default, Postgres assumes columns are independent when estimating query results. This can lead to inaccurate estimates.

For example, in this query:

SELECT nct_id, name FROM facilities WHERE city = 'Lyon' AND country = 'France';

However, 100% of cities named Lyon are in France (actually, there is Lyon, Texas, but not relevant to this specific database), so this approach underestimates row counts leading to a bad query plan.

Extended statistics

You can manually force Postgres to link two columns. Postgres supports extended statistics, including:

Functional dependencies
Multivariate distinct counts
Multivariate most common values

Creating statistics works in two parts:

Adding the statistic on the the columns and table through CREATE STATISTICS
Running a table ANALYZE

Dependencies

Functional dependency describes a dependency between two columns.

It can be because there is a relationship between them (city and country for example) of because the values of two columns vary together (column a = column b + 1).

Extended Statistics Examples - Dependencies

EXPLAIN ANALYZE SELECT nct_id, name FROM facilities WHERE city = 'Lyon' AND country = 'France';
                                                                  QUERY PLAN
---------------
 Index Scan using index_facilities_on_city on facilities  (cost=0.43..375.61 rows=19 width=47) (actual time=1.708..1839.398 rows=5816 loops=1)
   Index Cond: ((city)::text = 'Lyon'::text)
   Filter: ((country)::text = 'France'::text)
 Planning Time: 0.097 ms
 Execution Time: 1840.283 ms
(5 rows)

In the explain plan, you might notice that the estimated rows was 19, and the actual number was 5816. This caused Postgres to pick a less efficient plan.

Here is how it calculated the selectivity:

rows=19
selectivity France: 0.060533334 (6%), France is in the MCV, so we have the frequency
selectivity Lyon: 0.000103 (0.01%), Lyon is not in the MCV, so it calculated the selectivity as being the same for any city not in it.
reltuples: 3132540

3132540 _ 0.060533334 _ 00.000103 = 19

Postgres makes two mistakes:

Assuming that the cities, outside of the MCV are evenly distributed
Assuming that only 6% of cities named Lyon are in France

Let's manually add extended statistics:

CREATE STATISTICS (dependencies) ON country, city FROM facilities;
ANALYZE facilities;

Re-run the EXPLAIN plan

EXPLAIN ANALYZE SELECT nct_id, name FROM facilities WHERE city = 'Lyon' AND country =
'France';
                                                                       QUERY PLAN
------------------------
 Bitmap Heap Scan on facilities  (cost=1586.13..1985.73 rows=5845 width=47) (actual time=13.080..18.711 rows=5816 loops=1)
   Recheck Cond: (((city)::text = 'Lyon'::text) AND ((country)::text = 'France'::text))
   Heap Blocks: exact=4772
   ->  BitmapAnd  (cost=1586.13..1586.13 rows=362 width=0) (actual time=12.436..12.437 rows=0 loops=1)
         ->  Bitmap Index Scan on index_facilities_on_city  (cost=0.00..56.23 rows=6414 width=0) (actual time=0.431..0.431 rows=5816 loops=1)
               Index Cond: ((city)::text = 'Lyon'::text)
         ->  Bitmap Index Scan on index_facilities_on_country  (cost=0.00..1526.73 rows=180773 width=0) (actual time=11.802..11.802 rows=185561 loops=1)
               Index Cond: ((country)::text = 'France'::text)
 Planning Time: 0.316 ms
 Execution Time: 18.974 ms
(10 rows)

So from 1840.283 to 18.971 that’s 97x faster 🔥!

Now the estimated rows and the actual rows are close, Postgres was able to pick the proper scan to handle this query more efficiently as can be seen in the execution time.

Ndistinct

By default, Postgres assumes that the distinct count of grouped columns is independent. This can lead to incorrect cardinality estimates when the columns are actually correlated.

Correlated columns are common in parent child data sets, like for example here where category and title are corrected and each category only has a limited set of title values.

When you create statistics using ndistinct, Postgres collects and stores information about how many distinct values exist in a combination of columns. This helps improve the planner's estimates for queries that involve GROUP BY, DISTINCT, or filtering conditions across multiple columns.

To improve that you can do:

CREATE STATISTICS (ndistinct) on category, title FROM baseline_measurements;
ANALYZE baseline_measurements;

Before the statistics

EXPLAIN ANALYZE SELECT category, title, SUM(number_analyzed)
FROM baseline_measurements
WHERE category IS NOT NULL
GROUP BY category, title
ORDER BY 3 DESC
LIMIT 10;
                                                                                 QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=210450.73..210450.76 rows=10 width=44) (actual time=907.290..907.311 rows=10 loops=1)
   ->  Sort  (cost=210450.73..211027.87 rows=230854 width=44) (actual time=907.288..907.310 rows=10 loops=1)
         Sort Key: (sum(number_analyzed)) DESC
         Sort Method: top-N heapsort  Memory: 26kB
         ->  Finalize GroupAggregate  (cost=138119.41..205462.06 rows=230854 width=44) (actual time=609.776..903.558 rows=38379 loops=1)
               Group Key: category, title
               ->  Gather Merge  (cost=138119.41..199690.71 rows=461708 width=44) (actual time=609.770..895.979 rows=42241 loops=1)
                     Workers Planned: 2
                     Workers Launched: 2
                     ->  Partial GroupAggregate  (cost=137119.39..145398.13 rows=230854 width=44) (actual time=589.234..763.160 rows=14080 loops=3)
                           Group Key: category, title
                           ->  Sort  (cost=137119.39..138611.94 rows=597020 width=40) (actual time=589.220..724.301 rows=482252 loops=3)
                                 Sort Key: category, title
                                 Sort Method: external merge  Disk: 24272kB
                                 Worker 0:  Sort Method: external merge  Disk: 22456kB
                                 Worker 1:  Sort Method: external merge  Disk: 24128kB
                                 ->  Parallel Seq Scan on baseline_measurements  (cost=0.00..63515.52 rows=597020 width=40) (actual time=0.051..98.694 rows=482252 loops=3)
                                       Filter: (category IS NOT NULL)
                                       Rows Removed by Filter: 287745
 Planning Time: 0.202 ms
 Execution Time: 908.366 ms
(21 rows)

After

EXPLAIN ANALYZE SELECT category, title, SUM(number_analyzed)
FROM baseline_measurements
WHERE category IS NOT NULL
GROUP BY category, title
ORDER BY 3 DESC
LIMIT 10;
                                                                                 QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=71080.18..71080.21 rows=10 width=43) (actual time=341.044..341.069 rows=10 loops=1)
   ->  Sort  (cost=71080.18..71094.80 rows=5849 width=43) (actual time=341.043..341.067 rows=10 loops=1)
         Sort Key: (sum(number_analyzed)) DESC
         Sort Method: top-N heapsort  Memory: 26kB
         ->  Finalize GroupAggregate  (cost=69442.70..70953.79 rows=5849 width=43) (actual time=262.366..337.346 rows=38379 loops=1)
               Group Key: category, title
               ->  Gather Merge  (cost=69442.70..70807.56 rows=11698 width=43) (actual time=262.239..330.513 rows=42726 loops=1)
                     Workers Planned: 2
                     Workers Launched: 2
                     ->  Sort  (cost=68442.68..68457.30 rows=5849 width=43) (actual time=258.648..259.082 rows=14242 loops=3)
                           Sort Key: category, title
                           Sort Method: quicksort  Memory: 1475kB
                           Worker 0:  Sort Method: quicksort  Memory: 1466kB
                           Worker 1:  Sort Method: quicksort  Memory: 1467kB
                           ->  Partial HashAggregate  (cost=68018.21..68076.71 rows=5849 width=43) (actual time=170.306..171.412 rows=14242 loops=3)
                                 Group Key: category, title
                                 Batches: 1  Memory Usage: 2577kB
                                 Worker 0:  Batches: 1  Memory Usage: 2577kB
                                 Worker 1:  Batches: 1  Memory Usage: 2577kB
                                 ->  Parallel Seq Scan on baseline_measurements  (cost=0.00..63511.45 rows=600902 width=39) (actual time=0.042..88.588 rows=482252 loops=3)
                                       Filter: (category IS NOT NULL)
                                       Rows Removed by Filter: 287745
 Planning Time: 0.205 ms
 Execution Time: 341.418 ms
(24 rows)

Here creating statistics, Postgres was able to pick a HashAggregate in memory, which made the query 3 times faster.

Multivariate MCV

MCV (Most Common Values) statistics help Postgres optimize query planning by tracking the most frequently occurring values in one or more columns. These statistics improve the accuracy of selectivity estimates, particularly for queries with filters like WHERE column = value.

By default, Postgres automatically collects MCV statistics for individual columns but you can manually add statistics for correlated columns. MCV captures the most frequently occurring pairs not just individual values). This helps the planner make better row count estimates when filtering on these columns.

EXPLAIN ANALYZE
SELECT nct_id, organ_system, adverse_event_term,
frequency_threshold
FROM reported_events
WHERE organ_system = 'Respiratory, thoracic and mediastinal disorders'
AND adverse_event_term = 'Hypoxia'
ORDER BY frequency_threshold DESC
LIMIT 10;
                                                                            QUERY PLAN
-----------------------------------------------------
 Limit  (cost=69.99..70.01 rows=10 width=63) (actual time=16.643..16.645 rows=10 loops=1)
   ->  Sort  (cost=69.99..70.14 rows=60 width=63) (actual time=16.642..16.643 rows=10 loops=1)
         Sort Key: frequency_threshold DESC
         Sort Method: top-N heapsort  Memory: 27kB
         ->  Index Scan using reported_events_organ_system_adverse_event_term_idx on reported_events  (cost=0.56..68.69 rows=60 width=63) (actual time=0.026..12.896 rows=18361 loops=1)
               Index Cond: (((organ_system)::text = 'Respiratory, thoracic and mediastinal disorders'::text) AND ((adverse_event_term)::text = 'Hypoxia'::text))
 Planning Time: 0.111 ms
 Execution Time: 16.666 ms
(8 rows)

Create statistics

CREATE STATISTICS (mcv) on organ_system, adverse_event_term FROM reported_events;
ANALYZE reported_events;

After

EXPLAIN ANALYZE
SELECT nct_id, organ_system, adverse_event_term,
frequency_threshold
FROM reported_events
WHERE organ_system = 'Respiratory, thoracic and mediastinal disorders'
AND adverse_event_term = 'Hypoxia'
ORDER BY frequency_threshold DESC
LIMIT 10;
                               QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=19198.72..19198.75 rows=10 width=63) (actual time=16.500..16.502 rows=10 loops=1)
   ->  Sort  (cost=19198.72..19241.43 rows=17082 width=63) (actual time=16.499..16.500 rows=10 loops=1)
         Sort Key: frequency_threshold DESC
         Sort Method: top-N heapsort  Memory: 27kB
         ->  Index Scan using reported_events_organ_system_adverse_event_term_idx on reported_events  (cost=0.56..18829.59 rows=17082 width=63) (actual time=0.026..12.844 rows=18361 loops=1)
               Index Cond: (((organ_system)::text = 'Respiratory, thoracic and mediastinal disorders'::text) AND ((adverse_event_term)::text = 'Hypoxia'::text))
 Planning Time: 0.152 ms
 Execution Time: 16.524 ms
(8 rows)

In this example, having statistics didn’t change the query plan, or improve the query time. It did make the expected number of rows in the index scan go from 60 to 17082, so closer to the actual 18361 rows.

It’s a good example of a case where inaccurate statistics didn’t hurt your performance. Choosing which statistics are important might just be part of your adventure. You might need to test different approaches, and look at your query plan.

Choosing between extended statistics

The type of extended statistics that you will create depends on the operation that you use:

If you only use =, use dependency
If you have GROUP BYs, you will need an ndistinct
If you use scalar operators, you need MCV list.

Limitations of extended statistics

Because histograms aren’t supported in extended statistics, they are only

accurate:

For MCVs
If the rest of your dataset is evenly distributed

If Postgres cannot use an extended statistics, it will default back to the default statistics.

You can improve this by increasing the statistic target, either for a column, or for your entire database.

default_statistics_target: default is 100, can go from 1 to 10000.
ALTER TABLE facilities ALTER COLUMN city SET STATISTICS 200;

Altering statistics targets can improve things, but for very rare values, you'll again see differences in the estimates and actual number of rows. It is not recommended to just go to the maximum stats target, that would make ANALYZE and vacuum slow and expensive. But this is something to be aware of. It might just be okay, often, we query less rare values compared to the common ones. So you can decide that it's fine to have slightly slower uncommon queries.

To go further with table statistics

A lot more is covered in the slides for "A Deep Dive into Statistics" presented at last year’s PGConfEU 2024. There is quite a bit more covered in there about the specific algorithms used by Postgres to compute selectivity, statistics, etc.

If you are interested in learning more about this through reading the source code, here is where you can start:

Algorithms to compute the selectivity of each clause: src/backend/utils/adt/selfuncs.c
How the optimizer combines selectivities for AND/OR/JOIN: src/backend/optimizer/path/clausesel.c
Calculating the cost: src/backend/optimizer/path/costsize.c

Conclusion

Postgres keeps detailed table statistics to track most common values and other details about the column data. These statistics are used by the query planner and table statistics themselves can directly impact query performance.

Postgres users can add extended statistics and explicitly tell Postgres about functional dependencies and correlations between columns. In some cases, these can dramatically improve query planning and performance.

With just a few lines of SQL, you can help the planner make smarter choices, reduce execution time, and get the most out of your database. As always, testing and evaluating with EXPLAIN is the best way to experiment and confirm additional table statistics.

Latest Articles

Hacking the Postgres Statistics Tables for Faster Queries