Introducing Crunchy Data Warehouse: A next-generation Postgres-native data warehouse. Crunchy Data Warehouse Learn more

Data Skews in Postgres

Avatar for Elizabeth Christensen

Elizabeth Christensen

4 min read

We recently gave a talk at SCaLE (Southern California Linux Expo) about common problems and solutions for managing large Postgres databases. One of the topics we covered was data skewing and partial indexing. This piqued some conference discussion afterwards so we wanted to do a deeper dive.

Skewed data is when your data is kind of bunched up - essentially it is not evenly distributed. You might have one really large customer with a customer id that takes up more than half the rows in your events table. Or a default value that gets created and many of the values in a certain column represent defaults. If you graphed table data, skewed data just means that data would not appear in a symmetrical distribution, it would be unevenly distributed.

Under the hood, Postgres knows what kind of data you have in your database and uses that information to create query plans and when to use indexes. In some cases, skewed data will result in a situation where Postgres is not using an index - thus making some queries less efficient.

As a general rule, Postgres generally doesn't use an index if a single value is greater than 30% of the total data. So skewed data can nullify an index in cases where you’re using a single or multi-column index and one of your columns has skewed data.

Finding skewed data in Postgres

The pg_statistics table has all kinds of information about the data inside your table. This is used by Postgres when choosing a query plan, which you can see when using EXPLAIN on a given query.

Looking in pg_statistics, we want to find cases where certain values or attributes are a larger percentage of value. Here’s a query for getting table name, column name, and the attributes with their percentage of common value.

SELECT starelid::regclass AS table_name,attname AS column_name,
(SELECT string_agg('',format(E'\'%s\': %s%%\n', v,ROUND(n::numeric*100, 2)))
FROM unnest(stanumbers1,stavalues1::text::text[])nvs(n,v)) pcts
FROM pg_statistic
JOIN pg_attribute ON attrelid=starelid
AND attnum = staattnum
JOIN pg_class ON attrelid = pg_class.oid
JOIN pg_namespace ON pg_class.relnamespace = pg_namespace.oid
WHERE nspname NOT LIKE 'pg_%' AND nspname <> 'information_schema'
\x\g\x

This query is going to review all of your table data across the entire dataset and internal tables, so probably wise to do a \o and get in in a file to review.

Let’s say we run this against a sample database of weather which tracking events by year and month and we get results like this:

table_name  | weather
column_name | begin_yearmonth
pcts        | '202205': 30.42%
            | '202204': 10.92%
            | '202207': 8.33%
            | '202208': 6.75%
            | '202201': 4.58%

It looks like May of 2022 happens to have a huge amount of data in it, and more than 30%. So we know a regular index on the begin_yearmonth column won’t be effective. Data is skewed in this table toward that month.

If you want a query to just look at tables where data columns are greater than 30%, here’s that same query above with a filter:

SELECT starelid::regclass AS table_name,attname AS column_name,
(SELECT string_agg('',format(E'\'%s\': %s%%\n', v,ROUND(n::numeric*100, 2)))
FROM unnest(stanumbers1,stavalues1::text::text[])nvs(n,v)) pcts
FROM pg_statistic
JOIN pg_attribute ON attrelid=starelid
AND attnum = staattnum
JOIN pg_class ON attrelid = pg_class.oid
WHERE stanumbers1[1] >= .3 and relname not like 'pg_%'
\x\g\x

Partial indexing for skewed data

Luckily there’s a really easy fix for situations like this: partial indexing. If have already indexed your column, you’ll can just leave the original index in place and add a second more granular partial index for queries that involved your skewed data. That way Postgres can index the skewed data as a separate set.

So your main index for the weather table would look like this:

CREATE INDEX idx_weather_yearmonth
ON weather(begin_yearmonth);

And the partial index would look like this:

CREATE INDEX idx_weather_yearmonth_202205
ON weather(begin_yearmonth)
WHERE begin_yearmonth = '202205';

Partial indexing can be used with multi-columns as well.

Summary

If you’ve got a growing data set and are periodically looking at query performance, checking for skewed data is a good idea.

  • use these queries to see how your data is distributed by column
  • add partial indexing for things that are already indexed or will be and have more than ~30% of the representation

co-authored with David Christensen