Introducing Crunchy Data Warehouse: A next-generation Postgres-native data warehouse. Crunchy Data Warehouse Learn more
Tutorial Instructions
Today I’m going to walk through working with tags in Postgres with a sample database of 🐈 cats and their attributes
For these tests, we will use a very simple table of 🐈 cats
, our entity of interest, and tags
a short table of ten tags for the cats. In between the two tables, the relationship between tags and cats is stored in the cat_tags
table.
Table Creation SQL
CREATE TABLE cats (
cat_id serial primary key,
cat_name text not null
);
CREATE TABLE cat_tags (
cat_id integer not null,
tag_id integer not null,
unique(cat_id, tag_id)
);
CREATE TABLE tags (
tag_id serial primary key,
tag_name text not null,
unique(tag_name)
);
I filled the tables with over 1.7M entries for the cats
, ten entries for the tags
, and 4.7M entries for the cat/tag relationship.
Data Generation SQL
-- Generate cats random names
INSERT INTO cats (cat_name)
WITH
hon AS (
SELECT *
FROM unnest(ARRAY['mr', 'ms', 'miss', 'doctor', 'frau', 'fraulein', 'missus', 'governer']) WITH ORDINALITY AS hon(n, i)
),
fn AS (
SELECT *
FROM unnest(ARRAY['flopsy', 'mopsey', 'whisper', 'fluffer', 'tigger', 'softly']) WITH ORDINALITY AS fn(n, i)
),
mn AS (
SELECT *
FROM unnest(ARRAY['biggles', 'wiggly', 'mossturn', 'leaflittle', 'flower', 'nonsuch']) WITH ORDINALITY AS mn(n, i)
),
ln AS (
SELECT *
FROM unnest(ARRAY['smithe-higgens', 'maclarter', 'ipswich', 'essex-howe', 'glumfort', 'pigeod']) WITH ORDINALITY AS ln(n, i)
)
SELECT initcap(concat_ws(' ', hon.n, fn.n, mn.n, ln.n)) AS name
FROM hon, fn, mn, ln, generate_series(1,5)
ORDER BY random();
-- Fill in the tag names
INSERT INTO tags (tag_name) VALUES
('soft'), ('cuddly'), ('brown'), ('red'), ('scratches'), ('hisses'), ('friendly'), ('aloof'), ('hungry'), ('birder'), ('mouser');
-- Generate random tagging. Every cat has 25% chance of getting each tag.
INSERT INTO cat_tags
WITH tag_ids AS (
SELECT DISTINCT tag_id FROM tags
),
tag_count AS (
SELECT Count(*) AS c FROM tags
)
SELECT cat_id, tag_id
FROM cats, tag_ids, tag_count
WHERE random() < 0.25;
— Create an index
CREATE INDEX cat_tags_x ON cat_tags (tag_id);
In total, the relational model needs 2424KB for the cats
, tags
and the tag relationships.
SELECT pg_size_pretty(
pg_total_relation_size('cats') +
pg_total_relation_size('cat_tags') +
pg_total_relation_size('tags'));
We’re going to be looking at performance, so set up timing to see how long things are taking. After each statement is run, the system will also provide a time output. Since this tutorial runs in a browser, there’s some variability in results.
\timing
There are two standard directions of tag queries:
The query is simple, and the performance is very good (under 1 ms).
SELECT tag_name
FROM tags
JOIN cat_tags USING (tag_id)
WHERE cat_id = 444;
The query is still simple, and the performance is not unexpected for the number of records that meet the filter criterion.
SELECT Count(*)
FROM cats
JOIN cat_tags a ON (cats.cat_id = a.cat_id)
JOIN tags ta ON (a.tag_id = ta.tag_id)
WHERE ta.tag_name = 'brown';
This is where things start to come off the rails for the relational model, because finding just the records that have two particular tags involves quite complicated SQL.
SELECT Count(*)
FROM cats
JOIN cat_tags a ON (cats.cat_id = a.cat_id)
JOIN cat_tags b ON (a.cat_id = b.cat_id)
JOIN tags ta ON (a.tag_id = ta.tag_id)
JOIN tags tb ON (b.tag_id = tb.tag_id)
WHERE ta.tag_name = 'brown' AND tb.tag_name = 'aloof';
This query takes around >1s to find the cats that are both "brown" and "aloof".
Just so you can see the pattern, here's the three tag version.
SELECT Count(*)
FROM cats
JOIN cat_tags a ON (cats.cat_id = a.cat_id)
JOIN cat_tags b ON (a.cat_id = b.cat_id)
JOIN cat_tags c ON (b.cat_id = c.cat_id)
JOIN tags ta ON (a.tag_id = ta.tag_id)
JOIN tags tb ON (b.tag_id = tb.tag_id)
JOIN tags tc ON (c.tag_id = tc.tag_id)
WHERE ta.tag_name = 'brown'AND tb.tag_name = 'aloof'AND tc.tag_name = 'red';
At this point the decreasing number of records in the result set is balancing out the growing complexity of the multi-join and query time has only grown to >2s.
But imagine doing this for five, six or seven tags?
What if we changed our model, and instead of modeling the cat/tag relationship with a correlation table, we model it with an integer array?
CREATE TABLE cats_array (
cat_id serial primary key,
cat_name text not null,
cat_tags integer[]
);
Now our model has just two tables, cats_array
and tags
.
We can populate the array-based table from the relational data, so we can compare answers between models.
Data Generation SQL
INSERT INTO cats_array
SELECT cat_id, cat_name, array_agg(tag_id) AS cat_tags
FROM cats
JOIN cat_tags USING (cat_id)
GROUP BY cat_id, cat_name;
With this new data model, the size of the required tables has gone down, and we are using only 1040KB.
SELECT pg_size_pretty(
pg_total_relation_size('cats_array') +
pg_total_relation_size('tags'));
Once the data are loaded, we need the most important part -- an index on the cat_tags
integer array.
CREATE INDEX cats_array_x ON cats_array USING GIN (cat_tags);
This GIN index is a perfect fit for indexing collections (like our array) where there is a fixed and finite number of values in the collection (like our ten tags). While Postgres ships with an intarray extension, the core support for arrays and array indexes have caught up with and rendered much of the extension redundant.
As before, we will test common tag-based use cases.
The query is much less pretty! First we have to lookup the tag_id
values in cat_tags
and use unnest()
to expand them out into a relation. Then we're ready to join that relation to the tags
table to find the tag_name
that corresponds to the tag_id
.
SELECT c.cat_id, c.cat_name, t.tag_name
FROM cats_array c
CROSS JOIN unnest(cat_tags) AS tag_id
JOIN tags t USING (tag_id)
WHERE cat_id = 779;
The query hits the cats
primary key index and returns in the 100ms range. Great performance!
This is the query that flummoxed our relational model. Let's jump straight to the hardest case and try to find all the cats that are "red", "brown" and "aloof".
WITH tags AS MATERIALIZED (
SELECT array_agg(tag_id) AS tag_ids
FROM tags
WHERE tag_name IN ('red', 'brown', 'aloof')
)
SELECT Count(*)
FROM cats_array
CROSS JOIN tags
WHERE cat_tags @> tags.tag_ids;
First we have to go into the tags
table to make an array of tag_id
entries that correspond to our tags. Then we can use the @>
Postgres array operator to test to see which cats have cat_tags
arrays that contain the query array.
The query hits the GIN index on cat_tags
and returns the count of cats in around 200ms. About seven times faster than the same query on the relational model!
So if partially de-normalizing our data from a cats -> cat_tags -> tags
model to a cats -> tags
model makes things faster... what if we went all the way to the simplest model of all -- just cats
?
CREATE TABLE cats_array_text (
cat_id serial primary key,
cat_name text not null,
cat_tags text[] not null
);
Again we can populate this new model directly from the relational model.
Data Generation SQL
INSERT INTO cats_array_text
SELECT cat_id, cat_name, array_agg(tag_name) AS cat_tags
FROM cats
JOIN cat_tags USING (cat_id)
JOIN tags USING (tag_id)
GROUP BY cat_id, cat_name;
The result is 1224KB, a little larger than the integer array version.
SELECT pg_size_pretty(
pg_total_relation_size('cats_array_text') +
pg_total_relation_size('tags'));
Now every cat has the tag names right in the record.
Since there's only one table with all the data, finding the tags of a cat is ridiculously easy.
SELECT cat_name, cat_tags
FROM cats_array_text
WHERE cat_id = 888;
Once again, in order to get good performance we need a GIN index on the array we will be searching.
CREATE INDEX cats_array_text_x ON cats_array_text USING GIN (cat_tags);
The query to find the cats are "red", "brown" and "aloof" is also wonderfully simple.
SELECT Count(*)
FROM cats_array_text
WHERE cat_tags @> ARRAY['red', 'brown', 'aloof'];
This query takes about the same amount of time as the integer array based query, 120ms.
So are array-based models the final answer for tag-oriented query patterns in Postgres? On the plus side, the array models are:
Those are really compelling positives!
However, there are some caveats to keep in mind when working with these models:
cats
table.cat_tags
integer array exist in the tags
table. You can work around this with a TRIGGER
, but it's not as clean as a relational foreign key constraint.Loading terminal...
Loading terminal...