WITH Queries: Present & Future
Common table expressions, aka CTEs, aka WITH queries, are not only the gateway to writing recursive SQL queries, but also help developers write maintainable SQL. WITH query clauses can help developers who are more comfortable writing in imperative languages to feel more comfortable writing SQL, as well as help reduce writing redundant code by reusing a particular common table expressions multiple times in a query.
A new patch, scheduled to be a part of PostgreSQL 12 major release later in the year, introduces the ability, under certain conditions, to inline common table expressions within a query. This is a huge feature: many developers could suddenly see their existing queries speed up significantly, and the ability to explicitly specify when to inline (i.e. the planner "substitutes" a reference to the CTE in the main query and can then optimize further) or, conversely, materialize (i.e. place the CTE into memory but lose out on certain planning & execution optimizations).
But why is this a big deal? Before we look into the future, first let's understand how WITH queries currently work in PostgreSQL.
The Present: WITH Queries Make Readable Queries
One of the reasons people like to use WITH queries is that they are easy to read: you can build up to a large query with a lot of smaller pieces, which is a very comfortable thing to do when you are less familiar with SQL. Let's look an at example.
For this example I am using PostgreSQL 11 but the syntax would work with 10 and above. Please use any performance measurements that I present in here as directional guidance to help you understand what is happening. I ran all tests on my laptop with 8 parallel workers available, and used similar configurations for PostgreSQL 11 and PostgreSQL "12devel" (it's not really "12" yet, it's running off of the "HEAD" branch in the git repository).
Let's say I have a table of employees, and each employee can belong to a company and a department. Let's say the tables for companies and departments are normalized. This would lead to a having a schema that looks like this:
CREATE TABLE company (
id int GENERATED BY DEFAULT AS IDENTITY PRIMARY KEY,
name text NOT NULL
);
CREATE TABLE department (
id int GENERATED BY DEFAULT AS IDENTITY PRIMARY KEY,
name text NOT NULL
);
CREATE TABLE employee (
id int GENERATED BY DEFAULT AS IDENTITY PRIMARY KEY,
company_id int NOT NULL,
department_id int NOT NULL,
name text,
FOREIGN KEY (company_id) REFERENCES company (id),
FOREIGN KEY (department_id) REFERENCES department (id)
);
CREATE INDEX employee_company_id_idx ON employee (company_id);
CREATE INDEX employee_department_id_idx ON employee (department_id);
Now all we need to do is populate it with some data. I generated 20,000 company records, 45 different departments, and 1,000,000 employees:
INSERT INTO company (name)
SELECT 'Company ' || x
FROM generate_series(1,20000) x;
INSERT INTO department (name)
SELECT 'Department ' || x
FROM generate_series(1,45) x;
INSERT INTO employee (company_id, department_id, name)
SELECT
ceil((random() * 20000))::int,
ceil((random() * 45))::int,
'Employee #' || (x)
FROM generate_series(1,1000000) x;
VACUUM FREEZE ANALYZE;
(At the end, I use VACUUM FREEZE ANALYZE to "aggressively VACUUM" and update the statistics on the tables).
Let's say I'm asking to pull a report: Find the total number of employees and the size of the largest department for all companies that have more than 65 employees in them, ordered from the largest to smallest. This is not dissimilar from a lot of seemingly-easy-but-subtly-challenging-reports that one must run against their database!
My developer brain likes to break down the query into parts: I like to try to minimize which sets of data are being joined at all times to try to ensure my query will run as quickly as possible. From my understanding of this data set and the problem, I would first want to find all the companies that have more than 65 employees:
SELECT
employee.company_id,
count(*) AS employee_count
FROM employee
GROUP BY employee.company_id
HAVING count(*) > 65;
This is a fairly straightforward aggregate, and what's great is that this much smaller than joining the entire "employee" or "company" table to other tables further along the query (in my case, this query returned 380 rows, vs. 20,000 or 1,000,000).
Now, we need to get a list of all the departments that each of these companies have, and determine how many employees are in each department. The good news is we already have a smaller set of companies to search over, so we can build off of those. Intuitively, I'm going to use a WITH query to do so, as it's easy to read:
WITH company_sizes AS (
SELECT
employee.company_id,
count(*) AS employee_count
FROM employee
GROUP BY employee.company_id
HAVING count(*) > 65
)
SELECT
employee.company_id,
company_sizes.employee_count,
employee.department_id,
count(*) AS department_count
FROM company_sizes
JOIN employee ON employee.company_id = company_sizes.company_id
GROUP BY employee.company_id, company_sizes.employee_count, employee.department_id;
We're almost done! Now, we just need to find which department is the largest in each of our companies, and then order the results from largest company to smallest. We can chain WITH clauses together, so why not write a query like this:
WITH company_sizes AS (
SELECT
employee.company_id,
count(*) AS employee_count
FROM employee
GROUP BY employee.company_id
HAVING count(*) > 65
), department_sizes AS (
SELECT
employee.company_id,
company_sizes.employee_count,
employee.department_id,
count(*) AS department_count
FROM company_sizes
JOIN employee ON employee.company_id = company_sizes.company_id
GROUP BY employee.company_id, company_sizes.employee_count, employee.department_id
)
SELECT
department_sizes.company_id,
department_sizes.employee_count,
max(department_sizes.department_count) AS department_max
FROM department_sizes
GROUP BY department_sizes.company_id, department_sizes.employee_count
ORDER BY department_sizes.employee_count DESC;
And we're done - we're getting our results, and the query is very readable: each part of the answer seems to build on itself as you read it from top-to-bottom.
The query seems to perform okay too (more on that in a second), but can we do better? The short answer is yes, we can, but why?
A Little SQL Knowledge Provides An Easy Optimization
Let's take a look at the query plan for the above query and see what is happening:
EXPLAIN ANALYZE WITH company_sizes AS (
SELECT
employee.company_id,
count(*) AS employee_count
FROM employee
GROUP BY employee.company_id
HAVING count(*) > 65
), department_sizes AS (
SELECT
employee.company_id,
company_sizes.employee_count,
employee.department_id,
count(*) AS department_count
FROM company_sizes
JOIN employee ON employee.company_id = company_sizes.company_id
GROUP BY employee.company_id, company_sizes.employee_count, employee.department_id
)
SELECT
department_sizes.company_id,
department_sizes.employee_count,
max(department_sizes.department_count) AS department_max
FROM department_sizes
GROUP BY department_sizes.company_id, department_sizes.employee_count
ORDER BY department_sizes.employee_count DESC;
which outputs:
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------------
Sort (cost=73494.81..73578.14 rows=33332 width=20) (actual time=531.590..531.611 rows=380 loops=1)
Sort Key: department_sizes.employee_count DESC
Sort Method: quicksort Memory: 54kB
CTE company_sizes
-> Finalize HashAggregate (cost=19046.75..19293.63 rows=6583 width=12) (actual time=171.007..173.622 rows=380 loops=1)
Group Key: employee.company_id
Filter: (count(*) > 65)
Rows Removed by Filter: 19620
-> Gather (cost=14603.00..18750.50 rows=39500 width=12) (actual time=140.876..152.792 rows=60000 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Partial HashAggregate (cost=13603.00..13800.50 rows=19750 width=12) (actual time=136.738..141.436 rows=20000 loops=3)
Group Key: employee.company_id
-> Parallel Seq Scan on employee (cost=0.00..11519.67 rows=416667 width=4) (actual time=0.010..31.414 rows=333333 loops=3)
CTE department_sizes
-> HashAggregate (cost=38864.51..42197.67 rows=333316 width=24) (actual time=515.134..520.531 rows=13464 loops=1)
Group Key: employee_1.company_id, company_sizes.employee_count, employee_1.department_id
-> Hash Join (cost=29853.00..35531.35 rows=333316 width=16) (actual time=500.032..508.671 rows=25992 loops=1)
Hash Cond: (company_sizes.company_id = employee_1.company_id)
-> CTE Scan on company_sizes (cost=0.00..131.66 rows=6583 width=12) (actual time=171.010..173.721 rows=380 loops=1)
-> Hash (cost=17353.00..17353.00 rows=1000000 width=8) (actual time=328.637..328.637 rows=1000000 loops=1)
Buckets: 1048576 Batches: 1 Memory Usage: 47255kB
-> Seq Scan on employee employee_1 (cost=0.00..17353.00 rows=1000000 width=8) (actual time=0.022..158.920 rows=1000000 loops=1)
-> HashAggregate (cost=9166.19..9499.51 rows=33332 width=20) (actual time=531.266..531.489 rows=380 loops=1)
Group Key: department_sizes.employee_count, department_sizes.company_id
-> CTE Scan on department_sizes (cost=0.00..6666.32 rows=333316 width=20) (actual time=515.136..525.484 rows=13464 loops=1)
Planning Time: 0.316 ms
Execution Time: 540.048 ms
Allow me to call your attention to the lines that say "CTE." First, "CTE company_sizes" seems to perform ok: the query planner does recognize we need to perform a sequential scan as the aggregate requires us to use the entire "employee" table, but we're also able to perform the scan in parallel.
However, let's look at some of the values in "CTE department_sizes" and in particular, this part:
-> Hash Join (cost=29853.00..35531.35 rows=333316 width=16) (actual time=500.032..508.671 rows=25992 loops=1)
Hash Cond: (company_sizes.company_id = employee_1.company_id)
-> CTE Scan on company_sizes (cost=0.00..131.66 rows=6583 width=12) (actual time=171.010..173.721 rows=380 loops=1)
-> Hash (cost=17353.00..17353.00 rows=1000000 width=8) (actual time=328.637..328.637 rows=1000000 loops=1)
Buckets: 1048576 Batches: 1 Memory Usage: 47255kB
This seems to be very costly: why is the hash join so large when the company_sizes table is so small? We have run into the "optimization fence" that CTEs have in PostgreSQL 11 and earlier! PostgreSQL will materialize every CTE, which can have benefits for some very costly WITH queries, but also prevents PostgreSQL from making some better choices.
So what can we do? Typically, when I see "chained" WITH statements, it typically means that you need to build your query from the "inside-out" vs. top-down. What do I mean by that?
First, let's start with our initial query that will only return companies that have more than 65 employees:
SELECT
employee.company_id,
count(*) AS employee_count
FROM employee
GROUP BY employee.company_id
HAVING count(*) > 65;
Now let's try something a little bit different, let's "wrap" the query that finds the total number of employees in each department for a reduced number of companies:
SELECT
company_sizes.company_id,
company_sizes.employee_count,
employee.department_id,
count(*) AS department_count
FROM (
SELECT
employee.company_id,
count(*) AS employee_count
FROM employee
GROUP BY employee.company_id
HAVING count(*) > 65
) company_sizes
JOIN employee ON employee.company_id = company_sizes.company_id
GROUP BY company_sizes.company_id, company_sizes.employee_count, employee.department_id;
Finally, let's find the size of the largest department for each company, and return the list from largest company to smallest:
SELECT
department_sizes.company_id,
department_sizes.employee_count,
max(department_sizes.department_count) AS department_max
FROM (
SELECT
company_sizes.company_id,
company_sizes.employee_count,
employee.department_id,
count(*) AS department_count
FROM (
SELECT
employee.company_id,
count(*) AS employee_count
FROM employee
GROUP BY employee.company_id
HAVING count(*) > 65
) company_sizes
JOIN employee ON employee.company_id = company_sizes.company_id
GROUP BY company_sizes.company_id, company_sizes.employee_count, employee.department_id
) department_sizes
GROUP BY department_sizes.company_id, department_sizes.employee_count
ORDER BY department_sizes.employee_count DESC;
This time, PostgreSQL 11 and prior should be able to take advantage of query optimizations, and we should have an overall better execution plan. In fact, when we run:
EXPLAIN ANALYZE SELECT
department_sizes.company_id,
department_sizes.employee_count,
max(department_sizes.department_count) AS department_max
FROM (
SELECT
company_sizes.company_id,
company_sizes.employee_count,
employee.department_id,
count(*) AS department_count
FROM (
SELECT
employee.company_id,
count(*) AS employee_count
FROM employee
GROUP BY employee.company_id
HAVING count(*) > 65
) company_sizes
JOIN employee ON employee.company_id = company_sizes.company_id
GROUP BY company_sizes.company_id, company_sizes.employee_count, employee.department_id
) department_sizes
GROUP BY department_sizes.company_id, department_sizes.employee_count
ORDER BY department_sizes.employee_count DESC;
the output looks like:
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Sort (cost=43768.09..43775.49 rows=2961 width=20) (actual time=361.405..361.427 rows=380 loops=1)
Sort Key: (count(*)) DESC
Sort Method: quicksort Memory: 54kB
-> HashAggregate (cost=43567.75..43597.36 rows=2961 width=20) (actual time=361.280..361.342 rows=380 loops=1)
Group Key: (count(*)), employee_1.company_id
-> HashAggregate (cost=42753.47..43049.57 rows=29610 width=24) (actual time=355.711..358.866 rows=13464 loops=1)
Group Key: employee_1.company_id, (count(*)), employee.department_id
-> Hash Join (cost=19441.74..39420.31 rows=333316 width=16) (actual time=167.384..346.379 rows=25992 loops=1)
Hash Cond: (employee.company_id = employee_1.company_id)
-> Seq Scan on employee (cost=0.00..17353.00 rows=1000000 width=8) (actual time=0.011..75.625 rows=1000000 loops=1)
-> Hash (cost=19359.46..19359.46 rows=6583 width=12) (actual time=167.362..167.362 rows=380 loops=1)
Buckets: 8192 Batches: 1 Memory Usage: 82kB
-> Finalize HashAggregate (cost=19046.75..19293.63 rows=6583 width=12) (actual time=165.658..167.284 rows=380 loops=1)
Group Key: employee_1.company_id
Filter: (count(*) > 65)
Rows Removed by Filter: 19620
-> Gather (cost=14603.00..18750.50 rows=39500 width=12) (actual time=139.180..151.142 rows=60000 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Partial HashAggregate (cost=13603.00..13800.50 rows=19750 width=12) (actual time=135.846..140.165 rows=20000 loops=3)
Group Key: employee_1.company_id
-> Parallel Seq Scan on employee employee_1 (cost=0.00..11519.67 rows=416667 width=4) (actual time=0.009..32.091 rows=333333 loops=3)
Planning Time: 0.176 ms
Execution Time: 362.076 ms
which is directionally faster (and in the case on my laptop over 30% faster!). If we inspect the new plan a bit, take a look at our suspect hash join:
-> Hash Join (cost=19441.74..39420.31 rows=333316 width=16) (actual time=167.384..346.379 rows=25992 loops=1)
Hash Cond: (employee.company_id = employee_1.company_id)
-> Seq Scan on employee (cost=0.00..17353.00 rows=1000000 width=8) (actual time=0.011..75.625 rows=1000000 loops=1)
-> Hash (cost=19359.46..19359.46 rows=6583 width=12) (actual time=167.362..167.362 rows=380 loops=1)
Buckets: 8192 Batches: 1 Memory Usage: 82kB
That looks a lot better.
The Future: Inlined WITH Clauses
Inlined WITH clauses offer us a new way to write queries that are appropriately optimized, as long as we heed this caution from the commit message:
By default, we will inline [CTEs] into the outer query (removing the
optimization fence) if they are called just once. If they are called more
than once, we will keep the old behavior by default, but the user can
override this and force inlining by specifying NOT MATERIALIZED.
In the case of our original query, we only refer to each CTE once, so we do not need to add anything extra. In fact, when we run EXPLAIN ANALYZE on the query against the latest build of PostgreSQL:
EXPLAIN ANALYZE WITH company_sizes AS (
SELECT
employee.company_id,
count(*) AS employee_count
FROM employee
GROUP BY employee.company_id
HAVING count(*) > 65
), department_sizes AS (
SELECT
employee.company_id,
company_sizes.employee_count,
employee.department_id,
count(*) AS department_count
FROM company_sizes
JOIN employee ON employee.company_id = company_sizes.company_id
GROUP BY employee.company_id, company_sizes.employee_count, employee.department_id
)
SELECT
department_sizes.company_id,
department_sizes.employee_count,
max(department_sizes.department_count) AS department_max
FROM department_sizes
GROUP BY department_sizes.company_id, department_sizes.employee_count
ORDER BY department_sizes.employee_count DESC;
we get the following output:
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Sort (cost=54732.73..54816.06 rows=33332 width=20) (actual time=389.213..389.235 rows=343 loops=1)
Sort Key: company_sizes.employee_count DESC
Sort Method: quicksort Memory: 54kB
-> HashAggregate (cost=51895.41..52228.73 rows=33332 width=20) (actual time=388.947..389.148 rows=343 loops=1)
Group Key: company_sizes.employee_count, employee.company_id
-> HashAggregate (cost=42729.22..46062.38 rows=333316 width=24) (actual time=382.513..386.816 rows=12073 loops=1)
Group Key: employee.company_id, company_sizes.employee_count, employee.department_id
-> Hash Join (cost=19417.49..39396.06 rows=333316 width=16) (actual time=180.546..373.281 rows=23441 loops=1)
Hash Cond: (employee.company_id = company_sizes.company_id)
-> Seq Scan on employee (cost=0.00..17353.00 rows=1000000 width=8) (actual time=0.020..78.965 rows=1000000 loops=1)
-> Hash (cost=19335.61..19335.61 rows=6550 width=12) (actual time=180.490..180.490 rows=343 loops=1)
Buckets: 8192 Batches: 1 Memory Usage: 79kB
-> Subquery Scan on company_sizes (cost=19024.48..19335.61 rows=6550 width=12) (actual time=178.670..180.410 rows=343 loops=1)
-> Finalize HashAggregate (cost=19024.48..19270.11 rows=6550 width=12) (actual time=178.669..180.370 rows=343 loops=1)
Group Key: employee_1.company_id
Filter: (count(*) > 65)
Rows Removed by Filter: 19657
-> Gather (cost=14603.00..18729.71 rows=39302 width=12) (actual time=149.943..162.368 rows=60000 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Partial HashAggregate (cost=13603.00..13799.51 rows=19651 width=12) (actual time=143.053..147.456 rows=20000 loops=3)
Group Key: employee_1.company_id
-> Parallel Seq Scan on employee employee_1 (cost=0.00..11519.67 rows=416667 width=4) (actual time=0.011..33.314 rows=333333 loops=3)
Planning Time: 0.379 ms
Execution Time: 396.157 ms
which directionally is much better than the performance of our original query! The evidence that the CTEs are inlined are that you do not see any reference "CTE" in the query plan and we can see that the hash join optimization is in place!
If you want to force a query to be inlined, you can use the "NOT MATERIALIZED" syntax on a specific WITH clause, e.g.:
WITH company_sizes AS NOT MATERIALIZED (
SELECT
employee.company_id,
count(*) AS employee_count
FROM employee
GROUP BY employee.company_id
HAVING count(*) > 65
)
SELECT
employee.company_id,
company_sizes.employee_count,
employee.department_id,
count(*) AS department_count
FROM company_sizes
JOIN employee ON employee.company_id = company_sizes.company_id
GROUP BY employee.company_id, company_sizes.employee_count, employee.department_id;
If you want to force the query to materialize, you can use the "MATERIALIZE" keyword.
Looking Ahead
This feature is very exciting for a lot of reasons: arguably, it lowers the bar to writing performant code for PostgreSQL, but it also gives people finer-grained control over how CTEs should behave in their queries. Additionally, it should help people to write more readable SQL, though I would still personally caution against chaining many too many CTEs together unless you absolutely have to (I have often did this for "writeable" CTEs, i.e. when I am writing SQL to migrate data from one schema to another. Perhaps a future blog post?).
Anyway, I am clearly excited for this feature. I should apply the usual disclaimer that up until the PostgreSQL 12, any feature could be reverted, I thought it was at least worthwhile to go through the exercise on how to optimize present day CTE queries.
Related Articles
- Postgres Tuning & Performance for Analytics Data
19 min read
- Running an Async Web Query Queue with Procedures and pg_cron
6 min read
- Name Collision of the Year: Vector
9 min read
- Sidecar Service Meshes with Crunchy Postgres for Kubernetes
12 min read
- pg_incremental: Incremental Data Processing in Postgres
11 min read