Postgres Subquery Powertools: CTEs, Materialized Views, Window Functions, and LATERAL Join
Beyond a basic query with a join or two, many queries require extracting subsets of data for comparison, conditionals, or aggregation. Postgres’ use of the SQL language is standards compliant and SQL has a world of tools for subqueries. This post will look at many of the different subquery tools. We’ll talk about the advantages and use cases of each, and provide further reading and tutorials to dig in more.
I’ll take a broad definition of “subquery”. Why am I calling all of these subqueries? These are all queries that work on subsets of data. Having read the article title, you might have come here to say that a subquery is a specific thing vs all these other SQL tools I’m talking about. And you’d be right! If you have a better name for this group of tools, let me know.
What is a subselect?
A subquery extracts values from a group of something else. It’s a subset of a set. A basic subquery is a nested select statement inside of another query. The most basic subselects are typically found in WHERE statements.
In this example we want to summarize a quantity of SKUs sold that have a certain sale price. The subquery returns the SKU of all products that have a sale price less than 75% of the price. Then the top-level query sums the quantity of each product_order.
SELECT
sum(qty) as total_qty,
sku
FROM
product_orders
WHERE
sku in
(
SELECT
sku
FROM
products
WHERE
sale_price <= price * .75
)
GROUP BY
sku;
As with most things SQL, you could build this query a few ways. Most queries that you execute with a join you could also execute with a subquery. Why would you use a subquery instead of a join? Mostly, it depends on your syntax preference and what you want to do with it. So the above could also be written as a plain select statement joining products and product_orders by SKU. See our blog post on choosing joins vs subqueries for more.
When to use a basic subselect
- Your subquery is simple and can go in the WHERE clause
What is a Postgres view?
A view is a stored query that you access as you would a table. View functionality is quite common across relational database systems. Since a view is a query, it can be data from one table or consolidated data from multiple tables. When called, a view will execute a query or it can be called as a subquery. Commonly, a view is used to save a query that you might be running often from inside your database. Views can be used as a join or a subquery from inside another query.
Views are a little more advanced than just a query, they can have separate user settings. You could specify views for certain individuals or applications if you want to show parts of certain tables. Some developers have their applications query a view instead of the base table, so if changes are made to the underlying tables, fewer changes will impact the application code.
Using the example we started with above, let’s say we often need to call the SKUs of sale items in other queries, so we want to create a view for that. Here’s sample syntax for a Postgres view. We name this view skus_on_sale
, which selects SKUs from the product table that have a sale price less than 75% of their original price.
CREATE VIEW skus_on_sale AS
SELECT
sku
FROM
products
WHERE
sale_price <= price * .75;
Previously, we nested a full subquery, this time, we join this view in another query. Logically, this will return the same values as the prior query:
SELECT
sum(po.qty) as total_qty,
sk.sku
FROM
product_orders po
JOIN
skus_on_sale sk
ON sk.sku = po.sku
GROUP BY
sk.sku;
When to use a view?
- When you want to save a specific query for use later or in other queries
- You have a security issue or need to to show a user or application only the view and not the entire table or tables involved
What is a Materialized View?
Materialized views are saved queries that you store in the database like you would store a table. Unlike the regular view, a materialized view is stored on disk and information does not need to be re-computed to be used each time. Materialized views can be queried like any other table. Typically materialized views are used for situations where you want to save yourself, or the database, from intensive queries or for data that is frequently used.
The big upside to materialized views is performance. Since the data has been precomputed, materialized views often have better response times than other subquery methods. No matter how complex the query, how many tables involved, Postgres stores these results as a simple table. This simple table becomes a simple join to the materialized view, and the materialized view hides complexity of the subqueries heavy lifting.
Here’s an example of a materialized view that will get my SKUs and the shipped quantity by SKU. This shows the most frequently sold SKUs at the top since I’m ordering by qty in descending order.
CREATE MATERIALIZED VIEW recent_product_sales AS
SELECT
p.sku,
SUM(po.qty) AS total_quantity
FROM
products p
JOIN
product_orders po
ON p.sku = po.sku
JOIN
orders o
ON po.order_id = o.order_id
WHERE
o.status = 'Shipped'
GROUP BY
p.sku
ORDER BY
2 DESC;
To improve query performance on materialized views, we can also create indexes on their fields, here;s an example that indexes on the quantity column.
CREATE INDEX sku_qty ON recent_product_sales(total_quantity);
Just like the view, we can call the materialized view in a query. So for example, we can quickly review the top 10 products sold without having to write a subquery to sum or rank.
SELECT
sku
FROM
recent_product_sales LIMIT 10;
To update the data held on disk, run a refresh command, REFRESH MATERIALIZED VIEW CONCURRENTLY recent_product_sales;
. Use CONCURRENTLY to allow queries to execute to the existing output while the new output is refreshed.
See our tutorial on materialized views if you want to see it in action.
When to use a materialized view?
- Your subquery is intensive so storing the generated results rather than computed each time will help overall performance
- Your data doesn’t need to be updated in real time
What is a common table expression (CTE)?
A CTE, a common table expression, allows you to split a complex query into different named parts and reference those parts later in the query.
- CTEs always start with a WITH statement that creates a subquery first
- The WITH statement is followed by a select statement that references the CTE, the CTE cannot exist alone
Similar to the view statement above, here is a sample CTE that creates a subselect called huge_savings
, then uses this in a select statement.
WITH huge_savings AS
(
SELECT
sku
FROM
products
WHERE
sale_price <= price * .75
)
SELECT
sum(qty) as total_qty,
sku
FROM
product_orders
JOIN
huge_savings USING (sku)
GROUP BY
sku;
Often as queries become more and more complex, CTEs are a great way to make understanding queries easier by combining data manipulation into sensible parts.
What is a recursive CTE?
A recursive CTE is a CTE that selects against itself. You’ll define an initial condition and then append rows as part of the query. This goes on and on until a terminating condition. We have some awesome examples of recurring CTEs in our Advent of Code series. Recursive CTEs will start with WITH recursive AS
.
When to use a CTE?
- To separate and define a complicated subquery
- You have multiple subqueries to include in a larger query
- Your subquery needs to select against itself so you’ll need a recursive CTE
What is a window function?
A window function is an aggregate function that looks at a certain set, ie - the window, of data. The function is typically first and the operator OVER is used to define the group / partition of data you’re looking at. Window functions are used in subqueries often to do averages, summations, max/min, ranks, averages, lead (next row), or lag (previous row).
For example, you could write a simple window function to sum product orders by sku. The SUM is the aggregations and the OVER PARTITION looks at the sku set.
SELECT
sku,
SUM(qty) OVER (PARTITION BY sku)
FROM
product_orders LIMIT 10;
We have a nice tutorial on window functions with a CTEs for the birth data set. Here’s one sql example using the window function lag. We use a CTE to create a count of births per week. Then, we use the lag function to return this weeks’ birth count, and the birth count for the prior week.
WITH weekly_births AS
(
SELECT
date_trunc('week', day) week,
sum(births) births
FROM
births
GROUP BY
1
)
SELECT
week,
births,
lag(births, 1) OVER (
ORDER BY
week DESC ) prev_births
FROM
weekly_births;
It is worth calling out here, that similar to a window function, the FILTER functionality on GROUP BY aggregations is also a powerful sql tool. I won’t include more here because it’s not a subquery as much as is a filter. For more information, Crunchy Data has a walkthrough on using FILTER with GROUP BY.
When to use a window function?
- If you have a subquery that’s an aggregation, like a sum, rank, or average
- The subquery applies to a limited set of the overall data to be returned
What is a LATERAL join?
LATERAL lets you use values from the top-level query in the subquery. So, if you are querying on accounts in the top-level query, you can then reference that in the subquery. When run, LATERAL is kind of like running a subquery for each individual row. LATERAL is commonly used for querying against an array or JSON data, as well as a replacement for the DISTINCT ON syntax. Check out our LATERAL tutorial to see if you get any ideas about where to add it to your query tools. I would also double check performance when using LATERAL, in our internal testing, its generally not as good as other join options.
Below, we use LATERAL
to find the last purchase for every account:
SELECT
accounts.id,
last_purchase.*
FROM
accounts
INNER JOIN
LATERAL (
SELECT
*
FROM
purchases
WHERE
account_id = accounts.id
ORDER BY
created_at DESC LIMIT 1 ) AS last_purchase
ON true;
When to use a LATERAL join?
- You want to lookup data for each row
- You’re using an array or JSON data in a join
Summary
Here’s my reference guide for the tools I discussed above:
what | details | example |
---|---|---|
subselect | select inside a select | SELECT sum(qty) as total_qty,sku FROM product_orders WHERE sku in (SELECT sku FROM products WHERE sale_price <= price * .75) GROUP BY sku; |
CTE | subqueries with named parts | WITH huge_savings AS ( SELECT sku FROM products WHERE sale_price <= price * .75) SELECT sum(qty) as total_qty, sku FROM product_orders JOIN huge_savings USING (sku) GROUP BY sku; |
materialized view | saved query to a table | CREATE MATERIALIZED VIEW recent_product_sales AS SELECT p.sku, SUM(po.qty) AS total_quantity FROM products p JOIN product_orders po ON p.sku = po.sku JOIN orders o ON po.order_id = o.order_id WHERE o.status = 'Shipped' GROUP BY p.sku ORDER BY 2 DESC; |
window functions | aggregations on subsets of data | SELECT sku, SUM(qty) OVER (PARTITION BY sku) FROM product_orders LIMIT 10; |
lateral join | correlated subquery | SELECT sku, SUM(qty) OVER (PARTITION BY sku) FROM product_orders LIMIT 10; |
If you’re wondering which to use when, you can just get in there and test. Don’t forget to use your query planning best friend EXPLAIN ANALYZE to test query efficiency and plans.
Links to our web based Postgres tutorials for more on these topics:
Related Articles
- Postgres Tuning & Performance for Analytics Data
19 min read
- Running an Async Web Query Queue with Procedures and pg_cron
6 min read
- Name Collision of the Year: Vector
9 min read
- Sidecar Service Meshes with Crunchy Postgres for Kubernetes
12 min read
- pg_incremental: Incremental Data Processing in Postgres
11 min read