Postgres Subquery Powertools: CTEs,...

Beyond a basic query with a join or two, many queries require extracting subsets of data for comparison, conditionals, or aggregation. Postgres’ use of the SQL language is standards compliant and SQL has a world of tools for subqueries. This post will look at many of the different subquery tools. We’ll talk about the advantages and use cases of each, and provide further reading and tutorials to dig in more.

I’ll take a broad definition of “subquery”. Why am I calling all of these subqueries? These are all queries that work on subsets of data. Having read the article title, you might have come here to say that a subquery is a specific thing vs all these other SQL tools I’m talking about. And you’d be right! If you have a better name for this group of tools, let me know.

What is a subselect?

A subquery extracts values from a group of something else. It’s a subset of a set. A basic subquery is a nested select statement inside of another query. The most basic subselects are typically found in WHERE statements.

In this example we want to summarize a quantity of SKUs sold that have a certain sale price. The subquery returns the SKU of all products that have a sale price less than 75% of the price. Then the top-level query sums the quantity of each product_order.

SELECT
   sum(qty) as total_qty,
   sku
FROM
   product_orders
WHERE
   sku in
   (
      SELECT
         sku
      FROM
         products
      WHERE
         sale_price <= price * .75
   )
GROUP BY
   sku;

As with most things SQL, you could build this query a few ways. Most queries that you execute with a join you could also execute with a subquery. Why would you use a subquery instead of a join? Mostly, it depends on your syntax preference and what you want to do with it. So the above could also be written as a plain select statement joining products and product_orders by SKU. See our blog post on choosing joins vs subqueries for more.

When to use a basic subselect

Your subquery is simple and can go in the WHERE clause

What is a Postgres view?

A view is a stored query that you access as you would a table. View functionality is quite common across relational database systems. Since a view is a query, it can be data from one table or consolidated data from multiple tables. When called, a view will execute a query or it can be called as a subquery. Commonly, a view is used to save a query that you might be running often from inside your database. Views can be used as a join or a subquery from inside another query.

Views are a little more advanced than just a query, they can have separate user settings. You could specify views for certain individuals or applications if you want to show parts of certain tables. Some developers have their applications query a view instead of the base table, so if changes are made to the underlying tables, fewer changes will impact the application code.

Using the example we started with above, let’s say we often need to call the SKUs of sale items in other queries, so we want to create a view for that. Here’s sample syntax for a Postgres view. We name this view skus_on_sale, which selects SKUs from the product table that have a sale price less than 75% of their original price.

CREATE VIEW skus_on_sale AS
SELECT
   sku
FROM
   products
WHERE
   sale_price <= price * .75;

Previously, we nested a full subquery, this time, we join this view in another query. Logically, this will return the same values as the prior query:

SELECT
   sum(po.qty) as total_qty,
   sk.sku
FROM
   product_orders po
   JOIN
      skus_on_sale sk
      ON sk.sku = po.sku
GROUP BY
   sk.sku;

When to use a view?

When you want to save a specific query for use later or in other queries
You have a security issue or need to to show a user or application only the view and not the entire table or tables involved

What is a Materialized View?

Materialized views are saved queries that you store in the database like you would store a table. Unlike the regular view, a materialized view is stored on disk and information does not need to be re-computed to be used each time. Materialized views can be queried like any other table. Typically materialized views are used for situations where you want to save yourself, or the database, from intensive queries or for data that is frequently used.

The big upside to materialized views is performance. Since the data has been precomputed, materialized views often have better response times than other subquery methods. No matter how complex the query, how many tables involved, Postgres stores these results as a simple table. This simple table becomes a simple join to the materialized view, and the materialized view hides complexity of the subqueries heavy lifting.

Here’s an example of a materialized view that will get my SKUs and the shipped quantity by SKU. This shows the most frequently sold SKUs at the top since I’m ordering by qty in descending order.

CREATE MATERIALIZED VIEW recent_product_sales AS
SELECT
   p.sku,
   SUM(po.qty) AS total_quantity
FROM
   products p
   JOIN
      product_orders po
      ON p.sku = po.sku
   JOIN
      orders o
      ON po.order_id = o.order_id
WHERE
   o.status = 'Shipped'
GROUP BY
   p.sku
ORDER BY
   2 DESC;

To improve query performance on materialized views, we can also create indexes on their fields, here;s an example that indexes on the quantity column.

CREATE INDEX sku_qty ON recent_product_sales(total_quantity);

Just like the view, we can call the materialized view in a query. So for example, we can quickly review the top 10 products sold without having to write a subquery to sum or rank.

SELECT
   sku
FROM
   recent_product_sales LIMIT 10;

To update the data held on disk, run a refresh command, REFRESH MATERIALIZED VIEW CONCURRENTLY recent_product_sales;. Use CONCURRENTLY to allow queries to execute to the existing output while the new output is refreshed.

See our tutorial on materialized views if you want to see it in action.

When to use a materialized view?

Your subquery is intensive so storing the generated results rather than computed each time will help overall performance
Your data doesn’t need to be updated in real time

What is a common table expression (CTE)?

A CTE, a common table expression, allows you to split a complex query into different named parts and reference those parts later in the query.

CTEs always start with a WITH statement that creates a subquery first
The WITH statement is followed by a select statement that references the CTE, the CTE cannot exist alone

Similar to the view statement above, here is a sample CTE that creates a subselect called huge_savings, then uses this in a select statement.

WITH huge_savings AS
(
   SELECT
      sku
   FROM
      products
   WHERE
      sale_price <= price * .75
)
SELECT
   sum(qty) as total_qty,
   sku
FROM
   product_orders
   JOIN
      huge_savings USING (sku)
GROUP BY
   sku;

Often as queries become more and more complex, CTEs are a great way to make understanding queries easier by combining data manipulation into sensible parts.

What is a recursive CTE?

A recursive CTE is a CTE that selects against itself. You’ll define an initial condition and then append rows as part of the query. This goes on and on until a terminating condition. We have some awesome examples of recurring CTEs in our Advent of Code series. Recursive CTEs will start with WITH recursive AS.

When to use a CTE?

To separate and define a complicated subquery
You have multiple subqueries to include in a larger query
Your subquery needs to select against itself so you’ll need a recursive CTE

What is a window function?

A window function is an aggregate function that looks at a certain set, ie - the window, of data. The function is typically first and the operator OVER is used to define the group / partition of data you’re looking at. Window functions are used in subqueries often to do averages, summations, max/min, ranks, averages, lead (next row), or lag (previous row).

For example, you could write a simple window function to sum product orders by sku. The SUM is the aggregations and the OVER PARTITION looks at the sku set.

SELECT
   sku,
   SUM(qty) OVER (PARTITION BY sku)
FROM
   product_orders LIMIT 10;

We have a nice tutorial on window functions with a CTEs for the birth data set. Here’s one sql example using the window function lag. We use a CTE to create a count of births per week. Then, we use the lag function to return this weeks’ birth count, and the birth count for the prior week.

WITH weekly_births AS
(
   SELECT
      date_trunc('week', day) week,
      sum(births) births
   FROM
      births
   GROUP BY
      1
)
SELECT
   week,
   births,
   lag(births, 1) OVER (
ORDER BY
   week DESC ) prev_births
FROM
   weekly_births;

It is worth calling out here, that similar to a window function, the FILTER functionality on GROUP BY aggregations is also a powerful sql tool. I won’t include more here because it’s not a subquery as much as is a filter. For more information, Crunchy Data has a walkthrough on using FILTER with GROUP BY.

When to use a window function?

If you have a subquery that’s an aggregation, like a sum, rank, or average
The subquery applies to a limited set of the overall data to be returned

What is a LATERAL join?

LATERAL lets you use values from the top-level query in the subquery. So, if you are querying on accounts in the top-level query, you can then reference that in the subquery. When run, LATERAL is kind of like running a subquery for each individual row. LATERAL is commonly used for querying against an array or JSON data, as well as a replacement for the DISTINCT ON syntax. Check out our LATERAL tutorial to see if you get any ideas about where to add it to your query tools. I would also double check performance when using LATERAL, in our internal testing, its generally not as good as other join options.

Below, we use LATERAL to find the last purchase for every account:

SELECT
   accounts.id,
   last_purchase.*
FROM
   accounts
   INNER JOIN
      LATERAL (
      SELECT
         *
      FROM
         purchases
      WHERE
         account_id = accounts.id
      ORDER BY
         created_at DESC LIMIT 1 ) AS last_purchase
         ON true;

When to use a LATERAL join?

You want to lookup data for each row
You’re using an array or JSON data in a join

Summary

Here’s my reference guide for the tools I discussed above:

what	details	example
subselect	select inside a select	SELECT sum(qty) as total_qty,sku FROM product_orders WHERE sku in (SELECT sku FROM products WHERE sale_price <= price * .75) GROUP BY sku;
CTE	subqueries with named parts	WITH huge_savings AS ( SELECT sku FROM products WHERE sale_price <= price * .75) SELECT sum(qty) as total_qty, sku FROM product_orders JOIN huge_savings USING (sku) GROUP BY sku;
materialized view	saved query to a table	CREATE MATERIALIZED VIEW recent_product_sales AS SELECT p.sku, SUM(po.qty) AS total_quantity FROM products p JOIN product_orders po ON p.sku = po.sku JOIN orders o ON po.order_id = o.order_id WHERE o.status = 'Shipped' GROUP BY p.sku ORDER BY 2 DESC;
window functions	aggregations on subsets of data	SELECT sku, SUM(qty) OVER (PARTITION BY sku) FROM product_orders LIMIT 10;
lateral join	correlated subquery	SELECT sku, SUM(qty) OVER (PARTITION BY sku) FROM product_orders LIMIT 10;

If you’re wondering which to use when, you can just get in there and test. Don’t forget to use your query planning best friend EXPLAIN ANALYZE to test query efficiency and plans.

Links to our web based Postgres tutorials for more on these topics:

CTEs and Window functions

Materialized views

Lateral joins

Latest Articles

Postgres Subquery Powertools: CTEs, Materialized Views, Window Functions, and LATERAL Join