Introducing Crunchy Data Warehouse: A next-generation Postgres-native data warehouse. Crunchy Data Warehouse Learn more
Tutorial Instructions
SQL makes sense when it's working on a single row, or even when it's aggregating across multiple rows. But what happens when you want to compare between rows of something you've already calculated? Or make groups of data and query those. Enter window functions.
Window functions tend to confuse people - but they’re a pretty awesome tool in SQL for data analytics. The best part is that you don’t need charts, fancy BI tools or AI to get some actionable and useful data for your stakeholders. Window functions let you:
In this post, I will show various types of window functions and how they can apply to certain situations. I’m using a super simple ecommerce schema for this to follow along with the kinds of queries I’m going to run with window functions.
The OVER
function
The OVER
part of the Window function is what creates the window. Annoyingly the word window appears no where in any of the functions. 😂 Typically the OVER part is preambled by another function, either an aggregate or mathematical function. There’s also often a frame, which specifics which rows you’re looking at like ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
.
Window functions vs where clauses
Window functions kind of feel like a where clause at first, since they’re looking at a set of data. But they’re really different. Window functions are more for times when you need to look across sets of data or across groups of data. There are cases where you could use either. In general:
WHERE
clause when you need to filter rows based on a condition.Here’s a simple place to get started. Let’s ask for orders, customer data, order totals, and then a running total of orders. This will show us our total orders across a date range.
SELECT
SUM(total_amount) OVER (ORDER BY order_date) AS running_total,
order_date,
order_id,
customer_id,
total_amount
FROM
orders
ORDER BY
order_date;
running_total | order_date | order_id | customer_id | total_amount
---------------+---------------------+----------+-------------+--------------
349.98 | 2024-08-21 10:00:00 | 21 | 1 | 349.98
1249.96 | 2024-08-22 11:30:00 | 22 | 2 | 899.98
1284.94 | 2024-08-23 09:15:00 | 23 | 3 | 34.98
1374.93 | 2024-08-24 14:45:00 | 24 | 4 | 89.99
1524.92 | 2024-08-25 08:25:00 | 25 | 5 | 149.99
1589.90 | 2024-08-26 12:05:00 | 26 | 6 | 64.98
1839.88 | 2024-08-27 16:35:00 | 27 | 7 | 249.98
What's happening here is that each frame of data is the existing row plus the
rows before. This sort of does calculations one one slice at a time, which you
might see in the docs called a virtual table. Here's a diagram to get the
general idea of how each frame of data is a set of rows, aggregated by the
function with the SUM OVER
.
Window functions can look at groups of data, so say a specific customer ID and give you something like their first and last order, total, for your most recent 10 orders.
SELECT
FIRST_VALUE(o.order_date) OVER (PARTITION BY o.customer_id ORDER BY o.order_date) AS first_order_date,
LAST_VALUE(o.order_date) OVER (PARTITION BY o.customer_id ORDER BY o.order_date) AS last_order_date,
o.order_id,
o.customer_id,
o.order_date,
o.total_amount
FROM
orders o
ORDER BY
o.order_date DESC;
first_order_date | last_order_date | order_id | customer_id | order_date | total_amount
---------------------+---------------------+----------+-------------+---------------------+--------------
2024-08-30 17:50:00 | 2024-09-19 18:50:00 | 50 | 10 | 2024-09-19 18:50:00 | 149.98
2024-08-29 13:10:00 | 2024-09-18 14:10:00 | 49 | 9 | 2024-09-18 14:10:00 | 199.98
2024-08-28 10:20:00 | 2024-09-17 11:20:00 | 48 | 8 | 2024-09-17 11:20:00 | 139.99
2024-08-27 16:35:00 | 2024-09-16 17:35:00 | 47 | 7 | 2024-09-16 17:35:00 | 249.98
2024-08-26 12:05:00 | 2024-09-15 13:05:00 | 46 | 6 | 2024-09-15 13:05:00 | 89.98
date_trunc
is an incredibly handy Postgres function that summarizes units of time, hours, days, weeks, month. When combined with a GROUP BY
in a CTE, you can create really easy summary statistics by day, month, week, year, etc.
When you combine the date_trunc GROUP BY partitions with window functions, some pretty magical stuff happens, and you can get readymade summary statistics straight out of your database. In my opinion this is one of the most powerful features of Postgres window functions that really gets you to the next level.
Here’s an example query that starts with a CTE, calling a date_trunc to sum orders by daily totals. The second part of the query, the window function, ranks the sales in descending order with the best day of sales.
WITH DailySales AS (
SELECT
date_trunc('day', o.order_date) AS sales_date,
SUM(o.total_amount) AS daily_total_sales
FROM
orders o
GROUP BY
date_trunc('day', o.order_date)
)
SELECT
sales_date,
daily_total_sales,
RANK() OVER (
ORDER BY daily_total_sales DESC
) AS sales_rank
FROM
DailySales
ORDER BY
sales_rank;
sales_date | daily_total_sales | sales_rank
---------------------+-------------------+------------
2024-09-02 00:00:00 | 2419.97 | 1
2024-09-01 00:00:00 | 1679.94 | 2
2024-08-22 00:00:00 | 899.98 | 3
2024-09-07 00:00:00 | 699.95 | 4
2024-09-10 00:00:00 | 659.96 | 5
2024-09-09 00:00:00 | 499.94 | 6
2024-09-06 00:00:00 | 409.94 | 7
2024-08-30 00:00:00 | 349.99 | 8
I should note that this uses one of the really helpful math function, RANK
in a Window function.
LAG
AnalysisLet’s do more with our date_trunc CTEs. Now that we know we have our data by day, we can use a window functions to calculate changes between these groups. For example we could look at the difference in sales from the prior day. In this example, LAG
looks at the sales date and creates a comparison with the previous day.
WITH DailySales AS (
SELECT
date_trunc('day', o.order_date) AS sales_date,
SUM(o.total_amount) AS daily_total_sales
FROM
orders o
GROUP BY
date_trunc('day', o.order_date)
)
SELECT
sales_date,
daily_total_sales,
LAG(daily_total_sales) OVER (
ORDER BY sales_date
) AS previous_day_sales,
daily_total_sales - LAG(daily_total_sales) OVER (
ORDER BY sales_date
) AS sales_difference
FROM
DailySales
ORDER BY
sales_date;
sales_date | daily_total_sales | previous_day_sales | sales_difference
---------------------+-------------------+--------------------+------------------
2024-08-21 00:00:00 | 349.98 | |
2024-08-22 00:00:00 | 899.98 | 349.98 | 550.00
2024-08-23 00:00:00 | 34.98 | 899.98 | -865.00
2024-08-24 00:00:00 | 89.99 | 34.98 | 55.01
2024-08-25 00:00:00 | 149.99 | 89.99 | 60.00
2024-08-26 00:00:00 | 64.98 | 149.99 | -85.01
LEAD
works the same way, looking forward in the data set.
Using our same day groups we can also make a rolling average. The AVG
function takes an input ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
for a rolling 7 day sales average.
WITH DailySales AS (
SELECT
date_trunc('day', o.order_date) AS sales_date,
SUM(o.total_amount) AS daily_total_sales
FROM
orders o
GROUP BY
date_trunc('day', o.order_date)
)
SELECT
sales_date,
daily_total_sales,
AVG(daily_total_sales) OVER (
ORDER BY sales_date
ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
) AS rolling_average_7_days
FROM
DailySales
ORDER BY
sales_date
LIMIT 10;
sales_date | daily_total_sales | rolling_average_7_days
---------------------+-------------------+------------------------
2024-08-21 00:00:00 | 349.98 | 349.9800000000000000
2024-08-22 00:00:00 | 899.98 | 624.9800000000000000
2024-08-23 00:00:00 | 34.98 | 428.3133333333333333
2024-08-24 00:00:00 | 89.99 | 343.7325000000000000
2024-08-25 00:00:00 | 149.99 | 304.9840000000000000
2024-08-26 00:00:00 | 64.98 | 264.9833333333333333
2024-08-27 00:00:00 | 249.98 | 262.8400000000000000
2024-08-28 00:00:00 | 129.99 | 231.4128571428571429
2024-08-29 00:00:00 | 179.98 | 128.5557142857142857
2024-08-30 00:00:00 | 349.99 | 173.5571428571428571
(10 rows)
The NTILE
function is a window function in SQL that is used to divide a result set into a specified number of roughly equal parts, known as tiles or buckets. By assigning a unique tile number to each row, nitles helps categorize and analyze data distribution within a dataset. This function is particularly useful in statistical and financial analysis for understanding how data is distributed across different segments, identifying trends, and making comparisons between groups with different characteristics.
For example, using NTILE(4)
divides the data into four quartiles, ranking each row into one of four groups. The descending part here makes sure that quartile 1 is the top quarter and so on.
WITH DailySales AS (
SELECT
date_trunc('day', o.order_date) AS sales_date,
SUM(o.total_amount) AS daily_total_sales
FROM
orders o
GROUP BY
date_trunc('day', o.order_date)
)
SELECT
sales_date,
daily_total_sales,
NTILE(4) OVER (
ORDER BY daily_total_sales DESC
) AS sales_quartile
FROM
DailySales
ORDER BY
sales_date;
sales_date | daily_total_sales | sales_quartile
---------------------+-------------------+----------------
2024-09-06 00:00:00 | 409.94 | 1
2024-08-30 00:00:00 | 349.99 | 1
2024-08-21 00:00:00 | 349.98 | 2
2024-09-08 00:00:00 | 349.96 | 2
Are there a million other things you can do with Window functions? Yes! Do I think you get the gist here? Also Yes.
Ideal Postgres Window functions with samples I’ve included here:
Using with date summary (ie date_trunc GROUP BY
) with CTEs and Window functions is really powerful and will get you quick and easily summary data out of Postgres.
Loading terminal...
Loading terminal...