Window Functions for Data Analysis with Postgres
SQL makes sense when it's working on a single row, or even when it's aggregating across multiple rows. But what happens when you want to compare between rows of something you've already calculated? Or make groups of data and query those? Enter window functions.
Window functions tend to confuse people - but they’re a pretty awesome tool in SQL for data analytics. The best part is that you don’t need charts, fancy BI tools or AI to get some actionable and useful data for your stakeholders. Window functions let you quickly:
- Calculate running totals
- Provide summary statistics for groups/partitions of data
- Create rankings
- Perform lag/lead analysis, ie comparing two separate sets of data with each other
- Compute moving/rolling averages
In this post, I will show various types of window functions and how they can apply to certain situations. I’m using a super simple e-commerce schema for this to follow along with the kinds of queries I’m going to run with window functions.
This post is available as an interactive tutorial as well in our Postgres playground.
The OVER
function
The OVER
part of the Window function is what creates the window. Annoyingly the word window appears nowhere in any of the functions. 😂 Typically the OVER part is preambled by another function, either an aggregate or mathematical function. There’s also often a frame, to specify which rows you’re looking at like ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
.
Window functions vs where clauses
Window functions kind of feel like a where clause at first, since they’re looking at a set of data. But they’re really different. Window functions are more for times when you need to look across sets of data or across groups of data. There are cases where you could use either. In general:
- Use
WHERE
clause when you need to filter rows based on a condition. - Use window functions when you need to perform calculations across rows that remain after filtering, without removing any rows from the result set.
Running totals
Here’s a simple place to get started. Let’s ask for orders, customer data, order totals, and then a running total of orders. This will show us our total orders across a date range.
SELECT
SUM(total_amount) OVER (ORDER BY order_date) AS running_total,
order_date,
order_id,
customer_id,
total_amount
FROM
orders
ORDER BY
order_date;
running_total | order_date | order_id | customer_id | total_amount
---------------+---------------------+----------+-------------+--------------
349.98 | 2024-08-21 10:00:00 | 21 | 1 | 349.98
1249.96 | 2024-08-22 11:30:00 | 22 | 2 | 899.98
1284.94 | 2024-08-23 09:15:00 | 23 | 3 | 34.98
1374.93 | 2024-08-24 14:45:00 | 24 | 4 | 89.99
1524.92 | 2024-08-25 08:25:00 | 25 | 5 | 149.99
1589.90 | 2024-08-26 12:05:00 | 26 | 6 | 64.98
What's happening here is that each frame of data is the existing row plus the rows before. This sort of does calculations one one slice at a time, which you might see in the docs called a virtual table. Here's a diagram to get the general idea of how each frame of data is a set of rows, aggregated by the function with the SUM OVER
.
First and last values
Window functions can look at groups of data, so say a specific customer ID and give you something like their first and last order, total, for your most recent 10 orders.
SELECT
FIRST_VALUE(o.order_date) OVER (PARTITION BY o.customer_id ORDER BY o.order_date) AS first_order_date,
LAST_VALUE(o.order_date) OVER (PARTITION BY o.customer_id ORDER BY o.order_date) AS last_order_date,
o.order_id,
o.customer_id,
o.order_date,
o.total_amount
FROM
orders o
ORDER BY
o.order_date DESC;
first_order_date | last_order_date | order_id | customer_id | order_date | total_amount
---------------------+---------------------+----------+-------------+---------------------+--------------
2024-08-30 17:50:00 | 2024-09-19 18:50:00 | 50 | 10 | 2024-09-19 18:50:00 | 149.98
2024-08-29 13:10:00 | 2024-09-18 14:10:00 | 49 | 9 | 2024-09-18 14:10:00 | 199.98
2024-08-28 10:20:00 | 2024-09-17 11:20:00 | 48 | 8 | 2024-09-17 11:20:00 | 139.99
2024-08-27 16:35:00 | 2024-09-16 17:35:00 | 47 | 7 | 2024-09-16 17:35:00 | 249.98
2024-08-26 12:05:00 | 2024-09-15 13:05:00 | 46 | 6 | 2024-09-15 13:05:00 | 89.98
Using date_trunc GROUP BY CTEs with Window Functions
date_trunc
is an incredibly handy Postgres function that summarizes units of time, hours, days, weeks, month. When combined with a GROUP BY
in a CTE, you can create really easy summary statistics by day, month, week, year, etc.
When you combine the date_trunc GROUP BY partitions with window functions, some pretty magical stuff happens, and you can get ready-made summary statistics straight out of your database. In my opinion this is one of the most powerful features of Postgres window functions that really gets you to the next level.
Here’s an example query that starts with a CTE, calling a date_trunc to sum orders by daily totals. The second part of the query, the window function, ranks the sales in descending order with the best day of sales.
WITH DailySales AS (
SELECT
date_trunc('day', o.order_date) AS sales_date,
SUM(o.total_amount) AS daily_total_sales
FROM
orders o
GROUP BY
date_trunc('day', o.order_date)
)
SELECT
sales_date,
daily_total_sales,
RANK() OVER (
ORDER BY daily_total_sales DESC
) AS sales_rank
FROM
DailySales
ORDER BY
sales_rank;
sales_date | daily_total_sales | sales_rank
---------------------+-------------------+------------
2024-09-02 00:00:00 | 2419.97 | 1
2024-09-01 00:00:00 | 1679.94 | 2
2024-08-22 00:00:00 | 899.98 | 3
2024-09-07 00:00:00 | 699.95 | 4
2024-09-10 00:00:00 | 659.96 | 5
2024-09-09 00:00:00 | 499.94 | 6
2024-09-06 00:00:00 | 409.94 | 7
2024-08-30 00:00:00 | 349.99 | 8
I should note that this uses one of the really helpful math functions, RANK
in a Window function.
LAG
Analysis
Let’s do more with our date_trunc CTEs. Now that we know we have our data by day, we can use a window function to calculate changes between these groups. For example we could look at the difference in sales from the prior day. In this example, LAG
looks at the sales date and creates a comparison with the previous day.
WITH DailySales AS (
SELECT
date_trunc('day', o.order_date) AS sales_date,
SUM(o.total_amount) AS daily_total_sales
FROM
orders o
GROUP BY
date_trunc('day', o.order_date)
)
SELECT
sales_date,
daily_total_sales,
LAG(daily_total_sales) OVER (
ORDER BY sales_date
) AS previous_day_sales,
daily_total_sales - LAG(daily_total_sales) OVER (
ORDER BY sales_date
) AS sales_difference
FROM
DailySales
ORDER BY
sales_date;
sales_date | daily_total_sales | previous_day_sales | sales_difference
---------------------+-------------------+--------------------+------------------
2024-08-21 00:00:00 | 349.98 | |
2024-08-22 00:00:00 | 899.98 | 349.98 | 550.00
2024-08-23 00:00:00 | 34.98 | 899.98 | -865.00
2024-08-24 00:00:00 | 89.99 | 34.98 | 55.01
2024-08-25 00:00:00 | 149.99 | 89.99 | 60.00
2024-08-26 00:00:00 | 64.98 | 149.99 | -85.01
LEAD
works the same way, looking forward in the data set.
Rolling averages
Using our same day groups we can also make a rolling average. The AVG
function takes an input ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
for a rolling 7 day sales average.
WITH DailySales AS (
SELECT
date_trunc('day', o.order_date) AS sales_date,
SUM(o.total_amount) AS daily_total_sales
FROM
orders o
GROUP BY
date_trunc('day', o.order_date)
)
SELECT
sales_date,
daily_total_sales,
AVG(daily_total_sales) OVER (
ORDER BY sales_date
ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
) AS rolling_average_7_days
FROM
DailySales
ORDER BY
sales_date
LIMIT 10;
sales_date | daily_total_sales | rolling_average_7_days
---------------------+-------------------+------------------------
2024-08-21 00:00:00 | 349.98 | 349.9800000000000000
2024-08-22 00:00:00 | 899.98 | 624.9800000000000000
2024-08-23 00:00:00 | 34.98 | 428.3133333333333333
2024-08-24 00:00:00 | 89.99 | 343.7325000000000000
2024-08-25 00:00:00 | 149.99 | 304.9840000000000000
2024-08-26 00:00:00 | 64.98 | 264.9833333333333333
2024-08-27 00:00:00 | 249.98 | 262.8400000000000000
2024-08-28 00:00:00 | 129.99 | 231.4128571428571429
2024-08-29 00:00:00 | 179.98 | 128.5557142857142857
2024-08-30 00:00:00 | 349.99 | 173.5571428571428571
(10 rows)
N-tiles with Window Functions
The NTILE
function is a window function in SQL that is used to divide a result set into a specified number of roughly equal parts, known as tiles or buckets. By assigning a unique tile number to each row, n-tiles helps categorize and analyze data distribution within a dataset. This function is particularly useful in statistical and financial analysis for understanding how data is distributed across different segments, identifying trends, and making comparisons between groups with different characteristics.
For example, using NTILE(4)
divides the data into four quartiles, ranking each row into one of four groups. The descending part here makes sure that quartile 1 is the top quarter and so on.
WITH DailySales AS (
SELECT
date_trunc('day', o.order_date) AS sales_date,
SUM(o.total_amount) AS daily_total_sales
FROM
orders o
GROUP BY
date_trunc('day', o.order_date)
)
SELECT
sales_date,
daily_total_sales,
NTILE(4) OVER (
ORDER BY daily_total_sales DESC
) AS sales_quartile
FROM
DailySales
ORDER BY
sales_date;
sales_date | daily_total_sales | sales_quartile
---------------------+-------------------+----------------
2024-09-06 00:00:00 | 409.94 | 1
2024-08-30 00:00:00 | 349.99 | 1
2024-08-21 00:00:00 | 349.98 | 2
2024-09-08 00:00:00 | 349.96 | 2
Window functions resources
We’re big fans of Postgres functions at Crunchy Data, so check out our resources on Window functions in our Postgres Tutorials:
Related Articles
- Postgres Tuning & Performance for Analytics Data
19 min read
- Running an Async Web Query Queue with Procedures and pg_cron
6 min read
- Name Collision of the Year: Vector
9 min read
- Sidecar Service Meshes with Crunchy Postgres for Kubernetes
12 min read
- pg_incremental: Incremental Data Processing in Postgres
11 min read