Introducing Crunchy Data Warehouse: A next-generation Postgres-native data warehouse. Crunchy Data Warehouse Learn more
Tutorial Instructions
We're going to dig into SQL with the public dataset of US births from 2000-2014. But first we're going to work on creating the dataset in a more Postgres centric way with a new table and the date format:
CREATE TABLE births (
id serial,
day date,
births int
);
INSERT INTO births (day, births)
SELECT make_date(year, month, date_of_month),
births
FROM US_births_20002014_SSA;
Now, lets try out first query:
SELECT *
FROM births
LIMIT 5;
Now let’s start to roll up births by week.
SELECT date_trunc('week', day) week,
sum(births) births
FROM births
GROUP BY 1
ORDER BY week DESC;
The output of your query should look something like this:
week | births
------------------------+--------
2003-08-04 00:00:00+00 | 83402
2002-03-04 00:00:00+00 | 77719
2006-03-27 00:00:00+00 | 79467
2011-09-05 00:00:00+00 | 78593
2007-10-22 00:00:00+00 | 83645
We propose a solution using both CTEs and window functions. A CTE is a common table expression that allows you to split a complex query into different named parts and reference those parts later in the query. The WITH
statement is how you define CTEs, in our case weekly_births
. A Window Function lets you divide your result set into frames (i.e. distinct groups) to and run functions to compare between frames. In this case, we are using the lag(births, 1) OVER (ORDER BY WEEK DESC) prev_births
as a window function to return the previous week’s births
values. We can see this in action via this result set, which aggregates all births in a week, then collects the previous week as well:
WITH weekly_births AS
(
SELECT date_trunc('week', day) week,
sum(births) births
FROM births
GROUP BY 1
)
SELECT week,
births,
lag(births, 1) OVER (
ORDER BY week DESC
) prev_births
FROM weekly_births;
We could add another level to the CTE to find the two weeks in this set with the greatest difference:
WITH weekly_births AS
(
SELECT date_trunc('week', day) week,
sum(births) births
FROM births
GROUP BY 1
),
weekly_with_prev AS (
SELECT week,
births,
lag(births, 1) OVER (
ORDER BY week DESC
) prev_births
FROM weekly_births
)
SELECT * FROM weekly_with_prev
WHERE prev_births IS NOT NULL AND
week >= '2000-01-01'
ORDER BY abs(births - prev_births) DESC
LIMIT 1;
Loading terminal...
Loading terminal...