Introducing Crunchy Data Warehouse: A next-generation Postgres-native data warehouse. Crunchy Data Warehouse Learn more
Tutorial Instructions
This article will contain spoilers both on how I solved 2022 Day 16's challenge "Probscidea Volcanium" using SQL, as well as general ideas on how to approach the problem. I recommend trying to solve it yourself first, using your favorite language. This article is delayed from the actual puzzle's release. My solutions may not be the "best" solution, as the goal is to provide a quick solution.
I am not going to lie, this challenge was a tough one. That factor, plus other life obligations, means this post is long overdue. Day 16 was near and dear to me, as it involves elephants, the Postgres mascot! In this scenario, we find ourselves trapped with a herd of elephants in a volcano, amongst a bunch of interconnected tubes, containing pressure release valves. In short, the world's most interesting Travelling Salesman Problem (TSP)!
Amongst the Postgres / SQL items used to solve this are:
→
The file_fdw extension, to read text files in from the system.
→
CTEs (Common Table Expressions) to keep our queries readable and avoid ugly sub-selects.
→
Plpgsql functions and some upserts (i.e. ON CONFLICT handlers).
→
Various string formatting functions such as substr
, string_to_array
, substring
, left
, right
, and position
→
Array items including the unnest
function and the array "overlaps" operator &&
.
→
The pg_stat_user_functions view.
Following the link to Day 16 will give a lot more details about the challenge, you may want to read that first. We are provided with an input file that looks like this:
Valve AA has flow rate=0; tunnels lead to valves DD, II, BB
Valve BB has flow rate=13; tunnels lead to valves CC, AA
Valve CC has flow rate=2; tunnels lead to valves DD, BB
Valve DD has flow rate=20; tunnels lead to valves CC, AA, EE
Valve EE has flow rate=3; tunnels lead to valves FF, DD
Valve FF has flow rate=0; tunnels lead to valves EE, GG
Valve GG has flow rate=0; tunnels lead to valves FF, HH
Valve HH has flow rate=22; tunnel leads to valve GG
Valve II has flow rate=0; tunnels lead to valves AA, JJ
Valve JJ has flow rate=21; tunnel leads to valve II
The goal in Part One is to reduce as much pressure as possible in 30 minutes. Times are cumulative, so opening valve BB releases 13 units of pressure for every minute it is open after that. Travelling from one valve (aka node) to another costs one minute, and opening a valve also costs one minute. The goal is to find the optimal path to release the most pressure in 30 minutes, when beginning at node AA.
We will start with our usual steps to transform the text file above into something more SQL friendly:
CREATE EXTENSION if not exists file_fdw;
CREATE SERVER if not exists aoc2022 FOREIGN DATA WRAPPER file_fdw;
DROP SCHEMA if exists elephant CASCADE;
CREATE SCHEMA elephant;
SET search_path=elephant,public;
Let's start by parsing those lines using regular expressions into separate values. In the block below, we are converting our text file into separate rows in the tubes table. First we use some regexes to pull information out of each line into a CTE named y
which contains an array of valves connected to the source one:
WITH x AS (SELECT * FROM aoc_day16)
,y AS (SELECT line
,substr(line,7,2) AS valve
,substring(line from 'rate=(\\d+)')::int AS rate
,string_to_array(substring(line from 'valves? (.+)'),', ') AS targets
FROM x)
SELECT valve, rate, targets FROM y LIMIT 10;
valve | rate | targets
-------+------+------------
AA | 0 | {DD,II,BB}
BB | 13 | {CC,AA}
CC | 2 | {DD,BB}
DD | 20 | {CC,AA,EE}
EE | 3 | {FF,DD}
FF | 0 | {EE,GG}
GG | 0 | {FF,HH}
HH | 22 | {GG}
II | 0 | {AA,JJ}
JJ | 21 | {II}
Before we go any further, we need to simplify things. If you notice that some of the valves have a zero rate - they can be expunged from the final list. For example, here are valves EE and FF:
EE | 3 | {FF,DD}
FF | 0 | {EE,GG}
So EE leads to FF and DD, and opening valve EE will release 3 pressure. Valve FF leads to EE and GG, but releases no pressure. So there is no reason to ever stop at FF. In which case, the only reason for its existence is as a way to connect one valve to another. In this case, we can shorten EE -> FF -> GG to the path EE -> GG with an additional travel cost. So what we want to do is break things into simple paths from one valve to another, along with its travel cost. We discard any that end in a valve that has a zero value, as it does not need opening. We need a function that does this by walking through every single valve and finding the shortest path from one valve to another. Before running the function, we want to have a place to store the results, so let's create a simple table, along with a couple of indexes:
CREATE UNLOGGED TABLE tubes (start text, cost smallint, finish text, rate int);
CREATE UNIQUE INDEX path_unique ON tubes (start, finish) INCLUDE (cost, rate);
CREATE INDEX tube_index ON tubes (cost,start,finish) include (rate);
We use the INCLUDE
clause to ensure that while we have a unique index on the start and finish columns, we also store the other two columns as well, so we can use an index-only scan on that index as we need to grab all the columns later on.
The function itself:
CREATE or REPLACE function lesspaths()
returns text language plpgsql as $$
DECLARE
myrec record; myrec2 record; round int = 0; pathmade bool;
BEGIN
/* Populate the first round */
WITH x AS (SELECT * FROM aoc_day16)
,y AS (SELECT line
,substr(line,7,2) AS valve
,substring(line from 'rate=(\d+)')::int AS rate
,string_to_array(substring(line from 'valves? (.+)'),', ') AS targets
FROM x)
,z AS (SELECT valve, rate, unnest(targets) AS ending FROM y)
INSERT INTO tubes
SELECT valve, 1, ending,
(select y.rate from y where y.valve = z.ending limit 1) rate
FROM z;
LOOP
round = round + 1;
pathmade = false;
FOR myrec IN SELECT * FROM tubes WHERE cost = round LOOP
FOR myrec2 IN SELECT * FROM tubes WHERE cost=1 LOOP
/* AABB joins with BBCC to become AACC */
IF myrec.finish = myrec2.start AND myrec.start <> myrec2.finish
THEN
INSERT INTO tubes
SELECT myrec.start, myrec.cost+1, myrec2.finish, myrec2.rate
ON CONFLICT DO NOTHING;
IF FOUND THEN pathmade = true; END IF;
END IF;
END LOOP;
END LOOP;
IF NOT pathmade THEN RAISE NOTICE 'Leaving after % rounds', round; EXIT; END IF;
END LOOP;
DELETE FROM tubes WHERE rate=0;
RETURN 'Done with lesspaths';
END;
$$;
Let's break it down:
CREATE or REPLACE function lesspaths()
returns text language plpgsql as $$
DECLARE
myrec record; myrec2 record; round int = 0; pathmade bool;
BEGIN
This is all basic setup - declare the language, what the function returns, and what variables we are going to use.
-- Populate the first round
WITH x AS (SELECT * FROM aoc_day16)
,y AS (SELECT line
,substr(line,7,2) AS valve
,substring(line from 'rate=(\d+)')::int AS rate
,string_to_array(substring(line from 'valves? (.+)'),', ') AS targets
FROM x)
,z AS (SELECT valve, rate, unnest(targets) AS ending FROM y)
INSERT INTO tubes
SELECT valve, 1, ending,
(select y.rate from y where y.valve = z.ending limit 1) rate
FROM z;
In this block, we are converting our text file into separate rows in the tubes table. First we use some regexes to pull information out of each line into a CTE named y
. Then we turn around and use the unnest
function to break the array we built apart, such that we have a mapping from one valve to one other valve. We then insert that into the tubes
table, using a simple sub-select (select y.rate from y where y.valve = z.ending limit 1)
to record the rate of the "ending" valve for each pair. At this point, the tubes table looks like this:
DD | 1 | CC | 2
DD | 1 | AA | 0
DD | 1 | EE | 3
EE | 1 | FF | 0
EE | 1 | DD | 20
FF | 1 | EE | 3
FF | 1 | GG | 0
Now the tricker bit. At a high level, we are going to walk through every new row, see if we can extend the list of valves to add another on the end, and insert it as a new row:
LOOP
round = round + 1;
pathmade = false;
FOR myrec IN SELECT * FROM tubes WHERE cost = round LOOP
FOR myrec2 IN SELECT * FROM tubes WHERE cost=1 LOOP
We start a generic loop, that keeps going until we break out of it with an EXIT
command. The round
variable is a proxy for the number of steps it takes to get from one valve to another. As seen above, everything in the tubes table is currently one step from each other. Therefore, the first time through we are looping through all rows in the tubes table, and putting each row into the myrec
variable. Then for each of those, we walk through all the initial items, in other words, valves that are one step from each other with a cost of 1.
-- AABB joins with BBCC to become AACC
IF myrec.finish = myrec2.start AND myrec.start <> myrec2.finish
THEN
INSERT INTO tubes
SELECT myrec.start, myrec.cost+1, myrec2.finish, myrec2.rate
ON CONFLICT DO NOTHING;
END IF;
Next we check if the current path is joinable to another node. In other words, if we have valve A going to valve D, and we know that valve D goes to valve Q, we know that travel is possible from valve A to valve Q. We also add in the myrec.start <> myrec2.finish
qualifier, so we don't try to make a A -> D -> A path. For each of the new paths we find, we attempt to insert it into our tubes table. Recall that the tubes table has a unique constraint on start, finish
. That allows us to use ON CONFLICT DO NOTHING
to ensure only a single row exists for each pair of valves. Because we are walking them in order of travel cost, we are finding the shortest distance between any two valves!
Wrapping up:
IF FOUND THEN pathmade = true; END IF;
END IF;
END LOOP; -- myrec2
END LOOP; -- myrec
IF NOT pathmade THEN
RAISE NOTICE 'Leaving after % rounds', round; EXIT;
END IF;
END LOOP;
We use IF FOUND
to tell if the INSERT statement above it actually was able to add a row or not. This allows us to keep looping at higher and higher travel costs until there are no new paths. So once a round has no new INSERTS at all, pathmade
stays false, and we can then leave the entire main loop with the EXIT
command, as we have found all combinations of valves.
DELETE FROM tubes WHERE rate=0;
RETURN 'Done with lesspaths';
END;
$$;
Our final command at the end of the function is to remove any "dead" paths, in which the ending valve has a rate of 0. As you recall, there is no reason to ever end up on these nodes, so we can remove them. At this point, the tubes table looks like this:
start | cost | finish | rate
------+------+--------+------
AA | 1 | BB | 13
AA | 2 | CC | 2
AA | 1 | DD | 20
AA | 2 | EE | 3
AA | 5 | HH | 22
AA | 2 | JJ | 21
BB | 1 | CC | 2
BB | 2 | DD | 20
BB | 3 | EE | 3
BB | 6 | HH | 22
BB | 3 | JJ | 21
All of this is setup for the real problem. The lesspaths
function only takes about half a second to run:
\timing on
SELECT lesspaths();
NOTICE: Leaving after 17 rounds
lesspaths
--------------------
Done with lesspaths
(1 row)
Time: 530.416 ms
The actual input file had 64 lines, which led to 4032 rows in the tubes table before the final deletion, and 945 rows after the deletion. Still a lot of paths, but manageable as we will see in the next function:
CREATE OR REPLACE FUNCTION walkcost(startvalve TEXT, timeleft SMALLINT, openvalves TEXT)
returns smallint IMMUTABLE language plpgsql AS $$
DECLARE
myrec RECORD; thiscost SMALLINT; newtime SMALLINT; bestcost SMALLINT = 0;
BEGIN
FOR myrec IN
SELECT finish,cost,rate FROM tubes
WHERE start = startvalve AND timeleft > cost+1
AND position(' '||finish||' ' IN openvalves) = 0
LOOP
newtime = timeleft - myrec.cost - 1;
thiscost = (newtime * myrec.rate) +
walkcost(myrec.finish, newtime, openvalves || ' ' || myrec.finish || ' ');
IF thiscost > bestcost THEN bestcost = thiscost; END IF;
END LOOP;
RETURN bestcost;
END;
$$;
A small function, but it does a lot. It starts at a single valve (in our case AA), with a known amount of time left, and a list of which valves are already open. We then walk through all possible valves reachable from the current one, and make a recursive call back to the same function to find the "downstream" cost. For each possible ending valve, we compute which one is the best (has reduced the most pressure), and return that back to the caller. Let's break it down a bit:
CREATE OR REPLACE function walkcost
(startvalve text, timeleft smallint, openvalves text)
returns smallint immutable language plpgsql as $$
DECLARE
myrec record; thiscost smallint; newtime smallint; bestcost smallint = 0;
BEGIN
Standard setup of setting the name of the function, its arguments, and declaring the additional variables to use. The first argument startvalve
is the current valve we are starting from. The second argument timeleft
is the amount of time left in minutes, which decreases as we travel from valve to valve, and also by one minute each time we open a valve. These two costs are combined, as we never travel without also opening a valve. The third argument openvalves
is a simple text string we will use to keep track of which valves are currently open. We also declare four more variables: myrec
stores a row from the tubes table, thiscost
is the total cost for opening a valve and then going to a specific path, newtime
helps adjust the timeleft
, and bestcost
stores the winning value of thiscost
in each loop.
FOR myrec IN SELECT finish,cost,rate FROM tubes
WHERE start = startvalve AND timeleft > cost+1
AND position(' '||finish||' ' IN openvalves) = 0
LOOP
Here we start the main loop. We want to find all rows that start with the current valve, and then we use timeleft > cost+1
to rule out visiting any valves that will cause us to run out of time before opening them. We also want to exclude any previously-opened valves, by searching our passed in string openvalves
. We pad the value with a space on both sides to make sure we have an exact match. For example, the string AA BB FF
will exclude any rows ending in AA, BB, or FF, as the position function will return a non-zero value. So at this point, we have all possible paths and we can start looping through them one by one.
newtime = timeleft - myrec.cost - 1;
thiscost = (newtime * myrec.rate) + walkcost
(myrec.finish, newtime, openvalves || ' ' || myrec.finish || ' ');
IF thiscost > bestcost THEN bestcost = thiscost; END IF;
In this section, we first decrease the time based on the travel cost to the new valve, plus an additional minute to account for the time to open the valve. Next, we compute the total cost of opening this valve with newtime * myrec.rate
. We add to that the cost of opening other valves after this one, by calling the function recursively and passing in our current ending valve as the new starting valve, the newly-decreased time, and the list of open valves with our current ending one appended to the list. The recursion happens until there are no more paths to consider, at which point the "best" values will bubble up and get returned to each level.
Once that has returned, the line IF thiscost > bestcost THEN bestcost = thiscost; END IF;
stores the winning value among all the finish
valves in the current loop. Then we return that cost.
This is a crude but very effective search of all possibilities of travel through the tubes, and at the end it will generate the final answer. The test input returns very quick and gives an answer of 1651. The real input still returns quite fast, given all the work it is doing. Before we run it, let's vacuum the table to increase the odds of getting some index-only scans, and turn on track_functions
to get some stats about the calls:
VACUUM FREEZE tubes;
SET track_functions = 'pl';
/* 30 takes too long in a browser, so do 12 */
SELECT walkcost('AA', 12::smallint, ' ');
walkcost
---------
354
(1 row)
Time: 640.884 ms
Not too shabby at all! Let's peek at the pg_stat_user_functions
table to see how many times we recursively called this function (confusingly enough this is not a SQL 'recursive function', just a function that has been called recursively!)
SELECT * FROM pg_stat_user_functions WHERE funcname = 'walkcost';
funcid | schemaname | funcname | calls | total_time | self_time
--------+------------+----------+--------+------------+-----------
1054235 | elephant | walkcost | 233941 | 2783.138 | 2783.138
(1 row)
The function ran over 230 thousand times in less than three seconds! Or to put it another way, the function ran 84,000 times per second. Not bad. Part of the AOC challenge is optimizing your solution, both to make sure things run in a reasonable time frame, and because optimizing is fun. I did play with some ideas for optimizing the function above, but the only one that mattered was changing from int to smallint. Things that did not work:
Part Two, as always, introduces a significant wrinkle to the original challenge. This time, we learn that we can train an elephant to also open valves. So the new goal is to find the maximum pressure released in 26 minutes if you and the elephant are *simultaneously* going through and opening valves! At first, this caused a little bit of a panic, at the thought of tracking two actors at once, which arrive at valves at different times, and keeping a common state of which valves are open.
However, after some thinking about this, I found a much better approach (this is a big spoiler, so see if you can think of a better solution yourself before going on).
So rather than a running total of the best time, we are going to need to store all the possible solutions and then pick out the top two such that no valve is opened twice. Our previous solution will need some modification. First, we need a place to store all the paths, so let's create a table for that:
CREATE UNLOGGED TABLE pathscore (
vpath text,
score smallint,
apath text[]
);
As usual, this will be an unlogged table as it is for holding temporary information. In this case, we will store the path in vpath
, the total score
, and later on, an sorted array of the vpath information. Next we need a function to walk through and populate the table:
CREATE OR REPLACE FUNCTION walk_with_an_elephant(
startvalve text, timeleft smallint, openvalves text, currentcost int
)
RETURNS void VOLATILE LANGUAGE plpgsql
AS $$
DECLARE
myrec RECORD; newtime smallint;
BEGIN
/* IF timeleft < 7 AND currentcost >= 100
AND length(openvalves) BETWEEN 20 AND 25 THEN */
INSERT INTO pathscore SELECT openvalves, currentcost;
/* END IF; */
FOR myrec IN SELECT finish, cost, rate FROM tubes
WHERE start = startvalve AND timeleft > cost+1
AND position(' '||finish||' ' IN openvalves) = 0
LOOP
newtime = timeleft - myrec.cost - 1;
PERFORM walk_with_an_elephant(
myrec.finish,
newtime,
openvalves || ' ' || myrec.finish || ' ',
currentcost + (newtime * myrec.rate)
);
END LOOP;
RETURN;
END;
$$;
This function requires some explanations. At the top, we have:
IF timeleft < 7 AND currentcost >= 1000
AND length(openvalves) BETWEEN 20 AND 25 THEN
INSERT INTO pathscore select openvalves, currentcost;
END IF;
This is where we are storing the current path, represented by the same openvalves
text string we used before, as well as the current score, which is an additional argument to our function. Because we want to limit the size of our pathscore
table as much as possible, and because each INSERT slows things down quite a bit, there are some optimizations I made to speed things up. Note that these are commented out for our test data:
timeleft < 7
ensures that we only store a path when we are nearing the end of our time. A manual look at the input file shows that all the costs for opening the valves are on the same order of magnitude, so there is no situation where ourselves or the elephant would ever open a few valves and then stop. All these optimizations required some deduction and some trial-and-error; but in a strict critical production situation we would want to loosen them up and be 100% sure we don't miss one of the best paths.currentcost >= 1000
ensures that we don't store any matching paths that are less than 1000 in score. Running SELECT walkcost('AA', 26::smallint, ' ')
gives an answer of 1199, so it seems pretty safe to assume that there are at least two non-overlapping paths that both have a score of over 1000.BETWEEN 20 and 25
is a shortcut to exclude any paths that are too short or too long. In this case, as each valve is two letters, and surrounded by a space on each side, this is instructing the function to only insert paths that are between 6 and 8 valves long. This makes sense as we know that each valve costs at least one minute to move to, one to open, and there are 26 minutes in total.The rest of the function is straightforward:
newtime = timeleft - myrec.cost - 1;
PERFORM walk_with_an_elephant(
myrec.finish,
newtime,
openvalves || ' ' || myrec.finish || ' ',
currentcost + (newtime * myrec.rate)
);
END LOOP;
Once again, we compute the new amount of time left to pass recursively into the function. This time, we don't need to return anything back up the stack, so we use PERFORM
to call the function and ignore any result. This is a quirk of plpgsql that requires replacing SELECT with PERFORM if the return value is ignored. As before, we call the function with the new starting valve, the new time left, and a list of open valves. We also add in the current cost, as we need to know that as well.
Running it populates our table, and only takes 850ms:
SELECT walk_with_an_elephant('AA', 26::smallint, '', 0);
Even with all our optimizations, the pathscore table contains 28,719 rows! It looks like this:
SELECT * FROM pathscore ORDER BY random() LIMIT 5;
vpath | score | apath
------------------------+-------+-------
GG DT KV HR NS RI | 1001 | ☃
KV NS HR YA RI | 1099 | ☃
KV DT SU GG NS HR | 1015 | ☃
RI KV YA NS SU GG | 1031 | ☃
RI NS YA HR KV RP | 1036 | ☃
(5 rows)
We need to be able to compare those vpaths and find ones that do not overlap each other. Sounds like a job for arrays! Let's populate the apath
column. We also want to sort the values, as without sorting the comparison is much slower.
UPDATE pathscore SET apath = array(
SELECT unnest(string_to_array(right(left(vpath,-1),-1), ' ')) ORDER BY 1
);
Now our pathscore table looks a little different:
SELECT * FROM pathscore ORDER BY random() LIMIT 5;
vpath | score | apath
------------------------+-------+---------------------
GG SU DT NS KV RI | 1075 | {DT,GG,KV,NS,RI,SU}
RI KV NS HR SU GG | 1016 | {GG,HR,KV,NS,RI,SU}
SU DT KV NS YA HR | 1044 | {DT,HR,KV,NS,SU,YA}
KV NS YA GG RP | 1095 | {GG,KV,NS,RP,YA}
GG DT KV NS RI VT | 1014 | {DT,GG,KV,NS,RI,VT}
At this point, we are ready to solve the challenge. All we need is to find the two top highest paths which do not intersect each other. Here is some SQL to do that:
WITH x AS (select apath::text[], max(score) AS score FROM pathscore GROUP BY 1)
SELECT max(t1.score+t2.score) AS aoc_2022_day_16_part_two
FROM x t1, x t2
WHERE NOT t1.apath && t2.apath;
First we build the CTE named x
by looking at each possible combination of valves, and getting the top scoring one from each of those sets. In other words, the visiting order does not matter, only which valves opened. Then we query our CTE and join it to itself, such that no two paths have any of the same items - which is what the &&
operator does. Once the WHERE clause gives us a combined list of all possible non-intersecting paths, we add up their costs and find the highest possible score:
aoc_2022_day_16_part_two
-------------------------
1707
(1 row)
Time: 354.400 ms
Voila! The answer to how much the pressure gets reduced when a human and an elephant work together to solve the TSP! :)
Loading terminal...
Loading terminal...