How to Solve Advent of Code 2022 Using Postgres - Day 7
Spoiler Alert
This article will contain spoilers both on how I solved 2022 Day 7's challenge "No Space Left On Device" using SQL, as well as general ideas on how to approach the problem. I recommend trying to solve it yourself first, using your favorite language. This article is delayed from the actual puzzle's release. Also note that my solutions are seldom going to be the "best" solutions - they were solved as quickly as possible, and these articles will show my first solutions, with some minor reformatting and cleaning up.
Hands on Tutorial
We've also loaded a tutorial for Day 7's challenge if you want to try it with a pre-loaded data set.
AOC Day 7
This challenge was an interesting one, as it required one to simulate walking through a file system. The things used to solve it include:
- file_fdw, a built-in foreign data wrapper for Postgres
- Recursive queries!
- Text arrays
- Functions
array_length
,string_to_array
,array_to_string
, andunnest
- Functions
regexp_count
,regexp_replace
, andregexp_substr
(Postgres 15+ only) - Function
regexp_match
(Postgres 14 and older if needed)
The challenge file looks like this:
$ cd /
$ ls
dir a
14848514 b.txt
8504156 c.dat
dir d
$ cd a
$ ls
dir e
29116 f
2557 g
62596 h.lst
$ cd e
$ ls
584 i
$ cd ..
$ cd ..
$ cd d
$ ls
4060174 j
8033020 d.log
5626152 d.ext
7214296 k
The goal is to find all directories with a total byte size of 100,000 or less, then sum up all those sizes. Directories can contain other directories, of course. First, some optimizations. Because we only care about the files in our current directory, the lines starting with "dir" can be completely ignored. The same thing for any lines with "$ ls". All we need to do is track our current directory, and add the size of any files contained within them to both the current directory, and any parent directories as well. It is also true that each file only appears once.
Before we dive in, the usual setup, plus a sequence:
CREATE EXTENSION if not exists file_fdw;
CREATE SERVER if not exists aoc2022 foreign data wrapper file_fdw;
DROP FOREIGN TABLE if exists aoc_day7;
CREATE FOREIGN TABLE aoc_day7 (line text)
SERVER aoc2022 options(filename '/tmp/aoc2022.day7.input');
DROP SEQUENCE if exists aoc;
CREATE SEQUENCE if not exists aoc;
Here is the final solution: a single SQL statement. Fear not, we will break it down:
-- Version for Postgres 15 and higher:
WITH recursive r AS (
SELECT id, line, 1 as nextid, null::text[] AS currdir, 0 AS currnum
FROM s where id = 1
UNION
SELECT s.id, s.line, r.nextid+1,
CASE WHEN s.line = '$ cd /' THEN '{}'::text[]
WHEN substr(s.line,1,7)='$ cd ..' THEN
CASE WHEN 1=array_length(currdir,1) THEN '{}'::text[]
ELSE string_to_array( regexp_replace(
array_to_string(currdir, ','), ',[a-z/]+$', ''), ',') END
WHEN substr(s.line,1,4)='$ cd' THEN currdir
|| ( coalesce(currdir[ array_length(currdir,1) ], '')
|| '/' || regexp_substr(s.line, 'cd (.+)', 1,1,'',1))
ELSE currdir
END AS currdir,
CASE WHEN regexp_count(s.line, '\d')>0
THEN regexp_substr(s.line, '(\d+)')::int
ELSE 0
END AS currnum
FROM r, s WHERE s.id = nextid
)
,setup AS (SELECT setval('aoc',1,false))
,s AS (SELECT nextval('aoc') AS id, line FROM setup, aoc_day7)
,t AS (SELECT currdir, currnum FROM r WHERE currnum > 0)
,v as (SELECT unnest(currdir) AS dir, currnum AS size FROM t)
,w as (SELECT dir, sum(size) AS totalsize FROM v GROUP BY 1)
SELECT sum(totalsize) FROM w WHERE totalsize <= 100000;
-- Version for Postgres 14 and older:
WITH recursive r AS (
SELECT id, line, 1 as nextid, null::text[] AS currdir, 0 AS currnum
FROM s where id = 1
UNION
SELECT s.id, s.line, r.nextid+1,
CASE WHEN s.line = '$ cd /' THEN '{}'::text[]
WHEN substr(s.line,1,7)='$ cd ..' THEN
CASE WHEN 1=array_length(currdir,1) THEN '{}'::text[]
ELSE string_to_array( regexp_replace(
array_to_string(currdir, ','), ',[a-z/]+$', ''), ',') END
WHEN substr(s.line,1,4)='$ cd' THEN currdir
|| ( coalesce(currdir[ array_length(currdir,1) ], '')
|| '/' || (regexp_match(s.line, 'cd (.+)'))[1] )
ELSE currdir
END AS currdir,
CASE WHEN s.line ~ '\d'
THEN ((regexp_match(s.line, '(\d+)'))[1])::int
ELSE 0
END AS currnum
FROM r, s WHERE s.id = nextid
)
,setup AS (SELECT setval('aoc',1,false))
,s AS (SELECT nextval('aoc') AS id, line FROM setup, aoc_day7)
,t AS (SELECT currdir, currnum FROM r WHERE currnum > 0)
,v as (SELECT unnest(currdir) AS dir, currnum AS size FROM t)
,w as (SELECT dir, sum(size) AS totalsize FROM v GROUP BY 1)
SELECT sum(totalsize) FROM w WHERE totalsize <= 100000;
The easiest part of that statement is to grab the file sizes. We do not care about the name of the file, we simply want to find any line that has a number in it, then grab the number and store it away. That's the job of this bit of code:
CASE WHEN regexp_count(s.line, '\d')>0 -- or: WHEN s.line ~ '\d'
THEN regexp_substr(s.line, '(\d+)')::int -- or: THEN ((regexp_match(s.line, '(\d+)'))[1])::int
ELSE 0
END AS currnum
The regexp_count
simply spots line from our source that have a number anywhere in them, then we pull that number out with the handy regexp_substr
function.
A trickier bit is keeping track of not just what directory we are in, but what ones we recently visited, such that the cd ..
commands work properly. A text array is a nice way to store the list of directories. We initialized it at the top of the recursive CTE with null::text[] AS currdir
. Now we have to be able to handle these three variants of cd:
cd /
- change to the root of the file systemcd x
- change to the directory "x"cd ..
- change to the parent directory
All of these require changing our currdir variable. The simplest is cd /
in which case we empty out our array of directories we care about: CASE WHEN s.line = '$ cd /' THEN '{}'::text[]
For the case where we are changing to a new directory, we want to add this directory to our current list, by sticking it after a slash, just like a regular Unix file system. If the array is empty, we simply append to an empty string:
WHEN substr(s.line,1,4)='$ cd' THEN
currdir
|| ( coalesce(currdir[ array_length(currdir,1) ], '')
|| '/'
|| regexp_substr(s.line, 'cd (.+)', 1,1,'',1)) -- or: || '/' || (regexp_match(s.line, 'cd (.+)'))[1] )
For the final case, cd ..
, we need to remove the current directory from our array. We need a second check to see if the array only has a single item, in which case we simply set it empty. If it is not empty, we remove the final element from our path by casting the array to a string, removing the final "slash-something", and putting it back into the array again.
WHEN substr(s.line,1,7)='$ cd ..' THEN
CASE WHEN 1=array_length(currdir,1) THEN '{}'::text[]
ELSE string_to_array( regexp_replace(
array_to_string(currdir, ','), ',[a-z/]+$', ''), ',') END
All of this means that as we are walking line by line, we are maintaining a list of interesting paths in currdir
like so:
Line | currdir |
---|---|
cd / | {}::text[] |
cd abc | /abc |
cd def | /abc, /abc/def |
cd .. | /abc |
cd ghi | /abc, /abc/ghi |
cd jkl | /abc, /abc/ghi, /abc/ghi/jkl |
Why all this trouble? Well, we cannot come back to a line easily, so anytime we find a file size, we need to add it to the list for every directory in the whole tree leading up to our current directory.
We need to walk through our input line by line (order matters!) and keep track of a variable that changes over time. Without a language like PL/pgSQL, the only way to do that via SQL is with a recursive CTE. The name is a little misleading, as they are iterative, not recursive, but they get the job done. A recursive CTE has two parts, joined by a UNION
or UNION ALL
. The first part (or first query) sets everything up by reading data from somewhere, and crafting a careful SELECT statement. The second part (or second query) runs over and over until it returns nothing else. Because the second query is of the form SELECT * FROM my_recursive_query
is it able to maintain state by changing the data in the SELECT statement each time it runs. In our case, we change the currdir each time, and set a unique currnum as required.
Confusingly, a recursive query must appear as the first item in a CTE - in other words, the top item of the WITH list. But it can reference other parts of the WITH clause below it (notice that both the setup and iterative sections reference table s
). It also needs to be referenced by other parts of the WITH clause - in our case by the table t
. Let's view four of the tables in action. The final one that produces the output is t
, which pulls from r
. Table r
pulls from s
and r
. Table s
pulls from setup
and aoc_day7
:
-- Version for Postgres 14 and older:
WITH recursive r AS (
SELECT id, line, 1 as nextid, null::text[] AS currdir, 0 AS currnum
FROM s where id = 1
UNION
SELECT s.id, s.line, r.nextid+1,
CASE WHEN s.line = '$ cd /' THEN '{}'::text[]
WHEN substr(s.line,1,7)='$ cd ..' THEN
CASE WHEN 1=array_length(currdir,1) THEN '{}'::text[]
ELSE string_to_array( regexp_replace(
array_to_string(currdir, ','), ',[a-z/]+$', ''), ',') END
WHEN substr(s.line,1,4)='$ cd' THEN currdir
|| ( coalesce(currdir[ array_length(currdir,1) ], '')
|| '/' || (regexp_match(s.line, 'cd (.+)'))[1] )
ELSE currdir
END AS currdir,
CASE WHEN s.line ~ '\d'
THEN ((regexp_match(s.line, '(\d+)'))[1])::int
ELSE 0
END AS currnum
FROM r, s WHERE s.id = nextid
)
,setup AS (SELECT setval('aoc',1,false))
,s AS (SELECT nextval('aoc') AS id, line FROM setup, aoc_day7)
,t AS (SELECT currdir, currnum FROM r WHERE currnum > 0)
SELECT * FROM t LIMIT 10;
currdir | currnum
-----------+----------
{} | 14848514
{} | 8504156
{/a} | 29116
{/a} | 2557
{/a} | 62596
{/a,/a/e} | 584
{/d} | 4060174
{/d} | 8033020
{/d} | 5626152
{/d} | 7214296
(10 rows)
So by table t
, we have a entry for every file's size, and a corresponding list (or text array) of every directory that the size should be added to. Next, in table v
, we use the unnest
function to expand the array, so that each item gets assigned. Notice how /a/e
is now on its own row:
-- Version for Postgres 14 and older:
WITH recursive r AS (
SELECT id, line, 1 as nextid, null::text[] AS currdir, 0 AS currnum
FROM s where id = 1
UNION
SELECT s.id, s.line, r.nextid+1,
CASE WHEN s.line = '$ cd /' THEN '{}'::text[]
WHEN substr(s.line,1,7)='$ cd ..' THEN
CASE WHEN 1=array_length(currdir,1) THEN '{}'::text[]
ELSE string_to_array( regexp_replace(
array_to_string(currdir, ','), ',[a-z/]+$', ''), ',') END
WHEN substr(s.line,1,4)='$ cd' THEN currdir
|| ( coalesce(currdir[ array_length(currdir,1) ], '')
|| '/' || (regexp_match(s.line, 'cd (.+)'))[1] )
ELSE currdir
END AS currdir,
CASE WHEN s.line ~ '\d'
THEN ((regexp_match(s.line, '(\d+)'))[1])::int
ELSE 0
END AS currnum
FROM r, s WHERE s.id = nextid
)
,setup AS (SELECT setval('aoc',1,false))
,s AS (SELECT nextval('aoc') AS id, line FROM setup, aoc_day7)
,t AS (SELECT currdir, currnum FROM r WHERE currnum > 0)
,v AS (SELECT unnest(currdir) AS dir, currnum AS size FROM t)
SELECT * FROM v LIMIT 10;
dir | size
------+---------
/a | 29116
/a | 2557
/a | 62596
/a | 584
/a/e | 584
/d | 4060174
/d | 8033020
/d | 5626152
/d | 7214296
(9 rows)
Next, in table w
, we sum up the sizes and use a group by to get a grand total of the total size of all files within each directory:
-- Version for Postgres 14 and older:
WITH recursive r AS (
SELECT id, line, 1 as nextid, null::text[] AS currdir, 0 AS currnum
FROM s where id = 1
UNION
SELECT s.id, s.line, r.nextid+1,
CASE WHEN s.line = '$ cd /' THEN '{}'::text[]
WHEN substr(s.line,1,7)='$ cd ..' THEN
CASE WHEN 1=array_length(currdir,1) THEN '{}'::text[]
ELSE string_to_array( regexp_replace(
array_to_string(currdir, ','), ',[a-z/]+$', ''), ',') END
WHEN substr(s.line,1,4)='$ cd' THEN currdir
|| ( coalesce(currdir[ array_length(currdir,1) ], '')
|| '/' || (regexp_match(s.line, 'cd (.+)'))[1] )
ELSE currdir
END AS currdir,
CASE WHEN s.line ~ '\d'
THEN ((regexp_match(s.line, '(\d+)'))[1])::int
ELSE 0
END AS currnum
FROM r, s WHERE s.id = nextid
)
,setup AS (SELECT setval('aoc',1,false))
,s AS (SELECT nextval('aoc') AS id, line FROM setup, aoc_day7)
,t AS (SELECT currdir, currnum FROM r WHERE currnum > 0)
,v as (SELECT unnest(currdir) AS dir, currnum AS size FROM t)
,w as (SELECT dir, sum(size) AS totalsize FROM v GROUP BY 1)
SELECT * FROM w LIMIT 10;
dir | totalsize
------+-----------
/a | 94853
/d | 24933642
/a/e | 584
(3 rows)
Then, we do our final query, and add up all the ones that are <= 100,000 bytes:
-- Version for Postgres 14 and older:
WITH recursive r AS (
SELECT id, line, 1 as nextid, null::text[] AS currdir, 0 AS currnum
FROM s where id = 1
UNION
SELECT s.id, s.line, r.nextid+1,
CASE WHEN s.line = '$ cd /' THEN '{}'::text[]
WHEN substr(s.line,1,7)='$ cd ..' THEN
CASE WHEN 1=array_length(currdir,1) THEN '{}'::text[]
ELSE string_to_array( regexp_replace(
array_to_string(currdir, ','), ',[a-z/]+$', ''), ',') END
WHEN substr(s.line,1,4)='$ cd' THEN currdir
|| ( coalesce(currdir[ array_length(currdir,1) ], '')
|| '/' || (regexp_match(s.line, 'cd (.+)'))[1] )
ELSE currdir
END AS currdir,
CASE WHEN s.line ~ '\d'
THEN ((regexp_match(s.line, '(\d+)'))[1])::int
ELSE 0
END AS currnum
FROM r, s WHERE s.id = nextid
)
,setup AS (SELECT setval('aoc',1,false))
,s AS (SELECT nextval('aoc') AS id, line FROM setup, aoc_day7)
,t AS (SELECT currdir, currnum FROM r WHERE currnum > 0)
,v as (SELECT unnest(currdir) AS dir, currnum AS size FROM t)
,w as (SELECT dir, sum(size) AS totalsize FROM v GROUP BY 1)
SELECT sum(totalsize) FROM w WHERE totalsize <= 100000;
sum
-------
95437
Whew! That was a lot. Now for Part 2! In this section, we need to find a certain directory. The total disk space available is 70,000,000 bytes, and we need to have at least 30,000,000. So we need to figure out how much we are using, and the smallest file we can remove to make sure we have at least 30M bytes.
We can re-use almost all of the previous CTE, and add in four new lines:
,fs AS (SELECT 70000000::int AS maxsize)
,unused AS (SELECT fs.maxsize - totalsize AS space FROM w, fs WHERE dir = '/')
,needed AS (SELECT 30000000 - space AS neededbytes FROM unused)
SELECT min(totalsize) FROM w, needed WHERE totalsize >= neededbytes;
We have total space of seventy million bytes, so this is a constant we add as table fs
. Then we can grab the total disk space as the sum of all entries in the filesystem /
, and subtract that from our new constant to get the total used space
in table unused
. After that, we figure out how much space we need to have, and finally, we find the smallest entry that meets that criteria! Let's end with our final SQL query for part two:
-- Version for Postgres 14 and older:
WITH recursive r AS (
SELECT id, line, 1 as nextid, null::text[] AS currdir, 0 AS currnum
FROM s where id = 1
UNION
SELECT s.id, s.line, r.nextid+1,
CASE WHEN s.line = '$ cd /' THEN '{}'::text[]
WHEN substr(s.line,1,7)='$ cd ..' THEN
CASE WHEN 1=array_length(currdir,1) THEN '{}'::text[]
ELSE string_to_array( regexp_replace(
array_to_string(currdir, ','), ',[a-z/]+$', ''), ',') END
WHEN substr(s.line,1,4)='$ cd' THEN currdir
|| ( coalesce(currdir[ array_length(currdir,1) ], '')
|| '/' || (regexp_match(s.line, 'cd (.+)'))[1] )
ELSE currdir
END AS currdir,
CASE WHEN s.line ~ '\d'
THEN ((regexp_match(s.line, '(\d+)'))[1])::int
ELSE 0
END AS currnum
FROM r, s WHERE s.id = nextid
)
,setup AS (SELECT setval('aoc',1,false))
,s AS (SELECT nextval('aoc') AS id, line FROM setup, aoc_day7)
,t AS (SELECT currdir, currnum FROM r WHERE currnum > 0)
,v as (SELECT unnest(currdir) AS dir, currnum AS size FROM t)
,w as (SELECT dir, sum(size) AS totalsize FROM v GROUP BY 1)
,fs AS (SELECT 70000000::int AS maxsize)
,unused AS (SELECT fs.maxsize - totalsize AS space FROM w, fs WHERE dir = '/')
,needed AS (SELECT 30000000 - space AS neededbytes FROM unused)
SELECT min(totalsize) FROM w, needed WHERE totalsize >= neededbytes;
min
----------
24933642
Note: the sample data set will give a null for this query: only a larger data set will give the query below. Try it on the playground!
Related Articles
- Sidecar Service Meshes with Crunchy Postgres for Kubernetes
12 min read
- pg_incremental: Incremental Data Processing in Postgres
11 min read
- Smarter Postgres LLM with Retrieval Augmented Generation
6 min read
- Postgres Partitioning with a Default Partition
16 min read
- Iceberg ahead! Analyzing Shipping Data in Postgres
8 min read