Introducing Crunchy Data Warehouse: A next-generation Postgres-native data warehouse. Crunchy Data Warehouse Learn more

Tutorial Instructions

Advent of Code - Day 7

AOC Day 7

This challenge was an interesting one, as it required one to simulate walking through a file system. The things used to solve it include:

The challenge file looks like this:

$ cd /
$ ls
dir a
14848514 b.txt
8504156 c.dat
dir d
$ cd a
$ ls
dir e
29116 f
2557 g
62596 h.lst
$ cd e
$ ls
584 i
$ cd ..
$ cd ..
$ cd d
$ ls
4060174 j
8033020 d.log
5626152 d.ext
7214296 k

Data is in a single table and single column line'.

select * from aoc_day7 limit 10;

The goal is to find all directories with a total byte size of 100,000 or less, then sum up all those sizes. Directories can contain other directories, of course. First, some optimizations. Because we only care about the files in our current directory, the lines starting with "dir" can be completely ignored. The same thing for any lines with "$ ls". All we need to do is track our current directory, and add the size of any files contained within them to both the current directory, and any parent directories as well. It is also true that each file only appears once.

First, we’ll make a sequence.

CREATE SEQUENCE if not exists aoc;

Here is the final solution: a single SQL statement. Fear not, we will break it down:

WITH recursive r AS ( SELECT id, line, 1 as nextid, null::text[] AS currdir, 0 AS currnum FROM s where id = 1 UNION SELECT s.id, s.line, r.nextid+1, CASE WHEN s.line = '$ cd /' THEN '{/}'::text[] WHEN substr(s.line,1,7)='$ cd ..' THEN CASE WHEN 1=array_length(currdir,1) THEN '{}'::text[] ELSE string_to_array( regexp_replace( array_to_string(currdir, ','), ',[a-z/]+$', ''), ',') END WHEN substr(s.line,1,4)='$ cd' THEN currdir || ( coalesce(currdir[ array_length(currdir,1) ], '') || '/' || (regexp_match(s.line, 'cd (.+)'))[1] ) ELSE currdir END AS currdir, CASE WHEN s.line ~ '\d' THEN ((regexp_match(s.line, '(\d+)'))[1])::int ELSE 0 END AS currnum FROM r, s WHERE s.id = nextid ) ,setup AS (SELECT setval('aoc',1,false)) ,s AS (SELECT nextval('aoc') AS id, line FROM setup, aoc_day7) ,t AS (SELECT currdir, currnum FROM r WHERE currnum > 0) ,v as (SELECT unnest(currdir) AS dir, currnum AS size FROM t) ,w as (SELECT dir, sum(size) AS totalsize FROM v GROUP BY 1) SELECT sum(totalsize) FROM w WHERE totalsize <= 100000;

The easiest part of that statement is to grab the file sizes. We do not care about the name of the file, we simply want to find any line that has a number in it, then grab the number and store it away. That's the job of this bit of code:

    CASE WHEN regexp_count(s.line, '\d')>0      -- or: WHEN s.line ~ '\\d'
      THEN regexp_substr(s.line, '(\d+)')::int  -- or: THEN ((regexp_match(s.line, '(\\d+)'))[1])::int
      ELSE 0
    END AS currnum

The regexp_count simply spots line from our source that have a number anywhere in them, then we pull that number out with the handy regexp_substr function.

A trickier bit is keeping track of not just what directory we are in, but what ones we recently visited, such that the cd .. commands work properly. A text array is a nice way to store the list of directories. We initialized it at the top of the recursive function with null::text[] AS currdir. Now we have to be able to handle these three variants of cd:

  1. cd / - change to the root of the file system
  2. cd x - change to the directory "x"
  3. cd .. - change to the parent directory

All of these require changing our currdir variable. The simplest is cd / in which case we empty out our array of directories we care about: CASE WHEN s.line = '$ cd /' THEN '{/}'::text[]

For the case where we are changing to a new directory, we want to add this directory to our current list, by sticking it after a slash, just like a regular Unix file system. If the array is empty, we simply append to an empty string:

WHEN substr(s.line,1,4)='$ cd' THEN
  currdir
  || ( coalesce(currdir[ array_length(currdir,1) ], '')
  || '/'
  || regexp_substr(s.line, 'cd (.+)', 1,1,'',1))      -- or: || '/' || (regexp_match(s.line, 'cd (.+)'))[1] )

For the final case, cd .., we need to remove the current directory from our array. We need a second check to see if the array only has a single item, in which case we simply set it empty. If it is not empty, we remove the final element from our path by casting the array to a string, removing the final "slash-something", and putting it back into the array again.

WHEN substr(s.line,1,7)='$ cd ..' THEN
  CASE WHEN 1=array_length(currdir,1) THEN '{}'::text[]
  ELSE string_to_array( regexp_replace(
       array_to_string(currdir, ','), ',[a-z/]+$', ''), ',') END

All of this means that as we are walking line by line, we are maintaining a list of interesting paths in currdir like so:

Linecurrdir
cd /{}::text[]
cd abc/abc
cd def/abc, /abc/def
cd ../abc
cd ghi/abc, /abc/ghi
cd jkl/abc, /abc/ghi, /abc/ghi/jkl

Why all this trouble? Well, we cannot come back to a line easily, so anytime we find a file size, we need to add it to the list for every directory in the whole tree leading up to our current directory.

We need to walk through our input line by line (order matters!) and keep track of a variable that changes over time. Without a language like plpgsql, the only way to do that via SQL is with a recursive function. The name is a little misleading, as they are iterative, not recursive, but they get the job done. A recursive function has two parts, joined by a UNION or UNION ALL. The first part (or first query) sets everything up by reading data from somewhere, and crafting a careful SELECT statement. The second part (or second query) runs over and over until it returns nothing else. Because the second query is of the form SELECT * FROM my_recursive_query is it able to maintain state by changing the data in the SELECT statement each time it runs. In our case, we change the currdir each time, and set a unique currnum as required.

Confusingly, a recursive query must appear as the first item in a CTE - in other words, the top item of the WITH list. But it can reference other parts of the WITH clause below it (notice that both the setup and iterative sections reference table s). It also needs to be referenced by other parts of the WITH clause - in our case by the table t. Let's view four of the tables in action. The final one that produces the output is t, which pulls from r. Table r pulls from s and r. Table s pulls from setup and aoc_day7:

WITH recursive r AS ( SELECT id, line, 1 as nextid, null::text[] AS currdir, 0 AS currnum FROM s where id = 1 UNION SELECT s.id, s.line, r.nextid+1, CASE WHEN s.line = '$ cd /' THEN '{}'::text[] WHEN substr(s.line,1,7)='$ cd ..' THEN CASE WHEN 1=array_length(currdir,1) THEN '{}'::text[] ELSE string_to_array( regexp_replace( array_to_string(currdir, ','), ',[a-z/]+$', ''), ',') END WHEN substr(s.line,1,4)='$ cd' THEN currdir || ( coalesce(currdir[ array_length(currdir,1) ], '') || '/' || (regexp_match(s.line, 'cd (.+)'))[1] ) ELSE currdir END AS currdir, CASE WHEN s.line ~ '\d' THEN ((regexp_match(s.line, '(\d+)'))[1])::int ELSE 0 END AS currnum FROM r, s WHERE s.id = nextid ) ,setup AS (SELECT setval('aoc',1,false)) ,s AS (SELECT nextval('aoc') AS id, line FROM setup, aoc_day7) ,t AS (SELECT currdir, currnum FROM r WHERE currnum > 0) SELECT * FROM t LIMIT 10;
currdir | currnum -----------+---------- {} | 14848514 {} | 8504156 {/a} | 29116 {/a} | 2557 {/a} | 62596 {/a,/a/e} | 584 {/d} | 4060174 {/d} | 8033020 {/d} | 5626152 {/d} | 7214296 (10 rows)

So by table t, we have a entry for every file's size, and a corresponding list (or text array) of every directory that the size should be added to. Next, in table v, we use the unnest function to expand the array, so that each item gets assigned. Notice how /a/e is now on its own row:

WITH recursive r AS ( SELECT id, line, 1 as nextid, null::text[] AS currdir, 0 AS currnum FROM s where id = 1 UNION SELECT s.id, s.line, r.nextid+1, CASE WHEN s.line = '$ cd /' THEN '{/}'::text[] WHEN substr(s.line,1,7)='$ cd ..' THEN CASE WHEN 1=array_length(currdir,1) THEN '{}'::text[] ELSE string_to_array( regexp_replace( array_to_string(currdir, ','), ',[a-z/]+$', ''), ',') END WHEN substr(s.line,1,4)='$ cd' THEN currdir || ( coalesce(currdir[ array_length(currdir,1) ], '') || '/' || (regexp_match(s.line, 'cd (.+)'))[1] ) ELSE currdir END AS currdir, CASE WHEN s.line ~ '\d' THEN ((regexp_match(s.line, '(\d+)'))[1])::int ELSE 0 END AS currnum FROM r, s WHERE s.id = nextid ) ,setup AS (SELECT setval('aoc',1,false)) ,s AS (SELECT nextval('aoc') AS id, line FROM setup, aoc_day7) ,t AS (SELECT currdir, currnum FROM r WHERE currnum > 0) ,v AS (SELECT unnest(currdir) AS dir, currnum AS size FROM t) SELECT * FROM v LIMIT 10;
dir | size ------+--------- /a | 29116 /a | 2557 /a | 62596 /a | 584 /a/e | 584 /d | 4060174 /d | 8033020 /d | 5626152 /d | 7214296 (9 rows)

Next, in table w, we sum up the sizes and use a group by to get a grand total of the total size of all files within each directory:

WITH recursive r AS ( SELECT id, line, 1 as nextid, null::text[] AS currdir, 0 AS currnum FROM s where id = 1 UNION SELECT s.id, s.line, r.nextid+1, CASE WHEN s.line = '$ cd /' THEN '{/}'::text[] WHEN substr(s.line,1,7)='$ cd ..' THEN CASE WHEN 1=array_length(currdir,1) THEN '{}'::text[] ELSE string_to_array( regexp_replace( array_to_string(currdir, ','), ',[a-z/]+$', ''), ',') END WHEN substr(s.line,1,4)='$ cd' THEN currdir || ( coalesce(currdir[ array_length(currdir,1) ], '') || '/' || (regexp_match(s.line, 'cd (.+)'))[1] ) ELSE currdir END AS currdir, CASE WHEN s.line ~ '\d' THEN ((regexp_match(s.line, '(\d+)'))[1])::int ELSE 0 END AS currnum FROM r, s WHERE s.id = nextid ) ,setup AS (SELECT setval('aoc',1,false)) ,s AS (SELECT nextval('aoc') AS id, line FROM setup, aoc_day7) ,t AS (SELECT currdir, currnum FROM r WHERE currnum > 0) ,v as (SELECT unnest(currdir) AS dir, currnum AS size FROM t) ,w as (SELECT dir, sum(size) AS totalsize FROM v GROUP BY 1) SELECT * FROM w LIMIT 10;
dir | totalsize ------+----------- /a | 94853 /d | 24933642 /a/e | 584 (3 rows)

Then, we do our final query, and add up all the ones that are <= 100,000 bytes:

WITH recursive r AS ( SELECT id, line, 1 as nextid, null::text[] AS currdir, 0 AS currnum FROM s where id = 1 UNION SELECT s.id, s.line, r.nextid+1, CASE WHEN s.line = '$ cd /' THEN '{/}'::text[] WHEN substr(s.line,1,7)='$ cd ..' THEN CASE WHEN 1=array_length(currdir,1) THEN '{}'::text[] ELSE string_to_array( regexp_replace( array_to_string(currdir, ','), ',[a-z/]+$', ''), ',') END WHEN substr(s.line,1,4)='$ cd' THEN currdir || ( coalesce(currdir[ array_length(currdir,1) ], '') || '/' || (regexp_match(s.line, 'cd (.+)'))[1] ) ELSE currdir END AS currdir, CASE WHEN s.line ~ '\d' THEN ((regexp_match(s.line, '(\d+)'))[1])::int ELSE 0 END AS currnum FROM r, s WHERE s.id = nextid ) ,setup AS (SELECT setval('aoc',1,false)) ,s AS (SELECT nextval('aoc') AS id, line FROM setup, aoc_day7) ,t AS (SELECT currdir, currnum FROM r WHERE currnum > 0) ,v as (SELECT unnest(currdir) AS dir, currnum AS size FROM t) ,w as (SELECT dir, sum(size) AS totalsize FROM v GROUP BY 1) SELECT sum(totalsize) FROM w WHERE totalsize <= 100000;
  sum
-------
 95437

Whew! That was a lot. Now for Part 2! In this section, we need to find a certain directory. The total disk space available is 70,000,000 bytes, and we need to have at least 30,000,000. So we need to figure out how much we are using, and the smallest file we can remove to make sure we have at least 30M bytes.

We can re-use almost all of the previous CTE, and add in four new lines:

,fs AS (SELECT 70000000::int AS maxsize)
,unused AS (SELECT fs.maxsize - totalsize AS space FROM w, fs WHERE dir = '/')
,needed AS (SELECT 30000000 - space AS neededbytes FROM unused)
SELECT min(totalsize) FROM w, needed WHERE totalsize >= neededbytes;

We have total space of seventy million bytes, so this is a constant we add as table fs. Then we can grab the total disk space as the sum of all entries in the filesystem /, and subtract that from our new constant to get the total used space in table unused. After that, we figure out how much space we need to have, and finally, we find the smallest entry that meets that criteria! Let's end with our final SQL query for part two:

WITH recursive r AS ( SELECT id, line, 1 as nextid, null::text[] AS currdir, 0 AS currnum FROM s where id = 1 UNION SELECT s.id, s.line, r.nextid+1, CASE WHEN s.line = '$ cd /' THEN '{/}'::text[] WHEN substr(s.line,1,7)='$ cd ..' THEN CASE WHEN 1=array_length(currdir,1) THEN '{}'::text[] ELSE string_to_array( regexp_replace( array_to_string(currdir, ','), ',[a-z/]+$', ''), ',') END WHEN substr(s.line,1,4)='$ cd' THEN currdir || ( coalesce(currdir[ array_length(currdir,1) ], '') || '/' || (regexp_match(s.line, 'cd (.+)'))[1] ) ELSE currdir END AS currdir, CASE WHEN s.line ~ '\d' THEN ((regexp_match(s.line, '(\d+)'))[1])::int ELSE 0 END AS currnum FROM r, s WHERE s.id = nextid ) ,setup AS (SELECT setval('aoc',1,false)) ,s AS (SELECT nextval('aoc') AS id, line FROM setup, aoc_day7) ,t AS (SELECT currdir, currnum FROM r WHERE currnum > 0) ,v as (SELECT unnest(currdir) AS dir, currnum AS size FROM t) ,w as (SELECT dir, sum(size) AS totalsize FROM v GROUP BY 1) ,fs AS (SELECT 70000000::int AS maxsize) ,unused AS (SELECT fs.maxsize - totalsize AS space FROM w, fs WHERE dir = '/') ,needed AS (SELECT 30000000 - space AS neededbytes FROM unused) SELECT min(totalsize) FROM w, needed WHERE totalsize >= neededbytes;
min ---------- 24933642

Loading terminal...

Loading terminal...