Introducing Crunchy Data Warehouse: A next-generation Postgres-native data warehouse. Crunchy Data Warehouse Learn more
Tutorial Instructions
This challenge was an interesting one, as it required one to simulate walking through a file system. The things used to solve it include:
array_length
, string_to_array
, array_to_string
, and unnest
regexp_match
The challenge file looks like this:
$ cd /
$ ls
dir a
14848514 b.txt
8504156 c.dat
dir d
$ cd a
$ ls
dir e
29116 f
2557 g
62596 h.lst
$ cd e
$ ls
584 i
$ cd ..
$ cd ..
$ cd d
$ ls
4060174 j
8033020 d.log
5626152 d.ext
7214296 k
Data is in a single table and single column line
'.
select * from aoc_day7 limit 10;
The goal is to find all directories with a total byte size of 100,000 or less, then sum up all those sizes. Directories can contain other directories, of course. First, some optimizations. Because we only care about the files in our current directory, the lines starting with "dir" can be completely ignored. The same thing for any lines with "$ ls". All we need to do is track our current directory, and add the size of any files contained within them to both the current directory, and any parent directories as well. It is also true that each file only appears once.
First, we’ll make a sequence.
CREATE SEQUENCE if not exists aoc;
Here is the final solution: a single SQL statement. Fear not, we will break it down:
WITH recursive r AS (
SELECT id, line, 1 as nextid, null::text[] AS currdir, 0 AS currnum
FROM s where id = 1
UNION
SELECT s.id, s.line, r.nextid+1,
CASE WHEN s.line = '$ cd /' THEN '{/}'::text[]
WHEN substr(s.line,1,7)='$ cd ..' THEN
CASE WHEN 1=array_length(currdir,1) THEN '{}'::text[]
ELSE string_to_array( regexp_replace(
array_to_string(currdir, ','), ',[a-z/]+$', ''), ',') END
WHEN substr(s.line,1,4)='$ cd' THEN currdir
|| ( coalesce(currdir[ array_length(currdir,1) ], '')
|| '/' || (regexp_match(s.line, 'cd (.+)'))[1] )
ELSE currdir
END AS currdir,
CASE WHEN s.line ~ '\d'
THEN ((regexp_match(s.line, '(\d+)'))[1])::int
ELSE 0
END AS currnum
FROM r, s WHERE s.id = nextid
)
,setup AS (SELECT setval('aoc',1,false))
,s AS (SELECT nextval('aoc') AS id, line FROM setup, aoc_day7)
,t AS (SELECT currdir, currnum FROM r WHERE currnum > 0)
,v as (SELECT unnest(currdir) AS dir, currnum AS size FROM t)
,w as (SELECT dir, sum(size) AS totalsize FROM v GROUP BY 1)
SELECT sum(totalsize) FROM w WHERE totalsize <= 100000;
The easiest part of that statement is to grab the file sizes. We do not care about the name of the file, we simply want to find any line that has a number in it, then grab the number and store it away. That's the job of this bit of code:
CASE WHEN regexp_count(s.line, '\d')>0 -- or: WHEN s.line ~ '\\d'
THEN regexp_substr(s.line, '(\d+)')::int -- or: THEN ((regexp_match(s.line, '(\\d+)'))[1])::int
ELSE 0
END AS currnum
The regexp_count
simply spots line from our source that have a number anywhere
in them, then we pull that number out with the handy regexp_substr
function.
A trickier bit is keeping track of not just what directory we are in, but what
ones we recently visited, such that the cd ..
commands work properly. A text
array is a nice way to store the list of directories. We initialized it at the
top of the recursive function with null::text[] AS currdir
. Now we have to be
able to handle these three variants of cd:
cd /
- change to the root of the file systemcd x
- change to the directory "x"cd ..
- change to the parent directoryAll of these require changing our currdir variable. The simplest is cd /
in
which case we empty out our array of directories we care about:
CASE WHEN s.line = '$ cd /' THEN '{/}'::text[]
For the case where we are changing to a new directory, we want to add this directory to our current list, by sticking it after a slash, just like a regular Unix file system. If the array is empty, we simply append to an empty string:
WHEN substr(s.line,1,4)='$ cd' THEN
currdir
|| ( coalesce(currdir[ array_length(currdir,1) ], '')
|| '/'
|| regexp_substr(s.line, 'cd (.+)', 1,1,'',1)) -- or: || '/' || (regexp_match(s.line, 'cd (.+)'))[1] )
For the final case, cd ..
, we need to remove the current directory from our
array. We need a second check to see if the array only has a single item, in
which case we simply set it empty. If it is not empty, we remove the final
element from our path by casting the array to a string, removing the final
"slash-something", and putting it back into the array again.
WHEN substr(s.line,1,7)='$ cd ..' THEN
CASE WHEN 1=array_length(currdir,1) THEN '{}'::text[]
ELSE string_to_array( regexp_replace(
array_to_string(currdir, ','), ',[a-z/]+$', ''), ',') END
All of this means that as we are walking line by line, we are maintaining a list of interesting paths in currdir like so:
Line | currdir |
cd / | {}::text[] |
cd abc | /abc |
cd def | /abc, /abc/def |
cd .. | /abc |
cd ghi | /abc, /abc/ghi |
cd jkl | /abc, /abc/ghi, /abc/ghi/jkl |
Why all this trouble? Well, we cannot come back to a line easily, so anytime we find a file size, we need to add it to the list for every directory in the whole tree leading up to our current directory.
We need to walk through our input line by line (order matters!) and keep track
of a variable that changes over time. Without a language like plpgsql, the only
way to do that via SQL is with a recursive function. The name is a little
misleading, as they are iterative, not recursive, but they get the job done. A
recursive function has two parts, joined by a UNION
or UNION ALL
. The first
part (or first query) sets everything up by reading data from somewhere, and
crafting a careful SELECT statement. The second part (or second query) runs over
and over until it returns nothing else. Because the second query is of the form
SELECT * FROM my_recursive_query
is it able to maintain state by changing the
data in the SELECT statement each time it runs. In our case, we change the
currdir each time, and set a unique currnum as required.
Confusingly, a recursive query must appear as the first item in a CTE - in
other words, the top item of the WITH list. But it can reference other parts of
the WITH clause below it (notice that both the setup and iterative sections
reference table s
). It also needs to be referenced by other parts of the
WITH clause - in our case by the table t
. Let's view four of the tables in
action. The final one that produces the output is t
, which pulls from r
.
Table r
pulls from s
and r
. Table s
pulls from setup
and aoc_day7
:
WITH recursive r AS (
SELECT id, line, 1 as nextid, null::text[] AS currdir, 0 AS currnum
FROM s where id = 1
UNION
SELECT s.id, s.line, r.nextid+1,
CASE WHEN s.line = '$ cd /' THEN '{}'::text[]
WHEN substr(s.line,1,7)='$ cd ..' THEN
CASE WHEN 1=array_length(currdir,1) THEN '{}'::text[]
ELSE string_to_array( regexp_replace(
array_to_string(currdir, ','), ',[a-z/]+$', ''), ',') END
WHEN substr(s.line,1,4)='$ cd' THEN currdir
|| ( coalesce(currdir[ array_length(currdir,1) ], '')
|| '/' || (regexp_match(s.line, 'cd (.+)'))[1] )
ELSE currdir
END AS currdir,
CASE WHEN s.line ~ '\d'
THEN ((regexp_match(s.line, '(\d+)'))[1])::int
ELSE 0
END AS currnum
FROM r, s WHERE s.id = nextid
)
,setup AS (SELECT setval('aoc',1,false))
,s AS (SELECT nextval('aoc') AS id, line FROM setup, aoc_day7)
,t AS (SELECT currdir, currnum FROM r WHERE currnum > 0)
SELECT * FROM t LIMIT 10;
currdir | currnum
-----------+----------
{} | 14848514
{} | 8504156
{/a} | 29116
{/a} | 2557
{/a} | 62596
{/a,/a/e} | 584
{/d} | 4060174
{/d} | 8033020
{/d} | 5626152
{/d} | 7214296
(10 rows)
So by table t
, we have a entry for every file's size, and a corresponding list
(or text array) of every directory that the size should be added to. Next, in
table v
, we use the unnest
function to expand the array, so that each item
gets assigned. Notice how /a/e
is now on its own row:
WITH recursive r AS (
SELECT id, line, 1 as nextid, null::text[] AS currdir, 0 AS currnum
FROM s where id = 1
UNION
SELECT s.id, s.line, r.nextid+1,
CASE WHEN s.line = '$ cd /' THEN '{/}'::text[]
WHEN substr(s.line,1,7)='$ cd ..' THEN
CASE WHEN 1=array_length(currdir,1) THEN '{}'::text[]
ELSE string_to_array( regexp_replace(
array_to_string(currdir, ','), ',[a-z/]+$', ''), ',') END
WHEN substr(s.line,1,4)='$ cd' THEN currdir
|| ( coalesce(currdir[ array_length(currdir,1) ], '')
|| '/' || (regexp_match(s.line, 'cd (.+)'))[1] )
ELSE currdir
END AS currdir,
CASE WHEN s.line ~ '\d'
THEN ((regexp_match(s.line, '(\d+)'))[1])::int
ELSE 0
END AS currnum
FROM r, s WHERE s.id = nextid
)
,setup AS (SELECT setval('aoc',1,false))
,s AS (SELECT nextval('aoc') AS id, line FROM setup, aoc_day7)
,t AS (SELECT currdir, currnum FROM r WHERE currnum > 0)
,v AS (SELECT unnest(currdir) AS dir, currnum AS size FROM t)
SELECT * FROM v LIMIT 10;
dir | size
------+---------
/a | 29116
/a | 2557
/a | 62596
/a | 584
/a/e | 584
/d | 4060174
/d | 8033020
/d | 5626152
/d | 7214296
(9 rows)
Next, in table w
, we sum up the sizes and use a group by to get a grand total
of the total size of all files within each directory:
WITH recursive r AS (
SELECT id, line, 1 as nextid, null::text[] AS currdir, 0 AS currnum
FROM s where id = 1
UNION
SELECT s.id, s.line, r.nextid+1,
CASE WHEN s.line = '$ cd /' THEN '{/}'::text[]
WHEN substr(s.line,1,7)='$ cd ..' THEN
CASE WHEN 1=array_length(currdir,1) THEN '{}'::text[]
ELSE string_to_array( regexp_replace(
array_to_string(currdir, ','), ',[a-z/]+$', ''), ',') END
WHEN substr(s.line,1,4)='$ cd' THEN currdir
|| ( coalesce(currdir[ array_length(currdir,1) ], '')
|| '/' || (regexp_match(s.line, 'cd (.+)'))[1] )
ELSE currdir
END AS currdir,
CASE WHEN s.line ~ '\d'
THEN ((regexp_match(s.line, '(\d+)'))[1])::int
ELSE 0
END AS currnum
FROM r, s WHERE s.id = nextid
)
,setup AS (SELECT setval('aoc',1,false))
,s AS (SELECT nextval('aoc') AS id, line FROM setup, aoc_day7)
,t AS (SELECT currdir, currnum FROM r WHERE currnum > 0)
,v as (SELECT unnest(currdir) AS dir, currnum AS size FROM t)
,w as (SELECT dir, sum(size) AS totalsize FROM v GROUP BY 1)
SELECT * FROM w LIMIT 10;
dir | totalsize
------+-----------
/a | 94853
/d | 24933642
/a/e | 584
(3 rows)
Then, we do our final query, and add up all the ones that are <= 100,000 bytes:
WITH recursive r AS (
SELECT id, line, 1 as nextid, null::text[] AS currdir, 0 AS currnum
FROM s where id = 1
UNION
SELECT s.id, s.line, r.nextid+1,
CASE WHEN s.line = '$ cd /' THEN '{/}'::text[]
WHEN substr(s.line,1,7)='$ cd ..' THEN
CASE WHEN 1=array_length(currdir,1) THEN '{}'::text[]
ELSE string_to_array( regexp_replace(
array_to_string(currdir, ','), ',[a-z/]+$', ''), ',') END
WHEN substr(s.line,1,4)='$ cd' THEN currdir
|| ( coalesce(currdir[ array_length(currdir,1) ], '')
|| '/' || (regexp_match(s.line, 'cd (.+)'))[1] )
ELSE currdir
END AS currdir,
CASE WHEN s.line ~ '\d'
THEN ((regexp_match(s.line, '(\d+)'))[1])::int
ELSE 0
END AS currnum
FROM r, s WHERE s.id = nextid
)
,setup AS (SELECT setval('aoc',1,false))
,s AS (SELECT nextval('aoc') AS id, line FROM setup, aoc_day7)
,t AS (SELECT currdir, currnum FROM r WHERE currnum > 0)
,v as (SELECT unnest(currdir) AS dir, currnum AS size FROM t)
,w as (SELECT dir, sum(size) AS totalsize FROM v GROUP BY 1)
SELECT sum(totalsize) FROM w WHERE totalsize <= 100000;
sum
-------
95437
Whew! That was a lot. Now for Part 2! In this section, we need to find a certain directory. The total disk space available is 70,000,000 bytes, and we need to have at least 30,000,000. So we need to figure out how much we are using, and the smallest file we can remove to make sure we have at least 30M bytes.
We can re-use almost all of the previous CTE, and add in four new lines:
,fs AS (SELECT 70000000::int AS maxsize)
,unused AS (SELECT fs.maxsize - totalsize AS space FROM w, fs WHERE dir = '/')
,needed AS (SELECT 30000000 - space AS neededbytes FROM unused)
SELECT min(totalsize) FROM w, needed WHERE totalsize >= neededbytes;
We have total space of seventy million bytes, so this is a constant we add as
table fs
. Then we can grab the total disk space as the sum of all entries in
the filesystem /
, and subtract that from our new constant to get the total
used space
in table unused
. After that, we figure out how much space we need
to have, and finally, we find the smallest entry that meets that criteria! Let's
end with our final SQL query for part two:
WITH recursive r AS (
SELECT id, line, 1 as nextid, null::text[] AS currdir, 0 AS currnum
FROM s where id = 1
UNION
SELECT s.id, s.line, r.nextid+1,
CASE WHEN s.line = '$ cd /' THEN '{/}'::text[]
WHEN substr(s.line,1,7)='$ cd ..' THEN
CASE WHEN 1=array_length(currdir,1) THEN '{}'::text[]
ELSE string_to_array( regexp_replace(
array_to_string(currdir, ','), ',[a-z/]+$', ''), ',') END
WHEN substr(s.line,1,4)='$ cd' THEN currdir
|| ( coalesce(currdir[ array_length(currdir,1) ], '')
|| '/' || (regexp_match(s.line, 'cd (.+)'))[1] )
ELSE currdir
END AS currdir,
CASE WHEN s.line ~ '\d'
THEN ((regexp_match(s.line, '(\d+)'))[1])::int
ELSE 0
END AS currnum
FROM r, s WHERE s.id = nextid
)
,setup AS (SELECT setval('aoc',1,false))
,s AS (SELECT nextval('aoc') AS id, line FROM setup, aoc_day7)
,t AS (SELECT currdir, currnum FROM r WHERE currnum > 0)
,v as (SELECT unnest(currdir) AS dir, currnum AS size FROM t)
,w as (SELECT dir, sum(size) AS totalsize FROM v GROUP BY 1)
,fs AS (SELECT 70000000::int AS maxsize)
,unused AS (SELECT fs.maxsize - totalsize AS space FROM w, fs WHERE dir = '/')
,needed AS (SELECT 30000000 - space AS neededbytes FROM unused)
SELECT min(totalsize) FROM w, needed WHERE totalsize >= neededbytes;
min
----------
24933642
Loading terminal...
Loading terminal...