High-compression Metrics Storage with Postgres Hyperloglog
We have been talking a lot here about using Postgres for metrics, dashboards, and analytics. One of my favorite Postgres tools that makes a lot of this work easy and efficient is Hyperloglog (HLL). Hyperloglog is like Regex, once you understand it -- you feel like it's a superpower. Also, like Regex -- it can't solve everything. In this post I’ll take you through how to get started with HLL and build some sample queries, and get started with simple tuning.
So what is Hyperloglog?
Hyperloglog is a compression and extraction algorithm for counting distinct values. It gets the name from its disk size characteristic of log(log(x)) -- hence the "loglog". Because it's a compression algorithm, results are an approximation. During storage time, it converts values using a hash, adds those to a summation, and converts them to binary. Then, during read time, it groups the binary values to approximate values based on the length of zeroes.
When to use Hyperloglog?
Because it’s an approximation, the use case should accept approximate calculations. Billing systems? No! Visitor tracking? Yes!
To use Hyperloglog, you must have data that counts distinct values across a time series or set of values. Typically, you can use it where you are currently using COUNT(DISTINCT <arg>)
. The canonical example for Hyperloglog is product behavior metrics, where you are looking to scale something like the following query:
SELECT
date_trunc('week', received_at) AS query_week,
COUNT(DISTINCT customer_id) AS active_customer_count
FROM activities
WHERE received_at > '2023-01-01'
GROUP BY 1
ORDER BY 1;
Running that query will find the matching set of activities, store the result in memory, then group by iterating over the set, and iterate over the order. The most common solution to this problem is using output caching. The problem with output caching is that it can only answer one question. The cached value for this query can only answer the same question the query answered. Thus, if you wanted to find the answers for the following, you'd have to setup additional caches:
- Which pages received the most unique views during that time?
- Which hour has the highest unique visitors?
- Which referrer generated the most page views last week?
When you desire to use additional dimensions, the simple cache breaks down quickly. Enter Hyperloglog.
The benefits of using Hyperloglog
The two benefits of Hyperloglog are reduced storage space and increased performance. Sounds too good to be true, right? But it is not. Let me show you how it works.
Storage reduction is a function of the roll-up strategy. For instance, roll-up to the hour will not compress as much as roll-up to the day. Additional dimensions will not compact as well as a flat roll-up either. Performance is also a function of the roll-up strategy. In a simple example, storage space on hourly roll-up is 1/100th the size and 1/10th the performance, but it gets even better as the sizes scale.
The main draw-back for HLL is that calculations are approximations. So, any use-case requiring precision is not suitable.
Data generator for HLL
If you would like to follow this tutorial, but need data, I’ve created a data generator for you, and it works with Crunchy Bridge. The data generator is at: https://gist.github.com/Winslett/8c594d836c7a802bff1f0749e67c8ba4.
To run the data generator, run:
git clone git@gist.github.com:8c594d836c7a802bff1f0749e67c8ba4.git
bundle install
DATABASE_URL="postgres://user:pass@host:port/dbname" ruby generator.rb
This data generates two tables: customers and activities. The activities are a pseudo-random creation of actions by customers. The following examples will use this activities table as the example.
The customers table looks like this:
Column | Type |
--------+---------+
id | integer |
name | text |
email | text |
The activities table looks like this:
Column | Type |
-------------+-----------------------------+
received_at | timestamp without time zone |
customer_id | integer |
action | text |
Hyperloglog set up
- Load Hyperloglog extension into your database by running
CREATE EXTENSION hll;
- Create an aggregation table with an
hll
column - Aggregate metrics and store them in a table with the
hll
column - Query the aggregation table using the
hll
functions
Create an aggregation table
Connect to your database, and run the following:
CREATE EXTENSION hll;
An aggregation table will consist of dimensions and an HLL column. You can have as many dimension columns as you like. Continuing the page visitors example above, dimensions could be path, referrer, utm values, etc. The data generator example we are using above has customers and action.
Run the following to create the roll-up table:
CREATE TABLE activities_hourly_rollup (
hour timestamp,
action varchar(255),
customers hll,
unique (hour, action)
);
The hour
field will contain the beginning of each hour (e.g. 2023-01-01 12:00:00). The action
field will contain the one of a set of twelve actions as defined in the data generator linked above. The customers
field contains hashed values summated using the HLL algorithm, which allows us to approximate a count of the values (e.g. \x128b7fde24688c782a1cda10a2e797b24e2c1772f24e286b62c7e4
). We use UNIQUE
to enforce valid data. If we duplicated hour and action, data could become duplicated and messy.
Aggregate and store
For a first-run of the aggregation, use a nested SELECT
statement as the feeder to the INSERT
statement. This will work pretty well, but because of the UNIQUE
constraint, it will fail if run more than once.
INSERT INTO activities_hourly_rollup
SELECT
date_trunc('hour', received_at) AS hour,
action,
hll_add_agg(hll_hash_integer(customer_id))
FROM activities
GROUP BY 1, 2;
Querying the new aggregate table
When testing performance, run the following in your psql
terminal to get performance output:
\timing
Typically, an aggregation query would look something like the following. It’s a simple SELECT
with GROUP BY
that summarizes the distinct values in an hour.
SELECT
date_trunc('hour', received_at) AS hour,
COUNT(DISTINCT customer_id) AS customer_count
FROM activities
GROUP BY 1
ORDER BY 1
LIMIT 15;
What would this same query look like if we were to use the hll
functionality?
SELECT
hour,
hll_cardinality(hll_union_agg((customers))) AS hourly_uniques
FROM activities_hourly_rollup
GROUP BY 1
ORDER BY 1
LIMIT 15;
Run the two queries above, and compare the outputs and run time. As the generator runs, it will continue to build data, which will slow the aggregation query.
What made HLL work?
When storing the value into activities_hourly_rollup
, the customer_id
integer field was converted to a hash using hll_hash_integer
(a). Then the unique hashed values were added together using hll_add_agg
and stored into the activities_hourly_rollup
record by hour and action (b).
When querying, we used hll_union_agg
as a post-compute DISTINCT
aggregator on the values(c). Then, use hll_cardinality
to estimate the distinct values by the longest stretch of zeroes in the aggregated value (d). Any distinct count that falls within the constraint of HLL can be counted when combining hll_union_agg
and hll_cardinality
-- it is quite powerful.
Below is a diagram:
Let's look at a query that goes through the entire process. We build up the hll
data, then extract counts for it. For this example, below is a self-contained query that takes 3 integers, hashes them, adds them to form a hash. Then, uses the hll_union_agg
for DISTINCT values, then approximates the distinct count of values using the hll_cardinality
:
WITH three_ints AS (
SELECT
hll_add_agg(hll_hash_integer(int_value)) AS hashed_valued
FROM
(VALUES (1), (2), (3)) AS v(int_value)
)
SELECT
hll_cardinality(hll_union_agg(six_ints.hashed_valued))
FROM ((SELECT * FROM three_ints) UNION ALL (SELECT * FROM three_ints)) AS six_ints;
I encourage you to extract the different parts of the query above to understand what each part is doing. This will help you understand the entire hll
process.
For more experimentation, checkout this HLL Playground which visualizes hash values, register values, and cardinality estimations with error percentage.
Keep the HLL table up-to-date
With Postgres 15, the MERGE
command was introduced, which can power upserts. With it, we build the query to insert in the USING
clause, compare values using the ON
statement, then run different actions based on MATCHED
AND NOT MATCHED
.
MERGE INTO activities_hourly_rollup AS target
USING (
SELECT
date_trunc('hour', received_at) AS hour,
action,
hll_add_agg(hll_hash_integer(customer_id)) AS customers
FROM activities
GROUP BY 1, 2
) AS source
ON target.hour = source.hour AND target.action = source.action
WHEN MATCHED THEN
UPDATE SET customers = source.customers
WHEN NOT MATCHED THEN
INSERT (hour, action, customers) VALUES (source.hour, source.action, source.customers);
Since we are rolling up by the hour, values generated prior to the beginning of the hour for the latest run will not have new values. For performance reasons, you’ll want to cache the last run time, then include a conditional in the SELECT
statement to be something like the following: WHERE activities.received_at >= date_trunc('hour', last_run_at)
.
How much disk did we save?
How much smaller is the rollup table than raw activities table? In this scenario, it’s about 5% of the original size. But, the storage usage should be more efficient as sizes increase. To see these values in psql
, run the following:
\d+
Which outputs something like the following (while building these examples, the data generator is running, so I will have different sizes than you will):
List of relations
Schema | Name | Type | Owner | Persistence | Access method | Size | Description
--------+--------------------------+----------+----------+-------------+---------------+------------+-------------
public | activities | table | postgres | permanent | heap | 77 MB |
public | activities_hourly_rollup | table | postgres | permanent | heap | 4576 kB |
public | customers | table | postgres | permanent | heap | 1352 kB |
public | customers_id_seq | sequence | postgres | permanent | | 8192 bytes
In my example, the rollup table is 4576kB and the raw table was 77MB. With the HLL hashing algorithm, it is possible to delete old records, yet keep the aggregated data. If deleting raw data, aggregates remain, but history tracking for an individual will be lost.
How much performance did we gain?
Because we are using approximation calculations on aggregated values, the database does not iterate over records of the database. In this example, the hll
queries run in about 7% of the time as iterating over the records:
Prior query
SELECT
date_trunc('week', received_at) AS query_week,
COUNT(DISTINCT customer_id) AS active_customer_count
FROM activities
WHERE received_at > '2023-01-01'
GROUP BY 1
ORDER BY 1;
⇒ ~ 1100ms
HLL query
SELECT
date_trunc('week', hour) AS week,
hll_cardinality(hll_union_agg(customers)) AS weekly_uniques
FROM activities_hourly_rollup
GROUP BY 1
ORDER BY 1;
⇒ ~80ms
What was the trade-off?
We mentioned approximation, right? When working with hll
calculations, you have to decide what is close enough.
Prior response:
query_week | active_customer_count
---------------------+-----------------------
2022-12-26 00:00:00 | 43
2023-01-02 00:00:00 | 434
2023-01-09 00:00:00 | 2849
2023-01-16 00:00:00 | 10313
2023-01-23 00:00:00 | 14410
(5 rows)
HLL response:
week | weekly_uniques
---------------------+--------------------
2022-12-26 00:00:00 | 43
2023-01-02 00:00:00 | 443.7906710523894
2023-01-09 00:00:00 | 2835.134752744712
2023-01-16 00:00:00 | 10045.884001042512
2023-01-23 00:00:00 | 13824.710142108923
(5 rows)
For close-enough scenarios, the weekly_uniques is close-enough.
Session defaults and tuning HLL config
You can use session variables to default configure all hll
fields. For this blog post, I chose to be more explicit for each field because we have multiple tables with different settings.
SELECT hll_set_defaults(17, 5, -1, 1); -- arguments are in order:
-- log2m (default = 11)
-- regwidth (default = 5)
-- expthresh (default = -1)
-- sparseon (default = 1)
In most scenarios, I would also choose to be explicit, as implicit values may be frustrating some time in the future when storage won’t work due to mix-matched values. If you run the wrong settings on an upsert, you have the possibility of expanding a table size significantly.
Tuning HLL consists of tuning a few different configurations to match your data needs while balancing the storage and performance gains.
Improving accuracy with log2m
By changing this value, log2m
, to 17 we change the number of registers used for the algorithm, increasing accuracy, doubling storage requirements, and reducing performance due to compute time.
Then, let’s create a new table:
CREATE TABLE activities_hourly_rollup_max_log2m (
hour timestamp,
action varchar(255),
customers hll(17, 5, -1, 1), -- first argument is the log2m, remaining are default, but we'll change them later
unique (hour, action)
);
Now, let’s throw data at this new table. Notice the addition of the configurations when building the hll
field:
MERGE INTO activities_hourly_rollup_max_log2m AS target
USING (
SELECT
date_trunc('hour', received_at) AS hour,
action,
hll_add_agg(hll_hash_integer(customer_id), 17, 5, -1, 1) AS customers -- we set the same values as the table
FROM activities
GROUP BY 1, 2
) AS source
ON target.hour = source.hour AND target.action = source.action
WHEN MATCHED THEN
UPDATE SET customers = source.customers
WHEN NOT MATCHED THEN
INSERT (hour, action, customers) VALUES (source.hour, source.action, source.customers);
If we look at the table sizes, by running \d+
, in my scenario the table size has increased from 4752 kB
for the activities_hourly_rollup
table, to 15 MB
for the activities_hourly_rollup_max_log2m
table. But, if we run the aggregate queries, we’ll see the gain for the trade-off is accuracy:
SELECT
date_trunc('week', hour) AS week,
hll_cardinality(hll_union_agg(customers)) AS weekly_uniques
FROM activities_hourly_rollup_max_log2m
GROUP BY 1
ORDER BY 1;
Where there was a ~5% difference in results previously, the difference is now less than 1%. Additionally, the performance of the query moved from ~150ms to ~925ms. When you think of log2m
, think increased accuracy, at the cost of storage and performance.
Tuning maximum distincts with regwidth
Now, let’s set regwidth
to the maximum to determine the affect. This will change the number of bits used per register. So, log2m
changes the number of registers, and regwidth
changes the bits used per register. As you can imagine, this also has an impact on storage. Increase the size of the regwidth
if you expect more unique values (i.e. cardinality). Postgres HLL documentation has a chart with the cardinality limits for each log2m
and regwidth
combination.
CREATE TABLE activities_hourly_rollup_max_regwidth (
hour timestamp,
action varchar(255),
customers hll(11, 7, -1, 1), -- second argument is regwidth set to maximum
unique (hour, action)
);
Now, let’s load it with data:
MERGE INTO activities_hourly_rollup_max_regwidth AS target
USING (
SELECT
date_trunc('hour', received_at) AS hour,
action,
hll_add_agg(hll_hash_integer(customer_id), 11, 7, -1, 1) AS customers -- we set the same values as the table
FROM activities
GROUP BY 1, 2
) AS source
ON target.hour = source.hour AND target.action = source.action
WHEN MATCHED THEN
UPDATE SET customers = source.customers
WHEN NOT MATCHED THEN
INSERT (hour, action, customers) VALUES (source.hour, source.action, source.customers);
When running \d+
, we’ll see that the table has increased from 4752kb to 5928kb — so, it does not have the same impact on storage as log2m
. But, when we run the following query we get NaN
results when the number of uniques increases past the ability of the regwidth
to contain the cardinality:
SELECT
date_trunc('week', hour) AS week,
hll_cardinality(hll_union_agg(customers)) AS weekly_uniques
FROM activities_hourly_rollup_max_regwidth
GROUP BY 1
ORDER BY 1;
Results in:
week | weekly_uniques
---------------------+-------------------
2022-12-26 00:00:00 | 43
2023-01-02 00:00:00 | 443.7906710523894
2023-01-09 00:00:00 | 2835.134752744712
2023-01-16 00:00:00 | NaN
2023-01-23 00:00:00 | NaN
For this scenario, we would need to increase the size of the log2m
to allow more registers, as the size of the registers is already at the maximum.
Using expthresh
for accuracy, up to a point
The 3rd setting is for accuracy up to a certain value. The trade-off is additional memory for calculations. The default for this setting so far has been -1
, which effectively tells the database to make the best decision. Let’s leave the other values, and set a new value for expthresh
:
CREATE TABLE activities_hourly_rollup_max_expthresh (
hour timestamp,
action varchar(255),
customers hll(12, 5, 8192, 1), -- 3rd argument is expthreash, set to 2^14
unique (hour, action)
);
Now, let’s use the upsert to insert data into this new field:
MERGE INTO activities_hourly_rollup_max_expthresh AS target
USING (
SELECT
date_trunc('hour', received_at) AS hour,
action,
hll_add_agg(hll_hash_integer(customer_id), 12, 5, 8192, 1) AS customers -- we set the same values as the table
FROM activities
GROUP BY 1, 2
) AS source
ON target.hour = source.hour AND target.action = source.action
WHEN MATCHED THEN
UPDATE SET customers = source.customers
WHEN NOT MATCHED THEN
INSERT (hour, action, customers) VALUES (source.hour, source.action, source.customers);
Run a similar query against this table:
SELECT
date_trunc('week', hour) AS week,
hll_cardinality(hll_union_agg(customers)) AS weekly_uniques
FROM activities_hourly_rollup_max_expthresh
GROUP BY 1
ORDER BY 1;
And, you’ll see precise values up to the given expthresh
:
week | weekly_uniques
---------------------+--------------------
2022-12-26 00:00:00 | 43
2023-01-02 00:00:00 | 434
2023-01-09 00:00:00 | 2849
2023-01-16 00:00:00 | 10478.318375775954
2023-01-23 00:00:00 | 15889.381066151105
Summary
- HLL is a compression algorithm, results are an approximation. Great for collecting summary statistics and metrics, but not so great for exact calculation needs.
- Using the HLL extension, create an aggregation table with an HLL column and query the aggregation table using the HLL functions.
- An HLL table can be ~95% smaller than raw data tables.
- Queries can run in ~10% of the time raw queries of the same data would take.
- For tuning HLL, review log2m, regwidth, and expthresh which can increase accuracy but will impact storage time and performance.
Related Articles
- Name Collision of the Year: Vector
9 min read
- Sidecar Service Meshes with Crunchy Postgres for Kubernetes
12 min read
- pg_incremental: Incremental Data Processing in Postgres
11 min read
- Smarter Postgres LLM with Retrieval Augmented Generation
6 min read
- Postgres Partitioning with a Default Partition
16 min read