Introducing Crunchy Data Warehouse: A next-generation Postgres-native data warehouse. Crunchy Data Warehouse Learn more

High-compression Metrics Storage with Postgres Hyperloglog

Avatar for Christopher Winslett

Christopher Winslett

13 min read

We have been talking a lot here about using Postgres for metrics, dashboards, and analytics. One of my favorite Postgres tools that makes a lot of this work easy and efficient is Hyperloglog (HLL). Hyperloglog is like Regex, once you understand it -- you feel like it's a superpower. Also, like Regex -- it can't solve everything. In this post I’ll take you through how to get started with HLL and build some sample queries, and get started with simple tuning.

So what is Hyperloglog?

Hyperloglog is a compression and extraction algorithm for counting distinct values. It gets the name from its disk size characteristic of log(log(x)) -- hence the "loglog". Because it's a compression algorithm, results are an approximation. During storage time, it converts values using a hash, adds those to a summation, and converts them to binary. Then, during read time, it groups the binary values to approximate values based on the length of zeroes.

When to use Hyperloglog?

Because it’s an approximation, the use case should accept approximate calculations. Billing systems? No! Visitor tracking? Yes!

To use Hyperloglog, you must have data that counts distinct values across a time series or set of values. Typically, you can use it where you are currently using COUNT(DISTINCT <arg>). The canonical example for Hyperloglog is product behavior metrics, where you are looking to scale something like the following query:

SELECT
	date_trunc('week', received_at) AS query_week,
	COUNT(DISTINCT customer_id) AS active_customer_count
FROM activities
WHERE received_at > '2023-01-01'
GROUP BY 1
ORDER BY 1;

Running that query will find the matching set of activities, store the result in memory, then group by iterating over the set, and iterate over the order. The most common solution to this problem is using output caching. The problem with output caching is that it can only answer one question. The cached value for this query can only answer the same question the query answered. Thus, if you wanted to find the answers for the following, you'd have to setup additional caches:

  • Which pages received the most unique views during that time?
  • Which hour has the highest unique visitors?
  • Which referrer generated the most page views last week?

When you desire to use additional dimensions, the simple cache breaks down quickly. Enter Hyperloglog.

The benefits of using Hyperloglog

The two benefits of Hyperloglog are reduced storage space and increased performance. Sounds too good to be true, right? But it is not. Let me show you how it works.

Storage reduction is a function of the roll-up strategy. For instance, roll-up to the hour will not compress as much as roll-up to the day. Additional dimensions will not compact as well as a flat roll-up either. Performance is also a function of the roll-up strategy. In a simple example, storage space on hourly roll-up is 1/100th the size and 1/10th the performance, but it gets even better as the sizes scale.

The main draw-back for HLL is that calculations are approximations. So, any use-case requiring precision is not suitable.

Data generator for HLL

If you would like to follow this tutorial, but need data, I’ve created a data generator for you, and it works with Crunchy Bridge. The data generator is at: https://gist.github.com/Winslett/8c594d836c7a802bff1f0749e67c8ba4.

To run the data generator, run:

  1. git clone git@gist.github.com:8c594d836c7a802bff1f0749e67c8ba4.git
  2. bundle install
  3. DATABASE_URL="postgres://user:pass@host:port/dbname" ruby generator.rb

This data generates two tables: customers and activities. The activities are a pseudo-random creation of actions by customers. The following examples will use this activities table as the example.

The customers table looks like this:

 Column |  Type   |
--------+---------+
 id     | integer |
 name   | text    |
 email  | text    |

The activities table looks like this:

   Column    |            Type             |
-------------+-----------------------------+
 received_at | timestamp without time zone |
 customer_id | integer                     |
 action      | text                        |

Hyperloglog set up

  1. Load Hyperloglog extension into your database by running CREATE EXTENSION hll;
  2. Create an aggregation table with an hll column
  3. Aggregate metrics and store them in a table with the hll column
  4. Query the aggregation table using the hll functions

Create an aggregation table

Connect to your database, and run the following:

CREATE EXTENSION hll;

An aggregation table will consist of dimensions and an HLL column. You can have as many dimension columns as you like. Continuing the page visitors example above, dimensions could be path, referrer, utm values, etc. The data generator example we are using above has customers and action.

Run the following to create the roll-up table:

CREATE TABLE activities_hourly_rollup (
	hour timestamp,
	action varchar(255),
	customers hll,
	unique (hour, action)
);

The hour field will contain the beginning of each hour (e.g. 2023-01-01 12:00:00). The action field will contain the one of a set of twelve actions as defined in the data generator linked above. The customers field contains hashed values summated using the HLL algorithm, which allows us to approximate a count of the values (e.g. \x128b7fde24688c782a1cda10a2e797b24e2c1772f24e286b62c7e4). We use UNIQUE to enforce valid data. If we duplicated hour and action, data could become duplicated and messy.

Aggregate and store

For a first-run of the aggregation, use a nested SELECT statement as the feeder to the INSERT statement. This will work pretty well, but because of the UNIQUE constraint, it will fail if run more than once.

INSERT INTO activities_hourly_rollup
  SELECT
    date_trunc('hour', received_at) AS hour,
    action,
    hll_add_agg(hll_hash_integer(customer_id))
  FROM activities
  GROUP BY 1, 2;

Querying the new aggregate table

When testing performance, run the following in your psql terminal to get performance output:

\timing

Typically, an aggregation query would look something like the following. It’s a simple SELECT with GROUP BY that summarizes the distinct values in an hour.

SELECT
	date_trunc('hour', received_at) AS hour,
  COUNT(DISTINCT customer_id) AS customer_count
FROM activities
GROUP BY 1
ORDER BY 1
LIMIT 15;

What would this same query look like if we were to use the hll functionality?

SELECT
	hour,
  hll_cardinality(hll_union_agg((customers))) AS hourly_uniques
FROM activities_hourly_rollup
GROUP BY 1
ORDER BY 1
LIMIT 15;

Run the two queries above, and compare the outputs and run time. As the generator runs, it will continue to build data, which will slow the aggregation query.

What made HLL work?

When storing the value into activities_hourly_rollup, the customer_id integer field was converted to a hash using hll_hash_integer (a). Then the unique hashed values were added together using hll_add_agg and stored into the activities_hourly_rollup record by hour and action (b).

When querying, we used hll_union_agg as a post-compute DISTINCT aggregator on the values(c). Then, use hll_cardinality to estimate the distinct values by the longest stretch of zeroes in the aggregated value (d). Any distinct count that falls within the constraint of HLL can be counted when combining hll_union_agg and hll_cardinality -- it is quite powerful.

Below is a diagram:

Untitled

Let's look at a query that goes through the entire process. We build up the hll data, then extract counts for it. For this example, below is a self-contained query that takes 3 integers, hashes them, adds them to form a hash. Then, uses the hll_union_agg for DISTINCT values, then approximates the distinct count of values using the hll_cardinality:

WITH three_ints AS (
	SELECT
		hll_add_agg(hll_hash_integer(int_value)) AS hashed_valued
	FROM
		(VALUES (1), (2), (3)) AS v(int_value)
)

SELECT
	hll_cardinality(hll_union_agg(six_ints.hashed_valued))
FROM ((SELECT * FROM three_ints) UNION ALL (SELECT * FROM three_ints)) AS six_ints;

I encourage you to extract the different parts of the query above to understand what each part is doing. This will help you understand the entire hll process.

For more experimentation, checkout this HLL Playground which visualizes hash values, register values, and cardinality estimations with error percentage.

Keep the HLL table up-to-date

With Postgres 15, the MERGE command was introduced, which can power upserts. With it, we build the query to insert in the USING clause, compare values using the ON statement, then run different actions based on MATCHED AND NOT MATCHED.

MERGE INTO activities_hourly_rollup AS target
  USING (
    SELECT
      date_trunc('hour', received_at) AS hour,
      action,
      hll_add_agg(hll_hash_integer(customer_id)) AS customers
    FROM activities
    GROUP BY 1, 2
  ) AS source
  ON target.hour = source.hour AND target.action = source.action
  WHEN MATCHED THEN
    UPDATE SET customers = source.customers
  WHEN NOT MATCHED THEN
    INSERT (hour, action, customers) VALUES (source.hour, source.action, source.customers);

Since we are rolling up by the hour, values generated prior to the beginning of the hour for the latest run will not have new values. For performance reasons, you’ll want to cache the last run time, then include a conditional in the SELECT statement to be something like the following: WHERE activities.received_at >= date_trunc('hour', last_run_at).

How much disk did we save?

How much smaller is the rollup table than raw activities table? In this scenario, it’s about 5% of the original size. But, the storage usage should be more efficient as sizes increase. To see these values in psql, run the following:

\d+

Which outputs something like the following (while building these examples, the data generator is running, so I will have different sizes than you will):

                                                 List of relations
 Schema |           Name           |   Type   |  Owner   | Persistence | Access method |    Size    | Description
--------+--------------------------+----------+----------+-------------+---------------+------------+-------------
 public | activities               | table    | postgres | permanent   | heap          | 77 MB      |
 public | activities_hourly_rollup | table    | postgres | permanent   | heap          | 4576 kB    |
 public | customers                | table    | postgres | permanent   | heap          | 1352 kB    |
 public | customers_id_seq         | sequence | postgres | permanent   |               | 8192 bytes

In my example, the rollup table is 4576kB and the raw table was 77MB. With the HLL hashing algorithm, it is possible to delete old records, yet keep the aggregated data. If deleting raw data, aggregates remain, but history tracking for an individual will be lost.

How much performance did we gain?

Because we are using approximation calculations on aggregated values, the database does not iterate over records of the database. In this example, the hll queries run in about 7% of the time as iterating over the records:

Prior query

SELECT
	date_trunc('week', received_at) AS query_week,
	COUNT(DISTINCT customer_id) AS active_customer_count
FROM activities
WHERE received_at > '2023-01-01'
GROUP BY 1
ORDER BY 1;

⇒ ~ 1100ms

HLL query

SELECT
  date_trunc('week', hour) AS week,
  hll_cardinality(hll_union_agg(customers)) AS weekly_uniques
FROM activities_hourly_rollup
GROUP BY 1
ORDER BY 1;

⇒ ~80ms

What was the trade-off?

We mentioned approximation, right? When working with hll calculations, you have to decide what is close enough.

Prior response:

query_week           | active_customer_count
---------------------+-----------------------
 2022-12-26 00:00:00 |                    43
 2023-01-02 00:00:00 |                   434
 2023-01-09 00:00:00 |                  2849
 2023-01-16 00:00:00 |                 10313
 2023-01-23 00:00:00 |                 14410
(5 rows)

HLL response:

week                 |   weekly_uniques
---------------------+--------------------
 2022-12-26 00:00:00 |                 43
 2023-01-02 00:00:00 |  443.7906710523894
 2023-01-09 00:00:00 |  2835.134752744712
 2023-01-16 00:00:00 | 10045.884001042512
 2023-01-23 00:00:00 | 13824.710142108923
(5 rows)

For close-enough scenarios, the weekly_uniques is close-enough.

Session defaults and tuning HLL config

You can use session variables to default configure all hll fields. For this blog post, I chose to be more explicit for each field because we have multiple tables with different settings.

SELECT hll_set_defaults(17, 5, -1, 1); -- arguments are in order:
-- log2m (default = 11)
-- regwidth (default = 5)
-- expthresh (default = -1)
-- sparseon (default = 1)

In most scenarios, I would also choose to be explicit, as implicit values may be frustrating some time in the future when storage won’t work due to mix-matched values. If you run the wrong settings on an upsert, you have the possibility of expanding a table size significantly.

Tuning HLL consists of tuning a few different configurations to match your data needs while balancing the storage and performance gains.

Improving accuracy with log2m

By changing this value, log2m, to 17 we change the number of registers used for the algorithm, increasing accuracy, doubling storage requirements, and reducing performance due to compute time.

Then, let’s create a new table:

CREATE TABLE activities_hourly_rollup_max_log2m (
	hour timestamp,
	action varchar(255),
	customers hll(17, 5, -1, 1), -- first argument is the log2m, remaining are default, but we'll change them later
	unique (hour, action)
);

Now, let’s throw data at this new table. Notice the addition of the configurations when building the hll field:

MERGE INTO activities_hourly_rollup_max_log2m AS target
  USING (
    SELECT
      date_trunc('hour', received_at) AS hour,
      action,
      hll_add_agg(hll_hash_integer(customer_id), 17, 5, -1, 1) AS customers -- we set the same values as the table
    FROM activities
    GROUP BY 1, 2
  ) AS source
  ON target.hour = source.hour AND target.action = source.action
  WHEN MATCHED THEN
    UPDATE SET customers = source.customers
  WHEN NOT MATCHED THEN
    INSERT (hour, action, customers) VALUES (source.hour, source.action, source.customers);

If we look at the table sizes, by running \d+, in my scenario the table size has increased from 4752 kB for the activities_hourly_rollup table, to 15 MB for the activities_hourly_rollup_max_log2m table. But, if we run the aggregate queries, we’ll see the gain for the trade-off is accuracy:

SELECT
  date_trunc('week', hour) AS week,
  hll_cardinality(hll_union_agg(customers)) AS weekly_uniques
FROM activities_hourly_rollup_max_log2m
GROUP BY 1
ORDER BY 1;

Where there was a ~5% difference in results previously, the difference is now less than 1%. Additionally, the performance of the query moved from ~150ms to ~925ms. When you think of log2m, think increased accuracy, at the cost of storage and performance.

Tuning maximum distincts with regwidth

Now, let’s set regwidth to the maximum to determine the affect. This will change the number of bits used per register. So, log2m changes the number of registers, and regwidth changes the bits used per register. As you can imagine, this also has an impact on storage. Increase the size of the regwidth if you expect more unique values (i.e. cardinality). Postgres HLL documentation has a chart with the cardinality limits for each log2m and regwidth combination.

CREATE TABLE activities_hourly_rollup_max_regwidth (
	hour timestamp,
	action varchar(255),
	customers hll(11, 7, -1, 1), -- second argument is regwidth set to maximum
	unique (hour, action)
);

Now, let’s load it with data:

MERGE INTO activities_hourly_rollup_max_regwidth AS target
  USING (
    SELECT
      date_trunc('hour', received_at) AS hour,
      action,
      hll_add_agg(hll_hash_integer(customer_id), 11, 7, -1, 1) AS customers -- we set the same values as the table
    FROM activities
    GROUP BY 1, 2
  ) AS source
  ON target.hour = source.hour AND target.action = source.action
  WHEN MATCHED THEN
    UPDATE SET customers = source.customers
  WHEN NOT MATCHED THEN
    INSERT (hour, action, customers) VALUES (source.hour, source.action, source.customers);

When running \d+, we’ll see that the table has increased from 4752kb to 5928kb — so, it does not have the same impact on storage as log2m. But, when we run the following query we get NaN results when the number of uniques increases past the ability of the regwidth to contain the cardinality:

SELECT
  date_trunc('week', hour) AS week,
  hll_cardinality(hll_union_agg(customers)) AS weekly_uniques
FROM activities_hourly_rollup_max_regwidth
GROUP BY 1
ORDER BY 1;

Results in:

week                 |  weekly_uniques
---------------------+-------------------
 2022-12-26 00:00:00 |                43
 2023-01-02 00:00:00 | 443.7906710523894
 2023-01-09 00:00:00 | 2835.134752744712
 2023-01-16 00:00:00 |               NaN
 2023-01-23 00:00:00 |               NaN

For this scenario, we would need to increase the size of the log2m to allow more registers, as the size of the registers is already at the maximum.

Using expthresh for accuracy, up to a point

The 3rd setting is for accuracy up to a certain value. The trade-off is additional memory for calculations. The default for this setting so far has been -1, which effectively tells the database to make the best decision. Let’s leave the other values, and set a new value for expthresh:

CREATE TABLE activities_hourly_rollup_max_expthresh (
	hour timestamp,
	action varchar(255),
	customers hll(12, 5, 8192, 1), -- 3rd argument is expthreash, set to 2^14
	unique (hour, action)
);

Now, let’s use the upsert to insert data into this new field:

MERGE INTO activities_hourly_rollup_max_expthresh AS target
  USING (
    SELECT
      date_trunc('hour', received_at) AS hour,
      action,
      hll_add_agg(hll_hash_integer(customer_id), 12, 5, 8192, 1) AS customers -- we set the same values as the table
    FROM activities
    GROUP BY 1, 2
  ) AS source
  ON target.hour = source.hour AND target.action = source.action
  WHEN MATCHED THEN
    UPDATE SET customers = source.customers
  WHEN NOT MATCHED THEN
    INSERT (hour, action, customers) VALUES (source.hour, source.action, source.customers);

Run a similar query against this table:

SELECT
  date_trunc('week', hour) AS week,
  hll_cardinality(hll_union_agg(customers)) AS weekly_uniques
FROM activities_hourly_rollup_max_expthresh
GROUP BY 1
ORDER BY 1;

And, you’ll see precise values up to the given expthresh:

week                 |   weekly_uniques
---------------------+--------------------
 2022-12-26 00:00:00 |                 43
 2023-01-02 00:00:00 |                434
 2023-01-09 00:00:00 |               2849
 2023-01-16 00:00:00 | 10478.318375775954
 2023-01-23 00:00:00 | 15889.381066151105

Summary

  • HLL is a compression algorithm, results are an approximation. Great for collecting summary statistics and metrics, but not so great for exact calculation needs.
  • Using the HLL extension, create an aggregation table with an HLL column and query the aggregation table using the HLL functions.
  • An HLL table can be ~95% smaller than raw data tables.
  • Queries can run in ~10% of the time raw queries of the same data would take.
  • For tuning HLL, review log2m, regwidth, and expthresh which can increase accuracy but will impact storage time and performance.