Latest Articles
- Creating Histograms with Postgres
- Introducing Crunchy Postgres for Kubernetes 5.8: OpenTelemetry, API enhancements, UBI-9 and More
- Crunchy Data Warehouse: Postgres with Iceberg Available for Kubernetes and On-premises
- Reducing Cloud Spend: Migrating Logs from CloudWatch to Iceberg with Postgres
- Postgres Security Checklist from the Center for Internet Security
Creating Histograms with Postgres
10 min readMore by this author
10 min readMore by this author
Histograms were first used in a lecture in 1892 by Karl Pearson — the godfather of mathematical statistics. With how many data presentation tools we have today, it’s hard to think that representing data as a graphic was classified as “innovation”, but it was. They are a graphic presentation of the distribution and frequency of data. If you haven’t seen one recently, or don’t know the word histogram off the top of your head - it is a bar chart, each bar represents the count of data with a defined range of values. When Pearson built the first histogram, he calculated it by hand. Today we can use SQL (or even Excel) to extract this data continuously across large data sets.
While true statistical histograms have a bit more complexity for choosing bin ranges, for many business intelligence purposes, Postgres width_bucket
is good-enough to counting data inside bins with minimal effort.
Postgres width_bucket for histograms
Given the number of buckets and max/min value, width_bucket
returns the index for the bucket that a value will fall. For instance, given a minimum value of 0, a maximum value of 100, and 10 buckets, a value of 43 would fall in bucket #5: select width_bucket(43, 0, 100, 10) AS bucket;
But 5 is not correct for 43, or is it?
You can see how the values would fall using generate_series
:
SELECT value, width_bucket(value, 0, 100, 10) AS bucket FROM generate_series(0, 100) AS value;
When running the query, the values 0 through 9 go into bucket 1. As you can see in the image above, width_bucket
behaves as a step function that starts indexing with 1. In this scenario, when passed a value of 100, width_bucket
returns 11, because the maximum value given the width_bucket is an exclusive range (i.e. the logic is minimum <= value < maximum).
We can use the bucket value to generate more readable labels.
Auto-formatting histogram with SQL
Let’s build out a larger query that creates ranges, range labels, and formats the histogram. We will start by using a synthetic table within a CTE called formatted_data
. We are doing it this way so that we can replace that query with new data in the future.
Here’s the beginning of the query (this is copy-pastable into Postgres):
WITH formatted_data AS (
SELECT * FROM (VALUES (13), (42), (18), (62), (93), (47), (51), (41), (1)) AS t (value)
)
SELECT
WIDTH_BUCKET(value, 0, 100, 10) AS bucket,
COUNT(value)
FROM formatted_data
GROUP BY 1
ORDER BY 1;
Let’s use another CTE to define some settings for our width_bucket
:
WITH formatted_data AS (
SELECT * FROM (VALUES (13), (42), (18), (62), (93), (47), (51), (41), (1)) AS t (value)
), bucket_settings AS (
SELECT
10 as bucket_count,
0::integer AS min_value, -- can be null::integer or an integer
100::integer AS max_value -- can be null::integer or an integer
)
SELECT
WIDTH_BUCKET(value,
(SELECT min_value FROM bucket_settings),
(SELECT max_value FROM bucket_settings),
(SELECT bucket_count FROM bucket_settings)
) AS bucket,
COUNT(value)
FROM formatted_data
GROUP BY 1
ORDER BY 1;
In the bucket_settings
CTE, we use ::integer
to cast any value there as an integer. We do this since we will want to compare NULL against other integers later. If we don’t cast NULLs then the SQL will fail.
Now, we will use a CTE called calculated_bucket_settings
to set a dynamic range if the static range is not defined. This will let the data specify the values if they are not defined by the bucket_settings
:
WITH formatted_data AS (
SELECT * FROM (VALUES (13), (42), (18), (62), (93), (47), (51), (41), (1)) AS t (value)
), bucket_settings AS (
SELECT
5 AS bucket_count,
null::integer AS min_value, -- can be null or an integer
null::integer AS max_value -- can be null or an integer
), calculated_bucket_settings AS (
SELECT
(SELECT bucket_count FROM bucket_settings) AS bucket_count,
COALESCE(
(SELECT min_value FROM bucket_settings),
(SELECT min(value) FROM formatted_data)
) AS min_value,
COALESCE(
(SELECT max_value FROM bucket_settings),
(SELECT max(value) + 1 FROM formatted_data)
) AS max_value
), histogram AS (
SELECT
WIDTH_BUCKET(value, min_value, max_value, (SELECT bucket_count FROM bucket_settings)) AS bucket,
COUNT(value) AS frequency
FROM formatted_data, calculated_bucket_settings
GROUP BY 1
ORDER BY 1
)
SELECT
bucket,
frequency,
CONCAT(
(min_value + (bucket - 1) * (max_value - min_value) / bucket_count)::INT,
' - ',
(((min_value + bucket * (max_value - min_value) / bucket_count)) - 1)::INT) AS range
FROM histogram, calculated_bucket_settings;
In the histogram
CTE, we use max_value + 1
because the range of values is treated as an exclusive range. Also, because we are working with integers, when you create the pretty label for the range
, we subtracted 1 from the maximum value for the range to reduce confusion from what would appear to be overlapping ranges. This decision fits into the “good-enough for business intelligence” caveats listed above. We could have changed the label logic to be 75 <= value < 94
in lieu of the subtraction, but most folks like it see the dash instead of math logic for a histogram.
The query above will give results like the following:
bucket | frequency | range
---------+-----------+---------
1 | 3 | 1 - 18
3 | 4 | 38 - 55
4 | 1 | 56 - 74
5 | 1 | 75 - 93
(4 rows)
Now we see that all buckets and frequencies are not represented. So, if a value is empty, we need to fill in the frequency with a zero. This is where SQL requires thinking in sets. We can use generate_series
to generate all values for the buckets, then join the histogram to all values. Flipping the order of the query around makes it simpler than joining an incomplete set. In the following query, we’ve built out the buckets in the all_buckets
CTE, then joined that to the histogram values:
WITH formatted_data AS (
SELECT * FROM (VALUES (13), (42), (18), (62), (93), (47), (51), (41), (1)) AS t (value)
), bucket_settings AS (
SELECT
5 AS bucket_count,
0::integer AS min_value, -- can be null or an integer
100::integer AS max_value -- can be null or an integer
), calculated_bucket_settings AS (
SELECT
(SELECT bucket_count FROM bucket_settings) AS bucket_count,
COALESCE(
(SELECT min_value FROM bucket_settings),
(SELECT min(value) FROM formatted_data)
) AS min_value,
COALESCE(
(SELECT max_value FROM bucket_settings),
(SELECT max(value) + 1 FROM formatted_data)
) AS max_value
), histogram AS (
SELECT
WIDTH_BUCKET(value, calculated_bucket_settings.min_value, calculated_bucket_settings.max_value + 1, (SELECT bucket_count FROM bucket_settings)) AS bucket,
COUNT(value) AS frequency
FROM formatted_data, calculated_bucket_settings
GROUP BY 1
ORDER BY 1
), all_buckets AS (
SELECT
fill_buckets.bucket AS bucket,
FLOOR(calculated_bucket_settings.min_value + (fill_buckets.bucket - 1) * (calculated_bucket_settings.max_value - calculated_bucket_settings.min_value) / (SELECT bucket_count FROM bucket_settings)) AS min_value,
FLOOR(calculated_bucket_settings.min_value + fill_buckets.bucket * (calculated_bucket_settings.max_value - calculated_bucket_settings.min_value) / (SELECT bucket_count FROM bucket_settings)) AS max_value
FROM calculated_bucket_settings,
generate_series(1, calculated_bucket_settings.bucket_count) AS fill_buckets (bucket))
SELECT
all_buckets.bucket AS bucket,
CASE
WHEN all_buckets IS NULL THEN
'out of bounds'
ELSE
CONCAT(all_buckets.min_value, ' - ', all_buckets.max_value - 1)
END AS range,
SUM(COALESCE(histogram.frequency, 0)) AS frequency
FROM all_buckets
FULL OUTER JOIN histogram ON all_buckets.bucket = histogram.bucket
GROUP BY 1, 2
ORDER BY bucket;
Try modifying the values in the bucket_settings
CTE to see how the histogram responds. By increasing the bucket_count
, min_value
, or max_value
, you’ll see the histogram respond appropriately. If you modify the range to exclude values, using the FULL OUTER JOIN
, you’ll see that all non-classified items are bucketed as “out of bounds”.
Using a presentation tool, display the histogram as a bar chart:
Real Life Data with Histograms
Now that we have a really nice auto-adjusting query, we can simply build a histogram from other examples. I have a little experimental database from the database of clinical trials.
What if we wanted to build a histogram for the count of participants in various clinical trial studies? To start, build the query that finds the number of participants for each study:
SELECT
outcomes.nct_id,
max(outcome_counts.count) AS value
FROM outcomes
INNER JOIN outcome_counts ON outcomes.id = outcome_counts.outcome_id
WHERE param_type = 'COUNT_OF_PARTICIPANTS'
GROUP BY 1
We can take the above query, and place it in the formatted_data
CTE:
WITH formatted_data AS (
SELECT
outcomes.nct_id,
MAX(outcome_counts.count) AS value
FROM outcomes
INNER JOIN outcome_counts ON outcomes.id = outcome_counts.outcome_id
WHERE param_type = 'COUNT_OF_PARTICIPANTS'
GROUP BY 1
), bucket_settings AS (
SELECT
20 AS bucket_count,
null::integer AS min_value, -- can be null or an integer
null::integer AS max_value -- can be null or an integer
), calculated_bucket_settings AS (
SELECT
(SELECT bucket_count FROM bucket_settings) AS bucket_count,
COALESCE(
(SELECT min_value FROM bucket_settings),
(SELECT min(value) FROM formatted_data)
) AS min_value,
COALESCE(
(SELECT max_value FROM bucket_settings),
(SELECT max(value) + 1 FROM formatted_data)
) AS max_value
), histogram AS (
SELECT
WIDTH_BUCKET(value, calculated_bucket_settings.min_value, calculated_bucket_settings.max_value + 1, (SELECT bucket_count FROM bucket_settings)) AS bucket,
COUNT(value) AS frequency
FROM formatted_data, calculated_bucket_settings
GROUP BY 1
ORDER BY 1
), all_buckets AS (
SELECT
fill_buckets.bucket AS bucket,
FLOOR(calculated_bucket_settings.min_value + (fill_buckets.bucket - 1) * (calculated_bucket_settings.max_value - calculated_bucket_settings.min_value) / (SELECT bucket_count FROM bucket_settings)) AS min_value,
FLOOR(calculated_bucket_settings.min_value + fill_buckets.bucket * (calculated_bucket_settings.max_value - calculated_bucket_settings.min_value) / (SELECT bucket_count FROM bucket_settings)) AS max_value
FROM calculated_bucket_settings,
generate_series(1, calculated_bucket_settings.bucket_count) AS fill_buckets (bucket))
SELECT
all_buckets.bucket AS bucket,
CASE
WHEN all_buckets IS NULL THEN
'out of bounds'
ELSE
CONCAT(all_buckets.min_value, ' - ', all_buckets.max_value - 1)
END AS range,
SUM(COALESCE(histogram.frequency, 0)) AS frequency
FROM all_buckets
FULL OUTER JOIN histogram ON all_buckets.bucket = histogram.bucket
GROUP BY 1, 2
ORDER BY bucket;
The query will output the following. This is a bit un-desirable because the distribution is concentrated in the first bucket:
bucket | range | frequency
--------+-------------------+-----------
1 | 1 - 359943 | 23261
2 | 359944 - 719886 | 3
3 | 719887 - 1079829 | 1
4 | 1079830 - 1439773 | 0
5 | 1439774 - 1799716 | 1
6 | 1799717 - 2159659 | 0
7 | 2159660 - 2519602 | 0
8 | 2519603 - 2879546 | 0
9 | 2879547 - 3239489 | 0
10 | 3239490 - 3599432 | 0
11 | 3599433 - 3959375 | 0
12 | 3959376 - 4319319 | 0
13 | 4319320 - 4679262 | 0
14 | 4679263 - 5039205 | 0
15 | 5039206 - 5399148 | 0
16 | 5399149 - 5759092 | 0
17 | 5759093 - 6119035 | 0
18 | 6119036 - 6478978 | 0
19 | 6478979 - 6838921 | 0
20 | 6838922 - 7198865 | 1
(20 rows)
If you’ve loaded the data, to improve the presentation, we can adjust the bucket_settings
CTE to modify how the buckets are defined. For instance, with this dataset, if we changed the bucket settings to:
SELECT
20 AS bucket_count,
0::integer AS min_value, -- can be null or an integer
100::integer AS max_value -- can be null or an integer
It outputs a much nicer distribution of data:
bucket | range | frequency
--------+---------------+-----------
1 | 0 - 49 | 13584
2 | 50 - 99 | 3612
3 | 100 - 149 | 1720
4 | 150 - 199 | 942
5 | 200 - 249 | 645
6 | 250 - 299 | 477
7 | 300 - 349 | 338
8 | 350 - 399 | 237
9 | 400 - 449 | 176
10 | 450 - 499 | 137
11 | 500 - 549 | 150
12 | 550 - 599 | 101
13 | 600 - 649 | 77
14 | 650 - 699 | 58
15 | 700 - 749 | 61
16 | 750 - 799 | 41
17 | 800 - 849 | 41
18 | 850 - 899 | 33
19 | 900 - 949 | 36
20 | 950 - 999 | 43
| out of bounds | 758
In brief
- Using Postgres
width_bucket
will build buckets to gather frequency values to create histograms.- Creating a function assigns values to predefined buckets based on a min/max range and bucket count.
- By casting, you can work with data that contains some null values
- You can create values that fall outside the defined range
- By using Common Table Expressions (CTEs), you can define bucket settings dynamically with auto-adjusting bins based on the dataset.
- Histograms can aid with the visualization of data and data distribution in your set. Histograms show how frequently data points appear within specific ranges (bins), making it easier to understand patterns, trends, and outliers. Bin size does affect interpretation so choosing the right number of bins is crucial; too few can oversimplify the data, while too many can create noise and obscure trends.
Build an interesting histogram? Show us @crunchydata!
Related Articles
- Creating Histograms with Postgres
10 min read
- Introducing Crunchy Postgres for Kubernetes 5.8: OpenTelemetry, API enhancements, UBI-9 and More
4 min read
- Crunchy Data Warehouse: Postgres with Iceberg Available for Kubernetes and On-premises
6 min read
- Reducing Cloud Spend: Migrating Logs from CloudWatch to Iceberg with Postgres
5 min read
- Postgres Security Checklist from the Center for Internet Security
3 min read