Introducing Crunchy Data Warehouse: A next-generation Postgres-native data warehouse. Crunchy Data Warehouse Learn more

Creating Histograms with Postgres

Elizabeth Christensen

10 min readMore by this author

Christopher Winslett

10 min readMore by this author

Histograms were first used in a lecture in 1892 by Karl Pearson — the godfather of mathematical statistics. With how many data presentation tools we have today, it’s hard to think that representing data as a graphic was classified as “innovation”, but it was. They are a graphic presentation of the distribution and frequency of data. If you haven’t seen one recently, or don’t know the word histogram off the top of your head - it is a bar chart, each bar represents the count of data with a defined range of values. When Pearson built the first histogram, he calculated it by hand. Today we can use SQL (or even Excel) to extract this data continuously across large data sets.

While true statistical histograms have a bit more complexity for choosing bin ranges, for many business intelligence purposes, Postgres width_bucket is good-enough to counting data inside bins with minimal effort.

Postgres width_bucket for histograms

Given the number of buckets and max/min value, width_bucket returns the index for the bucket that a value will fall. For instance, given a minimum value of 0, a maximum value of 100, and 10 buckets, a value of 43 would fall in bucket #5: select width_bucket(43, 0, 100, 10) AS bucket; But 5 is not correct for 43, or is it?

You can see how the values would fall using generate_series:

SELECT value, width_bucket(value, 0, 100, 10) AS bucket FROM generate_series(0, 100) AS value;

postgres histogram 1-100

When running the query, the values 0 through 9 go into bucket 1. As you can see in the image above, width_bucket behaves as a step function that starts indexing with 1. In this scenario, when passed a value of 100, width_bucket returns 11, because the maximum value given the width_bucket is an exclusive range (i.e. the logic is minimum <= value < maximum).

We can use the bucket value to generate more readable labels.

Auto-formatting histogram with SQL

Let’s build out a larger query that creates ranges, range labels, and formats the histogram. We will start by using a synthetic table within a CTE called formatted_data. We are doing it this way so that we can replace that query with new data in the future.

Here’s the beginning of the query (this is copy-pastable into Postgres):

WITH formatted_data AS (
  SELECT * FROM (VALUES (13), (42), (18), (62), (93), (47), (51), (41), (1)) AS t (value)
)
SELECT
  WIDTH_BUCKET(value, 0, 100, 10) AS bucket,
  COUNT(value)
FROM formatted_data
  GROUP BY 1
  ORDER BY 1;

Let’s use another CTE to define some settings for our width_bucket:

WITH formatted_data AS (
  SELECT * FROM (VALUES (13), (42), (18), (62), (93), (47), (51), (41), (1)) AS t (value)
), bucket_settings AS (
	SELECT
		10 as bucket_count,
		0::integer AS min_value, -- can be null::integer or an integer
		100::integer AS max_value -- can be null::integer or an integer
)

SELECT
  WIDTH_BUCKET(value,
	  (SELECT min_value FROM bucket_settings),
		(SELECT max_value FROM bucket_settings),
		(SELECT bucket_count FROM bucket_settings)
	) AS bucket,
  COUNT(value)
FROM formatted_data
  GROUP BY 1
  ORDER BY 1;

In the bucket_settings CTE, we use ::integer to cast any value there as an integer. We do this since we will want to compare NULL against other integers later. If we don’t cast NULLs then the SQL will fail.

Now, we will use a CTE called calculated_bucket_settings to set a dynamic range if the static range is not defined. This will let the data specify the values if they are not defined by the bucket_settings:

WITH formatted_data AS (
  SELECT * FROM (VALUES (13), (42), (18), (62), (93), (47), (51), (41), (1)) AS t (value)
), bucket_settings AS (
	SELECT
		5 AS bucket_count,
		null::integer AS min_value, -- can be null or an integer
		null::integer AS max_value -- can be null or an integer
), calculated_bucket_settings AS (
	SELECT
		(SELECT bucket_count FROM bucket_settings) AS bucket_count,
		COALESCE(
			(SELECT min_value FROM bucket_settings),
			(SELECT min(value) FROM formatted_data)
		) AS min_value,
		COALESCE(
			(SELECT max_value FROM bucket_settings),
			(SELECT max(value) + 1 FROM formatted_data)
		) AS max_value
), histogram AS (
  SELECT
     WIDTH_BUCKET(value, min_value, max_value, (SELECT bucket_count FROM bucket_settings)) AS bucket,
     COUNT(value) AS frequency
   FROM formatted_data, calculated_bucket_settings
   GROUP BY 1
   ORDER BY 1
)

SELECT
   bucket,
   frequency,
   CONCAT(
     (min_value + (bucket - 1) * (max_value - min_value) / bucket_count)::INT,
     ' - ',
     (((min_value + bucket * (max_value - min_value) / bucket_count)) - 1)::INT) AS range
FROM histogram, calculated_bucket_settings;

In the histogram CTE, we use max_value + 1 because the range of values is treated as an exclusive range. Also, because we are working with integers, when you create the pretty label for the range, we subtracted 1 from the maximum value for the range to reduce confusion from what would appear to be overlapping ranges. This decision fits into the “good-enough for business intelligence” caveats listed above. We could have changed the label logic to be 75 <= value < 94 in lieu of the subtraction, but most folks like it see the dash instead of math logic for a histogram.

The query above will give results like the following:

bucket   | frequency |  range
---------+-----------+---------
       1 |         3 | 1 - 18
       3 |         4 | 38 - 55
       4 |         1 | 56 - 74
       5 |         1 | 75 - 93
(4 rows)

Now we see that all buckets and frequencies are not represented. So, if a value is empty, we need to fill in the frequency with a zero. This is where SQL requires thinking in sets. We can use generate_series to generate all values for the buckets, then join the histogram to all values. Flipping the order of the query around makes it simpler than joining an incomplete set. In the following query, we’ve built out the buckets in the all_buckets CTE, then joined that to the histogram values:

WITH formatted_data AS (
  SELECT * FROM (VALUES (13), (42), (18), (62), (93), (47), (51), (41), (1)) AS t (value)
), bucket_settings AS (
  SELECT
        5 AS bucket_count,
        0::integer AS min_value, -- can be null or an integer
        100::integer AS max_value -- can be null or an integer
), calculated_bucket_settings AS (
	SELECT
	  (SELECT bucket_count FROM bucket_settings) AS bucket_count,
	  COALESCE(
	          (SELECT min_value FROM bucket_settings),
	          (SELECT min(value) FROM formatted_data)
	  ) AS min_value,
	  COALESCE(
	          (SELECT max_value FROM bucket_settings),
	          (SELECT max(value) + 1 FROM formatted_data)
	  ) AS max_value
), histogram AS (
  SELECT
    WIDTH_BUCKET(value, calculated_bucket_settings.min_value, calculated_bucket_settings.max_value + 1, (SELECT bucket_count FROM bucket_settings)) AS bucket,
    COUNT(value) AS frequency
  FROM formatted_data, calculated_bucket_settings
  GROUP BY 1
  ORDER BY 1
 ), all_buckets AS (
  SELECT
    fill_buckets.bucket AS bucket,
    FLOOR(calculated_bucket_settings.min_value + (fill_buckets.bucket - 1) * (calculated_bucket_settings.max_value - calculated_bucket_settings.min_value) / (SELECT bucket_count FROM bucket_settings)) AS min_value,
    FLOOR(calculated_bucket_settings.min_value + fill_buckets.bucket * (calculated_bucket_settings.max_value - calculated_bucket_settings.min_value) / (SELECT bucket_count FROM bucket_settings)) AS max_value
  FROM calculated_bucket_settings,
	  generate_series(1, calculated_bucket_settings.bucket_count) AS fill_buckets (bucket))

 SELECT
   all_buckets.bucket AS bucket,
   CASE
   WHEN all_buckets IS NULL THEN
	   'out of bounds'
	 ELSE
     CONCAT(all_buckets.min_value, ' - ', all_buckets.max_value - 1)
   END AS range,
   SUM(COALESCE(histogram.frequency, 0)) AS frequency
 FROM all_buckets
 FULL OUTER JOIN histogram ON all_buckets.bucket = histogram.bucket
 GROUP BY 1, 2
 ORDER BY bucket;

Try modifying the values in the bucket_settings CTE to see how the histogram responds. By increasing the bucket_count, min_value, or max_value, you’ll see the histogram respond appropriately. If you modify the range to exclude values, using the FULL OUTER JOIN, you’ll see that all non-classified items are bucketed as “out of bounds”.

Using a presentation tool, display the histogram as a bar chart:

postgres histogram

Real Life Data with Histograms

Now that we have a really nice auto-adjusting query, we can simply build a histogram from other examples. I have a little experimental database from the database of clinical trials.

What if we wanted to build a histogram for the count of participants in various clinical trial studies? To start, build the query that finds the number of participants for each study:

SELECT
	outcomes.nct_id,
	max(outcome_counts.count) AS value
FROM outcomes
INNER JOIN outcome_counts ON outcomes.id = outcome_counts.outcome_id
WHERE param_type = 'COUNT_OF_PARTICIPANTS'
GROUP BY 1

We can take the above query, and place it in the formatted_data CTE:

WITH formatted_data AS (
	SELECT
		outcomes.nct_id,
		MAX(outcome_counts.count) AS value
	FROM outcomes
	INNER JOIN outcome_counts ON outcomes.id = outcome_counts.outcome_id
	WHERE param_type = 'COUNT_OF_PARTICIPANTS'
	GROUP BY 1
), bucket_settings AS (
  SELECT
        20 AS bucket_count,
        null::integer AS min_value, -- can be null or an integer
        null::integer AS max_value -- can be null or an integer
), calculated_bucket_settings AS (
	SELECT
	  (SELECT bucket_count FROM bucket_settings) AS bucket_count,
	  COALESCE(
	          (SELECT min_value FROM bucket_settings),
	          (SELECT min(value) FROM formatted_data)
	  ) AS min_value,
	  COALESCE(
	          (SELECT max_value FROM bucket_settings),
	          (SELECT max(value) + 1 FROM formatted_data)
	  ) AS max_value
), histogram AS (
  SELECT
    WIDTH_BUCKET(value, calculated_bucket_settings.min_value, calculated_bucket_settings.max_value + 1, (SELECT bucket_count FROM bucket_settings)) AS bucket,
     COUNT(value) AS frequency
   FROM formatted_data, calculated_bucket_settings
   GROUP BY 1
   ORDER BY 1
 ), all_buckets AS (
   SELECT
     fill_buckets.bucket AS bucket,
     FLOOR(calculated_bucket_settings.min_value + (fill_buckets.bucket - 1) * (calculated_bucket_settings.max_value - calculated_bucket_settings.min_value) / (SELECT bucket_count FROM bucket_settings)) AS min_value,
     FLOOR(calculated_bucket_settings.min_value + fill_buckets.bucket * (calculated_bucket_settings.max_value - calculated_bucket_settings.min_value) / (SELECT bucket_count FROM bucket_settings)) AS max_value
   FROM calculated_bucket_settings,
	   generate_series(1, calculated_bucket_settings.bucket_count) AS fill_buckets (bucket))

 SELECT
   all_buckets.bucket AS bucket,
   CASE
   WHEN all_buckets IS NULL THEN
	   'out of bounds'
	 ELSE
     CONCAT(all_buckets.min_value, ' - ', all_buckets.max_value - 1)
   END AS range,
   SUM(COALESCE(histogram.frequency, 0)) AS frequency
 FROM all_buckets
 FULL OUTER JOIN histogram ON all_buckets.bucket = histogram.bucket
 GROUP BY 1, 2
 ORDER BY bucket;

The query will output the following. This is a bit un-desirable because the distribution is concentrated in the first bucket:

 bucket |       range       | frequency
--------+-------------------+-----------
      1 | 1 - 359943        |     23261
      2 | 359944 - 719886   |         3
      3 | 719887 - 1079829  |         1
      4 | 1079830 - 1439773 |         0
      5 | 1439774 - 1799716 |         1
      6 | 1799717 - 2159659 |         0
      7 | 2159660 - 2519602 |         0
      8 | 2519603 - 2879546 |         0
      9 | 2879547 - 3239489 |         0
     10 | 3239490 - 3599432 |         0
     11 | 3599433 - 3959375 |         0
     12 | 3959376 - 4319319 |         0
     13 | 4319320 - 4679262 |         0
     14 | 4679263 - 5039205 |         0
     15 | 5039206 - 5399148 |         0
     16 | 5399149 - 5759092 |         0
     17 | 5759093 - 6119035 |         0
     18 | 6119036 - 6478978 |         0
     19 | 6478979 - 6838921 |         0
     20 | 6838922 - 7198865 |         1
(20 rows)

If you’ve loaded the data, to improve the presentation, we can adjust the bucket_settings CTE to modify how the buckets are defined. For instance, with this dataset, if we changed the bucket settings to:

  SELECT
        20 AS bucket_count,
        0::integer AS min_value, -- can be null or an integer
        100::integer AS max_value -- can be null or an integer

It outputs a much nicer distribution of data:

 bucket |     range     | frequency
--------+---------------+-----------
      1 | 0 - 49        |     13584
      2 | 50 - 99       |      3612
      3 | 100 - 149     |      1720
      4 | 150 - 199     |       942
      5 | 200 - 249     |       645
      6 | 250 - 299     |       477
      7 | 300 - 349     |       338
      8 | 350 - 399     |       237
      9 | 400 - 449     |       176
     10 | 450 - 499     |       137
     11 | 500 - 549     |       150
     12 | 550 - 599     |       101
     13 | 600 - 649     |        77
     14 | 650 - 699     |        58
     15 | 700 - 749     |        61
     16 | 750 - 799     |        41
     17 | 800 - 849     |        41
     18 | 850 - 899     |        33
     19 | 900 - 949     |        36
     20 | 950 - 999     |        43
        | out of bounds |       758

In brief

  • Using Postgres width_bucket will build buckets to gather frequency values to create histograms.
    • Creating a function assigns values to predefined buckets based on a min/max range and bucket count.
    • By casting, you can work with data that contains some null values
    • You can create values that fall outside the defined range
  • By using Common Table Expressions (CTEs), you can define bucket settings dynamically with auto-adjusting bins based on the dataset.
  • Histograms can aid with the visualization of data and data distribution in your set. Histograms show how frequently data points appear within specific ranges (bins), making it easier to understand patterns, trends, and outliers. Bin size does affect interpretation so choosing the right number of bins is crucial; too few can oversimplify the data, while too many can create noise and obscure trends.

Build an interesting histogram? Show us @crunchydata!