Running TPC-H Queries on Iceberg Tables from...

Looking for Postgres with Iceberg support? Crunchy Data Warehouse is the latest version of the native-Postgres data lake house.

We recently introduced support for querying Iceberg tables from PostgreSQL in Crunchy Bridge for Analytics. Iceberg defines a way to store tables in data lakes (usually as Parquet files in S3) with support for snapshots and other important database features, and it is designed with high performance analytics in mind.

If you’re new to Crunchy Bridge, it offers a fully managed PostgreSQL experience. Crunchy Bridge for Analytics extends these capabilities, enabling you to query and interact with object storage using Postgres, now including Iceberg tables on S3.

In this post, we will explore running TPC-H queries against Iceberg tables with PostgreSQL. You'll see why this could be a game-changer for handling demanding analytical workloads.

What is an Iceberg table?

Before diving into Iceberg tables, let's briefly revisit what a table is. In a database system, a table is an abstraction that organizes data stored, such as PostgreSQL’s file and folder layout. When you query a table, PostgreSQL locates the relevant files for you.

Iceberg is an open table format specification that defines metadata on how a table should be structured. It is designed for managing vast amounts of data and supporting analytical workloads. Iceberg offers several advantages, including schema evolution, hidden partitioning, and version rollback.

Iceberg tables are typically stored in object stores like Amazon S3, with data files in Parquet format. Parquet is a columnar file format with built-in compression, optimized for analytical queries. Storing Parquet files on S3 in an open table format provides a crucial benefit: interoperability. This allows multiple tools and query engines to seamlessly access the same data. For example, you can have your Spark jobs process large-scale data transformations, while you enjoy fast analytical queries on Postgres.

iceberg postgres diagram.png

The Iceberg specification is backed by the Apache Foundation, and you can find all the details on GitHub.

How can you query Iceberg tables using Postgres?

Crunchy Bridge for Analytics uses familiar PostgreSQL syntax, making it easy to integrate. Iceberg tables are created as foreign tables backed by the crunchy_lake_analytics server, with the format set to iceberg. Finally, point to the iceberg table on Amazon S3 via the path parameter:

--creates an Iceberg table on Postgres, the system fetches the
--column definitions from the metadata

CREATE FOREIGN TABLE nation()
SERVER crunchy_lake_analytics OPTIONS(
format 'iceberg',
path 's3://crunchy-analytics-datasets-use1/tpch_iceberg/scale_10/public/nation/metadata/00000-eb4cc10e-af7b-4164-91bc-a0c2c306ff49.metadata.json'
);

When you query an Iceberg table (or another analytics table), under the covers we delegate part of the computation to DuckDB, which is a state-of-the-art analytical query engine. Moreover, Iceberg is designed to be very cache-friendly since all files are immutable. Bridge for Analytics caches all the data files and Iceberg metadata files on locally-attached storage for maximum performance.

Running TPC-H queries on Postgres with Iceberg

The TPC-H benchmark is a widely recognized industry-standard test for database performance. We tested TPC-H queries on Iceberg tables and regular PostgreSQL tables and here are the steps to run them:

Spin-up a Crunchy Bridge for Analytics cluster, start here. Once you have your cluster up and running, connect to the database.
We have pre-generated and uploaded TPC-H data to a public Amazon S3 for anyone to use for this. We have a sample commands to connect to this public S3 bucket. Note that it may take a few minutes for the system to automatically download files to the local NVMe drive. You can ask the system to synchronously download the files as well.
If you want to compare a Postgres only instance, create the regular Postgres tables and indexes using the same data set. For regular PostgreSQL tables, you will likely want to add as many indexes as needed to optimize performance. Here, are the indexes we created.
We used scale factor 10 for the TPC-H data. Set the search_path to tpch_10_iceberg or tpch_10_heap to point to the tables that you want to use:
```
psql postgres://user:password@host:port/dbname -c "SET search_path TO tpch_10_iceberg;" -f q1.sql
```

Start executing the 22 queries, starting with as Q1. The full set of queries is in the repo file as well.

SELECT
    l_returnflag,
    l_linestatus,
    sum(l_quantity) AS sum_qty,
    sum(l_extendedprice) AS sum_base_price,
    sum(l_extendedprice * (1 - l_discount)) AS sum_disc_price,
    sum(l_extendedprice * (1 - l_discount) * (1 + l_tax)) AS sum_charge,
    avg(l_quantity) AS avg_qty,
    avg(l_extendedprice) AS avg_price,
    avg(l_discount) AS avg_disc,
    count(*) AS count_order
FROM
    lineitem
WHERE
    l_shipdate <= CAST('1998-09-02' AS date)
GROUP BY
    l_returnflag,
    l_linestatus
ORDER BY
    l_returnflag,
    l_linestatus;

Get the results. Either by using \timing or EXPLAIN ANALYZE, you can return the timing of each of the 22 queries run against the Iceberg tables and/or the Postgres table. This can be manual or you can script the results.

TPC-H benchmark results on Iceberg vs indexed Postgres

tcp-h benchmark iceberg postgres

In the above chart, we compare the execution times of 22 queries defined by the TPC-H benchmark. The y-axis represents the time taken to execute each query in seconds. To ensure the chart remains readable, we capped the maximum time at 10 seconds. This prevents some bars from becoming excessively large and others too small, making the comparison clearer. The impacted queries are Q9, Q13 and Q18, which also have a red line to denote. The light blue bars represent Iceberg tables, while the dark blue bars represent regular PostgreSQL tables with indexes. For the detailed execution times, refer to the appendix section.

Our tests show that the total execution times of TPC-H queries on Iceberg tables are on average 14 times faster than on indexed PostgreSQL tables. This significant performance boost highlights the efficiency and speed that Crunchy Bridge for Analytics brings to your PostgreSQL database. We discussed the underlying technology that provides this boost in a previous article.

Notably, there is only one query where the execution time is lower for regular PostgreSQL tables. Our tests show that heavy use of indexes helps to answer the query fast via index-only-scan. If we run the same query without any indexes, it’d have been roughly 10 times slower.

Summary

We have created a public S3 bucket for anyone to run TPC-H queries against Iceberg, using Crunchy Bridge for Analytics
Running TPC-H queries on Iceberg tables from PostgreSQL demonstrates significant performance over standard Postgres.

We hope this exploration inspires you to experiment with Iceberg tables on Crunchy Bridge for Analytics and shows you what’s possible now with OLAP workloads in Postgres. Let us know if you have any comments or questions!

Appendix: Sample benchmark results

Conducting comprehensive database benchmarks is a complex process. Here, I've executed a focused test run to give a snapshot of the performance, acknowledging that various other parameters can affect the overall results. This focused test aims to show a general trend of the overall performance impact.

In this test, we used the Analytics-128 instance type to run the TPC-H queries. The data was generated using DuckDB's TPCH extension. Spark was used to create the Iceberg tables. The tests were conducted on a hot-cache, and the results are the average of three runs.

Query	Iceberg Table (ms)	Indexed Postgres Table (ms)
1	329.597	6099.366
2	198.148	2241.877
3	318.105	2456.459
4	221.961	881.417
5	319.218	3576.895
6	141.033	1735.455
7	917.689	1667.338
8	399.895	4533.325
9	535.732	10167.185
10	327.829	2196.22
11	83.979	676.779
12	253.095	2091.635
13	572.946	15494.473
14	186.335	889.872
15	322.212	5149.462
16	186.878	1159.822
17	253.43	869.421
18	652.72	43694.238
19	464.98	141.36
20	272.433	3147.819
21	676.448	2586.442
22	167.936	264.262

Latest Articles

Running TPC-H Queries on Iceberg Tables from PostgreSQL