Running TPC-H Queries on Iceberg Tables from PostgreSQL
We recently introduced support for querying Iceberg tables from PostgreSQL in Crunchy Bridge for Analytics. Iceberg defines a way to store tables in data lakes (usually as Parquet files in S3) with support for snapshots and other important database features, and it is designed with high performance analytics in mind.
If you’re new to Crunchy Bridge, it offers a fully managed PostgreSQL experience. Crunchy Bridge for Analytics extends these capabilities, enabling you to query and interact with object storage using Postgres, now including Iceberg tables on S3.
In this post, we will explore running TPC-H queries against Iceberg tables with PostgreSQL. You'll see why this could be a game-changer for handling demanding analytical workloads.
What is an Iceberg table?
Before diving into Iceberg tables, let's briefly revisit what a table is. In a database system, a table is an abstraction that organizes data stored, such as PostgreSQL’s file and folder layout. When you query a table, PostgreSQL locates the relevant files for you.
Iceberg is an open table format specification that defines metadata on how a table should be structured. It is designed for managing vast amounts of data and supporting analytical workloads. Iceberg offers several advantages, including schema evolution, hidden partitioning, and version rollback.
Iceberg tables are typically stored in object stores like Amazon S3, with data files in Parquet format. Parquet is a columnar file format with built-in compression, optimized for analytical queries. Storing Parquet files on S3 in an open table format provides a crucial benefit: interoperability. This allows multiple tools and query engines to seamlessly access the same data. For example, you can have your Spark jobs process large-scale data transformations, while you enjoy fast analytical queries on Postgres.
The Iceberg specification is backed by the Apache Foundation, and you can find all the details on GitHub.
How can you query Iceberg tables using Postgres?
Crunchy Bridge for Analytics uses familiar PostgreSQL syntax, making it easy to integrate. Iceberg tables are created as foreign tables backed by the crunchy_lake_analytics
server, with the format set to iceberg
. Finally, point to the iceberg table on Amazon S3 via the path
parameter:
--creates an Iceberg table on Postgres, the system fetches the
--column definitions from the metadata
CREATE FOREIGN TABLE nation()
SERVER crunchy_lake_analytics OPTIONS(
format 'iceberg',
path 's3://crunchy-analytics-datasets-use1/tpch_iceberg/scale_10/public/nation/metadata/00000-eb4cc10e-af7b-4164-91bc-a0c2c306ff49.metadata.json'
);
When you query an Iceberg table (or another analytics table), under the covers we delegate part of the computation to DuckDB, which is a state-of-the-art analytical query engine. Moreover, Iceberg is designed to be very cache-friendly since all files are immutable. Bridge for Analytics caches all the data files and Iceberg metadata files on locally-attached storage for maximum performance.
Running TPC-H queries on Postgres with Iceberg
The TPC-H benchmark is a widely recognized industry-standard test for database performance. We tested TPC-H queries on Iceberg tables and regular PostgreSQL tables and here are the steps to run them:
- Spin-up a Crunchy Bridge for Analytics cluster, start here. Once you have your cluster up and running, connect to the database.
- We have pre-generated and uploaded TPC-H data to a public Amazon S3 for anyone to use for this. We have a sample commands to connect to this public S3 bucket. Note that it may take a few minutes for the system to automatically download files to the local NVMe drive. You can ask the system to synchronously download the files as well.
- If you want to compare a Postgres only instance, create the regular Postgres tables and indexes using the same data set. For regular PostgreSQL tables, you will likely want to add as many indexes as needed to optimize performance. Here, are the indexes we created.
- We used scale factor 10 for the TPC-H data. Set the
search_path
totpch_10_iceberg
ortpch_10_heap
to point to the tables that you want to use:psql postgres://user:password@host:port/dbname -c "SET search_path TO tpch_10_iceberg;" -f q1.sql
- Start executing the 22 queries, starting with as Q1. The full set of queries is in the repo file as well.
SELECT l_returnflag, l_linestatus, sum(l_quantity) AS sum_qty, sum(l_extendedprice) AS sum_base_price, sum(l_extendedprice * (1 - l_discount)) AS sum_disc_price, sum(l_extendedprice * (1 - l_discount) * (1 + l_tax)) AS sum_charge, avg(l_quantity) AS avg_qty, avg(l_extendedprice) AS avg_price, avg(l_discount) AS avg_disc, count(*) AS count_order FROM lineitem WHERE l_shipdate <= CAST('1998-09-02' AS date) GROUP BY l_returnflag, l_linestatus ORDER BY l_returnflag, l_linestatus;
- Get the results. Either by using
\timing
orEXPLAIN ANALYZE
, you can return the timing of each of the 22 queries run against the Iceberg tables and/or the Postgres table. This can be manual or you can script the results.
TPC-H benchmark results on Iceberg vs indexed Postgres
In the above chart, we compare the execution times of 22 queries defined by the TPC-H benchmark. The y-axis represents the time taken to execute each query in seconds. To ensure the chart remains readable, we capped the maximum time at 10 seconds. This prevents some bars from becoming excessively large and others too small, making the comparison clearer. The impacted queries are Q9, Q13 and Q18, which also have a red line to denote. The light blue bars represent Iceberg tables, while the dark blue bars represent regular PostgreSQL tables with indexes. For the detailed execution times, refer to the appendix section.
Our tests show that the total execution times of TPC-H queries on Iceberg tables are on average 14 times faster than on indexed PostgreSQL tables. This significant performance boost highlights the efficiency and speed that Crunchy Bridge for Analytics brings to your PostgreSQL database. We discussed the underlying technology that provides this boost in a previous article.
Notably, there is only one query where the execution time is lower for regular PostgreSQL tables. Our tests show that heavy use of indexes helps to answer the query fast via index-only-scan. If we run the same query without any indexes, it’d have been roughly 10 times slower.
Summary
- We have created a public S3 bucket for anyone to run TPC-H queries against Iceberg, using Crunchy Bridge for Analytics
- Running TPC-H queries on Iceberg tables from PostgreSQL demonstrates significant performance over standard Postgres.
We hope this exploration inspires you to experiment with Iceberg tables on Crunchy Bridge for Analytics and shows you what’s possible now with OLAP workloads in Postgres. Let us know if you have any comments or questions!
Appendix: Sample benchmark results
Conducting comprehensive database benchmarks is a complex process. Here, I've executed a focused test run to give a snapshot of the performance, acknowledging that various other parameters can affect the overall results. This focused test aims to show a general trend of the overall performance impact.
In this test, we used the Analytics-128 instance type to run the TPC-H queries. The data was generated using DuckDB's TPCH extension. Spark was used to create the Iceberg tables. The tests were conducted on a hot-cache, and the results are the average of three runs.
Query | Iceberg Table (ms) | Indexed Postgres Table (ms) |
---|---|---|
1 | 329.597 | 6099.366 |
2 | 198.148 | 2241.877 |
3 | 318.105 | 2456.459 |
4 | 221.961 | 881.417 |
5 | 319.218 | 3576.895 |
6 | 141.033 | 1735.455 |
7 | 917.689 | 1667.338 |
8 | 399.895 | 4533.325 |
9 | 535.732 | 10167.185 |
10 | 327.829 | 2196.22 |
11 | 83.979 | 676.779 |
12 | 253.095 | 2091.635 |
13 | 572.946 | 15494.473 |
14 | 186.335 | 889.872 |
15 | 322.212 | 5149.462 |
16 | 186.878 | 1159.822 |
17 | 253.43 | 869.421 |
18 | 652.72 | 43694.238 |
19 | 464.98 | 141.36 |
20 | 272.433 | 3147.819 |
21 | 676.448 | 2586.442 |
22 | 167.936 | 264.262 |
Related Articles
- Postgres Tuning & Performance for Analytics Data
19 min read
- Running an Async Web Query Queue with Procedures and pg_cron
6 min read
- Name Collision of the Year: Vector
9 min read
- Sidecar Service Meshes with Crunchy Postgres for Kubernetes
12 min read
- pg_incremental: Incremental Data Processing in Postgres
11 min read