Introducing Crunchy Data Warehouse: A next-generation Postgres-native data warehouse. Crunchy Data Warehouse Learn more

Running TPC-H Queries on Iceberg Tables from PostgreSQL

Avatar for Önder Kalacı

Önder Kalacı

6 min read

We recently introduced support for querying Iceberg tables from PostgreSQL in Crunchy Bridge for Analytics. Iceberg defines a way to store tables in data lakes (usually as Parquet files in S3) with support for snapshots and other important database features, and it is designed with high performance analytics in mind.

If you’re new to Crunchy Bridge, it offers a fully managed PostgreSQL experience. Crunchy Bridge for Analytics extends these capabilities, enabling you to query and interact with object storage using Postgres, now including Iceberg tables on S3.

In this post, we will explore running TPC-H queries against Iceberg tables with PostgreSQL. You'll see why this could be a game-changer for handling demanding analytical workloads.

What is an Iceberg table?

Before diving into Iceberg tables, let's briefly revisit what a table is. In a database system, a table is an abstraction that organizes data stored, such as PostgreSQL’s file and folder layout. When you query a table, PostgreSQL locates the relevant files for you.

Iceberg is an open table format specification that defines metadata on how a table should be structured. It is designed for managing vast amounts of data and supporting analytical workloads. Iceberg offers several advantages, including schema evolution, hidden partitioning, and version rollback.

Iceberg tables are typically stored in object stores like Amazon S3, with data files in Parquet format. Parquet is a columnar file format with built-in compression, optimized for analytical queries. Storing Parquet files on S3 in an open table format provides a crucial benefit: interoperability. This allows multiple tools and query engines to seamlessly access the same data. For example, you can have your Spark jobs process large-scale data transformations, while you enjoy fast analytical queries on Postgres.

iceberg postgres diagram.png

The Iceberg specification is backed by the Apache Foundation, and you can find all the details on GitHub.

How can you query Iceberg tables using Postgres?

Crunchy Bridge for Analytics uses familiar PostgreSQL syntax, making it easy to integrate. Iceberg tables are created as foreign tables backed by the crunchy_lake_analytics server, with the format set to iceberg. Finally, point to the iceberg table on Amazon S3 via the path parameter:

--creates an Iceberg table on Postgres, the system fetches the
--column definitions from the metadata

CREATE FOREIGN TABLE nation()
SERVER crunchy_lake_analytics OPTIONS(
format 'iceberg',
path 's3://crunchy-analytics-datasets-use1/tpch_iceberg/scale_10/public/nation/metadata/00000-eb4cc10e-af7b-4164-91bc-a0c2c306ff49.metadata.json'
);

When you query an Iceberg table (or another analytics table), under the covers we delegate part of the computation to DuckDB, which is a state-of-the-art analytical query engine. Moreover, Iceberg is designed to be very cache-friendly since all files are immutable. Bridge for Analytics caches all the data files and Iceberg metadata files on locally-attached storage for maximum performance.

Running TPC-H queries on Postgres with Iceberg

The TPC-H benchmark is a widely recognized industry-standard test for database performance.  We tested TPC-H queries on Iceberg tables and regular PostgreSQL tables and here are the steps to run them:

  • Spin-up a Crunchy Bridge for Analytics cluster, start here. Once you have your cluster up and running, connect to the database.
  • We have pre-generated and uploaded TPC-H data to a public Amazon S3 for anyone to use for this. We have a sample commands to connect to this public S3 bucket. Note that it may take a few minutes for the system to automatically download files to the local NVMe drive. You can ask the system to synchronously download the files as well.
  • If you want to compare a Postgres only instance, create the regular Postgres tables and indexes using the same data set. For regular PostgreSQL tables, you will likely want to add as many indexes as needed to optimize performance. Here, are the indexes we created.
  • We used scale factor 10 for the TPC-H data. Set the search_path to tpch_10_iceberg or tpch_10_heap to point to the tables that you want to use:
    psql postgres://user:password@host:port/dbname -c "SET search_path TO tpch_10_iceberg;" -f q1.sql
    
  • Start executing the 22 queries, starting with as Q1. The full set of queries is in the repo file as well.
    SELECT
        l_returnflag,
        l_linestatus,
        sum(l_quantity) AS sum_qty,
        sum(l_extendedprice) AS sum_base_price,
        sum(l_extendedprice * (1 - l_discount)) AS sum_disc_price,
        sum(l_extendedprice * (1 - l_discount) * (1 + l_tax)) AS sum_charge,
        avg(l_quantity) AS avg_qty,
        avg(l_extendedprice) AS avg_price,
        avg(l_discount) AS avg_disc,
        count(*) AS count_order
    FROM
        lineitem
    WHERE
        l_shipdate <= CAST('1998-09-02' AS date)
    GROUP BY
        l_returnflag,
        l_linestatus
    ORDER BY
        l_returnflag,
        l_linestatus;
    
  • Get the results. Either by using \timing or EXPLAIN ANALYZE, you can return the timing of each of the 22 queries run against the Iceberg tables and/or the Postgres table. This can be manual or you can script the results.

TPC-H benchmark results on Iceberg vs indexed Postgres

tcp-h benchmark iceberg postgres

In the above chart, we compare the execution times of 22 queries defined by the TPC-H benchmark. The y-axis represents the time taken to execute each query in seconds. To ensure the chart remains readable, we capped the maximum time at 10 seconds. This prevents some bars from becoming excessively large and others too small, making the comparison clearer. The impacted queries are Q9, Q13 and Q18, which also have a red line to denote. The light blue bars represent Iceberg tables, while the dark blue bars represent regular PostgreSQL tables with indexes. For the detailed execution times, refer to the appendix section.

Our tests show that the total execution times of TPC-H queries on Iceberg tables are on average 14 times faster than on indexed PostgreSQL tables. This significant performance boost highlights the efficiency and speed that Crunchy Bridge for Analytics brings to your PostgreSQL database. We discussed the underlying technology that provides this boost in a previous article.

Notably, there is only one query where the execution time is lower for regular PostgreSQL tables. Our tests show that heavy use of indexes helps to answer the query fast via index-only-scan. If we run the same query without any indexes, it’d have been roughly 10 times slower.

Summary

  • We have created a public S3 bucket for anyone to run TPC-H queries against Iceberg, using Crunchy Bridge for Analytics
  • Running TPC-H queries on Iceberg tables from PostgreSQL demonstrates significant performance over standard Postgres.

We hope this exploration inspires you to experiment with Iceberg tables on Crunchy Bridge for Analytics and shows you what’s possible now with OLAP workloads in Postgres. Let us know if you have any comments or questions!

Appendix: Sample benchmark results

Conducting comprehensive database benchmarks is a complex process. Here, I've executed a focused test run to give a snapshot of the performance, acknowledging that various other parameters can affect the overall results. This focused test aims to show a general trend of the overall performance impact.

In this test, we used the Analytics-128 instance type to run the TPC-H queries. The data was generated using DuckDB's TPCH extension. Spark was used to create the Iceberg tables. The tests were conducted on a hot-cache, and the results are the average of three runs.

QueryIceberg Table (ms)Indexed Postgres Table (ms)
1329.5976099.366
2198.1482241.877
3318.1052456.459
4221.961881.417
5319.2183576.895
6141.0331735.455
7917.6891667.338
8399.8954533.325
9535.73210167.185
10327.8292196.22
1183.979676.779
12253.0952091.635
13572.94615494.473
14186.335889.872
15322.2125149.462
16186.8781159.822
17253.43869.421
18652.7243694.238
19464.98141.36
20272.4333147.819
21676.4482586.442
22167.936264.262