Crunchy Data Warehouse: Postgres with Iceberg...

Today I'm excited to announce the release of Crunchy Data Warehouse on premises, which provides one of the easiest and yet richest ways to work with your data lake in the environment of your choosing. Built on top of Crunchy Postgres for Kubernetes, Crunchy Data Warehouse extends Postgres with a modern data warehouse solution, giving you:

The ability to easily query data where it resides in S3 or S3 compatible storage (like MinIO). With a variety of data formats supported including CSV, JSON, Parquet, Geoparquet, and Iceberg you can leave behind complicated ETL processes and work with your data directly. With standard SQL and copy support in Postgres you can choose to query data directly, or move to/from S3 alongside the rest of your data.
The simplest way of creating and managing data within the Iceberg format. If you're unfamiliar with Iceberg it takes Parquet (an open format for columnar files) and transforms them from a single immutable file to a full database with a collection of files and metadata that represent your database.
Fast analytical queries. Iceberg gives you columnar compression of your data. We also include an adaptive query engine that can seamlessly leverage a vectorized query execution to provide analytics on your data at performance speeds of up to 100x over standard Postgres.
Automatic management and maintenance of your Iceberg data. A common aspect of working with Iceberg is having to run processes to recompact your data to ensure efficient distribution of your data. Crunchy Data Warehouse automatically manages this for you behind the scenes so you have one less thing to think about.

All of the above is in a production ready box built on the experience of Crunchy Postgres for Kubernetes.

Let’s dig into setting up a warehouse cluster

Assuming you have Crunchy Postgres for Kubernetes installed, as well as access to S3-compatible storage, you can create your first Crunchy Data Warehouse cluster with just a few lines of YAML:

apiVersion: v1
kind: Secret
metadata:
  name: cdw-secret
type: Opaque
stringData:
  s3-key: <s3-key>
  s3-secret: <s3-secret>
–--
apiVersion: postgres-operator.crunchydata.com/v1beta1
kind: CrunchyDataWarehouse
metadata:
  name: cdw-sample
spec:
  externalStorage:
    - name: s3bucket
      scope: s3://<s3-bucket>
      region: <s3-region>
      endpoint: <s3-endpoint>
      accessKeyIDRef:
        key: s3-key
        name: cdw-secret
      secretAccessKeyRef:
        key: s3-secret
        name: cdw-secret
  image: registry.crunchydata.com/crunchydata/crunchy-data-warehouse:ubi9-17.4-2.1.2-2513
  postgresVersion: 17
  instances:
    - replicas: 1
      dataVolumeClaimSpec:
        accessModes:
          - "ReadWriteOnce"
        resources:
          requests:
            storage: 4Gi

This creates a Postgres instance on version 17 with requested storage. After running kubectl apply, you'll have a running Crunchy Data Warehouse cluster. This includes a pod for the Postgres database ready to work with the S3 storage and all the other services that currently run with Crunchy Postgres for Kubernetes like backups, disaster recovery, high availability, and connection pooling. Once you initialize a connection to the Postgres database, you’re ready to start working with data lake files in Postgres.

Getting started with data (in CSV format)

Now that we’ve got a provisioned Crunchy Data Warehouse instance, let’s start working with some data. We’re going to start with a CSV file to see how simple it is to work with existing data in S3, but eventually load it into Iceberg, fully managed by our warehouse. We’ll be able to see the performance speed up we get over standard Postgres HEAP tables compared to columnar Parquet files managed in Iceberg.

To begin we’re going to load a set of historical data of stock prices per day, in total just over 32 million records, and the stock listing data. These files already exists in S3 and we can set them up as a lakehouse tables:

CREATE FOREIGN TABLE stock_csv(trade_dt timestamptz,
    ticker varchar(20),
    open_price numeric(38, 9),
    high_price numeric(38, 9),
    low_price numeric(38, 9),
    close_price numeric(38, 9),
    trade_volume numeric(38, 9),
    dividends numeric(38, 9),
    stock_splits numeric(38, 9))
SERVER crunchy_lake_analytics
OPTIONS (
    header 'true',
    path 's3://crunchydatawarehouse/stock/stock_history.csv',
    format 'csv');

CREATE FOREIGN TABLE stock_list_csv()
SERVER crunchy_lake_analytics
OPTIONS (
    header 'true',
    path 's3://crunchydatawarehouse/stock/stock_list.csv',
    format 'csv');

Working with Parquet data

Similar to how we referenced our CSV file above we could do the exact same with Parquet. We also have the ability to easily move data in and out of Parquet format. Parquet is an open standard file format that is self describing of the data types and brings columnar compression. For time series data, compression offers some great benefits in storage but also in performance to scan and read larger amounts of data.

In the above, we can work with the data directly as it sits in CSV format within S3, but we can also import this directly into our Postgres table. Let’s go ahead and setup a Postgres table and load the data:

CREATE TABLE stock () WITH (load_from = 's3://crunchydatawarehouse/stock/stock_history.csv');

From here if we wanted to either import or export data whether in CSV or in this case Parquet we could use the Postgres copy command. As an example we could export the existing historical stock data that we loaded into Postgres out to a Parquet file:

COPY stock to 's3://crunchydatawarehouse/stock/stock_history.parquet';

The easiest way to work with Iceberg

Now we’re going to create a table with the Iceberg format. Iceberg is another open standard, one that extends Parquet (an immutable file from a point in time) and maintains metadata about new files, changeset, and more to give you essentially a full database on top of Parquet files. We’re going to create our Iceberg table and point to those Parquet files that we generate above:

CREATE TABLE stock_list_iceberg()
USING iceberg
WITH (load_from = 's3://crunchydatawarehouse/stock/stock_list.parquet');

CREATE TABLE stock_history_iceberg()
USING iceberg
WITH (load_from = 's3://crunchydatawarehouse/stock/stock_history.parquet');

By using Iceberg here as our table type, we’re not only getting the columnar compression of Parquet, we also have the ability to add new records, update existing records, or delete data. We can now work on this as if it is a standard Postgres table - now with great data compression and amazingly fast analytical querying capabilities. To see just how fast, let’s compare a query that gives us the average closing price of stocks over time. First against standard Postgres, then against our Iceberg data:

SELECT extract('year' from trade_dt) trade_year, 
       count(distinct ticker) cnt, 
       trunc(avg(close_price)) close_price 
FROM stock 
WHERE trade_dt >= '01/01/2020' 
GROUP BY trade_year 
ORDER BY 1;

SELECT extract('year' from trade_dt) trade_year, 
       count(distinct ticker) cnt, 
       trunc(avg(close_price)) close_price 
FROM stock_iceberg
WHERE trade_dt >= '01/01/2020' 
GROUP BY trade_year 
ORDER BY 1;

iceberg versus heap tables

Above we see that our columnar Iceberg table is over 20x faster than the standard Postgres row based table. With Crunchy Data Warehouse for real-time analytics workloads we see query speed-ups of anywhere from 10 to 100x over standard Postgres.

In a matter of minutes we went from raw data sitting in a data lake within S3, loading it into our Postgres database, exporting data to Parquet, and creating our table as Iceberg. We had a fully managed Iceberg experience with a single command.

In short you get:

Native Postgres experience for data warehousing
Simple Iceberg data management
Fast analytical performance

Next Steps

We’re excited to show you what we’ve built!

There are a few ways to to get started:

Contact our team today to talk about building your data warehouse on Kubernetes.
If you don't need to self manage, give our managed warehouse experience a try on Crunchy Bridge.
Join me and Andrew L'Ecuyer for a live demo and webinar on April 10.

Latest Articles

Crunchy Data Warehouse: Postgres with Iceberg Available for Kubernetes and On-premises