pg_parquet: An Extension to Connect Postgres and Parquet
Today, we’re excited to release pg_parquet - an open source Postgres extension for working with Parquet files. The extension reads and writes parquet files to local disk or to S3 natively from Postgres. With pg_parquet you're able to:
- Export tables or queries from Postgres to Parquet files
- Ingest data from Parquet files to Postgres
- Inspect the schema and metadata of existing Parquet files
Code is available at: https://github.com/CrunchyData/pg_parquet/.
Read on for more background on why we built pg_parquet or jump below to get a walkthrough of working with it.
Why pg_parquet?
Parquet is a great columnar file format that provides efficient compression of data. Working with data in parquet makes sense when you're sharing data between systems. You might be archiving older data, or a format suitable for analytics as opposed to transactional workloads. While there are plenty of tools to work with Parquet, Postgres users have been left to figure things out on their own. Now, thanks to pg_parquet, Postgres and Parquet easily and natively work together. Better yet, you can work with Parquet without needing yet another data pipeline to maintain.
Wait, what is Parquet? Apache Parquet is an open-source, standard, column-oriented file format that grew out of the Hadoop era of big-data. Using a file, Parquet houses data in a way that is optimized for SQL queries. In the world of data lakes, Parquet is ubiquitous.
Using pg_parquet
Extending the Postgres copy
command we're able to efficiently copy data to and from Parquet, on your local server or in s3.
-- Copy a query result into a Parquet file on the postgres server
COPY (SELECT * FROM table) TO '/tmp/data.parquet' WITH (format 'parquet');
-- Copy a query result into Parquet in S3
COPY (SELECT * FROM table) TO 's3://mybucket/data.parquet' WITH (format 'parquet');
-- Load data from Parquet in S3
COPY table FROM 's3://mybucket/data.parquet' WITH (format 'parquet');
Let's take an example products table, but not just a basic version, one that has composite Postgres types and arrays:
-- create composite types
CREATE TYPE product_item AS (id INT, name TEXT, price float4);
CREATE TYPE product AS (id INT, name TEXT, items product_item[]);
-- create a table with complex types
CREATE TABLE product_example (
id int,
product product,
products product[],
created_at TIMESTAMP,
updated_at TIMESTAMPTZ
);
-- insert some rows into the table
INSERT INTO product_example values (
1,
ROW(1, 'product 1', ARRAY[ROW(1, 'item 1', 1.0), ROW(2, 'item 2', 2.0), NULL]::product_item[])::product,
ARRAY[ROW(1, NULL, NULL)::product, NULL],
now(),
'2022-05-01 12:00:00-04'
);
-- copy the table to a parquet file
COPY product_example TO '/tmp/product_example.parquet' (format 'parquet', compression 'gzip');
-- copy the parquet file to the table
COPY product_example FROM '/tmp/product_example.parquet';
-- show table
SELECT * FROM product_example;
Inspecting Parquet files
In addition to copying data in and out of parquet, you can explore existing Parquet files to start to understand their structure.
-- Describe a parquet schema
SELECT name, type_name, logical_type, field_id
FROM parquet.schema('s3://mybucket/data.parquet');
┌──────────────┬────────────┬──────────────┬──────────┐
│ name │ type_name │ logical_type │ field_id │
├──────────────┼────────────┼──────────────┼──────────┤
│ arrow_schema │ │ │ │
│ name │ BYTE_ARRAY │ STRING │ 0 │
│ s │ INT32 │ │ 1 │
└──────────────┴────────────┴──────────────┴──────────┘
(3 rows)
-- Retrieve parquet detailed metadata including column statistics
SELECT row_group_id, column_id, row_group_num_rows, row_group_bytes
FROM parquet.metadata('s3://mybucket/data.parquet');
┌──────────────┬───────────┬────────────────────┬─────────────────┐
│ row_group_id │ column_id │ row_group_num_rows │ row_group_bytes │
├──────────────┼───────────┼────────────────────┼─────────────────┤
│ 0 │ 0 │ 100 │ 622 │
│ 0 │ 1 │ 100 │ 622 │
└──────────────┴───────────┴────────────────────┴─────────────────┘
(2 rows)
-- Retrieve parquet file metadata such as the total number of rows
SELECT created_by, num_rows, format_version
FROM parquet.file_metadata('s3://mybucket/data.parquet');
┌────────────┬──────────┬────────────────┐
│ created_by │ num_rows │ format_version │
├────────────┼──────────┼────────────────┤
│ pg_parquet │ 100 │ 1 │
└────────────┴──────────┴────────────────┘
(1 row)
Parquet and the cloud
If you’re working with object storage managing your Parquet files – likely S3 of something S3 compatible. If you configure your ~/.aws/credentials
and ~/.aws/config
files, pg_parquet will automatically use those credentials allowing you to copy to and from your cloud object storage.
$ cat ~/.aws/credentials
[default]
aws_access_key_id = AKIAIOSFODNN7EXAMPLE
aws_secret_access_key = wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY
$ cat ~/.aws/config
[default]
region = eu-central-1
Being able to directly access object storage via the COPY
command is very useful for archival, analytics, importing data written by other applications, and moving data between servers.
In conclusion
Postgres has long been trusted for transactional workloads, but we believe in the very near future, it will be equally as capable for analytics. We’re excited to release pg_parquet as one more step towards making Postgres the only database you need.
Related Articles
- Postgres Tuning & Performance for Analytics Data
19 min read
- Running an Async Web Query Queue with Procedures and pg_cron
6 min read
- Name Collision of the Year: Vector
9 min read
- Sidecar Service Meshes with Crunchy Postgres for Kubernetes
12 min read
- pg_incremental: Incremental Data Processing in Postgres
11 min read