Iceberg ahead! Analyzing Shipping Data in Postgres
PostgreSQL is one of the most versatile data storage and processing tools available. We enhanced it even further by adding Iceberg tables to PostgreSQL in Crunchy Data Warehouse with a fast analytical query engine.
What is Iceberg? Iceberg tables are stored in a compressed columnar format for fast analytics in object storage (S3). This means storage is cheap and there are no storage limits. Yet the tables are still transactional and work with nearly all PostgreSQL features. Crunchy Data Warehouse can also query or load raw data from object storage into Iceberg tables via PostgreSQL commands.
A pattern we repeatedly see in data analytics scenarios is:
- Use temporary or external tables to collect raw data
- Use Iceberg as a central repository to organize data
- Use PostgreSQL tables or materialized views for querying insights
This type of 3-stage data processing is sometimes referred to as the “medallion architecture”: with bronze, raw data; silver, organized data repository; and gold, queryable insights layers. In Crunchy Data Warehouse, you can achieve in one system by composing PostgreSQL features.
In this blog we will see an end-to-end example of Iceberg and Postgres together. We will be using Crunchy Data Warehouse to collect, organize and query shipping data. We'll take full advantage of Iceberg and many features that are unique to PostgreSQL.
Creating our Iceberg data repository
We start by creating an Iceberg table and loading raw data for AIS data. AIS is a radio protocol which ships often use to share their identity, location, size, heading, and activity with each other. The data can be obtained freely from the ether, though only from nearby ships. Some institutions gather data from many receivers. The NOAA Office for Coastal Management publishes a large amount of AIS data for US coastal waters.
The AIS data is published in CSV format in a .zip archive per day. For January 1st, the CSV is named AIS_2024_01_01.csv and contained in the archive located at: https://coast.noaa.gov/htdata/CMSP/AISDataHandler/2024/AIS_2024_01_01.zip
We can explore the file structure by creating a temporary table from the file and viewing the columns.
-- create a temporary table from the AIS file 2024-01-01
create temp table ais_import()
with (
load_from = 'https://coast.noaa.gov/htdata/CMSP/AISDataHandler/2024/AIS_2024_01_01.zip',
zip_path = 'AIS_2024_01_01.csv');
-- Look at the columns
\d ais_tmp
Table "pg_temp_229.ais_import"
┌──────────────────┬──────┬───────────┬──────────┬─────────┐
│ Column │ Type │ Collation │ Nullable │ Default │
├──────────────────┼──────┼───────────┼──────────┼─────────┤
│ mmsi │ text │ │ │ │
│ basedatetime │ text │ │ │ │
│ lat │ text │ │ │ │
│ lon │ text │ │ │ │
│ sog │ text │ │ │ │
│ cog │ text │ │ │ │
│ heading │ text │ │ │ │
│ vesselname │ text │ │ │ │
│ imo │ text │ │ │ │
│ callsign │ text │ │ │ │
│ vesseltype │ text │ │ │ │
│ status │ text │ │ │ │
│ length │ text │ │ │ │
│ width │ text │ │ │ │
│ draft │ text │ │ │ │
│ cargo │ text │ │ │ │
│ transceiverclass │ text │ │ │ │
└──────────────────┴──────┴───────────┴──────────┴─────────┘
Let’s define our Iceberg table directly with precise types, and also add a PostGIS geometry column.
-- Create an Iceberg table from the AIS data with a geometry column
create table ais (
mmsi bigint not null,
basedatetime timestamptz not null,
lat double precision not null,
lon double precision not null,
sog double precision not null,
cog double precision not null,
heading double precision not null,
vesselname text not null,
imo text not null,
callsign text not null,
vesseltype int,
status int,
length double precision,
width double precision,
draft double precision,
cargo int,
transceiverclass text,
-- for convenience, auto-generate a point geometry
pos geometry generated always as (st_makepoint(lon, lat)) stored
)
using iceberg;
The pos column will be automatically populated when we insert data, and under the covers it is stored in the Parquet files that make up the Iceberg table as WKB blobs.
Safe, transactional data loading into Iceberg
We now have a temporary table containing the first day of data and an Iceberg table, so we can insert the raw data into Iceberg and apply the necessary casts and transformations.
-- insert raw data into the main AIS table
insert into
ais (mmsi, basedatetime, lat, lon, sog, cog, heading, vesselname, imo, callsign, vesseltype, status, length, width, draft, cargo, transceiverclass)
select
mmsi::bigint,
basedatetime::timestamptz,
lat::double precision,
lon::double precision,
sog::double precision,
cog::double precision,
heading::double precision,
vesselname,
imo,
callsign,
nullif(vesseltype, '')::int,
nullif(status, '')::int,
nullif(length, '')::double precision,
nullif(width, '')::double precision,
nullif(draft, '')::double precision,
nullif(cargo, '')::int,
nullif(transceiverclass, '')
from
ais_import;
Doing this once is kind of fun, but doing it for all days of the year is tedious and error-prone. For any other database we would probably write a Python script and then find a place to deploy it and somehow handle failures, but in PostgreSQL we can simply write a stored procedure that does everything reliably.
I created a gist with a simple load_ais_file
function that loads the file for a given day into a temp table and then inserts it into the Iceberg table. The function also remembers which files were already loaded, such that they can be skipped on the next run (the function is idempotent).
-- Load data for 2nd of January
select load_ais_file('2024-01-02');
One of the unique advantages of Iceberg tables in Crunchy Data Warehouse being fully transactional is that we can do this bookkeeping in the same transaction that inserts into the table. If anything fails, the insertion and bookkeeping will cleanly roll back and we’ll try again later. Moreover, we can divide the work into multiple transactions to make sure we do not lose progress. The load_ais function loads a range of files, committing after each file.
-- Load the rest of January
call load_ais('2024-01-03', '2024-01-31');
Due to the large data size, and doing the download, unzip, CSV parsing, data loading, and insertion into Iceberg all in a single function, the data loading process will take a while, but you can start querying the table as soon as the first data is loaded. You can also schedule the data load as a periodic background job since it is idempotent and concurrency safe, so eventually all the data will be there.
Analytical queries on Iceberg
With data loaded into Iceberg, we can start running fast analytical queries.
-- See how many rows we loaded so far
select count(*) from ais;
┌───────────┐
│ count │
├───────────┤
│ 214052260 │
└───────────┘
(1 row)
Time: 296.533 ms
-- How many ships are in the data?
select count(distinct mmsi) from ais;
┌───────┐
│ count │
├───────┤
│ 33481 │
└───────┘
(1 row)
Time: 468.786 ms
-- Find the ships with the most number of data points
select mmsi, vesselname, vesseltype, count(*)
from ais group by 1, 2, 3 order by 4 desc limit 10;
┌───────────┬───────────────────┬────────────┬───────┐
│ mmsi │ vesselname │ vesseltype │ count │
├───────────┼───────────────────┼────────────┼───────┤
│ 367458840 │ OSPREY │ 60 │ 39744 │
│ 368091590 │ EAGLE II │ 60 │ 39507 │
│ 368179250 │ SEAHAWK │ 60 │ 39451 │
│ 367669550 │ ALASKA CHALLENGER │ 30 │ 39392 │
│ 367331730 │ COLUMBIA │ 50 │ 38795 │
│ 367653630 │ OSCAR B │ 60 │ 38761 │
│ 367327250 │ ENDEAVOR │ 30 │ 38739 │
│ 366264360 │ WESTERN DAWN │ 30 │ 38654 │
│ 367155110 │ CAPTAIN RALEIGH │ 30 │ 38608 │
│ 368257130 │ TRIUMPH VII │ 60 │ 38584 │
└───────────┴───────────────────┴────────────┴───────┘
(10 rows)
Time: 1350.690 ms (00:01.351)
Queries on Iceberg tables are accelerated through caching and parallel, vectorized execution, and will typically run 10-100x faster than on a regular PostgreSQL table.
Creating materialized views for application-facing queries
While queries on Iceberg are extremely fast compared to queries on regular PostgreSQL tables, the queries might still use a lot of CPU and memory, and the application might make the same aggregations over and over again. It can therefore be a lot more efficient to materialize aggregations in a table or materialized view.
For instance, we can create a view that contains the traces of all fishing boats.
-- Build a view for traces of fishing boats (vesseltype 30)
create materialized view ais_fishing as
select
mmsi,
vesselname,
st_makeline(array_agg(pos order by basedatetime)) trace
from
ais
where
vesseltype = 30
group by
mmsi, vesselname
having
count(*) > 1;
SELECT 2370
Time: 7664.514 ms (00:07.665)
create index on ais_fishing using gist (trace);
Time: 28.762 ms
We can then easily add the view as a layer in QGIS to make an interactive visualization. QGIS will send many queries to the database, but they will hit an indexed view.
If we want other insights or visualizations, we can simply create more views or run queries on Iceberg directly. We can also schedule periodic refresh of the views in the database to respond to new data.
Shipping faster with Crunchy Data Warehouse
PostgreSQL with the added abilities of Crunchy Data Warehouse can be a powerful hub in your overall data architecture. The transactional capabilities make data processing scenarios a lot simpler and more reliable, and PostgreSQL features like stored procedures and extensions like PostGIS and pg_cron compose seamlessly with the analytics capabilities.
Related Articles
- Postgres Tuning & Performance for Analytics Data
19 min read
- Running an Async Web Query Queue with Procedures and pg_cron
6 min read
- Name Collision of the Year: Vector
9 min read
- Sidecar Service Meshes with Crunchy Postgres for Kubernetes
12 min read
- pg_incremental: Incremental Data Processing in Postgres
11 min read