Automatic Iceberg Maintenance Within Postgres
Önder Kalacı
5 min readMore by this author
Today we're excited to announce built-in maintenance for Iceberg in Crunchy Data Warehouse. This enhancement to Crunchy Data Warehouse brings PostgreSQL-style maintenance directly to Iceberg. The warehouse autovacuum workers continuously optimize Iceberg tables by compacting data and cleaning up expired files. In this post, we'll explore how we handle cleanup, and in the follow-up posts, we'll take a deeper dive into compaction.
If you use Postgres, you are probably familiar with tables and rows in a relational database. Instead of storing data in Postgres’ pages, Iceberg organizes the data into Parquet files and typically stores them in object storage like S3 with an organizational layer on top. Parquet is a compressed columnar file format that stores data efficiently. And Iceberg is designed to handle analytical queries across large datasets.
On Crunchy Data Warehouse, Postgres tables backed by Iceberg behave almost exactly like regular Postgres tables. You can run full SQL queries, perform ACID transactions, and use standard DDL commands like CREATE TABLE or ALTER TABLE. We’re excited to add vacuum processes to Iceberg to create an even better and hassle free user experience.
Orphan Files in Iceberg
In Postgres, when you update or delete rows, the changes happen inside the same table storage. The database keeps track of visibility using MVCC, and old versions of rows are eventually freed up by vacuum.
Iceberg works differently because its data files are immutable. When you update or delete data, Iceberg doesn’t modify existing files—it creates new ones with the updated data. The table’s metadata is then updated to point to the new files, while the old ones become unreferenced.
Over time, as more updates and deletes happen, these orphaned files—ones that are no longer referenced by any active table snapshot—start to accumulate.
Cleaning up orphan files
Just like autovacuum keeps your PostgreSQL tables lean, Crunchy Data Warehouse has background workers that automatically clean up orphan files in Iceberg. This helps ensure efficient storage without manual intervention. Crunchy Data Warehouse also does compaction on Iceberg, combining smaller files into larger files for efficiency and performance.
With Crunchy Data Warehouse, the autovacuum takes care of this cleanup automatically, just like in Postgres. It scans for unreferenced files and removes them, ensuring that storage does not grow unnecessarily over time. If a data file is no longer referenced by any snapshot and the retention period has passed (by default, 10 days), it is time to delete it. The design of autovacuum for Iceberg tables ensures that only files generated by PostgreSQL transactions are deleted, eliminating any risk of removing unintended files.
We support autovacuum for both expired snapshots and transaction rollbacks. There are two primary ways orphan files are created.
Iceberg Snapshot Expiration
When data is deleted or updated, the corresponding files become unreachable by queries. Snapshot expiration ensures that these files are safely removed during maintenance. In Crunchy Data Warehouse, the familiar VACUUM
command handles snapshot expiration. Remember that Crunchy Data Warehouse supports autovacuum on Iceberg tables, so you don’t have to manually keep track of this. By default, we retain files for up to 10 days to provide backups.
Here’s a step-by-step example demonstrating snapshot expiration in action:
-- Increase log verbosity to see detailed file operations
SET client_min_messages TO DEBUG4;
-- Create a table with 100 rows, generating the "data_0.parquet" file on S3
CREATE TABLE expire_data USING iceberg AS SELECT id FROM generate_series(0,100) id;
....
DEBUG: adding s3://testbucketcdw/../data_0.parquet with 101 rows
....
-- TRUNCATE the table to remove all rows. This marks underlying files as orphaned.
TRUNCATE expire_data;
-- For demo purposes, set retention periods to 0 to expire files immediately
-- In production, files are retained for 10 days as a backup.
SET crunchy_query_engine.orphaned_file_retention_period TO '0s';
SET crunchy_iceberg.max_snapshot_age TO '0';
-- Trigger snapshot expiration to clean up orphaned files and verify their removal
VACUUM (verbose) expire_data;
...
INFO: deleting expired file s3://testbucketcdw/../data_0.parquet
...
Rollback Transactions
Postgres users are already familiar with transactional rollbacks. Iceberg tables in the warehouse bring this same feature to data files, ensuring that storage is automatically cleaned up for rolled-back operations.
Here’s an example demonstrating how unused files from a rolled-back transaction are handled:
-- Create a new Iceberg table
CREATE TABLE rollback_data(id INT) USING iceberg;
BEGIN;
-- Increase log verbosity to see detailed file operations
SET LOCAL client_min_messages TO DEBUG4;
-- Insert data, generating the "data_0.parquet" file in S3
INSERT INTO rollback_data SELECT id FROM generate_series(0,100) id;
....
DEBUG: adding s3://testbucketcdw/../data_0.parquet with 101 rows
....
-- Roll back the transaction to discard the changes
ROLLBACK;
-- Trigger the cleanup process to remove the unused file
VACUUM (verbose) rollback_data;
...
INFO: deleting unused file s3://testbucketcdw/../data_0.parquet
...
This example shows how Iceberg tables handle transaction rollbacks cleanly and efficiently. Crunchy Data Warehouse removes any files created during rolled-back transactions during autovacuum, ensuring storage is not wasted on unused files.
Closing Thoughts
Orphan files are a natural consequence of how Iceberg manages immutable data files. Without proper cleanup, they can lead to wasted storage and unnecessary costs. In Postgres, autovacuum handles similar maintenance tasks seamlessly, and in Crunchy Data Warehouse, we bring that same convenience to Iceberg tables.
By automatically identifying and removing unreferenced files, Crunchy Data Warehouse ensures your Iceberg tables stay efficient, just like a well-maintained Postgres database. No manual cleanup, no wasted storage—just a streamlined experience for large-scale analytics.
Whether you're coming from a Postgres background or an Iceberg-first world - Crunchy Data Warehouse combines the best of both with the power of Iceberg’s scalable architecture with the ease of Postgres-style maintenance.