Auto-archiving and Data Retention Management in Postgres with pg_partman
You could be saving money every month on databases costs with a smarter data retention policy. One of the primary reasons, and a huge benefit of partitioning is using it to automatically archive your data. For example, you might have a huge log table. For business purposes, you need to keep this data for 30 days. This table grows continually over time and keeping all the data makes database maintenance challenging. With time-based partitioning, you can simply archive off data older than 30 days.
The nature of most relational databases means that deleting large volumes of data can be very inefficient and that space is not immediately, if ever, returned to the file system. PostgreSQL does not return the space it reserves to the file system when normal deletion operations are run except under very specific conditions:
- the page(s) at the end of the relation are completely emptied
- a VACUUM FULL/CLUSTER is run against the relation (exclusively locking it until complete)
If you find yourself needing that space back more immediately, or without intrusive locking, then partitioning can provide a much simpler means of removing old data: drop the table. The removal is nearly instantaneous (barring any transactions locking the table) and immediately returns the space to the file system. pg_partman, the Postgres extension for partitioning, provides a very easy way to manage this for time and integer based partitioning.
pg_partman daily partition example
Recently pg_partman 5.1 was released that includes new features such as list partitioning for single value integers, controlled maintenance run ordering, and experimental support for numeric partitioning. This new version also includes several bug fixes, so please update to the latest release when possible! All examples were done using this latest version.
https://github.com/pgpartman/pg_partman
First lets get a simple, time-based daily partition set going
CREATE TABLE public.time_stuff
(col1 int
, col2 text default 'stuff'
, col3 timestamptz NOT NULL DEFAULT now() )
PARTITION BY RANGE (col3);
SELECT partman.create_parent('public.time_stuff', 'col3', '1 day');
\d+ time_stuff
Partitioned table "public.time_stuff"
Column | Type | Collation | Nullable | Default | Storage | Compression | Stats target | Description
--------+--------------------------+-----------+----------+---------------+----------+-------------+--------------+-------------
col1 | integer | | | | plain | | |
col2 | text | | | 'stuff'::text | extended | | |
col3 | timestamp with time zone | | not null | now() | plain | | |
Partition key: RANGE (col3)
Partitions: time_stuff_p20240408 FOR VALUES FROM ('2024-04-08 00:00:00-04') TO ('2024-04-09 00:00:00-04'),
time_stuff_p20240409 FOR VALUES FROM ('2024-04-09 00:00:00-04') TO ('2024-04-10 00:00:00-04'),
time_stuff_p20240410 FOR VALUES FROM ('2024-04-10 00:00:00-04') TO ('2024-04-11 00:00:00-04'),
time_stuff_p20240411 FOR VALUES FROM ('2024-04-11 00:00:00-04') TO ('2024-04-12 00:00:00-04'),
time_stuff_p20240412 FOR VALUES FROM ('2024-04-12 00:00:00-04') TO ('2024-04-13 00:00:00-04'),
time_stuff_p20240413 FOR VALUES FROM ('2024-04-13 00:00:00-04') TO ('2024-04-14 00:00:00-04'),
time_stuff_p20240414 FOR VALUES FROM ('2024-04-14 00:00:00-04') TO ('2024-04-15 00:00:00-04'),
time_stuff_p20240415 FOR VALUES FROM ('2024-04-15 00:00:00-04') TO ('2024-04-16 00:00:00-04'),
time_stuff_p20240416 FOR VALUES FROM ('2024-04-16 00:00:00-04') TO ('2024-04-17 00:00:00-04'),
time_stuff_default DEFAULT
Setting data retention policies
This partition set was created on April 12, 2024, so a default setup will create 4 partitions before and 4 partitions after. The first setting to configure for retention, and the only one that is required, is the retention
column in the part_config
table. For this example, we’ll set a retention of 2 days. We’re also going to increase the premake value just to see that normal maintenance is working as well.
UPDATE partman.part_config SET retention = '2 days', premake = 6 WHERE parent_table = 'public.time_stuff';
By default, pg_partman also does not create new child tables if there is no data in the partition set, so lets add some data in as well.
INSERT INTO public.time_stuff (col1, col3)
VALUES (generate_series(1,10), CURRENT_TIMESTAMP);
SELECT * FROM partman.part_config
WHERE parent_table = 'public.time_stuff';
-[ RECORD 1 ]--------------+-----------------------------------
parent_table | public.time_stuff
control | col3
partition_interval | 1 day
partition_type | range
premake | 6
automatic_maintenance | on
template_table | partman.template_public_time_stuff
retention | 2 days
retention_schema |
retention_keep_index | t
retention_keep_table | t
epoch | none
constraint_cols |
optimize_constraint | 30
infinite_time_partitions | f
datetime_string | YYYYMMDD
jobmon | t
sub_partition_set_full | f
undo_in_progress | f
inherit_privileges | f
constraint_valid | t
ignore_default_data | t
default_table | t
date_trunc_interval |
maintenance_order |
retention_keep_publication | f
maintenance_last_run |
In pg_partman, retention management is handled at the same time as new partition creation. So a simple call to run_maintenance_proc()
will handle both.
CALL partman.run_maintenance_proc();
\d+ time_stuff
Partitioned table "public.time_stuff"
Column | Type | Collation | Nullable | Default | Storage | Compression | Stats target | Description
--------+--------------------------+-----------+----------+---------------+----------+-------------+--------------+-------------
col1 | integer | | | | plain | | |
col2 | text | | | 'stuff'::text | extended | | |
col3 | timestamp with time zone | | not null | now() | plain | | |
Partition key: RANGE (col3)
Partitions: time_stuff_p20240410 FOR VALUES FROM ('2024-04-10 00:00:00-04') TO ('2024-04-11 00:00:00-04'),
time_stuff_p20240411 FOR VALUES FROM ('2024-04-11 00:00:00-04') TO ('2024-04-12 00:00:00-04'),
time_stuff_p20240412 FOR VALUES FROM ('2024-04-12 00:00:00-04') TO ('2024-04-13 00:00:00-04'),
time_stuff_p20240413 FOR VALUES FROM ('2024-04-13 00:00:00-04') TO ('2024-04-14 00:00:00-04'),
time_stuff_p20240414 FOR VALUES FROM ('2024-04-14 00:00:00-04') TO ('2024-04-15 00:00:00-04'),
time_stuff_p20240415 FOR VALUES FROM ('2024-04-15 00:00:00-04') TO ('2024-04-16 00:00:00-04'),
time_stuff_p20240416 FOR VALUES FROM ('2024-04-16 00:00:00-04') TO ('2024-04-17 00:00:00-04'),
time_stuff_p20240417 FOR VALUES FROM ('2024-04-17 00:00:00-04') TO ('2024-04-18 00:00:00-04'),
time_stuff_p20240418 FOR VALUES FROM ('2024-04-18 00:00:00-04') TO ('2024-04-19 00:00:00-04'),
time_stuff_default DEFAULT
Now you can see the two partitions older than 2 days ago have been removed and two new partitions have been created to include 6 days ahead. There are some other more advanced options for retention available in pg_partman as well. You’ll see above that the retention_keep_table
option is set to true by default. This means that while the child tables are no longer part of the retention set, those tables do still exist in the database. pg_partman tries to keep all default options set in a manner to reduce accidental data loss.
\dt public.time_stuff*
List of relations
Schema | Name | Type | Owner
--------+----------------------+-------------------+-------
public | time_stuff | partitioned table | keith
public | time_stuff_default | table | keith
public | time_stuff_p20240408 | table | keith
public | time_stuff_p20240409 | table | keith
public | time_stuff_p20240410 | table | keith
public | time_stuff_p20240411 | table | keith
public | time_stuff_p20240412 | table | keith
public | time_stuff_p20240413 | table | keith
public | time_stuff_p20240414 | table | keith
public | time_stuff_p20240415 | table | keith
public | time_stuff_p20240416 | table | keith
public | time_stuff_p20240417 | table | keith
public | time_stuff_p20240418 | table | keith
Dropping tables and indexes
If you’d like these tables to actually be dropped, you can set the retention_keep_table
to false. Or if you’d like to keep the tables live in the database, but don’t need the indexes taking up space anymore, you can leave retention_keep_table
set to true, but set retention_keep_index
false instead. In the example below, I have reset the partition set back to its original state after running create_parent()
and then running this update.
UPDATE partman.part_config
SET retention = '2 days', premake = 6, retention_keep_table = false
WHERE parent_table = 'public.time_stuff';
CALL partman.run_maintenance_proc();
Now if we look at the tables that actually exist, we can see the oldest two tables are gone.
\dt public.time*
List of relations
Schema | Name | Type | Owner
--------+----------------------+-------------------+-------
public | time_stuff | partitioned table | keith
public | time_stuff_default | table | keith
public | time_stuff_p20240410 | table | keith
public | time_stuff_p20240411 | table | keith
public | time_stuff_p20240412 | table | keith
public | time_stuff_p20240413 | table | keith
public | time_stuff_p20240414 | table | keith
public | time_stuff_p20240415 | table | keith
public | time_stuff_p20240416 | table | keith
public | time_stuff_p20240417 | table | keith
public | time_stuff_p20240418 | table | keith
Retention outside the database
Another scenario is if you don’t need the data live in the database, but you still want to keep a backup of it outside of the database. In this case, we’re going to use the retention_schema
option which detaches the child tables from the partition set and then moves them to the schema named in this option. Again, the partition set has been reset to the initial state after create_parent()
and then we run this:
CREATE SCHEMA old_tables;
UPDATE partman.part_config
SET retention = '2 days', retention_schema = 'old_tables'
WHERE parent_table = 'public.time_stuff';
CALL partman.run_maintenance_proc();
Now we can see that the old tables are no longer in the partition set, but are now in the old_tables
schema.
\d+ time_stuff
Partitioned table "public.time_stuff"
Column | Type | Collation | Nullable | Default | Storage | Compression | Stats target | Description
--------+--------------------------+-----------+----------+---------------+----------+-------------+--------------+-------------
col1 | integer | | | | plain | | |
col2 | text | | | 'stuff'::text | extended | | |
col3 | timestamp with time zone | | not null | now() | plain | | |
Partition key: RANGE (col3)
Partitions: time_stuff_p20240410 FOR VALUES FROM ('2024-04-10 00:00:00-04') TO ('2024-04-11 00:00:00-04'),
time_stuff_p20240411 FOR VALUES FROM ('2024-04-11 00:00:00-04') TO ('2024-04-12 00:00:00-04'),
time_stuff_p20240412 FOR VALUES FROM ('2024-04-12 00:00:00-04') TO ('2024-04-13 00:00:00-04'),
time_stuff_p20240413 FOR VALUES FROM ('2024-04-13 00:00:00-04') TO ('2024-04-14 00:00:00-04'),
time_stuff_p20240414 FOR VALUES FROM ('2024-04-14 00:00:00-04') TO ('2024-04-15 00:00:00-04'),
time_stuff_p20240415 FOR VALUES FROM ('2024-04-15 00:00:00-04') TO ('2024-04-16 00:00:00-04'),
time_stuff_p20240416 FOR VALUES FROM ('2024-04-16 00:00:00-04') TO ('2024-04-17 00:00:00-04'),
time_stuff_default DEFAULT
\dt old_tables.*
List of relations
Schema | Name | Type | Owner
------------+----------------------+-------+-------
old_tables | time_stuff_p20240408 | table | keith
old_tables | time_stuff_p20240409 | table | keith
To store these tables “offline” outside of the database, we can use a python script provided by pg_partman to dump all tables in a given schema. It’s not tied in any way to the partition configuration or the partition set, so this script can be used to dump any tables in any schema.
$ python3 dump_partition.py -c"host=localhost" --schema=old_tables
DROP TABLE IF EXISTS"old_tables"."time_stuff_p20240409"
DROP TABLE IF EXISTS"old_tables"."time_stuff_p20240408
$ ls -l old*
-rw-rw-r-- 1 keith keith 168 Apr 12 18:17 old_tables.time_stuff_p20240408.hash
-rw-rw-r-- 1 keith keith 1410 Apr 12 18:17 old_tables.time_stuff_p20240408.pgdump
-rw-rw-r-- 1 keith keith 168 Apr 12 18:17 old_tables.time_stuff_p20240409.hash
-rw-rw-r-- 1 keith keith 1410 Apr 12 18:17 old_tables.time_stuff_p20240409.pgdump
By default it creates dump files in the custom dump format as well as providing a SHA-512 hash of the dump file to provide long-term data integrity checks. This backup option can either be run as part of a regularly scheduled script or as a one off backup.
Summary
Keeping data that doesn’t need to actually exist inside the database is a key part of keeping it running efficiently. Hopefully this has provided a guide to using both basic and advanced retention management options available in pg_partman.
Related Articles
- Postgres Tuning & Performance for Analytics Data
19 min read
- Running an Async Web Query Queue with Procedures and pg_cron
6 min read
- Name Collision of the Year: Vector
9 min read
- Sidecar Service Meshes with Crunchy Postgres for Kubernetes
12 min read
- pg_incremental: Incremental Data Processing in Postgres
11 min read