Introducing Crunchy Data Warehouse: A next-generation Postgres-native data warehouse. Crunchy Data Warehouse Learn more

pgBackRest File Bundling and Block Incremental Backup

Avatar for David Steele

David Steele

6 min read

Crunchy Data is proud to support the pgBackRest project, an essential production grade backup tool used in our fully managed and self managed Postgres products. pgBackRest is also available as an open source project.

pgBackRest provides:

  • Full, differential, and incremental backups
  • Checksum validation of backup integrity
  • Point-in-Time recovery

pgBackRest recently released v2.46 with support for block incremental backup, which saves space in the repository by storing only changed parts of files. File bundling, released in v2.39, combines smaller files together for speed and cost savings, especially on object stores.

Efficiently storing backups is a major priority for the pgBackRest project but we also strive to balance this goal with backup and restore performance. The file bundling and block incremental backup features improve backup and, in many cases, restore performance while also saving space in the repository.

In this blog we will provide working examples to help you get started with these exciting features.

File bundling

  • combines smaller files together
  • improves speed on object stores like S3, Azure, GCS

Block incremental backup

  • saves space by storing only changed file parts
  • improves efficiency of delta restore

Sample repository set up

To demonstrate these features we will create two repositories. The first repository will use defaults. The second will have file bundling and block incremental backup enabled.

Configure both repositories:

[global]
log-level-console=info
start-fast=y

repo1-path=/var/lib/pgbackrest/1
repo1-retention-full=2

repo2-path=/var/lib/pgbackrest/2
repo2-retention-full=2
repo2-bundle=y
repo2-block=y

[demo]
pg1-path=/var/lib/postgresql/12/demo

Create the stanza on both repositories:

pgbackrest --stanza=demo stanza-create

The block incremental backup feature is best demonstrated with a larger dataset. In particular, we would prefer to have at least one table that is near the maximum segment size of 1GB. This can be accomplished by creating data with pgbench:

/usr/lib/postgresql/12/bin/pgbench -i -s 65

PostgreSQL splits tables into segment files of 1GB each, so the main table that pgbench created above will be contained in a single file. The format PostgreSQL uses to store tables on disk will be important in the examples below.

File bundling

File bundling stores data in the repository more efficiently by combining smaller files together. This results in fewer files overall in the backup which improves the speed of all repository operations, especially on object stores like S3, Azure, and GCS. There may also be cost savings on repositories that have a cost per operation since there will be fewer lists, deletes, etc.

To demonstrate this we'll make a backup on repo1, which does not have bundling enabled:

pgbackrest --stanza=demo --type=full --repo=1 backup

Now we check the number of files in repo1 for the latest backup:

$ find /var/lib/pgbackrest/1/backup/demo/latest/ -type f | wc -l

991

This is pretty normal for a small database without bundling enabled since each file is stored separately. There are also a few metadata files that pgBackRest uses to track the backup.

Now we'll perform the same actions on repo2, which has file bundling enabled:

$ pgbackrest --stanza=demo --type=full --repo=2 backup
$ find /var/lib/pgbackrest/2/backup/demo/latest/ -type f | wc -l

7

This time there are far fewer files. The small files have been bundled together and zero-length files are stored only in the manifest.

The repo-bundle-size option can be used to control the maximum size of bundles before compression and other operations are applied. The repo-bundle-limit option limits the files that will be added to bundles. It is not a good idea to set these options too large because any failure in the bundle on backup or restore will require the entire bundle to be retried. The goal of file bundling is to combine small files -- there is very seldom any benefit in combining larger files.

Block incremental backup

Block incremental backup saves space in the repository by storing only the parts of the file that have changed since the last backup. The block size depends on the file size and when the file was last modified, i.e. larger, older files will get larger block sizes. Blocks are compressed and encrypted into super blocks that can be retrieved independently to make restore more efficient.

To demonstrate the block incremental feature, we need to make some changes to the database. With pgbench we can update 100 random rows in the main table, which is about 1GB in size.

/usr/lib/postgresql/12/bin/pgbench -n -b simple-update -t 100

On repo1 the time to make an incremental backup is very similar to making a full backup. As previously discussed, PostgreSQL breaks tables up into 1GB segments so in our case the main table consists of a single file that contains most of the data in our database.

$ pgbackrest --stanza=demo --type=incr --repo=1 backup

<...>
INFO: backup command end: completed successfully (12525ms)

Here we can see that the incremental backup is nearly as large as the full backup, 52.8MB vs 55.5MB. This is expected since the bulk of the database is contained in a single file and by default incremental backups copy the entire file if any part of the file has changed.

$ pgbackrest --stanza=demo --repo=1 info

full backup: 20230520-082323F
    database size: 995.7MB, database backup size: 995.7MB
    repo1: backup size: 55.5MB

incr backup: 20230520-082323F_20230520-082934I
    database size: 995.7MB, database backup size: 972.8MB
    repo1: backup size: 52.8MB

However, on repo2 with block incremental enabled, the backup is significantly faster.

$ pgbackrest --stanza=demo --type=incr --repo=2 backup

<...>
INFO: backup command end: completed successfully (3589ms)

And also much smaller, 943KB vs 52.8MB on the repo without block incremental enabled. This is more than 50x improvement in backup size! Note that the block incremental backup feature also works with differential backups.

$ pgbackrest --stanza=demo --repo=2 info

full backup: 20230520-082438F
    database size: 995.7MB, database backup size: 995.7MB
    repo2: backup size: 56MB

incr backup: 20230520-082438F_20230520-083027I
    database size: 995.7MB, database backup size: 972.8MB
    repo2: backup size: 943.3KB

The block incremental feature also improves the efficiency of the delta restore command. Here we stop the cluster and perform a delta restore back to the full backup in repo 1:

$ pg_ctlcluster 12 demo stop
$ pgbackrest --stanza=demo --delta --repo=1 --set=20230526-053458F restore

<...>
INFO: restore command end: completed successfully (3697ms)

As we saw above the main table is contained in a single file, so the restore must copy and decompress the entire file from repo 1 (compressed size 30.4MB) because it was changed since the full backup.

To test a delta restore of the full backup in repo 2 we need to first restore the cluster to the most recent backup in repo 2:

pgbackrest --stanza=demo --delta --repo=2 restore

And then perform a delta restore back to the full backup in repo 2:

$ pgbackrest --stanza=demo --delta --repo=2 --set=20230526-053406F restore

<...>
INFO: restore command end: completed successfully (1536ms)

This is noticeably faster even on our fairly small demo database. When storage latency is high (e.g. S3) the performance improvement will be more pronounced. With block incremental enabled, delta restore only had to copy 3.5MB of the main table file from repo 2, as compared to 30.4MB from repo 1.

It is best to avoid long chains of block incremental backups since they can have a negative impact on restore performance. In this case pgBackRest may be forced to pull from many backups to restore a file.

Conclusion

Block incremental and file bundling both help make backup and restore more efficient and they are a powerful combination when used together. In general you should consider enabling both on all your repositories, with the caveat that these features are not backward compatible with older versions of pgBackRest.