Introducing Crunchy Data Warehouse: A next-generation Postgres-native data warehouse. Crunchy Data Warehouse Learn more

Introducing pgBackRest Multiple Repository Support

Avatar for David Steele

David Steele

5 min read

The pgBackRest team is pleased to announce the introduction of multiple repository support in v2.33. Backups already provide redundancy by creating an offline copy of your PostgreSQL cluster that can be used in disaster recovery. Multiple repositories allow you to have copies of your backups and WAL archives in separate locations to increase your redundancy and provide even more protection for your data. This feature is the culmination of many months of hard work, so let's delve into why we think multiple repositories are so important and how they can help preserve your data.

If you are unfamiliar with pgBackRest, general repository configuration, or configuring PostgreSQL to work with pgBackRest, please read the pgBackRest Quick Start before proceeding.

Configuration

Up to four repositories may be configured and each one can be any repo type, e.g. S3 or Posix. The configuration below defines two repositories. repo1 is Posix and stored on locally-mounted NFS volume; the repository should always be located off the PostgreSQL server in case disaster strikes. repo2 is stored on Azure.

$ cat /etc/pgbackrest/pgbackrest.conf

[demo]
pg1-path=/var/lib/postgresql/13/demo

[global]
repo1-path=/var/lib/pgbackrest
repo1-retention-full=2

repo2-type=azure
repo2-azure-account=pgbackrest
repo2-azure-container=demo-container
repo2-azure-key=YXpLZXk=
repo2-path=/demo-repo
repo2-retention-full=8

Note that the retention has been configured differently on each repository. repo1 has a shorter retention to save space but still provide several backups and plenty of WAL on local storage that is both fast and cheap (in terms of bandwidth cost) to access. repo2 has a longer retention and is stored in Azure where storage is cheap but retrieving data is both slower and more expensive than repo1. repo2 is both a fail-safe in case something goes wrong with repo1 and a resource to restore backups from remote sites.

pgBackRest will treat repo1 with higher priority than repo2 for certain commands like restore and archive-get. In general, the lower-numbered repositories should be faster and/or cheaper than the higher-numbered repositories.

Usage

Now that the repositories are configured, run stanza-create. This will initialize the demo stanza in each repository.

$ pgbackrest --stanza=demo stanza-create

<...>
INFO: stanza-create for stanza 'demo' on repo1
INFO: stanza-create for stanza 'demo' on repo2
<...>

Once the repositories have been initialized it is a good idea to run check to ensure that everything is working. Note that WAL segments are being pushed to both repositories.

$ pgbackrest --stanza=demo check

<...>
INFO: check repo1 configuration (primary)
INFO: check repo2 configuration (primary)
INFO: check repo1 archive for WAL (primary)
INFO: WAL segment 000000010000000000000003 successfully archived to '/var/lib/pgbackrest/archive/demo/13-1/0000000100000000/000000010000000000000003-c981b4ddc8c1437c539eda05427a6aa454a0923e.gz' on repo1
INFO: check repo2 archive for WAL (primary)
INFO: WAL segment 000000010000000000000003 successfully archived to '/demo-repo/archive/demo/13-1/0000000100000000/00000001000000000000000

Now a backup should be run for each repository. Most commands operate automatically on all repos but backup requires the repo to be specified. Each repo is likely to have different retention and backup schedules so backups should be run independently.

pgbackrest --stanza=demo --repo=1 backup
pgbackrest --stanza=demo --repo=2 backup

Info for the repositories is shown in a unified fashion with the organizing unit being the stanza. This way all the backups for a stanza can be seen together and the most recent backup easily identified since the list is sorted oldest to newest.

$ pgbackrest info

stanza: demo
    status: ok
    cipher: none

    db (current)
        wal archive min/max (13): 000000010000000000000003/000000010000000000000006

        full backup: 20210331-155726F
            timestamp start/stop: 2021-03-31 15:57:26 / 2021-03-31 15:57:30
            wal start/stop: 000000010000000000000005 / 000000010000000000000005
            database size: 23.1MB, database backup size: 23.1MB
            repo1: backup set size: 2.8MB, backup size: 2.8MB

        full backup: 20210331-155736F
            timestamp start/stop: 2021-03-31 15:57:36 / 2021-03-31 15:57:41
            wal start/stop: 000000010000000000000006 / 000000010000000000000006
            database size: 23.1MB, database backup size: 23.1MB
            repo2: backup set size: 2.8MB, backup size: 2.8MB

It is also possible to get info for a single repository by specifying the --repo option.

$ pgbackrest --repo=2 info

stanza: demo
    status: ok
    cipher: none

    db (current)
        wal archive min/max (13): 000000010000000000000003/000000010000000000000006

        full backup: 20210331-155736F
            timestamp start/stop: 2021-03-31 15:57:36 / 2021-03-31 15:57:41
            wal start/stop: 000000010000000000000006 / 000000010000000000000006
            database size: 23.1MB, database backup size: 23.1MB
            repo2: backup set size: 2.8MB, backup size: 2.8MB

When restoring, pgBackRest will automatically select the best backup from the repositories based on your criteria. Here is an example of time-based recovery:

pgbackrest --stanza=demo --delta --type=time --target="2021-03-31 15:57:31-04" restore

<...>
INFO: repo1: restore backup set 20210331-155726F
<...>

In this case only the backup from repo1 was valid because of the time restriction. Even if a later time was specified that would seem to favor the later backup in repo2, the same backup from repo1 will be selected. This is because pgBackRest tends to prefer the repo with higher priority, i.e. repo1 over repo2, under the assumption that it will be faster and/or cheaper than a repository with lower priority.

If you want to restore from a specific repository then simply specify the preferred repository using --repo. The archive-get command generated for recovery will always search all repositories in priority order for WAL segments.

Conclusion

Multiple repositories allow for more redundancy as well as cost savings and performance improvements by allowing a repository to be located close to the PostgreSQL clusters while also having a repository located safely out in the cloud or in another data center.

For more information about multiple repository support, see the User Guide.