PostgreSQL Snapshots and Backups with pgBackRest in Kubernetes
Backups are dead. Now that I have your attention, let me clarify. Traditional backups have earned a solid reputation for their reliability over time. However, they are dead in the sense that a backup is essentially useless until it's restored—essentially "resurrected." In this post, we'll explore best practices for managing PostgreSQL snapshots and backups using pgBackRest. We will then provide some guidance of how you apply these techniques in Kubernetes using the Postgres Operator (PGO) from Crunchy Data. Whether you're overseeing a production environment, handling replicas, or refreshing lower environments, understanding how to effectively manage snapshots is key.
Creating snapshots
There are two effective methods for creating snapshots, but before we dive into those, let's address a common but ill-advised solution.
You shouldn't snapshot the primary PostgreSQL instance
When working with PostgreSQL, it's crucial to avoid taking snapshots of the primary instance or running replicas for a couple of reasons:
- Volume Overhead: Snapshotting the primary instance can impose unnecessary overhead on the underlying volume, potentially affecting performance.
- Risk of Corruption: If the database contains a corrupt block, it can propagate to the snapshots, compromising the integrity of your backups and hindering data recovery.
- Backup Label Management: To snapshot a running instance, you need to execute pg_backup_start and pg_backup_stop. The output of the stop command must be stored, and the appropriate content injected into the backup_label file if the clone is used.
To avoid these issues, I recommend two alternative approaches.
Option 1: Delta restores with pgBackRest
The first and preferred approach is to use pgBackRest for delta restores. When you snapshot a PostgreSQL instance, there's a risk of corrupt blocks being included, endangering your snapshots. pgBackRest adds a layer of protection by checking for corrupt blocks during the backup. If the previous backup contains a questionable block or any other error, the snapshot is skipped. Additionally, pgBackRest verifies the block during restoration, providing two layers of protection for your snapshots. To start, create a persistent volume claim that will be used for the delta restore. This PVC will be mounted to the restore job each time and will also be the PVC against which the snapshot is taken after each restore. The restore job should follow these high-level steps:
- Mount the delta restore PVC
- Check the last pgBackRest backup for errors (abort if errors are found)
- Perform a checksum on $PGDATA/backup_label
- Execute the delta restore with pgBackRest
- Verify the backup_label checksum after the restore matches the previous checksum (if unchanged, end the job)
- Snapshot the delta restore PVC
- Repeat after each pgBackRest backup or as per the desired schedule
Option 2: Using a standby replica
The second approach involves using a standby replica. The advantage here is that snapshots can be taken without waiting for a backup, allowing for increased snapshot frequency. A job can be submitted to perform the snapshot, following these high-level steps:
- Ensure the replica is up-to-date by comparing the source LSN to the last applied LSN in the replica
- Shut down the PostgreSQL standby replica (setting spec.shutdown to true in the Postgres Cluster manifest if using the Postgres Operator)
- Snapshot the replica's PVC
- Restart the PostgreSQL replica (setting spec.shutdown to false in the Postgres Cluster manifest if using the Postgres Operator)
- Verify that replication has resumed correctly
Consuming snapshots
Now let's use the Postgres Operator (PGO) from Crunchy Data to automate the process of using the snapshots. A common scenario involves refreshing a User Acceptance Test (UAT) database from production. Here's how to do it:
Identifying existing snapshots
The first step is to identify the snapshot we want to use. We can list available snapshots using kubectl:
kubectl get volumesnapshot -n crunchy-snap -o=custom-columns=NAME:.metadata.name,STATUS:.status.readyToUse
NAME STATUS
acmeprod-replica-snapshot-20240830 true
Creating a PostgreSQL clone from a snapshot
Once we've identified the snapshot, the next step is to create a new Persistent Volume Claim (PVC) from it. This is where a Kubernetes Operator like the PGO can really add value. By simply specifying the desired end state to the Postgres Operator, the operator handles the details.
apiVersion: postgres-operator.crunchydata.com/v1beta1
kind: PostgresCluster
metadata:
name: acmeuat
spec:
# Postgres Clone Operation Instructions
dataSource:
volumes:
pgDataVolume:
pvcName: acmeauat-replica-snapshot-restore
image: registry.crunchydata.com/crunchydata/crunchy-postgres:ubi8-16.3-2
port: 5432
postgresVersion: 16
instances:
- name: 'uat'
replicas: 1
dataVolumeClaimSpec:
accessModes:
- 'ReadWriteOnce'
# Identify the snapshot to be used
dataSource:
apiGroup: snapshot.storage.k8s.io
kind: VolumeSnapshot
name: acmeprod-replica-snapshot-20240830
dataSourceRef:
apiGroup: snapshot.storage.k8s.io
kind: VolumeSnapshot
name: acmeprod-replica-snapshot-20240830
resources:
requests:
storage: 100Gi
storageClassName: ssd-csi
# Name the volume the same as the pvcName specified under spec.dataSource.
volumneName: acmeauat-replica-snapshot-restore
patroni:
dynamicConfiguration:
postgresql:
parameters:
shared_buffers: 512MB
work_mem: 10MB
backups:
pgbackrest:
image: registry.crunchydata.com/crunchydata/crunchy-pgbackrest:ubi8-2.51-2
repos:
- name: repo1
volume:
volumeClaimSpec:
accessModes:
- 'ReadWriteOnce'
resources:
requests:
storage: 100Gi
Submitting this configuration instructs the Postgres Operator to create a new cloned environment using the storage snapshot.
Instructions for clone operations
The spec.dataSource section tells the Operator that this is a clone operation and specifies which PVC will contain the staged PostgreSQL data. These clones include everything necessary to bring the PostgreSQL instance online and recover to a consistent state. This is possible due to two main scenarios: either the source was a pgBackRest delta restore with the archive-copy option used during backup, or the snapshot was taken from a cleanly shutdown PostgreSQL instance.
Instructions for snapshot operations
In spec.instances[0].dataVolumeClaimSpec, two sections guide the Postgres Operator to create a persistent volume claim based on a specific VolumeSnapshot: spec.instances[0].dataVolumeClaimSpec.dataSource and spec.instances[0].dataVolumeClaimSpec.dataSourceRef. In our example, both reference the VolumeSnapshot we identified earlier (acmeprod-replica-snapshot-20240830). Finally, the Operator is instructed to name the newly created persistent volume claim the same as the pvcName specified under spec.dataSource—in this case, acmeuat-replica-snapshot-restore.
Additional considerations for using snapshots
If we wanted to roll the cloned copy forward to a specific point in time, we could include a pgBackRest section under spec.dataSource. This would require pgBackRest to use an object storage solution as one of its repositories. Here's an example:
dataSource:
volumes:
pgDataVolume:
pvcName: acmeauat-replica-snapshot-restore
pgbackrest:
options:
- --type=time
- --target="2024-08-30 12:30:00"
configuration:
- secret:
name: s3-confuat
stanza: db
repo:
name: repo2
s3:
bucket: 'acmeprod-pgbackrest-repo'
endpoint: 's3.openshift-storage.svc:443'
region: 'us'
Ideally, the storage provider would snapshot the existing snapshot, mounting it back to avoid moving data. However, depending on the provider, data might still be copied internally, which is faster than moving it across different infrastructures. In either case, there are still a lot of advantages in using snapshots.
Conclusion
Effectively managing PostgreSQL snapshots and backups requires a strategic approach. By using delta restores with pgBackRest or leveraging a standby replica, you can reduce risks and enhance your backup strategy. Whether you're managing production databases or refreshing environments, these methods offer a reliable and efficient solution. Using a Kubernetes Operator, like the PGO from Crunchy Data simplifies the process of consuming snapshots across various use cases. Both snapshot options discussed provide "virtual full copies" of the database, which are efficient in terms of disk usage—allowing multiple "full copies" while consuming disk space only for the changes between snapshots.If you're interested in trying these methods or need assistance with setting up snapshot jobs, feel free to reach out. You can get started with these examples with Crunchy Postgres for Kubernetes using the quickstart. Stay tuned for more, as the Crunchy Data engineering team has exciting plans for further automating snapshots in the near future.
Related Articles
- Postgres Tuning & Performance for Analytics Data
19 min read
- Running an Async Web Query Queue with Procedures and pg_cron
6 min read
- Name Collision of the Year: Vector
9 min read
- Sidecar Service Meshes with Crunchy Postgres for Kubernetes
12 min read
- pg_incremental: Incremental Data Processing in Postgres
11 min read