Cloud Data Sources for the Postgres Operator for Kubernetes
One of the many powerful features in PGO, the open source Postgres Operator from Crunchy Data, is the ability to use an existing data source to initialize a new Postgres cluster. If you are cloning an existing cluster or migrating volumes, PGO data sources make cluster initialization easy. New in PGO v5.0.5, we've expanded support for data sources even further with the introduction of cloud data sources! We’re excited to offer this new feature to customers onboarding to Crunchy Postgres for Kubernetes.
With cloud data sources, you can now use pgBackRest backups stored using AWS S3, GCS or Azure storage as the data source for a new Postgres cluster. What is great about this feature is that the pgBackRest repository containing the backup used is only used to initialize the cluster. After that, PGO creates new repositories for the cluster as defined in the backups
section of the PostgresCluster spec. This means you can use the same AWS S3, GCS or Azure backup to initialize as many Postgres clusters as needed.
If you want to store Postgres backups using AWS S3, GCS or Azure - and then use them to initialize your Postgres workloads - this feature is perfect for you!
Here is an example of what cloud data sources look like in action.
Populating a Cloud Data Source
The first step is to create a cluster and populate a cloud data source. For this example we will keep the PostgresCluster spec to a single Postgres instance, and a single pgBackRest repository. We will also configure the repository to use S3 storage, since it represents the cloud data source that we are populating. Finally, we will also populate the backups.manual
section of the spec so that we can take a manual backup of the database.
First, create a secret containing the S3 key and key secret.
$ kubectl create -f - <<EOF
apiVersion: v1
kind: Secret
metadata:
name: s3-key
stringData:
s3.conf: |
[global]
repo1-s3-key=<redacted>
repo1-s3-key-secret=<redacted>
EOF
secret/s3-key created
Next create the PostgresCluster called hippo-source
which will reference the s3-key secret
we created in the previous step.
$ kubectl create -f - <<EOF
apiVersion: postgres-operator.crunchydata.com/v1beta1
kind: PostgresCluster
metadata:
name: hippo-source
spec:
image: registry.developers.crunchydata.com/crunchydata/crunchy-postgres:centos8-14.0-0
postgresVersion: 14
instances:
- name: instance1
dataVolumeClaimSpec:
accessModes:
- "ReadWriteOnce"
resources:
requests:
storage: 1Gi
backups:
pgbackrest:
image: registry.developers.crunchydata.com/crunchydata/crunchy-pgbackrest:centos8-2.36-0
manual:
repoName: repo1
configuration:
- secret:
name: s3-key
global:
repo1-path: /pgbackrest/hippo-source/repo1
repos:
- name: repo1
s3:
bucket: containers-dev-pgbackrest
endpoint: s3.amazonaws.com
region: us-east-1
EOF
postgrescluster.postgres-operator.crunchydata.com/hippo-source created
After the hippo-source
cluster is ready, insert a bit of data into the database. To find the name of the primary Postgres instance you case use:
$ kubectl get pod -o name -l \ postgres-operator.crunchydata.com/role=master,postgres-operator.crunchydata.com/cluster=hippo-source
pod/hippo-source-instance1-5clq-0
Then, insert some data using psql
.
$ kubectl exec -it -c database \
pod/hippo-source-instance1-5clq-0 -- psql
psql (14.0)
Type "help" for help.
CREATE TABLE initdata(id int);
CREATE TABLE
INSERT INTO initdata(id) SELECT id FROM generate_series(1, 10000) AS id;
INSERT 0 10000
Next, take a backup of the data that was inserted. Since the backups.manual
section was included in the PostgresCluster spec above, we can simply annotate hippo-source
to start the backup.
$ kubectl annotate -n postgres-operator postgrescluster hippo-source \
postgres-operator.crunchydata.com/pgbackrest-backup="$(date)"
postgrescluster.postgres-operator.crunchydata.com/hippo-source annotated
Once the backup completes, we can consider the data source populated! To find the name of the manual backup job use this:
$ kubectl get job -o name -l \
postgres-operator.crunchydata.com/pgbackrest-backup=manual,postgres-operator.crunchydata.com/cluster=hippo-source
job.batch/hippo-source-backup-56k9
Then, wait for the job to complete successfully.
$ kubectl wait --for=condition=complete \
job.batch/hippo-source-backup-56k9
job.batch/hippo-source-backup-56k9 condition met
Finally, delete the hippo-source
cluster, but leave the s3-key
secret in place. This will leave the S3 repository in place and ready for use to initialize a new cluster in the next section.
$ kubectl delete postgrescluster hippo-source
postgrescluster.postgres-operator.crunchydata.com "hippo-source" deleted
Using a Cloud Data Source
Now to use our cloud data source! Let's say it's now a week later, and we want to create a new cluster containing the hippo-source
data that we backed up above.
This is where the new cloud data source feature comes in. Let's see what it looks like by creating a new cluster called hippo
.
$ kubectl create -f - <<EOF
apiVersion: postgres-operator.crunchydata.com/v1beta1
kind: PostgresCluster
metadata:
name: hippo
spec:
dataSource:
pgbackrest:
stanza: db
configuration:
- secret:
name: s3-key
global:
repo1-path: /pgbackrest/repo1
repo:
name: repo1
s3:
bucket: containers-dev-pgbackrest
endpoint: s3.amazonaws.com
region: us-east-1
image: registry.developers.crunchydata.com/crunchydata/crunchy-postgres:centos8-14.0-0
postgresVersion: 14
instances:
- name: instance1
dataVolumeClaimSpec:
accessModes:
- "ReadWriteOnce"
resources:
requests:
storage: 1Gi
backups:
pgbackrest:
image: registry.developers.crunchydata.com/crunchydata/crunchy-pgbackrest:centos8-2.36-0
repos:
- name: repo1
volume:
volumeClaimSpec:
accessModes:
- "ReadWriteOnce"
resources:
requests:
storage: 1Gi
EOF
postgrescluster.postgres-operator.crunchydata.com/hippo created
A couple of things to note about the hippo
PostgresCluster spec:
- It contains an S3 data source that points to the repository used to backup
hippo-source
above - The
s3-key
secret will now be used by the S3 data source - The pgBackRest repository defined uses a local PVC (and not S3) for storage
Once the cluster is ready, take a look in the database to see if the data we backed up using hippo-source
exists. First, find the name of the primary Postgres instance.
$ kubectl get pod -o name -l \ postgres-operator.crunchydata.com/role=master,postgres-operator.crunchydata.com/cluster=hippo
pod/hippo-instance1-79ff-0
Then, look for the initdata
table using psql
.
$ kubectl exec -it -c database \
pod/hippo-instance1-79ff-0 -- psql
psql (14.0)
Type "help" for help.
SELECT * FROM initdata;
id
-------
1
2
3
4
5
…
As you can see, the expected data is present, showing that the S3 data source worked!
Also, by looking at the status of hippo
you can also view the pgBackRest repositories currently in used in the hippo
cluster.
$ kubectl get postgrescluster hippo \
-o jsonpath='{.status.pgbackrest.repos}'
[{"bound":true,"name":"repo1","stanzaCreated":false,"volume":"crunchy-kubeadm-pv193"}]
As the status shows, there is no reference to the S3 repository used as the data source. This means hippo
is writing any new archives and backups to it's own PVC-based repository (specifically PV crunchy-kubeadm-pv193
for this example).
Since hippo
never writes to the data source, you can repeat these steps to initialize more Postgres clusters with the exact same data!
Cluster Initialization Made Easy
One of our goals when designing PGO is to ensure Postgres cluster initialization is as easy as possible. Thanks to cloud data sources, it is now easier than ever to use backups stored using AWS S3, GCS or Azure storage to initialize your clusters.
Please see the PGO documentation for more information on data sources, and all of the other great features available for managing your Postgres clusters.
Related Articles
- Sidecar Service Meshes with Crunchy Postgres for Kubernetes
12 min read
- pg_incremental: Incremental Data Processing in Postgres
11 min read
- Smarter Postgres LLM with Retrieval Augmented Generation
6 min read
- Postgres Partitioning with a Default Partition
16 min read
- Iceberg ahead! Analyzing Shipping Data in Postgres
8 min read