Introducing Crunchy Data Warehouse: A next-generation Postgres-native data warehouse. Crunchy Data Warehouse Learn more

Cloud Data Sources for the Postgres Operator for Kubernetes

Avatar for Andrew L'Ecuyer

Andrew L'Ecuyer

5 min read

One of the many powerful features in PGO, the open source Postgres Operator from Crunchy Data, is the ability to use an existing data source to initialize a new Postgres cluster. If you are cloning an existing cluster or migrating volumes, PGO data sources make cluster initialization easy. New in PGO v5.0.5, we've expanded support for data sources even further with the introduction of cloud data sources! We’re excited to offer this new feature to customers onboarding to Crunchy Postgres for Kubernetes.

With cloud data sources, you can now use pgBackRest backups stored using AWS S3, GCS or Azure storage as the data source for a new Postgres cluster. What is great about this feature is that the pgBackRest repository containing the backup used is only used to initialize the cluster. After that, PGO creates new repositories for the cluster as defined in the backups section of the PostgresCluster spec. This means you can use the same AWS S3, GCS or Azure backup to initialize as many Postgres clusters as needed.

If you want to store Postgres backups using AWS S3, GCS or Azure - and then use them to initialize your Postgres workloads - this feature is perfect for you!

Here is an example of what cloud data sources look like in action.

Populating a Cloud Data Source

The first step is to create a cluster and populate a cloud data source. For this example we will keep the PostgresCluster spec to a single Postgres instance, and a single pgBackRest repository. We will also configure the repository to use S3 storage, since it represents the cloud data source that we are populating. Finally, we will also populate the backups.manual section of the spec so that we can take a manual backup of the database.

First, create a secret containing the S3 key and key secret.

$ kubectl create -f - <<EOF
apiVersion: v1
kind: Secret
metadata:
  name: s3-key
stringData:
  s3.conf: |
    [global]
    repo1-s3-key=<redacted>
    repo1-s3-key-secret=<redacted>
EOF
secret/s3-key created

Next create the PostgresCluster called hippo-source which will reference the s3-key secret we created in the previous step.

$ kubectl create -f - <<EOF
apiVersion: postgres-operator.crunchydata.com/v1beta1
kind: PostgresCluster
metadata:
  name: hippo-source
spec:
  image: registry.developers.crunchydata.com/crunchydata/crunchy-postgres:centos8-14.0-0
  postgresVersion: 14
  instances:
    - name: instance1
      dataVolumeClaimSpec:
        accessModes:
        - "ReadWriteOnce"
        resources:
          requests:
            storage: 1Gi
  backups:
    pgbackrest:
      image: registry.developers.crunchydata.com/crunchydata/crunchy-pgbackrest:centos8-2.36-0
      manual:
        repoName: repo1
      configuration:
      - secret:
          name: s3-key
      global:
        repo1-path: /pgbackrest/hippo-source/repo1
      repos:
      - name: repo1
        s3:
          bucket: containers-dev-pgbackrest
          endpoint: s3.amazonaws.com
          region: us-east-1
EOF
postgrescluster.postgres-operator.crunchydata.com/hippo-source created

After the hippo-source cluster is ready, insert a bit of data into the database. To find the name of the primary Postgres instance you case use:

$ kubectl get pod -o name -l \ postgres-operator.crunchydata.com/role=master,postgres-operator.crunchydata.com/cluster=hippo-source
pod/hippo-source-instance1-5clq-0

Then, insert some data using psql.

$ kubectl exec -it -c database \
    pod/hippo-source-instance1-5clq-0 -- psql
psql (14.0)
Type "help" for help.
CREATE TABLE initdata(id int);
CREATE TABLE

INSERT INTO initdata(id) SELECT id FROM generate_series(1, 10000) AS id;
INSERT 0 10000

Next, take a backup of the data that was inserted. Since the backups.manual section was included in the PostgresCluster spec above, we can simply annotate hippo-source to start the backup.

$ kubectl annotate -n postgres-operator postgrescluster hippo-source \
  postgres-operator.crunchydata.com/pgbackrest-backup="$(date)"
postgrescluster.postgres-operator.crunchydata.com/hippo-source annotated

Once the backup completes, we can consider the data source populated! To find the name of the manual backup job use this:

$ kubectl get job -o name -l \
postgres-operator.crunchydata.com/pgbackrest-backup=manual,postgres-operator.crunchydata.com/cluster=hippo-source
job.batch/hippo-source-backup-56k9

Then, wait for the job to complete successfully.

$ kubectl wait --for=condition=complete \
  job.batch/hippo-source-backup-56k9
job.batch/hippo-source-backup-56k9 condition met

Finally, delete the hippo-source cluster, but leave the s3-key secret in place. This will leave the S3 repository in place and ready for use to initialize a new cluster in the next section.

$ kubectl delete postgrescluster hippo-source
postgrescluster.postgres-operator.crunchydata.com "hippo-source" deleted

Using a Cloud Data Source

Now to use our cloud data source! Let's say it's now a week later, and we want to create a new cluster containing the hippo-source data that we backed up above.

This is where the new cloud data source feature comes in. Let's see what it looks like by creating a new cluster called hippo.

$ kubectl create -f - <<EOF
apiVersion: postgres-operator.crunchydata.com/v1beta1
kind: PostgresCluster
metadata:
  name: hippo
spec:
  dataSource:
    pgbackrest:
      stanza: db
      configuration:
      - secret:
          name: s3-key
      global:
        repo1-path: /pgbackrest/repo1
      repo:
        name: repo1
        s3:
          bucket: containers-dev-pgbackrest
          endpoint: s3.amazonaws.com
          region: us-east-1
  image: registry.developers.crunchydata.com/crunchydata/crunchy-postgres:centos8-14.0-0
  postgresVersion: 14
  instances:
    - name: instance1
      dataVolumeClaimSpec:
        accessModes:
        - "ReadWriteOnce"
        resources:
          requests:
            storage: 1Gi
  backups:
    pgbackrest:
      image: registry.developers.crunchydata.com/crunchydata/crunchy-pgbackrest:centos8-2.36-0
      repos:
      - name: repo1
        volume:
          volumeClaimSpec:
            accessModes:
            - "ReadWriteOnce"
            resources:
              requests:
                storage: 1Gi
EOF
postgrescluster.postgres-operator.crunchydata.com/hippo created

A couple of things to note about the hippo PostgresCluster spec:

  • It contains an S3 data source that points to the repository used to backup hippo-source above
  • The s3-key secret will now be used by the S3 data source
  • The pgBackRest repository defined uses a local PVC (and not S3) for storage

Once the cluster is ready, take a look in the database to see if the data we backed up using hippo-source exists. First, find the name of the primary Postgres instance.

$ kubectl get pod -o name -l \ postgres-operator.crunchydata.com/role=master,postgres-operator.crunchydata.com/cluster=hippo
pod/hippo-instance1-79ff-0

Then, look for the initdata table using psql.

$ kubectl exec -it -c database \
    pod/hippo-instance1-79ff-0 -- psql
psql (14.0)
Type "help" for help.
SELECT * FROM initdata;
 id
-------
    1
    2
    3
    4
    5
…

As you can see, the expected data is present, showing that the S3 data source worked!

Also, by looking at the status of hippo you can also view the pgBackRest repositories currently in used in the hippo cluster.

$ kubectl get postgrescluster hippo \
  -o jsonpath='{.status.pgbackrest.repos}'
[{"bound":true,"name":"repo1","stanzaCreated":false,"volume":"crunchy-kubeadm-pv193"}]

As the status shows, there is no reference to the S3 repository used as the data source. This means hippo is writing any new archives and backups to it's own PVC-based repository (specifically PV crunchy-kubeadm-pv193 for this example).

Since hippo never writes to the data source, you can repeat these steps to initialize more Postgres clusters with the exact same data!

Cluster Initialization Made Easy

One of our goals when designing PGO is to ensure Postgres cluster initialization is as easy as possible. Thanks to cloud data sources, it is now easier than ever to use backups stored using AWS S3, GCS or Azure storage to initialize your clusters.

Please see the PGO documentation for more information on data sources, and all of the other great features available for managing your Postgres clusters.