Troubleshooting Postgres in Kubernetes
In my role as a Solutions Architect at Crunchy Data, I help customers get up and running with Crunchy Postgres for Kubernetes (CPK). Installing and managing a Postgres cluster in Kubernetes has never been easier. However, sometimes things don't go as planned and I’ve noticed a few major areas where Kubernetes installations go awry. Today I want to walk through some of the most common issues I see when people try to get up and running with Postgres in Kubernetes and offer a list of basic troubleshooting ideas to get started. Now sure, your issue might not be in here, but if you’re just trying to diagnose a bad install or a failing cluster, here’s my go to list of where to get started
The Order of Things: CRD, Operator, Cluster, Pod
Let’s get started with a basic understanding of how things get installed and by what. You can use that knowledge to determine where to look first when something that you are expecting does not appear during your installation.
Custom Resource Definition (CRD): The CPK Operator requires a Custom Resource Definition (CRD). It is possible to have multiple CRDs per Operator. Our most recent Operator, 5.5, has 3 CRD examples, postgres-operator.crunchydata.com_postgresclusters.yaml being one of them. The user applies all CRD files to the Kubernetes cluster. The CRDs must be installed before the operator.
Operator: The CPK Operator gets installed by the user applying a manager.yaml file that describes a Kubernetes object of kind:Deployment. This creates the Deployment and the Deployment creates the Operator pod. The Operator itself is a container running in a pod.
Postgres Cluster: A CPK Postgres Cluster is typically created by the user applying a postgres.yaml file containing the PostgresCluster.spec, that describes a Kubernetes object of kind:PostgresCluster.
Pods: The stateful sets and deployments create the individual pods they describe. The Operator creates a stateful set for each Postgres pod and the pgBackRest repo host pod (if applicable). Deployments are also created for pgBouncer pods (if applicable). If you are missing a pod, describe the Deployment or StatefulSet that owns it. If you are missing a Deployment or StatefulSet, the CPK Operator logs will usually indicate why.
Image Pulls
Next, let’s look at image pull issues. There are two primary reasons why you would receive an image pull error. 1 - you do not have permissions to connect to the registry or pull the requested image. Or 2 - the image requested is not in the registry.
Permissions Example
I am attempting to deploy the CPK Operator.
kubectl apply -n postgres-operator -k install/default --server-side
I see that I have an ImagePullBackOff error.
kubectl -n postgres-operator get pods
NAME READY STATUS RESTARTS AGE
pgo-5694b9545c-ggz7g 0/1 ImagePullBackOff 0 27s
When looking at issues with a pod not coming up in Kubernetes the first thing we will do is run a describe on the pod and look at the events in the bottom of the output.
kubectl -n postgres-operator describe pod pgo-5694b9545c-ggz7g
...
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
...
Normal Pulling 6m9s (x4 over 7m39s) kubelet Pulling image "registry.crunchydata.com/crunchydata/postgres-operator:ubi8-5.5.0-0"
Warning Failed 6m9s (x4 over 7m39s) kubelet Failed to pull image "registry.crunchydata.com/crunchydata/postgres-operator:ubi8-5.5.0-0": rpc error: code = Unknown desc = failed to pull and unpack image "registry.crunchydata.com/crunchydata/postgres-operator:ubi8-5.5.0-0": failed to resolve reference "registry.crunchydata.com/crunchydata/postgres-operator:ubi8-5.5.0-0": failed to authorize: failed to fetch anonymous token: unexpected status from GET request to <https://access.crunchydata.com/api/v1/auth/jwt/container/token/?scope=repository%3Acrunchydata%2Fpostgres-operator%3Apull&service=crunchy-container-registry:> 403 Forbidden
Looking in the events, we see that we attempted to pull the crunchydata/postgres-operator:ubi8-5.5.0-0 pod from the Crunchy Data registry. We see in the next event entry: 403 Forbidden. This means that we do not have the permissions to pull this pod from this registry.
Adding a pull secret
To resolve the issue, we will create a pull secret and add it to the deployment. You can find more information on creating pull secrets for private registries in the CPK documentation.
We created the image pull secret and added it to the deployment per the documentation. We apply the change and delete the failed pod. Now we see that the pod is recreated and the image is pulled successfully.
kubectl apply -n postgres-operator -k install/default --server-side
kubectl -n postgres-operator delete pod pgo-5694b9545c-xnpjg
pod "pgo-5694b9545c-xnpjg" deleted
kubectl -n postgres-operator get pods
NAME READY STATUS RESTARTS AGE
pgo-5694b9545c-xnpjg 1/1 Running 0 23s
Image Not In Registry Example
We again attempt to deploy the Operator and see that we have an ImagePullBackOff error.
kubectl -n postgres-operator get pods
NAME READY STATUS RESTARTS AGE
pgo-6bfc9554b7-6h4jd 0/1 ImagePullBackOff 0 22s
Just like before, we will describe the pod and look at the events to determine why this is happening:
kubectl -n postgres-operator describe pod pgo-6bfc9554b7-6h4jd
...
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
...
Normal Pulling 4m30s (x4 over 6m5s) kubelet Pulling image "registry.crunchydata.com/crunchydata/postgres-operator:ubi8-5.50.0-0"
Warning Failed 4m30s (x4 over 6m4s) kubelet Failed to pull image "registry.crunchydata.com/crunchydata/postgres-operator:ubi8-5.50.0-0": rpc error: code = NotFound desc = failed to pull and unpack image "registry.crunchydata.com/crunchydata/postgres-operator:ubi8-5.50.0-0": failed to resolve reference "registry.crunchydata.com/crunchydata/postgres-operator:ubi8-5.50.0-0": registry.crunchydata.com/crunchydata/postgres-operator:ubi8-5.50.0-0: not found
This time we see that we tried to pull the crunchydata/postgres-operator:ubi8-5.50.0-0 image from the Crunchy Data registry. However, the image is not found. Upon closer inspection of the image listed in the CPK Operator kustomization.yaml file we see that we have a typo. We had a tag of ubi8-5.50.0-0 when it should have been ubi8-5.5.0-0.
images:
- name: postgres-operator
newName: registry.crunchydata.com/crunchydata/postgres-operator
newTag: ubi8-5.50.0-0
Changing tag names
We make the correction to the file and apply the change. The pod is automatically recreated with the correct image tag.
kubectl apply -n postgres-operator -k install/default --server-side
kubectl -n postgres-operator get pods
NAME READY STATUS RESTARTS AGE
pgo-6bfc9554b7-6h4jd 1/1 Running 0 96s
By using the Kubernetes describe pod function we were able to see why we were getting image pull errors and easily correct them.
Resource Allocation
Another important place to look when troubleshooting a failed Kubernetes installation is looking at resource allocations, and ensuring pods have the necessary CPU and memory. The most common issues I see at installation time are:
- Requesting more resources then are available in the available Kubernetes nodes.
- Insufficient resource requests to allow for the proper operation of the containers running in the pod.
Resource Request Exceeds Availability
Here in this postgres.yaml we set some resource requests and limits for our Postgres pods. We are requesting 5 CPUs and setting a limit of 10 CPUs per Postgres pod.
instances:
- name: pgha1
replicas: 2
resources:
limits:
cpu: 10000m
memory: 256Mi
requests:
cpu: 5000m
memory: 100Mi
When we create the Postgres cluster and look at the pods we find them in a pending state.
kubectl apply -n postgres-operator -k high-availability
postgrescluster.postgres-operator.crunchydata.com/hippo-ha created
kubectl -n postgres-operator get pods ──(Tue,Dec19)─┘
NAME READY STATUS RESTARTS AGE
hippo-ha-pgbouncer-7c467748d-tl4pn 2/2 Running 0 103s
hippo-ha-pgbouncer-7c467748d-v6s4d 2/2 Running 0 103s
hippo-ha-pgha1-bzrb-0 0/5 Pending 0 103s
hippo-ha-pgha1-z7nl-0 0/5 Pending 0 103s
hippo-ha-repo-host-0 2/2 Running 0 103s
pgo-6ccdb8b5b-m2zsc 1/1 Running 0 48m
Let's describe one of the pending pods and look at the events:
kubectl -n postgres-operator describe pod hippo-ha-pgha1-bzrb-0
Name: hippo-ha-pgha1-bzrb-0
Namespace: postgres-operator
...
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 3m41s (x2 over 3m43s) default-scheduler 0/2 nodes are available: 2 Insufficient cpu. preemption: 0/2 nodes are available: 2 No preemption victims found for incoming pod..
We see that there is insufficient available CPU to meet our request. We reduce our resource request and limits and try again.
instances:
- name: pgha1
replicas: 2
resources:
limits:
cpu: 1000m
memory: 256Mi
requests:
cpu: 500m
memory: 100Mi
kubectl apply -n postgres-operator -k high-availability
postgrescluster.postgres-operator.crunchydata.com/hippo-ha created
kubectl -n postgres-operator get pods ──(Tue,Dec19)─┘
NAME READY STATUS RESTARTS AGE
hippo-ha-backup-jb8t-tgdtx 1/1 Running 0 13s
hippo-ha-pgbouncer-7c467748d-s8wq6 2/2 Running 0 34s
hippo-ha-pgbouncer-7c467748d-zhcmf 2/2 Running 0 34s
hippo-ha-pgha1-hmrq-0 5/5 Running 0 35s
hippo-ha-pgha1-xxtf-0 5/5 Running 0 35s
hippo-ha-repo-host-0 2/2 Running 0 35s
pgo-6ccdb8b5b-m2zsc 1/1 Running 0 124m
Now we see that all of our pods are running as expected.
Insufficient Resource Request
What happens if we don't allocate enough resources? Here we set very low CPU requests and limits. We are requesting 5m CPUs and setting a limit of 10m CPUs.
instances:
- name: pgha1
replicas: 2
resources:
limits:
cpu: 10m
memory: 256Mi
requests:
cpu: 5m
memory: 100Mi
We apply the manifest and take a look at the pods.
kubectl apply -n postgres-operator -k high-availability
postgrescluster.postgres-operator.crunchydata.com/hippo-ha created
kubectl -n postgres-operator get pods
NAME READY STATUS RESTARTS AGE
hippo-ha-pgbouncer-7c467748d-hnf5k 2/2 Running 0 93s
hippo-ha-pgbouncer-7c467748d-q28t9 2/2 Running 0 93s
hippo-ha-pgha1-r2qs-0 4/5 Running 2 (11s ago) 93s
hippo-ha-pgha1-x2ft-0 4/5 Running 2 (8s ago) 93s
hippo-ha-repo-host-0 2/2 Running 0 93s
pgo-6ccdb8b5b-m2zsc 1/1 Running 0 136m
We see that our Postgres pods are only showing 4/5 containers running and 90 seconds after creation they have already restarted twice. This is a clear indication that something is wrong. Let's look at the logs for the Postgres container to see what is going on.
kubectl -n postgres-operator logs hippo-ha-pgha1-r2qs-0 -c database
We didn't get any logs back. This indicates that the Postgres container is not starting. Now we will adjust the CPU request and limit to more reasonable values and try again. I normally don’t go below 500m.
instances:
- name: pgha1
replicas: 2
resources:
limits:
cpu: 1000m
memory: 256Mi
requests:
cpu: 500m
memory: 100Mi
kubectl apply -n postgres-operator -k high-availability
postgrescluster.postgres-operator.crunchydata.com/hippo-ha created
Now we see that our cluster is up and running with all expected containers.
kubectl -n postgres-operator get pods
NAME READY STATUS RESTARTS AGE
hippo-ha-backup-pv9n-tr7mh 1/1 Running 0 6s
hippo-ha-pgbouncer-7c467748d-45jj9 2/2 Running 0 33s
hippo-ha-pgbouncer-7c467748d-lqfz2 2/2 Running 0 33s
hippo-ha-pgha1-8kh2-0 5/5 Running 0 34s
hippo-ha-pgha1-v4t5-0 5/5 Running 0 34s
hippo-ha-repo-host-0 2/2 Running 0 33s
pgo-6ccdb8b5b-m2zsc 1/1 Running 0 147m
Storage Allocation
Lastly, we will look at some common issues when allocating storage to our pods. The most common issues that someone will run into regarding storage allocation at installation time are:
- Improper Resource Request
- Unsupported Storage Class
Improper Resource Request Example
Here is an example of the storage we want to allocate to our Postgres cluster pods in the postgres.yaml:
dataVolumeClaimSpec:
accessModes:
- 'ReadWriteOnce'
resources:
requests:
storage: 1GB
When we attempt to apply the manifest we see this output on the command line:
k apply -n postgres-operator -k high-availability
The PostgresCluster "hippo-ha" is invalid: spec.instances[0].dataVolumeClaimSpec.resources.requests.storage: Invalid value: "1GB": spec.instances[0].dataVolumeClaimSpec.resources.requests.storage in body should match '^(\\+|-)?(([0-9]+(\\.[0-9]*)?)|(\\.[0-9]+))(([KMGTPE]i)|[numkMGTPE]|([eE](\\+|-)?(([0-9]+(\\.[0-9]*)?)|(\\.[0-9]+))))?$'
The value of "1GB" is invalid. The error message tells you where in the manifest the error is. It is in the spec.instances[0].dataVolumeClaimSpec.resources.requests.storage
section of the manifest. The message even provides the regex that is used for validation.
When we enter the valid value of 1Gi
we are able to deploy our Postgres cluster. Remember that gigabytes must be described in Gi, and megabytes with Mi. More syntax specifics in the Kubernetes docs.
dataVolumeClaimSpec:
accessModes:
- 'ReadWriteOnce'
resources:
requests:
storage: 1Gi
affinity:
kubectl -n postgres-operator get pods
NAME READY STATUS RESTARTS AGE
hippo-ha-backup-ngg5-56z7z 1/1 Running 0 10s
hippo-ha-pgbouncer-7c467748d-4q887 2/2 Running 0 35s
hippo-ha-pgbouncer-7c467748d-lc2sr 2/2 Running 0 35s
hippo-ha-pgha1-w9vc-0 5/5 Running 0 35s
hippo-ha-pgha1-zhx8-0 5/5 Running 0 35s
hippo-ha-repo-host-0 2/2 Running 0 35s
pgo-6ccdb8b5b-vzzkp 1/1 Running 0 12m
Improper Storage Class Name Example
We want to specify a specific storage class to be used with our Postgres cluster pods:
dataVolumeClaimSpec:
storageClassName: foo
accessModes:
- 'ReadWriteOnce'
resources:
requests:
storage: 1Gi
When we apply the manifest we see that our Postgres pods get stuck in a "pending" state.
kubectl -n postgres-operator get pods
NAME READY STATUS RESTARTS AGE
hippo-ha-pgbouncer-7c467748d-jxxpf 2/2 Running 0 3m42s
hippo-ha-pgbouncer-7c467748d-wdtvq 2/2 Running 0 3m42s
hippo-ha-pgha1-79gr-0 0/5 Pending 0 3m42s
hippo-ha-pgha1-xv2t-0 0/5 Pending 0 3m42s
hippo-ha-repo-host-0 2/2 Running 0 3m42s
pgo-6ccdb8b5b-vzzkp 1/1 Running 0 24m
At this point it is not clear to us why the pods are pending. Let's describe one of them and look at the events to see if we can get more information.
kubectl -n postgres-operator describe pod hippo-ha-pgha1-79gr-0
Name: hippo-ha-pgha1-79gr-0
Namespace: postgres-operator
...
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal NotTriggerScaleUp 31s (x32 over 5m34s) cluster-autoscaler pod didn't trigger scale-up:
Warning FailedScheduling 13s (x6 over 5m36s) default-scheduler 0/2 nodes are available: pod has unbound immediate PersistentVolumeClaims. preemption: 0/2 nodes are available: 2 Preemption is not helpful for scheduling..
In the describe events we see that the pod has unbound immediate PersistentVolumeClaims. What does that mean? It means that Kubernetes was not able to meet our storage claim request so it remains unbound. If we examine our dataVolumeClaimSpec we see that we set three specific values:
dataVolumeClaimSpec:
storageClassName: foo
accessModes:
- 'ReadWriteOnce'
resources:
requests:
storage: 1Gi
We review the available storage classes in our Kubernetes provider. In this case we are deploying on GKE. We see that we have 3 storage classes available to us:
We delete the failed cluster deployment
kubectl delete -n postgres-operator -k high-availability
postgrescluster.postgres-operator.crunchydata.com "hippo-ha" deleted
We update the storageClassName in our manifest to a supported storage class and apply it.
dataVolumeClaimSpec:
storageClassName: standard-rwo
accessModes:
- 'ReadWriteOnce'
resources:
requests:
storage: 1Gi
kubectl apply -n postgres-operator -k high-availability
configmap/db-init-sql created
postgrescluster.postgres-operator.crunchydata.com/hippo-ha created
Now we see that all of our pods are up and running.
kubectl -n postgres-operator get pods
NAME READY STATUS RESTARTS AGE
hippo-ha-backup-jstq-c8n67 1/1 Running 0 6s
hippo-ha-pgbouncer-7c467748d-5smt9 2/2 Running 0 31s
hippo-ha-pgbouncer-7c467748d-6vb7t 2/2 Running 0 31s
hippo-ha-pgha1-9s2g-0 5/5 Running 0 32s
hippo-ha-pgha1-drmv-0 5/5 Running 0 32s
hippo-ha-repo-host-0 2/2 Running 0 32s
pgo-6ccdb8b5b-vzzkp 1/1 Running 0 44m
We Did It!
In this blog we were able to identify, diagnose and correct common installation issues that sometimes occur when installing Postgres in Kubernetes. We learned how to use the Kubernetes describe function to obtain information that assisted us in the diagnosis of the issues we ran into. The lessons learned here don't just apply to Postgres. These types of issues can happen with any application running in Kubernetes if the manifest is not correct or proper resources have not been allocated. Congratulations! You now have the knowledge you need to solve common installation issues.
Related Articles
- Postgres Tuning & Performance for Analytics Data
19 min read
- Running an Async Web Query Queue with Procedures and pg_cron
6 min read
- Name Collision of the Year: Vector
9 min read
- Sidecar Service Meshes with Crunchy Postgres for Kubernetes
12 min read
- pg_incremental: Incremental Data Processing in Postgres
11 min read