Description
Overview
Updating the operator from 5.0.1 to 5.0.3 causes multiple issues:
-
postgres instances fail to schedule, due to secret and cm missing (cluster is called db-unicorn)
MountVolume.SetUp failed for volume "ssh" : [configmap "db-unicorn-ssh-config" not found, secret "db-unicorn-ssh" not found]
and
MountVolume.SetUp failed for volume "ssh" : [secret "db-unicorn-ssh" not found, secret "db-unicorn-ssh" not found]
-
some other cluster with replicas had issues like
MountVolume.SetUp failed for volume "pgbackrest-config" : configmap references non-existent config key: pgbackrest_instance.conf
and therefore also couldn't start -
rolling back to 5.0.1 is very painful, because it renders pgbouncer broken, since the postgres cluster is already set to require SCRAM
userdb/[email protected]:5432 cannot do SCRAM authentication: wrong password type
in order to fix it, one need to remove the bouncer configuration from the CR and then add it again. This works 90% of the time. The remaining can fail due topgaudit must be loaded via shared_preload_libraries
while it tries to drop the permissions for the pgbouncers.
Environment
Please provide the following details:
- Platform: (
Kubernetes
,GKE
) - Platform Version: (1.20.10-gke.2100)
- PGO Image Tag: (e.g.
ubi8-5.0.3-0
) - Postgres Version (e.g.
13
) - Storage: (S3 (minio))
Steps to Reproduce
This seem to the biggest issue, in our test cluster I cannot seem to reproduce it. So it might be related to the number of postgres instances? In the cluster we see the issues, we have 17 postgres clusters.
We are currently running stable on (https://github.com/CrunchyData/postgres-operator-examples/tree/v5.0.0-alpha.4-0/kustomize/install) and try to update to the latest (https://github.com/CrunchyData/postgres-operator-examples/tree/f80100c4055fa09fc257de8db60751c2ba879e1b/kustomize/install)
Prior to updating, we moved repoHost.dedicated to repoHost: as instructed https://access.crunchydata.com/documentation/postgres-operator/v5/releases/5.0.3/. Since helm didnt update the CRD, we also patched it manually.
An example cluster definition looks like this:
apiVersion: postgres-operator.crunchydata.com/v1beta1
kind: PostgresCluster
metadata:
labels:
velero.io/backup-name: pgo
velero.io/restore-name: pgo
name: db-unicorn
namespace: mip-development
spec:
backups:
pgbackrest:
configuration:
- secret:
name: postgres-s3-creds
global:
log-level-console: info
log-level-file: info
repo1-path: /db-unicorn/repo1
repo1-retention-diff: "15"
repo1-retention-full: "15"
repo1-retention-full-type: time
repo1-s3-bucket: BUCKET
repo1-s3-host: minio-endpoint.com
repo1-s3-uri-style: path
image: registry.developers.crunchydata.com/crunchydata/crunchy-pgbackrest:centos8-2.33-0
repoHost:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: persistent
operator: In
values:
- "true"
repos:
- name: repo1
s3:
bucket: BUCKET
endpoint: minio-endpoint.com
region: eu-west-4
schedules:
incremental: 5 2 * * *
image: registry.developers.crunchydata.com/crunchydata/crunchy-postgres-ha:centos8-13.3-0
instances:
- affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: persistent
operator: In
values:
- "true"
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- podAffinityTerm:
labelSelector:
matchLabels:
postgres-operator.crunchydata.com/cluster: db-unicorn
postgres-operator.crunchydata.com/instance-set: instance1
topologyKey: kubernetes.io/hostname
weight: 1
dataVolumeClaimSpec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 96Gi
storageClassName: fast
metadata:
labels:
monitoring: crunchy-postgres
prometheus: kube-prometheus
name: instance1
replicas: 1
resources:
limits:
cpu: 4
memory: 8Gi
metadata:
labels:
pg-cluster: db-unicorn
pg-database: db-unicorn
vendor: crunchydata
monitoring:
pgmonitor:
exporter:
image: registry.developers.crunchydata.com/crunchydata/crunchy-postgres-exporter:ubi8-5.0.1-0
patroni:
dynamicConfiguration:
postgresql:
parameters:
checkpoint_completion_target: 0.9
default_statistics_target: 100
effective_cache_size: 6GB
effective_io_concurrency: 300
maintenance_work_mem: 512MB
max_connections: 300
max_parallel_workers: 4
max_parallel_workers_per_gather: 2
max_wal_size: 6GB
max_worker_processes: 4
min_wal_size: 1GB
random_page_cost: 1.1
shared_buffers: 2GB
synchronous_commit: false
wal_buffers: 16MB
work_mem: 32MB
leaderLeaseDurationSeconds: 30
port: 8008
syncPeriodSeconds: 10
port: 5432
postgresVersion: 13
proxy:
pgBouncer:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: persistent
operator: In
values:
- "true"
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- podAffinityTerm:
labelSelector:
matchLabels:
postgres-operator.crunchydata.com/cluster: db-unicorn
postgres-operator.crunchydata.com/role: pgbouncer
topologyKey: kubernetes.io/hostname
weight: 1
config:
global:
default_pool_size: "150"
max_client_conn: "4096"
max_db_connections: "150"
pool_mode: session
image: registry.developers.crunchydata.com/crunchydata/crunchy-pgbouncer:centos8-1.15-0
metadata:
labels:
pg-cluster: db-unicorn
vendor: crunchydata
port: 5432
replicas: 1
resources:
limits:
cpu: 1
memory: 128Mi
users:
- databases:
- userdb
name: user
Logs
After updating the helmchart / PGO version to 5.0.3, the instances restart and fail to boot as mentioned above. Additional we see
skipping SSH reconciliation, no repo hosts configured
inside the operator logs. SSH keys and cm, which previously where there, are removed.