Skip to content

Updating 5.0.1 to 5.0.3 issues #2805

Open
@ThommyH

Description

@ThommyH

Overview

Updating the operator from 5.0.1 to 5.0.3 causes multiple issues:

  1. postgres instances fail to schedule, due to secret and cm missing (cluster is called db-unicorn)
    MountVolume.SetUp failed for volume "ssh" : [configmap "db-unicorn-ssh-config" not found, secret "db-unicorn-ssh" not found] and
    MountVolume.SetUp failed for volume "ssh" : [secret "db-unicorn-ssh" not found, secret "db-unicorn-ssh" not found]

  2. some other cluster with replicas had issues like
    MountVolume.SetUp failed for volume "pgbackrest-config" : configmap references non-existent config key: pgbackrest_instance.conf and therefore also couldn't start

  3. rolling back to 5.0.1 is very painful, because it renders pgbouncer broken, since the postgres cluster is already set to require SCRAM userdb/[email protected]:5432 cannot do SCRAM authentication: wrong password type in order to fix it, one need to remove the bouncer configuration from the CR and then add it again. This works 90% of the time. The remaining can fail due to pgaudit must be loaded via shared_preload_libraries while it tries to drop the permissions for the pgbouncers.

Environment

Please provide the following details:

  • Platform: (Kubernetes, GKE)
  • Platform Version: (1.20.10-gke.2100)
  • PGO Image Tag: (e.g. ubi8-5.0.3-0)
  • Postgres Version (e.g. 13)
  • Storage: (S3 (minio))

Steps to Reproduce

This seem to the biggest issue, in our test cluster I cannot seem to reproduce it. So it might be related to the number of postgres instances? In the cluster we see the issues, we have 17 postgres clusters.
We are currently running stable on (https://github.com/CrunchyData/postgres-operator-examples/tree/v5.0.0-alpha.4-0/kustomize/install) and try to update to the latest (https://github.com/CrunchyData/postgres-operator-examples/tree/f80100c4055fa09fc257de8db60751c2ba879e1b/kustomize/install)

Prior to updating, we moved repoHost.dedicated to repoHost: as instructed https://access.crunchydata.com/documentation/postgres-operator/v5/releases/5.0.3/. Since helm didnt update the CRD, we also patched it manually.

An example cluster definition looks like this:

apiVersion: postgres-operator.crunchydata.com/v1beta1
kind: PostgresCluster
metadata:
  labels:
    velero.io/backup-name: pgo
    velero.io/restore-name: pgo
  name: db-unicorn
  namespace: mip-development
spec:
  backups:
    pgbackrest:
      configuration:
      - secret:
          name: postgres-s3-creds
      global:
        log-level-console: info
        log-level-file: info
        repo1-path: /db-unicorn/repo1
        repo1-retention-diff: "15"
        repo1-retention-full: "15"
        repo1-retention-full-type: time
        repo1-s3-bucket: BUCKET
        repo1-s3-host: minio-endpoint.com
        repo1-s3-uri-style: path
      image: registry.developers.crunchydata.com/crunchydata/crunchy-pgbackrest:centos8-2.33-0
      repoHost:
        affinity:
          nodeAffinity:
            requiredDuringSchedulingIgnoredDuringExecution:
              nodeSelectorTerms:
              - matchExpressions:
                - key: persistent
                  operator: In
                  values:
                  - "true"
      repos:
      - name: repo1
        s3:
          bucket: BUCKET
          endpoint: minio-endpoint.com
          region: eu-west-4
        schedules:
          incremental: 5 2 * * *
  image: registry.developers.crunchydata.com/crunchydata/crunchy-postgres-ha:centos8-13.3-0
  instances:
  - affinity:
      nodeAffinity:
        requiredDuringSchedulingIgnoredDuringExecution:
          nodeSelectorTerms:
          - matchExpressions:
            - key: persistent
              operator: In
              values:
              - "true"
      podAntiAffinity:
        preferredDuringSchedulingIgnoredDuringExecution:
        - podAffinityTerm:
            labelSelector:
              matchLabels:
                postgres-operator.crunchydata.com/cluster: db-unicorn
                postgres-operator.crunchydata.com/instance-set: instance1
            topologyKey: kubernetes.io/hostname
          weight: 1
    dataVolumeClaimSpec:
      accessModes:
      - ReadWriteOnce
      resources:
        requests:
          storage: 96Gi
      storageClassName: fast
    metadata:
      labels:
        monitoring: crunchy-postgres
        prometheus: kube-prometheus
    name: instance1
    replicas: 1
    resources:
      limits:
        cpu: 4
        memory: 8Gi
  metadata:
    labels:
      pg-cluster: db-unicorn
      pg-database: db-unicorn
      vendor: crunchydata
  monitoring:
    pgmonitor:
      exporter:
        image: registry.developers.crunchydata.com/crunchydata/crunchy-postgres-exporter:ubi8-5.0.1-0
  patroni:
    dynamicConfiguration:
      postgresql:
        parameters:
          checkpoint_completion_target: 0.9
          default_statistics_target: 100
          effective_cache_size: 6GB
          effective_io_concurrency: 300
          maintenance_work_mem: 512MB
          max_connections: 300
          max_parallel_workers: 4
          max_parallel_workers_per_gather: 2
          max_wal_size: 6GB
          max_worker_processes: 4
          min_wal_size: 1GB
          random_page_cost: 1.1
          shared_buffers: 2GB
          synchronous_commit: false
          wal_buffers: 16MB
          work_mem: 32MB
    leaderLeaseDurationSeconds: 30
    port: 8008
    syncPeriodSeconds: 10
  port: 5432
  postgresVersion: 13
  proxy:
    pgBouncer:
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: persistent
                operator: In
                values:
                - "true"
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - podAffinityTerm:
              labelSelector:
                matchLabels:
                  postgres-operator.crunchydata.com/cluster: db-unicorn
                  postgres-operator.crunchydata.com/role: pgbouncer
              topologyKey: kubernetes.io/hostname
            weight: 1
      config:
        global:
          default_pool_size: "150"
          max_client_conn: "4096"
          max_db_connections: "150"
          pool_mode: session
      image: registry.developers.crunchydata.com/crunchydata/crunchy-pgbouncer:centos8-1.15-0
      metadata:
        labels:
          pg-cluster: db-unicorn
          vendor: crunchydata
      port: 5432
      replicas: 1
      resources:
        limits:
          cpu: 1
          memory: 128Mi
  users:
  - databases:
    - userdb
    name: user

Logs

After updating the helmchart / PGO version to 5.0.3, the instances restart and fail to boot as mentioned above. Additional we see
skipping SSH reconciliation, no repo hosts configured inside the operator logs. SSH keys and cm, which previously where there, are removed.

Additional Information

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions