Skip to content

garbd fails to connect to PXC cluster during backup job #2105

@lev-stas

Description

@lev-stas

Report

Backup job fails with the next error

2025-06-27 13:22:33.566 ERROR: failed to open gcomm backend connection: 110: failed to reach primary view (pc.wait_prim_timeout)
        at ../../../../percona-xtradb-cluster-galera/gcomm/src/pc.cpp:connect():176
2025-06-27 13:22:33.566 ERROR: ../../../../percona-xtradb-cluster-galera/gcs/src/gcs_core.cpp:gcs_core_open():256: Failed to open backend connection: -110 (Connection timed out)
2025-06-27 13:22:34.566  INFO: gcomm: terminating thread
2025-06-27 13:22:34.566  INFO: gcomm: joining thread
2025-06-27 13:22:34.566 ERROR: ../../../../percona-xtradb-cluster-galera/gcs/src/gcs.cpp:gcs_open():1952: Failed to open channel 'mysql-cluster-pxc' at 'gcomm://mysql-cluster-pxc-4.mysql-cluster-pxc?gmcast.listen_addr=tcp://0.0.0.0:4567': -110 (Connection timed out)
2025-06-27 13:22:34.566  INFO: Shifting CLOSED -> DESTROYED (TO: 0)
2025-06-27 13:22:34.567 FATAL: Garbd exiting with error: Failed to open connection to group
        at ../../../percona-xtradb-cluster-galera/garb/garb_gcs.cpp:Gcs():35
+ grep 'Will never receive state. Need to abort' /tmp/garbd.log
+ grep 'Donor is no longer in the cluster, interrupting script' /tmp/garbd.log
+ grep 'failed: Invalid argument' /tmp/garbd.log
+ '[' -f /tmp/backup-is-completed ']'
+ log ERROR 'Backup was finished unsuccessful'
+ exit 1

while cluster is in healthy and ready state

  kubectl get pxc -n mysql-main    
NAME            ENDPOINT         STATUS   PXC   PROXYSQL   HAPROXY   AGE
mysql-cluster   192.168.24.206   ready    5                3         3d19h
stas@SkyNet temp % 

More about the problem

I have checked cluster state and it looks healty

MySQL [(none)]> SELECT 
    ->   VARIABLE_NAME, VARIABLE_VALUE 
    -> FROM 
    ->   performance_schema.global_status 
    -> WHERE 
    ->   VARIABLE_NAME IN (
    ->     'wsrep_cluster_status', 
    ->     'wsrep_local_state_comment', 
    ->     'wsrep_ready', 
    ->     'wsrep_connected', 
    ->     'wsrep_cluster_size'
    ->   );
+---------------------------+----------------+
| VARIABLE_NAME             | VARIABLE_VALUE |
+---------------------------+----------------+
| wsrep_cluster_size        | 5              |
| wsrep_cluster_status      | Primary        |
| wsrep_connected           | ON             |
| wsrep_local_state_comment | Synced         |
| wsrep_ready               | ON             |
+---------------------------+----------------+
5 rows in set (0.003 sec)

MySQL [(none)]>

To rule out any network issues, I run a debug pod in the same namespace.
And made the next steps:
pod-name can be resolved to ip

net-debug:~# nslookup mysql-cluster-pxc-4.mysql-cluster-pxc
;; Got recursion not available from 10.43.96.3
Server:         10.43.96.3
Address:        10.43.96.3#53

Name:   mysql-cluster-pxc-4.mysql-cluster-pxc.mysql-main.svc.cluster.local
Address: 10.42.23.197
;; Got recursion not available from 10.43.96.3

net-debug:~# 

pod exposes target port

net-debug:~# nc -zv mysql-cluster-pxc-4.mysql-cluster-pxc 4567
Connection to mysql-cluster-pxc-4.mysql-cluster-pxc (10.42.23.197) 4567 port [tcp/*] succeeded!
net-debug:~# 

No any issues on pxc node side

2025-06-27T13:39:48.312124Z 31341 [Note] [MY-000000] [Galera] after_statement: success(31341,exec,local,success,0,toi: -1,nbo: -1)
2025-06-27T13:39:48.312160Z 31341 [Note] [MY-000000] [Galera] after_statement: enter(31341,exec,local,success,0,toi: -1,nbo: -1)
2025-06-27T13:39:48.312188Z 31341 [Note] [MY-000000] [Galera] after_statement_enter
    server: 56ff5a7b-5331-11f0-b9f9-4ae7d6ef3935, client: 31341, state: exec, mode: local
    trx_id: -1, seqno: -1, flags: 0
    state: aborted, bfa_state: executing, error: success, status: 0
    is_sr: 0, frags: 0, frags size: 0, unit: 0, size: 0, counter: 0, log_pos: 0, sr_rb: 0
    own: 1 thread_id: 7f7f010ee640
2025-06-27T13:39:48.312213Z 31341 [Note] [MY-000000] [Galera] cleanup_enter
    server: 56ff5a7b-5331-11f0-b9f9-4ae7d6ef3935, client: 31341, state: exec, mode: local
    trx_id: -1, seqno: -1, flags: 0
    state: aborted, bfa_state: executing, error: success, status: 0
    is_sr: 0, frags: 0, frags size: 0, unit: 0, size: 0, counter: 0, log_pos: 0, sr_rb: 0
    own: 1 thread_id: 7f7f010ee640
2025-06-27T13:39:48.312239Z 31341 [Note] [MY-000000] [Galera] cleanup_leave
    server: 56ff5a7b-5331-11f0-b9f9-4ae7d6ef3935, client: 31341, state: exec, mode: local
    trx_id: -1, seqno: -1, flags: 0
    state: aborted, bfa_state: executing, error: success, status: 0
    is_sr: 0, frags: 0, frags size: 0, unit: 0, size: 0, counter: 0, log_pos: 0, sr_rb: 0
    own: 1 thread_id: 7f7f010ee640
2025-06-27T13:39:48.312265Z 31341 [Note] [MY-000000] [Galera] after_statement_leave
    server: 56ff5a7b-5331-11f0-b9f9-4ae7d6ef3935, client: 31341, state: exec, mode: local
    trx_id: -1, seqno: -1, flags: 0
    state: aborted, bfa_state: executing, error: success, status: 0
    is_sr: 0, frags: 0, frags size: 0, unit: 0, size: 0, counter: 0, log_pos: 0, sr_rb: 0
    own: 1 thread_id: 7f7f010ee640
2025-06-27T13:39:48.312284Z 31341 [Note] [MY-000000] [Galera] after_statement: success(31341,exec,local,success,0,toi: -1,nbo: -1)
2025-06-27T13:39:48.312306Z 31341 [Note] [MY-000000] [Galera] after_command_before_result: enter(31341,exec,local,success,0,toi: -1,nbo: -1)
2025-06-27T13:39:48.312324Z 31341 [Note] [MY-000000] [Galera] after_command_before_result: leave(31341,result,local,success,0,toi: -1,nbo: -1)
2025-06-27T13:39:48.312433Z 31341 [Note] [MY-000000] [Galera] after_command_after_result_enter(31341,result,local,success,0,toi: -1,nbo: -1)
2025-06-27T13:39:48.312461Z 31341 [Note] [MY-000000] [Galera] after_command_after_result: leave(31341,idle,local,success,0,toi: -1,nbo: -1)
2025-06-27T13:39:48.312920Z 31341 [Note] [MY-000000] [Galera] before_command: enter(31341,idle,local,success,0,toi: -1,nbo: -1)
2025-06-27T13:39:48.312945Z 31341 [Note] [MY-000000] [Galera] before_command: success(31341,exec,local,success,0,toi: -1,nbo: -1)
2025-06-27T13:39:48.312968Z 31341 [Note] [MY-000000] [Galera] after_command_before_result: enter(31341,exec,local,success,0,toi: -1,nbo: -1)
2025-06-27T13:39:48.312987Z 31341 [Note] [MY-000000] [Galera] after_command_before_result: leave(31341,result,local,success,0,toi: -1,nbo: -1)
2025-06-27T13:39:48.313011Z 31341 [Note] [MY-000000] [Galera] after_command_after_result_enter(31341,result,local,success,0,toi: -1,nbo: -1)
2025-06-27T13:39:48.313033Z 31341 [Note] [MY-000000] [Galera] after_command_after_result: leave(31341,idle,local,success,0,toi: -1,nbo: -1)
2025-06-27T13:39:48.313052Z 31341 [Note] [MY-000000] [Galera] close: enter(31341,idle,local,success,0,toi: -1,nbo: -1)

Steps to reproduce

  1. Deploy custom resource from release cr.yaml manifest
...
backup:
#    allowParallel: true
    image: percona/percona-xtradb-cluster-operator:1.17.0-pxc8.0-backup-pxb8.0.35
    backoffLimit: 3
#    activeDeadlineSeconds: 3600
#    startingDeadlineSeconds: 300
#    suspendedDeadlineSeconds: 1200
    serviceAccountName: percona-xtradb-cluster-operator
#    imagePullSecrets:
#      - name: private-registry-credentials
   
    storages:
      minio:
        type: s3
        verifyTLS: true
        s3:
          bucket: percona-operator
          region: us-east-1
          endpointUrl: https://minio.mydomain.net
          credentialsSecret: mysql-cluster-s3-credentials
        resources:
          requests:
            memory: 1G
            cpu: 600m
...
  1. deploy backup.yaml manifest
apiVersion: pxc.percona.com/v1
kind: PerconaXtraDBClusterBackup
metadata:
  namespace: mysql-main
  finalizers:
    - percona.com/delete-backup
  name: test-backup
spec:
  pxcCluster: mysql-cluster
  storageName: minio

Versions

  1. Kubernetes - v1.24.17
  2. Operator - 1.17.0
  3. Database - 8.0.41-32.1

Anything else?

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions