Skip to content

PVC Resize Logic Includes Orphaned PVCs From Scale-Down Operations Causing Infinite Loops #2181

@arturkasperek

Description

@arturkasperek

Report

When scaling down a PXC cluster (e.g., from 3 replicas to 1), orphaned PVCs from removed nodes remain in the cluster. The operator's PVC resize logic incorrectly includes these orphaned PVCs when calculating storage sizes, causing infinite resize loops and hanging the operator reconciliation.

Impact:

  • Operator gets stuck in infinite "Pod is not updated" loops
  • Storage resize operations fail and hang indefinitely
  • Manual intervention required to delete orphaned PVCs
  • Affects any cluster that has been scaled down

Environment:

  • Operator v1.17.0
  • Reproduced on production EKS clusters
  • Affects storage operations after scale-down events

More about the problem

Root Cause Analysis

The bug is located in pkg/controller/pxc/volumes.go in the reconcilePersistentVolumes function. There's a logic inconsistency between PVC filtering and size calculation:

Lines 71-83CORRECTLY filter PVCs to only active ones:

pvcsToUpdate := make([]string, 0, len(pvcList.Items))
for _, pvc := range pvcList.Items {
    podName := strings.SplitN(pvc.Name, "-", 2)[1]
    if !slices.Contains(podNames, podName) {
        continue  // ✅ Skips orphaned PVCs
    }
    pvcsToUpdate = append(pvcsToUpdate, pvc.Name)
}

Lines 89-104INCORRECTLY include orphaned PVCs in size calculation:

var actual resource.Quantity
for _, pvc := range pvcList.Items {  // ❌ BUG: Uses ALL PVCs
    // ... finds smallest size among ALL PVCs including orphaned ones
    if actual.IsZero() || pvc.Status.Capacity.Storage().Cmp(actual) < 0 {
        actual = *pvc.Status.Capacity.Storage()
    }
}

What Happens

  1. Scale-down leaves orphaned PVCs (StatefulSet doesn't auto-delete them)
  2. Resize operation processes only active PVCs in pvcsToUpdate array
  3. Size calculation incorrectly includes orphaned PVCs when finding "actual" size
  4. Infinite loop occurs because orphaned PVCs have different sizes than active ones
  5. Operator hangs with repeated "Resizing PVCs" messages

Real-World Scenario

  • 3-node cluster with 80Gi PVCs → Scale to 1 node → Resize to 160Gi
  • Orphaned PVCs remain: datadir-cluster-pxc-1, datadir-cluster-pxc-2
  • Next storage operation sees conflicting sizes and loops infinitely

Steps to reproduce

  1. Create a 3-node PXC cluster:

    apiVersion: pxc.percona.com/v1
    kind: PerconaXtraDBCluster
    metadata:
      name: test-cluster
    spec:
      pxc:
        size: 3
        volumeSpec:
          persistentVolumeClaim:
            resources:
              requests:
                storage: 80Gi
  2. Wait for cluster to be ready, then scale down to 1 node:

    spec:
      pxc:
        size: 1  # Scale down from 3 to 1
  3. Verify orphaned PVCs remain after scale-down:

    kubectl get pvc -l app.kubernetes.io/component=pxc
    # Should show: datadir-test-cluster-pxc-0, datadir-test-cluster-pxc-1, datadir-test-cluster-pxc-2
  4. Attempt to resize storage to trigger the bug:

    spec:
      pxc:
        volumeSpec:
          persistentVolumeClaim:
            resources:
              requests:
                storage: 160Gi  # Any size change triggers the bug
  5. Observe infinite resize loops in operator logs:

    kubectl logs -l app.kubernetes.io/name=percona-xtradb-cluster-operator -f
    # Expected: Endless "PVCResize Resizing PVCs" messages

Versions

  • Operator Version: v1.17.0
  • Kubernetes Version: 1.30 (reproduced on EKS)
  • Go Version: 1.21
  • Platform: Linux/amd64

Anything else?

No response

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions