Skip to content

Conversation

hors
Copy link
Collaborator

@hors hors commented Sep 23, 2025

K8SPG-859 Powered by Pull Request Badge

CHANGE DESCRIPTION

Problem:
This PR implements a hibernation feature for Percona Server MySQL clusters that allows automatic pausing and unpausing based on cron schedules. This is particularly useful for development environments, test clusters, or any scenario where you want to automatically stop MySQL clusters during off-hours to save resources.

🎯 Key Features

Core Hibernation Functionality

  • Automatic Pause/Unpause: Schedule-based hibernation using cron expressions
  • Manual Override: Manual pause/unpause via spec.pause field
  • State Synchronization: Hibernation state automatically syncs with cluster state
  • Health Checks: Only allows hibernation when cluster is in Ready state
  • Backup/Restore Awareness: Prevents hibernation during active backups or restores

Smart Scheduling Logic

  • Next Window Scheduling: If cluster is unhealthy during scheduled time, automatically schedules for next window
  • Schedule Change Detection: Automatically updates next pause/unpause times when schedules change
  • First-time Evaluation: Handles initial hibernation setup correctly
  • Proactive Scheduling: Prevents immediate pausing when cluster becomes ready after being unready

Robust Error Handling

  • Invalid Schedule Handling: Gracefully handles invalid cron expressions
  • Cluster State Management: Proper handling of Initializing, Error, Stopping, Paused, and Ready states
  • Race Condition Prevention: Prevents state flipping during cluster startup/recovery

🏗️ Architecture

New Controller: PerconaServerMySQLHibernationReconciler

  • Dedicated controller for hibernation logic
  • Registered in cmd/manager/main.go
  • RBAC permissions for PS objects and backup/restore resources

Enhanced CRD Fields

spec:
  hibernation:
    enabled: true
    schedule:
      pause: "0 18 * * 1-5"    # 6 PM Mon-Fri
      unpause: "0 8 * * 1-5"   # 8 AM Mon-Fri
  pause: false  # Manual override

Status Fields

status:
  hibernation:
    state: "Active"  # Active, Paused, Scheduled, Blocked, Disabled
    nextPauseTime: "2025-09-24T18:00:00Z"
    nextUnpauseTime: "2025-09-25T08:00:00Z"
    lastPauseTime: "2025-09-23T18:00:00Z"
    lastUnpauseTime: "2025-09-24T08:00:00Z"
    reason: "Cluster not ready during scheduled time"

CHECKLIST

Jira

  • Is the Jira ticket created and referenced properly?
  • Does the Jira ticket have the proper statuses for documentation (Needs Doc) and QA (Needs QA)?
  • Does the Jira ticket link to the proper milestone (Fix Version field)?

Tests

  • Is an E2E test/test case added for the new feature/change?
  • Are unit tests added where appropriate?

Config/Logging/Testability

  • Are all needed new/changed options added to default YAML files?
  • Are all needed new/changed options added to the Helm Chart?
  • Did we add proper logging messages for operator actions?
  • Did we ensure compatibility with the previous version or cluster upgrade process?
  • Does the change support oldest and newest supported PS version?
  • Does the change support oldest and newest supported Kubernetes version?

@pull-request-size pull-request-size bot added the size/XXL 1000+ lines label Sep 23, 2025
@hors hors changed the title POC: Percona Server MySQL Hibernation Feature K8SPG-859 POC: Percona Server MySQL Hibernation Feature Sep 23, 2025
@hors hors changed the title K8SPG-859 POC: Percona Server MySQL Hibernation Feature K8SPG-859 [POC] Percona Server MySQL Hibernation Feature Sep 23, 2025
@JNKPercona
Copy link
Collaborator

Test Name Result Duration
version-service passed 00:00:00
async-ignore-annotations passed 00:00:00
async-global-metadata passed 00:00:00
auto-config passed 00:00:00
config passed 00:00:00
config-router passed 00:00:00
demand-backup-minio passed 00:00:00
demand-backup-cloud passed 00:00:00
async-data-at-rest-encryption passed 00:00:00
gr-global-metadata failure 00:15:24
gr-data-at-rest-encryption failure 00:17:33
gr-demand-backup-minio failure 00:13:37
gr-demand-backup-cloud failure 00:13:06
gr-demand-backup-haproxy passed 00:00:00
gr-finalizer passed 00:00:00
gr-haproxy passed 00:00:00
gr-ignore-annotations passed 00:00:00
gr-init-deploy passed 00:00:00
gr-one-pod failure 00:10:08
gr-recreate failure 00:06:22
gr-scaling passed 00:00:00
gr-scheduled-backup passed 00:00:00
gr-security-context passed 00:00:00
gr-self-healing passed 00:00:00
gr-tls-cert-manager passed 00:00:00
gr-users passed 00:00:00
haproxy passed 00:00:00
init-deploy passed 00:00:00
limits passed 00:00:00
monitoring passed 00:00:00
one-pod passed 00:00:00
operator-self-healing passed 00:00:00
recreate passed 00:00:00
scaling passed 00:00:00
scheduled-backup passed 00:00:00
service-per-pod passed 00:00:00
sidecars passed 00:00:00
smart-update passed 00:00:00
storage passed 00:00:00
telemetry passed 00:00:00
tls-cert-manager passed 00:00:00
users passed 00:00:00
pvc-resize passed 00:00:00
We run 43 out of 43 01:16:12

commit: 78dcbca
image: perconalab/percona-server-mysql-operator:PR-1092-78dcbca6

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
size/XXL 1000+ lines
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants