Skip to content

Conversation

hors
Copy link
Collaborator

@hors hors commented Sep 5, 2025

K8SPS-517 Powered by Pull Request Badge

This commit fixes multiple issues that were preventing MySQL clone operations from completing successfully:

Root Cause: The MySQL driver's read timeout was set to only 10 seconds, causing clone operations to be interrupted after ~10 seconds with error 1317 (query execution interrupted) followed by error 1160 (communication packet error).

Changes Made:

  1. Increased Default Timeouts:

    • Read timeout: 10s → 3600s (1 hour) for long-running operations
    • Write timeout: 10s → 3600s (1 hour) to match read timeout
    • Clone timeout: Added 3600s (1 hour) default for clone operations
  2. Enhanced Error Handling:

    • Added specific handling for MySQL error 1317 (query interrupted)
    • Added specific handling for MySQL error 1160 (communication error)
    • Improved error messages with detailed context
  3. Fixed SQL Syntax:

    • Corrected CLONE INSTANCE statement to use proper ?@?:? placeholder format
    • Ensured proper parameter separation for MySQL driver compatibility
  4. Added Clone Validation:

    • Added getCloneStatus() method to verify clone completion
    • Added getCloneStatusDetails() for detailed error reporting
    • Enhanced clone status validation with comprehensive error messages
  5. Updated Configuration:

    • Modified DSN generation to use configurable timeouts
    • Updated DBParams to include CloneTimeoutSeconds parameter
    • Added environment variable support via BOOTSTRAP_CLONE_TIMEOUT
  6. Comprehensive Testing:

    • Added unit tests for all new timeout functionality
    • Added tests for clone status validation and error handling
    • Added tests for SQL statement format and parameter handling
    • Updated existing tests to reflect new timeout defaults

Impact:

  • Clone operations now complete successfully without timeout interruptions
  • Better error reporting for debugging clone issues
  • Configurable timeouts via environment variables
  • Comprehensive test coverage for clone functionality

CHANGE DESCRIPTION

Problem:
Short explanation of the problem.

Cause:
Short explanation of the root cause of the issue if applicable.

Solution:
Short explanation of the solution we are providing with this PR.

CHECKLIST

Jira

  • Is the Jira ticket created and referenced properly?
  • Does the Jira ticket have the proper statuses for documentation (Needs Doc) and QA (Needs QA)?
  • Does the Jira ticket link to the proper milestone (Fix Version field)?

Tests

  • Is an E2E test/test case added for the new feature/change?
  • Are unit tests added where appropriate?

Config/Logging/Testability

  • Are all needed new/changed options added to default YAML files?
  • Are all needed new/changed options added to the Helm Chart?
  • Did we add proper logging messages for operator actions?
  • Did we ensure compatibility with the previous version or cluster upgrade process?
  • Does the change support oldest and newest supported PS version?
  • Does the change support oldest and newest supported Kubernetes version?

This commit fixes multiple issues that were preventing MySQL clone operations
from completing successfully:

**Root Cause**: The MySQL driver's read timeout was set to only 10 seconds,
causing clone operations to be interrupted after ~10 seconds with error 1317
(query execution interrupted) followed by error 1160 (communication packet error).

**Changes Made**:

1. **Increased Default Timeouts**:
   - Read timeout: 10s → 3600s (1 hour) for long-running operations
   - Write timeout: 10s → 3600s (1 hour) to match read timeout
   - Clone timeout: Added 3600s (1 hour) default for clone operations

2. **Enhanced Error Handling**:
   - Added specific handling for MySQL error 1317 (query interrupted)
   - Added specific handling for MySQL error 1160 (communication error)
   - Improved error messages with detailed context

3. **Fixed SQL Syntax**:
   - Corrected CLONE INSTANCE statement to use proper ?@?:? placeholder format
   - Ensured proper parameter separation for MySQL driver compatibility

4. **Added Clone Validation**:
   - Added getCloneStatus() method to verify clone completion
   - Added getCloneStatusDetails() for detailed error reporting
   - Enhanced clone status validation with comprehensive error messages

5. **Updated Configuration**:
   - Modified DSN generation to use configurable timeouts
   - Updated DBParams to include CloneTimeoutSeconds parameter
   - Added environment variable support via BOOTSTRAP_CLONE_TIMEOUT

6. **Comprehensive Testing**:
   - Added unit tests for all new timeout functionality
   - Added tests for clone status validation and error handling
   - Added tests for SQL statement format and parameter handling
   - Updated existing tests to reflect new timeout defaults

**Impact**:
- Clone operations now complete successfully without timeout interruptions
- Better error reporting for debugging clone issues
- Configurable timeouts via environment variables
- Comprehensive test coverage for clone functionality
@pull-request-size pull-request-size bot added the size/XL 500-999 lines label Sep 5, 2025
@hors hors added this to the v1.0.0 milestone Sep 16, 2025
@hors hors marked this pull request as ready for review September 24, 2025 15:02
@JNKPercona
Copy link
Collaborator

Test Name Result Duration
version-service passed 00:12:54
async-ignore-annotations passed 00:06:39
async-global-metadata passed 00:13:13
auto-config passed 00:23:12
config passed 00:17:00
config-router passed 00:07:29
demand-backup-minio passed 00:19:49
demand-backup-cloud passed 00:21:56
async-data-at-rest-encryption passed 00:14:47
gr-global-metadata passed 00:18:29
gr-data-at-rest-encryption passed 00:16:34
gr-demand-backup-minio passed 00:13:15
gr-demand-backup-cloud passed 00:20:49
gr-demand-backup-haproxy passed 00:10:42
gr-finalizer passed 00:05:39
gr-haproxy passed 00:04:08
gr-ignore-annotations passed 00:04:57
gr-init-deploy passed 00:09:08
gr-one-pod passed 00:05:44
gr-recreate passed 00:17:44
gr-scaling passed 00:07:43
gr-scheduled-backup passed 00:15:43
gr-security-context passed 00:09:46
gr-self-healing passed 00:24:00
gr-tls-cert-manager passed 00:08:55
gr-users passed 00:05:23
haproxy passed 00:08:46
init-deploy passed 00:05:48
limits passed 00:08:40
monitoring passed 00:14:06
one-pod passed 00:06:00
operator-self-healing passed 00:11:42
recreate passed 00:12:36
scaling passed 00:10:26
scheduled-backup passed 00:18:38
service-per-pod passed 00:06:36
sidecars passed 00:04:35
smart-update passed 00:09:28
storage passed 00:03:56
telemetry passed 00:06:25
tls-cert-manager passed 00:09:59
users passed 00:08:17
pvc-resize passed 00:07:13
We run 43 out of 43 08:09:08

commit: bbdff07
image: perconalab/percona-server-mysql-operator:PR-1055-bbdff076

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
size/XL 500-999 lines
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants