Skip to content

fix(backup): Clear lastErr on repl healthy#20111

Merged
mattlord merged 1 commit into
vitessio:mainfrom
EtienneBerube:dev/etienne.berube/clear-last-err-on-healthy
May 20, 2026
Merged

fix(backup): Clear lastErr on repl healthy#20111
mattlord merged 1 commit into
vitessio:mainfrom
EtienneBerube:dev/etienne.berube/clear-last-err-on-healthy

Conversation

@EtienneBerube
Copy link
Copy Markdown
Contributor

@EtienneBerube EtienneBerube commented May 14, 2026

Description

vtbackup can abort during catch-up replication even after its internal mysqld recovers from a transient replication failure.

This can happen during a failover or reparent while a backup is running. vtbackup sees replication become unhealthy, restarts replication against the updated source, and MySQL later reports that replication is healthy again. Despite that recovery, vtbackup can still abort about 60 seconds after the original error.

Root cause

vtbackup tracks replication catch-up errors with a LastError.

When replication is unhealthy, vtbackup records:

replication has stopped before backup could be taken. trying to restart replication.

The old code never cleared that recorded error once replication returned to a healthy state.

Because LastError.ShouldRetry() only looks at how long the recorded error has been present, a transient replication failure could still be treated as continuously failing. After timeoutWaitingForReplicationStatus, vtbackup would abort with:

the same error was encountered continuously since

even though replication had already recovered.

Fix

When catch-up replication status is healthy, clear the recorded LastError with lastErr.Record(nil).

This preserves the existing timeout behavior for continuous failures, but prevents a recovered transient replication issue from poisoning the rest of the catch-up loop.

How to reproduce

Run vtbackup so it restores from an existing backup and starts catch-up replication from the current primary.

While vtbackup is catching up, trigger a failover or reparent that temporarily breaks replication. For example:

  • vtbackup is replicating from the current primary
  • The primary becomes unavailable during a failover
  • MySQL reports an error like:
Replica I/O for channel '': Error reconnecting to source 'vt_repl@<ip>:3306'. This was attempt 1/300, with a delay of 10 seconds between attempts. Message: Can't connect to MySQL server on '<ip>:3306' (111), Error_code: MY-002003
  • vtbackup restarts replication
  • MySQL reports that the replica receiver thread connected to the new source

Without the fix, vtbackup can still abort roughly 60 seconds after the first replication error because the old LastError was never cleared.

Test

Added TestCatchUpReplicationForBackupClearsLastErrWhenReplicationBecomesHealthy.

The test:

  • Simulates an unhealthy replication status with the MySQL MY-002003 reconnect error
  • Returns healthy but not-yet-caught-up replication statuses long enough to cross the real 60 second LastError threshold
  • Uses testing/synctest so the test exercises the real timeout behavior without waiting in wall-clock time
  • Returns a caught-up replication status
  • Confirms vtbackup completes catch-up successfully

Confirmed the test fails without the fix by hitting:

the same error was encountered continuously since

from LastError.ShouldRetry(), then returning:

timeout waiting for replication status after 60 seconds

Confirmed the test passes with the fix.

Also tested in prod-like environment by killing the primary while catching up. The backup succeeded as expected.

Related Issue(s)

Fixes: #20110

Checklist

  • "Backport to:" labels have been added if this change should be back-ported to release branches
  • If this change is to be back-ported to previous releases, a justification is included in the PR description
  • Tests were added or are not required
  • Did the new or modified tests pass consistently locally and on CI?
  • Documentation was added or is not required

Deployment Notes

AI Disclosure

Test and was written by Codex.

Signed-off-by: EtienneBerube <etienne@planetscale.com>
@EtienneBerube EtienneBerube requested a review from frouioui as a code owner May 14, 2026 21:18
Copilot AI review requested due to automatic review settings May 14, 2026 21:18
@github-actions github-actions Bot added this to the v25.0.0 milestone May 14, 2026
@vitess-bot
Copy link
Copy Markdown
Contributor

vitess-bot Bot commented May 14, 2026

Review Checklist

Hello reviewers! 👋 Please follow this checklist when reviewing this Pull Request.

General

  • Ensure that the Pull Request has a descriptive title.
  • Ensure there is a link to an issue (except for internal cleanup and flaky test fixes), new features should have an RFC that documents use cases and test cases.

Tests

  • Bug fixes should have at least one unit or end-to-end test, enhancement and new features should have a sufficient number of tests.

Documentation

  • Apply the release notes (needs details) label if users need to know about this change.
  • New features should be documented.
  • There should be some code comments as to why things are implemented the way they are.
  • There should be a comment at the top of each new or modified test to explain what the test does.

New flags

  • Is this flag really necessary?
  • Flag names must be clear and intuitive, use dashes (-), and have a clear help text.

If a workflow is added or modified:

  • Each item in Jobs should be named in order to mark it as required.
  • If the workflow needs to be marked as required, the maintainer team must be notified.

Backward compatibility

  • Protobuf changes should be wire-compatible.
  • Changes to _vt tables and RPCs need to be backward compatible.
  • RPC changes should be compatible with vitess-operator
  • If a flag is removed, then it should also be removed from vitess-operator and arewefastyet, if used there.
  • vtctl command output order should be stable and awk-able.

@vitess-bot vitess-bot Bot added NeedsWebsiteDocsUpdate What it says NeedsDescriptionUpdate The description is not clear or comprehensive enough, and needs work NeedsIssue A linked issue is missing for this Pull Request NeedsBackportReason If backport labels have been applied to a PR, a justification is required labels May 14, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes a vtbackup failure mode where a transient replication health issue during catch-up can “poison” the retry timer and cause vtbackup to abort even after MySQL replication has recovered. The change clears the tracked LastError once replication is healthy again, and adds a regression test that exercises the real 60s timeout behavior under testing/synctest.

Changes:

  • Clear the catch-up loop’s recorded LastError when replication returns to a healthy state.
  • Add a synctest-based unit test to ensure transient replication failures don’t trigger the continuous-failure timeout after recovery.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File Description
go/cmd/vtbackup/cli/vtbackup.go Clears lastErr when replication status is healthy to prevent false “continuous error” timeouts after recovery.
go/cmd/vtbackup/cli/vtbackup_test.go Adds a regression test simulating a transient unhealthy replication state followed by prolonged healthy-but-not-caught-up status, validating that catch-up completes successfully.

@mattlord mattlord requested review from mattlord and mhamza15 May 15, 2026 01:34
@mattlord mattlord added Backport to: release-23.0 Needs to be backport to release-23.0 Backport to: release-24.0 Needs to be backport to release-24.0 and removed NeedsDescriptionUpdate The description is not clear or comprehensive enough, and needs work NeedsWebsiteDocsUpdate What it says NeedsIssue A linked issue is missing for this Pull Request labels May 15, 2026
@codecov
Copy link
Copy Markdown

codecov Bot commented May 15, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 66.10%. Comparing base (70c7a72) to head (d9c1e62).
⚠️ Report is 259 commits behind head on main.

Additional details and impacted files
@@             Coverage Diff             @@
##             main   #20111       +/-   ##
===========================================
- Coverage   69.67%   66.10%    -3.57%     
===========================================
  Files        1614       89     -1525     
  Lines      216793    14080   -202713     
===========================================
- Hits       151044     9308   -141736     
+ Misses      65749     4772    -60977     
Flag Coverage Δ
partial 66.10% <100.00%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copy link
Copy Markdown
Member

@mattlord mattlord left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Nice work on this, @EtienneBerube ! ❤️

Copy link
Copy Markdown
Collaborator

@mhamza15 mhamza15 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! Do you think we can add an e2e test that validates backups work through failovers? I think that would be beneficial.

@EtienneBerube
Copy link
Copy Markdown
Contributor Author

Backporting since this is a bug that affects all versions.

@frouioui frouioui removed the NeedsBackportReason If backport labels have been applied to a PR, a justification is required label May 19, 2026
@frouioui frouioui enabled auto-merge (squash) May 19, 2026 15:17
@frouioui frouioui disabled auto-merge May 19, 2026 15:17
@mattlord mattlord merged commit 28c3212 into vitessio:main May 20, 2026
187 of 201 checks passed
frouioui pushed a commit that referenced this pull request May 20, 2026
…20139)

Signed-off-by: EtienneBerube <etienne@planetscale.com>
Co-authored-by: vitess-bot[bot] <108069721+vitess-bot[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Backport to: release-23.0 Needs to be backport to release-23.0 Backport to: release-24.0 Needs to be backport to release-24.0 Component: Backup and Restore Type: Bug

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Bug Report: VtBackup always aborts when encountering replication issue

5 participants