fix(backup): Clear lastErr on repl healthy#20111
Conversation
Signed-off-by: EtienneBerube <etienne@planetscale.com>
Review ChecklistHello reviewers! 👋 Please follow this checklist when reviewing this Pull Request. General
Tests
Documentation
New flags
If a workflow is added or modified:
Backward compatibility
|
There was a problem hiding this comment.
Pull request overview
This PR fixes a vtbackup failure mode where a transient replication health issue during catch-up can “poison” the retry timer and cause vtbackup to abort even after MySQL replication has recovered. The change clears the tracked LastError once replication is healthy again, and adds a regression test that exercises the real 60s timeout behavior under testing/synctest.
Changes:
- Clear the catch-up loop’s recorded
LastErrorwhen replication returns to a healthy state. - Add a synctest-based unit test to ensure transient replication failures don’t trigger the continuous-failure timeout after recovery.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
| go/cmd/vtbackup/cli/vtbackup.go | Clears lastErr when replication status is healthy to prevent false “continuous error” timeouts after recovery. |
| go/cmd/vtbackup/cli/vtbackup_test.go | Adds a regression test simulating a transient unhealthy replication state followed by prolonged healthy-but-not-caught-up status, validating that catch-up completes successfully. |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #20111 +/- ##
===========================================
- Coverage 69.67% 66.10% -3.57%
===========================================
Files 1614 89 -1525
Lines 216793 14080 -202713
===========================================
- Hits 151044 9308 -141736
+ Misses 65749 4772 -60977
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
mattlord
left a comment
There was a problem hiding this comment.
LGTM. Nice work on this, @EtienneBerube ! ❤️
mhamza15
left a comment
There was a problem hiding this comment.
Nice! Do you think we can add an e2e test that validates backups work through failovers? I think that would be beneficial.
|
Backporting since this is a bug that affects all versions. |
Description
vtbackupcan abort during catch-up replication even after its internal mysqld recovers from a transient replication failure.This can happen during a failover or reparent while a backup is running.
vtbackupsees replication become unhealthy, restarts replication against the updated source, and MySQL later reports that replication is healthy again. Despite that recovery,vtbackupcan still abort about 60 seconds after the original error.Root cause
vtbackuptracks replication catch-up errors with aLastError.When replication is unhealthy,
vtbackuprecords:The old code never cleared that recorded error once replication returned to a healthy state.
Because
LastError.ShouldRetry()only looks at how long the recorded error has been present, a transient replication failure could still be treated as continuously failing. AftertimeoutWaitingForReplicationStatus,vtbackupwould abort with:even though replication had already recovered.
Fix
When catch-up replication status is healthy, clear the recorded
LastErrorwithlastErr.Record(nil).This preserves the existing timeout behavior for continuous failures, but prevents a recovered transient replication issue from poisoning the rest of the catch-up loop.
How to reproduce
Run
vtbackupso it restores from an existing backup and starts catch-up replication from the current primary.While
vtbackupis catching up, trigger a failover or reparent that temporarily breaks replication. For example:vtbackupis replicating from the current primaryvtbackuprestarts replicationWithout the fix,
vtbackupcan still abort roughly 60 seconds after the first replication error because the oldLastErrorwas never cleared.Test
Added
TestCatchUpReplicationForBackupClearsLastErrWhenReplicationBecomesHealthy.The test:
MY-002003reconnect errorLastErrorthresholdtesting/synctestso the test exercises the real timeout behavior without waiting in wall-clock timevtbackupcompletes catch-up successfullyConfirmed the test fails without the fix by hitting:
from
LastError.ShouldRetry(), then returning:Confirmed the test passes with the fix.
Also tested in prod-like environment by killing the primary while catching up. The backup succeeded as expected.
Related Issue(s)
Fixes: #20110
Checklist
Deployment Notes
AI Disclosure
Test and was written by Codex.