Skip to content

Make PRS & ERS new primary selection MySQL version aware#20211

Open
ejortegau wants to merge 13 commits into
vitessio:mainfrom
ejortegau:ejortegau/mysql-version-aware-reparent
Open

Make PRS & ERS new primary selection MySQL version aware#20211
ejortegau wants to merge 13 commits into
vitessio:mainfrom
ejortegau:ejortegau/mysql-version-aware-reparent

Conversation

@ejortegau

@ejortegau ejortegau commented May 29, 2026

Copy link
Copy Markdown
Contributor

Description

This PR makes PlannedReparentShard (PRS) and EmergencyReparentShard (ERS) aware of MySQL server versions when selecting a new primary.

MySQL replication requires that replicas run at the same or higher version than the primary. During rolling MySQL upgrades, a reparent that promotes a newer-version tablet can break replication for tablets still on the older version. This change adds MySQL version as a factor in candidate election so that a lower-version tablet is preferred as the new primary.

How it works:

  • A new server_version field is added to the replicationdata.Status proto, populated by ReplicationStatus and StopReplicationAndGetStatus RPCs.
  • PRS candidate sort order: promotion rules > MySQL release (major.minor) > GTID position > InnoDB buffer pool size > tablet alias.
  • ERS candidate sort order: GTID position > promotion rules > MySQL release (major.minor) > InnoDB buffer pool size > tablet alias.
  • Only major.minor differences matter — patch version differences within the same release are ignored since they don't affect replication compatibility.
  • In ERS, if the most-advanced tablet (intermediate source) is on a newer release, a lower-version candidate is still preferred as the final primary after catch-up.

Backport justification: This is a correctness fix for MySQL replication safety. Without version-aware election, a reparent during a rolling MySQL upgrade can promote a newer-version primary, causing older-version replicas to break replication. MySQL only guarantees forward replication compatibility (old primary → new replica), not backward (new primary → old replica). This affects any operator performing rolling MySQL upgrades on Vitess clusters.

Related Issue(s)

Fixes #20210

Checklist

  • "Backport to:" labels have been added if this change should be back-ported to release branches
  • If this change is to be back-ported to previous releases, a justification is included in the PR description
  • Tests were added or are not required
  • Did the new or modified tests pass consistently locally and on CI?
  • Documentation was added or is not required

Deployment Notes

No migrations or flag changes required. The version-aware election is automatic once tablets report server_version in their replication status. Tablets on older Vitess versions that don't populate the field are treated as "unknown version" and sorted last, preserving existing behavior.

Behavior change during rolling MySQL upgrades: PRS now prefers lower-version candidates over higher-version ones, even when the lower-version tablet is slightly behind in replication position. This means PRS may elect a tablet that requires catch-up to the old primary's demotion position, which can increase reparent latency. Operators performing rolling MySQL upgrades should ensure --wait-replicas-timeout is generous enough to accommodate this catch-up time.

It is also worth mentioning that because this is done at the candidate-sorting level, a PRS that does not specify --allow-cross-cell-promotion could still promote a higher-version tablet in the same cell as the current primary instead of a lower-version tablet in a different cell. The cell boundary remains a hard filter — operators who want cross-cell promotion during upgrades can opt in with --allow-cross-cell-promotion.

AI Disclosure

This PR was co-authored with Claude Code, which helped with implementation, testing, and the version-comparison refinement.

ejortegau and others added 3 commits May 28, 2026 11:56
Signed-off-by: Eduardo Ortega <5791035+ejortegau@users.noreply.github.com>
Signed-off-by: Eduardo Ortega <5791035+ejortegau@users.noreply.github.com>
Patch version differences within the same MySQL release do not affect
replication compatibility, so the version-aware candidate election now
ignores them. Only major.minor boundaries trigger the preference for a
lower-version primary.

Co-Authored-By: Claude <svc-devxp-claude@slack-corp.com>
Signed-off-by: Eduardo Ortega <5791035+ejortegau@users.noreply.github.com>
Copilot AI review requested due to automatic review settings May 29, 2026 09:20
@github-actions github-actions Bot added this to the v25.0.0 milestone May 29, 2026
@vitess-bot vitess-bot Bot added NeedsWebsiteDocsUpdate What it says NeedsDescriptionUpdate The description is not clear or comprehensive enough, and needs work NeedsIssue A linked issue is missing for this Pull Request NeedsBackportReason If backport labels have been applied to a PR, a justification is required labels May 29, 2026
@vitess-bot

vitess-bot Bot commented May 29, 2026

Copy link
Copy Markdown
Contributor

Review Checklist

Hello reviewers! 👋 Please follow this checklist when reviewing this Pull Request.

General

  • Ensure that the Pull Request has a descriptive title.
  • Ensure there is a link to an issue (except for internal cleanup and flaky test fixes), new features should have an RFC that documents use cases and test cases.

Tests

  • Bug fixes should have at least one unit or end-to-end test, enhancement and new features should have a sufficient number of tests.

Documentation

  • Apply the release notes (needs details) label if users need to know about this change.
  • New features should be documented.
  • There should be some code comments as to why things are implemented the way they are.
  • There should be a comment at the top of each new or modified test to explain what the test does.

New flags

  • Is this flag really necessary?
  • Flag names must be clear and intuitive, use dashes (-), and have a clear help text.

If a workflow is added or modified:

  • Each item in Jobs should be named in order to mark it as required.
  • If the workflow needs to be marked as required, the maintainer team must be notified.

Backward compatibility

  • Protobuf changes should be wire-compatible.
  • Changes to _vt tables and RPCs need to be backward compatible.
  • RPC changes should be compatible with vitess-operator
  • If a flag is removed, then it should also be removed from vitess-operator and arewefastyet, if used there.
  • vtctl command output order should be stable and awk-able.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

This PR teaches the reparent (PRS/ERS) selection logic to prefer lower MySQL versions when choosing a new primary, in order to preserve replication compatibility (replicas must be at the same or higher version than the primary). The tablet's MySQL version is now propagated through the replicationdata.Status proto and incorporated as a tie-breaker after replication position and promotion rules.

Changes:

  • Adds server_version to replicationdata.Status and populates it from MysqlDaemon.GetVersionString in ReplicationStatus/StopReplicationAndGetStatus.
  • Threads a per-tablet MySQL ServerVersion map through sortTabletsForReparent, findMostAdvanced, findCandidate, and related helpers, applying lowest-release-first ordering after promotion rules.
  • Exports AtLeast/IsSameRelease and adds ReleaseAtLeast on ServerVersion, updating existing callers.

Reviewed changes

Copilot reviewed 12 out of 14 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
proto/replicationdata.proto Adds server_version field (tag 30) to Status.
go/vt/vttablet/tabletmanager/rpc_replication.go Populates ServerVersion on status responses.
go/vt/mysqlctl/version.go Exports AtLeast/IsSameRelease; adds ReleaseAtLeast.
go/vt/mysqlctl/capabilityset.go Updates callers to exported method names.
go/vt/mysqlctl/backupengine.go Updates callers to exported method names.
go/vt/vtctl/reparentutil/reparent_sorter.go Adds MySQL-version tie-breaker; new length validation; defines unknownVersion.
go/vt/vtctl/reparentutil/util.go Propagates server version through electing logic and findCandidate.
go/vt/vtctl/reparentutil/replication.go Collects MySQL versions into replicationSnapshot.
go/vt/vtctl/reparentutil/emergency_reparenter.go Plumbs versionMap into ERS candidate selection.
go/vt/vtctl/reparentutil/reparent_sorter_test.go New tests for version-aware sorting and length validation.
go/vt/vtctl/reparentutil/util_test.go New tests for version-aware election and findCandidate.
go/vt/vtctl/reparentutil/emergency_reparenter_test.go New tests for findMostAdvanced version preference.
Files not reviewed (2)
  • go/vt/proto/replicationdata/replicationdata.pb.go: Language not supported
  • go/vt/proto/replicationdata/replicationdata_vtproto.pb.go: Language not supported

Comment thread go/vt/vttablet/tabletmanager/rpc_replication.go Outdated
Comment on lines +356 to +364
if len(versionMap) == 0 {
// No version data — fall back to preferring the intermediate source to avoid catch-up.
for _, candidate := range possibleCandidates {
if topoproto.TabletAliasEqual(intermediateSource.Alias, candidate.Alias) {
return candidate
}
}
return possibleCandidates[0]
}
Comment thread go/vt/vtctl/reparentutil/util.go Outdated
Comment on lines 397 to 404
// If best is on the same release as the intermediate source, prefer intermediate source to avoid catch-up.
if sourceVersion.IsSameRelease(bestVersion) {
for _, candidate := range possibleCandidates {
if topoproto.TabletAliasEqual(intermediateSource.Alias, candidate.Alias) {
return candidate
}
}
}
Comment on lines +1132 to +1134
after := replication.ReplicationStatusToProto(rsAfter)
after.BackupRunning = tm.IsBackupRunning()
after.ServerVersion = before.ServerVersion

// reparentSorter sorts tablets by GTID positions and Promotion rules aimed at finding the best
// candidate for intermediate promotion in emergency reparent shard, and the new primary in planned reparent shard
var unknownVersion = mysqlctl.ServerVersion{Major: math.MaxInt, Minor: math.MaxInt, Patch: math.MaxInt}
@@ -181,25 +193,25 @@ func findTabletPositionLagBackupStatus(ctx context.Context, tablet *topodatapb.T
sqlErr, isSQLErr := sqlerror.NewSQLErrorFromError(err).(*sqlerror.SQLError)
if isSQLErr && sqlErr != nil && sqlErr.Number() == sqlerror.ERNotReplica {
logger.Warningf("no replication statue from %v, using empty gtid set", topoproto.TabletAliasString(tablet.Alias))
@promptless

promptless Bot commented May 29, 2026

Copy link
Copy Markdown
Contributor

Promptless prepared documentation updates related to this change.

Triggered by PR #20211

This PR introduces MySQL version-aware primary selection in PRS and ERS. The documentation updates explain the new candidate election criteria (GTID position → promotion rules → MySQL version → InnoDB buffer pool size → tablet alias), the replication compatibility rationale, and guidance for rolling MySQL upgrades.

Documentation suggestions:

  1. vitessio/website - Updated reparenting and upgrade user guides
    Review: MySQL version-aware reparent docs

  2. vitessio/vitess - Added changelog entry for v25.0.0
    Review: PRS/ERS MySQL version-aware changelog

@timvaillancourt timvaillancourt added Component: VTOrc Vitess Orchestrator integration Component: vtctl Type: Enhancement Logical improvement (somewhere between a bug and feature) and removed NeedsDescriptionUpdate The description is not clear or comprehensive enough, and needs work NeedsWebsiteDocsUpdate What it says NeedsIssue A linked issue is missing for this Pull Request NeedsBackportReason If backport labels have been applied to a PR, a justification is required labels May 29, 2026
@codecov

codecov Bot commented May 29, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 90.84507% with 13 lines in your changes missing coverage. Please review.
✅ Project coverage is 62.90%. Comparing base (70c7a72) to head (5f417e2).
⚠️ Report is 288 commits behind head on main.

Files with missing lines Patch % Lines
go/vt/vttablet/tabletmanager/rpc_replication.go 66.66% 4 Missing ⚠️
go/vt/vtctl/reparentutil/reparent_sorter.go 94.73% 3 Missing ⚠️
go/vt/vtctl/reparentutil/util.go 93.61% 3 Missing ⚠️
go/vt/vtctl/reparentutil/replication.go 81.81% 2 Missing ⚠️
go/vt/mysqlctl/version.go 80.00% 1 Missing ⚠️

❗ There is a different number of reports uploaded between BASE (70c7a72) and HEAD (5f417e2). Click for more details.

HEAD has 1 upload less than BASE
Flag BASE (70c7a72) HEAD (5f417e2)
1 0
Additional details and impacted files
@@             Coverage Diff             @@
##             main   #20211       +/-   ##
===========================================
- Coverage   69.67%   62.90%    -6.78%     
===========================================
  Files        1614      260     -1354     
  Lines      216793    51083   -165710     
===========================================
- Hits       151044    32132   -118912     
+ Misses      65749    18951    -46798     
Flag Coverage Δ
partial 62.90% <90.84%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@mattlord mattlord left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In go/vt/vtctl/reparentutil/reparent_sorter.go:83-121 the version only wins on exact position ties, so I think that PRS can still promote a newer MySQL release during rolling upgrades because ElectNewPrimary still sorts replication position before MySQL release. For PRS, the current primary is healthy and performGracefulPromotion later waits the chosen primary-elect to catch up to both a primary snapshot and the demotion position, so a slightly-behind lower-release replica is usually safe and intended to catch up. With the current ordering, an 8.4 replica that is one transaction ahead of an 8.0 replica will still be auto-selected, which is exactly the rolling-upgrade failure mode this PR is trying to prevent for older replicas. Please add a PRS regression test where an 8.4 candidate is slightly ahead of an 8.0 candidate and make the lower release win when it is otherwise eligible.

In go/vt/vtctl/reparentutil/replication.go:321-339 ERS primary-status candidates are always treated as unknown version. When StopReplicationAndGetStatus returns ERNotReplica, the code demotes the tablet and records it in primaryStatusMap, but never records its MySQL version in mysqlVersions. Later ERS maps missing versions to unknownVersion, so reachable primary-like candidates are sorted after known-version replicas even when they are the lower compatible release. That leaves a gap in ERS version-aware election for contested-primary/split-brain cases. Please carry server version through this path too, likely by adding it to PrimaryStatus or otherwise populating res.mysqlVersions for demoted primary-status candidates.

ejortegau and others added 2 commits June 1, 2026 15:46
PRS always catches the elected tablet up to the old primary's exact
demotion position, so replication position head-start is irrelevant for
data safety. This change introduces SortMode (SortForPRS/SortForERS) to
give PRS a distinct sort order: promotion rules > MySQL version >
position > buffer pool > alias.

ERS retains position-first ordering to minimize data loss.

Co-Authored-By: Claude <svc-devxp-claude@slack-corp.com>
Signed-off-by: Eduardo Ortega <5791035+ejortegau@users.noreply.github.com>
Co-Authored-By: Claude <svc-devxp-claude@slack-corp.com>
Signed-off-by: Eduardo Ortega <5791035+ejortegau@users.noreply.github.com>
@ejortegau ejortegau requested a review from beingnoble03 as a code owner June 1, 2026 13:48
@ejortegau

Copy link
Copy Markdown
Contributor Author

Hi, @mattlord ,

Thanks for the feedback, I I have addressed it.

@ejortegau ejortegau requested a review from mattlord June 1, 2026 13:49
@timvaillancourt

Copy link
Copy Markdown
Contributor

👋 @ejortegau — nice work on this. The version-aware preference feels like the right trade — a louder, recoverable failure during a mixed-version window beats a silent post-PRS replication break that you only notice when older-version replicas start erroring on relay-log apply.

A few small follow-ups worth folding in:

  1. Deployment note — a sentence calling out that PRS during a rolling MySQL upgrade will now prefer lower-version candidates and may take longer while it catches up, so operators tuning --wait-replicas-timeout know what to expect.

  2. Drop the unused assignment at rpc_replication.go:1134after.ServerVersion = before.ServerVersion isn't read anywhere downstream (stopReplicationAndBuildStatusMaps only reads Before.ServerVersion).

  3. Log the parse error in replication.go:359 and util.go:141 when the version string fails to parse — silently dropping it makes "why is this tablet's version unknown?" hard to debug in the field.

  4. Quick comment on findCandidate explaining why the post-loop "if source is same-release as best, prefer source" step is needed — the control flow is subtle and reads at first glance like it could be folded into the loop.

Otherwise LGTM 👍

@mattlord mattlord left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ERS primary-status candidates still seem to lose MySQL version data in go/vt/vtctl/reparentutil/replication.go:321-338 as it handles ERNotReplica by demoting the tablet and storing only primaryStatusMap. The version map is populated only on the normal StopReplicationAndGetStatus path at replication.go:359-363. Later, findMostAdvanced maps missing versions to unknownVersion, so reachable primary-like/split-brain candidates are sorted as unknown even if they are actually the lower compatible MySQL release. This leaves a gap in the ERS protection this PR is adding. Please carry server version through the DemotePrimary/PrimaryStatus path too.

The optional version lookup seems to happen before stopping replication in go/vt/vttablet/tabletmanager/rpc_replication.go:1078-1082 as it calls GetVersionString(ctx) before stopIOThreadLocked / stopReplicationLocked. This is optional metadata, but it shares the stop RPC context and runs while holding the tabletmanager lock. If that version query or fallback path is slow, it can delay or consume the timeout before the actual stop, making ERS more brittle. Please collect this after replication is stopped, or otherwise ensure version collection cannot delay/fail the stop path.

Lastly, the Copilot note about log.Warn(..., slog.Any(...)) still looks valid.

And the PR description still says the sorter order is GTID position > promotion rules > MySQL release, while PRS now intentionally uses promotion rules > MySQL release > position.

- Remove unused `after.ServerVersion` assignment in StopReplicationAndGetStatus
- Log warning when MySQL version string fails to parse
- Clarify findCandidate post-loop comment explaining two-phase logic
- Add v25 changelog entry with deployment note and cross-cell limitation

Co-Authored-By: Claude <svc-devxp-claude@slack-corp.com>
Signed-off-by: Eduardo Ortega <5791035+ejortegau@users.noreply.github.com>
Copilot AI review requested due to automatic review settings June 2, 2026 08:36

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 13 out of 16 changed files in this pull request and generated 4 comments.

Files not reviewed (2)
  • go/vt/proto/replicationdata/replicationdata.pb.go: Language not supported
  • go/vt/proto/replicationdata/replicationdata_vtproto.pb.go: Language not supported

Comment on lines 156 to 157
} else {
fmt.Fprintf(&reasonsToInvalidate, "\n%v has %v replication lag which is more than the tolerable amount", topoproto.TabletAliasString(tablet.Alias), replLag)
Comment on lines +145 to +154
v := unknownVersion
if serverVersion != "" {
_, parsed, parseErr := mysqlctl.ParseVersionString(serverVersion)
if parseErr == nil {
v = parsed
} else {
logger.Warningf("failed to parse MySQL version %q for tablet %v: %v", serverVersion, topoproto.TabletAliasString(tb.Alias), parseErr)
}
}
mysqlVersions = append(mysqlVersions, v)
Comment on lines +56 to +60
if version, vErr := tm.MysqlDaemon.GetVersionString(ctx); vErr == nil {
protoStatus.ServerVersion = version
} else {
log.Warn("failed to get MySQL version string", slog.Any("error", vErr))
}
Comment thread go/vt/vtctl/reparentutil/util.go Outdated
Comment on lines +392 to +394
if v.IsSameRelease(bestVersion) || v.ReleaseAtLeast(bestVersion) {
continue
}
ejortegau and others added 2 commits June 2, 2026 14:00
- Add `server_version` field to `PrimaryStatus` proto so that
  DemotePrimary and PrimaryStatus RPCs report the MySQL version
- Populate version in PrimaryStatus, DemotePrimary, and
  StopReplicationAndGetStatus RPCs
- Move GetVersionString call after replication stop in
  StopReplicationAndGetStatus to avoid delaying the critical path
- Read PrimaryStatus.ServerVersion in ERS ERNotReplica path so
  demoted primary-status candidates no longer get unknownVersion
- Extract getMySQLVersion helper to deduplicate the fetch-and-warn
  pattern across 4 call sites
- Add test coverage for version propagation through both paths

Co-Authored-By: Claude <svc-devxp-claude@slack-corp.com>
Signed-off-by: Eduardo Ortega <5791035+ejortegau@users.noreply.github.com>
ReleaseAtLeast already covers the same-release case (Minor >=),
so the redundant IsSameRelease check can be removed.

Co-Authored-By: Claude <svc-devxp-claude@slack-corp.com>
Signed-off-by: Eduardo Ortega <5791035+ejortegau@users.noreply.github.com>
@ejortegau ejortegau requested a review from mattlord June 2, 2026 12:15

@mattlord mattlord left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current Check Make VTAdmin Web Proto job is failing because regenerating vtadmin web protos changes both committed files. Please rerun the vtadmin web proto generation and commit the updated artifacts.

In go/vt/vttablet/tabletmanager/rpc_replication.go:1091-1116 StopReplicationAndGetStatus only populates before.ServerVersion at the end of the full success path. The early-success paths for already-stopped IO replication or already-unhealthy replication return Before/After before the version is set. ERS builds its version map from StopReplicationStatus.Before.ServerVersion, so a reachable older-version tablet in this state becomes unknownVersion and can lose the version-aware election to a newer tablet. Please populate ServerVersion before all success returns, and add a regression test for an already-stopped replica.

ejortegau and others added 3 commits June 3, 2026 10:39
Signed-off-by: Eduardo Ortega <5791035+ejortegau@users.noreply.github.com>
StopReplicationAndGetStatus had several early-return paths (IO thread
already stopped, replication not healthy, stop failures, after-status
failure) that returned Before without populating ServerVersion. ERS
builds its version map from Before.ServerVersion, so tablets hitting
these paths became unknownVersion and could lose version-aware election
to newer tablets.

Fix: call getMySQLVersion(ctx) before every return that includes Before.

Add a table-driven test covering all return paths (success and error)
for both IOTHREADONLY and IOANDSQLTHREAD modes.

Ref: vitessio#20211 (review)

Co-Authored-By: Claude <svc-devxp-claude@slack-corp.com>
Signed-off-by: Eduardo Ortega <5791035+ejortegau@users.noreply.github.com>
Signed-off-by: Eduardo Ortega <5791035+ejortegau@users.noreply.github.com>
@ejortegau ejortegau requested review from Copilot and mattlord June 3, 2026 09:38

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 15 out of 18 changed files in this pull request and generated 4 comments.

Files not reviewed (2)
  • go/vt/proto/replicationdata/replicationdata.pb.go: Language not supported
  • go/vt/proto/replicationdata/replicationdata_vtproto.pb.go: Language not supported
Comments suppressed due to low confidence (2)

go/vt/vttablet/tabletmanager/rpc_replication.go:1

  • tm.getMySQLVersion(ctx) can be called multiple times in a single StopReplicationAndGetStatus execution, potentially causing repeated MysqlDaemon.GetVersionString calls (extra round trips) along error/early-return paths. Consider retrieving the version once near the start of the method and reusing it for all branches.
/*

go/vt/vttablet/tabletmanager/rpc_replication.go:1

  • tm.getMySQLVersion(ctx) can be called multiple times in a single StopReplicationAndGetStatus execution, potentially causing repeated MysqlDaemon.GetVersionString calls (extra round trips) along error/early-return paths. Consider retrieving the version once near the start of the method and reusing it for all branches.
/*

Comment on lines +43 to +50
func (tm *TabletManager) getMySQLVersion(ctx context.Context) string {
version, err := tm.MysqlDaemon.GetVersionString(ctx)
if err != nil {
log.Warn(fmt.Sprintf("failed to get MySQL version string: %v", err))
return ""
}
return version
}
rs.FilePosition = rsAfter.FilePosition
rs.RelayLogSourceBinlogEquivalentPosition = rsAfter.RelayLogSourceBinlogEquivalentPosition

before.ServerVersion = tm.getMySQLVersion(ctx)
Comment on lines +1148 to 1152
before.ServerVersion = tm.getMySQLVersion(ctx)

return StopReplicationAndGetStatusResponse{
Status: &replicationdatapb.StopReplicationStatus{
Before: before,
Comment on lines +24 to +25
- **[VTCtld](#minor-changes-vtctld)**
- [MySQL version-aware reparent candidate election](#vtctld-version-aware-reparent)
ejortegau added a commit to slackhq/vitess that referenced this pull request Jun 4, 2026
StopReplicationAndGetStatus had several early-return paths (IO thread
already stopped, replication not healthy, stop failures, after-status
failure) that returned Before without populating ServerVersion. ERS
builds its version map from Before.ServerVersion, so tablets hitting
these paths became unknownVersion and could lose version-aware election
to newer tablets.

Fix: call getMySQLVersion(ctx) before every return that includes Before.

Add a table-driven test covering all return paths (success and error)
for both IOTHREADONLY and IOANDSQLTHREAD modes.

Ref: vitessio#20211 (review)

Co-Authored-By: Claude <svc-devxp-claude@slack-corp.com>
Signed-off-by: Eduardo Ortega <5791035+ejortegau@users.noreply.github.com>
Signed-off-by: Eduardo Ortega <5791035+ejortegau@users.noreply.github.com>

@mattlord mattlord left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking good. Just one relatively minor things... In go/vt/vttablet/tabletmanager/rpc_replication.go:1142-1158 StopReplicationAndGetStatus now populates Before.ServerVersion, but the normal successful stop path returns an After status without ServerVersion. The early no-op paths use After: before, so they include it, making the common success path inconsistent. Please set after.ServerVersion = before.ServerVersion and extend TestStopReplicationAndGetStatus_ServerVersion to assert After.ServerVersion when After is present.

Thank you @ejortegau !

The successful stop path set Before.ServerVersion but left After.ServerVersion
empty, while the no-op paths return After: before and thus include it. Set
after.ServerVersion = before.ServerVersion so the common success path is
consistent, and extend the test to assert After.ServerVersion when present.

Co-Authored-By: Claude <svc-devxp-claude@slack-corp.com>
Signed-off-by: Eduardo Ortega <5791035+ejortegau@users.noreply.github.com>
Copilot AI review requested due to automatic review settings June 12, 2026 14:02
@ejortegau

Copy link
Copy Markdown
Contributor Author

Made the requested changes, @mattlord . Please let me know if there is anything else that needs adjusting.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 15 out of 18 changed files in this pull request and generated 8 comments.

Files not reviewed (2)
  • go/vt/proto/replicationdata/replicationdata.pb.go: Generated file
  • go/vt/proto/replicationdata/replicationdata_vtproto.pb.go: Generated file

Comment on lines 210 to 212
sqlErr, isSQLErr := sqlerror.NewSQLErrorFromError(err).(*sqlerror.SQLError)
if isSQLErr && sqlErr != nil && sqlErr.Number() == sqlerror.ERNotReplica {
logger.Warningf("no replication statue from %v, using empty gtid set", topoproto.TabletAliasString(tablet.Alias))
Comment on lines +43 to +50
func (tm *TabletManager) getMySQLVersion(ctx context.Context) string {
version, err := tm.MysqlDaemon.GetVersionString(ctx)
if err != nil {
log.Warn(fmt.Sprintf("failed to get MySQL version string: %v", err))
return ""
}
return version
}
Comment on lines 1092 to +1093
if !rs.IOHealthy() {
before.ServerVersion = tm.getMySQLVersion(ctx)
Comment on lines 1101 to +1102
if err := tm.stopIOThreadLocked(ctx); err != nil {
before.ServerVersion = tm.getMySQLVersion(ctx)
Comment on lines 1110 to +1112
if !rs.Healthy() {
// no replication is running, just return what we got
before.ServerVersion = tm.getMySQLVersion(ctx)
Comment on lines 1120 to +1121
if err := tm.stopReplicationLocked(ctx); err != nil {
before.ServerVersion = tm.getMySQLVersion(ctx)
Comment on lines 1132 to +1133
if err != nil {
before.ServerVersion = tm.getMySQLVersion(ctx)
Comment on lines +1148 to +1149
before.ServerVersion = tm.getMySQLVersion(ctx)
after.ServerVersion = before.ServerVersion
@ejortegau ejortegau requested a review from mattlord June 16, 2026 09:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Feature Request: Reparenting should be MySQL version aware

4 participants