Skip to content

vtctld: preserve response keyspace/shard when reparent fails before ShardInfo is populated#20185

Open
lmorduch wants to merge 3 commits into
vitessio:mainfrom
HubSpot:lmorduch/planned-reparent-nil-guard
Open

vtctld: preserve response keyspace/shard when reparent fails before ShardInfo is populated#20185
lmorduch wants to merge 3 commits into
vitessio:mainfrom
HubSpot:lmorduch/planned-reparent-nil-guard

Conversation

@lmorduch
Copy link
Copy Markdown
Contributor

@lmorduch lmorduch commented May 26, 2026

Description

Both PlannedReparentShard and EmergencyReparentShard initialise the response with
req.Keyspace/req.Shard, then conditionally overwrite from ev.ShardInfo:

resp = &vtctldatapb.PlannedReparentShardResponse{
    Keyspace: req.Keyspace,
    Shard:    req.Shard,
}
if ev != nil {
    resp.Keyspace = ev.ShardInfo.Keyspace() // returns "" when ShardInfo is zero-valued
    resp.Shard    = ev.ShardInfo.ShardName()
    ...
}

ev is always non-nil (allocated before the reparent call), but ev.ShardInfo is
zero-valued whenever reparentShardLocked returns before the
ev.ShardInfo = *shardInfo assignment. For PRS this can happen when
GetShard, GetKeyspaceDurability, GetDurabilityPolicy, or the
ExpectedPrimary pre-flight check returns an error; for ERS when GetShard
fails. In those cases ev.ShardInfo.Keyspace() returns "", silently
overwriting resp.Keyspace and resp.Shard with empty strings.

The fix extracts a fillReparentResponseFromEvent helper shared by both
endpoints. The helper guards Keyspace()/ShardName() before overwriting,
preserving the req defaults when the event has no shard metadata.

Related Issue(s)

Fixes #20184

Checklist

  • "Backport to:" labels have been added if this change should be back-ported to release branches
  • If this change is to be back-ported to previous releases, a justification is included in the PR description
  • Tests were added or are not required
  • Did the new or modified tests pass consistently locally and on CI?
  • Documentation was added or is not required

Deployment Notes

None.

AI Disclosure

This PR was written primarily by Claude Code.

Copilot AI review requested due to automatic review settings May 26, 2026 15:58
@github-actions github-actions Bot added this to the v25.0.0 milestone May 26, 2026
@vitess-bot vitess-bot Bot added NeedsWebsiteDocsUpdate What it says NeedsDescriptionUpdate The description is not clear or comprehensive enough, and needs work NeedsIssue A linked issue is missing for this Pull Request NeedsBackportReason If backport labels have been applied to a PR, a justification is required labels May 26, 2026
@vitess-bot
Copy link
Copy Markdown
Contributor

vitess-bot Bot commented May 26, 2026

Review Checklist

Hello reviewers! 👋 Please follow this checklist when reviewing this Pull Request.

General

  • Ensure that the Pull Request has a descriptive title.
  • Ensure there is a link to an issue (except for internal cleanup and flaky test fixes), new features should have an RFC that documents use cases and test cases.

Tests

  • Bug fixes should have at least one unit or end-to-end test, enhancement and new features should have a sufficient number of tests.

Documentation

  • Apply the release notes (needs details) label if users need to know about this change.
  • New features should be documented.
  • There should be some code comments as to why things are implemented the way they are.
  • There should be a comment at the top of each new or modified test to explain what the test does.

New flags

  • Is this flag really necessary?
  • Flag names must be clear and intuitive, use dashes (-), and have a clear help text.

If a workflow is added or modified:

  • Each item in Jobs should be named in order to mark it as required.
  • If the workflow needs to be marked as required, the maintainer team must be notified.

Backward compatibility

  • Protobuf changes should be wire-compatible.
  • Changes to _vt tables and RPCs need to be backward compatible.
  • RPC changes should be compatible with vitess-operator
  • If a flag is removed, then it should also be removed from vitess-operator and arewefastyet, if used there.
  • vtctl command output order should be stable and awk-able.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

This PR adds a regression test and adjusts reparent RPC response-filling to avoid returning empty keyspace/shard when an error occurs before shard metadata is populated.

Changes:

  • Add a PlannedReparentShard regression test asserting response keyspace/shard are preserved on early failure.
  • Change EmergencyReparentShard and PlannedReparentShard to only overwrite response keyspace/shard when shard metadata appears populated.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File Description
go/vt/vtctl/grpcvtctldserver/server_test.go Adds a regression test ensuring error responses still include request keyspace/shard.
go/vt/vtctl/grpcvtctldserver/server.go Adjusts response population logic to avoid overwriting keyspace/shard with empty values on early failures.

Comment thread go/vt/vtctl/grpcvtctldserver/server.go Outdated
Comment on lines +1339 to +1341
if k := ev.ShardInfo.Keyspace(); k != "" {
resp.Keyspace = k
resp.Shard = ev.ShardInfo.ShardName()
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ev.ShardInfo is a value type (topo.ShardInfo struct), not a pointer, so it cannot be nil, only zero-valued. Keyspace() returns si.keyspace, a plain string field, which is "" on a zero-valued struct without panicking. The if k != "" guard should be sufficient here

Comment thread go/vt/vtctl/grpcvtctldserver/server.go Outdated
Comment on lines +3341 to +3348
// ev is always non-nil (initialized before the reparent call), but ev.ShardInfo may be
// zero-valued if reparentShardLocked returned before populating it (e.g. error paths where
// no primary could be found). Guard against a nil pointer dereference.
if ev != nil {
resp.Keyspace = ev.ShardInfo.Keyspace()
resp.Shard = ev.ShardInfo.ShardName()
if k := ev.ShardInfo.Keyspace(); k != "" {
resp.Keyspace = k
resp.Shard = ev.ShardInfo.ShardName()
}
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implemented a DRY method for this

Comment thread go/vt/vtctl/grpcvtctldserver/server.go
@mattlord mattlord added Type: Bug Component: Cluster management and removed NeedsDescriptionUpdate The description is not clear or comprehensive enough, and needs work NeedsWebsiteDocsUpdate What it says NeedsIssue A linked issue is missing for this Pull Request NeedsBackportReason If backport labels have been applied to a PR, a justification is required labels May 26, 2026
@lmorduch lmorduch force-pushed the lmorduch/planned-reparent-nil-guard branch from c75015a to 1a4bdb4 Compare May 26, 2026 16:33
Copilot AI review requested due to automatic review settings May 26, 2026 16:41
@lmorduch lmorduch force-pushed the lmorduch/planned-reparent-nil-guard branch from 1a4bdb4 to e0c2b1b Compare May 26, 2026 16:41
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

Comment thread go/vt/vtctl/grpcvtctldserver/server.go Outdated
Comment thread go/vt/vtctl/grpcvtctldserver/server.go Outdated
Comment on lines +1293 to +1294
*shard = ev.ShardInfo.ShardName()
}
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in the latest commit, now checks both: if k, s := ev.ShardInfo.Keyspace(), ev.ShardInfo.ShardName(); k != "" && s != ""

Comment thread go/vt/vtctl/grpcvtctldserver/server_test.go
Copy link
Copy Markdown
Member

@mattlord mattlord left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lmorduch please do not overwrite the PR template: https://github.com/vitessio/vitess/blob/main/.github/pull_request_template.md
Can you please incorporate your description into that?

The regression test does not cover the failure mode described in #20184. The linked issue describes AvoidPrimary / no valid same-cell candidate, but PRS assigns ev.ShardInfo = *shardInfo before preflightChecks / ElectNewPrimary, so that path should already have populated shard metadata. The new test instead exercises ExpectedPrimary failing before ev.ShardInfo is assigned, which is a real early-error response bug, but a different scenario. Please add a regression test for the exact #20184 repro and verify it fails on main, or narrow the PR/issue description to the early-return paths this actually fixes.

Since EmergencyReparentShard was changed symmetrically, it should get the same response-preservation assertion on an early GetShard/pre-ShardInfo failure path. The code change is small, but without the ERS test this can regress independently.

I don’t think Copilot’s ev.ShardInfo != nil comment is valid as written: ShardInfo is a value field, not a pointer. The deeper concern is that the claimed panic/repro and the tested fixed path don’t currently line up.

… in reparent RPCs

PlannedReparentShard and EmergencyReparentShard both initialize the
response with req.Keyspace/req.Shard, then overwrite from ev.ShardInfo
inside an `if ev != nil` block. ev itself is always non-nil (allocated
before the reparent call), but ev.ShardInfo is zero-valued whenever
reparentShardLocked returns before populating it — for example when no
valid reparent target exists in the same cell as the current primary.

Calling .Keyspace() / .ShardName() on a zero-valued topo.ShardInfo
dereferences an internal nil pointer, crashing vtctld.

Fix: check whether ev.ShardInfo.Keyspace() is non-empty before
overwriting the response fields. When it is empty the response already
holds the correct values from the req initialization, so the fallback
is safe.

Signed-off-by: Lucas Morduchowicz <lmorduch@gmail.com>
@lmorduch lmorduch force-pushed the lmorduch/planned-reparent-nil-guard branch from e0c2b1b to b9318a9 Compare May 26, 2026 16:50
@lmorduch
Copy link
Copy Markdown
Contributor Author

@lmorduch please do not overwrite the PR template: https://github.com/vitessio/vitess/blob/main/.github/pull_request_template.md Can you please incorporate your description into that?

The regression test does not cover the failure mode described in #20184. The linked issue describes AvoidPrimary / no valid same-cell candidate, but PRS assigns ev.ShardInfo = *shardInfo before preflightChecks / ElectNewPrimary, so that path should already have populated shard metadata. The new test instead exercises ExpectedPrimary failing before ev.ShardInfo is assigned, which is a real early-error response bug, but a different scenario. Please add a regression test for the exact #20184 repro and verify it fails on main, or narrow the PR/issue description to the early-return paths this actually fixes.

Since EmergencyReparentShard was changed symmetrically, it should get the same response-preservation assertion on an early GetShard/pre-ShardInfo failure path. The code change is small, but without the ERS test this can regress independently.

I don’t think Copilot’s ev.ShardInfo != nil comment is valid as written: ShardInfo is a value field, not a pointer. The deeper concern is that the claimed panic/repro and the tested fixed path don’t currently line up.

@mattlord Apologies for clobbering the PR description! Updated it back.

You're right that the claimed panic and the tested path don't line up, so let me explain what we actually found.

The production crash was in our hubspot-v14 branch (Vitess 14-based). In that version, server.go:2017 is:

if !topoproto.TabletAliasIsZero(ev.NewPrimary.Alias) {

No ev.NewPrimary != nil guard. When AvoidPrimary finds no valid candidate, ElectNewPrimary fails, ev.NewPrimary is never set, and that line dereferences nil. addr=0x28 in the SIGSEGV is the offset of Alias in topodatapb.Tablet. That version also lacks panicHandler in PlannedReparentShard, so the panic escaped gRPC and killed the process. The same unguarded line exists in vitessio release-14.0 at server.go:2006.

Current main already has ev.NewPrimary != nil guarding that dereference, so that specific panic can't happen here. What remains is a quieter correctness bug: when reparentShardLocked exits before ev.ShardInfo = *shardInfo is reached (GetShard failure, GetKeyspaceDurability failure, GetDurabilityPolicy failure, or ExpectedPrimary mismatch for PRS; GetShard failure for ERS), the response gets keyspace: "" and shard: "" instead of the request values. That's what this PR fixes, and what both tests cover.

Happy to add a note to the PR description clarifying the v14 origin of the issue vs. what this patch addresses in main. Also happy to drop this PR if we feel that extra fix ain't worth the code we're adding here, since the panic is fixed upstream.

@codecov
Copy link
Copy Markdown

codecov Bot commented May 27, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 69.47%. Comparing base (70c7a72) to head (9cfb823).
⚠️ Report is 277 commits behind head on main.

Additional details and impacted files
@@             Coverage Diff             @@
##             main   #20185       +/-   ##
===========================================
- Coverage   69.67%   69.47%    -0.20%     
===========================================
  Files        1614        9     -1605     
  Lines      216793     4777   -212016     
===========================================
- Hits       151044     3319   -147725     
+ Misses      65749     1458    -64291     
Flag Coverage Δ
partial 69.47% <100.00%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Add nil checks for each output parameter and unit tests covering all
nil combinations, so the helper is safe to call with any subset of
non-nil pointers.

Signed-off-by: Lucas Morduchowicz <lmorduch@gmail.com>
Copilot AI review requested due to automatic review settings May 27, 2026 00:27
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

Comment on lines 54 to 55
"vitess.io/vitess/go/vt/topotools/events"
"vitess.io/vitess/go/vt/topo/topoproto"
Signed-off-by: Lucas Morduchowicz <lmorduch@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Bug Report: PlannedReparentShard/EmergencyReparentShard silently return empty keyspace/shard on early reparent failure

3 participants