vtctld: preserve response keyspace/shard when reparent fails before ShardInfo is populated#20185
vtctld: preserve response keyspace/shard when reparent fails before ShardInfo is populated#20185lmorduch wants to merge 3 commits into
Conversation
Review ChecklistHello reviewers! 👋 Please follow this checklist when reviewing this Pull Request. General
Tests
Documentation
New flags
If a workflow is added or modified:
Backward compatibility
|
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
This PR adds a regression test and adjusts reparent RPC response-filling to avoid returning empty keyspace/shard when an error occurs before shard metadata is populated.
Changes:
- Add a PlannedReparentShard regression test asserting response keyspace/shard are preserved on early failure.
- Change EmergencyReparentShard and PlannedReparentShard to only overwrite response keyspace/shard when shard metadata appears populated.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| go/vt/vtctl/grpcvtctldserver/server_test.go | Adds a regression test ensuring error responses still include request keyspace/shard. |
| go/vt/vtctl/grpcvtctldserver/server.go | Adjusts response population logic to avoid overwriting keyspace/shard with empty values on early failures. |
| if k := ev.ShardInfo.Keyspace(); k != "" { | ||
| resp.Keyspace = k | ||
| resp.Shard = ev.ShardInfo.ShardName() |
There was a problem hiding this comment.
ev.ShardInfo is a value type (topo.ShardInfo struct), not a pointer, so it cannot be nil, only zero-valued. Keyspace() returns si.keyspace, a plain string field, which is "" on a zero-valued struct without panicking. The if k != "" guard should be sufficient here
| // ev is always non-nil (initialized before the reparent call), but ev.ShardInfo may be | ||
| // zero-valued if reparentShardLocked returned before populating it (e.g. error paths where | ||
| // no primary could be found). Guard against a nil pointer dereference. | ||
| if ev != nil { | ||
| resp.Keyspace = ev.ShardInfo.Keyspace() | ||
| resp.Shard = ev.ShardInfo.ShardName() | ||
| if k := ev.ShardInfo.Keyspace(); k != "" { | ||
| resp.Keyspace = k | ||
| resp.Shard = ev.ShardInfo.ShardName() | ||
| } |
There was a problem hiding this comment.
Implemented a DRY method for this
c75015a to
1a4bdb4
Compare
1a4bdb4 to
e0c2b1b
Compare
| *shard = ev.ShardInfo.ShardName() | ||
| } |
There was a problem hiding this comment.
Fixed in the latest commit, now checks both: if k, s := ev.ShardInfo.Keyspace(), ev.ShardInfo.ShardName(); k != "" && s != ""
mattlord
left a comment
There was a problem hiding this comment.
@lmorduch please do not overwrite the PR template: https://github.com/vitessio/vitess/blob/main/.github/pull_request_template.md
Can you please incorporate your description into that?
The regression test does not cover the failure mode described in #20184. The linked issue describes AvoidPrimary / no valid same-cell candidate, but PRS assigns ev.ShardInfo = *shardInfo before preflightChecks / ElectNewPrimary, so that path should already have populated shard metadata. The new test instead exercises ExpectedPrimary failing before ev.ShardInfo is assigned, which is a real early-error response bug, but a different scenario. Please add a regression test for the exact #20184 repro and verify it fails on main, or narrow the PR/issue description to the early-return paths this actually fixes.
Since EmergencyReparentShard was changed symmetrically, it should get the same response-preservation assertion on an early GetShard/pre-ShardInfo failure path. The code change is small, but without the ERS test this can regress independently.
I don’t think Copilot’s ev.ShardInfo != nil comment is valid as written: ShardInfo is a value field, not a pointer. The deeper concern is that the claimed panic/repro and the tested fixed path don’t currently line up.
… in reparent RPCs PlannedReparentShard and EmergencyReparentShard both initialize the response with req.Keyspace/req.Shard, then overwrite from ev.ShardInfo inside an `if ev != nil` block. ev itself is always non-nil (allocated before the reparent call), but ev.ShardInfo is zero-valued whenever reparentShardLocked returns before populating it — for example when no valid reparent target exists in the same cell as the current primary. Calling .Keyspace() / .ShardName() on a zero-valued topo.ShardInfo dereferences an internal nil pointer, crashing vtctld. Fix: check whether ev.ShardInfo.Keyspace() is non-empty before overwriting the response fields. When it is empty the response already holds the correct values from the req initialization, so the fallback is safe. Signed-off-by: Lucas Morduchowicz <lmorduch@gmail.com>
e0c2b1b to
b9318a9
Compare
@mattlord Apologies for clobbering the PR description! Updated it back. You're right that the claimed panic and the tested path don't line up, so let me explain what we actually found. The production crash was in our if !topoproto.TabletAliasIsZero(ev.NewPrimary.Alias) {No Current main already has Happy to add a note to the PR description clarifying the v14 origin of the issue vs. what this patch addresses in main. Also happy to drop this PR if we feel that extra fix ain't worth the code we're adding here, since the panic is fixed upstream. |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #20185 +/- ##
===========================================
- Coverage 69.67% 69.47% -0.20%
===========================================
Files 1614 9 -1605
Lines 216793 4777 -212016
===========================================
- Hits 151044 3319 -147725
+ Misses 65749 1458 -64291
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Add nil checks for each output parameter and unit tests covering all nil combinations, so the helper is safe to call with any subset of non-nil pointers. Signed-off-by: Lucas Morduchowicz <lmorduch@gmail.com>
| "vitess.io/vitess/go/vt/topotools/events" | ||
| "vitess.io/vitess/go/vt/topo/topoproto" |
Signed-off-by: Lucas Morduchowicz <lmorduch@gmail.com>
Description
Both
PlannedReparentShardandEmergencyReparentShardinitialise the response withreq.Keyspace/req.Shard, then conditionally overwrite fromev.ShardInfo:evis always non-nil (allocated before the reparent call), butev.ShardInfoiszero-valued whenever
reparentShardLockedreturns before theev.ShardInfo = *shardInfoassignment. For PRS this can happen whenGetShard,GetKeyspaceDurability,GetDurabilityPolicy, or theExpectedPrimarypre-flight check returns an error; for ERS whenGetShardfails. In those cases
ev.ShardInfo.Keyspace()returns"", silentlyoverwriting
resp.Keyspaceandresp.Shardwith empty strings.The fix extracts a
fillReparentResponseFromEventhelper shared by bothendpoints. The helper guards
Keyspace()/ShardName()before overwriting,preserving the
reqdefaults when the event has no shard metadata.Related Issue(s)
Fixes #20184
Checklist
Deployment Notes
None.
AI Disclosure
This PR was written primarily by Claude Code.