reparentutil: fix goroutine count and improve logging#19888
Conversation
Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
Review ChecklistHello reviewers! 👋 Please follow this checklist when reviewing this Pull Request. General
Tests
Documentation
New flags
If a workflow is added or modified:
Backward compatibility
|
There was a problem hiding this comment.
Pull request overview
This PR applies a set of safety and observability fixes in reparentutil, primarily around replication-stop coordination, replica reparent concurrency, and troubleshooting logs during planned reparents.
Changes:
- Adds a precondition guard in
stopReplicationAndBuildStatusMapsto prevent invalid “no tablets to act on” scenarios. - Improves mutex safety in ERS replica handling by using
deferto ensure unlock on early exits/panics. - Enhances PRS logging and adds rationale around best-effort
RefreshStatebehavior.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
| go/vt/vtctl/reparentutil/replication.go | Adds guard logic around the computed goroutine count when stopping replication across tablets. |
| go/vt/vtctl/reparentutil/planned_reparenter.go | Adds rationale for best-effort RefreshState and includes tablet alias in warning logs. |
| go/vt/vtctl/reparentutil/emergency_reparenter.go | Uses defer for mutex unlock and clarifies background goroutine / cancellation behavior. |
reparentutil: fix mutex safety, add ignored-tablets guard, improve logging
Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
reparentutil: fix mutex safety, add ignored-tablets guard, improve loggingreparentutil: fix goroutine count and improve logging
reparentutil: fix goroutine count and improve logging
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #19888 +/- ##
===========================================
+ Coverage 69.67% 91.38% +21.70%
===========================================
Files 1614 9 -1605
Lines 216793 1311 -215482
===========================================
- Hits 151044 1198 -149846
+ Misses 65749 113 -65636
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
mattlord
left a comment
There was a problem hiding this comment.
The changes seem OK (not sure why this warrants a PR by itself as there is an org/project cost for PRs), but I don't like having PRs w/o an issue and w/o any tests which demonstrate the problem and the fix.
The PR changes the goroutine counting logic and adds a new early FAILED_PRECONDITION branch in replication.go (line 371), but there’s no corresponding coverage added in replication_test.go (line 272). In particular, I’d want two explicit test cases:
- IgnoreReplicas contains a stale alias that is not present in tabletMap
- All real tablets are filtered out by IgnoreReplicas, so the new precondition path fires
That’s the exact behavior this PR seems to be fixing, and right now it’s still unprotected by regression tests.
|
The changes LGTM, but as @mattlord pointed out, tests that cover this would be nice. |
nickvanw
left a comment
There was a problem hiding this comment.
LGTM. The counting-in-loop fix is the right shape — the previous len-based subtraction could silently mismatch reality when ignoredTablets had stale aliases, leaking goroutines on the unbuffered errChan. Tests cover the two scenarios that matter.
…19888) (#19922) Signed-off-by: Matt Lord <mattalord@gmail.com> Signed-off-by: dependabot[bot] <support@github.com> Signed-off-by: Derek Perkins <derek@nozzle.io> Signed-off-by: Arthur Schreiber <arthur@planetscale.com> Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com> Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Signed-off-by: MukundaKatta <mukundakatta@users.noreply.github.com> Signed-off-by: Nick Van Wiggeren <nick@planetscale.com> Signed-off-by: Mohamed Hamza <mhamza@fastmail.com> Signed-off-by: Brett Wines <bwines@slack-corp.com> Signed-off-by: Harshit Gangal <harshit@planetscale.com> Co-authored-by: Matt Lord <mattalord@gmail.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Derek Perkins <derek@nozzle.io> Co-authored-by: Arthur Schreiber <arthurschreiber@github.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-authored-by: Mohamed Hamza <mhamza@fastmail.com> Co-authored-by: Tim Vaillancourt <tim@timvaillancourt.com> Co-authored-by: vitess-go-upgrade-bot <139342327+vitess-bot@users.noreply.github.com> Co-authored-by: frouioui <35779988+frouioui@users.noreply.github.com> Co-authored-by: Mukunda Rao Katta <mukunda.vjcs6@gmail.com> Co-authored-by: Nick Van Wiggeren <nickvanw@users.noreply.github.com> Co-authored-by: Arthur Schreiber <arthur@planetscale.com> Co-authored-by: Brett Wines <bwines@salesforce.com> Co-authored-by: Claude <svc-devxp-claude@slack-corp.com> Co-authored-by: vitess-bot[bot] <108069721+vitess-bot[bot]@users.noreply.github.com> Co-authored-by: Harshit Gangal <harshit@planetscale.com>
Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
Description
2 x small fixes found during an audit of the
reparentutilpackage:Incorrect goroutine count in
stopReplicationAndBuildStatusMaps()—numGoRoutineswas computed aslen(tabletMap) - ignoredTablets.Len(), butignoredTabletscan contain aliases not present intabletMap(e.g., stale entries), making the subtraction wrong. Now counts goroutines directly in the launch loop. Also added an earlyFAILED_PRECONDITIONreturn when tablets exist but all are excluded byIgnoreReplicas, which previously producedrequiredSuccesses = -1(invalidErrorGroupparameters)RefreshStatewarning missing tablet alias — TheRefreshStatebest-effort warning inreparentShardLocked()didn't include the tablet alias, making it hard to debug. Added the alias and a rationale comment explaining why this intentionally does not return an error (VTOrc and other callers rely on nil returns from successful reparents)Related Issue(s)
Closes: #19896
Checklist
Deployment Notes
N/A
AI Disclosure
Development assisted by Claude. Claude prepared this PR summary