Skip to content

reparentutil: fix goroutine count and improve logging#19888

Merged
timvaillancourt merged 4 commits into
mainfrom
reparentutil-bugs-v2
Apr 20, 2026
Merged

reparentutil: fix goroutine count and improve logging#19888
timvaillancourt merged 4 commits into
mainfrom
reparentutil-bugs-v2

Conversation

@timvaillancourt

@timvaillancourt timvaillancourt commented Apr 15, 2026

Copy link
Copy Markdown
Contributor

Description

2 x small fixes found during an audit of the reparentutil package:

  1. Incorrect goroutine count in stopReplicationAndBuildStatusMaps()numGoRoutines was computed as len(tabletMap) - ignoredTablets.Len(), but ignoredTablets can contain aliases not present in tabletMap (e.g., stale entries), making the subtraction wrong. Now counts goroutines directly in the launch loop. Also added an early FAILED_PRECONDITION return when tablets exist but all are excluded by IgnoreReplicas, which previously produced requiredSuccesses = -1 (invalid ErrorGroup parameters)

  2. RefreshState warning missing tablet alias — The RefreshState best-effort warning in reparentShardLocked() didn't include the tablet alias, making it hard to debug. Added the alias and a rationale comment explaining why this intentionally does not return an error (VTOrc and other callers rely on nil returns from successful reparents)

Related Issue(s)

Closes: #19896

Checklist

  • "Backport to:" labels have been added if this change should be back-ported to release branches
  • If this change is to be back-ported to previous releases, a justification is included in the PR description
  • Tests were added or are not required
  • Did the new or modified tests pass consistently locally and on CI?
  • Documentation was added or is not required

Deployment Notes

N/A

AI Disclosure

Development assisted by Claude. Claude prepared this PR summary

Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
Copilot AI review requested due to automatic review settings April 15, 2026 19:18
@vitess-bot vitess-bot Bot added NeedsWebsiteDocsUpdate What it says NeedsDescriptionUpdate The description is not clear or comprehensive enough, and needs work NeedsIssue A linked issue is missing for this Pull Request NeedsBackportReason If backport labels have been applied to a PR, a justification is required labels Apr 15, 2026
@vitess-bot

vitess-bot Bot commented Apr 15, 2026

Copy link
Copy Markdown
Contributor

Review Checklist

Hello reviewers! 👋 Please follow this checklist when reviewing this Pull Request.

General

  • Ensure that the Pull Request has a descriptive title.
  • Ensure there is a link to an issue (except for internal cleanup and flaky test fixes), new features should have an RFC that documents use cases and test cases.

Tests

  • Bug fixes should have at least one unit or end-to-end test, enhancement and new features should have a sufficient number of tests.

Documentation

  • Apply the release notes (needs details) label if users need to know about this change.
  • New features should be documented.
  • There should be some code comments as to why things are implemented the way they are.
  • There should be a comment at the top of each new or modified test to explain what the test does.

New flags

  • Is this flag really necessary?
  • Flag names must be clear and intuitive, use dashes (-), and have a clear help text.

If a workflow is added or modified:

  • Each item in Jobs should be named in order to mark it as required.
  • If the workflow needs to be marked as required, the maintainer team must be notified.

Backward compatibility

  • Protobuf changes should be wire-compatible.
  • Changes to _vt tables and RPCs need to be backward compatible.
  • RPC changes should be compatible with vitess-operator
  • If a flag is removed, then it should also be removed from vitess-operator and arewefastyet, if used there.
  • vtctl command output order should be stable and awk-able.

@github-actions github-actions Bot added this to the v25.0.0 milestone Apr 15, 2026

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR applies a set of safety and observability fixes in reparentutil, primarily around replication-stop coordination, replica reparent concurrency, and troubleshooting logs during planned reparents.

Changes:

  • Adds a precondition guard in stopReplicationAndBuildStatusMaps to prevent invalid “no tablets to act on” scenarios.
  • Improves mutex safety in ERS replica handling by using defer to ensure unlock on early exits/panics.
  • Enhances PRS logging and adds rationale around best-effort RefreshState behavior.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

File Description
go/vt/vtctl/reparentutil/replication.go Adds guard logic around the computed goroutine count when stopping replication across tablets.
go/vt/vtctl/reparentutil/planned_reparenter.go Adds rationale for best-effort RefreshState and includes tablet alias in warning logs.
go/vt/vtctl/reparentutil/emergency_reparenter.go Uses defer for mutex unlock and clarifies background goroutine / cancellation behavior.

Comment thread go/vt/vtctl/reparentutil/replication.go Outdated
@timvaillancourt timvaillancourt changed the title reparentutil: fix mutex safety, add ignored-tablets guard, improve logging reparentutil: fix mutex safety, add ignored-tablets guard, improve logging Apr 15, 2026
@timvaillancourt timvaillancourt added Type: Enhancement Logical improvement (somewhere between a bug and feature) Component: vtctl Type: Internal Cleanup and removed NeedsDescriptionUpdate The description is not clear or comprehensive enough, and needs work NeedsWebsiteDocsUpdate What it says NeedsIssue A linked issue is missing for this Pull Request NeedsBackportReason If backport labels have been applied to a PR, a justification is required labels Apr 15, 2026
@timvaillancourt timvaillancourt self-assigned this Apr 15, 2026
Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
Copilot AI review requested due to automatic review settings April 15, 2026 19:28
@timvaillancourt timvaillancourt changed the title reparentutil: fix mutex safety, add ignored-tablets guard, improve logging reparentutil: fix goroutine count and improve logging Apr 15, 2026
@timvaillancourt timvaillancourt changed the title reparentutil: fix goroutine count and improve logging reparentutil: fix goroutine count and improve logging Apr 15, 2026

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated no new comments.

@codecov

codecov Bot commented Apr 15, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 91.38%. Comparing base (70c7a72) to head (1ac5b6b).
⚠️ Report is 188 commits behind head on main.

Additional details and impacted files
@@             Coverage Diff             @@
##             main   #19888       +/-   ##
===========================================
+ Coverage   69.67%   91.38%   +21.70%     
===========================================
  Files        1614        9     -1605     
  Lines      216793     1311   -215482     
===========================================
- Hits       151044     1198   -149846     
+ Misses      65749      113    -65636     
Flag Coverage Δ
partial 91.38% <100.00%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@timvaillancourt timvaillancourt added the Backport to: release-24.0 Needs to be backport to release-24.0 label Apr 15, 2026
@timvaillancourt timvaillancourt marked this pull request as ready for review April 15, 2026 19:37

@mattlord mattlord left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes seem OK (not sure why this warrants a PR by itself as there is an org/project cost for PRs), but I don't like having PRs w/o an issue and w/o any tests which demonstrate the problem and the fix.

The PR changes the goroutine counting logic and adds a new early FAILED_PRECONDITION branch in replication.go (line 371), but there’s no corresponding coverage added in replication_test.go (line 272). In particular, I’d want two explicit test cases:

  1. IgnoreReplicas contains a stale alias that is not present in tabletMap
  2. All real tablets are filtered out by IgnoreReplicas, so the new precondition path fires

That’s the exact behavior this PR seems to be fixing, and right now it’s still unprotected by regression tests.

@arthurschreiber

Copy link
Copy Markdown
Member

The changes LGTM, but as @mattlord pointed out, tests that cover this would be nice.

Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>

@mattlord mattlord left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @timvaillancourt !

@nickvanw nickvanw left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. The counting-in-loop fix is the right shape — the previous len-based subtraction could silently mismatch reality when ignoredTablets had stale aliases, leaking goroutines on the unbuffered errChan. Tests cover the two scenarios that matter.

@timvaillancourt timvaillancourt merged commit 12ede52 into main Apr 20, 2026
184 of 192 checks passed
@timvaillancourt timvaillancourt deleted the reparentutil-bugs-v2 branch April 20, 2026 22:57
timvaillancourt added a commit that referenced this pull request Apr 21, 2026
…19888) (#19922)

Signed-off-by: Matt Lord <mattalord@gmail.com>
Signed-off-by: dependabot[bot] <support@github.com>
Signed-off-by: Derek Perkins <derek@nozzle.io>
Signed-off-by: Arthur Schreiber <arthur@planetscale.com>
Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Signed-off-by: MukundaKatta <mukundakatta@users.noreply.github.com>
Signed-off-by: Nick Van Wiggeren <nick@planetscale.com>
Signed-off-by: Mohamed Hamza <mhamza@fastmail.com>
Signed-off-by: Brett Wines <bwines@slack-corp.com>
Signed-off-by: Harshit Gangal <harshit@planetscale.com>
Co-authored-by: Matt Lord <mattalord@gmail.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Derek Perkins <derek@nozzle.io>
Co-authored-by: Arthur Schreiber <arthurschreiber@github.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-authored-by: Mohamed Hamza <mhamza@fastmail.com>
Co-authored-by: Tim Vaillancourt <tim@timvaillancourt.com>
Co-authored-by: vitess-go-upgrade-bot <139342327+vitess-bot@users.noreply.github.com>
Co-authored-by: frouioui <35779988+frouioui@users.noreply.github.com>
Co-authored-by: Mukunda Rao Katta <mukunda.vjcs6@gmail.com>
Co-authored-by: Nick Van Wiggeren <nickvanw@users.noreply.github.com>
Co-authored-by: Arthur Schreiber <arthur@planetscale.com>
Co-authored-by: Brett Wines <bwines@salesforce.com>
Co-authored-by: Claude <svc-devxp-claude@slack-corp.com>
Co-authored-by: vitess-bot[bot] <108069721+vitess-bot[bot]@users.noreply.github.com>
Co-authored-by: Harshit Gangal <harshit@planetscale.com>
timvaillancourt added a commit to timvaillancourt/vitess that referenced this pull request May 12, 2026
Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Backport to: release-24.0 Needs to be backport to release-24.0 Component: vtctl Type: Enhancement Logical improvement (somewhere between a bug and feature) Type: Internal Cleanup

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Bug Report: reparentutil: goroutine miscount in stopReplicationAndBuildStatusMaps

5 participants