Skip to content

vtctld: fix GetTablets stale primary report under --tablet-alias filter#20125

Open
Taeknology wants to merge 4 commits into
vitessio:mainfrom
Taeknology:fix/19898-gettablets-stale-primary-alias
Open

vtctld: fix GetTablets stale primary report under --tablet-alias filter#20125
Taeknology wants to merge 4 commits into
vitessio:mainfrom
Taeknology:fix/19898-gettablets-stale-primary-alias

Conversation

@Taeknology
Copy link
Copy Markdown

Description

GetTablets --tablet-alias was reporting stale former primaries as primary when the real current primary wasn't in the request. The fix consults the shard record via GetShard instead of trusting the result-set max, hybridised with the existing seed so transient cases (no recorded primary yet) stay best-effort.

The shard-filter path (--keyspace ... --shard ...) was correct already because the result set always includes the real primary. The alias-filter path doesn't — hence the bug.

Related Issue(s)

Fixes #19898

Reproducer tests by @mhamza15 (first commit, cherry-picked verbatim).

Checklist

  • "Backport to:" labels have been added if this change should be back-ported to release branches
  • If this change is to be back-ported to previous releases, a justification is included in the PR description
  • Tests were added or are not required
  • Did the new or modified tests pass consistently locally and on CI?
  • Documentation was added or is not required

Backport candidates: release-24.0 and release-23.0 — wrong primary reports from --tablet-alias can confuse operator automation. Tag if you agree.

Deployment Notes

Stale primaries queried by --tablet-alias now report as unknown instead of primary. Review any monitoring that relied on the old behavior. Adds one parallel GetShard per unique shard in the result.

AI Disclosure

Most of this was written by Claude Code - I just provided direction.

mhamza15 and others added 2 commits May 19, 2026 00:46
The alias-filtered `GetTablets` path has no regression coverage for stale primaries, so it can report a stale tablet as `PRIMARY` when queried alone or when grouped with stale tablets from other shards. This adds focused red tests for the single-alias, cross-shard, and mixed-alias cases.

Signed-off-by: Mohamed Hamza <mhamza@fastmail.com>
(cherry picked from commit c4c308a)
Signed-off-by: Taeknology <20297177+Taeknology@users.noreply.github.com>
Fixes vitessio#19898

Signed-off-by: Taeknology <20297177+Taeknology@users.noreply.github.com>
Copilot AI review requested due to automatic review settings May 18, 2026 16:03
@github-actions github-actions Bot added this to the v25.0.0 milestone May 18, 2026
@vitess-bot vitess-bot Bot added NeedsWebsiteDocsUpdate What it says NeedsDescriptionUpdate The description is not clear or comprehensive enough, and needs work NeedsIssue A linked issue is missing for this Pull Request NeedsBackportReason If backport labels have been applied to a PR, a justification is required labels May 18, 2026
@vitess-bot
Copy link
Copy Markdown
Contributor

vitess-bot Bot commented May 18, 2026

Review Checklist

Hello reviewers! 👋 Please follow this checklist when reviewing this Pull Request.

General

  • Ensure that the Pull Request has a descriptive title.
  • Ensure there is a link to an issue (except for internal cleanup and flaky test fixes), new features should have an RFC that documents use cases and test cases.

Tests

  • Bug fixes should have at least one unit or end-to-end test, enhancement and new features should have a sufficient number of tests.

Documentation

  • Apply the release notes (needs details) label if users need to know about this change.
  • New features should be documented.
  • There should be some code comments as to why things are implemented the way they are.
  • There should be a comment at the top of each new or modified test to explain what the test does.

New flags

  • Is this flag really necessary?
  • Flag names must be clear and intuitive, use dashes (-), and have a clear help text.

If a workflow is added or modified:

  • Each item in Jobs should be named in order to mark it as required.
  • If the workflow needs to be marked as required, the maintainer team must be notified.

Backward compatibility

  • Protobuf changes should be wire-compatible.
  • Changes to _vt tables and RPCs need to be backward compatible.
  • RPC changes should be compatible with vitess-operator
  • If a flag is removed, then it should also be removed from vitess-operator and arewefastyet, if used there.
  • vtctl command output order should be stable and awk-able.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes a bug in vtctld GetTablets where filtering by --tablet-alias could misreport stale former primaries as primary, because the previous code computed the "true primary" only as the max PrimaryTermStartTime within the result set (which excludes the real current primary when only stale aliases are queried). The fix consults GetShard per unique shard to obtain the authoritative term start time, falling back to the existing seed when the shard record is zero, and parallelizes those lookups with strict/non-strict failure modes.

Changes:

  • Replace the single truePrimaryTimestamp with a per-shard map seeded from result-set primaries, then overwrite each entry with Shard.PrimaryTermStartTime fetched via parallel GetShard calls.
  • Skip stale-primary adjustment for shards whose GetShard failed; aggregate errors with errors.Join, returning them in strict mode and logging a warning otherwise.
  • Add table-driven tests covering single-alias, multi-shard alias, all-current-primaries, and strict/non-strict GetShard failure scenarios, plumbing addTabletOptions and a factorySetup hook through the test harness.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File Description
go/vt/vtctl/grpcvtctldserver/server.go Per-shard primary term resolution via concurrent GetShard with strict/non-strict error handling.
go/vt/vtctl/grpcvtctldserver/server_test.go New test cases for stale-primary alias filtering and topo-failure paths; harness now supports AddTabletOptions and topo factory hooks.

Comment thread go/vt/vtctl/grpcvtctldserver/server_test.go Outdated
The prior name 'tablet alias filtering does not stale primaries across shards' used 'stale' as a verb and was grammatically incorrect. Rename to mirror the sibling case 'stale primaries across shards stay unknown' for a consistent 'X stay Y' pattern.

Signed-off-by: Taeknology <20297177+Taeknology@users.noreply.github.com>
…alias

Adds a new VTCtld minor-changes section with an entry describing the user-visible behavior change introduced by the fix: stale former primaries returned by 'vtctldclient GetTablets --tablet-alias' now report as 'unknown' instead of 'primary'. Includes operator-impact framing, mechanism summary, and notes on the GetShard fallback and --strict failure modes.

Signed-off-by: Taeknology <20297177+Taeknology@users.noreply.github.com>
@Taeknology
Copy link
Copy Markdown
Author

Status of Needs* labels

NeedsIssue — satisfied. PR body links Fixes #19898.

NeedsWebsiteDocsUpdate — satisfied by acabf849f8
(adds VTCtld minor-changes entry to
changelog/25.0/25.0.0/summary.md).

@Taeknology
Copy link
Copy Markdown
Author

Backport feasibility (verified locally)

I cherry-picked the fix onto both candidate branches and ran the
tests:

  • release-24.0: cherry-pick is clean (auto-merge handles a small
    import-block diff). All 5 new test cases and the full
    TestGetTablets pass with no regression.
  • release-23.0: cherry-pick has two trivial 3-way merges (import
    block, and the test setup harness — t.Context() +
    NewServerAndFactory, both available on Go 1.25). One adaptation is
    required: the fix uses the slog-style
    log.Warn(..., slog.Any/Int(...)) API introduced in Add structured logging with slog #19327, which
    was backported to release-24.0 but not to release-23.0. Replacing
    the call with
    log.Warningf("...affected_shards=%d, error=%v", ...) and dropping
    the log/slog import builds cleanly. All 5 new tests and the full
    TestGetTablets pass with no regression.

If you agree, please add:

  • Backport to: release-24.0
  • Backport to: release-23.0

I'll handle the log.Warn adaptation on the draft backport PR the bot
opens for release-23.0.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

NeedsBackportReason If backport labels have been applied to a PR, a justification is required NeedsDescriptionUpdate The description is not clear or comprehensive enough, and needs work NeedsIssue A linked issue is missing for this Pull Request NeedsWebsiteDocsUpdate What it says

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Bug Report: GetTablets misreports stale primaries with --tablet-alias

3 participants