Skip to content

tests: give vtctl/workflow tests their own port range#20150

Merged
arthurschreiber merged 3 commits into
mainfrom
arthur/fix-etcd2topo-port-collision
May 20, 2026
Merged

tests: give vtctl/workflow tests their own port range#20150
arthurschreiber merged 3 commits into
mainfrom
arthur/fix-etcd2topo-port-collision

Conversation

@arthurschreiber
Copy link
Copy Markdown
Member

@arthurschreiber arthurschreiber commented May 20, 2026

Description

The etcd-backed tests in go/vt/vtctl/workflow were reusing testfiles.GoVtTopoEtcd2topoPort, which is also claimed by the go/vt/topo/etcd2topo tests.

go/testfiles/ports.go is the centralized "no shared ports across concurrently-running packages" registry, and its comment explicitly says "Unit tests may run at the same time, so they should not use the same ports." Two packages consuming GoVtTopoEtcd2topoPort (= 6700) violates that contract. The workflow test was added later and reused the etcd2topo port instead of registering its own.

CI runs go test -p 4, so the two test binaries can run concurrently. Whichever etcd subprocess loses the bind race silently keeps running with no listener (cmd.Start() only reports exec failure, not bind failure). When the winner's cleanup eventually kills its etcd, the other test loses its connection and hangs on the etcd v3 client's watchGRPCStream chan send — eventually timing out the whole package after the per-job timeout.

This change tightens the central registry so that each reserved port has its own named constant (no more "this base takes N ports" comments with port + N arithmetic at the call sites), reshuffles the existing allocations into named pairs, and adds dedicated entries for the workflow tests:

  • etcd2topo plaintext: GoVtTopoEtcd2topoPort / GoVtTopoEtcd2topoPeerPort
  • etcd2topo TLS: GoVtTopoEtcd2topoTLSPort / GoVtTopoEtcd2topoTLSPeerPort
  • etcd2topo cell etcds for TestEtcd2TopoGetTabletsPartialResults: GoVtTopoEtcd2topoCell{1,2}{Port,PeerPort} (this test previously computed cell ports as GoVtTopoEtcd2topoPort + (i + 100*i), which collided with the global etcd for i=0 — same silent bind-loser bug, intra-package)
  • zk2topo: GoVtTopoZk2topoPort
  • consultopo: GoVtTopoConsultopo{DNS,HTTP,SerfLAN,SerfWAN}Port
  • workflow: GoVtVtctlWorkflowPort / GoVtVtctlWorkflowPeerPort

startEtcd in go/vt/topo/etcd2topo now takes the client and peer port explicitly (mirroring startEtcdWithTLS) so callers stop relying on the helper to derive peer = client + 1.

This is a bug fix for CI flakiness — should be backported to active release branches.

Related Issue(s)

None.

Checklist

  • "Backport to:" labels have been added if this change should be back-ported to release branches
  • If this change is to be back-ported to previous releases, a justification is included in the PR description
  • Tests were added or are not required
  • Did the new or modified tests pass consistently locally and on CI?
  • Documentation was added or is not required

No new test is needed — the existing TestEtcd2Topo and TestConcurrentKeyspaceRoutingRulesUpdates together exercise the cross-package failure mode, and TestEtcd2TopoGetTabletsPartialResults exercises the intra-package one. Verified locally with:

go test -p 2 -timeout 180s \
  -run "^(TestEtcd2Topo|TestConcurrentKeyspaceRoutingRulesUpdates)$" \
  ./go/vt/topo/etcd2topo/ ./go/vt/vtctl/workflow/

Before this change the workflow package hangs until the timeout; after, both packages pass in ~30s.

Deployment Notes

None — this only affects unit tests.

AI Disclosure

This PR was written primarily by Claude Code — I just provided direction.

The etcd-backed tests in go/vt/vtctl/workflow were reusing
testfiles.GoVtTopoEtcd2topoPort, which is also claimed by the
go/vt/topo/etcd2topo tests. Under `go test -p N` (CI uses `-p 4`),
the two test binaries can run concurrently, and their etcd
subprocesses race for the same port. The loser of the bind race
silently keeps running with no listener (cmd.Start only reports
exec failure, not bind failure), and when the winner's cleanup
later kills its etcd, the other test loses its connection and
hangs on the etcd v3 client's watchGRPCStream chan send.

Add a dedicated GoVtVtctlWorkflowPort entry to the central port
registry and switch the workflow test to use it.

Signed-off-by: Arthur Schreiber <arthur@planetscale.com>
Copilot AI review requested due to automatic review settings May 20, 2026 15:09
@vitess-bot
Copy link
Copy Markdown
Contributor

vitess-bot Bot commented May 20, 2026

Review Checklist

Hello reviewers! 👋 Please follow this checklist when reviewing this Pull Request.

General

  • Ensure that the Pull Request has a descriptive title.
  • Ensure there is a link to an issue (except for internal cleanup and flaky test fixes), new features should have an RFC that documents use cases and test cases.

Tests

  • Bug fixes should have at least one unit or end-to-end test, enhancement and new features should have a sufficient number of tests.

Documentation

  • Apply the release notes (needs details) label if users need to know about this change.
  • New features should be documented.
  • There should be some code comments as to why things are implemented the way they are.
  • There should be a comment at the top of each new or modified test to explain what the test does.

New flags

  • Is this flag really necessary?
  • Flag names must be clear and intuitive, use dashes (-), and have a clear help text.

If a workflow is added or modified:

  • Each item in Jobs should be named in order to mark it as required.
  • If the workflow needs to be marked as required, the maintainer team must be notified.

Backward compatibility

  • Protobuf changes should be wire-compatible.
  • Changes to _vt tables and RPCs need to be backward compatible.
  • RPC changes should be compatible with vitess-operator
  • If a flag is removed, then it should also be removed from vitess-operator and arewefastyet, if used there.
  • vtctl command output order should be stable and awk-able.

@github-actions github-actions Bot added this to the v25.0.0 milestone May 20, 2026
@vitess-bot vitess-bot Bot added NeedsWebsiteDocsUpdate What it says NeedsDescriptionUpdate The description is not clear or comprehensive enough, and needs work NeedsIssue A linked issue is missing for this Pull Request NeedsBackportReason If backport labels have been applied to a PR, a justification is required labels May 20, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes CI flakiness in etcd-backed workflow tests by ensuring go/vt/vtctl/workflow no longer shares the same fixed port range as go/vt/topo/etcd2topo, aligning with the central test port registry’s “no shared ports across concurrently-running packages” contract.

Changes:

  • Added a dedicated GoVtVtctlWorkflowPort allocation in go/testfiles/ports.go (reserving two ports).
  • Switched go/vt/vtctl/workflow’s startEtcd helper to use the new workflow-specific port base.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File Description
go/vt/vtctl/workflow/utils_test.go Uses the workflow-specific base port when spawning the etcd subprocess in tests.
go/testfiles/ports.go Adds a new port-base constant for the workflow package to avoid cross-package port collisions.

Comment thread go/testfiles/ports.go Outdated
Comment on lines +46 to +50
GoVtTopoConsultopoPort = GoVtTopoZk2topoPort + 3

// GoVtVtctlWorkflowPort is used by the go/vt/vtctl/workflow package for
// etcd-backed keyspace routing rules tests. Takes two ports.
GoVtVtctlWorkflowPort = GoVtTopoConsultopoPort + 4
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch — addressed in d735e85. I split the registry into one named constant per reserved port (etcd2topo now owns vtPortStart..+3 instead of pretending it only takes two; zk2topo moved to +4..+6; consultopo to +7..+10; workflow to +11..+12) and updated the call sites in etcd2topo, consultopo, and workflow to use the named constants directly instead of recomputing port + N locally. The remaining port + (i + 100*i) arithmetic in TestEtcd2TopoGetTabletsPartialResults is a separate pre-existing concern and is left alone in this PR.

@codecov
Copy link
Copy Markdown

codecov Bot commented May 20, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 73.36%. Comparing base (70c7a72) to head (ba885b8).
⚠️ Report is 267 commits behind head on main.

Additional details and impacted files
@@             Coverage Diff             @@
##             main   #20150       +/-   ##
===========================================
+ Coverage   69.67%   73.36%    +3.69%     
===========================================
  Files        1614       39     -1575     
  Lines      216793     8605   -208188     
===========================================
- Hits       151044     6313   -144731     
+ Misses      65749     2292    -63457     
Flag Coverage Δ
partial 73.36% <ø> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@arthurschreiber arthurschreiber changed the title testfiles: give vtctl/workflow tests their own port range tests: give vtctl/workflow tests their own port range May 20, 2026
@arthurschreiber arthurschreiber added Backport to: release-23.0 Needs to be backport to release-23.0 Backport to: release-24.0 Needs to be backport to release-24.0 and removed NeedsDescriptionUpdate The description is not clear or comprehensive enough, and needs work NeedsWebsiteDocsUpdate What it says NeedsIssue A linked issue is missing for this Pull Request NeedsBackportReason If backport labels have been applied to a PR, a justification is required labels May 20, 2026
@arthurschreiber arthurschreiber marked this pull request as ready for review May 20, 2026 15:33
Copilot pointed out a second cross-package overlap in the registry:
go/vt/topo/etcd2topo's TLS test binds vtPortStart+2 and vtPortStart+3,
exactly the range claimed by GoVtTopoZk2topoPort (which started at +2
and reserved three ports). The "Takes N ports" comments hid this — the
etcd2topo entry claimed two ports but the package actually consumes
four.

Replace the chained base+offset entries with one named constant per
reserved port, spread out so no two packages overlap:

  6700 / 6701  etcd2topo client / peer (plaintext)
  6702 / 6703  etcd2topo client / peer (TLS)
  6704 .. 6706 zk2topo  (zkctl.StartLocalZk consumes three consecutive
                         ports starting at the base)
  6707 .. 6710 consultopo dns / http / serf_lan / serf_wan
  6711 / 6712  vtctl/workflow etcd client / peer

Update the call sites in etcd2topo, consultopo, and vtctl/workflow to
reference the named constants directly instead of recomputing
`port + N` locally. The remaining `port + (i + 100*i)` arithmetic in
TestEtcd2TopoGetTabletsPartialResults is a separate pre-existing
concern and is left alone.

GoVtTopoConsultopoPort is removed; it was an internal test-helper
constant used only by go/vt/topo/consultopo's own test file.

Signed-off-by: Arthur Schreiber <arthur@planetscale.com>
@arthurschreiber arthurschreiber enabled auto-merge (squash) May 20, 2026 15:44
Copy link
Copy Markdown
Member

@mattlord mattlord left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @arthurschreiber ! ❤️

We shouldn’t leave TestEtcd2TopoGetTabletsPartialResults on the shared etcd2topo base port in go/vt/topo/etcd2topo/server_test.go:254-259 should we?

This PR adds the central invariant that each reserved port has its own constant, but that test still starts the global etcd with startEtcd(t, 0), which resolves to GoVtTopoEtcd2topoPort/peer, then starts the first cell etcd with GoVtTopoEtcd2topoPort+(0+100*0), i.e. the same client/peer ports. That reproduces the silent bind-loser behavior described in the PR and means the first cell is actually using the global etcd endpoint, so the test is not exercising three independent topo servers. Since this PR is cleaning up port allocation, ideally we would reserve explicit named client/peer ports for the two cell etcds, or otherwise ensure the helper cannot choose the default pair for a second concurrently-running etcd.

The PR description also looks stale after d735e85; it still describes the first-commit GoVtTopoConsultopoPort/6709 layout.

TestEtcd2TopoGetTabletsPartialResults started its global etcd at
GoVtTopoEtcd2topoPort and then started per-cell etcds at
GoVtTopoEtcd2topoPort+(i+100*i), so cell1 (offset 0) silently lost the
bind race against the global etcd and its "topo server" was actually
talking to the global instance, contradicting the test's own comment
about three independent topo servers.

Reserve explicit GoVtTopoEtcd2topoCell{1,2}{Port,PeerPort} constants,
shift the rest of the registry up by four, and have startEtcd accept
client and peer ports explicitly so callers stop relying on the helper
to compute peer = client + 1.

Signed-off-by: Arthur Schreiber <arthur@planetscale.com>
Copilot AI review requested due to automatic review settings May 20, 2026 19:23
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated no new comments.

@arthurschreiber arthurschreiber merged commit 50e632b into main May 20, 2026
113 checks passed
@arthurschreiber arthurschreiber deleted the arthur/fix-etcd2topo-port-collision branch May 20, 2026 19:46
arthurschreiber pushed a commit that referenced this pull request May 21, 2026
…20150) (#20156)

Signed-off-by: Arthur Schreiber <arthur@planetscale.com>
Co-authored-by: vitess-bot[bot] <108069721+vitess-bot[bot]@users.noreply.github.com>
arthurschreiber added a commit that referenced this pull request May 22, 2026
The conflict was a stylistic divergence: release-23.0 still uses
`for i := 0; i < len(cells); i++` while main was modernized to
`for i := range cells` by an unrelated PR. Keep the release-23.0
loop style and adopt the new startEtcd(client, peer) call from PR

Signed-off-by: Arthur Schreiber <arthur@planetscale.com>
#20150.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Backport to: release-23.0 Needs to be backport to release-23.0 Backport to: release-24.0 Needs to be backport to release-24.0 Component: VReplication Type: Bug

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants