tests: give vtctl/workflow tests their own port range#20150
Conversation
The etcd-backed tests in go/vt/vtctl/workflow were reusing testfiles.GoVtTopoEtcd2topoPort, which is also claimed by the go/vt/topo/etcd2topo tests. Under `go test -p N` (CI uses `-p 4`), the two test binaries can run concurrently, and their etcd subprocesses race for the same port. The loser of the bind race silently keeps running with no listener (cmd.Start only reports exec failure, not bind failure), and when the winner's cleanup later kills its etcd, the other test loses its connection and hangs on the etcd v3 client's watchGRPCStream chan send. Add a dedicated GoVtVtctlWorkflowPort entry to the central port registry and switch the workflow test to use it. Signed-off-by: Arthur Schreiber <arthur@planetscale.com>
Review ChecklistHello reviewers! 👋 Please follow this checklist when reviewing this Pull Request. General
Tests
Documentation
New flags
If a workflow is added or modified:
Backward compatibility
|
There was a problem hiding this comment.
Pull request overview
Fixes CI flakiness in etcd-backed workflow tests by ensuring go/vt/vtctl/workflow no longer shares the same fixed port range as go/vt/topo/etcd2topo, aligning with the central test port registry’s “no shared ports across concurrently-running packages” contract.
Changes:
- Added a dedicated
GoVtVtctlWorkflowPortallocation ingo/testfiles/ports.go(reserving two ports). - Switched
go/vt/vtctl/workflow’sstartEtcdhelper to use the new workflow-specific port base.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
| go/vt/vtctl/workflow/utils_test.go | Uses the workflow-specific base port when spawning the etcd subprocess in tests. |
| go/testfiles/ports.go | Adds a new port-base constant for the workflow package to avoid cross-package port collisions. |
| GoVtTopoConsultopoPort = GoVtTopoZk2topoPort + 3 | ||
|
|
||
| // GoVtVtctlWorkflowPort is used by the go/vt/vtctl/workflow package for | ||
| // etcd-backed keyspace routing rules tests. Takes two ports. | ||
| GoVtVtctlWorkflowPort = GoVtTopoConsultopoPort + 4 |
There was a problem hiding this comment.
Good catch — addressed in d735e85. I split the registry into one named constant per reserved port (etcd2topo now owns vtPortStart..+3 instead of pretending it only takes two; zk2topo moved to +4..+6; consultopo to +7..+10; workflow to +11..+12) and updated the call sites in etcd2topo, consultopo, and workflow to use the named constants directly instead of recomputing port + N locally. The remaining port + (i + 100*i) arithmetic in TestEtcd2TopoGetTabletsPartialResults is a separate pre-existing concern and is left alone in this PR.
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #20150 +/- ##
===========================================
+ Coverage 69.67% 73.36% +3.69%
===========================================
Files 1614 39 -1575
Lines 216793 8605 -208188
===========================================
- Hits 151044 6313 -144731
+ Misses 65749 2292 -63457
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Copilot pointed out a second cross-package overlap in the registry:
go/vt/topo/etcd2topo's TLS test binds vtPortStart+2 and vtPortStart+3,
exactly the range claimed by GoVtTopoZk2topoPort (which started at +2
and reserved three ports). The "Takes N ports" comments hid this — the
etcd2topo entry claimed two ports but the package actually consumes
four.
Replace the chained base+offset entries with one named constant per
reserved port, spread out so no two packages overlap:
6700 / 6701 etcd2topo client / peer (plaintext)
6702 / 6703 etcd2topo client / peer (TLS)
6704 .. 6706 zk2topo (zkctl.StartLocalZk consumes three consecutive
ports starting at the base)
6707 .. 6710 consultopo dns / http / serf_lan / serf_wan
6711 / 6712 vtctl/workflow etcd client / peer
Update the call sites in etcd2topo, consultopo, and vtctl/workflow to
reference the named constants directly instead of recomputing
`port + N` locally. The remaining `port + (i + 100*i)` arithmetic in
TestEtcd2TopoGetTabletsPartialResults is a separate pre-existing
concern and is left alone.
GoVtTopoConsultopoPort is removed; it was an internal test-helper
constant used only by go/vt/topo/consultopo's own test file.
Signed-off-by: Arthur Schreiber <arthur@planetscale.com>
There was a problem hiding this comment.
Thanks, @arthurschreiber ! ❤️
We shouldn’t leave TestEtcd2TopoGetTabletsPartialResults on the shared etcd2topo base port in go/vt/topo/etcd2topo/server_test.go:254-259 should we?
This PR adds the central invariant that each reserved port has its own constant, but that test still starts the global etcd with startEtcd(t, 0), which resolves to GoVtTopoEtcd2topoPort/peer, then starts the first cell etcd with GoVtTopoEtcd2topoPort+(0+100*0), i.e. the same client/peer ports. That reproduces the silent bind-loser behavior described in the PR and means the first cell is actually using the global etcd endpoint, so the test is not exercising three independent topo servers. Since this PR is cleaning up port allocation, ideally we would reserve explicit named client/peer ports for the two cell etcds, or otherwise ensure the helper cannot choose the default pair for a second concurrently-running etcd.
The PR description also looks stale after d735e85; it still describes the first-commit GoVtTopoConsultopoPort/6709 layout.
TestEtcd2TopoGetTabletsPartialResults started its global etcd at
GoVtTopoEtcd2topoPort and then started per-cell etcds at
GoVtTopoEtcd2topoPort+(i+100*i), so cell1 (offset 0) silently lost the
bind race against the global etcd and its "topo server" was actually
talking to the global instance, contradicting the test's own comment
about three independent topo servers.
Reserve explicit GoVtTopoEtcd2topoCell{1,2}{Port,PeerPort} constants,
shift the rest of the registry up by four, and have startEtcd accept
client and peer ports explicitly so callers stop relying on the helper
to compute peer = client + 1.
Signed-off-by: Arthur Schreiber <arthur@planetscale.com>
The conflict was a stylistic divergence: release-23.0 still uses `for i := 0; i < len(cells); i++` while main was modernized to `for i := range cells` by an unrelated PR. Keep the release-23.0 loop style and adopt the new startEtcd(client, peer) call from PR Signed-off-by: Arthur Schreiber <arthur@planetscale.com> #20150.
Description
The etcd-backed tests in
go/vt/vtctl/workflowwere reusingtestfiles.GoVtTopoEtcd2topoPort, which is also claimed by thego/vt/topo/etcd2topotests.go/testfiles/ports.gois the centralized "no shared ports across concurrently-running packages" registry, and its comment explicitly says "Unit tests may run at the same time, so they should not use the same ports." Two packages consumingGoVtTopoEtcd2topoPort(= 6700) violates that contract. The workflow test was added later and reused the etcd2topo port instead of registering its own.CI runs
go test -p 4, so the two test binaries can run concurrently. Whichever etcd subprocess loses the bind race silently keeps running with no listener (cmd.Start()only reports exec failure, not bind failure). When the winner's cleanup eventually kills its etcd, the other test loses its connection and hangs on the etcd v3 client'swatchGRPCStreamchan send — eventually timing out the whole package after the per-job timeout.This change tightens the central registry so that each reserved port has its own named constant (no more "this base takes N ports" comments with
port + Narithmetic at the call sites), reshuffles the existing allocations into named pairs, and adds dedicated entries for the workflow tests:GoVtTopoEtcd2topoPort/GoVtTopoEtcd2topoPeerPortGoVtTopoEtcd2topoTLSPort/GoVtTopoEtcd2topoTLSPeerPortTestEtcd2TopoGetTabletsPartialResults:GoVtTopoEtcd2topoCell{1,2}{Port,PeerPort}(this test previously computed cell ports asGoVtTopoEtcd2topoPort + (i + 100*i), which collided with the global etcd fori=0— same silent bind-loser bug, intra-package)GoVtTopoZk2topoPortGoVtTopoConsultopo{DNS,HTTP,SerfLAN,SerfWAN}PortGoVtVtctlWorkflowPort/GoVtVtctlWorkflowPeerPortstartEtcdingo/vt/topo/etcd2toponow takes the client and peer port explicitly (mirroringstartEtcdWithTLS) so callers stop relying on the helper to derivepeer = client + 1.This is a bug fix for CI flakiness — should be backported to active release branches.
Related Issue(s)
None.
Checklist
No new test is needed — the existing
TestEtcd2TopoandTestConcurrentKeyspaceRoutingRulesUpdatestogether exercise the cross-package failure mode, andTestEtcd2TopoGetTabletsPartialResultsexercises the intra-package one. Verified locally with:Before this change the workflow package hangs until the timeout; after, both packages pass in ~30s.
Deployment Notes
None — this only affects unit tests.
AI Disclosure
This PR was written primarily by Claude Code — I just provided direction.