[release-22.0] CI: deflakes, fork runner fallback, codecov / milestone gating#20197
Open
arthurschreiber wants to merge 7 commits into
Open
[release-22.0] CI: deflakes, fork runner fallback, codecov / milestone gating#20197arthurschreiber wants to merge 7 commits into
arthurschreiber wants to merge 7 commits into
Conversation
When the server's `VerifyPeerCertificate` returns "Certificate revoked", Go's TLS sends a `bad_certificate` alert and then closes. Whether the client reads the alert or the TCP RST first depends on kernel TCP flush timing — so the test would sometimes see `remote error: tls: bad certificate` and sometimes `connection reset by peer` / `broken pipe`. Both outcomes mean the revoked certificate was rejected, which is what the test cares about. Accept any of the three error strings. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Arthur Schreiber <arthur@planetscale.com> (cherry picked from commit 8136179)
TestTrackerNoLock pushes 500,000 messages onto a channel and asserts each send completes within 10ms. Under CI load that's tight enough to flake regularly, surfacing as: tracker_test.go:199: failed to send health check to tracker Match the fix from PR vitessio#18317: bump the per-send timeout to 50ms. Backport of the go/vt/vtgate/schema/tracker_test.go slice of vitessio#18317 (the materializer_test.go slice is for an unrelated test). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Arthur Schreiber <arthur@planetscale.com> (cherry picked from commit 8d3e545)
Forks don't have access to the `gh-hosted-runners-16cores-1-24.04` runner pool, so workflows that hardcoded it would never schedule. Gate the runner selection on `github.repository` so forks fall back to `ubuntu-24.04` automatically. `cluster_endtoend.yml` and `unit_test.yml` already chose between the larger runner and `ubuntu-24.04` via a matrix value (`matrix.needs`/`matrix.race`); AND the `github.repository` check into that expression so forks never request the larger runner. The remaining `upgrade_downgrade_*`, `local_example`, and `region_example` workflows hardcoded the larger runner, so they get the same ternary inline. `docker_build_images.yml` is left as-is because both of its jobs already gate on `if: github.repository == 'vitessio/vitess'`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Arthur Schreiber <arthur@planetscale.com>
The Upload step in codecov.yml has `fail_ci_if_error: true`, so when the workflow runs on a fork (without `secrets.CODECOV_TOKEN`) the upload returns "Token required - not valid tokenless upload" and the whole job goes red even though the test suite passed. Running ~16 minutes of unit tests just to throw away the coverage report on a fork is also wasted CI time. AND `github.repository == 'vitessio/vitess'` into the existing job-level `if` so forks skip the job entirely. The existing `Backport`-label exclusion is preserved for upstream runs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Arthur Schreiber <arthur@planetscale.com>
The previous deflake commit (0ea371f, the [release-22.0] cherry-pick of upstream's PR vitessio#19388) ported the EventuallyWithT callback verbatim from upstream, which uses `require.X(c, ...)`. That works on upstream's testify v1.11+, but this branch is pinned to testify v1.10.0, where `CollectT.FailNow` is implemented as `panic("Assertion failed")` and `EventuallyWithT` doesn't recover from it — so the first failed poll crashes the goroutine. The release-21.0 job log showed exactly that: panic: Assertion failed testify/assert.(*CollectT).FailNow ... EventuallyWithT.func1 ... FAIL vitess.io/vitess/go/mysql Replace `require.X(c, ...)` with `assert.X(c, ...)` (which just flags the CollectT instead of panicking) and guard the `entries[0]` indexing on `assert.NotEmpty`, otherwise a `nil[0]` slice access escapes the same way. Hoisted the polling loop into a `waitForReload` helper since both hupTest and hupTestWithRotation now use the same body. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> (cherry picked from commit 248182c) (cherry picked from commit a814579) Signed-off-by: Arthur Schreiber <arthur@planetscale.com>
`vtop_example.yml` hardcoded `oracle-vm-8cpu-32gb-x86-64`, which forks can't schedule on, so the workflow would never start. Apply the same `github.repository`-gated expression used for the `gh-hosted-runners` fallback so forks fall back to `ubuntu-24.04`. The test spins up minikube + vtop, which probably won't fit in ubuntu-24.04's 4-core/16GB envelope — but failing at runtime with a log is more useful than never scheduling. `codecov.yml` also uses the oracle runner but is already job-gated on `github.repository`, so its `runs-on` line never executes on forks and doesn't need this treatment. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Arthur Schreiber <arthur@planetscale.com>
`assign_milestone.yml` triggers on `pull_request_target: [opened]` and greps `go/vt/servenv/version.go` for a `versionName` to assign a matching milestone via `gh pr edit`. On a fork, there is no matching milestone, so the job fails on every PR opened against the fork. Add a job-level `if: github.repository == 'vitessio/vitess'` so the job only runs on the upstream repo. Other forks can opt in by setting up their own milestones and removing the gate. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Arthur Schreiber <arthur@planetscale.com>
Contributor
There was a problem hiding this comment.
Pull request overview
CI-only stabilization backport for the EOL release-22.0 branch so that downstream forks can run CI green when backporting fixes. Production code is untouched; changes are limited to workflow gating and a few flaky-test relaxations.
Changes:
- Gate custom GitHub-hosted runners (and
vtop_example's Oracle VM runner) ongithub.repository == 'vitessio/vitess'across many workflows, falling back toubuntu-24.04for forks; also gate the Code Coverage and Assign Milestone jobs to upstream only. - Relax
TestTLSRequired's revoked-cert assertion to acceptconnection reset by peer/broken pipein addition tobad certificate. - Extract
waitForReloadforTestStaticConfigHUPusingassert.X(notrequire.X) insideEventuallyWithTto avoid testify v1.10CollectT.FailNowpanics, and bumpTestTrackerNoLockchannel-send timeout from 10ms → 50ms.
Reviewed changes
Copilot reviewed 26 out of 26 changed files in this pull request and generated no comments.
Show a summary per file
| File | Description |
|---|---|
| .github/workflows/assign_milestone.yml | Skip Assign Milestone job on forks. |
| .github/workflows/codecov.yml | Skip Code Coverage job on forks (preserves Backport-label gating). |
| .github/workflows/cluster_endtoend.yml | Gate larger runner selection on upstream. |
| .github/workflows/unit_test.yml | Gate race-job runner on upstream. |
| .github/workflows/vtop_example.yml | Fall back to ubuntu-24.04 for forks. |
| .github/workflows/local_example.yml | Same runner fallback. |
| .github/workflows/region_example.yml | Same runner fallback. |
| .github/workflows/upgrade_downgrade_test_*.yml (12 files) | Same runner fallback across upgrade/downgrade workflows. |
| go/mysql/server_test.go | Accept alternate close errors in TestTLSRequired revoked-cert path. |
| go/mysql/auth_server_static_test.go | Extract waitForReload; use assert inside EventuallyWithT to avoid v1.10 panic. |
| go/vt/vtgate/schema/tracker_test.go | Bump TestTrackerNoLock send timeout 10ms → 50ms. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
5 tasks
timvaillancourt
approved these changes
May 27, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
release-22.0is long EOL and not supported. However, some users still run it internally, or have to run it for a short window as part of an upgrade path to a supported Vitess version. Those users sometimes need to backport fixes from newer branches into their own forks, and right now that's painful: CI on therelease-22.0branch is broken in several ways for any repo that isn'tvitessio/vitess(custom runners that forks can't schedule on, a Code Coverage job that goes red without an upload token, an Assign Milestone job that fails because the fork has no matching milestones, plus a couple of flaky tests).This PR bundles the CI-only fixes needed to get
release-22.0green again so those backports can land. No production code is changed — everything is.github/, test helpers, or test code.To be clear: merging this is not a change in support status.
release-22.0remains EOL. This is a courtesy to make life easier for users still on the branch by accident or by upgrade-path necessity.The same set of fixes was opened against
release-20.0in #20195 andrelease-21.0in #20196.Fork-runner / infra fallbacks
ci: fall back to ubuntu-24.04 outside vitessio/vitess— forks can't schedule ongh-hosted-runners-16cores-1-24.04; gate runner selection ongithub.repository(folded into the existingmatrix.needs/matrix.raceternaries where they already chose between the larger runner andubuntu-24.04).ci: fall back to ubuntu-24.04 for vtop_example outside vitessio/vitess—vtop_example.ymlhardcodedoracle-vm-8cpu-32gb-x86-64; apply the samegithub.repositoryternary. Won't necessarily pass onubuntu-24.04(minikube + vtop is tight on 4 cores), but failing at runtime with a log is more useful than never scheduling.Code Coverage gating
ci: gate Code Coverage on github.repository—fail_ci_if_error: true+ noCODECOV_TOKENon forks meant the job always went red. ANDgithub.repository == 'vitessio/vitess'into the existing job-levelif; theBackport-label exclusion is preserved for upstream runs.Milestone job gating
ci: gate Assign Milestone on github.repository—assign_milestone.ymlruns onpull_request_target: [opened]and tries to assign a milestone that doesn't exist on the fork, so every fork PR fails this job. Job-levelif: github.repository == 'vitessio/vitess'skips it cleanly.Test deflakes (backports / cherry-picks)
go/mysql: relax TestTLSRequired revoked-cert assertion— acceptconnection reset by peer/broken pipealongsidebad certificate; all three mean the revoked cert was rejected.go/mysql: stop TestStaticConfigHUP panicking inside EventuallyWithT— follow-up to upstream's cherry-pick of CI: Deflake Code Coverage workflow #19388: this branch is pinned to testify v1.10.0, whereCollectT.FailNowpanics andEventuallyWithTdoesn't recover. Useassert.X(c, ...)instead ofrequire.X(c, ...).go/vt/vtgate/schema: bump TestTrackerNoLock channel-send timeout— backport of thetracker_test.goslice of flaky test fix TestTrackerNoLock and TestCreateLookupVindexMultipleCreate #18317 (10ms → 50ms).Related Issue(s)
None — these are CI-only stabilization fixes.
Checklist
Deployment Notes
None — CI-only changes.