Skip to content

[release-22.0] CI: deflakes, fork runner fallback, codecov / milestone gating#20197

Open
arthurschreiber wants to merge 7 commits into
vitessio:release-22.0from
arthurschreiber:release-22.0-ci-fixes
Open

[release-22.0] CI: deflakes, fork runner fallback, codecov / milestone gating#20197
arthurschreiber wants to merge 7 commits into
vitessio:release-22.0from
arthurschreiber:release-22.0-ci-fixes

Conversation

@arthurschreiber
Copy link
Copy Markdown
Member

Description

release-22.0 is long EOL and not supported. However, some users still run it internally, or have to run it for a short window as part of an upgrade path to a supported Vitess version. Those users sometimes need to backport fixes from newer branches into their own forks, and right now that's painful: CI on the release-22.0 branch is broken in several ways for any repo that isn't vitessio/vitess (custom runners that forks can't schedule on, a Code Coverage job that goes red without an upload token, an Assign Milestone job that fails because the fork has no matching milestones, plus a couple of flaky tests).

This PR bundles the CI-only fixes needed to get release-22.0 green again so those backports can land. No production code is changed — everything is .github/, test helpers, or test code.

To be clear: merging this is not a change in support status. release-22.0 remains EOL. This is a courtesy to make life easier for users still on the branch by accident or by upgrade-path necessity.

The same set of fixes was opened against release-20.0 in #20195 and release-21.0 in #20196.

Fork-runner / infra fallbacks

  • f434887 ci: fall back to ubuntu-24.04 outside vitessio/vitess — forks can't schedule on gh-hosted-runners-16cores-1-24.04; gate runner selection on github.repository (folded into the existing matrix.needs/matrix.race ternaries where they already chose between the larger runner and ubuntu-24.04).
  • e1a5441 ci: fall back to ubuntu-24.04 for vtop_example outside vitessio/vitessvtop_example.yml hardcoded oracle-vm-8cpu-32gb-x86-64; apply the same github.repository ternary. Won't necessarily pass on ubuntu-24.04 (minikube + vtop is tight on 4 cores), but failing at runtime with a log is more useful than never scheduling.

Code Coverage gating

  • 493869e ci: gate Code Coverage on github.repositoryfail_ci_if_error: true + no CODECOV_TOKEN on forks meant the job always went red. AND github.repository == 'vitessio/vitess' into the existing job-level if; the Backport-label exclusion is preserved for upstream runs.

Milestone job gating

  • cc3a0cb ci: gate Assign Milestone on github.repositoryassign_milestone.yml runs on pull_request_target: [opened] and tries to assign a milestone that doesn't exist on the fork, so every fork PR fails this job. Job-level if: github.repository == 'vitessio/vitess' skips it cleanly.

Test deflakes (backports / cherry-picks)

  • 0cffe86 go/mysql: relax TestTLSRequired revoked-cert assertion — accept connection reset by peer / broken pipe alongside bad certificate; all three mean the revoked cert was rejected.
  • 487d728 go/mysql: stop TestStaticConfigHUP panicking inside EventuallyWithT — follow-up to upstream's cherry-pick of CI: Deflake Code Coverage workflow #19388: this branch is pinned to testify v1.10.0, where CollectT.FailNow panics and EventuallyWithT doesn't recover. Use assert.X(c, ...) instead of require.X(c, ...).
  • e91953f go/vt/vtgate/schema: bump TestTrackerNoLock channel-send timeout — backport of the tracker_test.go slice of flaky test fix TestTrackerNoLock and TestCreateLookupVindexMultipleCreate #18317 (10ms → 50ms).

Related Issue(s)

None — these are CI-only stabilization fixes.

Checklist

  • "Backport to:" labels have been added if this change should be back-ported to release branches
  • If this change is to be back-ported to previous releases, a justification is included in the PR description
  • Tests were added or are not required
  • Did the new or modified tests pass consistently locally and on CI?
  • Documentation was added or is not required

Deployment Notes

None — CI-only changes.

arthurschreiber and others added 7 commits May 27, 2026 06:46
When the server's `VerifyPeerCertificate` returns "Certificate revoked",
Go's TLS sends a `bad_certificate` alert and then closes. Whether the
client reads the alert or the TCP RST first depends on kernel TCP
flush timing — so the test would sometimes see
`remote error: tls: bad certificate` and sometimes
`connection reset by peer` / `broken pipe`.

Both outcomes mean the revoked certificate was rejected, which is what
the test cares about. Accept any of the three error strings.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Arthur Schreiber <arthur@planetscale.com>
(cherry picked from commit 8136179)
TestTrackerNoLock pushes 500,000 messages onto a channel and asserts
each send completes within 10ms. Under CI load that's tight enough to
flake regularly, surfacing as:

  tracker_test.go:199: failed to send health check to tracker

Match the fix from PR vitessio#18317: bump the per-send timeout to 50ms.

Backport of the go/vt/vtgate/schema/tracker_test.go slice of vitessio#18317
(the materializer_test.go slice is for an unrelated test).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Arthur Schreiber <arthur@planetscale.com>
(cherry picked from commit 8d3e545)
Forks don't have access to the `gh-hosted-runners-16cores-1-24.04`
runner pool, so workflows that hardcoded it would never schedule.
Gate the runner selection on `github.repository` so forks fall back
to `ubuntu-24.04` automatically.

`cluster_endtoend.yml` and `unit_test.yml` already chose between the
larger runner and `ubuntu-24.04` via a matrix value
(`matrix.needs`/`matrix.race`); AND the `github.repository` check
into that expression so forks never request the larger runner.
The remaining `upgrade_downgrade_*`, `local_example`, and
`region_example` workflows hardcoded the larger runner, so they get
the same ternary inline.

`docker_build_images.yml` is left as-is because both of its jobs
already gate on `if: github.repository == 'vitessio/vitess'`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Arthur Schreiber <arthur@planetscale.com>
The Upload step in codecov.yml has `fail_ci_if_error: true`, so when
the workflow runs on a fork (without `secrets.CODECOV_TOKEN`) the
upload returns "Token required - not valid tokenless upload" and the
whole job goes red even though the test suite passed. Running ~16
minutes of unit tests just to throw away the coverage report on a
fork is also wasted CI time.

AND `github.repository == 'vitessio/vitess'` into the existing
job-level `if` so forks skip the job entirely. The existing
`Backport`-label exclusion is preserved for upstream runs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Arthur Schreiber <arthur@planetscale.com>
The previous deflake commit (0ea371f, the [release-22.0] cherry-pick
of upstream's PR vitessio#19388) ported the EventuallyWithT callback verbatim
from upstream, which uses `require.X(c, ...)`. That works on upstream's
testify v1.11+, but this branch is pinned to testify v1.10.0, where
`CollectT.FailNow` is implemented as `panic("Assertion failed")` and
`EventuallyWithT` doesn't recover from it — so the first failed poll
crashes the goroutine. The release-21.0 job log showed exactly that:

  panic: Assertion failed
  testify/assert.(*CollectT).FailNow ...
  EventuallyWithT.func1 ...
  FAIL  vitess.io/vitess/go/mysql

Replace `require.X(c, ...)` with `assert.X(c, ...)` (which just flags
the CollectT instead of panicking) and guard the `entries[0]` indexing
on `assert.NotEmpty`, otherwise a `nil[0]` slice access escapes the
same way.

Hoisted the polling loop into a `waitForReload` helper since both
hupTest and hupTestWithRotation now use the same body.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
(cherry picked from commit 248182c)
(cherry picked from commit a814579)
Signed-off-by: Arthur Schreiber <arthur@planetscale.com>
`vtop_example.yml` hardcoded `oracle-vm-8cpu-32gb-x86-64`, which forks
can't schedule on, so the workflow would never start. Apply the same
`github.repository`-gated expression used for the `gh-hosted-runners`
fallback so forks fall back to `ubuntu-24.04`.

The test spins up minikube + vtop, which probably won't fit in
ubuntu-24.04's 4-core/16GB envelope — but failing at runtime with a
log is more useful than never scheduling.

`codecov.yml` also uses the oracle runner but is already job-gated on
`github.repository`, so its `runs-on` line never executes on forks
and doesn't need this treatment.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Arthur Schreiber <arthur@planetscale.com>
`assign_milestone.yml` triggers on `pull_request_target: [opened]` and
greps `go/vt/servenv/version.go` for a `versionName` to assign a
matching milestone via `gh pr edit`. On a fork, there is no matching
milestone, so the job fails on every PR opened against the fork.

Add a job-level `if: github.repository == 'vitessio/vitess'` so the
job only runs on the upstream repo. Other forks can opt in by setting
up their own milestones and removing the gate.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Arthur Schreiber <arthur@planetscale.com>
Copilot AI review requested due to automatic review settings May 27, 2026 09:08
@vitess-bot vitess-bot Bot added NeedsWebsiteDocsUpdate What it says NeedsDescriptionUpdate The description is not clear or comprehensive enough, and needs work NeedsIssue A linked issue is missing for this Pull Request NeedsBackportReason If backport labels have been applied to a PR, a justification is required labels May 27, 2026
@arthurschreiber arthurschreiber removed NeedsDescriptionUpdate The description is not clear or comprehensive enough, and needs work NeedsWebsiteDocsUpdate What it says NeedsIssue A linked issue is missing for this Pull Request NeedsBackportReason If backport labels have been applied to a PR, a justification is required labels May 27, 2026
@github-actions github-actions Bot added this to the v22.0.5 milestone May 27, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

CI-only stabilization backport for the EOL release-22.0 branch so that downstream forks can run CI green when backporting fixes. Production code is untouched; changes are limited to workflow gating and a few flaky-test relaxations.

Changes:

  • Gate custom GitHub-hosted runners (and vtop_example's Oracle VM runner) on github.repository == 'vitessio/vitess' across many workflows, falling back to ubuntu-24.04 for forks; also gate the Code Coverage and Assign Milestone jobs to upstream only.
  • Relax TestTLSRequired's revoked-cert assertion to accept connection reset by peer / broken pipe in addition to bad certificate.
  • Extract waitForReload for TestStaticConfigHUP using assert.X (not require.X) inside EventuallyWithT to avoid testify v1.10 CollectT.FailNow panics, and bump TestTrackerNoLock channel-send timeout from 10ms → 50ms.

Reviewed changes

Copilot reviewed 26 out of 26 changed files in this pull request and generated no comments.

Show a summary per file
File Description
.github/workflows/assign_milestone.yml Skip Assign Milestone job on forks.
.github/workflows/codecov.yml Skip Code Coverage job on forks (preserves Backport-label gating).
.github/workflows/cluster_endtoend.yml Gate larger runner selection on upstream.
.github/workflows/unit_test.yml Gate race-job runner on upstream.
.github/workflows/vtop_example.yml Fall back to ubuntu-24.04 for forks.
.github/workflows/local_example.yml Same runner fallback.
.github/workflows/region_example.yml Same runner fallback.
.github/workflows/upgrade_downgrade_test_*.yml (12 files) Same runner fallback across upgrade/downgrade workflows.
go/mysql/server_test.go Accept alternate close errors in TestTLSRequired revoked-cert path.
go/mysql/auth_server_static_test.go Extract waitForReload; use assert inside EventuallyWithT to avoid v1.10 panic.
go/vt/vtgate/schema/tracker_test.go Bump TestTrackerNoLock send timeout 10ms → 50ms.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@arthurschreiber arthurschreiber enabled auto-merge (squash) May 27, 2026 13:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants