Skip to content

[release-20.0] CI: deflakes, fork runner fallback, MySQL apt key, codecov gating#20195

Open
arthurschreiber wants to merge 14 commits into
vitessio:release-20.0from
arthurschreiber:release-20.0-ci-fixes
Open

[release-20.0] CI: deflakes, fork runner fallback, MySQL apt key, codecov gating#20195
arthurschreiber wants to merge 14 commits into
vitessio:release-20.0from
arthurschreiber:release-20.0-ci-fixes

Conversation

@arthurschreiber
Copy link
Copy Markdown
Member

@arthurschreiber arthurschreiber commented May 27, 2026

Description

release-20.0 is long EOL and not supported. However, some users still run it internally, or have to run it for a short window as part of an upgrade path to a supported Vitess version. Those users sometimes need to backport fixes from newer branches into their own forks, and right now that's painful: CI on the release-20.0 branch is broken in several ways for any repo that isn't vitessio/vitess (custom runners that forks can't schedule on, an expired MySQL apt key, a Percona repo that no longer ships the package we install, a Code Coverage job that goes red without an upload token, a handful of flaky tests, plus several uses: references that don't satisfy the org's SHA-pin policy).

This PR bundles the CI-only fixes needed to get release-20.0 green again so those backports can land. No production code is changed — everything is .github/, test helpers, or test code.

To be clear: merging this is not a change in support status. release-20.0 remains EOL. This is a courtesy to make life easier for users still on the branch by accident or by upgrade-path necessity.

The same set of fixes was opened against release-21.0 in #20196 and release-22.0 in #20197.

Fork-runner / infra fallbacks

  • 0628bbb ci: fall back to ubuntu-24.04 outside vitessio/vitess — forks can't schedule on gh-hosted-runners-16cores-1-24.04; gate on github.repository.
  • a08451a ci: fix MySQL install on ubuntu-24.04 runners — the GPG key shipped in mysql-apt-config_0.8.33-1 expired; bump to 0.8.35-1 (matching release-23.0/24.0) and uninstall the runner image's pre-installed MySQL before installing ours.
  • 888d9f1 ci: enable pxb-80 repo for percona-xtrabackup-80 installpercona-release setup ps80 no longer ships percona-xtrabackup-80; set up the pdps8.0 + pxb-80 repos like release-23.0/24.0 do.
  • d3f94e6 ci: don't run workflows twice for the same commit — backport of Simplify workflow files. #18649: restrict the push trigger to main, release branches, and tags so PR pushes don't double-fire.

Code Coverage gating

  • 51bb5c8 ci: skip Code Coverage job when CODECOV_TOKEN isn't available (cherry-pick of dea0555).
  • bb3b340 ci: gate Code Coverage on github.repository instead of token presence — follow-up so tokenless/OIDC uploads on vitessio/vitess aren't accidentally disabled.

Test deflakes (backports / cherry-picks)

  • d958e40 Flakes: Address TestServerStats flakiness (#16991) — cherry-pick.
  • a4768f0 go/mysql: relax TestTLSRequired revoked-cert assertion — accept connection reset by peer / broken pipe alongside bad certificate; all three mean the revoked cert was rejected.
  • e3cd779 go/mysql: deflake TestStaticConfigHUP — backport of the auth_server_static_test.go slice of CI: Deflake Code Coverage workflow #19388 (poll with EventuallyWithT instead of a fixed sleep).
  • 248182c go/mysql: stop TestStaticConfigHUP panicking inside EventuallyWithT — follow-up: this branch is pinned to testify v1.9, where CollectT.FailNow panics and EventuallyWithT doesn't recover. Use assert.X(c, ...) instead of require.X(c, ...).
  • 8a1c2fb go/vt/vtgate/schema: bump TestTrackerNoLock channel-send timeout — backport of the tracker_test.go slice of flaky test fix TestTrackerNoLock and TestCreateLookupVindexMultipleCreate #18317 (10ms → 50ms).
  • e4bee99 CI: wait-for rather than 'assume' in Online DDL flow (#16210) — cherry-pick.
  • 7f7eb34 CI: Look for expected log message rather than code in Backup tests (#19199) — cherry-pick.

Action SHA pinning

  • 7badca3 ci: pin all GitHub Actions to full-length commit SHAs — the vitessio/vitess action policy requires every uses: to reference a full 40-char commit SHA. Pin the nine remaining @v* / @master references (setup-go, setup-node, setup-python, stale, fossa-action, slack-workflow-status, codeql-action init/analyze, peter-evans/create-pull-request) at the same major versions they were on, using the SHAs upstream uses on newer branches.

Related Issue(s)

None — these are CI-only stabilization fixes.

Checklist

  • "Backport to:" labels have been added if this change should be back-ported to release branches
  • If this change is to be back-ported to previous releases, a justification is included in the PR description
  • Tests were added or are not required
  • Did the new or modified tests pass consistently locally and on CI?
  • Documentation was added or is not required

Deployment Notes

None — CI-only changes.

arthurschreiber and others added 13 commits May 26, 2026 06:54
Forks don't have access to the `gh-hosted-runners-16cores-1-24.04`
runner pool, so workflows that hardcoded it would never schedule.
Gate the runner selection on `github.repository` so forks fall back
to `ubuntu-24.04` automatically.

The three e2e templates are updated and regenerated; the
hand-maintained workflows (codecov, unit_race*, upgrade_downgrade_*,
local/region examples) get the same expression inline.
`docker_build_images.yml` is left as-is because both of its jobs
already gate on `if: github.repository == 'vitessio/vitess'`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Arthur Schreiber <arthur@planetscale.com>
The MySQL apt repo GPG key shipped in mysql-apt-config_0.8.33-1
(and 0.8.29-1) has expired, causing every CI job that installs
MySQL to fail with `EXPKEYSIG B7B3B788A8D3785C` during apt-get
update. The 0.8.35-1 package, used by release-23.0/release-24.0,
ships an updated key.

Backport the relevant pieces of the newer branches'
.github/actions/setup-mysql composite action into the existing
release-20.0 templates and hand-maintained workflows:

  - Bump mysql-apt-config to 0.8.35-1 across all templates and
    workflows that install MySQL.
  - Uninstall the MySQL pre-installed on the ubuntu-24.04 runner
    image before installing our own, so the package install
    doesn't conflict.
  - Recreate an empty apparmor profile before disabling /
    reloading it (the profile is removed along with the
    pre-installed MySQL packages).
  - Pull libaio1 / libtinfo5 from archive.ubuntu.com instead of
    mirrors.kernel.org, matching the newer composite action.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Arthur Schreiber <arthur@planetscale.com>
The `percona-release setup ps80` shortcut no longer enables a repo
that ships `percona-xtrabackup-80`, so the xb_backup / xb_recovery /
backup_pitr_xtrabackup jobs were failing with:

  E: Unable to locate package percona-xtrabackup-80

Match what release-23.0 and release-24.0 do: set up the
pdps8.0 distribution repo and pxb-80 (XtraBackup 8.0) repo, and
re-enable the ps-80 release repo for percona-server packages.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Arthur Schreiber <arthur@planetscale.com>
Signed-off-by: Matt Lord <mattalord@gmail.com>
(cherry picked from commit 9290e31)
When the server's `VerifyPeerCertificate` returns "Certificate revoked",
Go's TLS sends a `bad_certificate` alert and then closes. Whether the
client reads the alert or the TCP RST first depends on kernel TCP
flush timing — so the test would sometimes see
`remote error: tls: bad certificate` and sometimes
`connection reset by peer` / `broken pipe`.

Both outcomes mean the revoked certificate was rejected, which is what
the test cares about. Accept any of the three error strings.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Arthur Schreiber <arthur@planetscale.com>
When a workflow declares `on: [push, pull_request]` (or the multi-line
equivalent with bare `push:`/`pull_request:`), every commit pushed to a
branch with an open PR triggers two runs of the workflow: once for the
push, once for the pull_request event.

Match what was done on main / release-21.0 / release-22.0 (PR vitessio#18649):
restrict the push trigger to `main`, release branches, and tags, and
keep `pull_request` for all branches. Push-only paths/filters on the
vtadmin_web workflows are preserved.

The longer "skip-workflow" step in the templates is left in place; the
purpose of that PR's other simplification (removing the redundant skip
check) is out of scope here.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Arthur Schreiber <arthur@planetscale.com>
…itessio#19199)

Signed-off-by: Matt Lord <mattalord@gmail.com>
(cherry picked from commit 3839bd4)
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
(cherry picked from commit 135a6a8)
After SIGHUPing the static auth server to force a config reload, the
test slept a fixed 100ms (or 20ms) and then asserted the new entries
were live. On a slow CI runner the signal handler hasn't finished
processing yet, and the test fails with:

  Expected nil, but got: []*mysql.AuthServerStaticEntry{...}

Match the fix from PR vitessio#19388: replace the fixed sleep with
require.EventuallyWithT polling, with a generous 30s deadline so
slower runners still pass.

Backport of the go/mysql/auth_server_static_test.go slice of vitessio#19388
(the rest of that PR is unrelated zkctl/zk2topo/tabletserver work
that doesn't apply to this branch).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Arthur Schreiber <arthur@planetscale.com>
TestTrackerNoLock pushes 500,000 messages onto a channel and asserts
each send completes within 10ms. Under CI load that's tight enough to
flake regularly, surfacing as:

  tracker_test.go:199: failed to send health check to tracker

Match the fix from PR vitessio#18317: bump the per-send timeout to 50ms.

Backport of the go/vt/vtgate/schema/tracker_test.go slice of vitessio#18317
(the materializer_test.go slice is for an unrelated test).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Arthur Schreiber <arthur@planetscale.com>
The previous deflake commit (e3cd779) ported the EventuallyWithT
callback verbatim from upstream's PR vitessio#19388, which uses `require.X(c,
...)`. That works on upstream's testify v1.11+, but this branch is
pinned to testify v1.9, where `CollectT.FailNow` is implemented as
`panic("Assertion failed")` and `EventuallyWithT` doesn't recover from
it — so the first failed poll crashes the goroutine. The job log
showed exactly that:

  panic: Assertion failed
  testify/assert.(*CollectT).FailNow ...
  EventuallyWithT.func1 ...
  FAIL  vitess.io/vitess/go/mysql

Replace `require.X(c, ...)` with `assert.X(c, ...)` (which just flags
the CollectT instead of panicking) and guard the `entries[0]` indexing
on `assert.NotEmpty`, otherwise a `nil[0]` slice access escapes the
same way.

Hoisted the polling loop into a `waitForReload` helper since both
hupTest and hupTestWithRotation now use the same body.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Arthur Schreiber <arthur@planetscale.com>
The Upload step in codecov.yml has `fail_ci_if_error: true`, so when
the workflow runs on a fork (or anywhere else without
`secrets.CODECOV_TOKEN`) the upload returns:

  Token required - not valid tokenless upload
  ==> Failed to create-commit

…and the whole job goes red even though the test suite passed.

Gate the entire job on `secrets.CODECOV_TOKEN != ''` so forks skip
both the test run and the upload — running unit tests just to throw
away the coverage report is wasted CI time. Anyone who actually wants
the coverage can opt in by configuring the secret on their fork.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Arthur Schreiber <arthur@planetscale.com>
(cherry picked from commit dea0555)
The previous commit (51bb5c8) gated the Code Coverage job on
`secrets.CODECOV_TOKEN != ''`. That breaks if upstream relies on
tokenless / OIDC upload — they wouldn't have the secret set, and the
job would skip on `vitessio/vitess` too.

Switch to the same pattern we already use for runner selection:
`if: github.repository == 'vitessio/vitess'`. Coverage runs on upstream
unconditionally, and forks skip without burning ~16 minutes of unit
tests.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Arthur Schreiber <arthur@planetscale.com>
Copilot AI review requested due to automatic review settings May 27, 2026 08:28
@vitess-bot vitess-bot Bot added NeedsWebsiteDocsUpdate What it says NeedsDescriptionUpdate The description is not clear or comprehensive enough, and needs work NeedsIssue A linked issue is missing for this Pull Request NeedsBackportReason If backport labels have been applied to a PR, a justification is required labels May 27, 2026
@github-actions github-actions Bot added this to the v20.0.9 milestone May 27, 2026
@github-actions github-actions Bot added Component: Online DDL Online DDL (vitess/native/gh-ost/pt-osc) Component: VTGate labels May 27, 2026
@arthurschreiber arthurschreiber removed NeedsDescriptionUpdate The description is not clear or comprehensive enough, and needs work NeedsWebsiteDocsUpdate What it says NeedsIssue A linked issue is missing for this Pull Request NeedsBackportReason If backport labels have been applied to a PR, a justification is required Component: Online DDL Online DDL (vitess/native/gh-ost/pt-osc) Component: VTGate labels May 27, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR stabilizes CI for the EOL release-20.0 branch so fork/backport workflows can run more reliably without changing production code.

Changes:

  • Restricts workflow push triggers and adds fork runner fallbacks for custom 16-core runners.
  • Updates MySQL/Percona installation steps for current apt keys/repos and Ubuntu 24.04 runner images.
  • Deflakes selected Go/end-to-end tests by replacing fixed sleeps or overly strict assertions with polling/relaxed checks.

Reviewed changes

Copilot reviewed 101 out of 101 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
test/templates/unit_test.tpl Updates generated unit-test workflow triggers and MySQL install setup.
test/templates/cluster_vitess_tester.tpl Updates generated Vitess tester dependency setup.
test/templates/cluster_endtoend_test_docker.tpl Adds trigger restrictions and fork runner fallback.
go/vt/vtgate/schema/tracker_test.go Increases channel-send timeout to reduce flakiness.
go/test/endtoend/onlineddl/flow/onlineddl_flow_test.go Replaces fixed wait with DML-progress polling.
go/test/endtoend/backup/vtbackup/backup_only_test.go Matches redo-log messages instead of version-sensitive error codes.
go/mysql/server_test.go Deflakes server stats and revoked TLS certificate assertions.
go/mysql/auth_server_static_test.go Polls for auth config reload instead of fixed sleeps.
.github/workflows/codecov.yml Adds trigger restrictions, runner fallback, MySQL apt update, and repository gate.
.github/workflows/codeql_analysis.yml Updates MySQL apt config package.
.github/workflows/vtadmin_web_unit_tests.yml Restricts push triggers.
.github/workflows/vtadmin_web_lint.yml Restricts push triggers.
.github/workflows/vtadmin_web_build.yml Restricts push triggers.
.github/workflows/check_make_vtadmin_web_proto.yml Restricts push triggers.
.github/workflows/check_make_vtadmin_authz_testgen.yml Restricts push triggers.
.github/workflows/unit_test_mysql80.yml Regenerated unit workflow with trigger/MySQL updates.
.github/workflows/unit_test_mysql57.yml Regenerated unit workflow with trigger/MySQL updates.
.github/workflows/unit_test_evalengine_mysql80.yml Regenerated evalengine unit workflow with trigger/MySQL updates.
.github/workflows/unit_test_evalengine_mysql57.yml Regenerated evalengine unit workflow with trigger/MySQL updates.
.github/workflows/unit_race.yml Adds trigger restrictions and runner fallback.
.github/workflows/unit_race_evalengine.yml Adds trigger restrictions and runner fallback.
.github/workflows/endtoend.yml Restricts push triggers.
.github/workflows/e2e_race.yml Restricts triggers and updates MySQL apt config.
.github/workflows/local_example.yml Adds trigger restrictions and runner fallback.
.github/workflows/region_example.yml Adds trigger restrictions and runner fallback.
.github/workflows/docker_test_cluster_10.yml Restricts push triggers.
.github/workflows/docker_test_cluster_25.yml Restricts push triggers.
.github/workflows/vitess_tester_vtgate.yml Regenerated tester workflow dependency setup.
.github/workflows/upgrade_downgrade_test_*.yml Adds trigger restrictions, runner fallback, and MySQL apt config updates across upgrade/downgrade jobs.
.github/workflows/cluster_endtoend_*.yml Regenerated cluster workflows with trigger restrictions, MySQL/Percona install fixes, and apparmor handling.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

# configured on forks, and Codecov no longer allows tokenless uploads
# ("Token required - not valid tokenless upload"). Without this gate
# we'd burn ~16 minutes on the test suite just to red-fail the upload.
if: ${{ github.repository == 'vitessio/vitess' }}
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fine. We only want to skip codecov in PRs that are opened in fork repositories, not for PRs in vitessio/vitess that originate from a fork.

@codecov
Copy link
Copy Markdown

codecov Bot commented May 27, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 68.77%. Comparing base (8f9101e) to head (7badca3).

Additional details and impacted files
@@               Coverage Diff                @@
##           release-20.0   #20195      +/-   ##
================================================
+ Coverage         66.45%   68.77%   +2.32%     
================================================
  Files              1543     1543              
  Lines            244950   198737   -46213     
================================================
- Hits             162774   136677   -26097     
+ Misses            82176    62060   -20116     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

The vitessio/vitess action policy requires every `uses:` to reference a
full 40-char commit SHA, but a handful of workflows on release-20.0
still pin by major tag (or by `@master`). Any PR that touches CI on
this branch fails at the `Prepare all required actions` step with:

  The action <name>@<ref> is not allowed in vitessio/vitess because
  all actions must be pinned to a full-length commit SHA.

Pin the remaining references to the same SHAs upstream uses, keeping
the major version unchanged:

  actions/setup-go@v5                  → 0a12ed9d # v5.0.2
  actions/setup-node@v4                → 1e60f620 # v4.0.3
  actions/setup-python@v5              → 39cd1495 # v5.1.1
  actions/stale@v5                     → f7176fd3 # v5.2.1
  fossa-contrib/fossa-action@v3        → 3d2ef181 # v3.0.1
  Gamesight/slack-workflow-status@master → 68bf00d0 # v1.3.0
  github/codeql-action/init@v3         → 4bdb89f4 # v3.28.18
  github/codeql-action/analyze@v3      → 4bdb89f4 # v3.28.18
  peter-evans/create-pull-request@v4   → 38e0b6e6 # v4.2.4

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Arthur Schreiber <arthur@planetscale.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 105 out of 105 changed files in this pull request and generated no new comments.

@arthurschreiber arthurschreiber enabled auto-merge (squash) May 27, 2026 13:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants