Skip to content

update go toolchain to 1.22 #20

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 31 commits into from
Jul 14, 2025
Merged

update go toolchain to 1.22 #20

merged 31 commits into from
Jul 14, 2025

Conversation

iliana
Copy link
Collaborator

@iliana iliana commented Apr 18, 2025

Related to #10, but not quite fully fixing it.

This moves the Go toolchain from 1.17 to 1.22, a modernization of about 2.5 years, and a year behind current Go. Most of the changes here are related to fixing new lints from go vet and other tools, although I've only made fixes where it's not intrusive to do so. Some descriptions of fixes are called out here for context to reviewers, and to future maintainers (hello, me!) of this repo.

Why update?

Particularly when it comes to security protocols (such as TLS), Go has a very batteries-included standard library. Keeping on a currently-supported version of the Go runtime is very important, while also being somewhat painful for a codebase like Cockroach for reasons we'll get into.

While Oxide doesn't currently have any sensitivity to the Go TLS stack in the control plane, it is our intent to get internal services talking to each other over authenticated connections on the underlay network, and failing to do this maintenance work while it's relatively small creates a pretty big risk that we can't use built-in TLS support we already have.

(This is far from the only reason to keep the Go toolchain up to date but it's the one that feels most pertinent to us.)

Philosophy on new lints

We run the Cockroach linting test suite in CI, and changing the toolchain means updating Go's built-in lint tools (and some external tools that need updates to work on newer versions of Go).

I am generally of the opinion that linters and formatters are good. However, blindly applying 1.19's gofmt doc comment changes would have resulted in an unreviewable diff, in the literal "GitHub will not load the page" sense of unreviewable. (Some of that can be seen in the autogenerated code that was updated in this PR.) Similarly, some lints that have been introduced or improved to work in more cases since Go 1.17 that would only serve to introduce unnecessary code churn; I've opted to add rules to ignore them. Where new lints are particularly useful is when they tell us the Go runtime has meaningfully improved, which I'll go into detail on in the next section.

This philosophy is summarized near where we add the rules to ignore certain lints:

// [Oxide specific] Ignore a wide variety of deprecations and style suggestions.
// Generally we should investigate why something is deprecated: if it's because
// there is a genuine improvement (such as the new default math/rand RNG in
// Go 1.22 that cannot be used if rand.Seed is used), we should pay attention
// and follow those recommendations; if it's because of standard library
// reorganization (such as all of "io/ioutil" moving elsewhere), maybe it's not
// worth the code churn on a maintenance branch.

If I don't go into a specific change below, it probably hit a new/improved lint. Most of the ones that should flag reviewer attention are because the result of something was never actually used; these are the sorts of embarrassing mistakes we're used to rustc or Clippy catching for us, so it's nice to see a lot of these linter improvements.

math/rand deprecations

In Go 1.20:

The math/rand package now automatically seeds the global random number generator (used by top-level functions like Float64 and Int) with a random value, and the top-level Seed function has been deprecated. Programs that need a reproducible sequence of random numbers should prefer to allocate their own random source, using rand.New(rand.NewSource(seed)).

I did not know Go had a global random number generator and I'm glad I learned this, I guess?

Go has made the PRNG that backs the global RNG better too, but for the sake of backwards compatibility if you ever seed it it reverts to the old PRNG. But if you weren't seeding the RNG and used it for things you wanted to actually be random, you were going to have a bad day, so you always seeded it with either crypto/rand or the current UNIX timestamp. In several places (most notably in pkg/cli, the entrypoint to the compiled program) we remove that seeding so that we default to the new PRNG implementation.

A lot of the deprecated uses in test suites are kept in place; these have explicit //lint:ignore directives. Cockroach has test infrastructure for picking a random seed and printing it out on test failures so you can attempt to reproduce the issue locally. This capability might be removed in a future version of the Go runtime; in that situation it's probably best to disable that infrastructure rather than try to plumb a new "global" RNG into all the test interfaces.

pkg/util/uuid is one place where I made more invasive changes as part of correcting this deprecation lint. Previously there were two ways of generating a UUID with this package, either by passing in an io.Reader that hopefully read random bytes, or by using fastGen which used math/rand by way of a struct that implemented the io.Reader interface. I ported it to use math/rand/v2 instead (which uses the newer PRNG) while maintaining compatibility for using a given io.Reader (such as crypto/rand.Reader).

pkg/util/goschedstats

Cockroach uses the number of goroutines in the "runnable" state, i.e. ready and waiting to run, as an analog for system load; more ready-and-waiting goroutines means you're possibly overloaded. This metric is used for admission control of new work to a cluster. There's just one problem: Go doesn't actually tell you how many goroutines are in the "runnable" state. (A Stack Overflow comment suggests this is because "runnable" is an implementation detail of the Go runtime.)

I have had the misfortune of learning that Cockroach simply links to various parts of the Go runtime's scheduler and reads the data it's looking for. To do this, it keeps a copy of the relevant internal types in-tree along with a comment explaining that you must update them for any changes in the Go runtime.

If you're curious why this PR doesn't move us to 1.23 or 1.24, it's because Go 1.23 now disallows //go:linkname from linking to arbitrary parts of the Go runtime with a few exceptions that would have otherwise broken large parts of the Go ecosystem. This particular part of the scheduler is not one of those exceptions. I am not exactly sure yet how I want to approach this problem, but I didn't want to delay moving to a relatively-recent runtime for it.

Sorting

Go 1.19 switched the sorting algorithm used in the standard library to pdqsort. Cockroach has an in-tree copy of Go 1.17's sorting algorithm with an additional context parameter that permits early cancellation of a sort operation (in case a query is cancelled). A test case in pkg/sql/rowcontainer ensured (among other things) that sorting algorithm is consistent with the Go runtime and caught this change for me, because both sorting algorithms are unstable (equivalent rows do not necessarily maintain the same position relative to each other), and otherwise-equivalent rows were returning in different orders.

Instead of copying in the new sorting algorithm, I have made the less risky decision to make the test slightly more dubious by testing the abstract row sorting machinery against the in-tree cancelable sort functions.

gcassert

gcassert is a linting tool developed by Cockroach developers to be able to check certain things about compiled code. Go has poor language support for controlling inlining and bounds checking, so the best you can do is make a tool that parses compiler diagnostic output to ensure your assumptions hold. Of course, compiler behavior changes between releases and there's often nothing you can do about it, so whether this is valuable is dubious to me.

The version of gcassert previously used only processed directives without a space, such as //gcassert:bce, not // gcassert:bce, similar to Go build directives. The gofmt change in 1.19, of course, broke this behavior by limiting this to only Go build directives, and newer versions of gcassert support handling the variant with a space. Some // gcassert: directives (with the space) were removed because their assertions were never correct and never properly checked. In other cases where //gcassert was properly spelled, I made an attempt to correct any issues flagged; //gcassert:bce is used in a few places to ensure bounds checking is elided in tight loops by running the bounds check once before the loop.

(For context, the code where gcassert is used is in a place where "execgen" is also used, which performs code generation, inlining, etc. Those are the .eg.go files that GitHub is hiding by default; they generally come from similarly named _tmpl.go files.)

Node runtime update

Node.js is used to build the console frontend UI that is baked into the binary. The UI failed to build on versions of Node later than 16 due to Webpack's use of the md4 algorithm (used for cache-busting filenames) through Node's native OpenSSL bindings, which were removed in Node 18 and later. Fortunately Webpack 4 and terser-webpack-plugin made final maintenance releases to use a different md4 implementation that is compatible with later releases of Node, and that was all that was necessary to get off of Node 16. (It was also fortunately simple enough to stop vendoring npm packages via a Git submodule.)

We now pin to Node 20.11.1, the latest version available in Helios right now. Next time we bump those packages I'll move to the new latest release.

But does it work?

I've run the Omicron test suite locally with this version of Cockroach and it passes all the tests. I still need to verify that nothing about the on-disk format has changed, although that would be very unexpected. I intend to test this by deploying Omicron release 14 to a system and then upgrading the version of Cockroach, as well as deploying a version with a new Cockroach and then downgrading it back to the version we currently have; this way if we find a problem we at least are confident we can simply roll back.

Since Go 1.22, the default unseeded global RNG is more secure:
https://go.dev/blog/chacha8rand

However when the RNG is seeded it falls back to the old implementation.

One call to rand.Seed remains in randutil.SeedForTests. This is part
of the COCKROACH_RANDOM_SEED infrastructure for recording what seed was
used for a particular test failure so it could be reproduced. (Had the
global RNG not been seedable in the first place, perhaps there would be
more sensible infrastructure in place here.) We will leave this in place
until Go 1.24 when rand.Seed becomes a no-op by default.
@iliana iliana linked an issue Apr 18, 2025 that may be closed by this pull request
@sudomateo
Copy link

Thanks for working on this iliana. I've only read the write up so far and I don't have any reservations about moving to the new Go runtime. You've covered all the major points (e.g., v2 packages, linking) that I was going to mention if they weren't already mentioned.

I'll try to carve out some time to read the code next week. Meanwhile I'll subscribe to this so I stay current with what's happening.

Copy link

@sudomateo sudomateo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for working on this! I gave each file a brief look and I didn't see anything out of the ordinary. I left some non-blocking comments and questions.

I'll note that I did not run a build locally to verify the changes but instead kept my review to just that, a review.

@iliana
Copy link
Collaborator Author

iliana commented May 9, 2025

I am going to leave this open while I'm out for the next few weeks, since I didn't get a chance to do the full integration testing (upgrade/downgrade testing) I wanted to do with Omicron, and we're getting pretty far into the R15 cycle anyway. When I'm back we should be at the end of the R15 cycle and I can put in the work to land this as soon as R15 branches so it can get as much soak time for R16 as possible.

Anyone else can feel free to do the upgrade/downgrade testing while I'm gone though! The two tests I want to do are:

  • Rack init on the go1.16 version, thrash the database, mupdate to the go1.22 version, thrash the database, mupdate back to the go1.16 version, thrash the database
  • Rack init on the go1.22 version, thrash the database, mupdate to the go1.16 version, thrash the database

to ensure that whether or not rack init occurs on the go1.16 or go1.22 versions, that we can move into either version safely without any unexpected incompatibility in what's stored on-disk. Probably the best way to thrash the database is to run sled add/remove in a loop while also launching and shutting off instances.

@iliana
Copy link
Collaborator Author

iliana commented Jul 14, 2025

I've finally had the chance to do the first test above (go1.16 -> go1.22 -> go1.16) on a test rack and things are working as expected. Due to Reasons it will be easier to test rack init on the go1.22 version after merging this and getting a repo built in Omicron CI but I have tested initializing a cluster on the go1.22 version and then replacing with the go1.16 version, so I have extreme doubt that would discover an issue.

I'm going to go ahead and merge today, and then I'll work on burning down the Dependabot alerts and possibly getting beyond go1.22.

@iliana iliana merged commit 712357d into release-22.1-oxide Jul 14, 2025
13 checks passed
@iliana iliana deleted the iliana/go-1.22 branch July 14, 2025 23:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

errcheck lint disabled due to OOM
2 participants