-
Notifications
You must be signed in to change notification settings - Fork 1.5k
x64: Add non-SSE4.1 lowerings of ceil/trunc/floor/nearest #6224
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
x64: Add non-SSE4.1 lowerings of ceil/trunc/floor/nearest #6224
Conversation
This commit adds lowerings that work with SSE2 for CLIF `ceil`, `trunc`, `floor`, and `nearest` instructions over vectors. To get these working `insertlane` for float vectors was also implemented for non-SSE4.1 instructions as well. Note that the goal of these lowerings is not speed but rather "it works", so the decompose-to-call-libcalls logic for vector is probably horrendously slow but should at least be correct.
I'll also note that this doesn't enable the runtest for insertlane yet because there are other lowerings that use SSE4.1 instructions but a future commit will enable the insertlane test for testing without SSE4.1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
(x64_roundss a (RoundImm.RoundUp))) | ||
|
||
(rule 1 (lower (ceil a @ (value_type $F64))) | ||
(x64_roundss a imm)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know we're going the other direction here but I started thinking, "what about the AVX versions of these?" Looking at inst.isle
, they are emitted by x64_round*
, so all is well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah so far I feel like the current pattern of having a low-level instruction constructor which has an avx and a non-avx rule works out pretty well. It is relatively verbose and may be ripe for some future refactoring (e.g. auto-encoding as AVX if the context's feature flags are set) but for now it's nice to not have to worry about AVX at all really as part of lower.isle
Co-authored-by: Andrew Brown <[email protected]>
…liance#6224) * x64: Add non-SSE4.1 lowerings of ceil/trunc/floor/nearest This commit adds lowerings that work with SSE2 for CLIF `ceil`, `trunc`, `floor`, and `nearest` instructions over vectors. To get these working `insertlane` for float vectors was also implemented for non-SSE4.1 instructions as well. Note that the goal of these lowerings is not speed but rather "it works", so the decompose-to-call-libcalls logic for vector is probably horrendously slow but should at least be correct. * Skip new tests on riscv64 * Update cranelift/codegen/src/isa/x64/inst.isle Co-authored-by: Andrew Brown <[email protected]> --------- Co-authored-by: Andrew Brown <[email protected]>
Curious - is this just for testing or are there important users that only have SSE2? Even low-power x86(32/64) CPUs have had SSE4 for a very long time, and the standard standard was written with SSE4 in mind, web implementations don't support anything lower. In case someone would try to accelerate their code using instructions in this PR on SSE2 machine they are likely to get surprised. |
Just for testing and completeness. We occasionally get reports about users getting an error message that their machine didn't support SSE4.2 which was the basline required by Wasmtime originally. Part of this is on Wasmtime since we require SSE4.2 even if the module doesn't actually use simd instructions and that error could probably be improved by first detecting SIMD and then requiring SSE4.2 or something like that. Otherwise though I was basically just curious if it was even possible to take the requirement from SSE4.2 to SSE2. I wanted to learn x64 a bit more and the various instructions and it ended up not being too bad and it fits well in the instruction selection framework that Cranelift has, so I figured I'd land it. My guess though is that this'll only be here for compatibility in niche cases. I definitely have no plans myself to use work like this to shape future standards or anything like that, I think that the SSE4.2 baseline is quite reasonable and well thought out. |
@penzn FWIW, the current discussion on this and Alex's work were spawned by a downstream user trying to run Wasmtime on a Penryn-era (Core 2) machine a month or so ago, so these users do still exist, even if rarely (I'm not finding the issue at the moment, but all the related ones are linked from #3810, and we've had pretty extensive discussions on that cluster of issues over the past 1.5 years). It's also just nice to be able to say that we support our full feature set on any x86_64 machine (i.e. only SSE2), and dynamically detect the presence of better options (almost always switch to our SSE4.2 lowerings when available), rather than the other way around where we have a dynamically-detected error path or a "hole" in our support matrix. Wasm is all about portability, after all! |
Some instructions were added only because 4.1/4.2 support is assumed. Here is one example, also see notes from the meeting that approved it. On the other hand adding fallbacks makes sense. The tricky part of SIMD is that it needs to be fast as well as correct, you probably don't want to get into a situation when someone would accelerate their code using SIMD operations only to find out it got slower (probably even worse if compiler would quietly do it for them). Performance is pretty hard to specify precisely, criteria for adding instructions was something along the lines of "no slowdown, speedup for at least one platform" (i.e. swizzle). |
Right, and that came up in earlier discussions here too. From my perspective at least, it makes sense to tune the feature for newer hardware (and we do), but it kind of also sucks to just abandon users with old hardware completely. Given the choice between "user with 2007 laptop cannot run Wasmtime at all" and "user with 2007 laptop can run Wasmtime, but it's slow just like everything else on this machine" I think we'll take the latter :-) |
This commit adds lowerings that work with SSE2 for CLIF
ceil
,trunc
,floor
, andnearest
instructions over vectors. To get these workinginsertlane
for float vectors was also implemented for non-SSE4.1 instructions as well.Note that the goal of these lowerings is not speed but rather "it works", so the decompose-to-call-libcalls logic for vector is probably horrendously slow but should at least be correct.