Skip to content

Conversation

alexcrichton
Copy link
Member

This commit adds lowerings that work with SSE2 for CLIF ceil, trunc, floor, and nearest instructions over vectors. To get these working insertlane for float vectors was also implemented for non-SSE4.1 instructions as well.

Note that the goal of these lowerings is not speed but rather "it works", so the decompose-to-call-libcalls logic for vector is probably horrendously slow but should at least be correct.

This commit adds lowerings that work with SSE2 for CLIF `ceil`, `trunc`,
`floor`, and `nearest` instructions over vectors. To get these working
`insertlane` for float vectors was also implemented for non-SSE4.1
instructions as well.

Note that the goal of these lowerings is not speed but rather "it
works", so the decompose-to-call-libcalls logic for vector is probably
horrendously slow but should at least be correct.
@alexcrichton alexcrichton requested a review from abrown April 17, 2023 18:26
@alexcrichton alexcrichton marked this pull request as ready for review April 17, 2023 18:26
@alexcrichton alexcrichton requested a review from a team as a code owner April 17, 2023 18:26
@alexcrichton alexcrichton removed the request for review from a team April 17, 2023 18:26
@alexcrichton
Copy link
Member Author

I'll also note that this doesn't enable the runtest for insertlane yet because there are other lowerings that use SSE4.1 instructions but a future commit will enable the insertlane test for testing without SSE4.1

@github-actions github-actions bot added cranelift Issues related to the Cranelift code generator cranelift:area:x64 Issues related to x64 codegen labels Apr 17, 2023
@alexcrichton alexcrichton reopened this Apr 18, 2023
Copy link
Member

@abrown abrown left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

(x64_roundss a (RoundImm.RoundUp)))

(rule 1 (lower (ceil a @ (value_type $F64)))
(x64_roundss a imm))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know we're going the other direction here but I started thinking, "what about the AVX versions of these?" Looking at inst.isle, they are emitted by x64_round*, so all is well.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah so far I feel like the current pattern of having a low-level instruction constructor which has an avx and a non-avx rule works out pretty well. It is relatively verbose and may be ripe for some future refactoring (e.g. auto-encoding as AVX if the context's feature flags are set) but for now it's nice to not have to worry about AVX at all really as part of lower.isle

@alexcrichton alexcrichton enabled auto-merge April 18, 2023 16:51
@alexcrichton alexcrichton added this pull request to the merge queue Apr 18, 2023
Merged via the queue into bytecodealliance:main with commit 62f8928 Apr 18, 2023
@alexcrichton alexcrichton deleted the x64-round-sse41 branch April 18, 2023 18:03
eduardomourar pushed a commit to eduardomourar/wasmtime that referenced this pull request Apr 28, 2023
…liance#6224)

* x64: Add non-SSE4.1 lowerings of ceil/trunc/floor/nearest

This commit adds lowerings that work with SSE2 for CLIF `ceil`, `trunc`,
`floor`, and `nearest` instructions over vectors. To get these working
`insertlane` for float vectors was also implemented for non-SSE4.1
instructions as well.

Note that the goal of these lowerings is not speed but rather "it
works", so the decompose-to-call-libcalls logic for vector is probably
horrendously slow but should at least be correct.

* Skip new tests on riscv64

* Update cranelift/codegen/src/isa/x64/inst.isle

Co-authored-by: Andrew Brown <[email protected]>

---------

Co-authored-by: Andrew Brown <[email protected]>
@penzn
Copy link

penzn commented May 18, 2023

Curious - is this just for testing or are there important users that only have SSE2? Even low-power x86(32/64) CPUs have had SSE4 for a very long time, and the standard standard was written with SSE4 in mind, web implementations don't support anything lower. In case someone would try to accelerate their code using instructions in this PR on SSE2 machine they are likely to get surprised.

@alexcrichton
Copy link
Member Author

Just for testing and completeness. We occasionally get reports about users getting an error message that their machine didn't support SSE4.2 which was the basline required by Wasmtime originally. Part of this is on Wasmtime since we require SSE4.2 even if the module doesn't actually use simd instructions and that error could probably be improved by first detecting SIMD and then requiring SSE4.2 or something like that.

Otherwise though I was basically just curious if it was even possible to take the requirement from SSE4.2 to SSE2. I wanted to learn x64 a bit more and the various instructions and it ended up not being too bad and it fits well in the instruction selection framework that Cranelift has, so I figured I'd land it.

My guess though is that this'll only be here for compatibility in niche cases. I definitely have no plans myself to use work like this to shape future standards or anything like that, I think that the SSE4.2 baseline is quite reasonable and well thought out.

@cfallin
Copy link
Member

cfallin commented May 18, 2023

@penzn FWIW, the current discussion on this and Alex's work were spawned by a downstream user trying to run Wasmtime on a Penryn-era (Core 2) machine a month or so ago, so these users do still exist, even if rarely (I'm not finding the issue at the moment, but all the related ones are linked from #3810, and we've had pretty extensive discussions on that cluster of issues over the past 1.5 years). It's also just nice to be able to say that we support our full feature set on any x86_64 machine (i.e. only SSE2), and dynamically detect the presence of better options (almost always switch to our SSE4.2 lowerings when available), rather than the other way around where we have a dynamically-detected error path or a "hole" in our support matrix. Wasm is all about portability, after all!

@penzn
Copy link

penzn commented May 18, 2023

Some instructions were added only because 4.1/4.2 support is assumed. Here is one example, also see notes from the meeting that approved it. On the other hand adding fallbacks makes sense.

The tricky part of SIMD is that it needs to be fast as well as correct, you probably don't want to get into a situation when someone would accelerate their code using SIMD operations only to find out it got slower (probably even worse if compiler would quietly do it for them). Performance is pretty hard to specify precisely, criteria for adding instructions was something along the lines of "no slowdown, speedup for at least one platform" (i.e. swizzle).

@cfallin
Copy link
Member

cfallin commented May 18, 2023

Right, and that came up in earlier discussions here too. From my perspective at least, it makes sense to tune the feature for newer hardware (and we do), but it kind of also sucks to just abandon users with old hardware completely. Given the choice between "user with 2007 laptop cannot run Wasmtime at all" and "user with 2007 laptop can run Wasmtime, but it's slow just like everything else on this machine" I think we'll take the latter :-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cranelift:area:x64 Issues related to x64 codegen cranelift Issues related to the Cranelift code generator
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants