Skip to content

Fix binary emitting of br_if with a refined value by emitting a cast #6510

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 105 commits into from
May 16, 2024

Conversation

kripken
Copy link
Member

@kripken kripken commented Apr 17, 2024

Alternative to #6390: That PR makes br_if with a value have the same typing rules
as the wasm spec, i.e., the potentially less refined type of the break target rather than
the value itself. This PR, instead, works around that difference: we keep using the more
refined type in Binaryen IR, and when we emit a binary we make sure to fix it up so it
validates.

The rationale for using a somewhat hackish approach like that is that we hope the wasm
spec will improve here, and so this is will only be temporary in that case. Even if not,
this is useful because by using the most refined type in the IR we optimize in the best
way possible, and only suffer when we emit fixups in the binary, but in practice those
cases are very rare: br_if is almost always dropped rather than used, in real-world
code (except for fuzz cases and exploits).

In more detail, the fixup we perform is to add a cast when we see that the br_if's value
is used (not dropped) and it is not refined enough. The cast forces it to be the same
type as the br_if's value, basically.

@kripken
Copy link
Member Author

kripken commented May 14, 2024

The preparation work in StackIR is done but meanwhile @tlively landed a nice refactoring of scratch locals, and I realized I can use that here 😄 Basically this PR was a bit complex as it added a specific new mechanism for scratch locals, while #6583 adds a general mechanism for multiple scratch locals of a particular type. As evidence of the nice design there, it was easy to use here.

So my inclination now is that this PR is the way to do. Separately, we can consider whether we want to refactor more binary writing logic into StackIR, but that is no longer needed or urgent.

Top comment has been updated.

Comment on lines 105 to 106
auto localIndex = scratchLocals[t] + scratchTypeUses[t];
scratchTypeUses[t]++;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This avoids having to do the lookup twice.

Suggested change
auto localIndex = scratchLocals[t] + scratchTypeUses[t];
scratchTypeUses[t]++;
auto localIndex = scratchLocals[t] + scratchTypeUses[t]++;

Copy link
Member Author

@kripken kripken May 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I worry it's a little less readable that way... and I'd hope the optimizer avoids a double lookup? What do you think?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A quick test on godbolt shows that even in the simplest possible case, the optimizer is not able to figure out how to avoid two lookups, unfortunately. I'd go with the single lookup solution, but I agree it's probably not that important.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting it can't be optimized... I wonder why. I'll modify this then.

Seems like there is no other aliasing operation that can interfere, but I guess I was overly optimistic...

Comment on lines 111 to 112
scratchTypeUses[t]--;
auto localIndex = scratchLocals[t] + scratchTypeUses[t];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
scratchTypeUses[t]--;
auto localIndex = scratchLocals[t] + scratchTypeUses[t];
auto localIndex = scratchLocals[t] + --scratchTypeUses[t];

Copy link
Member

@tlively tlively left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did we definitely decide to go with the cast approach rather than the tee/drop/get approach?

Comment on lines +2848 to +2851
if (unrefinedType == curr->type) {
// It has the proper type anyhow.
return;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After #6579 there will be no need to know the source type when emitting a cast, so this check will be the only remaining use for unrefinedType. This is also the only check that cannot be done as-is in ScratchLocalFinder. WDYT about making ScratchLocalFinder an ExpressionStackWalker to avoid needing this separate walker at all?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The overhead is annoying, I'm afraid, and I'd really like to not regress the by-far common case of not needing the extra work.

Some data:

$ perf stat -r 5 bin/wasm-opt UE4Game-HTML5-Shipping.wasm -all -o a.wasm
warning: no passes specified, not doing any work
warning: no passes specified, not doing any work
warning: no passes specified, not doing any work
warning: no passes specified, not doing any work
warning: no passes specified, not doing any work

 Performance counter stats for 'bin/wasm-opt UE4Game-HTML5-Shipping.wasm -all -o a.wasm' (5 runs):

         10,947.27 msec task-clock                       #    1.767 CPUs utilized               ( +-  0.38% )
               855      context-switches                 #   78.102 /sec                        ( +-  9.24% )
                40      cpu-migrations                   #    3.654 /sec                        ( +- 11.87% )
           239,470      page-faults                      #   21.875 K/sec                       ( +-  0.07% )
    34,826,556,461      cycles                           #    3.181 GHz                         ( +-  0.27% )
    48,135,304,544      instructions                     #    1.38  insn per cycle              ( +-  0.01% )
    10,101,559,953      branches                         #  922.747 M/sec                       ( +-  0.02% )
       150,681,797      branch-misses                    #    1.49% of all branches             ( +-  1.17% )

            6.1962 +- 0.0404 seconds time elapsed  ( +-  0.65% )

And, with this patch:

$ git diff
diff --git a/src/wasm/wasm-stack.cpp b/src/wasm/wasm-stack.cpp
index 53a25d52b..1a83f56cb 100644
--- a/src/wasm/wasm-stack.cpp
+++ b/src/wasm/wasm-stack.cpp
@@ -2681,7 +2681,7 @@ void BinaryInstWriter::noteLocalType(Type type, Index count) {
 }
 
 InsertOrderedMap<Type, Index> BinaryInstWriter::countScratchLocals() {
-  struct ScratchLocalFinder : PostWalker<ScratchLocalFinder> {
+  struct ScratchLocalFinder : ExpressionStackWalker<ScratchLocalFinder> {
     BinaryInstWriter& parent;
     InsertOrderedMap<Type, Index> scratches;
$ perf stat -r 5 bin/wasm-opt UE4Game-HTML5-Shipping.wasm -all -o b.wasm

 Performance counter stats for 'bin/wasm-opt UE4Game-HTML5-Shipping.wasm -all -o b.wasm' (5 runs):

         11,311.23 msec task-clock                       #    1.727 CPUs utilized               ( +-  0.26% )
               720      context-switches                 #   63.654 /sec                        ( +-  5.01% )
                29      cpu-migrations                   #    2.564 /sec                        ( +-  8.30% )
           238,817      page-faults                      #   21.113 K/sec                       ( +-  0.04% )
    36,257,517,592      cycles                           #    3.205 GHz                         ( +-  0.14% )
    50,906,632,590      instructions                     #    1.40  insn per cycle              ( +-  0.66% )
    10,525,197,435      branches                         #  930.509 M/sec                       ( +-  0.01% )
       155,907,135      branch-misses                    #    1.48% of all branches             ( +-  0.70% )

            6.5514 +- 0.0335 seconds time elapsed  ( +-  0.51% )

That's about 5% more instructions and walltime. Note that that measures load+save, so the impact on save is around double that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oof, that's way too much slowdown :(

@kripken
Copy link
Member Author

kripken commented May 15, 2024

Did we definitely decide to go with the cast approach rather than the tee/drop/get approach?

tee/drop/get requires StackIR, I think - hard to do it otherwise. I do have a prototype of that working, but it requires a lot of other infrastructure changes to support it. So I think the cost of a cast here is not that bad - this is a pretty rare situation.

@tlively
Copy link
Member

tlively commented May 15, 2024

I don't think it should require StackIR since it's just another drop-in replacement instruction sequence that only requires scratch locals. We should have all the necessary infrastructure in place. Besides the cost of the cast, I'm concerned that sooner or later we're going to end up with types that cannot be cast because rossberg keeps insisting that this would be reasonable and desirable. Would it be helpful if I put together a PR using that approach that we could compare against?

@kripken
Copy link
Member Author

kripken commented May 15, 2024

Hmm, thanks, but let me try myself, should take just a few minutes.

I remember that one concern I had with doing it that way was that it could not integrate with the tuple.extract optimizations. But doing it in StackIR, which is earlier, allowed that. But maybe that's not important enough. Let me try it out here and see.

@kripken
Copy link
Member Author

kripken commented May 15, 2024

Ok, I started to write the code and then remembered more: Avoiding casts means we need an extra scratch local for the condition. We can avoid constant increases in roundtrips by detecting the case of a local.get there, but we'd need to be careful of a local.get of a scratch local (edit: and StackIR stuff as in the other PR). Overall it is more complicated.

I think if we end up with types that cannot be cast in the future then we'd have no choice but to make that change, but for now it seems best not to, to me.

Copy link
Member

@tlively tlively left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM rebased on the landed stringview PRs.


assert(curr->type.hasRef());

auto emitCast = [&](Type to, Type from) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the stringview PR has now landed, there's no need to supply the from type anymore, so this can be simplified a bit. I guess we still need the unrefinedType in the brIfsNeedingHandling map for the scratch locals, though.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, from can go away but we do still need to know the unrefined type because the scratch locals are of that type.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants