-
Notifications
You must be signed in to change notification settings - Fork 793
Fix binary emitting of br_if with a refined value by emitting a cast #6510
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
The preparation work in StackIR is done but meanwhile @tlively landed a nice refactoring of scratch locals, and I realized I can use that here 😄 Basically this PR was a bit complex as it added a specific new mechanism for scratch locals, while #6583 adds a general mechanism for multiple scratch locals of a particular type. As evidence of the nice design there, it was easy to use here. So my inclination now is that this PR is the way to do. Separately, we can consider whether we want to refactor more binary writing logic into StackIR, but that is no longer needed or urgent. Top comment has been updated. |
src/wasm/wasm-stack.cpp
Outdated
auto localIndex = scratchLocals[t] + scratchTypeUses[t]; | ||
scratchTypeUses[t]++; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This avoids having to do the lookup twice.
auto localIndex = scratchLocals[t] + scratchTypeUses[t]; | |
scratchTypeUses[t]++; | |
auto localIndex = scratchLocals[t] + scratchTypeUses[t]++; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I worry it's a little less readable that way... and I'd hope the optimizer avoids a double lookup? What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A quick test on godbolt shows that even in the simplest possible case, the optimizer is not able to figure out how to avoid two lookups, unfortunately. I'd go with the single lookup solution, but I agree it's probably not that important.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting it can't be optimized... I wonder why. I'll modify this then.
Seems like there is no other aliasing operation that can interfere, but I guess I was overly optimistic...
src/wasm/wasm-stack.cpp
Outdated
scratchTypeUses[t]--; | ||
auto localIndex = scratchLocals[t] + scratchTypeUses[t]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
scratchTypeUses[t]--; | |
auto localIndex = scratchLocals[t] + scratchTypeUses[t]; | |
auto localIndex = scratchLocals[t] + --scratchTypeUses[t]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did we definitely decide to go with the cast approach rather than the tee/drop/get approach?
if (unrefinedType == curr->type) { | ||
// It has the proper type anyhow. | ||
return; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After #6579 there will be no need to know the source type when emitting a cast, so this check will be the only remaining use for unrefinedType
. This is also the only check that cannot be done as-is in ScratchLocalFinder
. WDYT about making ScratchLocalFinder
an ExpressionStackWalker
to avoid needing this separate walker at all?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The overhead is annoying, I'm afraid, and I'd really like to not regress the by-far common case of not needing the extra work.
Some data:
$ perf stat -r 5 bin/wasm-opt UE4Game-HTML5-Shipping.wasm -all -o a.wasm
warning: no passes specified, not doing any work
warning: no passes specified, not doing any work
warning: no passes specified, not doing any work
warning: no passes specified, not doing any work
warning: no passes specified, not doing any work
Performance counter stats for 'bin/wasm-opt UE4Game-HTML5-Shipping.wasm -all -o a.wasm' (5 runs):
10,947.27 msec task-clock # 1.767 CPUs utilized ( +- 0.38% )
855 context-switches # 78.102 /sec ( +- 9.24% )
40 cpu-migrations # 3.654 /sec ( +- 11.87% )
239,470 page-faults # 21.875 K/sec ( +- 0.07% )
34,826,556,461 cycles # 3.181 GHz ( +- 0.27% )
48,135,304,544 instructions # 1.38 insn per cycle ( +- 0.01% )
10,101,559,953 branches # 922.747 M/sec ( +- 0.02% )
150,681,797 branch-misses # 1.49% of all branches ( +- 1.17% )
6.1962 +- 0.0404 seconds time elapsed ( +- 0.65% )
And, with this patch:
$ git diff
diff --git a/src/wasm/wasm-stack.cpp b/src/wasm/wasm-stack.cpp
index 53a25d52b..1a83f56cb 100644
--- a/src/wasm/wasm-stack.cpp
+++ b/src/wasm/wasm-stack.cpp
@@ -2681,7 +2681,7 @@ void BinaryInstWriter::noteLocalType(Type type, Index count) {
}
InsertOrderedMap<Type, Index> BinaryInstWriter::countScratchLocals() {
- struct ScratchLocalFinder : PostWalker<ScratchLocalFinder> {
+ struct ScratchLocalFinder : ExpressionStackWalker<ScratchLocalFinder> {
BinaryInstWriter& parent;
InsertOrderedMap<Type, Index> scratches;
$ perf stat -r 5 bin/wasm-opt UE4Game-HTML5-Shipping.wasm -all -o b.wasm
Performance counter stats for 'bin/wasm-opt UE4Game-HTML5-Shipping.wasm -all -o b.wasm' (5 runs):
11,311.23 msec task-clock # 1.727 CPUs utilized ( +- 0.26% )
720 context-switches # 63.654 /sec ( +- 5.01% )
29 cpu-migrations # 2.564 /sec ( +- 8.30% )
238,817 page-faults # 21.113 K/sec ( +- 0.04% )
36,257,517,592 cycles # 3.205 GHz ( +- 0.14% )
50,906,632,590 instructions # 1.40 insn per cycle ( +- 0.66% )
10,525,197,435 branches # 930.509 M/sec ( +- 0.01% )
155,907,135 branch-misses # 1.48% of all branches ( +- 0.70% )
6.5514 +- 0.0335 seconds time elapsed ( +- 0.51% )
That's about 5% more instructions and walltime. Note that that measures load+save, so the impact on save is around double that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oof, that's way too much slowdown :(
|
I don't think it should require StackIR since it's just another drop-in replacement instruction sequence that only requires scratch locals. We should have all the necessary infrastructure in place. Besides the cost of the cast, I'm concerned that sooner or later we're going to end up with types that cannot be cast because rossberg keeps insisting that this would be reasonable and desirable. Would it be helpful if I put together a PR using that approach that we could compare against? |
Hmm, thanks, but let me try myself, should take just a few minutes. I remember that one concern I had with doing it that way was that it could not integrate with the |
Ok, I started to write the code and then remembered more: Avoiding casts means we need an extra scratch local for the condition. We can avoid constant increases in roundtrips by detecting the case of a I think if we end up with types that cannot be cast in the future then we'd have no choice but to make that change, but for now it seems best not to, to me. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM rebased on the landed stringview PRs.
src/wasm/wasm-stack.cpp
Outdated
|
||
assert(curr->type.hasRef()); | ||
|
||
auto emitCast = [&](Type to, Type from) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the stringview PR has now landed, there's no need to supply the from
type anymore, so this can be simplified a bit. I guess we still need the unrefinedType
in the brIfsNeedingHandling
map for the scratch locals, though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, from
can go away but we do still need to know the unrefined type because the scratch locals are of that type.
Alternative to #6390: That PR makes
br_if
with a value have the same typing rulesas the wasm spec, i.e., the potentially less refined type of the break target rather than
the value itself. This PR, instead, works around that difference: we keep using the more
refined type in Binaryen IR, and when we emit a binary we make sure to fix it up so it
validates.
The rationale for using a somewhat hackish approach like that is that we hope the wasm
spec will improve here, and so this is will only be temporary in that case. Even if not,
this is useful because by using the most refined type in the IR we optimize in the best
way possible, and only suffer when we emit fixups in the binary, but in practice those
cases are very rare:
br_if
is almost always dropped rather than used, in real-worldcode (except for fuzz cases and exploits).
In more detail, the fixup we perform is to add a cast when we see that the
br_if
's valueis used (not dropped) and it is not refined enough. The cast forces it to be the same
type as the
br_if
's value, basically.