Common Subexpression Elimination #403
Description
Hi everyone,
@abrown asked me to look into optimizations for the extended multiplication in #376 that would reduce the number of shuffles before executing pmuldq
(The proposed optimization is here). In the midst of doing research and analyzing the code, it became clear that for every wasmsimd instruction that required a sequence of instructions to implement, there would be redundant operations if there was more than one operation using the same parameters. In the xnnpack case, the multiplier is reused in each of the subsequent calculations. As a side effect, pshufd
can be called repeatedly on the multiplier even if the required intermediate was already generated as part of a previous operation.
What if we implemented something like an LRU on register outputs for every instruction that comes through? Something like this has to exist on some level since the compiler has to keep track of register allocations before it gets to instruction generation. But what if we extended this mechanism such that for every instruction that is assembled, it first checks to see if a previous temporary register with the output of that instruction is set and is valid?
The benefit of such optimization could mean tremendously better code generation in cases like shuffles, swizzles, widen_high/low, multiplications and any other scenario that we thought might be appropriate for multi-return.