-
Notifications
You must be signed in to change notification settings - Fork 27
memory.copy|fill semantics limit optimizations for short constant lengths #111
Description
As discussed in #1, the expectation is that producers will use memory.copy|fill
for short lengths in addition to long lengths. We’ve already seen this to be the case, and have been investigating a performance regression resulting from LLVM 9 using memory.copy
for short constant lengths [1].
Part of that regression is in a sub-optimal OOL call to the system memmove
, but to really get performance to par, we’d like to inline these short memory.copy
s to loads and stores.
This has turned out challenging to implement in a way that is better or equal to the Wasm loads and stores that were emitted previously.
There are several problems resulting from the following:
- We must assume no alignment for
src
,dest
- When a range is partially OOBs, we must write all bytes up until the region is OOBs
- We must copy correctly when there is overlap of
src
,dest
(1) and (2) are related. Because we don’t know the alignment of src
or dest
we cannot use wider transfers than a single byte at a time (e.g 32bit, 64bit, or 128bit) or else we’d be at risk of the final store being partially OOB and not writing all bytes up to the boundary due to misalignment of src
or dest
.
The system memmove
can work around this by aligning the src
/dest
pointers, using wide transfer widths, and fixing up slop afterwards. But this isn’t feasible for inlined code.
The problem with (3) is that we need to generate two sequences of loads and stores. One for if src < dest
where we need to copy from high -> low
, and another for low -> high
. This adds to code size and is a comparison that we didn’t need to do before. This could be potentially solved in a branchless way by using vector registers as a temporary buffer, but that has difficulty still with (1) and (2).
There seem to be several options that could improve this situation.
-
We could ask LLVM to not emit
memory.copy
in these situations.memory.copy
is not equivalent tomemcpy
, it’smemmove
and has more strict semantics than LLVM requires. For example, with struct copies LLVM should know the alignment and that there is no overlap. Recovering this information at runtime is unfortunate. The downside to this is potential binary size increases, and limiting to loads and store widths that are defined in Wasm (e.g. no SIMD yet). -
We could modify the behavior for partially OOB’s ranges to not write any bytes at all. This would allow us to load all
src
bytes into vector registers, then store them todest
fromhigh -> low
. If there is a trap, it will happen immediately and nothing will be written. This fixes (1) and (3) by changing (2). The uncertainty here is around whether this is possible with a future ‘memory.protect’ instruction. -
We could add an alignment hint similar to the one used for plain loads and stores. We could then emit loads and stores at width of the alignment along with a guard checking the alignment. If the guard fails, we’d need to fall back to a slow path. If the guard succeeds, we’d have a guarantee for (2). This approach still has the problem of (3), and it doesn’t seem like adding an overlap hint would be feasible, due to the complexity of the guard required.