Skip to content
This repository was archived by the owner on Nov 3, 2021. It is now read-only.
This repository was archived by the owner on Nov 3, 2021. It is now read-only.

memory.copy|fill semantics limit optimizations for short constant lengths #111

@eqrion

Description

@eqrion

As discussed in #1, the expectation is that producers will use memory.copy|fill for short lengths in addition to long lengths. We’ve already seen this to be the case, and have been investigating a performance regression resulting from LLVM 9 using memory.copy for short constant lengths [1].

Part of that regression is in a sub-optimal OOL call to the system memmove, but to really get performance to par, we’d like to inline these short memory.copys to loads and stores.

This has turned out challenging to implement in a way that is better or equal to the Wasm loads and stores that were emitted previously.

There are several problems resulting from the following:

  1. We must assume no alignment for src, dest
  2. When a range is partially OOBs, we must write all bytes up until the region is OOBs
  3. We must copy correctly when there is overlap of src, dest

(1) and (2) are related. Because we don’t know the alignment of src or dest we cannot use wider transfers than a single byte at a time (e.g 32bit, 64bit, or 128bit) or else we’d be at risk of the final store being partially OOB and not writing all bytes up to the boundary due to misalignment of src or dest.

The system memmove can work around this by aligning the src/dest pointers, using wide transfer widths, and fixing up slop afterwards. But this isn’t feasible for inlined code.

The problem with (3) is that we need to generate two sequences of loads and stores. One for if src < dest where we need to copy from high -> low, and another for low -> high. This adds to code size and is a comparison that we didn’t need to do before. This could be potentially solved in a branchless way by using vector registers as a temporary buffer, but that has difficulty still with (1) and (2).

There seem to be several options that could improve this situation.

  1. We could ask LLVM to not emit memory.copy in these situations. memory.copy is not equivalent to memcpy, it’s memmove and has more strict semantics than LLVM requires. For example, with struct copies LLVM should know the alignment and that there is no overlap. Recovering this information at runtime is unfortunate. The downside to this is potential binary size increases, and limiting to loads and store widths that are defined in Wasm (e.g. no SIMD yet).

  2. We could modify the behavior for partially OOB’s ranges to not write any bytes at all. This would allow us to load all src bytes into vector registers, then store them to dest from high -> low. If there is a trap, it will happen immediately and nothing will be written. This fixes (1) and (3) by changing (2). The uncertainty here is around whether this is possible with a future ‘memory.protect’ instruction.

  3. We could add an alignment hint similar to the one used for plain loads and stores. We could then emit loads and stores at width of the alignment along with a guard checking the alignment. If the guard fails, we’d need to fall back to a slow path. If the guard succeeds, we’d have a guarantee for (2). This approach still has the problem of (3), and it doesn’t seem like adding an overlap hint would be feasible, due to the complexity of the guard required.

cc @lars-t-hansen @julian-seward1 @lukewagner

[1] https://bugzilla.mozilla.org/show_bug.cgi?id=1570112

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions