-
Notifications
You must be signed in to change notification settings - Fork 213
Optional SIMD str(c)spn #597
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
||
for (; *c; c++) { | ||
// Terminator IS NOT on the bitmap. | ||
__wasm_v128_setbit(&bitmap, *c); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a note for future reference: I was initially a bit concerned here that we will incur startup costs too heavy for the "check a small string" use case (?). But of course it's better to loop over c
once up front rather than at each character in s
like the scalar version does.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The scalar version does the same: it iterates over c
once, building a bitmap:
https://github.com/WebAssembly/wasi-libc/blob/main/libc-top-half/musl/src/string/strspn.c
They use some "inscrutable" (but well known) macros to build a more straightforward bitmap in stack memory.
I used this function to build our weird 256-bit bitmap "directly" into a pair of v128
vectors.
bitmap->lo[lo_nibble] |= (uint8_t)((uint32_t)1 << (hi_nibble - 0)); | ||
bitmap->hi[lo_nibble] |= (uint8_t)((uint32_t)1 << (hi_nibble - 8)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm interested in understanding the codegen of this: so the ...[lo_nibble] |=
is generating some i8x16.replace_lane
but somehow also OR
-ing the high nibble bits? What is emitted by LLVM here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LLVM cheats and uses the stack, wasm-opt
(and wasm-ctor-eval
) then removes any traces of $__stack_pointer
in this build:
https://github.com/ncruces/go-sqlite3/blob/b72fd5db/sqlite3/libc/libc.wat#L1646-L1726
Just to document the decision of using Geoff Langdale's variant of the algorithm: It trades (in the original by Wojciech Muła):
By (in the variant with Geoff Langdale input):
|
Just for documentation purposes (I can add this to the code too), this algorithm is basically the same as The bitset is represented in the exact same way (though creation looks different).
|
This is actually exactly the same as Copying here, this is static really_inline
u32 block(m128 shuf_mask_lo_highclear, m128 shuf_mask_lo_highset, m128 v) {
m128 highconst = _mm_set1_epi8(0x80);
m128 shuf_mask_hi = _mm_set1_epi64x(0x8040201008040201);
// and now do the real work
m128 shuf1 = pshufb_m128(shuf_mask_lo_highclear, v);
m128 t1 = xor128(v, highconst);
m128 shuf2 = pshufb_m128(shuf_mask_lo_highset, t1);
m128 t2 = andnot128(highconst, rshift64_m128(v, 4));
m128 shuf3 = pshufb_m128(shuf_mask_hi, t2);
m128 tmp = and128(or128(shuf1, shuf2), shuf3);
m128 tmp2 = eq128(tmp, zeroes128());
u32 z = movemask128(tmp2);
return z;
} This is mine: __attribute__((always_inline))
static v128_t __wasm_v128_chkbits(__wasm_v128_bitmap256_t bitmap, v128_t v) {
v128_t hi_nibbles = wasm_u8x16_shr(v, 4);
v128_t bitmask_lookup = wasm_u8x16_const(1, 2, 4, 8, 16, 32, 64, 128, //
1, 2, 4, 8, 16, 32, 64, 128);
v128_t bitmask = wasm_i8x16_relaxed_swizzle(bitmask_lookup, hi_nibbles);
v128_t indices_0_7 = v & wasm_u8x16_const_splat(0x8f);
v128_t indices_8_15 = indices_0_7 ^ wasm_u8x16_const_splat(0x80);
v128_t row_0_7 = wasm_i8x16_swizzle(bitmap.lo, indices_0_7);
v128_t row_8_15 = wasm_i8x16_swizzle(bitmap.hi, indices_8_15);
v128_t bitsets = row_0_7 | row_8_15;
return wasm_i8x16_eq(bitsets & bitmask, bitmask);
} Now to compare and explain the differences: // This does a 4-bit unsigned right shift for byte lanes, which doesn't exist on Intel.
// Hyperscan shifts 64-bit lanes (which means bits from each lane end up in the following lane),
// then clears the high bit of each 8-bit lane (we'll get to why).
//
// m128 highconst = _mm_set1_epi8(0x80);
// m128 t2 = andnot128(highconst, rshift64_m128(v, 4));
v128_t hi_nibbles = wasm_u8x16_shr(v, 4);
// This is just another way to express the same constant,
// maybe nicer and I update.
//
// m128 shuf_mask_hi = _mm_set1_epi64x(0x8040201008040201);
v128_t bitmask_lookup = wasm_u64x2_const_splat(0x8040201008040201);
// This does the shuffle/swizzle.
// Bit 7 zeros the lane (both Wasm and Intel).
// Intel ignores bits 456, so doesn't need to clear them,
// Wasm treats them like bit 7, so good that u8x16_shr cleared them.
// Having done so, we can do a relaxed swizzle (if available)
// which ignores the difference and is faster on Intel.
//
// m128 shuf3 = pshufb_m128(shuf_mask_hi, t2);
v128_t bitmask = wasm_i8x16_relaxed_swizzle(bitmask_lookup, hi_nibbles);
// Again, we need to clear bits 456 because Wasm swizzle doesn't like them,
// but we need the bit 7 behavior, so can't use relaxed.
// Intel can skip this.
v128_t indices_0_7 = v & wasm_u8x16_const_splat(0x8f);
// We use indices_0_7 because it has cleared bits 456.
// Intel can use v to exploit ILP.
// The xor is the same, still can't use relaxed.
//
// m128 t1 = xor128(v, highconst);
v128_t indices_8_15 = indices_0_7 ^ wasm_u8x16_const_splat(0x80);
// m128 shuf1 = pshufb_m128(shuf_mask_lo_highclear, v);
// m128 shuf2 = pshufb_m128(shuf_mask_lo_highset, t1);
v128_t row_0_7 = wasm_i8x16_swizzle(bitmap.lo, indices_0_7);
v128_t row_8_15 = wasm_i8x16_swizzle(bitmap.hi, indices_8_15);
// m128 tmp = and128(or128(shuf1, shuf2), shuf3);
v128_t bitsets = (row_0_7 | row_8_15) & bitmask;
// Hyperscan is calculating the opposite of our predicate.
// (eq vs ne, and eq is much cheaper than ne)
// When non-zero, the result is bitmask, so instead of doing
// ne(x, zero) we can do eq(x, bitmask).
//
// m128 tmp2 = eq128(tmp, zeroes128());
return wasm_i8x16_ne(bitsets, (v128_t){}); I hope this makes the differences clear. I still think that given Wasm SIMD instructions, this is close to the best approach. Not sure if using the complement would be faster, that's the only change I think could be made. Everything else follows from the slightly changed semantics. |
Having tested, it seems the using the opposite predicate (complement) is best for |
Continuing #580, implements
strspn
andstrcspn
.This one follows the same general structure as #586, #592 and #594, but uses a somewhat more complicated algorithm, described here.
I used the Geoff Langdale alternative implementation (the tweet as since disappeared) which is correctly described there but has a subtle bug in the implementation: WojciechMula/simd-byte-lookup#2
Since the complexity needed for
__wasm_v128_bitmap256_t
is shared for bothstrspn
andstrcspn
, I moved the implementation to a common file, when SIMD is used.The tests follow a similar structure as the previous ones, and cover the bug, which I was found through fuzzing.