Optional SIMD str(c)spn #597

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Open

ncruces wants to merge 2 commits into WebAssembly:main from ncruces:bitset

Contributor

ncruces commented Jul 17, 2025

Continuing #580, implements strspn and strcspn.

This one follows the same general structure as #586, #592 and #594, but uses a somewhat more complicated algorithm, described here.

I used the Geoff Langdale alternative implementation (the tweet as since disappeared) which is correctly described there but has a subtle bug in the implementation: WojciechMula/simd-byte-lookup#2

Since the complexity needed for __wasm_v128_bitmap256_t is shared for both strspn and strcspn, I moved the implementation to a common file, when SIMD is used.

The tests follow a similar structure as the previous ones, and cover the bug, which I was found through fuzzing.


          Optional SIMD str(c)spn

a730338

abrown reviewed

View reviewed changes

libc-top-half/musl/src/string/strspn_simd.c Outdated Show resolved Hide resolved

abrown reviewed

View reviewed changes

libc-top-half/musl/src/string/strspn_simd.c Outdated

+                for (; *c; c++) {
+                  // Terminator IS NOT on the bitmap.
+                  __wasm_v128_setbit(&bitmap, *c);

Collaborator

abrown Jul 22, 2025

Just a note for future reference: I was initially a bit concerned here that we will incur startup costs too heavy for the "check a small string" use case (?). But of course it's better to loop over c once up front rather than at each character in s like the scalar version does.

Contributor Author

ncruces Jul 22, 2025

The scalar version does the same: it iterates over c once, building a bitmap:
https://github.com/WebAssembly/wasi-libc/blob/main/libc-top-half/musl/src/string/strspn.c

They use some "inscrutable" (but well known) macros to build a more straightforward bitmap in stack memory.
I used this function to build our weird 256-bit bitmap "directly" into a pair of v128 vectors.

libc-top-half/musl/src/string/strspn_simd.c Outdated Show resolved Hide resolved

libc-top-half/musl/src/string/strspn_simd.c Outdated

Comment on lines 26 to 27

		bitmap->lo[lo_nibble] \|= (uint8_t)((uint32_t)1 << (hi_nibble - 0));
		bitmap->hi[lo_nibble] \|= (uint8_t)((uint32_t)1 << (hi_nibble - 8));

Collaborator

abrown Jul 22, 2025

I'm interested in understanding the codegen of this: so the ...[lo_nibble] |= is generating some i8x16.replace_lane but somehow also OR-ing the high nibble bits? What is emitted by LLVM here?

Contributor Author

ncruces Jul 23, 2025

LLVM cheats and uses the stack, wasm-opt (and wasm-ctor-eval) then removes any traces of $__stack_pointer in this build:
https://github.com/ncruces/go-sqlite3/blob/b72fd5db/sqlite3/libc/libc.wat#L1646-L1726

libc-top-half/musl/src/string/strspn_simd.c Outdated Show resolved Hide resolved

Contributor Author

ncruces commented Jul 22, 2025

Just to document the decision of using Geoff Langdale's variant of the algorithm:

It trades (in the original by Wojciech Muła):

1 wasm_v128_bitselect or wasm_i8x16_relaxed_laneselect
2 wasm_i8x16_swizzle or wasm_i8x16_relaxed_swizzle
1 wasm_i8x16_lt
1 wasm_v128_and

By (in the variant with Geoff Langdale input):

2 wasm_i8x16_swizzle
1 wasm_v128_xor
1 wasm_v128_or

Contributor Author

ncruces commented Aug 7, 2025 •

edited

Loading

Just for documentation purposes (I can add this to the code too), this algorithm is basically the same as truffle in Hyperscan:
https://github.com/intel/hyperscan/blob/v5.4.2/src/nfa/truffle.c#L65
https://github.com/intel/hyperscan/blob/v5.4.2/src/nfa/trufflecompile.cpp

The bitset is represented in the exact same way (though creation looks different).

truffle may save a bitwise op on actual hardware, but it's hard to tell. First, wasm_u8x16_shr probably needs a bitwise and in there on Intel, which truffle may be able to fuse with a subsequent and. Also wasm_i8x16_swizzle requires bits 456 to be zeroed, unlike the Intel instruction.

Contributor Author

ncruces commented Aug 8, 2025

This is actually exactly the same as truffle.

Copying here, this is truffle:

static really_inline
u32 block(m128 shuf_mask_lo_highclear, m128 shuf_mask_lo_highset, m128 v) {

  m128 highconst = _mm_set1_epi8(0x80);
  m128 shuf_mask_hi = _mm_set1_epi64x(0x8040201008040201);
  
  // and now do the real work
  m128 shuf1 = pshufb_m128(shuf_mask_lo_highclear, v);
  m128 t1 = xor128(v, highconst);
  m128 shuf2 = pshufb_m128(shuf_mask_lo_highset, t1);
  m128 t2 = andnot128(highconst, rshift64_m128(v, 4));
  m128 shuf3 = pshufb_m128(shuf_mask_hi, t2);
  m128 tmp = and128(or128(shuf1, shuf2), shuf3);
  m128 tmp2 = eq128(tmp, zeroes128());
  u32 z = movemask128(tmp2);
  
  return z;
}

This is mine:

__attribute__((always_inline))
static v128_t __wasm_v128_chkbits(__wasm_v128_bitmap256_t bitmap, v128_t v) {
  v128_t hi_nibbles = wasm_u8x16_shr(v, 4);
  v128_t bitmask_lookup = wasm_u8x16_const(1, 2, 4, 8, 16, 32, 64, 128,  //
                                           1, 2, 4, 8, 16, 32, 64, 128);
  v128_t bitmask = wasm_i8x16_relaxed_swizzle(bitmask_lookup, hi_nibbles);

  v128_t indices_0_7 = v & wasm_u8x16_const_splat(0x8f);
  v128_t indices_8_15 = indices_0_7 ^ wasm_u8x16_const_splat(0x80);

  v128_t row_0_7 = wasm_i8x16_swizzle(bitmap.lo, indices_0_7);
  v128_t row_8_15 = wasm_i8x16_swizzle(bitmap.hi, indices_8_15);

  v128_t bitsets = row_0_7 | row_8_15;
  return wasm_i8x16_eq(bitsets & bitmask, bitmask);
}

Now to compare and explain the differences:

// This does a 4-bit unsigned right shift for byte lanes, which doesn't exist on Intel.
// Hyperscan shifts 64-bit lanes (which means bits from each lane end up in the following lane),
// then clears the high bit of each 8-bit lane (we'll get to why).
//
// m128 highconst = _mm_set1_epi8(0x80);
// m128 t2 = andnot128(highconst, rshift64_m128(v, 4));
v128_t hi_nibbles = wasm_u8x16_shr(v, 4);

// This is just another way to express the same constant,
// maybe nicer and I update.
//
// m128 shuf_mask_hi = _mm_set1_epi64x(0x8040201008040201);
v128_t bitmask_lookup = wasm_u64x2_const_splat(0x8040201008040201);

// This does the shuffle/swizzle.
// Bit 7 zeros the lane (both Wasm and Intel).
// Intel ignores bits 456, so doesn't need to clear them,
// Wasm treats them like bit 7, so good that u8x16_shr cleared them.
// Having done so, we can do a relaxed swizzle (if available)
// which ignores the difference and is faster on Intel.
//
// m128 shuf3 = pshufb_m128(shuf_mask_hi, t2);
v128_t bitmask = wasm_i8x16_relaxed_swizzle(bitmask_lookup, hi_nibbles);

// Again, we need to clear bits 456 because Wasm swizzle doesn't like them,
// but we need the bit 7 behavior, so can't use relaxed.
// Intel can skip this.
v128_t indices_0_7 = v & wasm_u8x16_const_splat(0x8f);

// We use indices_0_7 because it has cleared bits 456.
// Intel can use v to exploit ILP.
// The xor is the same, still can't use relaxed.
//
// m128 t1 = xor128(v, highconst);
v128_t indices_8_15 = indices_0_7 ^ wasm_u8x16_const_splat(0x80);

// m128 shuf1 = pshufb_m128(shuf_mask_lo_highclear, v);
// m128 shuf2 = pshufb_m128(shuf_mask_lo_highset, t1);
v128_t row_0_7 = wasm_i8x16_swizzle(bitmap.lo, indices_0_7);
v128_t row_8_15 = wasm_i8x16_swizzle(bitmap.hi, indices_8_15);

// m128 tmp = and128(or128(shuf1, shuf2), shuf3);
v128_t bitsets = (row_0_7 | row_8_15) & bitmask;

// Hyperscan is calculating the opposite of our predicate.
// (eq vs ne, and eq is much cheaper than ne)
// When non-zero, the result is bitmask, so instead of doing
// ne(x, zero) we can do eq(x, bitmask).
//
// m128 tmp2 = eq128(tmp, zeroes128());
return wasm_i8x16_ne(bitsets, (v128_t){});

I hope this makes the differences clear. I still think that given Wasm SIMD instructions, this is close to the best approach. Not sure if using the complement would be faster, that's the only change I think could be made. Everything else follows from the slightly changed semantics.

Contributor Author

ncruces commented Aug 8, 2025 •

edited

Loading

Having tested, it seems the using the opposite predicate (complement) is best for strspn, but not strcspn, at least on wazero, and one Intel chip. Not sure if it's worth the additional complexity/duplication.


          Learnings from truffle.

a882f5a

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet