Skip to content

Commit 88e4427

Browse files
authored
Use inline assembly in strlen for vector loads (#593)
This commit is a refinement of #586 to use inline assembly to perform vector loads instead of using a C-defined load. This is done to avoid UB in LLVM where C cannot read either before or after an allocation. When `strlen` is not inlined, as it currently isn't, then there's not really any reasonable path that a compiler could prove that a load was out-of-bounds so this is issue is unlikely in practice, but it nevertheless is still UB. In the future the eventual goal is to move these SIMD routines into header files to avoid needing multiple builds of libc itself, and in such a situation inlining is indeed possible and a compiler would be capable of much more easily seeing the UB which could cause problems. Inline assembly unfortunately doesn't work with vector output parameters on Clang 19 and Clang 20 due to an ICE. This was fixed in llvm/llvm-project#146574 for Clang 21, but it means that the SIMD routines are now excluded with Clang 19 and Clang 20 to avoid compilation errors there.
1 parent 553305f commit 88e4427

File tree

1 file changed

+21
-9
lines changed

1 file changed

+21
-9
lines changed

libc-top-half/musl/src/string/strlen.c

Lines changed: 21 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -14,17 +14,28 @@
1414
size_t strlen(const char *s)
1515
{
1616
#if defined(__wasm_simd128__) && defined(__wasilibc_simd_string)
17-
// strlen must stop as soon as it finds the terminator.
18-
// Aligning ensures loads beyond the terminator are safe.
19-
// Casting through uintptr_t makes this implementation-defined,
20-
// rather than undefined behavior.
17+
// Skip Clang 19 and Clang 20 which have a bug (llvm/llvm-project#146574) which
18+
// results in an ICE when inline assembly is used with a vector result.
19+
#if __clang_major__ != 19 && __clang_major__ != 20
20+
// Note that reading before/after the allocation of a pointer is UB in
21+
// C, so inline assembly is used to generate the exact machine
22+
// instruction we want with opaque semantics to the compiler to avoid
23+
// the UB.
2124
uintptr_t align = (uintptr_t)s % sizeof(v128_t);
22-
const v128_t *v = (v128_t *)((uintptr_t)s - align);
25+
uintptr_t v = (uintptr_t)s - align;
2326

2427
for (;;) {
28+
v128_t chunk;
29+
__asm__ (
30+
"local.get %1\n"
31+
"v128.load 0\n"
32+
"local.set %0\n"
33+
: "=r"(chunk)
34+
: "r"(v)
35+
: "memory");
2536
// Bitmask is slow on AArch64, all_true is much faster.
26-
if (!wasm_i8x16_all_true(*v)) {
27-
const v128_t cmp = wasm_i8x16_eq(*v, (v128_t){});
37+
if (!wasm_i8x16_all_true(chunk)) {
38+
const v128_t cmp = wasm_i8x16_eq(chunk, (v128_t){});
2839
// Clear the bits corresponding to align (little-endian)
2940
// so we can count trailing zeros.
3041
int mask = wasm_i8x16_bitmask(cmp) >> align << align;
@@ -35,12 +46,13 @@ size_t strlen(const char *s)
3546
// it's as if we didn't find anything.
3647
if (mask) {
3748
// Find the offset of the first one bit (little-endian).
38-
return (char *)v - s + __builtin_ctz(mask);
49+
return v - (uintptr_t)s + __builtin_ctz(mask);
3950
}
4051
}
4152
align = 0;
42-
v++;
53+
v += sizeof(v128_t);
4354
}
55+
#endif
4456
#endif
4557

4658
const char *a = s;

0 commit comments

Comments
 (0)