Skip to content

Improve: Reduce startup overhead for sz_find_byteset_haswell#293

Merged
ashvardanian merged 2 commits intoashvardanian:main-devfrom
Caturra000:byteset-haswell
Dec 25, 2025
Merged

Improve: Reduce startup overhead for sz_find_byteset_haswell#293
ashvardanian merged 2 commits intoashvardanian:main-devfrom
Caturra000:byteset-haswell

Conversation

@Caturra000
Copy link
Copy Markdown
Contributor

The current unzip implementation uses a union-based conversion:

SZ_PUBLIC sz_cptr_t sz_find_byteset_haswell(sz_cptr_t text, sz_size_t length, sz_byteset_t const *filter) {
// Let's unzip even and odd elements and replicate them into both lanes of the YMM register.
// That way when we invoke `_mm256_shuffle_epi8` we can use the same mask for both lanes.
sz_u256_vec_t filter_even_vec, filter_odd_vec;
for (sz_size_t i = 0; i != 16; ++i)
filter_even_vec.u8s[i] = filter->_u8s[i * 2], filter_odd_vec.u8s[i] = filter->_u8s[i * 2 + 1];
filter_even_vec.xmms[1] = filter_even_vec.xmms[0];
filter_odd_vec.xmms[1] = filter_odd_vec.xmms[0];

Clang (20/19/18...) generates a massive dependency chain of vpinsrb instructions for this pattern:

// -O3 -march=znver3
auto test_find_byteset(const char *str, size_t len, const sz_byteset_t *bs) {
    return sz_find_byteset(str, len, bs);
}

0000000000000000 <test_find_byteset(char const*, unsigned long, sz_byteset_t const*)>:
       0:      	movq	%rdi, %rax
       3:      	cmpq	$0x20, %rsi
       7:      	jb	 <L0>
       d:      	movzbl	(%rdx), %ecx
      10:      	movzbl	0x1(%rdx), %edi
      14:      	vbroadcastss	, %ymm2 <test_find_byteset(char const*, unsigned long, sz_byteset_t const*)+0x1d>
      1d:      	vpbroadcastq	, %ymm3 <test_find_byteset(char const*, unsigned long, sz_byteset_t const*)+0x26>
      26:      	vpbroadcastb	, %ymm4 <test_find_byteset(char const*, unsigned long, sz_byteset_t const*)+0x2f>
      2f:      	vpxor	%xmm5, %xmm5, %xmm5
      33:      	vmovd	%ecx, %xmm0
      37:      	vpinsrb	$0x1, 0x2(%rdx), %xmm0, %xmm0 👈
      3e:      	vmovd	%edi, %xmm1
      42:      	vpinsrb	$0x1, 0x3(%rdx), %xmm1, %xmm1
      49:      	vpinsrb	$0x2, 0x4(%rdx), %xmm0, %xmm0 👈
      50:      	vpinsrb	$0x2, 0x5(%rdx), %xmm1, %xmm1
      57:      	vpinsrb	$0x3, 0x6(%rdx), %xmm0, %xmm0 👈
      5e:      	vpinsrb	$0x3, 0x7(%rdx), %xmm1, %xmm1
      65:      	vpinsrb	$0x4, 0x8(%rdx), %xmm0, %xmm0 👈
      6c:      	vpinsrb	$0x4, 0x9(%rdx), %xmm1, %xmm1
      73:      	vpinsrb	$0x5, 0xa(%rdx), %xmm0, %xmm0
      7a:      	vpinsrb	$0x5, 0xb(%rdx), %xmm1, %xmm1
      81:      	vpinsrb	$0x6, 0xc(%rdx), %xmm0, %xmm0
      88:      	vpinsrb	$0x6, 0xd(%rdx), %xmm1, %xmm1
      8f:      	vpinsrb	$0x7, 0xe(%rdx), %xmm0, %xmm0
      96:      	vpinsrb	$0x7, 0xf(%rdx), %xmm1, %xmm1
      9d:      	vpinsrb	$0x8, 0x10(%rdx), %xmm0, %xmm0
      a4:      	vpinsrb	$0x8, 0x11(%rdx), %xmm1, %xmm1
      ab:      	vpinsrb	$0x9, 0x12(%rdx), %xmm0, %xmm0
      b2:      	vpinsrb	$0x9, 0x13(%rdx), %xmm1, %xmm1
      b9:      	vpinsrb	$0xa, 0x14(%rdx), %xmm0, %xmm0
      c0:      	vpinsrb	$0xa, 0x15(%rdx), %xmm1, %xmm1
      c7:      	vpinsrb	$0xb, 0x16(%rdx), %xmm0, %xmm0
      ce:      	vpinsrb	$0xb, 0x17(%rdx), %xmm1, %xmm1
      d5:      	vpinsrb	$0xc, 0x18(%rdx), %xmm0, %xmm0
      dc:      	vpinsrb	$0xc, 0x19(%rdx), %xmm1, %xmm1
      e3:      	vpinsrb	$0xd, 0x1a(%rdx), %xmm0, %xmm0
      ea:      	vpinsrb	$0xd, 0x1b(%rdx), %xmm1, %xmm1
      f1:      	vpinsrb	$0xe, 0x1c(%rdx), %xmm0, %xmm0
      f8:      	vpinsrb	$0xe, 0x1d(%rdx), %xmm1, %xmm1
      ff:      	vpinsrb	$0xf, 0x1e(%rdx), %xmm0, %xmm0
     106:      	vpinsrb	$0xf, 0x1f(%rdx), %xmm1, %xmm1
     10d:      	vpermq	$0x44, %ymm0, %ymm0     # ymm0 = ymm0[0,1,0,1]
     113:      	vpermq	$0x44, %ymm1, %ymm1     # ymm1 = ymm1[0,1,0,1]
     119:      	nopl	(%rax)
<L2>:
     120:      	vlddqu	(%rax), %ymm6
     ...

This patch rewrites the unzip logic, which is basically "copied" from the GCC-14 assembly codegen. See https://godbolt.org/z/eb7xe5Pr9

Benchmarks were run on a Zen 3 device. I didn't change the core algorithm, but it just looks much better for almost every case:

Clang-20
| Dataset / Mode      | main-haswell      | patch-haswell   | Improvement |
|---------------------|-------------------|-----------------|-------------|
| Words (5B avg)      | 285.45 MiB/s      | 347.74 MiB/s    | +21.8%      |
| Lines (128B avg)    | 2.70 GiB/s        | 3.32 GiB/s      | +23.0%      |
| File (4KB)          | 9.58 GiB/s        | 10.19 GiB/s     | +6.4%       |
| File (16KB)         | 12.48 GiB/s       | 12.97 GiB/s     | +3.9%       |
| File (64KB)         | 11.98 GiB/s       | 12.34 GiB/s     | +3.0%       |
| File (64MB)         | 6.26 GiB/s        | 7.50 GiB/s      | +19.8%      |

GCC-14
| Dataset / Mode      | main-haswell      | patch-haswell   | Improvement |
|---------------------|-------------------|-----------------|-------------|
| Words (5B avg)      | 374.76 MiB/s      | 362.27 MiB/s    | -3.3%       |
| Lines (128B avg)    | 3.07 GiB/s        | 3.44 GiB/s      | +12.0%      |
| File (4KB)          | 7.89 GiB/s        | 11.16 GiB/s     | +41.4%      |
| File (16KB)         | 10.25 GiB/s       | 14.27 GiB/s     | +39.2%      |
| File (64KB)         | 9.75 GiB/s        | 13.73 GiB/s     | +40.8%      |
| File (64MB)         | 6.02 GiB/s        | 7.73 GiB/s      | +28.4%      |

Benchmark scripts:

for s in 4k 16k 64k; do head -c $s leipzig1M.txt > leipzig1M_${s}.txt; done

bench() {
    rm -rf build_release
    cmake -DSTRINGZILLA_BUILD_BENCHMARK=1 -B build_release
    cmake --build build_release --config Release --target stringzilla_bench_find_cpp20
    export STRINGWARS_FILTER="(sz_find_byteset_serial|sz_find_byteset_haswell)"
    CMD="setarch $(uname -m) -R ./build_release/stringzilla_bench_find_cpp20"

    STRINGWARS_TOKENS=words $CMD
    STRINGWARS_TOKENS=lines $CMD
    for s in 4k 16k 64k; do STRINGWARS_TOKENS=file STRINGWARS_DATASET=leipzig1M_${s}.txt $CMD; done
    STRINGWARS_TOKENS=file $CMD
}

# `byteset-haswell` refers to this patch.
for branch in main byteset-haswell; do
    git switch $branch
    echo -e "\n=== Testing Clang ($branch) ==="
    export CC=clang-20 CXX=clang++-20; bench
    echo -e "\n=== Testing GCC ($branch) ==="
    export CC=gcc-14 CXX=g++-14; bench
done

rm leipzig1M_*

(I also tried another full-MM256 implementation using permute4x64, but it was slightly slower (~2%) than this patch.)

@ashvardanian
Copy link
Copy Markdown
Owner

Great observation, @Caturra000! Will merge soon 🤗
Did you notice similar regressions on the AVX-512 paths?

@Caturra000
Copy link
Copy Markdown
Contributor Author

Hi, @ashvardanian
Did you mean the _ice version? I didn't check that with my Zen 4 laptop. (The previous benchmarks were run on Zen 3.)
But a quick look at AVX-512 find_byteset with clang++-20 -march=znver4 shows that the codegen looks fine.

0000000000000000 <test_find_byteset(char const*, unsigned long, sz_byteset_t const*)>:
       0:      	testq	%rsi, %rsi
       3:      	je	 <L0>
       9:      	vlddqu	(%rdx), %ymm0
       d:      	movl	$0xaaaaaaaa, %ecx       # imm = 0xAAAAAAAA
      12:      	vpbroadcastb	, %zmm2 <test_find_byteset(char const*, unsigned long, sz_byteset_t const*)+0x1c>
      1c:      	vpbroadcastq	, %zmm3 <test_find_byteset(char const*, unsigned long, sz_byteset_t const*)+0x26>
      26:      	vpbroadcastb	, %zmm4 <test_find_byteset(char const*, unsigned long, sz_byteset_t const*)+0x30>
      30:      	movq	%rdi, %rax
      33:      	kmovd	%ecx, %k1
      37:      	movl	$0x55555555, %ecx       # imm = 0x55555555
      3c:      	vpcompressb	%ymm0, %ymm1 {%k1} {z}
      42:      	kmovd	%ecx, %k1
      46:      	movq	$-0x1, %rcx
      4d:      	vpcompressb	%ymm0, %ymm0 {%k1} {z}
      53:      	vshufi64x2	$0x0, %zmm1, %zmm1, %zmm1 # zmm1 = zmm1[0,1,0,1,0,1,0,1]
      5a:      	vshufi64x2	$0x0, %zmm0, %zmm0, %zmm0 # zmm0 = zmm0[0,1,0,1,0,1,0,1]
      61:      	nopw	%cs:(%rax,%rax)
<L2>:
      70:      	cmpq	$0x40, %rsi
      74:      	movl	$0x40, %edx
      79:      	cmovbq	%rsi, %rdx
      7d:      	bzhiq	%rdx, %rcx, %rdi
      82:      	kmovq	%rdi, %k1
      87:      	vmovdqu8	(%rax), %zmm5 {%k1} {z}
      8d:      	vpandq	%zmm2, %zmm5, %zmm6
      93:      	vpsrlw	$0x4, %zmm5, %zmm5
      9a:      	vpandq	%zmm2, %zmm5, %zmm5
      a0:      	vpcmpltb	%zmm4, %zmm6, %k2
      a7:      	vpshufb	%zmm6, %zmm3, %zmm7
      ad:      	vpshufb	%zmm5, %zmm1, %zmm8
      b3:      	vpshufb	%zmm5, %zmm0, %zmm8 {%k2}
      b9:      	vptestmb	%zmm7, %zmm8, %k0
      bf:      	ktestq	%k1, %k0
      c4:      	jne	 <L1>
      c6:      	addq	%rdx, %rax
      c9:      	subq	%rdx, %rsi
      cc:      	jne	 <L2>
<L0>:
      ce:      	xorl	%eax, %eax
      d0:      	vzeroupper
      d3:      	retq
<L1>:
      d4:      	kandq	%k1, %k0, %k0
      d9:      	kmovq	%k0, %rcx
      de:      	tzcntq	%rcx, %rcx
      e3:      	addq	%rcx, %rax
      e6:      	vzeroupper
      e9:      	retq

At least it doesn't have the crazy vpinsrb sequence.

Renamed variables to follow codebase convention where *_vec_t types
use *_vec suffix:
- filter_mask → byte_mask_vec
- filter_lo/hi → filter_lo_vec/filter_hi_vec
- filter_lo_even/hi_even → lo_evens_vec/hi_evens_vec
- filter_lo_odd/hi_odd → lo_odds_vec/hi_odds_vec
- filter_even/odd → evens_xmm_vec/odds_xmm_vec

Also added clarifying comment about the unzip algorithm flow.
@ashvardanian ashvardanian changed the base branch from main to main-dev December 25, 2025 00:25
@ashvardanian ashvardanian merged commit 7f2899a into ashvardanian:main-dev Dec 25, 2025
ashvardanian pushed a commit that referenced this pull request Dec 26, 2025
### Minor

- Add: Georgian fast path (e02cb00)

### Patch

- Fix: `inline static` warnings with C++23 modules (#287) (374adbf)
- Improve: Reduce startup overhead for `sz_find_byteset_haswell` (#293) (7f2899a)
- Fix: Missing Georgian dispatch (5f91e7e)
- Improve: Drop stack-protection in hashing on GCC (23801e4)
- Improve: Reduce repeated reviews (2e5784b)
- Improve: Faster `sz_size_bit_ceil` (4003057)
- Improve: Avoid ZMM-to-stack spill in Skylake comparisons (d94a010)
- Make: Use relative install path for C sources (690d775)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants