Improve: Reduce startup overhead for `sz_find_byteset_haswell` by Caturra000 · Pull Request #293 · ashvardanian/StringZilla

Caturra000 · 2025-12-15T11:30:06Z

The current unzip implementation uses a union-based conversion:

Lines 1131 to 1139 in 682d2ba

    
           SZ_PUBLIC sz_cptr_t sz_find_byteset_haswell(sz_cptr_t text, sz_size_t length, sz_byteset_t const *filter) { 
        
               // Let's unzip even and odd elements and replicate them into both lanes of the YMM register. 
        
               // That way when we invoke `_mm256_shuffle_epi8` we can use the same mask for both lanes. 
        
               sz_u256_vec_t filter_even_vec, filter_odd_vec; 
        
               for (sz_size_t i = 0; i != 16; ++i) 
        
                   filter_even_vec.u8s[i] = filter->_u8s[i * 2], filter_odd_vec.u8s[i] = filter->_u8s[i * 2 + 1]; 
        
               filter_even_vec.xmms[1] = filter_even_vec.xmms[0]; 
        
               filter_odd_vec.xmms[1] = filter_odd_vec.xmms[0];

Clang (20/19/18...) generates a massive dependency chain of vpinsrb instructions for this pattern:

// -O3 -march=znver3
auto test_find_byteset(const char *str, size_t len, const sz_byteset_t *bs) {
    return sz_find_byteset(str, len, bs);
}

0000000000000000 <test_find_byteset(char const*, unsigned long, sz_byteset_t const*)>:
       0:      	movq	%rdi, %rax
       3:      	cmpq	$0x20, %rsi
       7:      	jb	 <L0>
       d:      	movzbl	(%rdx), %ecx
      10:      	movzbl	0x1(%rdx), %edi
      14:      	vbroadcastss	, %ymm2 <test_find_byteset(char const*, unsigned long, sz_byteset_t const*)+0x1d>
      1d:      	vpbroadcastq	, %ymm3 <test_find_byteset(char const*, unsigned long, sz_byteset_t const*)+0x26>
      26:      	vpbroadcastb	, %ymm4 <test_find_byteset(char const*, unsigned long, sz_byteset_t const*)+0x2f>
      2f:      	vpxor	%xmm5, %xmm5, %xmm5
      33:      	vmovd	%ecx, %xmm0
      37:      	vpinsrb	$0x1, 0x2(%rdx), %xmm0, %xmm0 👈
      3e:      	vmovd	%edi, %xmm1
      42:      	vpinsrb	$0x1, 0x3(%rdx), %xmm1, %xmm1
      49:      	vpinsrb	$0x2, 0x4(%rdx), %xmm0, %xmm0 👈
      50:      	vpinsrb	$0x2, 0x5(%rdx), %xmm1, %xmm1
      57:      	vpinsrb	$0x3, 0x6(%rdx), %xmm0, %xmm0 👈
      5e:      	vpinsrb	$0x3, 0x7(%rdx), %xmm1, %xmm1
      65:      	vpinsrb	$0x4, 0x8(%rdx), %xmm0, %xmm0 👈
      6c:      	vpinsrb	$0x4, 0x9(%rdx), %xmm1, %xmm1
      73:      	vpinsrb	$0x5, 0xa(%rdx), %xmm0, %xmm0
      7a:      	vpinsrb	$0x5, 0xb(%rdx), %xmm1, %xmm1
      81:      	vpinsrb	$0x6, 0xc(%rdx), %xmm0, %xmm0
      88:      	vpinsrb	$0x6, 0xd(%rdx), %xmm1, %xmm1
      8f:      	vpinsrb	$0x7, 0xe(%rdx), %xmm0, %xmm0
      96:      	vpinsrb	$0x7, 0xf(%rdx), %xmm1, %xmm1
      9d:      	vpinsrb	$0x8, 0x10(%rdx), %xmm0, %xmm0
      a4:      	vpinsrb	$0x8, 0x11(%rdx), %xmm1, %xmm1
      ab:      	vpinsrb	$0x9, 0x12(%rdx), %xmm0, %xmm0
      b2:      	vpinsrb	$0x9, 0x13(%rdx), %xmm1, %xmm1
      b9:      	vpinsrb	$0xa, 0x14(%rdx), %xmm0, %xmm0
      c0:      	vpinsrb	$0xa, 0x15(%rdx), %xmm1, %xmm1
      c7:      	vpinsrb	$0xb, 0x16(%rdx), %xmm0, %xmm0
      ce:      	vpinsrb	$0xb, 0x17(%rdx), %xmm1, %xmm1
      d5:      	vpinsrb	$0xc, 0x18(%rdx), %xmm0, %xmm0
      dc:      	vpinsrb	$0xc, 0x19(%rdx), %xmm1, %xmm1
      e3:      	vpinsrb	$0xd, 0x1a(%rdx), %xmm0, %xmm0
      ea:      	vpinsrb	$0xd, 0x1b(%rdx), %xmm1, %xmm1
      f1:      	vpinsrb	$0xe, 0x1c(%rdx), %xmm0, %xmm0
      f8:      	vpinsrb	$0xe, 0x1d(%rdx), %xmm1, %xmm1
      ff:      	vpinsrb	$0xf, 0x1e(%rdx), %xmm0, %xmm0
     106:      	vpinsrb	$0xf, 0x1f(%rdx), %xmm1, %xmm1
     10d:      	vpermq	$0x44, %ymm0, %ymm0     # ymm0 = ymm0[0,1,0,1]
     113:      	vpermq	$0x44, %ymm1, %ymm1     # ymm1 = ymm1[0,1,0,1]
     119:      	nopl	(%rax)
<L2>:
     120:      	vlddqu	(%rax), %ymm6
     ...

This patch rewrites the unzip logic, which is basically "copied" from the GCC-14 assembly codegen. See https://godbolt.org/z/eb7xe5Pr9

Benchmarks were run on a Zen 3 device. I didn't change the core algorithm, but it just looks much better for almost every case:

Clang-20
| Dataset / Mode      | main-haswell      | patch-haswell   | Improvement |
|---------------------|-------------------|-----------------|-------------|
| Words (5B avg)      | 285.45 MiB/s      | 347.74 MiB/s    | +21.8%      |
| Lines (128B avg)    | 2.70 GiB/s        | 3.32 GiB/s      | +23.0%      |
| File (4KB)          | 9.58 GiB/s        | 10.19 GiB/s     | +6.4%       |
| File (16KB)         | 12.48 GiB/s       | 12.97 GiB/s     | +3.9%       |
| File (64KB)         | 11.98 GiB/s       | 12.34 GiB/s     | +3.0%       |
| File (64MB)         | 6.26 GiB/s        | 7.50 GiB/s      | +19.8%      |

GCC-14
| Dataset / Mode      | main-haswell      | patch-haswell   | Improvement |
|---------------------|-------------------|-----------------|-------------|
| Words (5B avg)      | 374.76 MiB/s      | 362.27 MiB/s    | -3.3%       |
| Lines (128B avg)    | 3.07 GiB/s        | 3.44 GiB/s      | +12.0%      |
| File (4KB)          | 7.89 GiB/s        | 11.16 GiB/s     | +41.4%      |
| File (16KB)         | 10.25 GiB/s       | 14.27 GiB/s     | +39.2%      |
| File (64KB)         | 9.75 GiB/s        | 13.73 GiB/s     | +40.8%      |
| File (64MB)         | 6.02 GiB/s        | 7.73 GiB/s      | +28.4%      |

Benchmark scripts:

for s in 4k 16k 64k; do head -c $s leipzig1M.txt > leipzig1M_${s}.txt; done

bench() {
    rm -rf build_release
    cmake -DSTRINGZILLA_BUILD_BENCHMARK=1 -B build_release
    cmake --build build_release --config Release --target stringzilla_bench_find_cpp20
    export STRINGWARS_FILTER="(sz_find_byteset_serial|sz_find_byteset_haswell)"
    CMD="setarch $(uname -m) -R ./build_release/stringzilla_bench_find_cpp20"

    STRINGWARS_TOKENS=words $CMD
    STRINGWARS_TOKENS=lines $CMD
    for s in 4k 16k 64k; do STRINGWARS_TOKENS=file STRINGWARS_DATASET=leipzig1M_${s}.txt $CMD; done
    STRINGWARS_TOKENS=file $CMD
}

# `byteset-haswell` refers to this patch.
for branch in main byteset-haswell; do
    git switch $branch
    echo -e "\n=== Testing Clang ($branch) ==="
    export CC=clang-20 CXX=clang++-20; bench
    echo -e "\n=== Testing GCC ($branch) ==="
    export CC=gcc-14 CXX=g++-14; bench
done

rm leipzig1M_*

(I also tried another full-MM256 implementation using permute4x64, but it was slightly slower (~2%) than this patch.)

ashvardanian · 2025-12-15T12:44:59Z

Great observation, @Caturra000! Will merge soon 🤗
Did you notice similar regressions on the AVX-512 paths?

Caturra000 · 2025-12-15T13:35:56Z

Hi, @ashvardanian
Did you mean the _ice version? I didn't check that with my Zen 4 laptop. (The previous benchmarks were run on Zen 3.)
But a quick look at AVX-512 find_byteset with clang++-20 -march=znver4 shows that the codegen looks fine.

0000000000000000 <test_find_byteset(char const*, unsigned long, sz_byteset_t const*)>:
       0:      	testq	%rsi, %rsi
       3:      	je	 <L0>
       9:      	vlddqu	(%rdx), %ymm0
       d:      	movl	$0xaaaaaaaa, %ecx       # imm = 0xAAAAAAAA
      12:      	vpbroadcastb	, %zmm2 <test_find_byteset(char const*, unsigned long, sz_byteset_t const*)+0x1c>
      1c:      	vpbroadcastq	, %zmm3 <test_find_byteset(char const*, unsigned long, sz_byteset_t const*)+0x26>
      26:      	vpbroadcastb	, %zmm4 <test_find_byteset(char const*, unsigned long, sz_byteset_t const*)+0x30>
      30:      	movq	%rdi, %rax
      33:      	kmovd	%ecx, %k1
      37:      	movl	$0x55555555, %ecx       # imm = 0x55555555
      3c:      	vpcompressb	%ymm0, %ymm1 {%k1} {z}
      42:      	kmovd	%ecx, %k1
      46:      	movq	$-0x1, %rcx
      4d:      	vpcompressb	%ymm0, %ymm0 {%k1} {z}
      53:      	vshufi64x2	$0x0, %zmm1, %zmm1, %zmm1 # zmm1 = zmm1[0,1,0,1,0,1,0,1]
      5a:      	vshufi64x2	$0x0, %zmm0, %zmm0, %zmm0 # zmm0 = zmm0[0,1,0,1,0,1,0,1]
      61:      	nopw	%cs:(%rax,%rax)
<L2>:
      70:      	cmpq	$0x40, %rsi
      74:      	movl	$0x40, %edx
      79:      	cmovbq	%rsi, %rdx
      7d:      	bzhiq	%rdx, %rcx, %rdi
      82:      	kmovq	%rdi, %k1
      87:      	vmovdqu8	(%rax), %zmm5 {%k1} {z}
      8d:      	vpandq	%zmm2, %zmm5, %zmm6
      93:      	vpsrlw	$0x4, %zmm5, %zmm5
      9a:      	vpandq	%zmm2, %zmm5, %zmm5
      a0:      	vpcmpltb	%zmm4, %zmm6, %k2
      a7:      	vpshufb	%zmm6, %zmm3, %zmm7
      ad:      	vpshufb	%zmm5, %zmm1, %zmm8
      b3:      	vpshufb	%zmm5, %zmm0, %zmm8 {%k2}
      b9:      	vptestmb	%zmm7, %zmm8, %k0
      bf:      	ktestq	%k1, %k0
      c4:      	jne	 <L1>
      c6:      	addq	%rdx, %rax
      c9:      	subq	%rdx, %rsi
      cc:      	jne	 <L2>
<L0>:
      ce:      	xorl	%eax, %eax
      d0:      	vzeroupper
      d3:      	retq
<L1>:
      d4:      	kandq	%k1, %k0, %k0
      d9:      	kmovq	%k0, %rcx
      de:      	tzcntq	%rcx, %rcx
      e3:      	addq	%rcx, %rax
      e6:      	vzeroupper
      e9:      	retq

At least it doesn't have the crazy vpinsrb sequence.

Renamed variables to follow codebase convention where *_vec_t types use *_vec suffix: - filter_mask → byte_mask_vec - filter_lo/hi → filter_lo_vec/filter_hi_vec - filter_lo_even/hi_even → lo_evens_vec/hi_evens_vec - filter_lo_odd/hi_odd → lo_odds_vec/hi_odds_vec - filter_even/odd → evens_xmm_vec/odds_xmm_vec Also added clarifying comment about the unzip algorithm flow.

### Minor - Add: Georgian fast path (e02cb00) ### Patch - Fix: `inline static` warnings with C++23 modules (#287) (374adbf) - Improve: Reduce startup overhead for `sz_find_byteset_haswell` (#293) (7f2899a) - Fix: Missing Georgian dispatch (5f91e7e) - Improve: Drop stack-protection in hashing on GCC (23801e4) - Improve: Reduce repeated reviews (2e5784b) - Improve: Faster `sz_size_bit_ceil` (4003057) - Improve: Avoid ZMM-to-stack spill in Skylake comparisons (d94a010) - Make: Use relative install path for C sources (690d775)

Improve: Reduce startup overhead for sz_find_byteset_haswell

3e611ff

ashvardanian changed the base branch from main to main-dev December 25, 2025 00:25

ashvardanian merged commit 7f2899a into ashvardanian:main-dev Dec 25, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve: Reduce startup overhead for `sz_find_byteset_haswell`#293

Improve: Reduce startup overhead for `sz_find_byteset_haswell`#293
ashvardanian merged 2 commits intoashvardanian:main-devfrom
Caturra000:byteset-haswell

Caturra000 commented Dec 15, 2025

Uh oh!

ashvardanian commented Dec 15, 2025

Uh oh!

Caturra000 commented Dec 15, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	SZ_PUBLIC sz_cptr_t sz_find_byteset_haswell(sz_cptr_t text, sz_size_t length, sz_byteset_t const *filter) {

	// Let's unzip even and odd elements and replicate them into both lanes of the YMM register.
	// That way when we invoke `_mm256_shuffle_epi8` we can use the same mask for both lanes.
	sz_u256_vec_t filter_even_vec, filter_odd_vec;
	for (sz_size_t i = 0; i != 16; ++i)
	filter_even_vec.u8s[i] = filter->_u8s[i * 2], filter_odd_vec.u8s[i] = filter->_u8s[i * 2 + 1];
	filter_even_vec.xmms[1] = filter_even_vec.xmms[0];
	filter_odd_vec.xmms[1] = filter_odd_vec.xmms[0];

Conversation

Caturra000 commented Dec 15, 2025

Uh oh!

ashvardanian commented Dec 15, 2025

Uh oh!

Caturra000 commented Dec 15, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants