Improve: Reduce startup overhead for sz_find_byteset_haswell#293
Merged
ashvardanian merged 2 commits intoashvardanian:main-devfrom Dec 25, 2025
Merged
Improve: Reduce startup overhead for sz_find_byteset_haswell#293ashvardanian merged 2 commits intoashvardanian:main-devfrom
sz_find_byteset_haswell#293ashvardanian merged 2 commits intoashvardanian:main-devfrom
Conversation
Owner
|
Great observation, @Caturra000! Will merge soon 🤗 |
Contributor
Author
|
Hi, @ashvardanian 0000000000000000 <test_find_byteset(char const*, unsigned long, sz_byteset_t const*)>:
0: testq %rsi, %rsi
3: je <L0>
9: vlddqu (%rdx), %ymm0
d: movl $0xaaaaaaaa, %ecx # imm = 0xAAAAAAAA
12: vpbroadcastb , %zmm2 <test_find_byteset(char const*, unsigned long, sz_byteset_t const*)+0x1c>
1c: vpbroadcastq , %zmm3 <test_find_byteset(char const*, unsigned long, sz_byteset_t const*)+0x26>
26: vpbroadcastb , %zmm4 <test_find_byteset(char const*, unsigned long, sz_byteset_t const*)+0x30>
30: movq %rdi, %rax
33: kmovd %ecx, %k1
37: movl $0x55555555, %ecx # imm = 0x55555555
3c: vpcompressb %ymm0, %ymm1 {%k1} {z}
42: kmovd %ecx, %k1
46: movq $-0x1, %rcx
4d: vpcompressb %ymm0, %ymm0 {%k1} {z}
53: vshufi64x2 $0x0, %zmm1, %zmm1, %zmm1 # zmm1 = zmm1[0,1,0,1,0,1,0,1]
5a: vshufi64x2 $0x0, %zmm0, %zmm0, %zmm0 # zmm0 = zmm0[0,1,0,1,0,1,0,1]
61: nopw %cs:(%rax,%rax)
<L2>:
70: cmpq $0x40, %rsi
74: movl $0x40, %edx
79: cmovbq %rsi, %rdx
7d: bzhiq %rdx, %rcx, %rdi
82: kmovq %rdi, %k1
87: vmovdqu8 (%rax), %zmm5 {%k1} {z}
8d: vpandq %zmm2, %zmm5, %zmm6
93: vpsrlw $0x4, %zmm5, %zmm5
9a: vpandq %zmm2, %zmm5, %zmm5
a0: vpcmpltb %zmm4, %zmm6, %k2
a7: vpshufb %zmm6, %zmm3, %zmm7
ad: vpshufb %zmm5, %zmm1, %zmm8
b3: vpshufb %zmm5, %zmm0, %zmm8 {%k2}
b9: vptestmb %zmm7, %zmm8, %k0
bf: ktestq %k1, %k0
c4: jne <L1>
c6: addq %rdx, %rax
c9: subq %rdx, %rsi
cc: jne <L2>
<L0>:
ce: xorl %eax, %eax
d0: vzeroupper
d3: retq
<L1>:
d4: kandq %k1, %k0, %k0
d9: kmovq %k0, %rcx
de: tzcntq %rcx, %rcx
e3: addq %rcx, %rax
e6: vzeroupper
e9: retq
At least it doesn't have the crazy |
Renamed variables to follow codebase convention where *_vec_t types use *_vec suffix: - filter_mask → byte_mask_vec - filter_lo/hi → filter_lo_vec/filter_hi_vec - filter_lo_even/hi_even → lo_evens_vec/hi_evens_vec - filter_lo_odd/hi_odd → lo_odds_vec/hi_odds_vec - filter_even/odd → evens_xmm_vec/odds_xmm_vec Also added clarifying comment about the unzip algorithm flow.
ashvardanian
pushed a commit
that referenced
this pull request
Dec 26, 2025
### Minor - Add: Georgian fast path (e02cb00) ### Patch - Fix: `inline static` warnings with C++23 modules (#287) (374adbf) - Improve: Reduce startup overhead for `sz_find_byteset_haswell` (#293) (7f2899a) - Fix: Missing Georgian dispatch (5f91e7e) - Improve: Drop stack-protection in hashing on GCC (23801e4) - Improve: Reduce repeated reviews (2e5784b) - Improve: Faster `sz_size_bit_ceil` (4003057) - Improve: Avoid ZMM-to-stack spill in Skylake comparisons (d94a010) - Make: Use relative install path for C sources (690d775)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The current unzip implementation uses a union-based conversion:
StringZilla/include/stringzilla/find.h
Lines 1131 to 1139 in 682d2ba
Clang (20/19/18...) generates a massive dependency chain of
vpinsrbinstructions for this pattern:This patch rewrites the unzip logic, which is basically "copied" from the GCC-14 assembly codegen. See https://godbolt.org/z/eb7xe5Pr9
Benchmarks were run on a Zen 3 device. I didn't change the core algorithm, but it just looks much better for almost every case:
Benchmark scripts:
(I also tried another full-MM256 implementation using permute4x64, but it was slightly slower (~2%) than this patch.)