Harley Seal AVX-512 implementations #138

ashvardanian · 2024-06-11T02:17:25Z

Binary representations are becoming increasingly popular in Machine Learning and I'd love to explore the opportunity for faster Hamming and Jaccard distance calculations. I've looked into several benchmarks, most importantly the WojciechMula/sse-popcount library, that compares several optimizations for population-counts -the most expensive part of the Hamming/Jaccard kernel.

Extensive benchmarks and the design itself suggest that AVX-512 Harley Seal variant should be the fastest on long inputs beyond 1 KB. Here is a sample of the most recent results obtained on an i3 Cannonlake Intel CPU:

procedure	32 B	64 B	128 B	256 B	512 B	1024 B	2048 B	4096 B
lookup-8	1.19464	1.09949	1.21245	1.11428	1.69827	1.65605	1.63299	1.62148
lookup-64	1.16739	1.09284	1.19636	1.10018	1.69524	1.65319	1.63670	1.62359
harley-seal	1.00883	0.82805	0.51017	0.39659	0.54067	0.49312	0.46917	0.45787
avx2-lookup	0.45543	0.28456	0.20674	0.14150	0.18920	0.16951	0.15977	0.15527
avx2-lookup-original	1.53184	0.90269	0.61849	0.41858	0.34503	0.32416	0.23073	0.25976
avx2-harley-seal	1.03679	0.59198	0.37492	0.26418	0.20457	0.15556	0.13097	0.11904
avx512-harley-seal	3.36585	0.71542	0.40990	0.26028	0.29072	0.10719	0.07310	0.05560
avx512bw-shuf	2.56808	1.99008	1.04359	0.55736	0.48551	0.25119	0.20256	0.15851
avx512vbmi-shuf	2.51702	1.99085	1.09241	0.54717	0.49385	0.25181	0.20032	0.15249
builtin-popcnt	0.22182	0.28289	0.26755	0.31640	0.39424	0.38940	0.36062	0.33525
builtin-popcnt32	0.46220	0.46701	0.51513	0.59160	0.89925	0.85613	0.84084	0.84065
builtin-popcnt-unrolled	0.25161	0.17290	0.14147	0.12966	0.20433	0.22086	0.20939	0.20628
builtin-popcnt-movdq	0.21983	0.18868	0.17849	0.18037	0.34305	0.31526	0.29713	0.29047

I've tried copying the best solution into SimSIMD benchmarking suite and sadly didn't achieve similar improvements on more recent CPUs. On Intel Sapphire Rapids CPUs:

-------------------------------------------------------------------------------------------------------------
Benchmark                                                   Time             CPU   Iterations UserCounters...
-------------------------------------------------------------------------------------------------------------
hamming_b8_haswell_4096b/min_time:10.000/threads:1       50.3 ns         50.3 ns    277340752 abs_delta=0 bytes=162.807G/s pairs=19.8739M/s relative_error=0
hamming_b8_ice_4096b/min_time:10.000/threads:1           34.8 ns         34.8 ns    402233197 abs_delta=0 bytes=235.632G/s pairs=28.7636M/s relative_error=0
hamming_b8_icehs_4096b/min_time:10.000/threads:1         42.4 ns         42.4 ns    330077077 abs_delta=0 bytes=193.07G/s pairs=23.5681M/s relative_error=0

On AMD Genoa:

-------------------------------------------------------------------------------------------------------------
Benchmark                                                   Time             CPU   Iterations UserCounters...
-------------------------------------------------------------------------------------------------------------
hamming_b8_haswell_4096b/min_time:10.000/threads:1       40.5 ns         40.5 ns    346163289 abs_delta=0 bytes=202.502G/s pairs=24.7195M/s relative_error=0
hamming_b8_ice_4096b/min_time:10.000/threads:1           40.6 ns         40.6 ns    344646420 abs_delta=0 bytes=201.733G/s pairs=24.6257M/s relative_error=0
hamming_b8_icehs_4096b/min_time:10.000/threads:1         59.8 ns         59.8 ns    234058579 abs_delta=0 bytes=136.96G/s pairs=16.7188M/s relative_error=0

The kernel designed for Haswell simply uses _mm_popcnt_u64.
The kernel designed for Ice Lake uses _mm512_popcnt_epi64.
The icehs is an adaptation of the Harley Seal transform that "zip"-s two input streams with xor.

To reproduce the results:

cmake -DCMAKE_BUILD_TYPE=Release -DSIMSIMD_BUILD_TESTS=1 -DSIMSIMD_BUILD_BENCHMARKS=1 -DSIMSIMD_BUILD_BENCHMARKS_WITH_CBLAS=1 -B build_release
cmake --build build_release --config Release && build_release/simsimd_bench --benchmark_filter="hamming(.*)4096b"

Please let me know if there is a better way to accelerate this kernel 🤗

This commit adds the optimized Harley Seal kernel from the `WojciechMula/sse-popcount` library to the benchmarking suite to investigate optimization opportunities on Intel Sapphire Rapids and AMD Genoa chips.

alexbowe · 2024-06-11T22:42:07Z

cpp/bench.cxx

@@ -223,6 +221,182 @@ void vdot_f64c_blas(simsimd_f64_t const* a, simsimd_f64_t const* b, simsimd_size

 #endif

+namespace AVX512_harley_seal {
+
+uint8_t lookup8bit[256] = {


Does it help to add const/constexpr? I wonder if it would encourage the table to be cached. It might also help to run a loop over it to pre-load it into cache too (although I figure prefetching would most likely get the whole table in the first access).

In my own experiments in the past, I did find the built in instructions to be faster vs LUTs, however.

Assuming the size of the inputs - the tail will never be evaluated separately. I've just copied that part of the code for completeness.

alexbowe · 2024-06-11T22:45:01Z

cpp/bench.cxx

+uint64_t lower_qword(const __m128i v) { return _mm_cvtsi128_si64(v); }
+
+uint64_t higher_qword(const __m128i v) { return lower_qword(_mm_srli_si128(v, 8)); }
+
+uint64_t simd_sum_epu64(const __m128i v) { return lower_qword(v) + higher_qword(v); }
+
+uint64_t simd_sum_epu64(const __m256i v) {


I think modern compilers might do this without asking in some cases, but using inline might encourage it (and could help with these small functions).

The changes I've suggested so far are just low hanging fruit though. Have you used profiling tools to find which lines of code each approach is spending the most time in?

Most time is spent in the main loop computing CSAs. Sadly, I can't access hardware performance counters on those machines.

Wyctus · 2024-09-08T19:40:11Z

I'm interested in experimenting with this, but I don't have a CPU supporting AVX512. Do you test all these different instruction sets on cloud machines or do you have many CPUs? 😄

Maybe I could do some comparative experiments emulating with QEMU, but this most likely won't give enough info for finetuning.

ashvardanian · 2024-09-08T21:31:14Z

@Wyctus, QEMU is a nightmare, I recommend avoiding it. I used to have some CPUs, but cloud is the way to go for R&D of such kernels. I recommend r7iz instances for x86 and r8g for Arm on AWS. 2-4 vCPUs should be enough 😉

ashvardanian · 2024-09-08T21:34:59Z

Also, from priority perspective, if you can improve Harley-Seal - it's a huuuge win, but it proved to be quite hard and time consuming. If at any point it stop feeling rewarding - #159 and #160 are also important, more digestible, and untouched for now, @Wyctus 🤗

Wyctus · 2024-09-08T21:55:10Z

Thank you, I'll try AWS! 🙂 You are right, I messed a few hours with QEMU, and made me sick already....

The reason I picked this issue is that I used to mess with popcount stuff in the past, so I'm planning to dig up what I did and see if it's competitive enough, I don't remember.

But if I have time, I'll try to look into the other mentioned issues as well!

ashvardanian · 2024-09-19T02:57:21Z

Hi @Wyctus! Any luck with this?

Relates to #138

ashvardanian · 2024-11-05T14:02:35Z

More context for this.

VPOPCNTQ (ZMM, ZMM):

On Ice Lake: 3 cycles latency and executes only on port 5.
On Zen4: 2 cycles and executes on both port 0 and 1.

VPSHUFB (ZMM, ZMM, ZMM):

On Skylake-X: 1 cycle latency and executes only on port 5.
On Ice Lake: 1 cycle latency and executes only on port 5.
On Zen4: 2 cycles and executes on both port 1 and 2.

Optimizing for Genoa and Turin we may want to combine the first and second approach.

ashvardanian · 2024-11-27T11:24:29Z

More context. We can use the lookup table with sad intrinsics:

VPSADBW (ZMM, ZMM, ZMM)

On Ice Lake: 3 cycles latency and executes only on port 5.
On Zen4: 3 cycles and executes on both port 0 and 1.

#138

ashvardanian/SimSIMD#138

jianshu93 · 2025-07-02T23:34:11Z

instead of bit level hamming, how about integer hamming, for example, 2 vectors with u64, I want to calculate between those 2 vectors. Any idea how I can adjust this? --Jianshu

ashvardanian · 2025-07-02T23:55:34Z

@jianshu93, I don't plan to add such kernels to SimSIMD for now - due to low demand and low optimization opportunity.

jianshu93 · 2025-07-03T00:27:18Z

Hi @ashvardanian, actually it is very useful. google SimHash and MinHash, they all rely on such a hamming distance. because for Minhash-like algrorithms, we need to compute a integer level hamming to calculate the Jaccard similarity in original space.

jianshu93 · 2025-07-03T00:28:26Z

all we can do is for every 64 bit (one integer), we ran a simSIMD hamming right, there are no other ways for this purposes.

ashvardanian · 2025-07-03T00:34:21Z

In those tasks you probably need the sparse kernels of SimSIMD, rather than the dense vector hamming distances? Those are already defined.

jianshu93 · 2025-07-03T00:47:58Z

Can you please elaborate on that? In practice, the u64 hamming distance calculation is now the limiting step and most of the time it is a dense vector. The MinHash has widely application in deduplication, plagiarism detection, website search et.al., especially at large scale. --Jianshu

ashvardanian · 2025-07-03T14:15:49Z

@jianshu93, I meant the following kernels of the simsimd_intersect_* & simsimd_spdot_* variety:

SimSIMD/include/simsimd/sparse.h

Lines 60 to 156 in 9a4d325

    
           SIMSIMD_PUBLIC void simsimd_intersect_u16_serial(     // 
        
               simsimd_u16_t const *a, simsimd_u16_t const *b,   // 
        
               simsimd_size_t a_length, simsimd_size_t b_length, // 
        
               simsimd_distance_t *results); 
        
           SIMSIMD_PUBLIC void simsimd_intersect_u32_serial(     // 
        
               simsimd_u32_t const *a, simsimd_u32_t const *b,   // 
        
               simsimd_size_t a_length, simsimd_size_t b_length, // 
        
               simsimd_distance_t *results); 
        
           SIMSIMD_PUBLIC void simsimd_spdot_counts_u16_serial(                // 
        
               simsimd_u16_t const *a, simsimd_u16_t const *b,                 // 
        
               simsimd_i16_t const *a_weights, simsimd_i16_t const *b_weights, // 
        
               simsimd_size_t a_length, simsimd_size_t b_length,               // 
        
               simsimd_distance_t *results); 
        
           SIMSIMD_PUBLIC void simsimd_spdot_weights_u16_serial(                 // 
        
               simsimd_u16_t const *a, simsimd_u16_t const *b,                   // 
        
               simsimd_bf16_t const *a_weights, simsimd_bf16_t const *b_weights, // 
        
               simsimd_size_t a_length, simsimd_size_t b_length,                 // 
        
               simsimd_distance_t *results); 
        
           /*  Implements the most naive set intersection algorithm, similar to `std::set_intersection in C++ STL`, 
        
            *  naively enumerating the elements of two arrays. 
        
            */ 
        
           SIMSIMD_PUBLIC void simsimd_intersect_u16_accurate(   // 
        
               simsimd_u16_t const *a, simsimd_u16_t const *b,   // 
        
               simsimd_size_t a_length, simsimd_size_t b_length, // 
        
               simsimd_distance_t *results); 
        
           SIMSIMD_PUBLIC void simsimd_intersect_u32_accurate(   // 
        
               simsimd_u32_t const *a, simsimd_u32_t const *b,   // 
        
               simsimd_size_t a_length, simsimd_size_t b_length, // 
        
               simsimd_distance_t *results); 
        
           SIMSIMD_PUBLIC void simsimd_spdot_counts_u16_accurate(              // 
        
               simsimd_u16_t const *a, simsimd_u16_t const *b,                 // 
        
               simsimd_i16_t const *a_weights, simsimd_i16_t const *b_weights, // 
        
               simsimd_size_t a_length, simsimd_size_t b_length,               // 
        
               simsimd_distance_t *results); 
        
           SIMSIMD_PUBLIC void simsimd_spdot_weights_u16_accurate(               // 
        
               simsimd_u16_t const *a, simsimd_u16_t const *b,                   // 
        
               simsimd_bf16_t const *a_weights, simsimd_bf16_t const *b_weights, // 
        
               simsimd_size_t a_length, simsimd_size_t b_length,                 // 
        
               simsimd_distance_t *results); 
        
           /*  SIMD-powered backends for Arm SVE, mostly using 32-bit arithmetic over variable-length platform-defined word sizes. 
        
            *  Designed for Arm Graviton 3, Microsoft Cobalt, as well as Nvidia Grace and newer Ampere Altra CPUs. 
        
            */ 
        
           SIMSIMD_PUBLIC void simsimd_intersect_u16_sve2(       // 
        
               simsimd_u16_t const *a, simsimd_u16_t const *b,   // 
        
               simsimd_size_t a_length, simsimd_size_t b_length, // 
        
               simsimd_distance_t *results); 
        
           SIMSIMD_PUBLIC void simsimd_intersect_u32_sve2(       // 
        
               simsimd_u32_t const *a, simsimd_u32_t const *b,   // 
        
               simsimd_size_t a_length, simsimd_size_t b_length, // 
        
               simsimd_distance_t *results); 
        
           SIMSIMD_PUBLIC void simsimd_spdot_counts_u16_sve2(                  // 
        
               simsimd_u16_t const *a, simsimd_u16_t const *b,                 // 
        
               simsimd_i16_t const *a_weights, simsimd_i16_t const *b_weights, // 
        
               simsimd_size_t a_length, simsimd_size_t b_length,               // 
        
               simsimd_distance_t *results); 
        
           SIMSIMD_PUBLIC void simsimd_spdot_weights_u16_sve2(                   // 
        
               simsimd_u16_t const *a, simsimd_u16_t const *b,                   // 
        
               simsimd_bf16_t const *a_weights, simsimd_bf16_t const *b_weights, // 
        
               simsimd_size_t a_length, simsimd_size_t b_length,                 // 
        
               simsimd_distance_t *results); 
        
           /*  SIMD-powered backends for various generations of AVX512 CPUs. 
        
            *  Skylake is handy, as it supports masked loads and other operations, avoiding the need for the tail loop. 
        
            *  Ice Lake, however, is needed even for the most basic kernels to perform integer matching. 
        
            */ 
        
           SIMSIMD_PUBLIC void simsimd_intersect_u16_ice(        // 
        
               simsimd_u16_t const *a, simsimd_u16_t const *b,   // 
        
               simsimd_size_t a_length, simsimd_size_t b_length, // 
        
               simsimd_distance_t *results); 
        
           SIMSIMD_PUBLIC void simsimd_intersect_u32_ice(        // 
        
               simsimd_u32_t const *a, simsimd_u32_t const *b,   // 
        
               simsimd_size_t a_length, simsimd_size_t b_length, // 
        
               simsimd_distance_t *results); 
        
           /*  SIMD-powered backends for AMD Turin CPUs with cheap VP2INTERSECT instructions. 
        
            *  On the Intel side, only mobile Tiger Lake support them, but have prohibitively high latency. 
        
            */ 
        
           SIMSIMD_PUBLIC void simsimd_intersect_u16_turin(      // 
        
               simsimd_u16_t const *a, simsimd_u16_t const *b,   // 
        
               simsimd_size_t a_length, simsimd_size_t b_length, // 
        
               simsimd_distance_t *results); 
        
           SIMSIMD_PUBLIC void simsimd_intersect_u32_turin(      // 
        
               simsimd_u32_t const *a, simsimd_u32_t const *b,   // 
        
               simsimd_size_t a_length, simsimd_size_t b_length, // 
        
               simsimd_distance_t *results); 
        
           SIMSIMD_PUBLIC void simsimd_spdot_counts_u16_turin(                 // 
        
               simsimd_u16_t const *a, simsimd_u16_t const *b,                 // 
        
               simsimd_i16_t const *a_weights, simsimd_i16_t const *b_weights, // 
        
               simsimd_size_t a_length, simsimd_size_t b_length,               // 
        
               simsimd_distance_t *results); 
        
           SIMSIMD_PUBLIC void simsimd_spdot_weights_u16_turin(                  // 
        
               simsimd_u16_t const *a, simsimd_u16_t const *b,                   // 
        
               simsimd_bf16_t const *a_weights, simsimd_bf16_t const *b_weights, // 
        
               simsimd_size_t a_length, simsimd_size_t b_length,                 // 
        
               simsimd_distance_t *results);

As for MinHash, it may be an interesting use case I haven't explored yet. I assume u32 and u16 would be just as relevant as u64. If so, they are a higher priority for optimization, as more such scalars fit into registers and compilers have a harder time with smaller types. What do you think?

Implementing those should be relatively straightforward, and the importance of the kernel itself is minor compared to the benefits of having dynamic dispatch in SimSIMD and numerous language bindings. Any chance you have better ideas than d += _mm_popcnt_u32(_mm512_cmpeq_epu32_mask(a, b))? Would have time to draft a PR for include/simsimd/binary.h, adding simsimd_hamming_u32 variants?

jianshu93 · 2025-07-03T17:59:48Z

I agree, u32 and u64 will be easy to optimize using SIMD, at least with AVX256 and AVX512 as you showed above. By the way u64, generated by 64-bit hash functions are more widely used because it has a much smaller hash collision probability. I will try to see what I can do, but I am a Rust person, not C. The simd function you mentioned is also the one I am thinking. - Jianshu

ashvardanian · 2025-07-06T11:51:07Z

Hi @jianshu93! How have you been? Any luck starting this work direction? If you have any minimal draft, I can take it from there and merge in the next release 🤗

ashvardanian · 2025-07-16T13:44:33Z

Hi @jianshu93! I’ve made solid progress on Min-Hash related issues and had a couple of quick asks if you have a minute.

I’m curious whether you’re only using the default non-weighted MinHash variant, or if you’re also relying on any of the weighted or densified schemes. Knowing that would help me prioritize both the similarity kernels and the Min-Hash fingerprinting/sketching back-ends for the upcoming StringZilla 4 release.

If you get a chance to open a separate issue for this—and possibly a draft PR if you have any code snippets—that would be super helpful 🤗

jianshu93 · 2025-07-16T19:59:05Z

hi @ashvardanian,

Thanks! this is so fast. I only have experience test/run Rust version simd hamming. See the code in this repo: https://github.com/jean-pierreBoth/anndists .No matter it is weighted Minhash or binary one, the difference is at the hashing step, not the distance function (or compare sketch vector step). Here we only use hamming distance for the distance step, so equally applicable for all MinHash or weight Minhash such as BagMinHash and DartMinhash. Most use 64 bit hashing for Minhash, only a few 32bit, so I would suggest only to u64 hamming. For SimHash, we use bit hamming, which is already here in the SimSIMD. This also port to Rust right? I would love to have a AVX512 simd hamming because the rust one above never implement a 512 one. Thanks so much, let me know when the release is out and I would love to test it, especially 512 for very large vectors, e.g., 20,000 64 bit hashes. --Jianshu

Add: Harley Seal AVX-512 implementation

70ce8d5

This commit adds the optimized Harley Seal kernel from the `WojciechMula/sse-popcount` library to the benchmarking suite to investigate optimization opportunities on Intel Sapphire Rapids and AMD Genoa chips.

ashvardanian added the help wanted Extra attention is needed label Jun 11, 2024

alexbowe reviewed Jun 11, 2024

View reviewed changes

ashvardanian force-pushed the main-dev branch from 44b3fd1 to ce0b088 Compare June 30, 2024 04:37

ashvardanian force-pushed the main-dev branch from aaa8ce2 to 5ee3900 Compare August 18, 2024 02:31

ashvardanian force-pushed the main-dev branch 2 times, most recently from b0bc0da to b816617 Compare September 6, 2024 04:04

ashvardanian force-pushed the main-dev branch 3 times, most recently from 54ae495 to fb1e864 Compare October 8, 2024 01:52

ashvardanian added a commit that referenced this pull request Oct 14, 2024

Docs: Harley Seal optimization ideas

98fbf5d

Relates to #138

ashvardanian force-pushed the main-dev branch from 679a813 to 252fba7 Compare October 17, 2024 17:41

ashvardanian mentioned this pull request Oct 18, 2024

Vectorized bitHammingDistance ClickHouse/ClickHouse#59389

Open

ashvardanian force-pushed the main-dev branch 2 times, most recently from 48ac9e4 to 5d9a219 Compare November 26, 2024 13:44

ashvardanian added a commit that referenced this pull request Nov 27, 2024

Docs: Harley-Seal plans for binary kernels

45dbe6e

#138

ashvardanian added a commit to ashvardanian/jaccard-index that referenced this pull request May 15, 2025

Docs: Odd-Majority & Harley-Seal

e01f753

ashvardanian/SimSIMD#138

Harley Seal AVX-512 implementations #138

Are you sure you want to change the base?

Harley Seal AVX-512 implementations #138

Uh oh!

Conversation

ashvardanian commented Jun 11, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alexbowe Jun 11, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ashvardanian Jun 12, 2024

Choose a reason for hiding this comment

Uh oh!

alexbowe Jun 11, 2024

Choose a reason for hiding this comment

Uh oh!

alexbowe Jun 11, 2024

Choose a reason for hiding this comment

Uh oh!

ashvardanian Jun 12, 2024

Choose a reason for hiding this comment

Uh oh!

Wyctus commented Sep 8, 2024

Uh oh!

ashvardanian commented Sep 8, 2024

Uh oh!

ashvardanian commented Sep 8, 2024

Uh oh!

Wyctus commented Sep 8, 2024

Uh oh!

ashvardanian commented Sep 19, 2024

Uh oh!

ashvardanian commented Nov 5, 2024

Uh oh!

ashvardanian commented Nov 27, 2024

Uh oh!

jianshu93 commented Jul 2, 2025

Uh oh!

ashvardanian commented Jul 2, 2025

Uh oh!

jianshu93 commented Jul 3, 2025

Uh oh!

jianshu93 commented Jul 3, 2025

Uh oh!

ashvardanian commented Jul 3, 2025

Uh oh!

jianshu93 commented Jul 3, 2025

Uh oh!

ashvardanian commented Jul 3, 2025

Uh oh!

jianshu93 commented Jul 3, 2025

Uh oh!

ashvardanian commented Jul 6, 2025

Uh oh!

ashvardanian commented Jul 16, 2025

Uh oh!

jianshu93 commented Jul 16, 2025

Uh oh!

Uh oh!

ashvardanian commented Jun 11, 2024 •

edited

Loading

alexbowe Jun 11, 2024 •

edited

Loading