Skip to content

perf(simd): paired-byte SIMD search for memmem#55

Merged
kolkov merged 2 commits intomainfrom
feature/paired-byte-simd
Jan 4, 2026
Merged

perf(simd): paired-byte SIMD search for memmem#55
kolkov merged 2 commits intomainfrom
feature/paired-byte-simd

Conversation

@kolkov
Copy link
Copy Markdown
Contributor

@kolkov kolkov commented Jan 4, 2026

Summary

Implement frequency-based rare byte selection and paired-byte AVX2 search for dramatically improved substring matching (Issue #49).

Algorithm

  • Empirical byte frequency table (256 bytes) ranking bytes by commonality
  • SelectRareBytes(): identify two rarest bytes in needle
  • MemchrPair: AVX2 SIMD search for both bytes at correct offset simultaneously
  • Dramatically reduces false positives vs single-byte search

Benchmarks (vs stdlib bytes.Index)

Haystack Needle Speedup
4KB 64B 19x
16KB 64B 52x
64KB 64B 45x
1MB 64B 39x
any 7B 10.5x

Files Changed

  • simd/byte_frequencies.go (new): frequency table + SelectRareBytes
  • simd/byte_frequencies_test.go (new): comprehensive tests
  • simd/memchr_amd64.s: AVX2 memchrPairAVX2 assembly
  • simd/memchr_amd64.go: MemchrPair wrapper
  • simd/memchr_fallback.go: non-AMD64 fallback
  • simd/memchr_generic_impl.go: SWAR generic implementation
  • simd/memmem.go: refactored to use paired-byte search
  • simd/memchr_test.go: MemchrPair tests + fuzz

Test Plan

  • go test ./simd/... -race passes
  • golangci-lint run - 0 issues
  • Coverage: 86.7%
  • Pre-release check passed

Closes #49

Implement frequency-based rare byte selection and paired-byte
AVX2 search for dramatically improved substring matching.

Algorithm:
- Empirical byte frequency table (256 bytes)
- Select two rarest bytes in needle
- MemchrPair: search both bytes at correct offset simultaneously
- Reduces false positives vs single-byte search

Benchmarks (vs stdlib bytes.Index):
- 4KB haystack, 64B needle: 19x faster
- 16KB haystack, 64B needle: 52x faster
- 64KB haystack, 64B needle: 45x faster
- 1MB haystack, 64B needle: 39x faster
- Short needle (7B): 10.5x faster

New files:
- simd/byte_frequencies.go: frequency table + SelectRareBytes
- simd/byte_frequencies_test.go: comprehensive tests

Closes #49
@github-actions
Copy link
Copy Markdown

github-actions bot commented Jan 4, 2026

Benchmark Comparison

Comparing main → PR #55

Summary: geomean 244.7n 245.0n +0.12%

⚠️ Potential regressions detected:

Accelerate/memchr1-4       109.5n ± ∞ ¹   109.8n ± ∞ ¹  +0.27% (p=0.032 n=5)
geomean                    244.7n         245.0n        +0.12%
geomean                               ³                +0.00%               ³
geomean                               ³                +0.00%               ³
geomean              32.29n         32.47n        +0.57%
geomean                         ³                +0.00%               ³
geomean                         ³                +0.00%               ³
Find/hello-4                                            710.3n ± ∞ ¹   723.3n ± ∞ ¹   +1.83% (p=0.008 n=5)
Find/foo|bar|baz-4                                      72.29n ± ∞ ¹   75.88n ± ∞ ¹   +4.97% (p=0.008 n=5)
IsMatch/literal-4                                       50.47n ± ∞ ¹   63.30n ± ∞ ¹  +25.42% (p=0.008 n=5)

Full results available in workflow artifacts. CI runners have ~10-20% variance.
For accurate benchmarks, run locally: ./scripts/bench.sh --compare

SWAR zero-detection formula can produce false positives when a byte
equals 0x01 adjacent to a 0x00 byte due to borrow propagation during
subtraction. This caused test failures on 386 architecture.

Solution: verify each candidate position after SWAR detection before
returning, while preserving the SWAR optimization for the common case.
@kolkov kolkov merged commit 54f5d8a into main Jan 4, 2026
15 checks passed
@kolkov kolkov deleted the feature/paired-byte-simd branch January 4, 2026 17:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

perf: Optimize inner_literal patterns (Rust 2.7x faster)

1 participant