Skip to content

Commit bcd5d16

Browse files
committed
Improve: Safer & faster case-folding on Ice Lake
Testing across different Leipzig dataset languages yeilds the following results: Tier 1: Excellent (5-7x speedup) Language Script Speedup Notes -------------------------------------------------------------------- Hindi (hi) Devanagari 7.10x Pure 3-byte E0, no case folding English (en) Latin 6.94x Pure ASCII fast path Bengali (bn) Bengali 6.49x Pure 3-byte E0, no case folding Tamil (ta) Tamil 6.41x Pure 3-byte E0, no case folding Korean (ko) Hangul 6.13x Pure 3-byte EA, no case folding Dutch (nl) Latin 5.31x Mostly ASCII with some Latin-1 Tier 2: Good (3-5x speedup) Language Script Speedup Notes ------------------------------------------------------------------------ German (de) Latin 4.38x Latin-1 umlauts (ä, ö, ü) Portuguese (pt) Latin 4.17x Latin-1 accents Japanese (ja) CJK + Hiragana 4.03x Mixed but mostly safe 3-byte Ukrainian (uk) Cyrillic 3.90x Cyrillic fast path Spanish (es) Latin 3.89x Latin-1 accents Russian (ru) Cyrillic 3.62x Cyrillic fast path Greek (el) Greek 3.44x 2-byte with +0x20 folding Hebrew (he) Hebrew 3.35x RTL script, no case folding Arabic (ar) Arabic 3.35x RTL script, no case folding Tier 3: Moderate (2-3x speedup) Language Script Speedup Notes ------------------------------------------------------------------- French (fr) Latin 3.21x Latin-1 + Latin Extended Armenian (hy) Armenian 2.92x 2-byte with complex folding Persian (fa) Arabic 2.91x No case folding, mostly safe Czech (cs) Latin 2.05x Latin Extended-A (háčky, čárky) Polish (pl) Latin 2.04x Latin Extended-A (ł, ś, ź) Turkish (tr) Latin 2.00x Latin Extended (İ→i, dotless ı) Tier 4: Limited (~1-2x speedup) Language Script Speedup Notes -------------------------------------------------------------------- Chinese (zh) CJK 1.89x Has fullwidth A-Z (EF) Vietnamese (vi) Latin Ext Add 1.04x Mixed ASCII + 2-byte + E1 Tier 5: Regression (<1x) Language Script Speedup Notes -------------------------------------------------------------- Georgian (ka) Georgian 0.36x Cross-block folding: E1→E2
1 parent bb23b60 commit bcd5d16

File tree

3 files changed

+747
-33
lines changed

3 files changed

+747
-33
lines changed

CMakeLists.txt

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -435,6 +435,7 @@ if (STRINGZILLA_BUILD_BENCHMARK)
435435
define_launcher(stringzilla_bench_find_cpp20 scripts/bench_find.cpp 20 "${STRINGZILLA_TARGET_ARCH}")
436436
define_launcher(stringzilla_bench_sequence_cpp20 scripts/bench_sequence.cpp 20 "${STRINGZILLA_TARGET_ARCH}")
437437
define_launcher(stringzilla_bench_token_cpp20 scripts/bench_token.cpp 20 "${STRINGZILLA_TARGET_ARCH}")
438+
define_launcher(stringzilla_bench_unicode_cpp20 scripts/bench_unicode.cpp 20 "${STRINGZILLA_TARGET_ARCH}")
438439
define_launcher(stringzilla_bench_container_cpp20 scripts/bench_container.cpp 20 "${STRINGZILLA_TARGET_ARCH}")
439440
define_launcher(stringzilla_bench_memory_cpp20 scripts/bench_memory.cpp 20 "${STRINGZILLA_TARGET_ARCH}")
440441

0 commit comments

Comments
 (0)