Commit bcd5d16
committed
Improve: Safer & faster case-folding on Ice Lake
Testing across different Leipzig dataset
languages yeilds the following results:
Tier 1: Excellent (5-7x speedup)
Language Script Speedup Notes
--------------------------------------------------------------------
Hindi (hi) Devanagari 7.10x Pure 3-byte E0, no case folding
English (en) Latin 6.94x Pure ASCII fast path
Bengali (bn) Bengali 6.49x Pure 3-byte E0, no case folding
Tamil (ta) Tamil 6.41x Pure 3-byte E0, no case folding
Korean (ko) Hangul 6.13x Pure 3-byte EA, no case folding
Dutch (nl) Latin 5.31x Mostly ASCII with some Latin-1
Tier 2: Good (3-5x speedup)
Language Script Speedup Notes
------------------------------------------------------------------------
German (de) Latin 4.38x Latin-1 umlauts (ä, ö, ü)
Portuguese (pt) Latin 4.17x Latin-1 accents
Japanese (ja) CJK + Hiragana 4.03x Mixed but mostly safe 3-byte
Ukrainian (uk) Cyrillic 3.90x Cyrillic fast path
Spanish (es) Latin 3.89x Latin-1 accents
Russian (ru) Cyrillic 3.62x Cyrillic fast path
Greek (el) Greek 3.44x 2-byte with +0x20 folding
Hebrew (he) Hebrew 3.35x RTL script, no case folding
Arabic (ar) Arabic 3.35x RTL script, no case folding
Tier 3: Moderate (2-3x speedup)
Language Script Speedup Notes
-------------------------------------------------------------------
French (fr) Latin 3.21x Latin-1 + Latin Extended
Armenian (hy) Armenian 2.92x 2-byte with complex folding
Persian (fa) Arabic 2.91x No case folding, mostly safe
Czech (cs) Latin 2.05x Latin Extended-A (háčky, čárky)
Polish (pl) Latin 2.04x Latin Extended-A (ł, ś, ź)
Turkish (tr) Latin 2.00x Latin Extended (İ→i, dotless ı)
Tier 4: Limited (~1-2x speedup)
Language Script Speedup Notes
--------------------------------------------------------------------
Chinese (zh) CJK 1.89x Has fullwidth A-Z (EF)
Vietnamese (vi) Latin Ext Add 1.04x Mixed ASCII + 2-byte + E1
Tier 5: Regression (<1x)
Language Script Speedup Notes
--------------------------------------------------------------
Georgian (ka) Georgian 0.36x Cross-block folding: E1→E21 parent bb23b60 commit bcd5d16
File tree
3 files changed
+747
-33
lines changed- include/stringzilla
- scripts
3 files changed
+747
-33
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
435 | 435 | | |
436 | 436 | | |
437 | 437 | | |
| 438 | + | |
438 | 439 | | |
439 | 440 | | |
440 | 441 | | |
| |||
0 commit comments