Skip to content

Commit 333a778

Browse files
committed
Docs: Avoid locale-specific Unicode rules
1 parent 4b18f05 commit 333a778

File tree

1 file changed

+12
-12
lines changed

1 file changed

+12
-12
lines changed

include/stringzilla/utf8_unpack.h

Lines changed: 12 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -100,7 +100,9 @@ SZ_DYNAMIC sz_size_t sz_utf8_case_fold( //
100100
* This function applies full Unicode Case Folding as defined in the Unicode Standard (UAX #21 and
101101
* CaseFolding.txt), covering all bicameral scripts, all offset-based one-to-one folds, all table-based
102102
* one-to-one folds, and all normative one-to-many expansions. It doesn't however perform any normalization,
103-
* like NFKC or NFC, so combining marks are treated as-is.
103+
* like NFKC or NFC, so combining marks are treated as-is. StringZilla is intentionally locale-independent:
104+
* case folding produces identical results regardless of runtime locale settings, ensuring deterministic
105+
* behavior across platforms and simplifying use in multi-threaded and distributed systems.
104106
*
105107
* The following character mappings are supported:
106108
*
@@ -110,20 +112,18 @@ SZ_DYNAMIC sz_size_t sz_utf8_case_fold( //
110112
* - Armenian uppercase Ա–Ֆ (U+0531–U+0556) are folded to ա–ֆ (U+0561–U+0586) using a +48 offset.
111113
* - Georgian Mtavruli letters Ა-Ჿ (U+1C90–U+1CBF, excluding 2) are folded to their Mkhedruli equivalents
112114
* (U+10D0–U+10FF) using a fixed linear translation defined by the Unicode Standard.
113-
* - Greek uppercase Α–Ω (U+0391–U+03A9) are folded to α–ω (U+03B1–U+03C9) via a +32 offset, with a normative
114-
* context-sensitive rule for sigma: Σ (U+03A3) folds to σ (U+03C3) or ς (U+03C2) depending on word-final
115-
* position, as required by SpecialCasing.txt.
115+
* - Greek uppercase Α–Ω (U+0391–U+03A9) are folded to α–ω (U+03B1–U+03C9) via a +32 offset.
116+
* Both Σ (U+03A3) and ς (U+03C2, final sigma) fold to σ (U+03C3) for consistent matching.
116117
* - Latin Extended characters include numerous one-to-one folds and several one-to-many expansions, including:
117-
* ß (U+00DF) → "ss" (U+0073 U+0073)
118+
* ß (U+00DF) → "ss" (U+0073 U+0073)
118119
* ẞ (U+1E9E) → "ss"
119120
* as well as mixed-case digraphs and trigraphs normalized to lowercase sequences.
120-
* - Turkish and Azerbaijani dotted/dotless-I rules follow SpecialCasing.txt, including:
121-
* İ (U+0130) → "i̇" (U+0069 U+0307)
122-
* I (U+0049) → i (U+0069)
123-
* ı (U+0131) → ı (already lowercase)
124-
* with full locale-correct behavior.
125-
* - Lithuanian accented I/J mappings that require combining-dot additions or removals are processed
126-
* as multi-codepoint expansions exactly as specified in SpecialCasing.txt.
121+
* - Turkic dotted/dotless-I characters are handled per Unicode Case Folding (not locale-specific):
122+
* İ (U+0130) → "i̇" (U+0069 U+0307) — Full case folding with combining dot
123+
* I (U+0049) → i (U+0069) — Standard folding (not Turkic I→ı)
124+
* ı (U+0131) → ı (already lowercase, unchanged)
125+
* - Lithuanian accented I/J characters with combining dots are processed as multi-codepoint expansions
126+
* per CaseFolding.txt.
127127
* - Additional bicameral scripts—Cherokee, Deseret, Osage, Warang Citi, Adlam—use their normative
128128
* one-to-one uppercase-to-lowercase mappings defined in CaseFolding.txt.
129129
*

0 commit comments

Comments
 (0)