@@ -100,7 +100,9 @@ SZ_DYNAMIC sz_size_t sz_utf8_case_fold( //
100100 * This function applies full Unicode Case Folding as defined in the Unicode Standard (UAX #21 and
101101 * CaseFolding.txt), covering all bicameral scripts, all offset-based one-to-one folds, all table-based
102102 * one-to-one folds, and all normative one-to-many expansions. It doesn't however perform any normalization,
103- * like NFKC or NFC, so combining marks are treated as-is.
103+ * like NFKC or NFC, so combining marks are treated as-is. StringZilla is intentionally locale-independent:
104+ * case folding produces identical results regardless of runtime locale settings, ensuring deterministic
105+ * behavior across platforms and simplifying use in multi-threaded and distributed systems.
104106 *
105107 * The following character mappings are supported:
106108 *
@@ -110,20 +112,18 @@ SZ_DYNAMIC sz_size_t sz_utf8_case_fold( //
110112 * - Armenian uppercase Ա–Ֆ (U+0531–U+0556) are folded to ա–ֆ (U+0561–U+0586) using a +48 offset.
111113 * - Georgian Mtavruli letters Ა-Ჿ (U+1C90–U+1CBF, excluding 2) are folded to their Mkhedruli equivalents
112114 * (U+10D0–U+10FF) using a fixed linear translation defined by the Unicode Standard.
113- * - Greek uppercase Α–Ω (U+0391–U+03A9) are folded to α–ω (U+03B1–U+03C9) via a +32 offset, with a normative
114- * context-sensitive rule for sigma: Σ (U+03A3) folds to σ (U+03C3) or ς (U+03C2) depending on word-final
115- * position, as required by SpecialCasing.txt.
115+ * - Greek uppercase Α–Ω (U+0391–U+03A9) are folded to α–ω (U+03B1–U+03C9) via a +32 offset.
116+ * Both Σ (U+03A3) and ς (U+03C2, final sigma) fold to σ (U+03C3) for consistent matching.
116117 * - Latin Extended characters include numerous one-to-one folds and several one-to-many expansions, including:
117- * ß (U+00DF) → "ss" (U+0073 U+0073)
118+ * ß (U+00DF) → "ss" (U+0073 U+0073)
118119 * ẞ (U+1E9E) → "ss"
119120 * as well as mixed-case digraphs and trigraphs normalized to lowercase sequences.
120- * - Turkish and Azerbaijani dotted/dotless-I rules follow SpecialCasing.txt, including:
121- * İ (U+0130) → "i̇" (U+0069 U+0307)
122- * I (U+0049) → i (U+0069)
123- * ı (U+0131) → ı (already lowercase)
124- * with full locale-correct behavior.
125- * - Lithuanian accented I/J mappings that require combining-dot additions or removals are processed
126- * as multi-codepoint expansions exactly as specified in SpecialCasing.txt.
121+ * - Turkic dotted/dotless-I characters are handled per Unicode Case Folding (not locale-specific):
122+ * İ (U+0130) → "i̇" (U+0069 U+0307) — Full case folding with combining dot
123+ * I (U+0049) → i (U+0069) — Standard folding (not Turkic I→ı)
124+ * ı (U+0131) → ı (already lowercase, unchanged)
125+ * - Lithuanian accented I/J characters with combining dots are processed as multi-codepoint expansions
126+ * per CaseFolding.txt.
127127 * - Additional bicameral scripts—Cherokee, Deseret, Osage, Warang Citi, Adlam—use their normative
128128 * one-to-one uppercase-to-lowercase mappings defined in CaseFolding.txt.
129129 *
0 commit comments