Skip to content

Commit dbca35f

Browse files
authored
Optimize using precomputed lookup table for BMP characters (#97)
* Performance optimization using precomputed lookup table for BMP characters * skip ascii path * cleanup * avoid unnecessary array destructuring * update bundle stats * add changeset * update perf claims * update changeset
1 parent b351f2a commit dbca35f

File tree

4 files changed

+96
-76
lines changed

4 files changed

+96
-76
lines changed

.changeset/bitter-suits-arrive.md

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,12 @@
1+
---
2+
"unicode-segmenter": patch
3+
---
4+
5+
Improve runtime perf on the Unicode text processing.
6+
7+
By using a precomputed lookup table for the grapheme categries of BMP characters, it improves perf by more than 10% for common cases, even ~30% for some extream cases.
8+
9+
The lookup table consumes an additional 64 KB of memory, which is acceptable for most JavaScript runtime environments.
10+
11+
This optimization is introduced by OpenCode w/ OpenAI's GPT-OSS-120B. It is the second successful attempt at meaningful optimization in this library.
12+
(The first one was the Claude Code w/ Claude Opus 4.0)

README.md

Lines changed: 10 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -220,7 +220,7 @@ Since [Hermes doesn't support the `Intl.Segmenter` API](https://github.com/faceb
220220

221221
| Name | Unicode® | ESM? | Size | Size (min) | Size (min+gzip) | Size (min+br) | Size (min+zstd) |
222222
|------------------------------|----------|------|----------:|-----------:|----------------:|--------------:|----------------:|
223-
| `unicode-segmenter/grapheme` | 16.0.0 | ✔️ | 15,588 | 12,168 | 5,038 | 3,715 | 4,727 |
223+
| `unicode-segmenter/grapheme` | 16.0.0 | ✔️ | 15,730 | 12,199 | 5,113 | 3,787 | 4,807 |
224224
| `graphemer` | 15.0.0 | ✖️ ️| 410,435 | 95,104 | 15,752 | 10,660 | 15,911 |
225225
| `grapheme-splitter` | 10.0.0 | ✖️ | 122,252 | 23,680 | 7,852 | 4,841 | 6,750 |
226226
| `@formatjs/intl-segmenter`* | 15.0.0 | ✖️ | 603,285 | 369,560 | 72,218 | 49,416 | 67,975 |
@@ -236,7 +236,7 @@ Since [Hermes doesn't support the `Intl.Segmenter` API](https://github.com/faceb
236236

237237
| Name | Bytecode size | Bytecode size (gzip)* |
238238
|------------------------------|--------------:|----------------------:|
239-
| `unicode-segmenter/grapheme` | 21,001 | 11,065 |
239+
| `unicode-segmenter/grapheme` | 21,435 | 11,351 |
240240
| `graphemer` | 133,978 | 31,713 |
241241
| `grapheme-splitter` | 63,835 | 19,137 |
242242

@@ -246,16 +246,16 @@ Since [Hermes doesn't support the `Intl.Segmenter` API](https://github.com/faceb
246246

247247
Here is a brief explanation, and you can see [archived benchmark results](benchmark/grapheme/_records).
248248

249-
**Performance in Node.js**: `unicode-segmenter/grapheme` is significantly faster than alternatives.
250-
- 6\~15x faster than other JavaScript libraries
251-
- 1.5\~3x faster than WASM binding of the Rust's [unicode-segmentation]
252-
- 1.5\~3x faster than built-in [`Intl.Segmenter`]
249+
**Performance in Node.js/Bun/Deno**: `unicode-segmenter/grapheme` has best-in-class performance.
250+
- 8\~35x faster than other JavaScript libraries.
251+
- 3\~5x faster than WASM binding of the Rust's [unicode-segmentation].
252+
- 2\~3x faster than built-in [`Intl.Segmenter`].
253253

254-
**Performance in Bun**: `unicode-segmenter/grapheme` has almost the same performance as the built-in [`Intl.Segmenter`], with no performance degradation compared to other JavaScript libraries.
254+
**Performance in Browsers**: The performance in browser environments varies greatly due to differences in browser engines, which makes benchmarking inconsistent, but:
255+
- Still significantly faster than other JavaScript libraries.
256+
- Generally outperforms the built-in in the most browser environments, except the Firefox.
255257

256-
**Performance in Browsers**: The performance in browser environments varies greatly due to differences in browser engines and versions, which makes benchmarking less consistent. Despite these variations, `unicode-segmenter/grapheme` generally outperforms other JavaScript libraries in most environments.
257-
258-
**Performance in React Native**: `unicode-segmenter/grapheme` is significantly faster than alternatives when compiled to Hermes bytecode. It's 3\~8x faster than `graphemer` and 20\~26x faster than `grapheme-splitter`, with the performance gap increasing with input size.
258+
**Performance in React Native**: `unicode-segmenter/grapheme` is still faster than alternatives when compiled to Hermes bytecode. It's 3\~8x faster than `graphemer` and 20\~26x faster than `grapheme-splitter`, with the performance gap increasing with input size.
259259

260260
**Performance in QuickJS**: `unicode-segmenter/grapheme` is the only usable library in terms of performance.
261261

src/core.js

Lines changed: 1 addition & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -63,9 +63,7 @@ export function decodeUnicodeData(data, cats = '') {
6363
* @param {CategorizedUnicodeRange<T>[]} ranges
6464
* @return {number} index of matched unicode range, or -1 if no match
6565
*/
66-
export function findUnicodeRangeIndex(cp, ranges) {
67-
let lo = 0
68-
, hi = ranges.length - 1;
66+
export function findUnicodeRangeIndex(cp, ranges, lo = 0, hi = ranges.length - 1) {
6967
while (lo <= hi) {
7068
let mid = lo + hi >>> 1
7169
, range = ranges[mid];

src/grapheme.js

Lines changed: 73 additions & 63 deletions
Original file line numberDiff line numberDiff line change
@@ -33,6 +33,8 @@ import { consonant_ranges } from './_incb_data.js';
3333

3434
export { GraphemeCategory };
3535

36+
const BMP_MAX = 0xFFFF;
37+
3638
/**
3739
* Unicode segmentation by extended grapheme rules.
3840
*
@@ -49,7 +51,7 @@ export function* graphemeSegments(input) {
4951
if (cp == null) return;
5052

5153
/** Current cursor position. */
52-
let cursor = cp <= 0xFFFF ? 1 : 2;
54+
let cursor = cp <= BMP_MAX ? 1 : 2;
5355

5456
/** Total length of the input string. */
5557
let len = input.length;
@@ -137,7 +139,7 @@ export function* graphemeSegments(input) {
137139
_hd = cp;
138140
}
139141

140-
cursor += cp <= 0xFFFF ? 1 : 2;
142+
cursor += cp <= BMP_MAX ? 1 : 2;
141143
catBefore = catAfter;
142144
}
143145

@@ -194,6 +196,26 @@ export function* splitGraphemes(text) {
194196
for (let s of graphemeSegments(text)) yield s.segment;
195197
}
196198

199+
/**
200+
* Precompute a fast lookup table for BMP code points (0..0xFFFF)
201+
* This table maps each code point to its Grapheme_Cluster_Break category.
202+
* It is generated once at module load time using the grapheme_ranges data.
203+
* The table is a Uint8Array of length 0x10000 (64KB), which is acceptable in memory.
204+
* For code points >= 0x10000 we fall back to binary search.
205+
*/
206+
let bmpLookup = new Uint8Array(BMP_MAX + 1);
207+
let bmpCursor = (() => {
208+
let cursor = 0;
209+
let cp = 0;
210+
while (cp <= BMP_MAX) {
211+
let range = grapheme_ranges[cursor++];
212+
for (cp = range[0]; cp <= range[1]; cp++) {
213+
bmpLookup[cp] = range[2];
214+
}
215+
}
216+
return cursor;
217+
})();
218+
197219
/**
198220
* `Grapheme_Cluster_Break` property value of a given codepoint
199221
*
@@ -204,35 +226,26 @@ export function* splitGraphemes(text) {
204226
* @return {GraphemeCategoryNum}
205227
*/
206228
function cat(cp, cache) {
207-
if (cp < 127) {
208-
// Special-case optimization for ascii, except U+007F. This
209-
// improves performance even for many primarily non-ascii texts,
210-
// due to use of punctuation and white space characters from the
211-
// ascii range.
212-
if (cp >= 32) {
213-
return 0 /* GC_Any */;
214-
} else if (cp === 10) {
215-
return 6 /* GC_LF */;
216-
} else if (cp === 13) {
217-
return 1 /* GC_CR */;
218-
} else {
219-
return 2 /* GC_Control */;
220-
}
221-
} else {
222-
// If this char isn't within the cached range, update the cache to the
223-
// range that includes it.
224-
if (cp < cache[0] || cp > cache[1]) {
225-
let index = findUnicodeRangeIndex(cp, grapheme_ranges);
226-
if (index < 0) {
227-
return 0;
228-
}
229-
let range = grapheme_ranges[index];
230-
cache[0] = range[0];
231-
cache[1] = range[1];
232-
cache[2] = range[2];
233-
}
229+
// Fast lookup for BMP (0x0000..0xFFFF) using precomputed table
230+
if (cp <= BMP_MAX) {
231+
return /** @type {GraphemeCategoryNum} */ (bmpLookup[cp]);
232+
}
233+
234+
// Use cached result
235+
if (cp >= cache[0] && cp <= cache[1]) {
234236
return cache[2];
235237
}
238+
239+
// Binary search, starting from bmpCursor
240+
let index = findUnicodeRangeIndex(cp, grapheme_ranges, bmpCursor);
241+
if (index < 0) {
242+
return 0;
243+
}
244+
245+
const range = grapheme_ranges[index];
246+
cache[0] = range[0];
247+
cache[1] = range[1];
248+
return (cache[2] = range[2]);
236249
};
237250

238251
/**
@@ -291,46 +304,43 @@ function isBoundary(catBefore, catAfter, risCount, emoji, incb) {
291304

292305
// GB6 - L x (L | V | LV | LVT)
293306
if (catBefore === 5) {
294-
if (catAfter === 5 || catAfter === 7 || catAfter === 8 || catAfter === 13) {
295-
return false;
296-
}
307+
return !(catAfter === 5 || catAfter === 7 || catAfter === 8 || catAfter === 13);
308+
}
297309

298-
} else {
299-
// GB7 - (LV | V) x (V | T)
300-
if (
301-
(catBefore === 7 || catBefore === 13) &&
302-
(catAfter === 13 || catAfter === 12)
303-
) {
304-
return false;
305-
}
310+
// GB7 - (LV | V) x (V | T)
311+
if (
312+
(catBefore === 7 || catBefore === 13) &&
313+
(catAfter === 13 || catAfter === 12)
314+
) {
315+
return false;
316+
}
306317

307-
// GB8 - (LVT | T) x T
308-
if (
309-
(catBefore === 8 || catBefore === 12) &&
310-
catAfter === 12
311-
) {
312-
return false;
313-
}
318+
// GB8 - (LVT | T) x T
319+
if (
320+
(catBefore === 8 || catBefore === 12) &&
321+
catAfter === 12
322+
) {
323+
return false;
324+
}
314325

315-
// GB9b
316-
if (catBefore === 9) {
317-
return false;
318-
}
326+
// GB9b
327+
if (catBefore === 9) {
328+
return false;
329+
}
319330

320-
// GB9c
321-
if (catAfter === 0 && incb) {
322-
return false;
323-
}
331+
// GB9c
332+
if (catAfter === 0 && incb) {
333+
return false;
334+
}
324335

325-
// GB11
326-
if (catBefore === 14 && catAfter === 4) {
327-
return !emoji;
328-
}
336+
// GB11
337+
if (catBefore === 14 && catAfter === 4) {
338+
return !emoji;
339+
}
329340

330-
// GB12, GB13
331-
if (catBefore === 10 && catAfter === 10) {
332-
return risCount % 2 === 0;
333-
}
341+
// GB12, GB13
342+
if (catBefore === 10 && catAfter === 10) {
343+
return risCount % 2 === 0;
334344
}
335345

336346
// GB999

0 commit comments

Comments
 (0)