Skip to content

Commit 84887b4

Browse files
committed
fix: workaround for webpack not exposing the default export in UMD correctly
fixes #12
1 parent 774cf36 commit 84887b4

6 files changed

Lines changed: 18 additions & 15 deletions

File tree

README.md

Lines changed: 9 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -4,21 +4,22 @@
44

55
`gpt-tokenizer` is a highly optimized Token Byte Pair Encoder/Decoder for all OpenAI's models (including those used by GPT-2, GPT-3, GPT-3.5 and GPT-4). It's written in TypeScript, and is fully compatible with all modern JavaScript environments.
66

7+
This package is a port of OpenAI's [tiktoken](https://github.com/openai/tiktoken), with some additional features sprinkled on top.
8+
79
OpenAI's GPT models utilize byte pair encoding to transform text into a sequence of integers before feeding them into the model.
810

911
As of 2023, it is the most feature-complete, open-source GPT tokenizer on NPM. It implements some unique features, such as:
1012

13+
- Support for easily tokenizing chats thanks to the `encodeChat` function
1114
- Support for all current OpenAI models (available encodings: `r50k_base`, `p50k_base`, `p50k_edit` and `cl100k_base`)
12-
- Generator function versions of both the decoder and encoder
15+
- Generator function versions of both the decoder and encoder functions
1316
- Provides the ability to decode an asynchronous stream of data (using `decodeAsyncGenerator` and `decodeGenerator` with any iterable input)
1417
- No global cache (no accidental memory leaks, as with the original GPT-3-Encoder implementation)
15-
- Includes a highly performant `isWithinTokenLimit` function to assess token limit without encoding the entire text
18+
- Includes a highly performant `isWithinTokenLimit` function to assess token limit without encoding the entire text/chat
1619
- Improves overall performance by eliminating transitive arrays
1720
- Type-safe (written in TypeScript)
1821
- Works in the browser out-of-the-box
1922

20-
This package is a port of OpenAI's [tiktoken](https://github.com/openai/tiktoken), with some additional features sprinkled on top.
21-
2223
Thanks to @dmitry-brazhenko's [SharpToken](https://github.com/dmitry-brazhenko/SharpToken), whose code was served as a reference for the port.
2324

2425
Historical note: This package started off as a fork of [latitudegames/GPT-3-Encoder](https://github.com/latitudegames/GPT-3-Encoder), but version 2.0 was rewritten from scratch.
@@ -38,17 +39,19 @@ npm install gpt-tokenizer
3839

3940
<script>
4041
// the package is now available as a global:
41-
const { encode, decode } = GPTTokenizer
42+
const { encode, decode } = GPTTokenizer_cl100k_base
4243
</script>
4344
```
4445

45-
If you wish to use a custom encoding, fetch the relevant script:
46+
If you wish to use a custom encoding, fetch the relevant script.
4647

4748
- https://unpkg.com/gpt-tokenizer/dist/cl100k_base.js
4849
- https://unpkg.com/gpt-tokenizer/dist/p50k_base.js
4950
- https://unpkg.com/gpt-tokenizer/dist/p50k_edit.js
5051
- https://unpkg.com/gpt-tokenizer/dist/r50k_base.js
5152

53+
The global name is a concatenation: `GPTTokenizer_${encoding}`.
54+
5255
Refer to [supported models and their encodings](#Supported-models-and-their-encodings) section for more information.
5356

5457
## Playground

package.json

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
{
22
"name": "gpt-tokenizer",
33
"version": "0.0.0",
4-
"description": "BPE Encoder Decoder for GPT-2 / GPT-3",
4+
"description": "A pure JavaScript implementation of a BPE tokenizer (Encoder/Decoder) for GPT-2 / GPT-3 / GPT-4 and other OpenAI models",
55
"keywords": [
66
"BPE",
77
"encoder",
@@ -76,10 +76,10 @@
7676
"build:cjs": "yarn rrun tsc --outDir cjs --module commonjs --target es2022 --project tsconfig-cjs.json",
7777
"build:esm": "yarn rrun tsc --outDir esm --module esnext --target es2022 && echo '{\"name\": \"gpt-tokenizer\", \"type\": \"module\"}' > ./esm/package.json",
7878
"build:umd": "yarn build:umd:cl100k_base && yarn build:umd:p50k_base && yarn build:umd:p50k_edit && yarn build:umd:r50k_base",
79-
"build:umd:cl100k_base": "beemo webpack --entry='./src/main.ts' --env 'outDir=dist' --env 'moduleTarget=umd' --env 'engineTarget=web' --env 'codeTarget=es2022' --env 'name=GPTTokenizer' --env 'export=default' --env 'filename=cl100k_base.js'",
80-
"build:umd:p50k_base": "beemo webpack --entry='./src/encoding/p50k_base.ts' --env 'outDir=dist' --env 'moduleTarget=umd' --env 'engineTarget=web' --env 'codeTarget=es2022' --env 'name=GPTTokenizer' --env 'export=default' --env 'filename=p50k_base.js'",
81-
"build:umd:p50k_edit": "beemo webpack --entry='./src/encoding/p50k_edit.ts' --env 'outDir=dist' --env 'moduleTarget=umd' --env 'engineTarget=web' --env 'codeTarget=es2022' --env 'name=GPTTokenizer' --env 'export=default' --env 'filename=p50k_edit.js'",
82-
"build:umd:r50k_base": "beemo webpack --entry='./src/encoding/r50k_base.ts' --env 'outDir=dist' --env 'moduleTarget=umd' --env 'engineTarget=web' --env 'codeTarget=es2022' --env 'name=GPTTokenizer' --env 'export=default' --env 'filename=r50k_base.js'",
79+
"build:umd:cl100k_base": "beemo webpack --entry='./src/main.ts' --env 'outDir=dist' --env 'moduleTarget=umd' --env 'engineTarget=web' --env 'codeTarget=es2022' --env 'name=GPTTokenizer_cl100k_base' --env 'export=api' --env 'filename=cl100k_base.js'",
80+
"build:umd:p50k_base": "beemo webpack --entry='./src/encoding/p50k_base.ts' --env 'outDir=dist' --env 'moduleTarget=umd' --env 'engineTarget=web' --env 'codeTarget=es2022' --env 'name=GPTTokenizer_p50k_base' --env 'export=api' --env 'filename=p50k_base.js'",
81+
"build:umd:p50k_edit": "beemo webpack --entry='./src/encoding/p50k_edit.ts' --env 'outDir=dist' --env 'moduleTarget=umd' --env 'engineTarget=web' --env 'codeTarget=es2022' --env 'name=GPTTokenizer_p50k_edit' --env 'export=api' --env 'filename=p50k_edit.js'",
82+
"build:umd:r50k_base": "beemo webpack --entry='./src/encoding/r50k_base.ts' --env 'outDir=dist' --env 'moduleTarget=umd' --env 'engineTarget=web' --env 'codeTarget=es2022' --env 'name=GPTTokenizer_r50k_base' --env 'export=api' --env 'filename=r50k_base.js'",
8383
"clean": "git clean -dfX --exclude=node_modules src && beemo typescript:sync-project-refs",
8484
"format": "yarn rrun prettier --write \"./{src,tests,.config}/**/!(*.d).{.js,jsx,ts,tsx,json,md}\"",
8585
"postinstallDev": "yarn prepare",

src/encoding/cl100k_base.ts

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@ import { GptEncoding } from '../GptEncoding.js'
55

66
export * from '../specialTokens.js'
77

8-
const api = GptEncoding.getEncodingApi('cl100k_base', () =>
8+
export const api = GptEncoding.getEncodingApi('cl100k_base', () =>
99
convertTokenBytePairEncodingFromTuples(encoder),
1010
)
1111
const {

src/encoding/p50k_base.ts

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@ import { GptEncoding } from '../GptEncoding.js'
55

66
export * from '../specialTokens.js'
77

8-
const api = GptEncoding.getEncodingApi('p50k_base', () =>
8+
export const api = GptEncoding.getEncodingApi('p50k_base', () =>
99
convertTokenBytePairEncodingFromTuples(encoder),
1010
)
1111
const {

src/encoding/p50k_edit.ts

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@ import { GptEncoding } from '../GptEncoding.js'
55

66
export * from '../specialTokens.js'
77

8-
const api = GptEncoding.getEncodingApi('p50k_edit', () =>
8+
export const api = GptEncoding.getEncodingApi('p50k_edit', () =>
99
convertTokenBytePairEncodingFromTuples(encoder),
1010
)
1111
const {

src/encoding/r50k_base.ts

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@ import { GptEncoding } from '../GptEncoding.js'
55

66
export * from '../specialTokens.js'
77

8-
const api = GptEncoding.getEncodingApi('r50k_base', () =>
8+
export const api = GptEncoding.getEncodingApi('r50k_base', () =>
99
convertTokenBytePairEncodingFromTuples(encoder),
1010
)
1111
const {

0 commit comments

Comments
 (0)