Skip to content

Commit 1d1d76d

Browse files
authored
feat: add new models and ability to estimate cost (#72)
BREAKING CHANGE: changes the default encoding to `o200k_base` as that is what most modern models use now fixes #71 fixes #70
1 parent e2506c2 commit 1d1d76d

118 files changed

Lines changed: 13241 additions & 5538 deletions

File tree

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

.config/beemo/eslint.ts

Lines changed: 6 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,12 @@ const config: ESLintConfig = {
44
rules: {
55
'import/no-unresolved': 'off',
66
},
7-
ignorePatterns: ['**/models/*.js'],
7+
ignorePatterns: [
8+
'**/models/*.js',
9+
'src/model/*.ts',
10+
'benchmark/**/*.ts',
11+
'src/codegen/*.js',
12+
],
813
}
914

1015
export default config

.prettierignore

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -7,4 +7,6 @@ dts/
77
esm/
88
lib/
99
mjs/
10-
umd/
10+
umd/
11+
**/*.gen.ts
12+
src/models.ts

.yarn/releases/yarn-4.5.0.cjs

Lines changed: 0 additions & 925 deletions
This file was deleted.

.yarn/releases/yarn-4.9.2.cjs

Lines changed: 942 additions & 0 deletions
Large diffs are not rendered by default.

.yarnrc.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -8,4 +8,4 @@ plugins:
88
- path: .yarn/plugins/@yarnpkg/plugin-postinstall-dev.cjs
99
spec: "https://raw.githubusercontent.com/sachinraja/yarn-plugin-postinstall-dev/main/bundles/%40yarnpkg/plugin-postinstall-dev.js"
1010

11-
yarnPath: .yarn/releases/yarn-4.5.0.cjs
11+
yarnPath: .yarn/releases/yarn-4.9.2.cjs

README.md

Lines changed: 37 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22

33
[![Play with gpt-tokenizer](https://codesandbox.io/static/img/play-codesandbox.svg)](https://codesandbox.io/s/gpt-tokenizer-tjcjoz?fontsize=14&hidenavigation=1&theme=dark)
44

5-
`gpt-tokenizer` is a Token Byte Pair Encoder/Decoder supporting all OpenAI's models (including GPT-3.5, GPT-4, GPT-4o, and o1).
5+
`gpt-tokenizer` is a Token Byte Pair Encoder/Decoder supporting all OpenAI's models (including GPT-4o, o1, o3, o4, GPT-4.1 and older models like GPT-3.5, GPT-4).
66
It's the [_fastest, smallest and lowest footprint_](#benchmarks) GPT tokenizer available for all JavaScript environments. It's written in TypeScript.
77

88
This library has been trusted by:
@@ -17,7 +17,7 @@ Please consider [🩷 sponsoring](https://github.com/sponsors/niieani) the proje
1717

1818
#### Features
1919

20-
As of 2023, it is the most feature-complete, open-source GPT tokenizer on NPM. This package is a port of OpenAI's [tiktoken](https://github.com/openai/tiktoken), with some additional, unique features sprinkled on top:
20+
It is the most feature-complete, open-source GPT tokenizer on NPM. This package is a port of OpenAI's [tiktoken](https://github.com/openai/tiktoken), with some additional, unique features sprinkled on top:
2121

2222
- Support for easily tokenizing chats thanks to the `encodeChat` function
2323
- Support for all current OpenAI models (available encodings: `r50k_base`, `p50k_base`, `p50k_edit`, `cl100k_base` and `o200k_base`)
@@ -26,6 +26,8 @@ As of 2023, it is the most feature-complete, open-source GPT tokenizer on NPM. T
2626
- Provides the ability to decode an asynchronous stream of data (using `decodeAsyncGenerator` and `decodeGenerator` with any iterable input)
2727
- No global cache (no accidental memory leaks, as with the original GPT-3-Encoder implementation)
2828
- Includes a highly performant `isWithinTokenLimit` function to assess token limit without encoding the entire text/chat
29+
- Built-in cost estimation with the `estimateCost` function for calculating API usage costs
30+
- Full library of OpenAI models with comprehensive pricing information (see [`src/models.ts`](./src/models.ts) and [`src/models.gen.ts`](./src/models.gen.ts))
2931
- Improves overall performance by eliminating transitive arrays
3032
- Type-safe (written in TypeScript)
3133
- Works in the browser out-of-the-box
@@ -51,8 +53,8 @@ npm install gpt-tokenizer
5153

5254
If you wish to use a custom encoding, fetch the relevant script.
5355

54-
- https://unpkg.com/gpt-tokenizer/dist/o200k_base.js (for `gpt-4o` and `o1`)
55-
- https://unpkg.com/gpt-tokenizer/dist/cl100k_base.js (for `gpt-4-*` and `gpt-3.5-turbo`)
56+
- https://unpkg.com/gpt-tokenizer/dist/o200k_base.js (for all modern models, such as `gpt-4o`, `gpt-4.1`, `o1` and others)
57+
- https://unpkg.com/gpt-tokenizer/dist/cl100k_base.js (for `gpt-4` and `gpt-3.5`)
5658
- https://unpkg.com/gpt-tokenizer/dist/p50k_base.js
5759
- https://unpkg.com/gpt-tokenizer/dist/p50k_edit.js
5860
- https://unpkg.com/gpt-tokenizer/dist/r50k_base.js
@@ -130,7 +132,7 @@ for await (const textChunk of decodeAsyncGenerator(asyncTokens)) {
130132
}
131133
```
132134

133-
By default, importing from `gpt-tokenizer` uses `cl100k_base` encoding, used by `gpt-3.5-turbo` and `gpt-4`.
135+
By default, importing from `gpt-tokenizer` uses `o200k_base` encoding, used by all modern OpenAI models, including `gpt-4o`, `gpt-4.1`, `o1`, etc.
134136

135137
To get a tokenizer for a different model, import it directly, for example:
136138

@@ -182,16 +184,18 @@ import {
182184

183185
### Supported models and their encodings
184186

185-
- `o1-*` (`o200k_base`)
187+
We support all OpenAI models, including the latest ones, with the following encodings:
188+
189+
- `o`-series models, like `o1-*`, `o3-*` and `o4-*` (`o200k_base`)
186190
- `gpt-4o` (`o200k_base`)
187191
- `gpt-4-*` (`cl100k_base`)
188-
- `gpt-3.5-turbo` (`cl100k_base`)
192+
- `gpt-3.5-*` (`cl100k_base`)
189193
- `text-davinci-003` (`p50k_base`)
190194
- `text-davinci-002` (`p50k_base`)
191195
- `text-davinci-001` (`r50k_base`)
192196
- ...and many other models, see [models.ts](./src/models.ts) for an up-to-date list of supported models and their encodings.
193197

194-
Note: if you're using `gpt-3.5-*` or `gpt-4-*` and don't see the model you're looking for, use the `cl100k_base` encoding directly.
198+
If you don't see the model you're looking for, the default encoding is probably the one you want.
195199

196200
## API
197201

@@ -326,6 +330,31 @@ async function processTokens(asyncTokensIterator) {
326330
}
327331
```
328332

333+
### `estimateCost(tokenCount: number, modelSpec?: ModelSpec): PriceData`
334+
335+
Estimates the cost of processing a given number of tokens using the model's pricing data. This function calculates costs for different API usage types (main API, batch API) and cached tokens when available.
336+
337+
The function returns a `PriceData` object with the following structure:
338+
- `main`: Main API pricing with `input`, `output`, `cached_input`, and `cached_output` costs
339+
- `batch`: Batch API pricing with the same cost categories
340+
341+
All costs are calculated in USD based on the token count provided.
342+
343+
Example:
344+
345+
```typescript
346+
import { estimateCost } from 'gpt-tokenizer/model/gpt-4o'
347+
348+
const tokenCount = 1000
349+
const costEstimate = estimateCost(tokenCount)
350+
351+
console.log('Main API input cost:', costEstimate.main?.input)
352+
console.log('Main API output cost:', costEstimate.main?.output)
353+
console.log('Batch API input cost:', costEstimate.batch?.input)
354+
```
355+
356+
Note: The model spec must be available either through the model-specific import or by passing it as the second parameter. Cost information may not be available for all models.
357+
329358
## Special tokens
330359

331360
There are a few special tokens that are used by the GPT models.

benchmark/src/benchmarkWorker.ts

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,3 @@
1-
// benchmarkWorker.ts
21
import type {
32
BenchmarkResult,
43
WorkerInput,

package.json

Lines changed: 16 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -73,8 +73,10 @@
7373
"dist"
7474
],
7575
"scripts": {
76-
"codegen:models": "rm -rf src/model && yarn tsx src/codegen/generateByModel.ts",
77-
"codegen:bpe": "rm -rf src/bpeRanks && yarn tsx src/codegen/generateJsBpe.ts",
76+
"codegen": "yarn codegen:bpe && yarn codegen:chat-enabled && yarn codegen:models",
77+
"codegen:models": "rm -rf src/model && node --experimental-transform-types --import node-resolve-ts/register src/codegen/generateByModel.ts",
78+
"codegen:bpe": "rm -rf src/bpeRanks && node --experimental-transform-types --import node-resolve-ts/register src/codegen/generateJsBpe.ts",
79+
"codegen:chat-enabled": "rm -rf src/chat && node --experimental-transform-types --import node-resolve-ts/register src/codegen/generateChatEnabled.ts",
7880
"build": "yarn build:cjs && yarn build:esm && yarn build:umd",
7981
"build:cjs": "yarn rrun tsc --outDir cjs --module commonjs --target es2022 --project tsconfig-cjs.json",
8082
"build:esm": "mkdir -p esm && echo '{\"name\": \"gpt-tokenizer\", \"type\": \"module\"}' > ./esm/package.json && yarn rrun tsc --outDir esm --target es2022",
@@ -87,11 +89,11 @@
8789
"clean": "git clean -dfX --exclude=node_modules src && beemo typescript:sync-project-refs",
8890
"format": "yarn rrun prettier --write \"./{src,tests,.config}/**/!(*.d).{.js,jsx,ts,tsx,json,md}\"",
8991
"postinstallDev": "yarn prepare",
90-
"prepare": "rrun husky install .config/husky && beemo create-config",
92+
"prepare": "rrun husky install .config/husky && beemo create-config && echo '\n**/*.gen.ts\nsrc/models.ts' >> .prettierignore",
9193
"release": "beemo run-script release",
9294
"test": "yarn test:format && yarn test:types && yarn test:lint && yarn test:code",
9395
"test:code": "vitest",
94-
"test:format": "yarn rrun prettier --check \"./{src,tests,.config}/**/!(*.d).{.js,jsx,ts,tsx,json,md}\"",
96+
"test:format": "yarn rrun prettier --check \"./{src,tests,.config}/**/!(*.d).{.js,jsx,ts,tsx,json,md}\" --ignore-path .prettierignore",
9597
"test:lint": "rrun eslint 'src/*.{js,jsx,ts,tsx}'",
9698
"test:types": "yarn rrun tsc --noEmit"
9799
},
@@ -117,17 +119,19 @@
117119
},
118120
"devDependencies": {
119121
"@edge-runtime/vm": "^5.0.0",
120-
"@niieani/scaffold": "^1.7.39",
121-
"@swc/cli": "^0.5.2",
122-
"@swc/core": "^1.10.4",
123-
"tsx": "^4.19.2",
124-
"typescript": "^5.7.2",
125-
"vitest": "^2.1.8"
122+
"@niieani/scaffold": "^1.7.49",
123+
"@swc/cli": "^0.7.7",
124+
"@swc/core": "^1.11.31",
125+
"devalue": "^5.1.1",
126+
"node-resolve-ts": "^1.0.2",
127+
"typescript": "^5.8.3",
128+
"vitest": "^3.2.2"
126129
},
127130
"resolutions": {
128-
"typescript": "5.7.2"
131+
"typescript": "5.8.3",
132+
"prettier": "^3"
129133
},
130-
"packageManager": "yarn@4.5.0",
134+
"packageManager": "yarn@4.9.2",
131135
"publishConfig": {
132136
"access": "public"
133137
}

src/GptEncoding.test.ts

Lines changed: 80 additions & 44 deletions
Original file line numberDiff line numberDiff line change
@@ -7,10 +7,14 @@ import { type ChatMessage, GptEncoding } from './GptEncoding.js'
77
import {
88
type ChatModelName,
99
type EncodingName,
10+
type ModelName,
1011
chatModelParams,
12+
DEFAULT_ENCODING,
1113
encodingNames,
14+
modelToEncodingMap,
1215
} from './mapping.js'
13-
import { models } from './models.js'
16+
import * as models from './models.js'
17+
import * as modelsMap from './modelsMap.js'
1418
import { resolveEncoding } from './resolveEncoding.js'
1519
import { EndOfText } from './specialTokens.js'
1620

@@ -250,22 +254,23 @@ const exampleMessages: ChatMessage[] = [
250254
},
251255
] as const
252256

253-
describe.each(chatModelNames)('%s', (modelName) => {
254-
const encoding = GptEncoding.getEncodingApiForModel(
255-
modelName,
256-
resolveEncoding,
257+
describe.each(chatModelNames)('%s', async (modelName) => {
258+
const encoding: GptEncoding = await import(`./model/${modelName}.ts`).then(
259+
(mod) => mod.default,
257260
)
258-
const expectedEncodedLength = modelName.startsWith('gpt-3.5-turbo')
261+
const expectedEncodedLength = modelName.startsWith('gpt-3.5')
259262
? 127
260-
: modelName.startsWith('gpt-4o')
261-
? 120
262-
: 121
263+
: modelName.startsWith('gpt-4') &&
264+
!modelName.startsWith('gpt-4o') &&
265+
!modelName.startsWith('gpt-4.')
266+
? 121
267+
: 120
263268

264269
describe('chat functionality', () => {
265270
test('encodes a chat correctly', () => {
266271
const encoded = encoding.encodeChat(exampleMessages)
267-
expect(encoded).toMatchSnapshot()
268272
expect(encoded).toHaveLength(expectedEncodedLength)
273+
expect(encoded).toMatchSnapshot()
269274

270275
const decoded = encoding.decode(encoded)
271276
expect(decoded).toMatchSnapshot()
@@ -288,45 +293,77 @@ describe.each(chatModelNames)('%s', (modelName) => {
288293
})
289294
})
290295

291-
describe('estimateCost functionality', () => {
292-
const gpt4oEncoding = GptEncoding.getEncodingApiForModel(
293-
'gpt-4o',
294-
resolveEncoding,
296+
describe('estimateCost functionality', async () => {
297+
const gpt4oEncoding = await import(`./model/gpt-4o.js`).then(
298+
(mod) => mod.default,
295299
)
296-
const gpt35Encoding = GptEncoding.getEncodingApiForModel(
297-
'gpt-3.5-turbo',
298-
resolveEncoding,
300+
const gpt35Encoding = await import(`./model/gpt-3.5-turbo.js`).then(
301+
(mod) => mod.default,
299302
)
300303

301304
test('estimates cost correctly for gpt-4o model', () => {
302305
const tokenCount = 1_000
303306
const cost = gpt4oEncoding.estimateCost(tokenCount)
304307

305-
// gpt-4o has $2.5 per million tokens for input and $10 per million tokens for output
306-
expect(cost.input).toBeCloseTo(0.002_5, 6) // 1000/1M * $2.5
307-
expect(cost.output).toBeCloseTo(0.01, 6) // 1000/1M * $10
308-
expect(cost.batchInput).toBeCloseTo(0.001_25, 6) // 1000/1M * $1.25
309-
expect(cost.batchOutput).toBeCloseTo(0.005, 6) // 1000/1M * $5
308+
expect(cost).toMatchInlineSnapshot(`
309+
{
310+
"batch": {
311+
"cached_input": undefined,
312+
"cached_output": undefined,
313+
"input": 0.005,
314+
"output": 0.015,
315+
},
316+
"main": {
317+
"cached_input": undefined,
318+
"cached_output": undefined,
319+
"input": 0.01,
320+
"output": 0.03,
321+
},
322+
}
323+
`)
310324
})
311325

312326
test('estimates cost correctly for gpt-3.5-turbo model', () => {
313327
const tokenCount = 1_000
314328
const cost = gpt35Encoding.estimateCost(tokenCount)
315-
316-
// gpt-3.5-turbo has $0.5 per million tokens for input and $1.5 per million tokens for output
317-
expect(cost.input).toBeCloseTo(0.000_5, 6) // 1000/1M * $0.5
318-
expect(cost.output).toBeCloseTo(0.001_5, 6) // 1000/1M * $1.5
319-
expect(cost.batchInput).toBeCloseTo(0.000_25, 6) // 1000/1M * $0.25
320-
expect(cost.batchOutput).toBeCloseTo(0.000_75, 6) // 1000/1M * $0.75
329+
expect(cost).toMatchInlineSnapshot(`
330+
{
331+
"batch": {
332+
"cached_input": undefined,
333+
"cached_output": undefined,
334+
"input": 0.00025,
335+
"output": 0.00075,
336+
},
337+
"main": {
338+
"cached_input": undefined,
339+
"cached_output": undefined,
340+
"input": 0.0005,
341+
"output": 0.0015,
342+
},
343+
}
344+
`)
321345
})
322346

323347
test('allows overriding model name', () => {
324348
const tokenCount = 1_000
325349
// Use gpt-4o encoding but override with gpt-3.5-turbo model name
326-
const cost = gpt4oEncoding.estimateCost(tokenCount, 'gpt-3.5-turbo')
327-
328-
expect(cost.input).toBeCloseTo(0.000_5, 6) // 1000/1M * $0.5
329-
expect(cost.output).toBeCloseTo(0.001_5, 6) // 1000/1M * $1.5
350+
const cost = gpt4oEncoding.estimateCost(tokenCount, models['gpt-3.5-turbo'])
351+
expect(cost).toMatchInlineSnapshot(`
352+
{
353+
"batch": {
354+
"cached_input": undefined,
355+
"cached_output": undefined,
356+
"input": 0.00025,
357+
"output": 0.00075,
358+
},
359+
"main": {
360+
"cached_input": undefined,
361+
"cached_output": undefined,
362+
"input": 0.0005,
363+
"output": 0.0015,
364+
},
365+
}
366+
`)
330367
})
331368

332369
test('throws error when model name is not provided', () => {
@@ -335,30 +372,29 @@ describe('estimateCost functionality', () => {
335372

336373
// No model name was provided during initialization or function call
337374
expect(() => encoding.estimateCost(tokenCount)).toThrow(
338-
'Model name must be provided either during initialization or passed in to the method.',
375+
'Model spec must be provided either during initialization or passed in to the method.',
339376
)
340377
})
341378

342-
test('throws error for unknown model', () => {
343-
const tokenCount = 1_000
344-
expect(() =>
345-
gpt4oEncoding.estimateCost(tokenCount, 'non-existent-model' as any),
346-
).toThrow('Unknown model: non-existent-model')
347-
})
348-
349379
test('only includes properties that exist for the model', () => {
350380
// Find a model that only has input cost but no output cost
351381
const modelWithInputOnly = Object.entries(models).find(
352382
([_, model]) =>
353-
model.cost?.input !== undefined && model.cost?.output === undefined,
383+
'price_data' in model &&
384+
model.price_data?.main?.input !== undefined &&
385+
(!('output' in model.price_data.main) ||
386+
model.price_data?.main?.output === undefined),
354387
)
355388

356389
if (modelWithInputOnly) {
357390
const [modelName] = modelWithInputOnly
358-
const cost = gpt4oEncoding.estimateCost(1_000, modelName as any)
391+
const cost = gpt4oEncoding.estimateCost(
392+
1_000,
393+
models[modelName as ModelName],
394+
)
359395

360-
expect(cost.input).toBeDefined()
361-
expect(cost.output).toBeUndefined()
396+
expect(cost.main?.input).toBeDefined()
397+
expect(cost.main?.output).toBeUndefined()
362398
} else {
363399
// Skip test if we can't find an appropriate model
364400
console.log('Skipping test: no model with input-only cost found')

0 commit comments

Comments
 (0)