Skip to content

performance of countTokens #68

@pczekaj

Description

@pczekaj

I comparing performance of gpt-tokenizer 2.7.0 and tiktoken 1.0.17, on Intel based Mac + node 22.11.0 I'm always getting worser times for gpt-tokenizer than for tiktoken. I'm I doing something wrong or is this expected?

image

import { countTokens } from 'gpt-tokenizer';
import { encoding_for_model } from 'tiktoken';

const SAMPLE_TEXT = 'Occaecat est tempor incididunt voluptate exercitation irure quis aliqua sunt dolor. Anim nostrud incididunt eu aliquip quis culpa do incididunt eu. Magna qui dolor deserunt sit velit. Dolor anim laborum ut ad in et occaecat enim elit culpa commodo. Sit ut sit mollit adipisicing. Labore culpa do cillum proident incididunt et. Reprehenderit nisi excepteur culpa consectetur mollit consectetur laborum';

const LONG_MSG_REPEATS = 50000;
const EXPECTED_TOKENS = 86;

const gpt35Encoding = encoding_for_model('gpt-3.5-turbo');

describe('TokenizerService', () => {
  it('gpt-tokenizer short text', () => {
    const tokens = countTokens(SAMPLE_TEXT);
    expect(tokens).toBe(EXPECTED_TOKENS);
  });

  it('tiktoken short text', () => {
    const tokens = gpt35Encoding.encode(SAMPLE_TEXT).length;
    expect(tokens).toBe(EXPECTED_TOKENS);
  });

  it('gpt-tokenizer long text', () => {
    const tokens = countTokens(SAMPLE_TEXT.repeat(LONG_MSG_REPEATS));
    expect(tokens).toBe(EXPECTED_TOKENS * LONG_MSG_REPEATS);
  });

  it('tiktoken long text', () => {
    const tokens = gpt35Encoding.encode(SAMPLE_TEXT.repeat(LONG_MSG_REPEATS)).length;
    expect(tokens).toBe(EXPECTED_TOKENS * LONG_MSG_REPEATS);
  });
});

Metadata

Metadata

Assignees

No one assigned

    Labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions