Skip to content

Use cached added-token dicts in per-token decode loops#46535

Open
ishan-1010 wants to merge 1 commit into
huggingface:mainfrom
ishan-1010:fix/per-token-added-encoder-46396
Open

Use cached added-token dicts in per-token decode loops#46535
ishan-1010 wants to merge 1 commit into
huggingface:mainfrom
ishan-1010:fix/per-token-added-encoder-46396

Conversation

@ishan-1010

Copy link
Copy Markdown

What this fixes

Fixes #46396 (follow-up to #46323). Four slow tokenizers read the added_tokens_encoder / added_tokens_decoder properties inside the per-token loop of convert_tokens_to_string. Both properties rebuild and re-sort the full added-token map on every access, so decoding T tokens with N added tokens costs O(T * N * logN).

The fix

Point those loops at the maintained _added_tokens_encoder / _added_tokens_decoder dicts instead, same as #46323 did for convert_tokens_to_ids. Membership and lookup don't depend on the property's sort order, so behavior is unchanged. Files: byt5, myt5, dia, perceiver tokenizers; 10 lines changed, attribute swaps only.

This is option (B) from #46396. The broader option (memoizing the property itself) is left for that issue since it has the cpmant-mutation and sort-order questions.

Numbers

Decoding 2,840 mixed tokens with 4,000 added tokens (M1, CPU):

tokenizer before after
ByT5 2,207 ms 0.6 ms
Perceiver 1,270 ms 0.6 ms

Output verified byte-identical (md5) before/after for both.

Tests

python -m pytest tests/models/byt5/test_tokenization_byt5.py tests/models/myt5/test_tokenization_myt5.py tests/models/perceiver/test_tokenization_perceiver.py
# 78 passed, 107 skipped, 6 failed (same 6 fail on unmodified main; network/torch integration tests)
python -m pytest tests/models/dia/test_tokenization_dia.py
# 37 passed, 24 skipped

Notes

Not a duplicate: no open PR touches these call sites (checked per the contribution policy). AI-assisted; I reviewed every line and ran the tests and benchmark above.

@github-actions

Copy link
Copy Markdown
Contributor

[For maintainers] Suggested jobs to run (before merge)

run-slow: byt5, dia, myt5, perceiver

@Rocketknight1

Copy link
Copy Markdown
Member

cc @itazap

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

added_tokens_encoder rebuilds the entire added-token map on every access

2 participants