Use cached added-token dicts in per-token decode loops by ishan-1010 · Pull Request #46535 · huggingface/transformers

ishan-1010 · 2026-06-10T10:28:49Z

What this fixes

Fixes #46396 (follow-up to #46323). Four slow tokenizers read the added_tokens_encoder / added_tokens_decoder properties inside the per-token loop of convert_tokens_to_string. Both properties rebuild and re-sort the full added-token map on every access, so decoding T tokens with N added tokens costs O(T * N * logN).

The fix

Point those loops at the maintained _added_tokens_encoder / _added_tokens_decoder dicts instead, same as #46323 did for convert_tokens_to_ids. Membership and lookup don't depend on the property's sort order, so behavior is unchanged. Files: byt5, myt5, dia, perceiver tokenizers; 10 lines changed, attribute swaps only.

This is option (B) from #46396. The broader option (memoizing the property itself) is left for that issue since it has the cpmant-mutation and sort-order questions.

Numbers

Decoding 2,840 mixed tokens with 4,000 added tokens (M1, CPU):

tokenizer	before	after
ByT5	2,207 ms	0.6 ms
Perceiver	1,270 ms	0.6 ms

Output verified byte-identical (md5) before/after for both.

Tests

python -m pytest tests/models/byt5/test_tokenization_byt5.py tests/models/myt5/test_tokenization_myt5.py tests/models/perceiver/test_tokenization_perceiver.py
# 78 passed, 107 skipped, 6 failed (same 6 fail on unmodified main; network/torch integration tests)
python -m pytest tests/models/dia/test_tokenization_dia.py
# 37 passed, 24 skipped

Notes

Not a duplicate: no open PR touches these call sites (checked per the contribution policy). AI-assisted; I reviewed every line and ran the tests and benchmark above.

…6396)

github-actions · 2026-06-10T10:29:57Z

[For maintainers] Suggested jobs to run (before merge)

run-slow: byt5, dia, myt5, perceiver

Rocketknight1 · 2026-06-11T10:39:49Z

cc @itazap

Use cached added-token dicts in per-token decode loops (huggingface#4…

fea41c4

…6396)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use cached added-token dicts in per-token decode loops#46535

Use cached added-token dicts in per-token decode loops#46535
ishan-1010 wants to merge 1 commit into
huggingface:mainfrom
ishan-1010:fix/per-token-added-encoder-46396

ishan-1010 commented Jun 10, 2026

Uh oh!

github-actions Bot commented Jun 10, 2026

Uh oh!

Rocketknight1 commented Jun 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ishan-1010 commented Jun 10, 2026

What this fixes

The fix

Numbers

Tests

Notes

Uh oh!

github-actions Bot commented Jun 10, 2026

Uh oh!

Rocketknight1 commented Jun 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants