Who can help?
@itazap
Summary
Follow-up to #46323, which fixed the worst case: _convert_token_to_id_with_added_voc was reading the added_tokens_encoder property, which rebuilds and re-sorts the whole added-token map on every access. The property itself still rebuilds on every read, and a few call sites still go through it.
The property's docstring says the mapping is "cached ... in self._added_tokens_encoder", but it rebuilds and re-sorts on every call:
return {k.content: v for v, k in sorted(self._added_tokens_decoder.items(), key=lambda item: item[0])}
That _added_tokens_encoder cache is maintained at every _added_tokens_decoder mutation site and is already what most of the file reads.
Where it still bites
- ~47 call sites read the property, mostly
get_vocab() (vocab.update(self.added_tokens_encoder)), called occasionally.
- Per-token:
byt5, myt5, dia, and perceiver do if token in self.added_tokens_encoder inside _convert_token_to_id, so a rebuild happens per token.
Two subtleties
- The property returns a sorted map; the
_added_tokens_encoder cache is insertion-ordered, so they aren't swappable 1:1 if sort order is part of the contract.
cpmant mutates the returned dict (self.added_tokens_encoder.pop(...), tokenization_cpmant.py:151), which a shared cached dict would not tolerate.
Possible directions
Happy to open a PR for whichever direction you'd prefer.
Who can help?
@itazap
Summary
Follow-up to #46323, which fixed the worst case:
_convert_token_to_id_with_added_vocwas reading theadded_tokens_encoderproperty, which rebuilds and re-sorts the whole added-token map on every access. The property itself still rebuilds on every read, and a few call sites still go through it.The property's docstring says the mapping is "cached ... in
self._added_tokens_encoder", but it rebuilds and re-sorts on every call:That
_added_tokens_encodercache is maintained at every_added_tokens_decodermutation site and is already what most of the file reads.Where it still bites
get_vocab()(vocab.update(self.added_tokens_encoder)), called occasionally.byt5,myt5,dia, andperceiverdoif token in self.added_tokens_encoderinside_convert_token_to_id, so a rebuild happens per token.Two subtleties
_added_tokens_encodercache is insertion-ordered, so they aren't swappable 1:1 if sort order is part of the contract.cpmantmutates the returned dict (self.added_tokens_encoder.pop(...),tokenization_cpmant.py:151), which a shared cached dict would not tolerate.Possible directions
_added_tokens_encodercache, mirroring Fix convert_tokens_to_ids performance regression for slow tokenizers (#46315) #46323.Happy to open a PR for whichever direction you'd prefer.