Skip to content

added_tokens_encoder rebuilds the entire added-token map on every access #46396

@ishan-1010

Description

@ishan-1010

Who can help?

@itazap

Summary

Follow-up to #46323, which fixed the worst case: _convert_token_to_id_with_added_voc was reading the added_tokens_encoder property, which rebuilds and re-sorts the whole added-token map on every access. The property itself still rebuilds on every read, and a few call sites still go through it.

The property's docstring says the mapping is "cached ... in self._added_tokens_encoder", but it rebuilds and re-sorts on every call:

return {k.content: v for v, k in sorted(self._added_tokens_decoder.items(), key=lambda item: item[0])}

That _added_tokens_encoder cache is maintained at every _added_tokens_decoder mutation site and is already what most of the file reads.

Where it still bites

  • ~47 call sites read the property, mostly get_vocab() (vocab.update(self.added_tokens_encoder)), called occasionally.
  • Per-token: byt5, myt5, dia, and perceiver do if token in self.added_tokens_encoder inside _convert_token_to_id, so a rebuild happens per token.

Two subtleties

  1. The property returns a sorted map; the _added_tokens_encoder cache is insertion-ordered, so they aren't swappable 1:1 if sort order is part of the contract.
  2. cpmant mutates the returned dict (self.added_tokens_encoder.pop(...), tokenization_cpmant.py:151), which a shared cached dict would not tolerate.

Possible directions

Happy to open a PR for whichever direction you'd prefer.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions