`added_tokens_encoder` rebuilds the entire added-token map on every access


### Who can help?

@itazap

### Summary

Follow-up to #46323, which fixed the worst case: `_convert_token_to_id_with_added_voc` was reading the `added_tokens_encoder` property, which rebuilds and re-sorts the whole added-token map on every access. The property itself still rebuilds on every read, and a few call sites still go through it.

The property's docstring says the mapping is "cached ... in `self._added_tokens_encoder`", but it rebuilds and re-sorts on every call:

```python
return {k.content: v for v, k in sorted(self._added_tokens_decoder.items(), key=lambda item: item[0])}
```

That `_added_tokens_encoder` cache is maintained at every `_added_tokens_decoder` mutation site and is already what most of the file reads.

### Where it still bites

- ~47 call sites read the property, mostly `get_vocab()` (`vocab.update(self.added_tokens_encoder)`), called occasionally.
- Per-token: `byt5`, `myt5`, `dia`, and `perceiver` do `if token in self.added_tokens_encoder` inside `_convert_token_to_id`, so a rebuild happens per token.

### Two subtleties

1. The property returns a sorted map; the `_added_tokens_encoder` cache is insertion-ordered, so they aren't swappable 1:1 if sort order is part of the contract.
2. `cpmant` mutates the returned dict (`self.added_tokens_encoder.pop(...)`, `tokenization_cpmant.py:151`), which a shared cached dict would not tolerate.

### Possible directions

- **(A)** Memoize the sorted map, invalidated at the decoder-mutation sites (and handle the cpmant case). Fixes all call sites and matches the docstring.
- **(B)** Narrow: point the four per-token callers at the maintained `_added_tokens_encoder` cache, mirroring #46323.

Happy to open a PR for whichever direction you'd prefer.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`added_tokens_encoder` rebuilds the entire added-token map on every access #46396

Who can help?

Summary

Where it still bites

Two subtleties

Possible directions

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

added_tokens_encoder rebuilds the entire added-token map on every access #46396

Description

Who can help?

Summary

Where it still bites

Two subtleties

Possible directions

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

`added_tokens_encoder` rebuilds the entire added-token map on every access #46396