DeepSeek-Coder v1 tokenizer produces incorrect output on transformers v5+ (gap in PR #44801's fix)

### System Info

- `transformers` version: 5.10.2
- `tokenizers` version: 0.22.2
- Python version: 3.11.2
- Platform: Linux (Penn State HPC, RHEL 8.10)
- PyTorch version: not relevant (tokenizer-only repro)

### Who can help?

@ArthurZucker @itazap

### Information

- [ ] The official example scripts
- [ ] My own modified scripts

### Tasks

- [ ] An officially supported task in the `examples` folder (such as GLUE/SQuAD, ...)
- [ ] My own task or dataset (give details below)

### Reproduction

```python
from transformers import AutoTokenizer
 
tok = AutoTokenizer.from_pretrained("deepseek-ai/deepseek-coder-1.3b-instruct")
print(f"tokenizer class: {type(tok).__name__}")
 
for s in ["How are you doing?",
          "def fib(n):\n    if n < 2:\n        return n",
          "Hello, world! 1234",
          "   leading spaces"]:
    ids = tok.encode(s, add_special_tokens=False)
    decoded = tok.decode(ids)
    print(f"  match={decoded == s!s:<5}  input={s!r}  decoded={decoded!r}")
```
 
Output on transformers 5.10.2 + tokenizers 0.22.2:
 
```
tokenizer class: LlamaTokenizer
    match=False  input='How are you doing?'         decoded='Howareyoudoing?'
    match=False  input='def fib(n):\n    if n < ... decoded='deffib(n):ifn<2:returnn'
    match=False  input='Hello, world! 1234'         decoded='Hello,world!1234'
    match=False  input='   leading spaces'          decoded='leadingspaces'
```

All whitespace (spaces, newlines, tabs) is stripped on the `encode → decode` round-trip.
 
## Root cause
 
PR #44801 (closes #44779) addressed this for DeepSeek-R1 / V2 / V3 / V4 / VL / OCR by adding them to `MODELS_WITH_INCORRECT_HUB_TOKENIZER_CLASS` in `src/transformers/models/auto/tokenization_auto.py`. That set is keyed on `model_type`.
 
DeepSeek-Coder v1 (1.3B / 6.7B / 33B Instruct) declares `model_type: llama` in its `config.json` and `tokenizer_class: LlamaTokenizerFast` in `tokenizer_config.json`, so it falls through to `LlamaTokenizer`. But its `tokenizer.json` has the same ByteLevel-BPE pipeline as the other DeepSeek variants:
 
| Model | normalizer | pre_tokenizer subtypes | decoder | model |
|---|---|---|---|---|
| `deepseek-coder-1.3b-instruct` | `Sequence` (empty) | `Split×4, Digits, ByteLevel` | `ByteLevel` | `BPE` |
| `DeepSeek-Coder-V2-Lite-Instruct` | `Sequence` (empty) | `Split×5, Digits, ByteLevel` | `ByteLevel` | `BPE` |
| `DeepSeek-R1` | `Sequence` (empty) | `Split×3, ByteLevel` | `ByteLevel` | `BPE` |
 
`LlamaTokenizer` is SentencePiece-based and cannot consume a ByteLevel-BPE `tokenizer.json` correctly, and hence the stripped whitespace.
 
## Why the existing fix doesn't trivially extend
 
Adding `llama` to `MODELS_WITH_INCORRECT_HUB_TOKENIZER_CLASS` would force *all* llama-typed models onto `TokenizersBackend`, including genuine Llama models that should use `LlamaTokenizer`. So the fix needs a different shape.
 
Two possible directions:
 
1. **Per-checkpoint override**: explicit list of known-broken `(model_id, expected_backend)` tuples. This is narrow and surgical, but the list needs maintenance as new checkpoints land.
2. **Pipeline-shape detection**: when loading, if the resolved tokenizer class is SentencePiece-based but `tokenizer.json` has `model.type=BPE` + `decoder.type=ByteLevel`, force `TokenizersBackend`. More robust, slightly more code.
Happy to implement either once maintainers confirm which is preferred.

### Expected behavior

`tok.decode(tok.encode(s))` should round-trip whitespace, as it does on transformers v4 and on the other DeepSeek variants in v5+ (DeepSeek-R1, DeepSeek-Coder-V2-Lite-Instruct — both load as `TokenizersBackend` and round-trip cleanly).

This is a tokenization-correctness bug, not a UX one — downstream tasks that depend on exact-token round-trip (code-generation evaluation, edit-distance against ground-truth, RL reward signals on tokenized output) silently produce wrong values.

Model	normalizer	pre_tokenizer subtypes	decoder	model
`deepseek-coder-1.3b-instruct`	`Sequence` (empty)	`Split×4, Digits, ByteLevel`	`ByteLevel`	`BPE`
`DeepSeek-Coder-V2-Lite-Instruct`	`Sequence` (empty)	`Split×5, Digits, ByteLevel`	`ByteLevel`	`BPE`
`DeepSeek-R1`	`Sequence` (empty)	`Split×3, ByteLevel`	`ByteLevel`	`BPE`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DeepSeek-Coder v1 tokenizer produces incorrect output on transformers v5+ (gap in PR #44801's fix) #46489

System Info

Who can help?

Information

Tasks

Reproduction

Root cause

Why the existing fix doesn't trivially extend

Expected behavior

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

DeepSeek-Coder v1 tokenizer produces incorrect output on transformers v5+ (gap in PR #44801's fix) #46489

Description

System Info

Who can help?

Information

Tasks

Reproduction

Root cause

Why the existing fix doesn't trivially extend

Expected behavior

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions