System Info
transformers version: 5.10.2
tokenizers version: 0.22.2
- Python version: 3.11.2
- Platform: Linux (Penn State HPC, RHEL 8.10)
- PyTorch version: not relevant (tokenizer-only repro)
Who can help?
@ArthurZucker @itazap
Information
Tasks
Reproduction
from transformers import AutoTokenizer
tok = AutoTokenizer.from_pretrained("deepseek-ai/deepseek-coder-1.3b-instruct")
print(f"tokenizer class: {type(tok).__name__}")
for s in ["How are you doing?",
"def fib(n):\n if n < 2:\n return n",
"Hello, world! 1234",
" leading spaces"]:
ids = tok.encode(s, add_special_tokens=False)
decoded = tok.decode(ids)
print(f" match={decoded == s!s:<5} input={s!r} decoded={decoded!r}")
Output on transformers 5.10.2 + tokenizers 0.22.2:
tokenizer class: LlamaTokenizer
match=False input='How are you doing?' decoded='Howareyoudoing?'
match=False input='def fib(n):\n if n < ... decoded='deffib(n):ifn<2:returnn'
match=False input='Hello, world! 1234' decoded='Hello,world!1234'
match=False input=' leading spaces' decoded='leadingspaces'
All whitespace (spaces, newlines, tabs) is stripped on the encode → decode round-trip.
Root cause
PR #44801 (closes #44779) addressed this for DeepSeek-R1 / V2 / V3 / V4 / VL / OCR by adding them to MODELS_WITH_INCORRECT_HUB_TOKENIZER_CLASS in src/transformers/models/auto/tokenization_auto.py. That set is keyed on model_type.
DeepSeek-Coder v1 (1.3B / 6.7B / 33B Instruct) declares model_type: llama in its config.json and tokenizer_class: LlamaTokenizerFast in tokenizer_config.json, so it falls through to LlamaTokenizer. But its tokenizer.json has the same ByteLevel-BPE pipeline as the other DeepSeek variants:
| Model |
normalizer |
pre_tokenizer subtypes |
decoder |
model |
deepseek-coder-1.3b-instruct |
Sequence (empty) |
Split×4, Digits, ByteLevel |
ByteLevel |
BPE |
DeepSeek-Coder-V2-Lite-Instruct |
Sequence (empty) |
Split×5, Digits, ByteLevel |
ByteLevel |
BPE |
DeepSeek-R1 |
Sequence (empty) |
Split×3, ByteLevel |
ByteLevel |
BPE |
LlamaTokenizer is SentencePiece-based and cannot consume a ByteLevel-BPE tokenizer.json correctly, and hence the stripped whitespace.
Why the existing fix doesn't trivially extend
Adding llama to MODELS_WITH_INCORRECT_HUB_TOKENIZER_CLASS would force all llama-typed models onto TokenizersBackend, including genuine Llama models that should use LlamaTokenizer. So the fix needs a different shape.
Two possible directions:
- Per-checkpoint override: explicit list of known-broken
(model_id, expected_backend) tuples. This is narrow and surgical, but the list needs maintenance as new checkpoints land.
- Pipeline-shape detection: when loading, if the resolved tokenizer class is SentencePiece-based but
tokenizer.json has model.type=BPE + decoder.type=ByteLevel, force TokenizersBackend. More robust, slightly more code.
Happy to implement either once maintainers confirm which is preferred.
Expected behavior
tok.decode(tok.encode(s)) should round-trip whitespace, as it does on transformers v4 and on the other DeepSeek variants in v5+ (DeepSeek-R1, DeepSeek-Coder-V2-Lite-Instruct — both load as TokenizersBackend and round-trip cleanly).
This is a tokenization-correctness bug, not a UX one — downstream tasks that depend on exact-token round-trip (code-generation evaluation, edit-distance against ground-truth, RL reward signals on tokenized output) silently produce wrong values.
System Info
transformersversion: 5.10.2tokenizersversion: 0.22.2Who can help?
@ArthurZucker @itazap
Information
Tasks
examplesfolder (such as GLUE/SQuAD, ...)Reproduction
Output on transformers 5.10.2 + tokenizers 0.22.2:
All whitespace (spaces, newlines, tabs) is stripped on the
encode → decoderound-trip.Root cause
PR #44801 (closes #44779) addressed this for DeepSeek-R1 / V2 / V3 / V4 / VL / OCR by adding them to
MODELS_WITH_INCORRECT_HUB_TOKENIZER_CLASSinsrc/transformers/models/auto/tokenization_auto.py. That set is keyed onmodel_type.DeepSeek-Coder v1 (1.3B / 6.7B / 33B Instruct) declares
model_type: llamain itsconfig.jsonandtokenizer_class: LlamaTokenizerFastintokenizer_config.json, so it falls through toLlamaTokenizer. But itstokenizer.jsonhas the same ByteLevel-BPE pipeline as the other DeepSeek variants:deepseek-coder-1.3b-instructSequence(empty)Split×4, Digits, ByteLevelByteLevelBPEDeepSeek-Coder-V2-Lite-InstructSequence(empty)Split×5, Digits, ByteLevelByteLevelBPEDeepSeek-R1Sequence(empty)Split×3, ByteLevelByteLevelBPELlamaTokenizeris SentencePiece-based and cannot consume a ByteLevel-BPEtokenizer.jsoncorrectly, and hence the stripped whitespace.Why the existing fix doesn't trivially extend
Adding
llamatoMODELS_WITH_INCORRECT_HUB_TOKENIZER_CLASSwould force all llama-typed models ontoTokenizersBackend, including genuine Llama models that should useLlamaTokenizer. So the fix needs a different shape.Two possible directions:
(model_id, expected_backend)tuples. This is narrow and surgical, but the list needs maintenance as new checkpoints land.tokenizer.jsonhasmodel.type=BPE+decoder.type=ByteLevel, forceTokenizersBackend. More robust, slightly more code.Happy to implement either once maintainers confirm which is preferred.
Expected behavior
tok.decode(tok.encode(s))should round-trip whitespace, as it does on transformers v4 and on the other DeepSeek variants in v5+ (DeepSeek-R1, DeepSeek-Coder-V2-Lite-Instruct — both load asTokenizersBackendand round-trip cleanly).This is a tokenization-correctness bug, not a UX one — downstream tasks that depend on exact-token round-trip (code-generation evaluation, edit-distance against ground-truth, RL reward signals on tokenized output) silently produce wrong values.