Fix DeepSeek-Coder v1 / DeepSeek-LLM v1 loading as wrong tokenizer class (#46489) by kpal002 · Pull Request #46580 · huggingface/transformers

kpal002 · 2026-06-12T05:19:23Z

What this fixes

deepseek-ai/deepseek-coder-{1.3b,6.7b,33b}-{base,instruct} and deepseek-ai/deepseek-llm-7b-{base,chat} declare model_type: llama and tokenizer_class: LlamaTokenizerFast. The existing MODELS_WITH_INCORRECT_HUB_TOKENIZER_CLASS check never fires because it gates on a mismatch between the registered class and the hub class — after stripping "Fast", LlamaTokenizer == LlamaTokenizer, so the override is skipped. They fall through to LlamaTokenizer (SentencePiece-based), which silently strips all whitespace on encode → decode.

Also fixes deepseek-ai/deepseek-vl-7b-{base,chat} (model_type: multi_modality), which hits the same end-state via a different routing path and was not covered by the existing deepseek_vl / deepseek_vl_v2 entries.

Why the existing fix doesn't cover the coder/llm cases

Adding "llama" to MODELS_WITH_INCORRECT_HUB_TOKENIZER_CLASS would break meta-llama/Llama-2-7b-hf and every other genuine Llama checkpoint. Instead, a new CHECKPOINT_PREFIXES_WITH_INCORRECT_HUB_TOKENIZER_CLASS frozenset keys on Hub checkpoint name prefixes.

Changes

tokenization_auto.py:

Add CHECKPOINT_PREFIXES_WITH_INCORRECT_HUB_TOKENIZER_CLASS frozenset with deepseek-ai/deepseek-coder- and deepseek-ai/deepseek-llm- prefixes, and an early check in from_pretrained before model_type routing.
Add "multi_modality" to MODELS_WITH_INCORRECT_HUB_TOKENIZER_CLASS (handles deepseek-vl-7b-*).

Affected checkpoints

Checkpoint	model_type	Was	Now
`deepseek-ai/deepseek-coder-1.3b-{base,instruct}`	llama	LlamaTokenizer ❌	TokenizersBackend ✅
`deepseek-ai/deepseek-coder-6.7b-{base,instruct}`	llama	LlamaTokenizer ❌	TokenizersBackend ✅
`deepseek-ai/deepseek-coder-33b-{base,instruct}`	llama	LlamaTokenizer ❌	TokenizersBackend ✅
`deepseek-ai/deepseek-llm-7b-{base,chat}`	llama	LlamaTokenizer ❌	TokenizersBackend ✅
`deepseek-ai/deepseek-vl-7b-{base,chat}`	multi_modality	LlamaTokenizer ❌	TokenizersBackend ✅
`meta-llama/Llama-2-7b-hf`	llama	LlamaTokenizer ✅	LlamaTokenizer ✅ (unchanged)

Tests

No new test added — this is a routing-table change covered by the existing AutoTokenizerTest suite (specifically test_specialized_hub_tokenizer_class_overrides_mismatched_auto_mapping and test_mismatched_model_type_uses_config_tokenizer_class_with_sentencepiece, which exercise the same override mechanism this PR extends).

Ran locally on this PR's branch (pytest tests/models/auto/test_tokenization_auto.py -k AutoTokenizerTest, Python 3.12.13, transformers 5.10.0.dev0 editable install):

34 passed, 7 skipped, 14 warnings in 84.80s (0:01:24)

All warnings are pre-existing tokenizer deprecation notices unrelated to this change. Also green on CI: https://circleci.com/gh/huggingface/transformers/2351471

Disclosure

This fix was drafted with AI-agent assistance (diagnosis and patch). I reviewed every changed line, traced why the existing MODELS_WITH_INCORRECT_HUB_TOKENIZER_CLASS mismatch check doesn't fire for this model_type collision, confirmed the full list of affected checkpoints against the Hub configs, and ran the test suite locally (output above) before opening this PR.

…enizer (issue huggingface#46489) These checkpoints declare model_type=llama and tokenizer_class=LlamaTokenizerFast, so the MODELS_WITH_INCORRECT_HUB_TOKENIZER_CLASS model_type check never fires (both sides match after stripping 'Fast'). They fall through to LlamaTokenizer which is SentencePiece-based and cannot handle their ByteLevel-BPE tokenizer.json, silently stripping all whitespace on encode→decode. Adding 'llama' to MODELS_WITH_INCORRECT_HUB_TOKENIZER_CLASS would break all genuine Llama models. Instead, add CHECKPOINT_PREFIXES_WITH_INCORRECT_HUB_TOKENIZER_CLASS — a frozenset of Hub checkpoint name prefixes — and check it early in from_pretrained before the model_type routing. Genuine Llama models (meta-llama/*) do not match; only the affected deepseek-ai/deepseek-coder-* and deepseek-ai/deepseek-llm-* prefixes are listed.

github-actions · 2026-06-12T05:20:29Z

[For maintainers] Suggested jobs to run (before merge)

run-slow: auto

…HUB_TOKENIZER_CLASS

github-actions · 2026-06-12T05:23:38Z

CI Dashboard: View test results in Grafana

Also fix deepseek-vl-7b: add multi_modality to MODELS_WITH_INCORRECT_…

2adde15

…HUB_TOKENIZER_CLASS

kpal002 mentioned this pull request Jun 12, 2026

DeepSeek-Coder v1 tokenizer produces incorrect output on transformers v5+ (gap in PR #44801's fix) #46489

Open

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix DeepSeek-Coder v1 / DeepSeek-LLM v1 loading as wrong tokenizer class (#46489)#46580

Fix DeepSeek-Coder v1 / DeepSeek-LLM v1 loading as wrong tokenizer class (#46489)#46580
kpal002 wants to merge 2 commits into
huggingface:mainfrom
kpal002:fix/deepseek-coder-v1-tokenizer-bypass

kpal002 commented Jun 12, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jun 12, 2026

Uh oh!

github-actions Bot commented Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

kpal002 commented Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this fixes

Why the existing fix doesn't cover the coder/llm cases

Changes

Affected checkpoints

Tests

Disclosure

Uh oh!

github-actions Bot commented Jun 12, 2026

Uh oh!

github-actions Bot commented Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

kpal002 commented Jun 12, 2026 •

edited

Loading