Skip to content

Fix DeepSeek-Coder v1 / DeepSeek-LLM v1 loading as wrong tokenizer class (#46489)#46580

Open
kpal002 wants to merge 2 commits into
huggingface:mainfrom
kpal002:fix/deepseek-coder-v1-tokenizer-bypass
Open

Fix DeepSeek-Coder v1 / DeepSeek-LLM v1 loading as wrong tokenizer class (#46489)#46580
kpal002 wants to merge 2 commits into
huggingface:mainfrom
kpal002:fix/deepseek-coder-v1-tokenizer-bypass

Conversation

@kpal002

@kpal002 kpal002 commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

What this fixes

Closes #46489.

deepseek-ai/deepseek-coder-{1.3b,6.7b,33b}-{base,instruct} and deepseek-ai/deepseek-llm-7b-{base,chat} declare model_type: llama and tokenizer_class: LlamaTokenizerFast. The existing MODELS_WITH_INCORRECT_HUB_TOKENIZER_CLASS check never fires because it gates on a mismatch between the registered class and the hub class — after stripping "Fast", LlamaTokenizer == LlamaTokenizer, so the override is skipped. They fall through to LlamaTokenizer (SentencePiece-based), which silently strips all whitespace on encode → decode.

Also fixes deepseek-ai/deepseek-vl-7b-{base,chat} (model_type: multi_modality), which hits the same end-state via a different routing path and was not covered by the existing deepseek_vl / deepseek_vl_v2 entries.

Why the existing fix doesn't cover the coder/llm cases

Adding "llama" to MODELS_WITH_INCORRECT_HUB_TOKENIZER_CLASS would break meta-llama/Llama-2-7b-hf and every other genuine Llama checkpoint. Instead, a new CHECKPOINT_PREFIXES_WITH_INCORRECT_HUB_TOKENIZER_CLASS frozenset keys on Hub checkpoint name prefixes.

Changes

tokenization_auto.py:

  1. Add CHECKPOINT_PREFIXES_WITH_INCORRECT_HUB_TOKENIZER_CLASS frozenset with deepseek-ai/deepseek-coder- and deepseek-ai/deepseek-llm- prefixes, and an early check in from_pretrained before model_type routing.
  2. Add "multi_modality" to MODELS_WITH_INCORRECT_HUB_TOKENIZER_CLASS (handles deepseek-vl-7b-*).

Affected checkpoints

Checkpoint model_type Was Now
deepseek-ai/deepseek-coder-1.3b-{base,instruct} llama LlamaTokenizer ❌ TokenizersBackend ✅
deepseek-ai/deepseek-coder-6.7b-{base,instruct} llama LlamaTokenizer ❌ TokenizersBackend ✅
deepseek-ai/deepseek-coder-33b-{base,instruct} llama LlamaTokenizer ❌ TokenizersBackend ✅
deepseek-ai/deepseek-llm-7b-{base,chat} llama LlamaTokenizer ❌ TokenizersBackend ✅
deepseek-ai/deepseek-vl-7b-{base,chat} multi_modality LlamaTokenizer ❌ TokenizersBackend ✅
meta-llama/Llama-2-7b-hf llama LlamaTokenizer ✅ LlamaTokenizer ✅ (unchanged)

Tests

No new test added — this is a routing-table change covered by the existing AutoTokenizerTest suite (specifically test_specialized_hub_tokenizer_class_overrides_mismatched_auto_mapping and test_mismatched_model_type_uses_config_tokenizer_class_with_sentencepiece, which exercise the same override mechanism this PR extends).

Ran locally on this PR's branch (pytest tests/models/auto/test_tokenization_auto.py -k AutoTokenizerTest, Python 3.12.13, transformers 5.10.0.dev0 editable install):

34 passed, 7 skipped, 14 warnings in 84.80s (0:01:24)

All warnings are pre-existing tokenizer deprecation notices unrelated to this change. Also green on CI: https://circleci.com/gh/huggingface/transformers/2351471

Disclosure

This fix was drafted with AI-agent assistance (diagnosis and patch). I reviewed every changed line, traced why the existing MODELS_WITH_INCORRECT_HUB_TOKENIZER_CLASS mismatch check doesn't fire for this model_type collision, confirmed the full list of affected checkpoints against the Hub configs, and ran the test suite locally (output above) before opening this PR.

…enizer (issue huggingface#46489)

These checkpoints declare model_type=llama and tokenizer_class=LlamaTokenizerFast,
so the MODELS_WITH_INCORRECT_HUB_TOKENIZER_CLASS model_type check never fires (both
sides match after stripping 'Fast'). They fall through to LlamaTokenizer which is
SentencePiece-based and cannot handle their ByteLevel-BPE tokenizer.json, silently
stripping all whitespace on encode→decode.

Adding 'llama' to MODELS_WITH_INCORRECT_HUB_TOKENIZER_CLASS would break all genuine
Llama models. Instead, add CHECKPOINT_PREFIXES_WITH_INCORRECT_HUB_TOKENIZER_CLASS —
a frozenset of Hub checkpoint name prefixes — and check it early in from_pretrained
before the model_type routing. Genuine Llama models (meta-llama/*) do not match;
only the affected deepseek-ai/deepseek-coder-* and deepseek-ai/deepseek-llm-* prefixes
are listed.
@github-actions

Copy link
Copy Markdown
Contributor

[For maintainers] Suggested jobs to run (before merge)

run-slow: auto

@github-actions

Copy link
Copy Markdown
Contributor

CI Dashboard: View test results in Grafana

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

DeepSeek-Coder v1 tokenizer produces incorrect output on transformers v5+ (gap in PR #44801's fix)

2 participants