Fix DeepSeek-Coder v1 / DeepSeek-LLM v1 loading as wrong tokenizer class (#46489)#46580
Open
kpal002 wants to merge 2 commits into
Open
Fix DeepSeek-Coder v1 / DeepSeek-LLM v1 loading as wrong tokenizer class (#46489)#46580kpal002 wants to merge 2 commits into
kpal002 wants to merge 2 commits into
Conversation
…enizer (issue huggingface#46489) These checkpoints declare model_type=llama and tokenizer_class=LlamaTokenizerFast, so the MODELS_WITH_INCORRECT_HUB_TOKENIZER_CLASS model_type check never fires (both sides match after stripping 'Fast'). They fall through to LlamaTokenizer which is SentencePiece-based and cannot handle their ByteLevel-BPE tokenizer.json, silently stripping all whitespace on encode→decode. Adding 'llama' to MODELS_WITH_INCORRECT_HUB_TOKENIZER_CLASS would break all genuine Llama models. Instead, add CHECKPOINT_PREFIXES_WITH_INCORRECT_HUB_TOKENIZER_CLASS — a frozenset of Hub checkpoint name prefixes — and check it early in from_pretrained before the model_type routing. Genuine Llama models (meta-llama/*) do not match; only the affected deepseek-ai/deepseek-coder-* and deepseek-ai/deepseek-llm-* prefixes are listed.
Contributor
|
[For maintainers] Suggested jobs to run (before merge) run-slow: auto |
…HUB_TOKENIZER_CLASS
Contributor
|
CI Dashboard: View test results in Grafana |
4 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What this fixes
Closes #46489.
deepseek-ai/deepseek-coder-{1.3b,6.7b,33b}-{base,instruct}anddeepseek-ai/deepseek-llm-7b-{base,chat}declaremodel_type: llamaandtokenizer_class: LlamaTokenizerFast. The existingMODELS_WITH_INCORRECT_HUB_TOKENIZER_CLASScheck never fires because it gates on a mismatch between the registered class and the hub class — after stripping"Fast",LlamaTokenizer == LlamaTokenizer, so the override is skipped. They fall through toLlamaTokenizer(SentencePiece-based), which silently strips all whitespace on encode → decode.Also fixes
deepseek-ai/deepseek-vl-7b-{base,chat}(model_type: multi_modality), which hits the same end-state via a different routing path and was not covered by the existingdeepseek_vl/deepseek_vl_v2entries.Why the existing fix doesn't cover the coder/llm cases
Adding
"llama"toMODELS_WITH_INCORRECT_HUB_TOKENIZER_CLASSwould breakmeta-llama/Llama-2-7b-hfand every other genuine Llama checkpoint. Instead, a newCHECKPOINT_PREFIXES_WITH_INCORRECT_HUB_TOKENIZER_CLASSfrozenset keys on Hub checkpoint name prefixes.Changes
tokenization_auto.py:CHECKPOINT_PREFIXES_WITH_INCORRECT_HUB_TOKENIZER_CLASSfrozenset withdeepseek-ai/deepseek-coder-anddeepseek-ai/deepseek-llm-prefixes, and an early check infrom_pretrainedbefore model_type routing."multi_modality"toMODELS_WITH_INCORRECT_HUB_TOKENIZER_CLASS(handlesdeepseek-vl-7b-*).Affected checkpoints
deepseek-ai/deepseek-coder-1.3b-{base,instruct}deepseek-ai/deepseek-coder-6.7b-{base,instruct}deepseek-ai/deepseek-coder-33b-{base,instruct}deepseek-ai/deepseek-llm-7b-{base,chat}deepseek-ai/deepseek-vl-7b-{base,chat}meta-llama/Llama-2-7b-hfTests
No new test added — this is a routing-table change covered by the existing
AutoTokenizerTestsuite (specificallytest_specialized_hub_tokenizer_class_overrides_mismatched_auto_mappingandtest_mismatched_model_type_uses_config_tokenizer_class_with_sentencepiece, which exercise the same override mechanism this PR extends).Ran locally on this PR's branch (
pytest tests/models/auto/test_tokenization_auto.py -k AutoTokenizerTest, Python 3.12.13, transformers 5.10.0.dev0 editable install):All warnings are pre-existing tokenizer deprecation notices unrelated to this change. Also green on CI: https://circleci.com/gh/huggingface/transformers/2351471
Disclosure
This fix was drafted with AI-agent assistance (diagnosis and patch). I reviewed every changed line, traced why the existing
MODELS_WITH_INCORRECT_HUB_TOKENIZER_CLASSmismatch check doesn't fire for this model_type collision, confirmed the full list of affected checkpoints against the Hub configs, and ran the test suite locally (output above) before opening this PR.