Fix byte-level BPE tokenizers reusing SentencePiece classes (Llama 3 …#46686
Closed
kaixuanliu wants to merge 1 commit into
Closed
Fix byte-level BPE tokenizers reusing SentencePiece classes (Llama 3 …#46686kaixuanliu wants to merge 1 commit into
kaixuanliu wants to merge 1 commit into
Conversation
…/ DeepSeek) Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com>
Contributor
|
CI Dashboard: View test results in Grafana |
Contributor
|
View the CircleCI Test Summary for this PR: https://huggingface.co/spaces/transformers-community/circle-ci-viz?pr=46686&sha=1364eb |
Contributor
Author
|
Close this PR as it is best to be solved in model hub as discussed in https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Llama-8B/discussions/36 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR tries to fix the bug of Byte-level BPE models that declare
tokenizer_class="LlamaTokenizerFast"(e.g. Llama 3, DeepSeek-R1-Distill-Llama) produced garbled output:
Root Cause: for classes with a custom init,
convert_to_native_formatforwarded post_processor/padding/truncation fromtokenizer.jsonbut not the pre_tokenizer/decoder, so LlamaTokenizer's hardcoded SentencePiece pipeline overrode the model's byte-level config.Fix: also forward pre_tokenizer and decoder from tokenizer.json and apply them in init, so the backend matches the serialized tokenizer.
pls help review, thx! @ArthurZucker and @itazap