Skip to content

Fix byte-level BPE tokenizers reusing SentencePiece classes (Llama 3 …#46686

Closed
kaixuanliu wants to merge 1 commit into
huggingface:mainfrom
kaixuanliu:tokenizer-fix
Closed

Fix byte-level BPE tokenizers reusing SentencePiece classes (Llama 3 …#46686
kaixuanliu wants to merge 1 commit into
huggingface:mainfrom
kaixuanliu:tokenizer-fix

Conversation

@kaixuanliu

Copy link
Copy Markdown
Contributor

This PR tries to fix the bug of Byte-level BPE models that declare tokenizer_class="LlamaTokenizerFast"
(e.g. Llama 3, DeepSeek-R1-Distill-Llama) produced garbled output:

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

model_id = "deepseek-ai/DeepSeek-R1-Distill-Llama-8B"

device = "cuda"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
).to(device)

messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=256)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:], skip_special_tokens=True))

Root Cause: for classes with a custom init, convert_to_native_format forwarded post_processor/padding/truncation from tokenizer.json but not the pre_tokenizer/decoder, so LlamaTokenizer's hardcoded SentencePiece pipeline overrode the model's byte-level config.

Fix: also forward pre_tokenizer and decoder from tokenizer.json and apply them in init, so the backend matches the serialized tokenizer.
pls help review, thx! @ArthurZucker and @itazap

…/ DeepSeek)

Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com>
@github-actions

Copy link
Copy Markdown
Contributor

CI Dashboard: View test results in Grafana

@github-actions

Copy link
Copy Markdown
Contributor

View the CircleCI Test Summary for this PR:

https://huggingface.co/spaces/transformers-community/circle-ci-viz?pr=46686&sha=1364eb

@kaixuanliu kaixuanliu marked this pull request as draft June 16, 2026 06:12
@kaixuanliu

Copy link
Copy Markdown
Contributor Author

Close this PR as it is best to be solved in model hub as discussed in https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Llama-8B/discussions/36

@kaixuanliu kaixuanliu closed this Jun 16, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant