Fix byte-level BPE tokenizers reusing SentencePiece classes (Llama 3 … by kaixuanliu · Pull Request #46686 · huggingface/transformers

kaixuanliu · 2026-06-16T05:33:37Z

This PR tries to fix the bug of Byte-level BPE models that declare tokenizer_class="LlamaTokenizerFast"
(e.g. Llama 3, DeepSeek-R1-Distill-Llama) produced garbled output:

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

model_id = "deepseek-ai/DeepSeek-R1-Distill-Llama-8B"

device = "cuda"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
).to(device)

messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=256)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:], skip_special_tokens=True))

Root Cause: for classes with a custom init, convert_to_native_format forwarded post_processor/padding/truncation from tokenizer.json but not the pre_tokenizer/decoder, so LlamaTokenizer's hardcoded SentencePiece pipeline overrode the model's byte-level config.

Fix: also forward pre_tokenizer and decoder from tokenizer.json and apply them in init, so the backend matches the serialized tokenizer.
pls help review, thx! @ArthurZucker and @itazap

…/ DeepSeek) Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com>

github-actions · 2026-06-16T05:44:13Z

CI Dashboard: View test results in Grafana

github-actions · 2026-06-16T05:51:43Z

View the CircleCI Test Summary for this PR:

https://huggingface.co/spaces/transformers-community/circle-ci-viz?pr=46686&sha=1364eb

kaixuanliu · 2026-06-16T06:56:01Z

Close this PR as it is best to be solved in model hub as discussed in https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Llama-8B/discussions/36

Fix byte-level BPE tokenizers reusing SentencePiece classes (Llama 3 …

1364ebd

…/ DeepSeek) Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com>

kaixuanliu marked this pull request as draft June 16, 2026 06:12

kaixuanliu closed this Jun 16, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix byte-level BPE tokenizers reusing SentencePiece classes (Llama 3 …#46686

Fix byte-level BPE tokenizers reusing SentencePiece classes (Llama 3 …#46686
kaixuanliu wants to merge 1 commit into
huggingface:mainfrom
kaixuanliu:tokenizer-fix

kaixuanliu commented Jun 16, 2026

Uh oh!

github-actions Bot commented Jun 16, 2026

Uh oh!

github-actions Bot commented Jun 16, 2026

Uh oh!

kaixuanliu commented Jun 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

kaixuanliu commented Jun 16, 2026

Uh oh!

github-actions Bot commented Jun 16, 2026

Uh oh!

github-actions Bot commented Jun 16, 2026

Uh oh!

kaixuanliu commented Jun 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant