Skip to content

DeepSeek-Coder v1 tokenizer produces incorrect output on transformers v5+ (gap in PR #44801's fix) #46489

@SuryanshSS1011

Description

@SuryanshSS1011

System Info

  • transformers version: 5.10.2
  • tokenizers version: 0.22.2
  • Python version: 3.11.2
  • Platform: Linux (Penn State HPC, RHEL 8.10)
  • PyTorch version: not relevant (tokenizer-only repro)

Who can help?

@ArthurZucker @itazap

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

from transformers import AutoTokenizer
 
tok = AutoTokenizer.from_pretrained("deepseek-ai/deepseek-coder-1.3b-instruct")
print(f"tokenizer class: {type(tok).__name__}")
 
for s in ["How are you doing?",
          "def fib(n):\n    if n < 2:\n        return n",
          "Hello, world! 1234",
          "   leading spaces"]:
    ids = tok.encode(s, add_special_tokens=False)
    decoded = tok.decode(ids)
    print(f"  match={decoded == s!s:<5}  input={s!r}  decoded={decoded!r}")

Output on transformers 5.10.2 + tokenizers 0.22.2:

tokenizer class: LlamaTokenizer
    match=False  input='How are you doing?'         decoded='Howareyoudoing?'
    match=False  input='def fib(n):\n    if n < ... decoded='deffib(n):ifn<2:returnn'
    match=False  input='Hello, world! 1234'         decoded='Hello,world!1234'
    match=False  input='   leading spaces'          decoded='leadingspaces'

All whitespace (spaces, newlines, tabs) is stripped on the encode → decode round-trip.

Root cause

PR #44801 (closes #44779) addressed this for DeepSeek-R1 / V2 / V3 / V4 / VL / OCR by adding them to MODELS_WITH_INCORRECT_HUB_TOKENIZER_CLASS in src/transformers/models/auto/tokenization_auto.py. That set is keyed on model_type.

DeepSeek-Coder v1 (1.3B / 6.7B / 33B Instruct) declares model_type: llama in its config.json and tokenizer_class: LlamaTokenizerFast in tokenizer_config.json, so it falls through to LlamaTokenizer. But its tokenizer.json has the same ByteLevel-BPE pipeline as the other DeepSeek variants:

Model normalizer pre_tokenizer subtypes decoder model
deepseek-coder-1.3b-instruct Sequence (empty) Split×4, Digits, ByteLevel ByteLevel BPE
DeepSeek-Coder-V2-Lite-Instruct Sequence (empty) Split×5, Digits, ByteLevel ByteLevel BPE
DeepSeek-R1 Sequence (empty) Split×3, ByteLevel ByteLevel BPE

LlamaTokenizer is SentencePiece-based and cannot consume a ByteLevel-BPE tokenizer.json correctly, and hence the stripped whitespace.

Why the existing fix doesn't trivially extend

Adding llama to MODELS_WITH_INCORRECT_HUB_TOKENIZER_CLASS would force all llama-typed models onto TokenizersBackend, including genuine Llama models that should use LlamaTokenizer. So the fix needs a different shape.

Two possible directions:

  1. Per-checkpoint override: explicit list of known-broken (model_id, expected_backend) tuples. This is narrow and surgical, but the list needs maintenance as new checkpoints land.
  2. Pipeline-shape detection: when loading, if the resolved tokenizer class is SentencePiece-based but tokenizer.json has model.type=BPE + decoder.type=ByteLevel, force TokenizersBackend. More robust, slightly more code.
    Happy to implement either once maintainers confirm which is preferred.

Expected behavior

tok.decode(tok.encode(s)) should round-trip whitespace, as it does on transformers v4 and on the other DeepSeek variants in v5+ (DeepSeek-R1, DeepSeek-Coder-V2-Lite-Instruct — both load as TokenizersBackend and round-trip cleanly).

This is a tokenization-correctness bug, not a UX one — downstream tasks that depend on exact-token round-trip (code-generation evaluation, edit-distance against ground-truth, RL reward signals on tokenized output) silently produce wrong values.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions