Skip to content

StopStringCriteria misses CJK stop strings on byte-level tokenizers when a character splits into byte-fragment tokens #46519

@Incheonkirin

Description

@Incheonkirin

On a byte-level BPE tokenizer (e.g. Qwen2.5), StopStringCriteria doesn't halt when the stop string's bytes are split across byte-fragment tokens, even though the stop string is the suffix of the decoded text. This hits CJK stop strings: a CJK character can fragment into tokens that each decode to U+FFFD in isolation, and StopStringCriteria builds its match table from per-token isolated decodes, so those fragments never match the stop string.

The fragmentation is context-dependent (the same character is one token in one context, several fragments in another), so stop_token_ids can't express the stop either; StopStringCriteria is the only mechanism that can.

It is reachable from the public API: model.generate(stop_strings=[...], tokenizer=...) appends StopStringCriteria (generation/utils.py:1320).

Environment

  • transformers from source, main @ e314439
  • torch >= 2.0, CPU only (the repro uses the tokenizer alone, no weights or GPU)
  • Qwen/Qwen2.5-0.5B-Instruct (ByteLevel BPE, byte_fallback=False); no weights are loaded, so any byte-level tokenizer behaves the same

Reproduction

The criterion is called directly on the token ids of fixed strings, so the result is deterministic (no weights, no sampling).

import torch
from transformers import AutoTokenizer
from transformers.generation.stopping_criteria import StopStringCriteria

tok = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-0.5B-Instruct")

cases = [
    ("대화 끝", "끝"),     # '끝' fragments across the last 3 tokens  -> should halt
    ("작업 완료", "완료"),  # '완' fragments across 2 tokens ('료' clean) -> should halt
    ("응답 종료", "종료"),  # not fragmented (control)                 -> should halt
    ("结束", "结束"),      # single token (control)                   -> should halt
    ("끝 대화", "끝"),     # stop not at the end                      -> should not halt
    ("완료 후속", "완료"),  # stop not at the end                      -> should not halt
]

for text, stop in cases:
    ids = tok(text, add_special_tokens=False)["input_ids"]
    crit = StopStringCriteria(tokenizer=tok, stop_strings=[stop])
    halts = bool(crit(torch.tensor([ids]), None)[0])  # scores is unused
    print(f"text={text!r:12} stop={stop!r:6} ids={ids}  halts={halts}")

Output on main:

text='대화 끝'      stop='끝'    ids=[66845, 56290, 5140, 223, 251]  halts=False
text='작업 완료'     stop='완료'   ids=[67511, 124517, 74884, 226, 63256]  halts=False
text='응답 종료'     stop='종료'   ids=[131518, 132760, 98358, 63256]  halts=True
text='结束'         stop='结束'   ids=[80565]  halts=True
text='끝 대화'      stop='끝'    ids=[134539, 60960, 56290]  halts=False
text='완료 후속'     stop='완료'   ids=[130973, 63256, 94315, 126299]  halts=False

is one token ([134539]) in '끝 대화' but three fragments ([..., 5140, 223, 251]) in '대화 끝'.

Expected vs actual

text stop expected actual (main)
대화 끝 halt no halt (bug)
작업 완료 완료 halt no halt (bug)
응답 종료 종료 halt halt (control)
结束 结束 halt halt (control)
끝 대화 no halt no halt (stop not at end)
완료 후속 완료 no halt no halt

Root cause

Both in src/transformers/generation/stopping_criteria.py (main @ e314439):

  1. The match table decodes each vocab token in isolation. clean_tokenizer_vocab (L276) builds a per-token clean string via convert_tokens_to_string (L290); a byte-fragment token's clean string is U+FFFD, so the char-level overlap in _stop_string_get_matching_positions (L297) / _stop_string_create_embedding_vec (L338) never matches it. This fires even when the fragmented character isn't last: in 작업 완료 / 완료, splits into two tokens that both decode to ( is clean), so 완료 is never assembled.

  2. The window is sized in characters, not tokens. __call__ (L389) keeps the last maximum_token_len tokens (L394), where maximum_token_len is the longest stop string in characters (L250). A fragmented C-character CJK stop can span more than C tokens, putting the needed tokens outside the window. For , maximum_token_len == 1, so the window is the single last token [251] = U+FFFD.

Possible fix

I have a patch for the eager path: precompute the token ids whose isolated decode is U+FFFD, and after the tensor result, for samples that missed but have such a token in the window, decode a wider window and substring-match (keeping the "match overlaps the final token" rule, so the no-halt cases stay False). It passes the six cases above and the existing ASCII tests.

It is not complete, though: the tokenizer.decode and data-dependent control flow it adds to __call__ break torch.compile / XLA, which is why this class is tensor-only (stopping_criteria.py:139-142). Options: gate the fallback to eager mode, a fragment-aware tensor-only precompute, or document the limitation. I can open a PR once the direction is agreed.

Related

#40520 adds StopStringTextMatchCriteria (decodes text, intended as the new default, not compile-safe), whose docstring says the two classes have "equivalent functionality". That is not true for byte-fragmented CJK, and #40520 would not fix the compile-compatible StopStringCriteria this issue is about.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions