On a byte-level BPE tokenizer (e.g. Qwen2.5), StopStringCriteria doesn't halt when the stop string's bytes are split across byte-fragment tokens, even though the stop string is the suffix of the decoded text. This hits CJK stop strings: a CJK character can fragment into tokens that each decode to U+FFFD in isolation, and StopStringCriteria builds its match table from per-token isolated decodes, so those fragments never match the stop string.
The fragmentation is context-dependent (the same character is one token in one context, several fragments in another), so stop_token_ids can't express the stop either; StopStringCriteria is the only mechanism that can.
It is reachable from the public API: model.generate(stop_strings=[...], tokenizer=...) appends StopStringCriteria (generation/utils.py:1320).
Environment
- transformers from source, main @
e314439
- torch
>= 2.0, CPU only (the repro uses the tokenizer alone, no weights or GPU)
Qwen/Qwen2.5-0.5B-Instruct (ByteLevel BPE, byte_fallback=False); no weights are loaded, so any byte-level tokenizer behaves the same
Reproduction
The criterion is called directly on the token ids of fixed strings, so the result is deterministic (no weights, no sampling).
import torch
from transformers import AutoTokenizer
from transformers.generation.stopping_criteria import StopStringCriteria
tok = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-0.5B-Instruct")
cases = [
("대화 끝", "끝"), # '끝' fragments across the last 3 tokens -> should halt
("작업 완료", "완료"), # '완' fragments across 2 tokens ('료' clean) -> should halt
("응답 종료", "종료"), # not fragmented (control) -> should halt
("结束", "结束"), # single token (control) -> should halt
("끝 대화", "끝"), # stop not at the end -> should not halt
("완료 후속", "완료"), # stop not at the end -> should not halt
]
for text, stop in cases:
ids = tok(text, add_special_tokens=False)["input_ids"]
crit = StopStringCriteria(tokenizer=tok, stop_strings=[stop])
halts = bool(crit(torch.tensor([ids]), None)[0]) # scores is unused
print(f"text={text!r:12} stop={stop!r:6} ids={ids} halts={halts}")
Output on main:
text='대화 끝' stop='끝' ids=[66845, 56290, 5140, 223, 251] halts=False
text='작업 완료' stop='완료' ids=[67511, 124517, 74884, 226, 63256] halts=False
text='응답 종료' stop='종료' ids=[131518, 132760, 98358, 63256] halts=True
text='结束' stop='结束' ids=[80565] halts=True
text='끝 대화' stop='끝' ids=[134539, 60960, 56290] halts=False
text='완료 후속' stop='완료' ids=[130973, 63256, 94315, 126299] halts=False
끝 is one token ([134539]) in '끝 대화' but three fragments ([..., 5140, 223, 251]) in '대화 끝'.
Expected vs actual
| text |
stop |
expected |
actual (main) |
대화 끝 |
끝 |
halt |
no halt (bug) |
작업 완료 |
완료 |
halt |
no halt (bug) |
응답 종료 |
종료 |
halt |
halt (control) |
结束 |
结束 |
halt |
halt (control) |
끝 대화 |
끝 |
no halt |
no halt (stop not at end) |
완료 후속 |
완료 |
no halt |
no halt |
Root cause
Both in src/transformers/generation/stopping_criteria.py (main @ e314439):
-
The match table decodes each vocab token in isolation. clean_tokenizer_vocab (L276) builds a per-token clean string via convert_tokens_to_string (L290); a byte-fragment token's clean string is U+FFFD, so the char-level overlap in _stop_string_get_matching_positions (L297) / _stop_string_create_embedding_vec (L338) never matches it. This fires even when the fragmented character isn't last: in 작업 완료 / 완료, 완 splits into two tokens that both decode to � (료 is clean), so 완료 is never assembled.
-
The window is sized in characters, not tokens. __call__ (L389) keeps the last maximum_token_len tokens (L394), where maximum_token_len is the longest stop string in characters (L250). A fragmented C-character CJK stop can span more than C tokens, putting the needed tokens outside the window. For 끝, maximum_token_len == 1, so the window is the single last token [251] = U+FFFD.
Possible fix
I have a patch for the eager path: precompute the token ids whose isolated decode is U+FFFD, and after the tensor result, for samples that missed but have such a token in the window, decode a wider window and substring-match (keeping the "match overlaps the final token" rule, so the no-halt cases stay False). It passes the six cases above and the existing ASCII tests.
It is not complete, though: the tokenizer.decode and data-dependent control flow it adds to __call__ break torch.compile / XLA, which is why this class is tensor-only (stopping_criteria.py:139-142). Options: gate the fallback to eager mode, a fragment-aware tensor-only precompute, or document the limitation. I can open a PR once the direction is agreed.
Related
#40520 adds StopStringTextMatchCriteria (decodes text, intended as the new default, not compile-safe), whose docstring says the two classes have "equivalent functionality". That is not true for byte-fragmented CJK, and #40520 would not fix the compile-compatible StopStringCriteria this issue is about.
On a byte-level BPE tokenizer (e.g. Qwen2.5),
StopStringCriteriadoesn't halt when the stop string's bytes are split across byte-fragment tokens, even though the stop string is the suffix of the decoded text. This hits CJK stop strings: a CJK character can fragment into tokens that each decode to U+FFFD in isolation, andStopStringCriteriabuilds its match table from per-token isolated decodes, so those fragments never match the stop string.The fragmentation is context-dependent (the same character is one token in one context, several fragments in another), so
stop_token_idscan't express the stop either;StopStringCriteriais the only mechanism that can.It is reachable from the public API:
model.generate(stop_strings=[...], tokenizer=...)appendsStopStringCriteria(generation/utils.py:1320).Environment
e314439>= 2.0, CPU only (the repro uses the tokenizer alone, no weights or GPU)Qwen/Qwen2.5-0.5B-Instruct(ByteLevel BPE,byte_fallback=False); no weights are loaded, so any byte-level tokenizer behaves the sameReproduction
The criterion is called directly on the token ids of fixed strings, so the result is deterministic (no weights, no sampling).
Output on main:
끝is one token ([134539]) in'끝 대화'but three fragments ([..., 5140, 223, 251]) in'대화 끝'.Expected vs actual
대화 끝끝작업 완료완료응답 종료종료结束结束끝 대화끝완료 후속완료Root cause
Both in
src/transformers/generation/stopping_criteria.py(main @e314439):The match table decodes each vocab token in isolation.
clean_tokenizer_vocab(L276) builds a per-token clean string viaconvert_tokens_to_string(L290); a byte-fragment token's clean string is U+FFFD, so the char-level overlap in_stop_string_get_matching_positions(L297) /_stop_string_create_embedding_vec(L338) never matches it. This fires even when the fragmented character isn't last: in작업 완료/완료,완splits into two tokens that both decode to�(료is clean), so완료is never assembled.The window is sized in characters, not tokens.
__call__(L389) keeps the lastmaximum_token_lentokens (L394), wheremaximum_token_lenis the longest stop string in characters (L250). A fragmented C-character CJK stop can span more than C tokens, putting the needed tokens outside the window. For끝,maximum_token_len == 1, so the window is the single last token[251]= U+FFFD.Possible fix
I have a patch for the eager path: precompute the token ids whose isolated decode is U+FFFD, and after the tensor result, for samples that missed but have such a token in the window, decode a wider window and substring-match (keeping the "match overlaps the final token" rule, so the no-halt cases stay False). It passes the six cases above and the existing ASCII tests.
It is not complete, though: the
tokenizer.decodeand data-dependent control flow it adds to__call__breaktorch.compile/ XLA, which is why this class is tensor-only (stopping_criteria.py:139-142). Options: gate the fallback to eager mode, a fragment-aware tensor-only precompute, or document the limitation. I can open a PR once the direction is agreed.Related
#40520 adds
StopStringTextMatchCriteria(decodes text, intended as the new default, not compile-safe), whose docstring says the two classes have "equivalent functionality". That is not true for byte-fragmented CJK, and #40520 would not fix the compile-compatibleStopStringCriteriathis issue is about.