StopStringCriteria misses CJK stop strings on byte-level tokenizers when a character splits into byte-fragment tokens

On a byte-level BPE tokenizer (e.g. Qwen2.5), `StopStringCriteria` doesn't halt when the stop string's bytes are split across byte-fragment tokens, even though the stop string is the suffix of the decoded text. This hits CJK stop strings: a CJK character can fragment into tokens that each decode to U+FFFD in isolation, and `StopStringCriteria` builds its match table from per-token isolated decodes, so those fragments never match the stop string.

The fragmentation is context-dependent (the same character is one token in one context, several fragments in another), so `stop_token_ids` can't express the stop either; `StopStringCriteria` is the only mechanism that can.

It is reachable from the public API: `model.generate(stop_strings=[...], tokenizer=...)` appends `StopStringCriteria` (`generation/utils.py:1320`).

### Environment

- transformers from source, main @ `e314439`
- torch `>= 2.0`, CPU only (the repro uses the tokenizer alone, no weights or GPU)
- `Qwen/Qwen2.5-0.5B-Instruct` (ByteLevel BPE, `byte_fallback=False`); no weights are loaded, so any byte-level tokenizer behaves the same

### Reproduction

The criterion is called directly on the token ids of fixed strings, so the result is deterministic (no weights, no sampling).

```python
import torch
from transformers import AutoTokenizer
from transformers.generation.stopping_criteria import StopStringCriteria

tok = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-0.5B-Instruct")

cases = [
    ("대화 끝", "끝"),     # '끝' fragments across the last 3 tokens  -> should halt
    ("작업 완료", "완료"),  # '완' fragments across 2 tokens ('료' clean) -> should halt
    ("응답 종료", "종료"),  # not fragmented (control)                 -> should halt
    ("结束", "结束"),      # single token (control)                   -> should halt
    ("끝 대화", "끝"),     # stop not at the end                      -> should not halt
    ("완료 후속", "완료"),  # stop not at the end                      -> should not halt
]

for text, stop in cases:
    ids = tok(text, add_special_tokens=False)["input_ids"]
    crit = StopStringCriteria(tokenizer=tok, stop_strings=[stop])
    halts = bool(crit(torch.tensor([ids]), None)[0])  # scores is unused
    print(f"text={text!r:12} stop={stop!r:6} ids={ids}  halts={halts}")
```

Output on main:

```
text='대화 끝'      stop='끝'    ids=[66845, 56290, 5140, 223, 251]  halts=False
text='작업 완료'     stop='완료'   ids=[67511, 124517, 74884, 226, 63256]  halts=False
text='응답 종료'     stop='종료'   ids=[131518, 132760, 98358, 63256]  halts=True
text='结束'         stop='结束'   ids=[80565]  halts=True
text='끝 대화'      stop='끝'    ids=[134539, 60960, 56290]  halts=False
text='완료 후속'     stop='완료'   ids=[130973, 63256, 94315, 126299]  halts=False
```

`끝` is one token (`[134539]`) in `'끝 대화'` but three fragments (`[..., 5140, 223, 251]`) in `'대화 끝'`.

### Expected vs actual

| text | stop | expected | actual (main) |
|------|------|----------|---------------|
| `대화 끝`  | `끝`  | halt     | **no halt** (bug) |
| `작업 완료` | `완료` | halt     | **no halt** (bug) |
| `응답 종료` | `종료` | halt     | halt (control) |
| `结束`     | `结束` | halt     | halt (control) |
| `끝 대화`  | `끝`  | no halt  | no halt (stop not at end) |
| `완료 후속` | `완료` | no halt  | no halt |

### Root cause

Both in `src/transformers/generation/stopping_criteria.py` (main @ `e314439`):

1. The match table decodes each vocab token in isolation. `clean_tokenizer_vocab` (L276) builds a per-token clean string via `convert_tokens_to_string` (L290); a byte-fragment token's clean string is U+FFFD, so the char-level overlap in `_stop_string_get_matching_positions` (L297) / `_stop_string_create_embedding_vec` (L338) never matches it. This fires even when the fragmented character isn't last: in `작업 완료` / `완료`, `완` splits into two tokens that both decode to `�` (`료` is clean), so `완료` is never assembled.

2. The window is sized in characters, not tokens. `__call__` (L389) keeps the last `maximum_token_len` tokens (L394), where `maximum_token_len` is the longest stop string in characters (L250). A fragmented C-character CJK stop can span more than C tokens, putting the needed tokens outside the window. For `끝`, `maximum_token_len == 1`, so the window is the single last token `[251]` = U+FFFD.

### Possible fix

I have a patch for the eager path: precompute the token ids whose isolated decode is U+FFFD, and after the tensor result, for samples that missed but have such a token in the window, decode a wider window and substring-match (keeping the "match overlaps the final token" rule, so the no-halt cases stay False). It passes the six cases above and the existing ASCII tests.

It is not complete, though: the `tokenizer.decode` and data-dependent control flow it adds to `__call__` break `torch.compile` / XLA, which is why this class is tensor-only (`stopping_criteria.py:139-142`). Options: gate the fallback to eager mode, a fragment-aware tensor-only precompute, or document the limitation. I can open a PR once the direction is agreed.

### Related

#40520 adds `StopStringTextMatchCriteria` (decodes text, intended as the new default, not compile-safe), whose docstring says the two classes have "equivalent functionality". That is not true for byte-fragmented CJK, and #40520 would not fix the compile-compatible `StopStringCriteria` this issue is about.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

StopStringCriteria misses CJK stop strings on byte-level tokenizers when a character splits into byte-fragment tokens #46519

Environment

Reproduction

Expected vs actual

Root cause

Possible fix

Related

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

text	stop	expected	actual (main)
`대화 끝`	`끝`	halt	no halt (bug)
`작업 완료`	`완료`	halt	no halt (bug)
`응답 종료`	`종료`	halt	halt (control)
`结束`	`结束`	halt	halt (control)
`끝 대화`	`끝`	no halt	no halt (stop not at end)
`완료 후속`	`완료`	no halt	no halt

StopStringCriteria misses CJK stop strings on byte-level tokenizers when a character splits into byte-fragment tokens #46519

Description

Environment

Reproduction

Expected vs actual

Root cause

Possible fix

Related

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions