[Bugfix] DiffusionGemma: only pop a request's logprobs when it commits (#45689)#45754
Open
waynehacking8 wants to merge 1 commit into
Open
[Bugfix] DiffusionGemma: only pop a request's logprobs when it commits (#45689)#45754waynehacking8 wants to merge 1 commit into
waynehacking8 wants to merge 1 commit into
Conversation
In a mixed decode batch, `DiffusionSampler.__call__` gated logprobs reassembly on the batch-wide `is_committing.any()`, then popped `_pending_logprobs[slot]` for every decode request that had a stashed entry — without checking whether that request was itself committing this step. So when request A commits while a co-batched request B merely converged this step (its logprobs stashed via `just_converged`), B's stash was consumed one step early. B's eventual committed response then returned fewer logprob rows than tokens, crashing the OpenAI chat formatter with `IndexError` (issue vllm-project#45689). Pop a request's stash only when that request commits: build the set of committing slots from the pre-step `is_committing` snapshot and require `slot in committing_slots`. A converged-but-not-committing request keeps its stash until its own commit. The commit-loop assembly is extracted into `_assemble_committed_logprobs` so the interleaving can be unit tested without a GPU. Co-authored-by: Claude <noreply@anthropic.com> Signed-off-by: Wayne Chiu <waynehacking8@gmail.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Purpose
Fixes #45689.
In a mixed decode batch,
DiffusionSampler.__call__gated logprobsreassembly on the batch-wide
is_committing.any(), then poppedself._pending_logprobs[slot]for every decode request that had a stashedentry — without checking whether that request was itself committing this
step:
So when request A commits while a co-batched request B merely converged this
step (its logprobs stashed via
just_converged, butis_committing[B]isFalse), B's stash is consumed one step early. B's eventual committed response
then returns fewer logprob rows than tokens, crashing the OpenAI chat formatter
with
IndexError: list index out of range(in_create_chat_logprobs).Credit to @masterFoad for the root-cause analysis and the suggested fix in the
issue.
Fix
Pop a request's stash only when that request commits: build the set of
committing slots from the pre-step
is_committingsnapshot and requireslot in committing_slots. A converged-but-not-committing request keeps itsstash until its own commit. The commit-loop assembly is extracted into a
_assemble_committed_logprobsstatic helper so the interleaving is unit-testablewithout a GPU. No behavior change for the single-request / aligned cases.
Not a duplicate
gh issue view 45689 --comments→ unassigned, no prior PR.gh pr list --state open --search "45689"→ none.Test plan / results
New
tests/models/test_diffusion_gemma_logprobs.py(pure CPU, no GPU):Covers: only-committing-pops, both-committing-emit-in-order (incl.
cu_num_generated_tokens), no-committing-returns-None, non-decode-skipped.Verified the test catches the bug: removing the
slot in committing_slotsguard makes
test_only_committing_request_pops_its_logprobsandtest_no_committing_request_returns_nonefail (the early/erroneous pop).pre-commit run --files vllm/model_executor/models/diffusion_gemma.py tests/models/test_diffusion_gemma_logprobs.py→ all hooks pass (ruff check + format, mypy-3.10, SPDX, …).Notes
AI-assisted: this change was developed with AI assistance (Claude) and reviewed
end-to-end by the submitter.