[Feature] Split #1303 Part 2: Qwen PD integration#1912
Conversation
Bring the split-2 branch back in line with vllm-project#1303 by pairing the Qwen model and stage-processor changes with the PD runtime wiring they depend on. Includes the orchestrator routing changes in omni.py/async_omni.py, stage worker PD flags and KV-transfer restoration in omni_stage.py, the connector flush in omni_llm.py, and the unit-test package markers from the original branch. Co-authored-by: spencerr221 <liubingyu62@gmail.com> Signed-off-by: Jinheng Li <ahengljh@gmail.com>
25468fc to
5b6b234
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 25468fc951
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| if 0 <= index < len(prefill_stage.engine_outputs): | ||
| return prefill_stage.engine_outputs[index] | ||
| if prefill_stage.engine_outputs: | ||
| return prefill_stage.engine_outputs[-1] |
There was a problem hiding this comment.
Return None when no prefill request ID matches
In PD mode, _match_prefill_output falls back to index/last even after failing to find a matching request_id, so thinker2talker() can merge prefill embeddings from a different request into the current decode request. This corrupts talker context whenever prefill/decode output lists are not perfectly aligned for a step (for example, different ready-request sets across stages), and the safe behavior here is to skip merging (None) rather than positional fallback when no ID match exists.
Useful? React with 👍 / 👎.
| val = prefill_output.outputs[0].multimodal_output.get(key) | ||
| except Exception: | ||
| pass | ||
| return val.detach().to(device=device, dtype=torch.float) if val is not None else None |
There was a problem hiding this comment.
Ensure _tts always provides tensors for talker prefill
_tts() can now return None when both decode and prefill outputs lack a TTS embedding, and that None is stored in additional_information; the talker prefill path later does info_dict.get("tts_*_embed").to(...) unconditionally, which raises at runtime (NoneType has no to). This is the exact pd_no_tts_anywhere path introduced by the new logic, so this should either keep failing fast here or synthesize tensor defaults before forwarding to talker.
Useful? React with 👍 / 👎.
| import warnings | ||
| from collections import defaultdict | ||
| from typing import Any | ||
| from unittest.mock import MagicMock |
Purpose
This PR is part 2 of the #1303 split series.
Part 1 (#1863) merged the PD disaggregation scaffolding. This PR carries only the Qwen3-Omni integration layer that consumes that scaffolding.
Scope
This PR includes only:
vllm_omni/model_executor/models/qwen3_omni/qwen3_omni.pyvllm_omni/model_executor/stage_input_processors/qwen3_omni.pytests/model_executor/stage_input_processors/test_qwen3_omni_stage_processors.pyNotes
main, so it does not re-introduce the part 1 scaffolding diff.