Motivation
The Qwen-Omni family (e.g., Qwen3-Omni and similar multi-stage AR + speech pipelines) already runs end-to-end in vLLM-Omni (Thinker → Talker → Code2Wav, etc.). Q1 advanced entrypoints, quantization, CUDA Graph, cross-stage async chunking, and multimodal streaming input aligned with upstream vLLM (see Q1 Roadmap #677). For Qwen-Omni as a product line, Q2 should deliver lower time-to-first-token / time-to-first-audio, scalable long multi-turn and streaming sessions, and production parity with upstream scheduling features (prefix caching, chunked prefill).
This roadmap scopes Qwen-Omni only for Q2. It aligns with the broader Q2 themes in project Q2 collection #2136 (“Prefix Cache and Memory Coordination”, “Streaming input/output”, EPDG / disaggregated serving, etc.) and spells them out for Qwen-Omni.
Below are the models and features we support. If you have other models or features you are interested in, please feel free to contact us.
| Feature |
Ming-Flash-Omni-2.0 |
Qwen3-Omni |
Qwen2.5-Omni |
| Stage |
|
✅ |
✅ |
| Batch |
|
✅ |
✅ |
| Cuda Graph |
|
✅ |
|
| Async Chunk |
|
✅ |
|
| Streaming input |
|
⏳ |
|
| Streaming output |
|
✅ |
|
| Prefix cache |
|
⏳ |
|
| Chunked Prefill |
|
⏳ |
|
| Quantization |
|
✅ |
|
| Prefill-Decode disaggregation |
|
⏳ |
|
| Reinforcement Learning |
|
🙋 |
|
Performance data
model-id: Qwen/Qwen3-Omni-30B-A3B-Instruct
random_input_len:100
random_output_len:100
| test_name |
dataset_name |
max_concurrency |
request_rate |
mean_e2el_ms |
mean_ttft_ms |
mean_audio_ttfp_ms |
mean_audio_rtf |
| qwen3_omni |
random |
1 |
- |
5916.145047 |
49.9711778 |
5798.78831 |
0.177744214 |
| qwen3_omni |
random |
4 |
- |
7505.656183 |
66.06060175 |
7384.353625 |
0.223856885 |
| qwen3_omni |
random |
10 |
- |
11301.83517 |
186.9555722 |
11173.13861 |
0.32931707 |
| qwen3_omni |
random-mm |
- |
0.1 |
7343.288911 |
1188.264374 |
7219.766443 |
0.240805141 |
| qwen3_omni |
random-mm |
- |
0.3 |
7067.158632 |
168.8519941 |
6942.87535 |
0.207324097 |
| qwen3_omni |
random-mm |
- |
0.5 |
8773.390233 |
167.3971285 |
8647.037808 |
0.257598932 |
| qwen3_omni_chunk |
random |
1 |
- |
5149.965969 |
47.3504127 |
421.7417487 |
0.158677357 |
| qwen3_omni_chunk |
random |
4 |
- |
7969.763369 |
341.9354187 |
1079.446806 |
0.237064974 |
| qwen3_omni_chunk |
random |
10 |
- |
17474.16753 |
1481.090657 |
2857.069322 |
0.522625353 |
| qwen3_omni_chunk |
random-mm |
- |
0.1 |
5663.873116 |
258.6243343 |
661.881583 |
0.167943394 |
| qwen3_omni_chunk |
random-mm |
- |
0.3 |
6780.008788 |
222.2346747 |
700.7485206 |
0.203259027 |
| qwen3_omni_chunk |
random-mm |
- |
0.5 |
9883.475912 |
1724.942598 |
2333.510879 |
0.294857716 |
model-id: Qwen/Qwen3-Omni-30B-A3B-Instruct
random_input_len:2500
random_output_len:900
| test_name |
dataset_name |
concurrency |
request_rate |
mean_e2el_ms |
mean_ttft_ms |
mean_audio_ttfp_ms |
mean_audio_rtf |
| qwen3_omni |
random |
1 |
- |
31195.3737 |
215.1779 |
30987.77434 |
0.2801 |
| qwen3_omni |
random |
4 |
- |
57946.4745 |
325.7549 |
57718.85183 |
0.2778 |
| qwen3_omni_chunk |
random |
1 |
- |
37975.1413 |
216.7963 |
796.8836834 |
0.1648 |
| qwen3_omni_chunk |
random |
4 |
- |
54992.9832 |
627.7617 |
1595.880035 |
0.2787 |
Goals (Q2 2026)
| Theme |
Outcome |
| Prefix cache |
Reuse KV for repeatable prefixes (system prompts, multi-turn history, repeated vision/audio segments) on the Thinker and other AR stages mapped to vLLM, cutting TTFT and redundant prefill work. |
| Streaming input |
On top of existing audio and upstream-aligned paths, ship streaming multimodal input for real-time use (including video frame streams and audio–video sessions) consistent with Qwen-Omni pipeline semantics. |
| Chunked prefill |
Chunk long multimodal prefills so scheduling matches upstream vLLM: prefill is sliced and interleaved with decode, reducing head-of-line blocking and improving fairness under mixed load. |
| P/D disaggregation (Thinker) |
Split Qwen-Omni Thinker prefill vs decode with reliable KV transfer; validate with multimodal prompts, prefix cache, and chunked prefill; align configs and ops with broader EPDG / multi-node serving in #2136. |
P0 — Must ship for Qwen-Omni in Q2
1) Prefix caching (Qwen-Omni AR / Thinker)
- Scope
- Enable prefix KV reuse on the Thinker (and any stage backed by the vLLM AR engine), consistent with upstream behavior, including integration with
MultiConnector / LMCacheConnectorV1 as appropriate (see per-stage design in Multi-Stage KV Cache Management #1867).
- Define multimodal prefix boundaries: which token blocks participate in hashing / block alignment; cache correctness and invalidation for combinations of
image_url, audio_url, and interleaved inputs.
- Success criteria
- Measurable TTFT and prefill savings when many requests share a prefix; no cross-request KV reuse bugs.
- Documentation: flags, limits (e.g., max prefix length, modality combinations), interaction with disaggregated serving.
2) Streaming input (Qwen-Omni)
- Scope
- Build on Q1’s “multimodal streaming input aligned with vLLM upstream” and treat Qwen-Omni as a first-class target in Q2:
- Harden incremental audio / multi-turn paths and keep API/protocol behavior consistent.
- Streaming video input and long-session buffering, sampling, and request assembly for the Thinker (see Streaming Video Input RFC #2201 and linked PRs), aligned with Qwen3-Omni audio-in-video and temporal alignment semantics.
- Stay aligned with upstream
StreamingInput / realtime WebSocket behavior to avoid unnecessary Omni-only forks.
- Success criteria
3) Chunked prefill (Qwen-Omni)
- Scope
- Enable chunked prefill for heavy Thinker prefills (long text + many images/frames/long audio), matching vLLM scheduler semantics: prefill chunks interleave with decode instead of monopolizing the GPU.
- Validate interaction with prefix cache and async chunk (Thinker → Talker): chunk boundaries must not break hidden-state handoff or KV metadata consistency.
- Success criteria
- Measurable improvement in P99 prefill latency and decode starvation under high concurrency; regression tests for chunked + multimodal + multi-stage combinations.
4) Prefill–Decode disaggregation (Qwen-Omni Thinker)
- Scope
- Enable P/D split on the Thinker AR engine where supported, using vLLM-Omni’s KV transfer and connector stack (see #1867 F2, #1303).
- Ensure Qwen3-Omni multimodal prefill → decode handoff is correct (KV layout, request IDs, embedding merge paths); avoid regressions for CFG / multi-cache scenarios called out in community trackers.
- Define how P/D coexists with prefix caching and chunked prefill (scheduler + connector ordering); document single-node vs multi-node deployment.
- Success criteria
- E2E serving: prefill workers and decode workers stable under load for Qwen3-Omni Thinker; measured KV transfer latency and TTFT documented; CI or nightly coverage for at least one reference YAML.
Progress / current status (as of early Q2 2026)
Legend: ⏳ In progress · ✅ Done (use in tables above as items land).
P1 — Strongly aligned with the Q2 program
5) Reinforcement Learning Support (Qwen3-Omni)
- Scope
- Enable comprehensive RL support (GRPO/PPO for Audio) for multi-stage Qwen3-Omni models. Currently vLLM-Omni is optimized for inference; RL requires intermediate trajectory data (RVQ codec tokens, text hidden states, log-probabilities) from both Thinker and Talker stages.
- Implement Trajectory Return: Record intermediate text embeddings, RVQ codes, and log-probs in
Qwen3OmniMoeTalker for RL rollout consumption.
- Support Custom Sampling/Exploration: Enable custom multinomial or stochastic samplers via pipeline worker extensions for exploration in audio generation stage.
- Unified RL Output Interface: Standardize
OmniRequestOutput._custom_output for propagating RL tensors from Talker stages back to training collector (e.g., VeRL framework).
- Cross-Stage LoRA Support:
- Stage 0 (Thinker): Full vLLM-native LoRA adapters for multimodal textual reasoning alignment.
- Stage 1 (Talker): Update LoRA managers to support
qwen3_omni component naming and enable LoRA for Talker transformer blocks.
- Success criteria
- End-to-end GRPO/PPO training loop functional with Qwen3-Omni, returning required trajectory data without inference regression.
- LoRA fine-tuning improves both text reasoning and audio quality metrics; no naming convention conflicts between Thinker/Talker stages.
- Documentation: RL data format specs, LoRA adapter loading guide, example training integration with VeRL.
Other P1 Items
- Cross-stage memory / HBM coordination: Combine with "Prefix Cache and Memory Coordination" in #2136 and F1 in #1867 (static budgets, admission) so Thinker prefix caching does not starve Talker or blow VRAM budgets.
- Disaggregated Qwen-Omni: For P/D or EPDG deployments, define cross-node correctness for prefix cache + chunked prefill (builds on Q1 EPDG work); see § Prefill–Decode disaggregation and the P/D detailed tracking table above.
- Observability: Per-request metrics for prefill chunks, prefix hit/miss, streaming session length, etc., for joint acceptance with Q2 features.
P2 — Stretch / ecosystem
- Extend unified streaming video + audio protocol to Qwen2.5-Omni and others (Phase 6 in #2201).
| Symbol |
Meaning |
| ✅ |
already supported, PR attached |
| 🙋 |
not supported yet, help wanted! |
| ⏳ |
not supported yet, with PR raised |
| ❓ |
maybe unnecessary to support it. The benefits are minimal. |
Dependencies & references
Call for contributions
RFCs may use the project design doc template (same as #677). Please claim sub-tasks and link issues/PRs under #2136 or a dedicated tracking issue.
CC List
@hsliuustc0106 @Gaohan123 @tzhouam @R2-Y @Shirley125 @princepride @lishunyang12 @alex-jw-brooks @LJH-LBJ @ZeldaHuang @wtomin @ZJY0516 @knlnguyen1802 @natureofnature @SamitHuang
Motivation
The Qwen-Omni family (e.g., Qwen3-Omni and similar multi-stage AR + speech pipelines) already runs end-to-end in vLLM-Omni (Thinker → Talker → Code2Wav, etc.). Q1 advanced entrypoints, quantization, CUDA Graph, cross-stage async chunking, and multimodal streaming input aligned with upstream vLLM (see Q1 Roadmap #677). For Qwen-Omni as a product line, Q2 should deliver lower time-to-first-token / time-to-first-audio, scalable long multi-turn and streaming sessions, and production parity with upstream scheduling features (prefix caching, chunked prefill).
This roadmap scopes Qwen-Omni only for Q2. It aligns with the broader Q2 themes in project Q2 collection #2136 (“Prefix Cache and Memory Coordination”, “Streaming input/output”, EPDG / disaggregated serving, etc.) and spells them out for Qwen-Omni.
Below are the models and features we support. If you have other models or features you are interested in, please feel free to contact us.
Performance data
model-id: Qwen/Qwen3-Omni-30B-A3B-Instruct
random_input_len:100
random_output_len:100
model-id: Qwen/Qwen3-Omni-30B-A3B-Instruct
random_input_len:2500
random_output_len:900
Goals (Q2 2026)
P0 — Must ship for Qwen-Omni in Q2
1) Prefix caching (Qwen-Omni AR / Thinker)
MultiConnector/LMCacheConnectorV1as appropriate (see per-stage design in Multi-Stage KV Cache Management #1867).image_url,audio_url, and interleaved inputs.2) Streaming input (Qwen-Omni)
StreamingInput/ realtime WebSocket behavior to avoid unnecessary Omni-only forks.3) Chunked prefill (Qwen-Omni)
4) Prefill–Decode disaggregation (Qwen-Omni Thinker)
Progress / current status (as of early Q2 2026)
Legend: ⏳ In progress · ✅ Done (use in tables above as items land).
P1 — Strongly aligned with the Q2 program
5) Reinforcement Learning Support (Qwen3-Omni)
Qwen3OmniMoeTalkerfor RL rollout consumption.OmniRequestOutput._custom_outputfor propagating RL tensors from Talker stages back to training collector (e.g., VeRL framework).qwen3_omnicomponent naming and enable LoRA for Talker transformer blocks.Other P1 Items
P2 — Stretch / ecosystem
Dependencies & references
Call for contributions
RFCs may use the project design doc template (same as #677). Please claim sub-tasks and link issues/PRs under #2136 or a dedicated tracking issue.
CC List
@hsliuustc0106 @Gaohan123 @tzhouam @R2-Y @Shirley125 @princepride @lishunyang12 @alex-jw-brooks @LJH-LBJ @ZeldaHuang @wtomin @ZJY0516 @knlnguyen1802 @natureofnature @SamitHuang