Skip to content

[RFC]: Omni-Modality Q2 Roadmap #2207

@amy-why-3459

Description

@amy-why-3459

Motivation

The Qwen-Omni family (e.g., Qwen3-Omni and similar multi-stage AR + speech pipelines) already runs end-to-end in vLLM-Omni (Thinker → Talker → Code2Wav, etc.). Q1 advanced entrypoints, quantization, CUDA Graph, cross-stage async chunking, and multimodal streaming input aligned with upstream vLLM (see Q1 Roadmap #677). For Qwen-Omni as a product line, Q2 should deliver lower time-to-first-token / time-to-first-audio, scalable long multi-turn and streaming sessions, and production parity with upstream scheduling features (prefix caching, chunked prefill).

This roadmap scopes Qwen-Omni only for Q2. It aligns with the broader Q2 themes in project Q2 collection #2136 (“Prefix Cache and Memory Coordination”, “Streaming input/output”, EPDG / disaggregated serving, etc.) and spells them out for Qwen-Omni.

Below are the models and features we support. If you have other models or features you are interested in, please feel free to contact us.


Feature Ming-Flash-Omni-2.0 Qwen3-Omni Qwen2.5-Omni
Stage
Batch
Cuda Graph
Async Chunk
Streaming input
Streaming output
Prefix cache
Chunked Prefill
Quantization
Prefill-Decode disaggregation
Reinforcement Learning 🙋

Performance data

model-id: Qwen/Qwen3-Omni-30B-A3B-Instruct
random_input_len:100
random_output_len:100

test_name dataset_name max_concurrency request_rate mean_e2el_ms mean_ttft_ms mean_audio_ttfp_ms mean_audio_rtf
qwen3_omni random 1 - 5916.145047 49.9711778 5798.78831 0.177744214
qwen3_omni random 4 - 7505.656183 66.06060175 7384.353625 0.223856885
qwen3_omni random 10 - 11301.83517 186.9555722 11173.13861 0.32931707
qwen3_omni random-mm - 0.1 7343.288911 1188.264374 7219.766443 0.240805141
qwen3_omni random-mm - 0.3 7067.158632 168.8519941 6942.87535 0.207324097
qwen3_omni random-mm - 0.5 8773.390233 167.3971285 8647.037808 0.257598932
qwen3_omni_chunk random 1 - 5149.965969 47.3504127 421.7417487 0.158677357
qwen3_omni_chunk random 4 - 7969.763369 341.9354187 1079.446806 0.237064974
qwen3_omni_chunk random 10 - 17474.16753 1481.090657 2857.069322 0.522625353
qwen3_omni_chunk random-mm - 0.1 5663.873116 258.6243343 661.881583 0.167943394
qwen3_omni_chunk random-mm - 0.3 6780.008788 222.2346747 700.7485206 0.203259027
qwen3_omni_chunk random-mm - 0.5 9883.475912 1724.942598 2333.510879 0.294857716

model-id: Qwen/Qwen3-Omni-30B-A3B-Instruct

random_input_len:2500

random_output_len:900

test_name dataset_name concurrency request_rate mean_e2el_ms mean_ttft_ms mean_audio_ttfp_ms mean_audio_rtf
qwen3_omni random 1 - 31195.3737 215.1779 30987.77434 0.2801
qwen3_omni random 4 - 57946.4745 325.7549 57718.85183 0.2778
qwen3_omni_chunk random 1 - 37975.1413 216.7963 796.8836834 0.1648
qwen3_omni_chunk random 4 - 54992.9832 627.7617 1595.880035 0.2787

Goals (Q2 2026)

Theme Outcome
Prefix cache Reuse KV for repeatable prefixes (system prompts, multi-turn history, repeated vision/audio segments) on the Thinker and other AR stages mapped to vLLM, cutting TTFT and redundant prefill work.
Streaming input On top of existing audio and upstream-aligned paths, ship streaming multimodal input for real-time use (including video frame streams and audio–video sessions) consistent with Qwen-Omni pipeline semantics.
Chunked prefill Chunk long multimodal prefills so scheduling matches upstream vLLM: prefill is sliced and interleaved with decode, reducing head-of-line blocking and improving fairness under mixed load.
P/D disaggregation (Thinker) Split Qwen-Omni Thinker prefill vs decode with reliable KV transfer; validate with multimodal prompts, prefix cache, and chunked prefill; align configs and ops with broader EPDG / multi-node serving in #2136.

P0 — Must ship for Qwen-Omni in Q2

1) Prefix caching (Qwen-Omni AR / Thinker)

  • Scope
    • Enable prefix KV reuse on the Thinker (and any stage backed by the vLLM AR engine), consistent with upstream behavior, including integration with MultiConnector / LMCacheConnectorV1 as appropriate (see per-stage design in Multi-Stage KV Cache Management #1867).
    • Define multimodal prefix boundaries: which token blocks participate in hashing / block alignment; cache correctness and invalidation for combinations of image_url, audio_url, and interleaved inputs.
  • Success criteria
    • Measurable TTFT and prefill savings when many requests share a prefix; no cross-request KV reuse bugs.
    • Documentation: flags, limits (e.g., max prefix length, modality combinations), interaction with disaggregated serving.

2) Streaming input (Qwen-Omni)

  • Scope
    • Build on Q1’s “multimodal streaming input aligned with vLLM upstream” and treat Qwen-Omni as a first-class target in Q2:
      • Harden incremental audio / multi-turn paths and keep API/protocol behavior consistent.
      • Streaming video input and long-session buffering, sampling, and request assembly for the Thinker (see Streaming Video Input RFC #2201 and linked PRs), aligned with Qwen3-Omni audio-in-video and temporal alignment semantics.
    • Stay aligned with upstream StreamingInput / realtime WebSocket behavior to avoid unnecessary Omni-only forks.
  • Success criteria

3) Chunked prefill (Qwen-Omni)

  • Scope
    • Enable chunked prefill for heavy Thinker prefills (long text + many images/frames/long audio), matching vLLM scheduler semantics: prefill chunks interleave with decode instead of monopolizing the GPU.
    • Validate interaction with prefix cache and async chunk (Thinker → Talker): chunk boundaries must not break hidden-state handoff or KV metadata consistency.
  • Success criteria
    • Measurable improvement in P99 prefill latency and decode starvation under high concurrency; regression tests for chunked + multimodal + multi-stage combinations.

4) Prefill–Decode disaggregation (Qwen-Omni Thinker)

  • Scope
    • Enable P/D split on the Thinker AR engine where supported, using vLLM-Omni’s KV transfer and connector stack (see #1867 F2, #1303).
    • Ensure Qwen3-Omni multimodal prefill → decode handoff is correct (KV layout, request IDs, embedding merge paths); avoid regressions for CFG / multi-cache scenarios called out in community trackers.
    • Define how P/D coexists with prefix caching and chunked prefill (scheduler + connector ordering); document single-node vs multi-node deployment.
  • Success criteria
    • E2E serving: prefill workers and decode workers stable under load for Qwen3-Omni Thinker; measured KV transfer latency and TTFT documented; CI or nightly coverage for at least one reference YAML.

Progress / current status (as of early Q2 2026)

Feature Title Author RFC PR Status
prefix-cache Enable Prefix Caching with Hidden-State I/O (Multi-round / Service Scenarios) @alex-jw-brooks @LJH-LBJ #1184 #2164 ⏳ In progress
streaming input & RealTime Api Qwen3-Omni supports streaming input @lishunyang12 @Shirley125 @Sy0307 #1951 #2201 #2202 #2208 #2342 ⏳ In progress
chunked prefill Support chunked prefill @R2-Y #948 #949 ⏳ In progress
Prefill–Decode disaggregation Support Prefill–Decode disaggregation via vLLM KV transfer (Qwen-Omni / Thinker track) @spencerr221 #1188 #2220 ⏳ In progress
Reinforcement Learning RL support (GRPO/PPO) for Qwen3-Omni multi-stage models #2357 🙋

Legend: ⏳ In progress · ✅ Done (use in tables above as items land).

P1 — Strongly aligned with the Q2 program

5) Reinforcement Learning Support (Qwen3-Omni)

  • Scope
    • Enable comprehensive RL support (GRPO/PPO for Audio) for multi-stage Qwen3-Omni models. Currently vLLM-Omni is optimized for inference; RL requires intermediate trajectory data (RVQ codec tokens, text hidden states, log-probabilities) from both Thinker and Talker stages.
    • Implement Trajectory Return: Record intermediate text embeddings, RVQ codes, and log-probs in Qwen3OmniMoeTalker for RL rollout consumption.
    • Support Custom Sampling/Exploration: Enable custom multinomial or stochastic samplers via pipeline worker extensions for exploration in audio generation stage.
    • Unified RL Output Interface: Standardize OmniRequestOutput._custom_output for propagating RL tensors from Talker stages back to training collector (e.g., VeRL framework).
    • Cross-Stage LoRA Support:
      • Stage 0 (Thinker): Full vLLM-native LoRA adapters for multimodal textual reasoning alignment.
      • Stage 1 (Talker): Update LoRA managers to support qwen3_omni component naming and enable LoRA for Talker transformer blocks.
  • Success criteria
    • End-to-end GRPO/PPO training loop functional with Qwen3-Omni, returning required trajectory data without inference regression.
    • LoRA fine-tuning improves both text reasoning and audio quality metrics; no naming convention conflicts between Thinker/Talker stages.
    • Documentation: RL data format specs, LoRA adapter loading guide, example training integration with VeRL.

Other P1 Items

  • Cross-stage memory / HBM coordination: Combine with "Prefix Cache and Memory Coordination" in #2136 and F1 in #1867 (static budgets, admission) so Thinker prefix caching does not starve Talker or blow VRAM budgets.
  • Disaggregated Qwen-Omni: For P/D or EPDG deployments, define cross-node correctness for prefix cache + chunked prefill (builds on Q1 EPDG work); see § Prefill–Decode disaggregation and the P/D detailed tracking table above.
  • Observability: Per-request metrics for prefill chunks, prefix hit/miss, streaming session length, etc., for joint acceptance with Q2 features.

P2 — Stretch / ecosystem

  • Extend unified streaming video + audio protocol to Qwen2.5-Omni and others (Phase 6 in #2201).

Symbol Meaning
already supported, PR attached
🙋 not supported yet, help wanted!
not supported yet, with PR raised
maybe unnecessary to support it. The benefits are minimal.

Dependencies & references


Call for contributions

RFCs may use the project design doc template (same as #677). Please claim sub-tasks and link issues/PRs under #2136 or a dedicated tracking issue.

CC List

@hsliuustc0106 @Gaohan123 @tzhouam @R2-Y @Shirley125 @princepride @lishunyang12 @alex-jw-brooks @LJH-LBJ @ZeldaHuang @wtomin @ZJY0516 @knlnguyen1802 @natureofnature @SamitHuang

Metadata

Metadata

Assignees

Labels

good first issueGood for newcomershelp wantedExtra attention is neededhigh priorityhigh priority issue, needs to be done asap

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions