[RFC]: Omni-Modality Q2 Roadmap

### Motivation

The Qwen-Omni family (e.g., Qwen3-Omni and similar multi-stage AR + speech pipelines) already runs end-to-end in vLLM-Omni (Thinker → Talker → Code2Wav, etc.). Q1 advanced entrypoints, quantization, CUDA Graph, cross-stage async chunking, and multimodal streaming input aligned with upstream vLLM (see [Q1 Roadmap #677](https://github.com/vllm-project/vllm-omni/issues/677)). For Qwen-Omni as a product line, Q2 should deliver lower time-to-first-token / time-to-first-audio, scalable long multi-turn and streaming sessions, and production parity with upstream scheduling features (prefix caching, chunked prefill).

This roadmap scopes Qwen-Omni only for Q2. It aligns with the broader Q2 themes in [project Q2 collection #2136](https://github.com/vllm-project/vllm-omni/issues/2136) (“Prefix Cache and Memory Coordination”, “Streaming input/output”, **EPDG / disaggregated serving**, etc.) and spells them out for Qwen-Omni.

Below are the models and features we support. If you have other models or features you are interested in, please feel free to contact us.

------

|        Feature        | Ming-Flash-Omni-2.0 |     Qwen3-Omni     |    Qwen2.5-Omni    |
| :-------------------- | :-----------------: | :----------------: | :----------------: |
| Stage                 |                     |         ✅          |         ✅          |
| Batch                 |                     |         ✅          |         ✅          |
| Cuda Graph            |                     |         ✅          |                     |
| Async Chunk           |                     |         ✅          |                     |
| Streaming input       |                     |         ⏳          |                     |
| Streaming output      |                     |         ✅          |                     |
| Prefix cache          |                     |         ⏳          |                     |
| Chunked Prefill       |                     |         ⏳          |                     |
| Quantization          |                     |         ✅          |                     |
| Prefill-Decode disaggregation |             |         ⏳          |                     |
| Reinforcement Learning | | 🙋 | |

### Performance data

model-id: Qwen/Qwen3-Omni-30B-A3B-Instruct
random_input_len:100
random_output_len:100

|    test_name     | dataset_name | max_concurrency | request_rate | mean_e2el_ms | mean_ttft_ms | mean_audio_ttfp_ms | mean_audio_rtf |
| :--------------: | :----------: | :-------------: | :----------: | :----------: | :----------: | :----------------: | :------------: |
|    qwen3_omni    |    random    |        1        |      -       | 5916.145047  |  49.9711778  |     5798.78831     |  0.177744214   |
|    qwen3_omni    |    random    |        4        |      -       | 7505.656183  | 66.06060175  |    7384.353625     |  0.223856885   |
|    qwen3_omni    |    random    |       10        |      -       | 11301.83517  | 186.9555722  |    11173.13861     |   0.32931707   |
|    qwen3_omni    |  random-mm   |        -        |     0.1      | 7343.288911  | 1188.264374  |    7219.766443     |  0.240805141   |
|    qwen3_omni    |  random-mm   |        -        |     0.3      | 7067.158632  | 168.8519941  |     6942.87535     |  0.207324097   |
|    qwen3_omni    |  random-mm   |        -        |     0.5      | 8773.390233  | 167.3971285  |    8647.037808     |  0.257598932   |
| qwen3_omni_chunk |    random    |        1        |      -       | 5149.965969  |  47.3504127  |    421.7417487     |  0.158677357   |
| qwen3_omni_chunk |    random    |        4        |      -       | 7969.763369  | 341.9354187  |    1079.446806     |  0.237064974   |
| qwen3_omni_chunk |    random    |       10        |      -       | 17474.16753  | 1481.090657  |    2857.069322     |  0.522625353   |
| qwen3_omni_chunk |  random-mm   |        -        |     0.1      | 5663.873116  | 258.6243343  |     661.881583     |  0.167943394   |
| qwen3_omni_chunk |  random-mm   |        -        |     0.3      | 6780.008788  | 222.2346747  |    700.7485206     |  0.203259027   |
| qwen3_omni_chunk |  random-mm   |        -        |     0.5      | 9883.475912  | 1724.942598  |    2333.510879     |  0.294857716   |

model-id: Qwen/Qwen3-Omni-30B-A3B-Instruct

random_input_len:2500

random_output_len:900

|    test_name     | dataset_name | concurrency | request_rate | mean_e2el_ms | mean_ttft_ms | mean_audio_ttfp_ms | mean_audio_rtf |
| :--------------: | :----------: | :---------: | :----------: | :----------: | :----------: | :----------------: | :------------: |
|    qwen3_omni    |    random    |      1      |      -       |  31195.3737  |   215.1779   |    30987.77434     |     0.2801     |
|    qwen3_omni    |    random    |      4      |      -       |  57946.4745  |   325.7549   |    57718.85183     |     0.2778     |
| qwen3_omni_chunk |    random    |      1      |      -       |  37975.1413  |   216.7963   |    796.8836834     |     0.1648     |
| qwen3_omni_chunk |    random    |      4      |      -       |  54992.9832  |   627.7617   |    1595.880035     |     0.2787     |

### Goals (Q2 2026)

| Theme                         | Outcome                                                                                                                                                                                                                                                                 |
| :---------------------------- | :---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| Prefix cache                  | Reuse KV for repeatable prefixes (system prompts, multi-turn history, repeated vision/audio segments) on the Thinker and other AR stages mapped to vLLM, cutting TTFT and redundant prefill work.                                                                          |
| Streaming input               | On top of existing audio and upstream-aligned paths, ship streaming multimodal input for real-time use (including video frame streams and audio–video sessions) consistent with Qwen-Omni pipeline semantics.                                                             |
| Chunked prefill               | Chunk long multimodal prefills so scheduling matches upstream vLLM: prefill is sliced and interleaved with decode, reducing head-of-line blocking and improving fairness under mixed load.                                                                              |
| P/D disaggregation (Thinker)  | Split Qwen-Omni **Thinker** prefill vs decode with reliable **KV transfer**; validate with multimodal prompts, prefix cache, and chunked prefill; align configs and ops with broader **EPDG** / multi-node serving in [#2136](https://github.com/vllm-project/vllm-omni/issues/2136). |

------

### P0 — Must ship for Qwen-Omni in Q2

#### 1) Prefix caching (Qwen-Omni AR / Thinker)

- Scope
  - Enable prefix KV reuse on the Thinker (and any stage backed by the vLLM AR engine), consistent with upstream behavior, including integration with `MultiConnector` / `LMCacheConnectorV1` as appropriate (see per-stage design in [Multi-Stage KV Cache Management #1867](https://github.com/vllm-project/vllm-omni/issues/1867)).
  - Define multimodal prefix boundaries: which token blocks participate in hashing / block alignment; cache correctness and invalidation for combinations of `image_url`, `audio_url`, and interleaved inputs.
- Success criteria
  - Measurable TTFT and prefill savings when many requests share a prefix; no cross-request KV reuse bugs.
  - Documentation: flags, limits (e.g., max prefix length, modality combinations), interaction with disaggregated serving.

#### 2) Streaming input (Qwen-Omni)

- Scope
  - Build on Q1’s “multimodal streaming input aligned with vLLM upstream” and treat Qwen-Omni as a first-class target in Q2:
    - Harden incremental audio / multi-turn paths and keep API/protocol behavior consistent.
    - Streaming video input and long-session buffering, sampling, and request assembly for the Thinker (see [Streaming Video Input RFC #2201](https://github.com/vllm-project/vllm-omni/issues/2201) and linked PRs), aligned with Qwen3-Omni audio-in-video and temporal alignment semantics.
  - Stay aligned with upstream `StreamingInput` / realtime WebSocket behavior to avoid unnecessary Omni-only forks.
- Success criteria
  - End-to-end: stream in → Thinker → (optional) Talker/Code2Wav with documented baselines (e.g., TTFT, TTFA); bounded or configurable memory growth in long sessions (ties to KV window/eviction follow-ups in #2201).

#### 3) Chunked prefill (Qwen-Omni)

- Scope
  - Enable chunked prefill for heavy Thinker prefills (long text + many images/frames/long audio), matching vLLM scheduler semantics: prefill chunks interleave with decode instead of monopolizing the GPU.
  - Validate interaction with prefix cache and async chunk (Thinker → Talker): chunk boundaries must not break hidden-state handoff or KV metadata consistency.
- Success criteria
  - Measurable improvement in P99 prefill latency and decode starvation under high concurrency; regression tests for chunked + multimodal + multi-stage combinations.

#### 4) Prefill–Decode disaggregation (Qwen-Omni Thinker)

- Scope
  - Enable **P/D split** on the **Thinker** AR engine where supported, using vLLM-Omni’s **KV transfer** and connector stack (see [#1867](https://github.com/vllm-project/vllm-omni/issues/1867) F2, [#1303](https://github.com/vllm-project/vllm-omni/pull/1303)).
  - Ensure **Qwen3-Omni** multimodal prefill → decode handoff is **correct** (KV layout, request IDs, embedding merge paths); avoid regressions for **CFG / multi-cache** scenarios called out in community trackers.
  - Define how P/D coexists with **prefix caching** and **chunked prefill** (scheduler + connector ordering); document **single-node vs multi-node** deployment.
- Success criteria
  - E2E serving: prefill workers and decode workers **stable under load** for Qwen3-Omni Thinker; measured **KV transfer latency** and **TTFT** documented; CI or nightly coverage for at least one reference YAML.

------

### Progress / current status (as of early Q2 2026)

|            Feature             |                            Title                             |          Author           |                             RFC                              |                         PR                          |   Status    |
| :----------------------------: | :----------------------------------------------------------: | :-----------------------: | :----------------------------------------------------------: | :-------------------------------------------------: | :---------: |
|          prefix-cache          | Enable Prefix Caching with Hidden-State I/O (Multi-round / Service Scenarios) | @alex-jw-brooks  @LJH-LBJ |    https://github.com/vllm-project/vllm-omni/issues/1184     | https://github.com/vllm-project/vllm-omni/pull/2164 | ⏳ In progress |
| streaming input & RealTime Api |             Qwen3-Omni supports streaming input              | @lishunyang12 @Shirley125 @Sy0307 | https://github.com/vllm-project/vllm-omni/issues/1951                       https://github.com/vllm-project/vllm-omni/issues/2201 | https://github.com/vllm-project/vllm-omni/pull/2202 https://github.com/vllm-project/vllm-omni/pull/2208 https://github.com/vllm-project/vllm-omni/pull/2342 | ⏳ In progress |
|        chunked prefill         |                   Support chunked prefill                    |           @R2-Y           |     https://github.com/vllm-project/vllm-omni/issues/948     | https://github.com/vllm-project/vllm-omni/pull/949  | ⏳ In progress |
| Prefill–Decode disaggregation | Support Prefill–Decode disaggregation via vLLM KV transfer (Qwen-Omni / Thinker track) | @spencerr221 | https://github.com/vllm-project/vllm-omni/issues/1188 | https://github.com/vllm-project/vllm-omni/pull/2220 | ⏳ In progress |
| Reinforcement Learning | RL support (GRPO/PPO) for Qwen3-Omni multi-stage models |  | https://github.com/vllm-project/vllm-omni/issues/2357 |  | 🙋 |

*Legend: ⏳ In progress  · ✅ Done (use in tables above as items land).*

### P1 — Strongly aligned with the Q2 program

#### 5) Reinforcement Learning Support (Qwen3-Omni)

- **Scope**
  - Enable comprehensive RL support (GRPO/PPO for Audio) for multi-stage Qwen3-Omni models. Currently vLLM-Omni is optimized for inference; RL requires intermediate trajectory data (RVQ codec tokens, text hidden states, log-probabilities) from both Thinker and Talker stages.
  - Implement **Trajectory Return**: Record intermediate text embeddings, RVQ codes, and log-probs in `Qwen3OmniMoeTalker` for RL rollout consumption.
  - Support **Custom Sampling/Exploration**: Enable custom multinomial or stochastic samplers via pipeline worker extensions for exploration in audio generation stage.
  - **Unified RL Output Interface**: Standardize `OmniRequestOutput._custom_output` for propagating RL tensors from Talker stages back to training collector (e.g., VeRL framework).
  - **Cross-Stage LoRA Support**: 
    - Stage 0 (Thinker): Full vLLM-native LoRA adapters for multimodal textual reasoning alignment.
    - Stage 1 (Talker): Update LoRA managers to support `qwen3_omni` component naming and enable LoRA for Talker transformer blocks.
- **Success criteria**
  - End-to-end GRPO/PPO training loop functional with Qwen3-Omni, returning required trajectory data without inference regression.
  - LoRA fine-tuning improves both text reasoning and audio quality metrics; no naming convention conflicts between Thinker/Talker stages.
  - Documentation: RL data format specs, LoRA adapter loading guide, example training integration with VeRL.

#### Other P1 Items

- Cross-stage memory / HBM coordination: Combine with "Prefix Cache and Memory Coordination" in [#2136](https://github.com/vllm-project/vllm-omni/issues/2136) and F1 in [#1867](https://github.com/vllm-project/vllm-omni/issues/1867) (static budgets, admission) so Thinker prefix caching does not starve Talker or blow VRAM budgets.
- Disaggregated Qwen-Omni: For P/D or EPDG deployments, define cross-node correctness for prefix cache + chunked prefill (builds on Q1 EPDG work); see **§ Prefill–Decode disaggregation** and the **P/D detailed tracking** table above.
- Observability: Per-request metrics for prefill chunks, prefix hit/miss, streaming session length, etc., for joint acceptance with Q2 features.

### P2 — Stretch / ecosystem

- Extend unified streaming video + audio protocol to Qwen2.5-Omni and others ([Phase 6 in #2201](https://github.com/vllm-project/vllm-omni/issues/2201)).

------

| Symbol | Meaning                                                    |
| ------ | ---------------------------------------------------------- |
| ✅      | already supported, PR attached                             |
| 🙋      | not supported yet, help wanted!                            |
| ⏳      | not supported yet, with PR raised                          |
| ❓      | maybe unnecessary to support it. The benefits are minimal. |

### Dependencies & references

- [vLLM-Omni 2026 Q1 Roadmap #677](https://github.com/vllm-project/vllm-omni/issues/677) — Qwen3-Omni and streaming I/O baseline.
- [vLLM-Omni 2026 Q2 Roadmap (collecting ideas) #2136](https://github.com/vllm-project/vllm-omni/issues/2136) — project-wide Q2 themes.
- [Streaming Video Input RFC #2201](https://github.com/vllm-project/vllm-omni/issues/2201) — streaming video with Qwen3-Omni.
- [Multi-Stage KV Cache Management #1867](https://github.com/vllm-project/vllm-omni/issues/1867) — prefix / offload / cross-stage KV index; **F2** tracks P/D disaggregation.
- [Omni Connector Full Disaggregation #1192](https://github.com/vllm-project/vllm-omni/issues/1192) — P/D and multi-node connector roadmap.
- [Support Prefill–Decode disaggregation #1303](https://github.com/vllm-project/vllm-omni/pull/1303) — core P/D feature PR (see discussion / split plan in #1867).
- [Qwen-Omni P/D + KV transfer #1188](https://github.com/vllm-project/vllm-omni/issues/1188) · [PR #2220](https://github.com/vllm-project/vllm-omni/pull/2220) — Thinker-focused track.
- [Reinforcement learning support for Qwen3-Omni #2357](https://github.com/vllm-project/vllm-omni/issues/2357) — RL for multi-stage models (GRPO/PPO, trajectory return, LoRA alignment).

------

### Call for contributions

RFCs may use the project [design doc template](https://docs.google.com/document/d/12YxSsVeD1jvL-InClkeAEnZyWFDndz_65JmXvsamuV4/edit?usp=sharing) (same as [#677](https://github.com/vllm-project/vllm-omni/issues/677)). Please claim sub-tasks and link issues/PRs under [#2136](https://github.com/vllm-project/vllm-omni/issues/2136) or a dedicated tracking issue.

### CC List

[@hsliuustc0106](https://github.com/hsliuustc0106) [@Gaohan123](https://github.com/Gaohan123) [@tzhouam](https://github.com/tzhouam) [@R2-Y](https://github.com/R2-Y) [@Shirley125](https://github.com/Shirley125) [@princepride](https://github.com/princepride) [@lishunyang12](https://github.com/lishunyang12) [@alex-jw-brooks](https://github.com/alex-jw-brooks) [@LJH-LBJ](https://github.com/LJH-LBJ) [@ZeldaHuang](https://github.com/ZeldaHuang) [@wtomin](https://github.com/wtomin) [@ZJY0516](https://github.com/ZJY0516) [@knlnguyen1802](https://github.com/knlnguyen1802) [@natureofnature](https://github.com/natureofnature) [@SamitHuang](https://github.com/SamitHuang) 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC]: Omni-Modality Q2 Roadmap #2207

Motivation

Performance data

Goals (Q2 2026)

P0 — Must ship for Qwen-Omni in Q2

1) Prefix caching (Qwen-Omni AR / Thinker)

2) Streaming input (Qwen-Omni)

3) Chunked prefill (Qwen-Omni)

4) Prefill–Decode disaggregation (Qwen-Omni Thinker)

Progress / current status (as of early Q2 2026)

P1 — Strongly aligned with the Q2 program

5) Reinforcement Learning Support (Qwen3-Omni)

Other P1 Items

P2 — Stretch / ecosystem

Dependencies & references

Call for contributions

CC List

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Feature	Qwen3-Omni	Qwen2.5-Omni
Stage	✅	✅
Batch	✅	✅
Cuda Graph	✅
Async Chunk	✅
Streaming input	⏳
Streaming output	✅
Prefix cache	⏳
Chunked Prefill	⏳
Quantization	✅
Prefill-Decode disaggregation	⏳
Reinforcement Learning	🙋

test_name	dataset_name	max_concurrency	request_rate	mean_e2el_ms	mean_ttft_ms	mean_audio_ttfp_ms	mean_audio_rtf
qwen3_omni	random	1	-	5916.145047	49.9711778	5798.78831	0.177744214
qwen3_omni	random	4	-	7505.656183	66.06060175	7384.353625	0.223856885
qwen3_omni	random	10	-	11301.83517	186.9555722	11173.13861	0.32931707
qwen3_omni	random-mm	-	0.1	7343.288911	1188.264374	7219.766443	0.240805141
qwen3_omni	random-mm	-	0.3	7067.158632	168.8519941	6942.87535	0.207324097
qwen3_omni	random-mm	-	0.5	8773.390233	167.3971285	8647.037808	0.257598932
qwen3_omni_chunk	random	1	-	5149.965969	47.3504127	421.7417487	0.158677357
qwen3_omni_chunk	random	4	-	7969.763369	341.9354187	1079.446806	0.237064974
qwen3_omni_chunk	random	10	-	17474.16753	1481.090657	2857.069322	0.522625353
qwen3_omni_chunk	random-mm	-	0.1	5663.873116	258.6243343	661.881583	0.167943394
qwen3_omni_chunk	random-mm	-	0.3	6780.008788	222.2346747	700.7485206	0.203259027
qwen3_omni_chunk	random-mm	-	0.5	9883.475912	1724.942598	2333.510879	0.294857716

test_name	dataset_name	concurrency	request_rate	mean_e2el_ms	mean_ttft_ms	mean_audio_ttfp_ms	mean_audio_rtf
qwen3_omni	random	1	-	31195.3737	215.1779	30987.77434	0.2801
qwen3_omni	random	4	-	57946.4745	325.7549	57718.85183	0.2778
qwen3_omni_chunk	random	1	-	37975.1413	216.7963	796.8836834	0.1648
qwen3_omni_chunk	random	4	-	54992.9832	627.7617	1595.880035	0.2787

Theme	Outcome
Prefix cache	Reuse KV for repeatable prefixes (system prompts, multi-turn history, repeated vision/audio segments) on the Thinker and other AR stages mapped to vLLM, cutting TTFT and redundant prefill work.
Streaming input	On top of existing audio and upstream-aligned paths, ship streaming multimodal input for real-time use (including video frame streams and audio–video sessions) consistent with Qwen-Omni pipeline semantics.
Chunked prefill	Chunk long multimodal prefills so scheduling matches upstream vLLM: prefill is sliced and interleaved with decode, reducing head-of-line blocking and improving fairness under mixed load.
P/D disaggregation (Thinker)	Split Qwen-Omni Thinker prefill vs decode with reliable KV transfer; validate with multimodal prompts, prefix cache, and chunked prefill; align configs and ops with broader EPDG / multi-node serving in #2136.

Feature	Title	Author	RFC	PR	Status
prefix-cache	Enable Prefix Caching with Hidden-State I/O (Multi-round / Service Scenarios)	@alex-jw-brooks @LJH-LBJ	#1184	#2164	⏳ In progress
streaming input & RealTime Api	Qwen3-Omni supports streaming input	@lishunyang12 @Shirley125 @Sy0307	#1951 #2201	#2202 #2208 #2342	⏳ In progress
chunked prefill	Support chunked prefill	@R2-Y	#948	#949	⏳ In progress
Prefill–Decode disaggregation	Support Prefill–Decode disaggregation via vLLM KV transfer (Qwen-Omni / Thinker track)	@spencerr221	#1188	#2220	⏳ In progress
Reinforcement Learning	RL support (GRPO/PPO) for Qwen3-Omni multi-stage models		#2357		🙋

Symbol	Meaning
✅	already supported, PR attached
🙋	not supported yet, help wanted!
⏳	not supported yet, with PR raised
❓	maybe unnecessary to support it. The benefits are minimal.

[RFC]: Omni-Modality Q2 Roadmap #2207

Description

Motivation

Performance data

Goals (Q2 2026)

P0 — Must ship for Qwen-Omni in Q2

1) Prefix caching (Qwen-Omni AR / Thinker)

2) Streaming input (Qwen-Omni)

3) Chunked prefill (Qwen-Omni)

4) Prefill–Decode disaggregation (Qwen-Omni Thinker)

Progress / current status (as of early Q2 2026)

P1 — Strongly aligned with the Q2 program

5) Reinforcement Learning Support (Qwen3-Omni)

Other P1 Items

P2 — Stretch / ecosystem

Dependencies & references

Call for contributions

CC List

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions