[RFC]: vLLM-Omni NPU 2026 Q2 Roadmap

## 2026 Q2 Roadmap

2026 Q1 Roadmap: https://github.com/vllm-project/vllm-omni/issues/886

### Version match

| vLLM | vLLM-Ascend | vLLM-Omni | MindIE-SD(Optional) | status |
|--------------|--------|-------------------| -------| ------ |
| v0.19.0 | f1f0870 | 9d1392d | main | released |
| v0.20.0 | 07f6fec | v0.20.0 | main | released |
| v0.21.0 | https://github.com/gcanlin/vllm-ascend/tree/cann85-with-v0.21.0-adapt (personal branch as a workaround) | d4c139508fc490f856d8c22367cb85902caffb90| main | released |
| v0.21.0 | pending for main branch or release | d4c139508fc490f856d8c22367cb85902caffb90 | main | pending |
| v0.22.0 | bb4d0776eee8fc45c3484a45c971a7049f1a2bbf | v0.22.0 | main | pending |

### Background

We completed the initial Ascend NPU enablement in Q1 and established the basic version alignment flow across vLLM, vLLM-Ascend, and vLLM-Omni. In Q2 (April 1, 2026 to June 30, 2026), we will continue closing Q1 carry-over items that directly block production use, while shifting the main focus from initial bring-up to production readiness, performance guardrails, and architecture scalability.

Our goal for Q2 is to make NPU a continuously validated, performant, and maintainable first-class backend for vLLM-Omni, rather than a special path that depends on manual sync and repeated downstream patching.

### Q2 Focus

#### 1. CI / CD

We will build a stable, fast, and healthy NPU CI system with developer experience comparable to GPU CI.

Key goals:
- Make NPU CI efficiency comparable to GPU CI by separating presubmit, extended, and nightly lanes, and by reducing duplicated setup and redundant stages.
- Expand NPU CI to cover at least 95% of the test cases that already run on GPU and are expected to be NPU-compatible.
- Add automated performance guardrails for key models, especially Wan2.2, Hunyuan Image, and Qwen-Image-Edit, so regressions are detected before release.
- Establish CI health monitoring, flaky test tracking, and rerun policy, so NPU CI remains actionable instead of noisy.

#### 2. Diffusion Models

We will improve both performance and scalability of diffusion serving on NPU.

Key goals:
- **Compile backend**: Integrate a compile backend for diffusion models, with MindIE-SD as the primary candidate, to improve end-to-end throughput and latency. https://github.com/vllm-project/vllm-omni/pull/2466
- **Attention backend**: Introduce sparse attention support for diffusion workloads, especially for long-context and high-resolution scenarios where dense attention becomes a major bottleneck. https://github.com/vllm-project/vllm-omni/issues/2632
- **Quantization**: https://github.com/vllm-project/vllm-omni/issues/2438
- **Parallel feature**: Validate the full matrix of supported parallel strategies and their combinations through CI, instead of relying on ad hoc manual verification.
- **Test cases**: Continuously monitor performance for representative diffusion workloads to prevent regressions across releases.
- **New models**: Improve the Day-0 onboarding path for new diffusion models with a clearer bring-up checklist, compatibility matrix, and benchmark workflow.

Key models:
- Wan2.2: https://github.com/vllm-project/vllm-omni/issues/1355
- HunYuan-Video 1.5: https://github.com/vllm-project/vllm-omni/issues/2468
- Qwen-Image series

#### 3. Omni / TTS Models

We will make Omni/TTS serving production-ready on NPU.

Key goals:
- **ACL graph**: Refactor the talker graph into a hardware-scalable execution path that can support ACLGraph cleanly.
- **Performance Optimization**: Optimize talker and code2wav operator performance to improve RTF, TTFP, and end-to-end latency.
- **Test cases**: Add accuracy tests, long-running stability tests, and production-readiness validation for Omni/TTS serving.
- **New models**: Improve the Day-0 onboarding path for new Omni/TTS models so new model support can land faster with less one-off backend work.

Key models:
- Qwen3-Omni
- Qwen3-TTS: https://github.com/vllm-project/vllm-omni/issues/2328

#### 4. AR + DiT Models

- GLM-Image: https://github.com/vllm-project/vllm-omni/issues/2834
- HunYuan-Image 3.0: https://github.com/vllm-project/vllm-omni/issues/2015

#### 5. Overall Architecture

We will reduce long-term maintenance cost and upstream alignment friction.

Key goals:
- **Version upgrade**: Reduce the current coupling to vLLM-Ascend by replacing broad ModelRunner monkey patches with better extension points, instrumentation, or plugin-style hooks.
- **Model Runner V1**: Make NPU integration more resilient to upstream upgrades, so new vLLM or vLLM-Ascend versions require targeted interface alignment instead of repeated code duplication.
- **Model Runner V2**: Adapt the NPU path to Model Runner V2 and use this migration to remove redundant legacy logic where possible.

### Exit Criteria

By the end of Q2:
- NPU CI should be fast enough to serve as a normal PR signal rather than only a post-merge or manual validation path.
- NPU CI should cover most GPU-compatible test cases and include automated performance regression detection for key diffusion models.
- Diffusion compile backend support should land and be validated across supported parallel configurations.
- Talker ACLGraph support and Omni/TTS performance improvements should be backed by accuracy and long-stability validation.
- The NPU backend should rely less on repeated ModelRunner forks, and the first phase of Model Runner V2 adaptation should be completed.

### Milestones

- April 2026: finalize CI/CD plan, land the first round of CI speedups, start diffusion compile backend integration, and finish the talker graph refactor design.
- May 2026: bring up performance guardrails, land the diffusion compile backend MVP, validate parallel strategy combinations in CI, and land the first round of talker/code2wav optimizations.
- June 2026: complete accuracy and long-stability coverage, harden Day-0 model onboarding, finish the first phase of architecture decoupling, and land initial Model Runner V2 adaptation.

### CC List

cc @hsliuustc0106 @Yikun @wangxiyuan @Fishermanykx @jiangmengyu18 @blian6 @wtomin @SamitHuang @ZJY0516 @linyueqian @david6666666 @Gaohan123 @tzhouam @princepride


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC]: vLLM-Omni NPU 2026 Q2 Roadmap #2223

2026 Q2 Roadmap

Version match

Background

Q2 Focus

1. CI / CD

2. Diffusion Models

3. Omni / TTS Models

4. AR + DiT Models

5. Overall Architecture

Exit Criteria

Milestones

CC List

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

vLLM	vLLM-Ascend	vLLM-Omni	MindIE-SD(Optional)	status
v0.19.0	f1f0870	`9d1392d`	main	released
v0.20.0	07f6fec	v0.20.0	main	released
v0.21.0	https://github.com/gcanlin/vllm-ascend/tree/cann85-with-v0.21.0-adapt (personal branch as a workaround)	`d4c1395`	main	released
v0.21.0	pending for main branch or release	`d4c1395`	main	pending
v0.22.0	bb4d0776eee8fc45c3484a45c971a7049f1a2bbf	v0.22.0	main	pending

[RFC]: vLLM-Omni NPU 2026 Q2 Roadmap #2223

Description

2026 Q2 Roadmap

Version match

Background

Q2 Focus

1. CI / CD

2. Diffusion Models

3. Omni / TTS Models

4. AR + DiT Models

5. Overall Architecture

Exit Criteria

Milestones

CC List

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions