Skip to content

[RFC]: vLLM-Omni NPU 2026 Q2 Roadmap #2223

@gcanlin

Description

@gcanlin

2026 Q2 Roadmap

2026 Q1 Roadmap: #886

Version match

vLLM vLLM-Ascend vLLM-Omni MindIE-SD(Optional) status
v0.19.0 f1f0870 9d1392d main released
v0.20.0 07f6fec v0.20.0 main released
v0.21.0 https://github.com/gcanlin/vllm-ascend/tree/cann85-with-v0.21.0-adapt (personal branch as a workaround) d4c1395 main released
v0.21.0 pending for main branch or release d4c1395 main pending
v0.22.0 bb4d0776eee8fc45c3484a45c971a7049f1a2bbf v0.22.0 main pending

Background

We completed the initial Ascend NPU enablement in Q1 and established the basic version alignment flow across vLLM, vLLM-Ascend, and vLLM-Omni. In Q2 (April 1, 2026 to June 30, 2026), we will continue closing Q1 carry-over items that directly block production use, while shifting the main focus from initial bring-up to production readiness, performance guardrails, and architecture scalability.

Our goal for Q2 is to make NPU a continuously validated, performant, and maintainable first-class backend for vLLM-Omni, rather than a special path that depends on manual sync and repeated downstream patching.

Q2 Focus

1. CI / CD

We will build a stable, fast, and healthy NPU CI system with developer experience comparable to GPU CI.

Key goals:

  • Make NPU CI efficiency comparable to GPU CI by separating presubmit, extended, and nightly lanes, and by reducing duplicated setup and redundant stages.
  • Expand NPU CI to cover at least 95% of the test cases that already run on GPU and are expected to be NPU-compatible.
  • Add automated performance guardrails for key models, especially Wan2.2, Hunyuan Image, and Qwen-Image-Edit, so regressions are detected before release.
  • Establish CI health monitoring, flaky test tracking, and rerun policy, so NPU CI remains actionable instead of noisy.

2. Diffusion Models

We will improve both performance and scalability of diffusion serving on NPU.

Key goals:

  • Compile backend: Integrate a compile backend for diffusion models, with MindIE-SD as the primary candidate, to improve end-to-end throughput and latency. [NPU] Support mindie-sd compile backend #2466
  • Attention backend: Introduce sparse attention support for diffusion workloads, especially for long-context and high-resolution scenarios where dense attention becomes a major bottleneck. [RFC]: Per-Role Attention Backend Configuration for Diffusion Models in vllm-omni #2632
  • Quantization: [RFC]: Continuous Quantization Support for NPU #2438
  • Parallel feature: Validate the full matrix of supported parallel strategies and their combinations through CI, instead of relying on ad hoc manual verification.
  • Test cases: Continuously monitor performance for representative diffusion workloads to prevent regressions across releases.
  • New models: Improve the Day-0 onboarding path for new diffusion models with a clearer bring-up checklist, compatibility matrix, and benchmark workflow.

Key models:

3. Omni / TTS Models

We will make Omni/TTS serving production-ready on NPU.

Key goals:

  • ACL graph: Refactor the talker graph into a hardware-scalable execution path that can support ACLGraph cleanly.
  • Performance Optimization: Optimize talker and code2wav operator performance to improve RTF, TTFP, and end-to-end latency.
  • Test cases: Add accuracy tests, long-running stability tests, and production-readiness validation for Omni/TTS serving.
  • New models: Improve the Day-0 onboarding path for new Omni/TTS models so new model support can land faster with less one-off backend work.

Key models:

4. AR + DiT Models

5. Overall Architecture

We will reduce long-term maintenance cost and upstream alignment friction.

Key goals:

  • Version upgrade: Reduce the current coupling to vLLM-Ascend by replacing broad ModelRunner monkey patches with better extension points, instrumentation, or plugin-style hooks.
  • Model Runner V1: Make NPU integration more resilient to upstream upgrades, so new vLLM or vLLM-Ascend versions require targeted interface alignment instead of repeated code duplication.
  • Model Runner V2: Adapt the NPU path to Model Runner V2 and use this migration to remove redundant legacy logic where possible.

Exit Criteria

By the end of Q2:

  • NPU CI should be fast enough to serve as a normal PR signal rather than only a post-merge or manual validation path.
  • NPU CI should cover most GPU-compatible test cases and include automated performance regression detection for key diffusion models.
  • Diffusion compile backend support should land and be validated across supported parallel configurations.
  • Talker ACLGraph support and Omni/TTS performance improvements should be backed by accuracy and long-stability validation.
  • The NPU backend should rely less on repeated ModelRunner forks, and the first phase of Model Runner V2 adaptation should be completed.

Milestones

  • April 2026: finalize CI/CD plan, land the first round of CI speedups, start diffusion compile backend integration, and finish the talker graph refactor design.
  • May 2026: bring up performance guardrails, land the diffusion compile backend MVP, validate parallel strategy combinations in CI, and land the first round of talker/code2wav optimizations.
  • June 2026: complete accuracy and long-stability coverage, harden Day-0 model onboarding, finish the first phase of architecture decoupling, and land initial Model Runner V2 adaptation.

CC List

cc @hsliuustc0106 @Yikun @wangxiyuan @Fishermanykx @jiangmengyu18 @blian6 @wtomin @SamitHuang @ZJY0516 @linyueqian @david6666666 @Gaohan123 @tzhouam @princepride

Metadata

Metadata

Assignees

Labels

Hardware Pluginsupport different hardware beyond cudaNPUPR related to Ascend NPUgood first issueGood for newcomershelp wantedExtra attention is neededhigh priorityhigh priority issue, needs to be done asap

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions