You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We completed the initial Ascend NPU enablement in Q1 and established the basic version alignment flow across vLLM, vLLM-Ascend, and vLLM-Omni. In Q2 (April 1, 2026 to June 30, 2026), we will continue closing Q1 carry-over items that directly block production use, while shifting the main focus from initial bring-up to production readiness, performance guardrails, and architecture scalability.
Our goal for Q2 is to make NPU a continuously validated, performant, and maintainable first-class backend for vLLM-Omni, rather than a special path that depends on manual sync and repeated downstream patching.
Q2 Focus
1. CI / CD
We will build a stable, fast, and healthy NPU CI system with developer experience comparable to GPU CI.
Key goals:
Make NPU CI efficiency comparable to GPU CI by separating presubmit, extended, and nightly lanes, and by reducing duplicated setup and redundant stages.
Expand NPU CI to cover at least 95% of the test cases that already run on GPU and are expected to be NPU-compatible.
Add automated performance guardrails for key models, especially Wan2.2, Hunyuan Image, and Qwen-Image-Edit, so regressions are detected before release.
Establish CI health monitoring, flaky test tracking, and rerun policy, so NPU CI remains actionable instead of noisy.
2. Diffusion Models
We will improve both performance and scalability of diffusion serving on NPU.
Key goals:
Compile backend: Integrate a compile backend for diffusion models, with MindIE-SD as the primary candidate, to improve end-to-end throughput and latency. [NPU] Support mindie-sd compile backend #2466
Parallel feature: Validate the full matrix of supported parallel strategies and their combinations through CI, instead of relying on ad hoc manual verification.
Test cases: Continuously monitor performance for representative diffusion workloads to prevent regressions across releases.
New models: Improve the Day-0 onboarding path for new diffusion models with a clearer bring-up checklist, compatibility matrix, and benchmark workflow.
We will reduce long-term maintenance cost and upstream alignment friction.
Key goals:
Version upgrade: Reduce the current coupling to vLLM-Ascend by replacing broad ModelRunner monkey patches with better extension points, instrumentation, or plugin-style hooks.
Model Runner V1: Make NPU integration more resilient to upstream upgrades, so new vLLM or vLLM-Ascend versions require targeted interface alignment instead of repeated code duplication.
Model Runner V2: Adapt the NPU path to Model Runner V2 and use this migration to remove redundant legacy logic where possible.
Exit Criteria
By the end of Q2:
NPU CI should be fast enough to serve as a normal PR signal rather than only a post-merge or manual validation path.
NPU CI should cover most GPU-compatible test cases and include automated performance regression detection for key diffusion models.
Diffusion compile backend support should land and be validated across supported parallel configurations.
Talker ACLGraph support and Omni/TTS performance improvements should be backed by accuracy and long-stability validation.
The NPU backend should rely less on repeated ModelRunner forks, and the first phase of Model Runner V2 adaptation should be completed.
Milestones
April 2026: finalize CI/CD plan, land the first round of CI speedups, start diffusion compile backend integration, and finish the talker graph refactor design.
May 2026: bring up performance guardrails, land the diffusion compile backend MVP, validate parallel strategy combinations in CI, and land the first round of talker/code2wav optimizations.
June 2026: complete accuracy and long-stability coverage, harden Day-0 model onboarding, finish the first phase of architecture decoupling, and land initial Model Runner V2 adaptation.
2026 Q2 Roadmap
2026 Q1 Roadmap: #886
Version match
Background
We completed the initial Ascend NPU enablement in Q1 and established the basic version alignment flow across vLLM, vLLM-Ascend, and vLLM-Omni. In Q2 (April 1, 2026 to June 30, 2026), we will continue closing Q1 carry-over items that directly block production use, while shifting the main focus from initial bring-up to production readiness, performance guardrails, and architecture scalability.
Our goal for Q2 is to make NPU a continuously validated, performant, and maintainable first-class backend for vLLM-Omni, rather than a special path that depends on manual sync and repeated downstream patching.
Q2 Focus
1. CI / CD
We will build a stable, fast, and healthy NPU CI system with developer experience comparable to GPU CI.
Key goals:
2. Diffusion Models
We will improve both performance and scalability of diffusion serving on NPU.
Key goals:
Key models:
3. Omni / TTS Models
We will make Omni/TTS serving production-ready on NPU.
Key goals:
Key models:
4. AR + DiT Models
5. Overall Architecture
We will reduce long-term maintenance cost and upstream alignment friction.
Key goals:
Exit Criteria
By the end of Q2:
Milestones
CC List
cc @hsliuustc0106 @Yikun @wangxiyuan @Fishermanykx @jiangmengyu18 @blian6 @wtomin @SamitHuang @ZJY0516 @linyueqian @david6666666 @Gaohan123 @tzhouam @princepride