[RFC]: vLLM-Omni NPU 2026 Q1 Roadmap

### Background

We have completed the initial Ascend NPU enablement in vllm-omni v0.11.0rc1 and v0.12.0rc1, with support for most mainstream models such as Qwen3-Omni and the Qwen-Image series.

Building on this foundation, the next phase will focus on systematically expanding model coverage and prioritizing performance optimization efforts, with a clear roadmap to improve scalability, stability, and overall serving efficiency on Ascend NPU.

#### Version match

Currently, vLLM-Omni’s NPU support depends on vLLM-Ascend, the Ascend support plugin of vLLM. The AR (auto-regressive) path is jointly supported by vLLM and vLLM-Ascend.

Meanwhile, MindIE-SD serves as a standalone Ascend-optimized diffusion operator library. It is currently integrated through the `FlashAttentionBackend` and a set of `CustomOp`, delivering Ascend-native operators to improve the performance of diffusion models.

We're also building the separate plugin platform in vLLM-Omni to support scalable hardware better in the future.

| vLLM | vLLM-Ascend | vLLM-Omni | MindIE-SD(Optional) | status |
|--------------|--------|-------------------| -------| ------ |
| v0.11.0 | v0.11.0rc2 | v0.11.0rc1 | NA | released |
| v0.12.0 | v0.12.0rc1 | v0.12.0rc1 | main | released |
| v0.14.0 | v0.14.0rc1 | v0.14.0 | main | released |
| v0.15.0 | v0.15.0rc1 | v0.15.0rc1 | main | skipped |
| v0.16.0 | e2175d9  | v0.16.0 | main | released |
| v0.17.0 | v0.17.0rc1 | 9718a9 ~8e120 | main | released |
| v0.18.0 | v0.18.0rc1 | v0.18.0 | main | released | 
| v0.19.0 | f1f0870 | main | main | developing |

#### How to install MindIE-SD

Official Link: [MindIE-SD](https://gitcode.com/Ascend/MindIE-SD)

We are actively working to simplify the installation of mindie-sd. Eventually, it will be available via pip install mindie-sd. At the moment, however, some additional work is required.

```
git clone https://gitcode.com/Ascend/MindIE-SD.git && cd MindIE-SD
# Need to comment the line `source ${current_script_dir}/build_tik_ops.sh` in build/build_ops.sh
sed -i 's|^$\s*$source ${current_script_dir}/build_tik_ops.sh|\1# source ${current_script_dir}/build_tik_ops.sh|' build/build_ops.sh
python setup.py bdist_wheel
cd dist
pip install mindiesd-*.whl
```

### Feature Support

#### Omni(AR+Generator) Pipeline

- [x] Async chunk: follow this RFC https://github.com/vllm-project/vllm-omni/issues/268
- [ ] TTS Performance Optimization: https://github.com/vllm-project/vllm-omni/issues/1600
- [ ] Support code2wav multi-batch
- [ ] Streaming input and output
- [ ] Talker ACL graph Support
- [ ] More Ascend-friendly Ops
- [x] Expert Parallelism (EP)

#### Diffusion Pipeline

- [ ] Support sparse attention backend by integrating the sparse attention interface from MindIE-SD
- [x] Support LA from MindIE-SD
- [ ] Following #814's features, make sure them work on NPU
- [x] `Qwen-Image-Edit-2511` Optimization
- [x] Wan2.2 Optimization: #1355
- [ ] Remove NPU hardcode: https://github.com/vllm-project/vllm-omni/pull/1250#discussion_r2777640720
- [ ] Refactor ring attention for hardware dispatch: https://github.com/vllm-project/vllm-omni/pull/755

#### Others(UX & Hardware Scalable)

- [x] Platform: #774
- [x] Disable torch compile by default: https://github.com/vllm-project/vllm-omni/pull/1108
- [x] Dependencies router: https://github.com/vllm-project/vllm-omni/pull/1046

### Docs

### Known Issues

- [x] Memory usage: currently, Qwen2.5-Omni and Qwen3-Omni have to separate talker to one different device from thinker. We expect to make them together so that Qwen2.5-Omni and Qwen3-Omni would only need 2 and 4 cards.
- [ ] Qwen2.5-Omni: enabling ACL graph leads to accuracy problem. #912 
- [ ] Qwen3-Omni: talker ACL graph breaks. https://github.com/vllm-project/vllm-omni/pull/1114#issuecomment-3824546234
- [x] #1322: If you obtain images on NPU that differ from those on GPU, this is normal and expected, as long as there is no obvious degradation (e.g., blurred faces, failure to follow the prompt, etc.). You can make the outputs closer by fixing the same seed and using a CPU-based generator(online server can use #1183). However, even in that case, minor differences may still remain. The root cause is that different hardware backends cannot perfectly align in PyTorch operator implementations, such as `conv3d`, `nn.Linear`, and others.

### Model Support List

| Architecture | Models | Example HF Models | NPU support |
|--------------|--------|-------------------| -------|
| `Qwen3OmniMoeForConditionalGeneration` | Qwen3-Omni | `Qwen/Qwen3-Omni-30B-A3B-Instruct` | ✅ |
| `Qwen2_5OmniForConditionalGeneration` | Qwen2.5-Omni | `Qwen/Qwen2.5-Omni-7B`, `Qwen/Qwen2.5-Omni-3B` | ✅ |
| `BagelForConditionalGeneration` | BAGEL (DiT-only) | `ByteDance-Seed/BAGEL-7B-MoT` |
| `QwenImagePipeline` | Qwen-Image | `Qwen/Qwen-Image` | ✅ |
| `QwenImagePipeline` | Qwen-Image-2512 | `Qwen/Qwen-Image-2512` | ✅ |
| `QwenImageEditPipeline` | Qwen-Image-Edit | `Qwen/Qwen-Image-Edit` | ✅ |
| `QwenImageEditPlusPipeline` | Qwen-Image-Edit-2509 | `Qwen/Qwen-Image-Edit-2509` | ✅ |
| `QwenImageLayeredPipeline` | Qwen-Image-Layered | `Qwen/Qwen-Image-Layered` | ✅ |
|`ZImagePipeline` | Z-Image | `Tongyi-MAI/Z-Image-Turbo` | ✅ |
| `WanPipeline` | Wan2.2-T2V, Wan2.2-TI2V | `Wan-AI/Wan2.2-T2V-A14B-Diffusers`, `Wan-AI/Wan2.2-TI2V-5B-Diffusers` | ✅ |
| `WanImageToVideoPipeline` | Wan2.2-I2V | `Wan-AI/Wan2.2-I2V-A14B-Diffusers` | ✅ |
| `OvisImagePipeline` | Ovis-Image | `OvisAI/Ovis-Image` |
|`LongcatImagePipeline` | LongCat-Image | `meituan-longcat/LongCat-Image` | ✅ |
|`LongCatImageEditPipeline` | LongCat-Image-Edit | `meituan-longcat/LongCat-Image-Edit` |
|`StableDiffusion3Pipeline` | Stable-Diffusion-3 | `stabilityai/stable-diffusion-3.5-medium` |
|`Flux2KleinPipeline` | FLUX.2-klein | `black-forest-labs/FLUX.2-klein-4B`, `black-forest-labs/FLUX.2-klein-9B` |
|`StableAudioPipeline` | Stable-Audio-Open | `stabilityai/stable-audio-open-1.0` |
|`Qwen3TTSForConditionalGeneration` | Qwen3-TTS-12Hz-1.7B-CustomVoice | `Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice` | ✅ |
|`Qwen3TTSForConditionalGeneration` | Qwen3-TTS-12Hz-1.7B-VoiceDesign | `Qwen/Qwen3-TTS-12Hz-1.7B-VoiceDesign` | ✅ |
|`Qwen3TTSForConditionalGeneration` | Qwen3-TTS-12Hz-1.7B-Base | `Qwen/Qwen3-TTS-12Hz-0.6B-Base` | ✅ |


### Feedback Period.

_No response_

### CC List.

_No response_

### Any Other Things.

_No response_

### Before submitting a new issue...

- [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://vllm-omni.readthedocs.io), which can answer lots of frequently asked questions.

Architecture	Models	Example HF Models	NPU support
`Qwen3OmniMoeForConditionalGeneration`	Qwen3-Omni	`Qwen/Qwen3-Omni-30B-A3B-Instruct`	✅
`Qwen2_5OmniForConditionalGeneration`	Qwen2.5-Omni	`Qwen/Qwen2.5-Omni-7B`, `Qwen/Qwen2.5-Omni-3B`	✅
`BagelForConditionalGeneration`	BAGEL (DiT-only)	`ByteDance-Seed/BAGEL-7B-MoT`
`QwenImagePipeline`	Qwen-Image	`Qwen/Qwen-Image`	✅
`QwenImagePipeline`	Qwen-Image-2512	`Qwen/Qwen-Image-2512`	✅
`QwenImageEditPipeline`	Qwen-Image-Edit	`Qwen/Qwen-Image-Edit`	✅
`QwenImageEditPlusPipeline`	Qwen-Image-Edit-2509	`Qwen/Qwen-Image-Edit-2509`	✅
`QwenImageLayeredPipeline`	Qwen-Image-Layered	`Qwen/Qwen-Image-Layered`	✅
`ZImagePipeline`	Z-Image	`Tongyi-MAI/Z-Image-Turbo`	✅
`WanPipeline`	Wan2.2-T2V, Wan2.2-TI2V	`Wan-AI/Wan2.2-T2V-A14B-Diffusers`, `Wan-AI/Wan2.2-TI2V-5B-Diffusers`	✅
`WanImageToVideoPipeline`	Wan2.2-I2V	`Wan-AI/Wan2.2-I2V-A14B-Diffusers`	✅
`OvisImagePipeline`	Ovis-Image	`OvisAI/Ovis-Image`
`LongcatImagePipeline`	LongCat-Image	`meituan-longcat/LongCat-Image`	✅
`LongCatImageEditPipeline`	LongCat-Image-Edit	`meituan-longcat/LongCat-Image-Edit`
`StableDiffusion3Pipeline`	Stable-Diffusion-3	`stabilityai/stable-diffusion-3.5-medium`
`Flux2KleinPipeline`	FLUX.2-klein	`black-forest-labs/FLUX.2-klein-4B`, `black-forest-labs/FLUX.2-klein-9B`
`StableAudioPipeline`	Stable-Audio-Open	`stabilityai/stable-audio-open-1.0`
`Qwen3TTSForConditionalGeneration`	Qwen3-TTS-12Hz-1.7B-CustomVoice	`Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice`	✅
`Qwen3TTSForConditionalGeneration`	Qwen3-TTS-12Hz-1.7B-VoiceDesign	`Qwen/Qwen3-TTS-12Hz-1.7B-VoiceDesign`	✅
`Qwen3TTSForConditionalGeneration`	Qwen3-TTS-12Hz-1.7B-Base	`Qwen/Qwen3-TTS-12Hz-0.6B-Base`	✅

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC]: vLLM-Omni NPU 2026 Q1 Roadmap #886

Background

Version match

How to install MindIE-SD

Feature Support

Omni(AR+Generator) Pipeline

Diffusion Pipeline

Others(UX & Hardware Scalable)

Docs

Known Issues

Model Support List

Feedback Period.

CC List.

Any Other Things.

Before submitting a new issue...

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

vLLM	vLLM-Ascend	vLLM-Omni	MindIE-SD(Optional)	status
v0.11.0	v0.11.0rc2	v0.11.0rc1	NA	released
v0.12.0	v0.12.0rc1	v0.12.0rc1	main	released
v0.14.0	v0.14.0rc1	v0.14.0	main	released
v0.15.0	v0.15.0rc1	v0.15.0rc1	main	skipped
v0.16.0	`e2175d9`	v0.16.0	main	released
v0.17.0	v0.17.0rc1	9718a9 ~8e120	main	released
v0.18.0	v0.18.0rc1	v0.18.0	main	released
v0.19.0	f1f0870	main	main	developing

[RFC]: vLLM-Omni NPU 2026 Q1 Roadmap #886

Description

Background

Version match

How to install MindIE-SD

Feature Support

Omni(AR+Generator) Pipeline

Diffusion Pipeline

Others(UX & Hardware Scalable)

Docs

Known Issues

Model Support List

Feedback Period.

CC List.

Any Other Things.

Before submitting a new issue...

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions