Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
70 commits
Select commit Hold shift + click to select a range
fc4d8d5
[Hardware] Support platforms and plugin system
gcanlin Jan 13, 2026
c6d0c32
Make diffusion worker hardware-agnostic
gcanlin Jan 13, 2026
44350a3
fix pre-commit lint
gcanlin Jan 14, 2026
8664384
Merge branch 'main' into platforms
gcanlin Jan 14, 2026
a83a7d3
fix typo and remove some comments
gcanlin Jan 14, 2026
aa21514
remove supports_torch_compile and fix some typos
gcanlin Jan 14, 2026
af48d31
revert some unnecessary changes and fix lint
gcanlin Jan 15, 2026
22e67de
Update the selector for diffusion attention
gcanlin Jan 15, 2026
f30353a
Update the selector for diffusion attention
gcanlin Jan 15, 2026
41ee928
lint
gcanlin Jan 15, 2026
e13a96f
add forward_xpu in CustomOp
gcanlin Jan 15, 2026
648e1b3
clean
gcanlin Jan 15, 2026
26f9429
unify the attn log
gcanlin Jan 15, 2026
dcd6420
fix lint & fix device_type
gcanlin Jan 15, 2026
304c59b
Remove the unused env var
gcanlin Jan 15, 2026
15cf31a
fix lint
gcanlin Jan 15, 2026
09afcb2
Move Ascend attention to original position
gcanlin Jan 15, 2026
9bef25f
remove import
gcanlin Jan 15, 2026
3be1fcb
Merge branch 'main' into platforms
gcanlin Jan 15, 2026
45df182
remove cuda hardcode
gcanlin Jan 15, 2026
f139e30
fix the legacy is_xxx
gcanlin Jan 16, 2026
1aaf259
fix sleep mode test
gcanlin Jan 16, 2026
5880aa2
Merge branch 'main' into platforms
gcanlin Jan 16, 2026
0c1dc43
Add supports_attention_mask interface
gcanlin Jan 16, 2026
535d971
fix legacy
gcanlin Jan 16, 2026
d7bb244
Use upstream utils
gcanlin Jan 17, 2026
e8728f7
lint
gcanlin Jan 17, 2026
dfe3d63
Merge branch 'main' into platforms
gcanlin Jan 18, 2026
d8d9a15
Fix conflict
gcanlin Jan 18, 2026
5bb5d7b
make vae optimization as a optional parameter in example
gcanlin Jan 18, 2026
ab4c946
make vae optimization as a optional parameter in example and update docs
gcanlin Jan 18, 2026
b8c147e
Merge branch 'main' into platforms
gcanlin Jan 19, 2026
c37007f
fix lint
gcanlin Jan 19, 2026
881d0a8
enable qwen2.5-omni graph by default
gcanlin Jan 19, 2026
e40da71
Merge branch 'main' into pr-774
gcanlin Jan 21, 2026
c753894
add get_device_name for cuda platform
gcanlin Jan 21, 2026
81f17b2
fix lint
gcanlin Jan 21, 2026
273ac7a
Merge branch 'main' into pr-774
gcanlin Jan 22, 2026
4068538
fix
gcanlin Jan 22, 2026
9d8e218
simple test depends on image-build
gcanlin Jan 22, 2026
56438fb
fix ci
gcanlin Jan 22, 2026
84df573
test
gcanlin Jan 22, 2026
656932c
make simple test in docker
gcanlin Jan 22, 2026
2bb6dda
fix ci
gcanlin Jan 22, 2026
dfd724c
fix ci
gcanlin Jan 22, 2026
f8249c0
update ci config
gcanlin Jan 22, 2026
4360d6f
fix oom
gcanlin Jan 22, 2026
81a1762
Merge branch 'main' into pr-774
gcanlin Jan 22, 2026
98d6e18
fix lint
gcanlin Jan 22, 2026
8334c0b
fix selector
gcanlin Jan 22, 2026
46c041a
fix ci qwen3-omni config
gcanlin Jan 22, 2026
9abea44
Merge branch 'main' into pr-774
gcanlin Jan 22, 2026
cc15fd6
align with vllm plugin
gcanlin Jan 25, 2026
8a701fe
Merge branch 'main' into pr-774
gcanlin Jan 25, 2026
3092dac
fix lint
gcanlin Jan 25, 2026
a492a1a
fix lint
gcanlin Jan 25, 2026
d3674b6
Merge branch 'main' into pr-774
gcanlin Jan 26, 2026
aafbe0a
Merge branch 'main' into pr-774
gcanlin Jan 27, 2026
e52c29e
fix conflicts and move attention backend into platforms
gcanlin Jan 27, 2026
338f9c1
fix config
gcanlin Jan 27, 2026
6ded4da
fix utils
gcanlin Jan 27, 2026
0bf040c
fix
gcanlin Jan 28, 2026
8ff2697
align with vllm
gcanlin Jan 28, 2026
8dcab54
fix lint
gcanlin Jan 28, 2026
8dc7756
Refactor attention backend & remove ascend_attn
gcanlin Jan 28, 2026
535a7fe
Merge branch 'main' into pr-774
gcanlin Jan 28, 2026
633a23d
fix lint
gcanlin Jan 28, 2026
dd00f6f
fix lint
gcanlin Jan 28, 2026
2806b53
Merge branch 'main' into pr-774
gcanlin Jan 28, 2026
aec46c0
Merge branch 'main' into pr-774
gcanlin Jan 28, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
20 changes: 16 additions & 4 deletions .buildkite/pipeline.yml
Original file line number Diff line number Diff line change
Expand Up @@ -18,11 +18,23 @@ steps:
# queue: "ascend"

- label: "Simple Unit Test"
depends_on: ~
depends_on: image-build
commands:
- ".buildkite/scripts/simple_test.sh"
- pytest -v -s tests/entrypoints/
- pytest -v -s tests/diffusion/cache/
- pytest -v -s tests/model_executor/models/qwen2_5_omni/test_audio_length.py
- pytest -v -s tests/worker/
agents:
queue: "cpu_queue_premerge"
queue: "gpu_1_queue"
plugins:
- docker#v5.2.0:
image: public.ecr.aws/q9t5s3a7/vllm-ci-test-repo:$BUILDKITE_COMMIT
always-pull: true
propagate-environment: true
environment:
- "HF_HOME=/fsx/hf_cache"
volumes:
- "/fsx/hf_cache:/fsx/hf_cache"

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here I move UT into GPU queue but it wouldn't cost GPU resources, so I guess it's okay. The reason is that after introducing platform, when it initialized, torch._C ops would be imported. If keeping in cpu queue, it will raise the error below:

[2026-01-21T11:33:06Z] .venv-simple-test/lib/python3.12/site-packages/vllm/platforms/cuda.py:16: in <module>
[2026-01-21T11:33:06Z]     import vllm._C  # noqa
[2026-01-21T11:33:06Z]     ^^^^^^^^^^^^^^
[2026-01-21T11:33:06Z] E   ImportError: libcuda.so.1: cannot open shared object file: No such file or directory

See more details:https://buildkite.com/vllm/vllm-omni/builds/1913/steps/canvas?sid=019be051-4a0c-43ef-8b20-47464b363092


- label: "Diffusion Model Test"
timeout_in_minutes: 20
Expand Down Expand Up @@ -149,7 +161,7 @@ steps:
timeout_in_minutes: 20
depends_on: image-build
commands:
- pytest -s -v tests/diffusion/test_gpu_diffusion_worker.py
- pytest -s -v tests/diffusion/test_diffusion_worker.py
agents:
queue: "gpu_4_queue" # g6.12xlarge instance on AWS, has 4 L4 GPU
plugins:
Expand Down
2 changes: 1 addition & 1 deletion .buildkite/test-amd.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -54,7 +54,7 @@ steps:
commands:
- export MIOPEN_DEBUG_CONV_DIRECT=0
- export MIOPEN_DEBUG_CONV_GEMM=0
- pytest -s -v tests/diffusion/test_gpu_diffusion_worker.py
- pytest -s -v tests/diffusion/test_diffusion_worker.py

- label: "Omni Model Test Qwen2-5-Omni"
timeout_in_minutes: 15
Expand Down
6 changes: 3 additions & 3 deletions docs/configuration/stage_configs.md
Original file line number Diff line number Diff line change
Expand Up @@ -40,7 +40,7 @@ stage_args:
engine_args: # Engine arguments for a certain engine
model_stage: thinker
model_arch: Qwen2_5OmniForConditionalGeneration # The model implementation registered in model_executor/models/registry.py
worker_cls: vllm_omni.worker.gpu_ar_worker.GPUARWorker # The specific worker used
worker_type: ar # The specific worker used
scheduler_cls: vllm_omni.core.sched.omni_ar_scheduler.OmniARScheduler # The specific scehduler used
gpu_memory_utilization: 0.8 # The gpu memory allocation for the stage within a single chip
enforce_eager: true # Now we only support eager mode
Expand All @@ -66,7 +66,7 @@ stage_args:
engine_args:
model_stage: talker
model_arch: Qwen2_5OmniForConditionalGeneration
worker_cls: vllm_omni.worker.gpu_ar_worker.GPUARWorker
worker_type: ar
scheduler_cls: vllm_omni.core.sched.omni_ar_scheduler.OmniARScheduler
gpu_memory_utilization: 0.8
enforce_eager: true
Expand All @@ -92,7 +92,7 @@ stage_args:
engine_args:
model_stage: code2wav
model_arch: Qwen2_5OmniForConditionalGeneration
worker_cls: vllm_omni.worker.gpu_generation_worker.GPUGenerationWorker
worker_type: generation
scheduler_cls: vllm_omni.core.sched.omni_generation_scheduler.OmniGenerationScheduler
gpu_memory_utilization: 0.15
enforce_eager: true
Expand Down
6 changes: 3 additions & 3 deletions docs/configuration/stage_configs/qwen2_5_omni.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ stage_args:
engine_args:
model_stage: thinker
model_arch: Qwen2_5OmniForConditionalGeneration
worker_cls: vllm_omni.worker.gpu_ar_worker.GPUARWorker
worker_type: ar
scheduler_cls: vllm_omni.core.sched.omni_ar_scheduler.OmniARScheduler
gpu_memory_utilization: 0.8
enforce_eager: true # Now we only support eager mode
Expand All @@ -34,7 +34,7 @@ stage_args:
engine_args:
model_stage: talker
model_arch: Qwen2_5OmniForConditionalGeneration
worker_cls: vllm_omni.worker.gpu_ar_worker.GPUARWorker
worker_type: ar
scheduler_cls: vllm_omni.core.sched.omni_ar_scheduler.OmniARScheduler
gpu_memory_utilization: 0.8
enforce_eager: true
Expand All @@ -60,7 +60,7 @@ stage_args:
engine_args:
model_stage: code2wav
model_arch: Qwen2_5OmniForConditionalGeneration
worker_cls: vllm_omni.worker.gpu_generation_worker.GPUGenerationWorker
worker_type: generation
scheduler_cls: vllm_omni.core.sched.omni_generation_scheduler.OmniGenerationScheduler
gpu_memory_utilization: 0.15
enforce_eager: true
Expand Down
22 changes: 14 additions & 8 deletions examples/offline_inference/image_to_image/image_edit.py
Original file line number Diff line number Diff line change
Expand Up @@ -81,7 +81,7 @@
from vllm_omni.entrypoints.omni import Omni
from vllm_omni.inputs.data import OmniDiffusionSamplingParams
from vllm_omni.outputs import OmniRequestOutput
from vllm_omni.utils.platform_utils import detect_device_type, is_npu
from vllm_omni.platforms import current_omni_platform


def parse_args() -> argparse.Namespace:
Expand Down Expand Up @@ -280,6 +280,16 @@ def parse_args() -> argparse.Namespace:
action="store_true",
help="Disable torch.compile and force eager execution.",
)
parser.add_argument(
"--vae_use_slicing",
action="store_true",
help="Enable VAE slicing for memory optimization.",
)
parser.add_argument(
"--vae_use_tiling",
action="store_true",
help="Enable VAE tiling for memory optimization.",
)
parser.add_argument(
"--enable-cpu-offload",
action="store_true",
Expand All @@ -306,12 +316,8 @@ def main():
else:
input_image = input_images

device = detect_device_type()
generator = torch.Generator(device=device).manual_seed(args.seed)
generator = torch.Generator(device=current_omni_platform.device_type).manual_seed(args.seed)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand the usage. A suggestion for discussion. The function of the api is somewhat torch style. So could we change it to vllm_omni.device_type for simplicity? And so for so on.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I'm not sure I fully understood your point. Do current_omini_platform.device_type and vllm_omni.device_type seem to have the same effect? Could you please give more details?


# Enable VAE memory optimizations on NPU
vae_use_slicing = is_npu()
vae_use_tiling = is_npu()
parallel_config = DiffusionParallelConfig(
ulysses_degree=args.ulysses_degree,
ring_degree=args.ring_degree,
Expand Down Expand Up @@ -344,8 +350,8 @@ def main():
# Initialize Omni with appropriate pipeline
omni = Omni(
model=args.model,
vae_use_slicing=vae_use_slicing,
vae_use_tiling=vae_use_tiling,
vae_use_slicing=args.vae_use_slicing,
vae_use_tiling=args.vae_use_tiling,
cache_backend=args.cache_backend,
cache_config=cache_config,
parallel_config=parallel_config,
Expand Down
4 changes: 4 additions & 0 deletions examples/offline_inference/image_to_image/image_to_image.md
Original file line number Diff line number Diff line change
Expand Up @@ -47,4 +47,8 @@ Key arguments:
- `--guidance_scale`: guidance scale for guidance-distilled models (default: 1.0, disabled). Unlike classifier-free guidance (--cfg_scale), guidance-distilled models take the guidance scale directly as an input parameter. Enabled when guidance_scale > 1. Ignored when not using guidance-distilled models.
- `--num_inference_steps`: diffusion sampling steps (more steps = higher quality, slower).
- `--output`: path to save the generated PNG.
- `--vae_use_slicing`: enable VAE slicing for memory optimization.
- `--vae_use_tiling`: enable VAE tiling for memory optimization.
- `--enable-cpu-offload`: enable CPU offloading for diffusion models.

> ℹ️ If you encounter OOM errors, try using `--vae_use_slicing` and `--vae_use_tiling` to reduce memory usage.
4 changes: 4 additions & 0 deletions examples/offline_inference/image_to_video/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -54,4 +54,8 @@ Key arguments:
- `--num_inference_steps`: Number of denoising steps (default 50).
- `--fps`: Frames per second for the saved MP4 (requires `diffusers` export_to_video).
- `--output`: Path to save the generated video.
- `--vae_use_slicing`: Enable VAE slicing for memory optimization.
- `--vae_use_tiling`: Enable VAE tiling for memory optimization.
- `--enable-cpu-offload`: enable CPU offloading for diffusion models.

> ℹ️ If you encounter OOM errors, try using `--vae_use_slicing` and `--vae_use_tiling` to reduce memory usage.
23 changes: 14 additions & 9 deletions examples/offline_inference/image_to_video/image_to_video.py
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,7 @@
from vllm_omni.entrypoints.omni import Omni
from vllm_omni.inputs.data import OmniDiffusionSamplingParams
from vllm_omni.outputs import OmniRequestOutput
from vllm_omni.utils.platform_utils import detect_device_type, is_npu
from vllm_omni.platforms import current_omni_platform


def parse_args() -> argparse.Namespace:
Expand Down Expand Up @@ -59,6 +59,16 @@ def parse_args() -> argparse.Namespace:
)
parser.add_argument("--output", type=str, default="i2v_output.mp4", help="Path to save the video (mp4).")
parser.add_argument("--fps", type=int, default=16, help="Frames per second for the output video.")
parser.add_argument(
"--vae_use_slicing",
action="store_true",
help="Enable VAE slicing for memory optimization.",
)
parser.add_argument(
"--vae_use_tiling",
action="store_true",
help="Enable VAE tiling for memory optimization.",
)
parser.add_argument(
"--enable-cpu-offload",
action="store_true",
Expand All @@ -80,8 +90,7 @@ def calculate_dimensions(image: PIL.Image.Image, max_area: int = 480 * 832) -> t

def main():
args = parse_args()
device = detect_device_type()
generator = torch.Generator(device=device).manual_seed(args.seed)
generator = torch.Generator(device=current_omni_platform.device_type).manual_seed(args.seed)

# Load input image
image = PIL.Image.open(args.image).convert("RGB")
Expand All @@ -98,17 +107,13 @@ def main():
# Resize image to target dimensions
image = image.resize((width, height), PIL.Image.Resampling.LANCZOS)

# Enable VAE memory optimizations on NPU
vae_use_slicing = is_npu()
vae_use_tiling = is_npu()

# Check if profiling is requested via environment variable
profiler_enabled = bool(os.getenv("VLLM_TORCH_PROFILER_DIR"))

omni = Omni(
model=args.model,
vae_use_slicing=vae_use_slicing,
vae_use_tiling=vae_use_tiling,
vae_use_slicing=args.vae_use_slicing,
vae_use_tiling=args.vae_use_tiling,
boundary_ratio=args.boundary_ratio,
flow_shift=args.flow_shift,
enable_cpu_offload=args.enable_cpu_offload,
Expand Down
5 changes: 2 additions & 3 deletions examples/offline_inference/text_to_audio/text_to_audio.py
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@

from vllm_omni.entrypoints.omni import Omni
from vllm_omni.inputs.data import OmniDiffusionSamplingParams
from vllm_omni.utils.platform_utils import detect_device_type
from vllm_omni.platforms import current_omni_platform


def parse_args() -> argparse.Namespace:
Expand Down Expand Up @@ -118,8 +118,7 @@ def save_audio(audio_data: np.ndarray, output_path: str, sample_rate: int = 4410

def main():
args = parse_args()
device = detect_device_type()
generator = torch.Generator(device=device).manual_seed(args.seed)
generator = torch.Generator(device=current_omni_platform.device_type).manual_seed(args.seed)

print(f"\n{'=' * 60}")
print("Stable Audio Open - Text-to-Audio Generation")
Expand Down
4 changes: 4 additions & 0 deletions examples/offline_inference/text_to_image/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -96,8 +96,12 @@ Key arguments:
- `--num_inference_steps`: diffusion sampling steps (more steps = higher quality, slower).
- `--height/--width`: output resolution (defaults 1024x1024).
- `--output`: path to save the generated PNG.
- `--vae_use_slicing`: enable VAE slicing for memory optimization.
- `--vae_use_tiling`: enable VAE tiling for memory optimization.
- `--enable-cpu-offload`: enable CPU offloading for diffusion models.

> ℹ️ If you encounter OOM errors, try using `--vae_use_slicing` and `--vae_use_tiling` to reduce memory usage.

> ℹ️ Qwen-Image currently publishes best-effort presets at `1328x1328`, `1664x928`, `928x1664`, `1472x1140`, `1140x1472`, `1584x1056`, and `1056x1584`. Adjust `--height/--width` accordingly for the most reliable outcomes.

## Web UI Demo
Expand Down
9 changes: 4 additions & 5 deletions examples/offline_inference/text_to_image/gradio_demo.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@
from vllm_omni.entrypoints.omni import Omni
from vllm_omni.inputs.data import OmniDiffusionSamplingParams
from vllm_omni.outputs import OmniRequestOutput
from vllm_omni.utils.platform_utils import detect_device_type, is_npu
from vllm_omni.platforms import current_omni_platform

ASPECT_RATIOS: dict[str, tuple[int, int]] = {
"1:1": (1328, 1328),
Expand Down Expand Up @@ -62,8 +62,8 @@ def parse_args() -> argparse.Namespace:
@lru_cache(maxsize=1)
def get_omni(model_name: str) -> Omni:
# Enable VAE memory optimizations on NPU
vae_use_slicing = is_npu()
vae_use_tiling = is_npu()
vae_use_slicing = current_omni_platform.is_npu()
vae_use_tiling = current_omni_platform.is_npu()
return Omni(
model=model_name,
vae_use_slicing=vae_use_slicing,
Expand All @@ -72,7 +72,6 @@ def get_omni(model_name: str) -> Omni:


def build_demo(args: argparse.Namespace) -> gr.Blocks:
device = detect_device_type()
omni = get_omni(args.model)

def run_inference(
Expand All @@ -99,7 +98,7 @@ def run_inference(
raise gr.Error("Inference steps must be a positive integer.")
if num_images not in {1, 2, 3, 4}:
raise gr.Error("Number of images must be 1, 2, 3, or 4.")
generator = torch.Generator(device=device).manual_seed(seed)
generator = torch.Generator(device=current_omni_platform.device_type).manual_seed(seed)
outputs = omni.generate(
prompt.strip(),
OmniDiffusionSamplingParams(
Expand Down
23 changes: 14 additions & 9 deletions examples/offline_inference/text_to_image/text_to_image.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@
from vllm_omni.entrypoints.omni import Omni
from vllm_omni.inputs.data import OmniDiffusionSamplingParams
from vllm_omni.outputs import OmniRequestOutput
from vllm_omni.utils.platform_utils import detect_device_type, is_npu
from vllm_omni.platforms import current_omni_platform


def parse_args() -> argparse.Namespace:
Expand Down Expand Up @@ -113,17 +113,22 @@ def parse_args() -> argparse.Namespace:
default=1,
help="Number of GPUs used for tensor parallelism (TP) inside the DiT.",
)
parser.add_argument(
"--vae_use_slicing",
action="store_true",
help="Enable VAE slicing for memory optimization.",
)
parser.add_argument(
"--vae_use_tiling",
action="store_true",
help="Enable VAE tiling for memory optimization.",
)
return parser.parse_args()


def main():
args = parse_args()
device = detect_device_type()
generator = torch.Generator(device=device).manual_seed(args.seed)

# Enable VAE memory optimizations on NPU
vae_use_slicing = is_npu()
vae_use_tiling = is_npu()
generator = torch.Generator(device=current_omni_platform.device_type).manual_seed(args.seed)

# Configure cache based on backend type
cache_config = None
Expand Down Expand Up @@ -167,8 +172,8 @@ def main():

omni = Omni(
model=args.model,
vae_use_slicing=vae_use_slicing,
vae_use_tiling=vae_use_tiling,
vae_use_slicing=args.vae_use_slicing,
vae_use_tiling=args.vae_use_tiling,
cache_backend=args.cache_backend,
cache_config=cache_config,
enable_cache_dit_summary=args.enable_cache_dit_summary,
Expand Down
4 changes: 4 additions & 0 deletions examples/offline_inference/text_to_video/text_to_video.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,4 +29,8 @@ Key arguments:
- `--boundary_ratio`: Boundary split ratio for low/high DiT.
- `--fps`: frames per second for the saved MP4 (requires `diffusers` export_to_video).
- `--output`: path to save the generated video.
- `--vae_use_slicing`: enable VAE slicing for memory optimization.
- `--vae_use_tiling`: enable VAE tiling for memory optimization.
- `--enable-cpu-offload`: enable CPU offloading for diffusion models.

> ℹ️ If you encounter OOM errors, try using `--vae_use_slicing` and `--vae_use_tiling` to reduce memory usage.
Loading