Motivation
Currently, vllm-omni supports multiple hardware backends (CUDA, NPU, XPU) through hardcoded paths and hardware-specific if/else checks scattered throughout the codebase (e.g., stage_configs and worker directories). This creates several issues:
- Coupling: Core modeling logic (like
qwen2_5_omni_token2wav.py ) is polluted with device-specific code (e.g., is_npu() checks for torch.kaiser_window kernels).
- Maintenance: Adding a new backend requires invasive changes to the core repository.
- Redundancy: We are reimplementing platform detection logic that already exists in upstream vLLM.
This proposal transitions vllm-omni into a modular architecture where the core engine remains hardware-blind, delegating device-specific logic to an OmniPlatform layer.
Proposed Architecture
A. OmniPlatform Hardware Abstraction Layer
Instead of ad-hoc detection, we unify all hardware-aware implementations into a specialized platform layer that inherits from upstream vLLM.
- Upstream Alignment:
OmniPlatform inherits directly from vllm.platforms.Platform. Concrete implementations (e.g. XPUOmniPlatform, NPUOmniPlatform) extends vLLM Platform behavior with Omni-specific APIs
- Generic Device Handling: remove
is_xpu()/is_npu() checks and device-specific strings from core modeling files. Instead, it will use current_omni_platform APIs to dynamically retrieve device metadata, worker classes, attention backends etc
- Custom Op Dispatch: Operators will dispatch via platform-owned selectors (e.g.
current_omni_platform.is_npu()) to keep device knowledge centralized
B. The Plugin System
We follow the vLLM plugin structure but introduce Omni-specific groups to handle both in-tree and out-of-tree backends.
- Modular Repositories: Platform-specific code (Workers, ModelRunners, and specialized kernels) will be encapsulated in dedicated platform directories.
- Platform-Agnostic Configuration: YAML stage configs will be simplified. Instead of explicitly specifying
vllm_omni.worker.xpu.xpu_ar_worker.XPUARWorker , current_omni_platform dynamically resolves the correct hardware-specific implementation at runtime.
- Registration via Entry Points: We will utilize two distinct Python entry point groups to mirror vLLM's loading behavior:
vllm_omni.general_plugins and vllm_omni.platform_plugins.
Implementation Details
see @gcanlin 's post below.
Feedback Period.
No response
CC List.
No response
Any Other Things.
No response
Before submitting a new issue...
Motivation
Currently,
vllm-omnisupports multiple hardware backends (CUDA, NPU, XPU) through hardcoded paths and hardware-specificif/elsechecks scattered throughout the codebase (e.g.,stage_configsandworkerdirectories). This creates several issues:qwen2_5_omni_token2wav.py) is polluted with device-specific code (e.g.,is_npu()checks fortorch.kaiser_windowkernels).This proposal transitions
vllm-omniinto a modular architecture where the core engine remains hardware-blind, delegating device-specific logic to anOmniPlatformlayer.Proposed Architecture
A. OmniPlatform Hardware Abstraction Layer
Instead of ad-hoc detection, we unify all hardware-aware implementations into a specialized platform layer that inherits from upstream vLLM.
OmniPlatforminherits directly fromvllm.platforms.Platform. Concrete implementations (e.g. XPUOmniPlatform, NPUOmniPlatform) extends vLLM Platform behavior with Omni-specific APIsis_xpu()/is_npu()checks and device-specific strings from core modeling files. Instead, it will usecurrent_omni_platformAPIs to dynamically retrieve device metadata, worker classes, attention backends etccurrent_omni_platform.is_npu()) to keep device knowledge centralizedB. The Plugin System
We follow the vLLM plugin structure but introduce Omni-specific groups to handle both in-tree and out-of-tree backends.
vllm_omni.worker.xpu.xpu_ar_worker.XPUARWorker, current_omni_platform dynamically resolves the correct hardware-specific implementation at runtime.vllm_omni.general_pluginsandvllm_omni.platform_plugins.Implementation Details
see @gcanlin 's post below.
Feedback Period.
No response
CC List.
No response
Any Other Things.
No response
Before submitting a new issue...