[RFC]: Hardware Abstraction Layer & Plugin System for Unified Backend Support

### Motivation

Currently,  `vllm-omni` supports multiple hardware backends (CUDA, NPU, XPU) through hardcoded paths and hardware-specific  `if/else` checks scattered throughout the codebase (e.g., `stage_configs` and  `worker` directories). This creates several issues:
- **Coupling**: Core modeling logic (like `qwen2_5_omni_token2wav.py` ) is polluted with device-specific code (e.g.,  `is_npu()`  checks for `torch.kaiser_window` kernels).
- **Maintenance**: Adding a new backend requires invasive changes to the core repository.
- **Redundancy**: We are reimplementing platform detection logic that already exists in upstream vLLM.

This proposal transitions `vllm-omni` into a modular architecture where the core engine remains hardware-blind, delegating device-specific logic to an `OmniPlatform` layer.

### Proposed Architecture

**A. OmniPlatform Hardware Abstraction Layer**
Instead of ad-hoc detection, we unify all hardware-aware implementations into a specialized platform layer that inherits from upstream vLLM.
- **Upstream Alignment**: `OmniPlatform` inherits directly from `vllm.platforms.Platform`. Concrete implementations (e.g. XPUOmniPlatform, NPUOmniPlatform) extends vLLM Platform behavior with Omni-specific APIs
- **Generic Device Handling**: remove `is_xpu()/is_npu()` checks and device-specific strings from core modeling files. Instead, it will use `current_omni_platform` APIs to dynamically retrieve device metadata, worker classes, attention backends etc
- **Custom Op Dispatch**: Operators will dispatch via platform-owned selectors (e.g. `current_omni_platform.is_npu()`) to keep device knowledge centralized


**B. The Plugin System**
We follow the vLLM plugin structure but introduce Omni-specific groups to handle both in-tree and out-of-tree backends.
- **Modular Repositories**: Platform-specific code (Workers, ModelRunners, and specialized kernels) will be encapsulated in dedicated platform directories.
- **Platform-Agnostic Configuration**: YAML stage configs will be simplified. Instead of explicitly specifying ` vllm_omni.worker.xpu.xpu_ar_worker.XPUARWorker `, current_omni_platform dynamically resolves the correct hardware-specific implementation at runtime.
- **Registration via Entry Points**: We will utilize two distinct Python entry point groups to mirror vLLM's loading behavior: `vllm_omni.general_plugins` and `vllm_omni.platform_plugins`.

<img width="5753" height="3725" alt="Image" src="https://github.com/user-attachments/assets/e7ba1ef6-e430-4b32-9892-e4c08da41426" />

### Implementation Details
see @gcanlin 's post below.


### Feedback Period.

_No response_

### CC List.

_No response_

### Any Other Things.

_No response_

### Before submitting a new issue...

- [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://vllm-omni.readthedocs.io), which can answer lots of frequently asked questions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC]: Hardware Abstraction Layer & Plugin System for Unified Backend Support #702

Motivation

Proposed Architecture

Implementation Details

Feedback Period.

CC List.

Any Other Things.

Before submitting a new issue...

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[RFC]: Hardware Abstraction Layer & Plugin System for Unified Backend Support #702

Description

Motivation

Proposed Architecture

Implementation Details

Feedback Period.

CC List.

Any Other Things.

Before submitting a new issue...

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions