Skip to content

[RFC]: Hardware Abstraction Layer & Plugin System for Unified Backend Support #702

@faaany

Description

@faaany

Motivation

Currently,  vllm-omni supports multiple hardware backends (CUDA, NPU, XPU) through hardcoded paths and hardware-specific  if/else checks scattered throughout the codebase (e.g., stage_configs and  worker directories). This creates several issues:

  • Coupling: Core modeling logic (like qwen2_5_omni_token2wav.py ) is polluted with device-specific code (e.g.,  is_npu()  checks for torch.kaiser_window kernels).
  • Maintenance: Adding a new backend requires invasive changes to the core repository.
  • Redundancy: We are reimplementing platform detection logic that already exists in upstream vLLM.

This proposal transitions vllm-omni into a modular architecture where the core engine remains hardware-blind, delegating device-specific logic to an OmniPlatform layer.

Proposed Architecture

A. OmniPlatform Hardware Abstraction Layer
Instead of ad-hoc detection, we unify all hardware-aware implementations into a specialized platform layer that inherits from upstream vLLM.

  • Upstream Alignment: OmniPlatform inherits directly from vllm.platforms.Platform. Concrete implementations (e.g. XPUOmniPlatform, NPUOmniPlatform) extends vLLM Platform behavior with Omni-specific APIs
  • Generic Device Handling: remove is_xpu()/is_npu() checks and device-specific strings from core modeling files. Instead, it will use current_omni_platform APIs to dynamically retrieve device metadata, worker classes, attention backends etc
  • Custom Op Dispatch: Operators will dispatch via platform-owned selectors (e.g. current_omni_platform.is_npu()) to keep device knowledge centralized

B. The Plugin System
We follow the vLLM plugin structure but introduce Omni-specific groups to handle both in-tree and out-of-tree backends.

  • Modular Repositories: Platform-specific code (Workers, ModelRunners, and specialized kernels) will be encapsulated in dedicated platform directories.
  • Platform-Agnostic Configuration: YAML stage configs will be simplified. Instead of explicitly specifying  vllm_omni.worker.xpu.xpu_ar_worker.XPUARWorker , current_omni_platform dynamically resolves the correct hardware-specific implementation at runtime.
  • Registration via Entry Points: We will utilize two distinct Python entry point groups to mirror vLLM's loading behavior: vllm_omni.general_plugins and vllm_omni.platform_plugins.
Image

Implementation Details

see @gcanlin 's post below.

Feedback Period.

No response

CC List.

No response

Any Other Things.

No response

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions