[Feature] vLLM-Omni RDMA connector by natureofnature · Pull Request #1019 · vllm-project/vllm-omni

natureofnature · 2026-01-28T09:18:02Z

PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS (AT THE BOTTOM) HAVE BEEN CONSIDERED.

Purpose

Refer to #955, provide a RDMA based transfer implementation based on mooncake transfer engine for CPU<->CPU and GPU<->GPU.

Progress

Added d2d connector
Enable rdma test
Added cross node testing and benchmark
Support Bagel (AR->DIT) disaggregation (relevant modifications will be release in next PR)

TODO

Integrate with Bagel/Qwen2.5/3 model inference

Test Plan

Inter node test, 3 modes (serialization/deserialization, cpu pin memory, gpu pin memory for CI)
Cross node test, 3 modes. (for performance testing)
Inter node model test (bagel, qwen2.5 omni, qwen3 omni etc.)
Cross node model test (bagel, qwen2.5 omni, qwen3 omni etc.)

Test Result

Internode functionality

test_buffer_management.py passed
test_mooncake_rdma.py passed

Cross nodes performance

Case 1: Simulated test

Using 1GB data, repeated 20 times, zero copy (using managed buffer), gpu (gpu direct transfer), copy (data -> buffer -> RDMA). Tested on H800 clusters.

Mode	Throughput	Efficiency (vs 45 GB/s, the maximum bandwidth on tested servers)
zero copy	33.7 GB/s	75%
gpu	25.6 GB/s	57%
copy	13.2 GB/s	29%

Case2: Bagel AR/DIT disaggregation test

Using a text to image with prompt around 3400 tokens, which generates around 190MB KV cache between AR->DIT stages, below is the performance results.

Stage	Mooncake Store	RDMA with serialize/deserialize/memory copy	RDMA CPU zero-copy	RDMA GPU RDMA Direct
AR (Stage-0)	102 ms	99 ms	100 ms	106 ms
AR→DIT Data transmission	810 ms	300 ms	14 ms	14 ms
DIT (Stage-1)	10,083 ms	10,240 ms	10,077 ms	10,039 ms
Other time cost	65 ms	104 ms	103 ms	93 ms
E2E time	11,060 ms	10,743 ms	10,294 ms	10,252 ms
Overal performance gain	baseline	~3%	>7%	>7%

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft.

BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: b2cc320601

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

natureofnature · 2026-01-28T10:08:08Z

@codex review

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 743d268753

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

hsliuustc0106 · 2026-02-08T14:06:40Z

Can we rename it? I do not remember RDMAConnector is used vllm upstream

natureofnature · 2026-02-10T03:45:38Z

@codex review

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 365f163118

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

natureofnature · 2026-02-11T04:54:28Z

@codex review

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: efb3b316cf

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

natureofnature · 2026-02-11T05:33:40Z

@codex review

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 36717281d6

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

natureofnature · 2026-02-11T07:53:37Z

@codex review

natureofnature · 2026-02-11T07:57:02Z

@codex review

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 0b13984aed

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

natureofnature · 2026-02-11T09:47:23Z

@codex review

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 2158eeee99

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

princepride · 2026-02-12T05:14:36Z

Docs build failed, PTAL

WARNING -  griffe: vllm_omni/distributed/omni_connectors/connectors/mooncake_transfer_engine_connector.py:744: Confusing indentation for continuation line 11 in docstring, should be 4 * 2 = 8 spaces, not 6
WARNING -  griffe: vllm_omni/distributed/omni_connectors/connectors/mooncake_transfer_engine_connector.py:745: Confusing indentation for continuation line 12 in docstring, should be 4 * 2 = 8 spaces, not 6
WARNING -  griffe: vllm_omni/distributed/omni_connectors/connectors/mooncake_transfer_engine_connector.py:746: Confusing indentation for continuation line 13 in docstring, should be 4 * 2 = 8 spaces, not 6
WARNING -  griffe: vllm_omni/distributed/omni_connectors/connectors/mooncake_transfer_engine_connector.py:747: Confusing indentation for continuation line 14 in docstring, should be 4 * 2 = 8 spaces, not 6
WARNING -  griffe: vllm_omni/distributed/omni_connectors/connectors/mooncake_transfer_engine_connector.py:748: Confusing indentation for continuation line 15 in docstring, should be 4 * 2 = 8 spaces, not 6
WARNING -  griffe: vllm_omni/distributed/omni_connectors/connectors/mooncake_transfer_engine_connector.py:750: Confusing indentation for continuation line 17 in docstring, should be 4 * 2 = 8 spaces, not 6
WARNING -  griffe: vllm_omni/distributed/omni_connectors/connectors/mooncake_transfer_engine_connector.py:751: Confusing indentation for continuation line 18 in docstring, should be 4 * 2 = 8 spaces, not 6

princepride · 2026-02-12T05:23:11Z

Please also add yaml file

princepride · 2026-02-12T08:42:39Z

@Gaohan123 @ZJY0516 @hsliuustc0106 PTAL

Gaohan123 · 2026-02-13T08:38:12Z

+python -c "from mooncake.engine import TransferEngine; print('OK')"
+
+# Reinstall if needed
+pip install mooncake


Can we use uv to align with vLLM?

Copilot

Pull request overview

Copilot reviewed 20 out of 20 changed files in this pull request and generated 21 comments.

Comments suppressed due to low confidence (1)

vllm_omni/distributed/omni_connectors/connectors/base.py:51

Overridden method signature does not match call, where it is passed too many arguments. Overriding method method MooncakeTransferEngineConnector.cleanup matches the call.
Overridden method signature does not match call, where it is passed too many arguments. Overriding method method MooncakeTransferEngineConnector.cleanup matches the call.
Overridden method signature does not match call, where it is passed too many arguments. Overriding method method MooncakeTransferEngineConnector.cleanup matches the call.
Overridden method signature does not match call, where it is passed an argument named 'from_stage'. Overriding method method MooncakeTransferEngineConnector.cleanup matches the call.
Overridden method signature does not match call, where it is passed an argument named 'from_stage'. Overriding method method MooncakeTransferEngineConnector.cleanup matches the call.
Overridden method signature does not match call, where it is passed an argument named 'from_stage'. Overriding method method MooncakeTransferEngineConnector.cleanup matches the call.
Overridden method signature does not match call, where it is passed an argument named 'to_stage'. Overriding method method MooncakeTransferEngineConnector.cleanup matches the call.
Overridden method signature does not match call, where it is passed an argument named 'to_stage'. Overriding method method MooncakeTransferEngineConnector.cleanup matches the call.
Overridden method signature does not match call, where it is passed an argument named 'to_stage'. Overriding method method MooncakeTransferEngineConnector.cleanup matches the call.

    def cleanup(self, request_id: str) -> None:

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

2. enable rdma test 3. add cross node testing 4. add cross node benchmark 5. update for threading issues 6. Verified support bagel using mooncake rdma (bagel relevant support will be submited in next PR) 7. update connector name/doc Signed-off-by: natureofnature <wzliu@connect.hku.hk>

Signed-off-by: natureofnature <wzliu@connect.hku.hk>

update connector interfaces udpate benchmark forler Signed-off-by: natureofnature <wzliu@connect.hku.hk>

Signed-off-by: natureofnature <wzliu@connect.hku.hk>

hsliuustc0106 · 2026-02-24T01:42:41Z

PR #1019 Review: [Feature] vLLM-Omni RDMA Connector

Overview

This PR adds RDMA-based data transfer support using the Mooncake Transfer Engine for high-performance multi-node communication between vLLM-Omni stages. The implementation includes a new MooncakeTransferEngineConnector that supports both CPU and GPU memory pools with RDMA and TCP protocols.

Summary of Changes

20 files changed, +3,645 additions, -63 deletions
New connector: MooncakeTransferEngineConnector with managed memory pool and zero-copy deserialization
Documentation: Comprehensive design doc and test configuration guide
Tests: Integration tests for buffer management and RDMA operations
Configuration updates: Added connector to factory and initialization utilities

✅ Strengths

Well-documented architecture: The design doc clearly explains topology limitations, port offset scheme, and troubleshooting. The benchmark results showing 58x speedup over TCP are impressive.
Robust memory management:
- BufferAllocator with thread-safe free list and double-free detection
- ManagedBuffer with RAII pattern (__del__, context manager support)
- TTL-based cleanup (5 minutes) for stale buffers
Thoughtful thread safety:
- Thread-local ZMQ REQ socket caching to avoid ordering violations
- Inproc notification system for listener thread wake-up
- Proper use of locks around shared state (_local_buffers_lock)
Comprehensive error handling:
- Early validation of ZMQ bind errors propagated to __init__
- Socket invalidation on timeout/error
- Graceful shutdown with idempotent close()
Good test coverage:
- Buffer management tests
- RDMA transfer tests
- Benchmarking infrastructure included

⚠️ Issues & Concerns

P1 - Critical Issues

1. CUDA synchronization before RDMA send (line 582-595)
The code synchronizes on the CUDA stream before RDMA sends, which is critical for correctness. However, the synchronization logic has duplicated code paths and could be simplified:

# Current: Duplicate sync blocks in both branches
if src_view.device != dst_tensor.device:
    # ... sync ...
else:
    # ... sync ...

Suggestion: Factor out the synchronization into a helper method to reduce duplication and ensure consistency.

2. ZMQ socket used across threads (listener thread + worker threads)
The _zmq_listener_loop creates ROUTER socket that is used by the listener thread for recv_multipart() and by worker threads via response_queue.put() then socket.send_multipart(). While response_queue provides thread-safe handoff, the actual send_multipart() call happens in the listener thread which is correct. However, the notify_recv PULL socket is bound in listener thread but accessed by worker threads via _notify_listener()'s PUSH socket connection, which is proper. This appears correct, but the comment at line 1095 about avoiding ZMQ socket across threads suggests this was a concern - verify this is actually an issue.

3. Resource leak on transfer failure (lines 1132-1145)
When RDMA write fails, the code logs a warning and returns TRANS_ERROR but does not release the receive buffer on the sender side:

if ret == 0:
    self.cleanup(meta.request_id)
    response_queue.put((identity, TRANS_DONE))
else:
    # Buffer retained for retry - but what about receiver's buffer?
    response_queue.put((identity, TRANS_ERROR))

The receiver's recv_buffer will be leaked if the transfer fails and times out on the receiver side.

4. put() accepts calls after connector falls back to receiver mode
When ZMQ bind fails, can_put is set to False (line 1034), but put() checks self.can_put before rejecting (line 501). The order of operations in __init__ means _bind_error is checked AFTER can_put is modified. However, there's a race: if put() is called between bind failure and error propagation, it would execute.

P2 - Important Issues

5. Missing optional dependency guard in imports
The connector imports mooncake.engine.TransferEngine at top level with a try/except, but the module is still imported in factory.py without the same guard. This could cause import failures even when not using RDMA:

# factory.py line ~125
from .connectors.mooncake_transfer_engine_connector import MooncakeTransferEngineConnector

This triggers the try: from mooncake.engine... block even if the import fails, causing TransferEngine = None but the module is still loaded.

6. Cleanup signature mismatch
The cleanup() method in MooncakeTransferEngineConnector accepts from_stage and to_stage parameters (line 888), but the base class OmniConnectorBase likely has a different signature. Copilot noted this issue - verify the base class signature and ensure compatibility.

7. Inconsistent error handling in get() for unresolved sender_host
Lines 745-750 raise RuntimeError when sender_host is unresolved, but the metadata query path at line 681 already checks for valid sender_host before constructing the ZMQ address. The validation at line 763 is also checking str(src_host).lower() == "auto". This creates three different error paths for the same condition.

8. Port offset scheme complexity
The port calculation logic is complex and could be error-prone. While documented, the formula involves multiple offsets that are concatenated in runtime. Consider extracting this to a dedicated helper function with unit tests.

📋 Code Quality Notes

Type hints: Good use of type hints throughout, though some Any types could be more specific.
Logging: Extensive logging with appropriate levels (debug/info/warning/error).
Naming: Clear, descriptive variable names and function names.
Documentation: Inline docstrings are comprehensive, especially the class docstring explaining topology limitations.

🔍 Specific Code Observations

Line 1032-1037: When ZMQ bind fails, can_put is set to False and _bind_error is set, then the listener returns. The __init__ at line 378 checks _bind_error and propagates it. However, there's a window where other threads could call put() before the exception is raised.

Line 1132: Variable src_lengths is used but not defined in the function scope (it should be src_lengths from line 1128).

Line 942: if getattr(self, "_closed", True): has inverted logic - it should be if getattr(self, "_closed", False): since the default should be "not closed" to allow first-time execution. Currently it would always return early unless _closed is explicitly False.

✅ Test Coverage

The tests look comprehensive:

Buffer allocation/deallocation with edge cases
Double-free detection
RDMA transfer with different modes (serialize, pin_memory, gpu)
Cross-node benchmarking infrastructure

📝 Recommendations

Fix the inverted _closed check in close() (line 942)
Add receiver buffer cleanup on transfer failure in _handle_pull_request
Guard the factory import of RDMA connector behind optional dependency check
Consider a helper for CUDA synchronization to reduce duplication
Add unit tests for port offset calculation logic
Document the race condition in put() after bind failure (or add a lock around the transition)
Verify base class signature compatibility for cleanup() method

Overall Assessment

Status: Request Changes - This is a significant feature with good design, but has several critical issues that should be addressed before merge.

The RDMA connector implementation is well-architected and thoroughly documented. The performance improvements (58x speedup) are substantial. However, there are 3-4 issues that could lead to memory leaks or race conditions in production, and a potential resource cleanup bug that should be fixed.

natureofnature · 2026-02-24T02:39:29Z

PR #1019 Review: [Feature] vLLM-Omni RDMA Connector

Overview

This PR adds RDMA-based data transfer support using the Mooncake Transfer Engine for high-performance multi-node communication between vLLM-Omni stages. The implementation includes a new MooncakeTransferEngineConnector that supports both CPU and GPU memory pools with RDMA and TCP protocols.

Summary of Changes

20 files changed, +3,645 additions, -63 deletions

New connector: MooncakeTransferEngineConnector with managed memory pool and zero-copy deserialization

Documentation: Comprehensive design doc and test configuration guide

Tests: Integration tests for buffer management and RDMA operations

Configuration updates: Added connector to factory and initialization utilities

✅ Strengths

Well-documented architecture: The design doc clearly explains topology limitations, port offset scheme, and troubleshooting. The benchmark results showing 58x speedup over TCP are impressive.

Robust memory management:

BufferAllocator with thread-safe free list and double-free detection

ManagedBuffer with RAII pattern (__del__, context manager support)

TTL-based cleanup (5 minutes) for stale buffers

Thoughtful thread safety:

Thread-local ZMQ REQ socket caching to avoid ordering violations

Inproc notification system for listener thread wake-up

Proper use of locks around shared state (_local_buffers_lock)

Comprehensive error handling:

Early validation of ZMQ bind errors propagated to __init__

Socket invalidation on timeout/error

Graceful shutdown with idempotent close()

Good test coverage:

Buffer management tests

RDMA transfer tests

Benchmarking infrastructure included

⚠️ Issues & Concerns

P1 - Critical Issues

1. CUDA synchronization before RDMA send (line 582-595) The code synchronizes on the CUDA stream before RDMA sends, which is critical for correctness. However, the synchronization logic has duplicated code paths and could be simplified:
# Current: Duplicate sync blocks in both branches
if src_view.device != dst_tensor.device:
    # ... sync ...
else:
    # ... sync ...
Suggestion: Factor out the synchronization into a helper method to reduce duplication and ensure consistency.

2. ZMQ socket used across threads (listener thread + worker threads) The _zmq_listener_loop creates ROUTER socket that is used by the listener thread for recv_multipart() and by worker threads via response_queue.put() then socket.send_multipart(). While response_queue provides thread-safe handoff, the actual send_multipart() call happens in the listener thread which is correct. However, the notify_recv PULL socket is bound in listener thread but accessed by worker threads via _notify_listener()'s PUSH socket connection, which is proper. This appears correct, but the comment at line 1095 about avoiding ZMQ socket across threads suggests this was a concern - verify this is actually an issue.

3. Resource leak on transfer failure (lines 1132-1145) When RDMA write fails, the code logs a warning and returns TRANS_ERROR but does not release the receive buffer on the sender side:
if ret == 0:
    self.cleanup(meta.request_id)
    response_queue.put((identity, TRANS_DONE))
else:
    # Buffer retained for retry - but what about receiver's buffer?
    response_queue.put((identity, TRANS_ERROR))
The receiver's recv_buffer will be leaked if the transfer fails and times out on the receiver side.

4. put() accepts calls after connector falls back to receiver mode When ZMQ bind fails, can_put is set to False (line 1034), but put() checks self.can_put before rejecting (line 501). The order of operations in __init__ means _bind_error is checked AFTER can_put is modified. However, there's a race: if put() is called between bind failure and error propagation, it would execute.

P2 - Important Issues

5. Missing optional dependency guard in imports The connector imports mooncake.engine.TransferEngine at top level with a try/except, but the module is still imported in factory.py without the same guard. This could cause import failures even when not using RDMA:
# factory.py line ~125
from .connectors.mooncake_transfer_engine_connector import MooncakeTransferEngineConnector
This triggers the try: from mooncake.engine... block even if the import fails, causing TransferEngine = None but the module is still loaded.

6. Cleanup signature mismatch The cleanup() method in MooncakeTransferEngineConnector accepts from_stage and to_stage parameters (line 888), but the base class OmniConnectorBase likely has a different signature. Copilot noted this issue - verify the base class signature and ensure compatibility.

7. Inconsistent error handling in get() for unresolved sender_host Lines 745-750 raise RuntimeError when sender_host is unresolved, but the metadata query path at line 681 already checks for valid sender_host before constructing the ZMQ address. The validation at line 763 is also checking str(src_host).lower() == "auto". This creates three different error paths for the same condition.

8. Port offset scheme complexity The port calculation logic is complex and could be error-prone. While documented, the formula involves multiple offsets that are concatenated in runtime. Consider extracting this to a dedicated helper function with unit tests.

📋 Code Quality Notes

Type hints: Good use of type hints throughout, though some Any types could be more specific.

Logging: Extensive logging with appropriate levels (debug/info/warning/error).

Naming: Clear, descriptive variable names and function names.

Documentation: Inline docstrings are comprehensive, especially the class docstring explaining topology limitations.

🔍 Specific Code Observations

Line 1032-1037: When ZMQ bind fails, can_put is set to False and _bind_error is set, then the listener returns. The __init__ at line 378 checks _bind_error and propagates it. However, there's a window where other threads could call put() before the exception is raised.

Line 1132: Variable src_lengths is used but not defined in the function scope (it should be src_lengths from line 1128).

Line 942: if getattr(self, "_closed", True): has inverted logic - it should be if getattr(self, "_closed", False): since the default should be "not closed" to allow first-time execution. Currently it would always return early unless _closed is explicitly False.

✅ Test Coverage

The tests look comprehensive:

Buffer allocation/deallocation with edge cases

Double-free detection

RDMA transfer with different modes (serialize, pin_memory, gpu)

Cross-node benchmarking infrastructure

📝 Recommendations

Fix the inverted _closed check in close() (line 942)

Add receiver buffer cleanup on transfer failure in _handle_pull_request

Guard the factory import of RDMA connector behind optional dependency check

Consider a helper for CUDA synchronization to reduce duplication

Add unit tests for port offset calculation logic

Document the race condition in put() after bind failure (or add a lock around the transition)

Verify base class signature compatibility for cleanup() method

Overall Assessment

Status: Request Changes - This is a significant feature with good design, but has several critical issues that should be addressed before merge.

The RDMA connector implementation is well-architected and thoroughly documented. The performance improvements (58x speedup) are substantial. However, there are 3-4 issues that could lead to memory leaks or race conditions in production, and a potential resource cleanup bug that should be fixed.

Response to P1 & P2 Review Comments

P1-1: CUDA synchronization before RDMA send (line 582-595)

The synchronization logic is correct — the three branches handle genuinely different device combinations:

Cross-device copy (line 581-590): Uses non_blocking=True and syncs on the source device for D2H or destination device for H2D.
Same-device copy (line 591-595): Blocking copy, then sync on CUDA device if applicable.
bytes→tensor copy (line 596-602): Always CPU source; sync only if destination is CUDA.

Each branch correctly identifies which CUDA device to synchronize on. Factoring into a helper is a valid cleanup suggestion and can be done as a follow-up — it does not affect correctness.

P1-2: ZMQ socket used across threads

This is a false positive. The design specifically avoids cross-thread socket access via a single-writer pattern:

The ROUTER socket is created, recv_multipart()'d, and send_multipart()'d exclusively in the listener thread (_zmq_listener_loop).
Worker threads (from _sender_executor thread pool) perform the time-consuming RDMA transfer via engine.batch_transfer_sync_write(), then place the result (identity, response_bytes) into a Python queue.Queue — which is thread-safe by design.
Workers wake up the listener through a per-thread PUSH socket (stored in threading.local()) connected to an inproc:// PULL socket owned by the listener thread.
The listener, upon being woken by the PULL notification, drains the queue and calls socket.send_multipart() in its own thread.

At no point does any worker thread touch the ROUTER socket. The response_queue carries plain bytes objects, not socket references.

P1-3: Resource leak on transfer failure (lines 1132-1145)

This is handled on both sides:

Sender side: The buffer is intentionally retained in _local_buffers to allow receiver retries. It is eventually reclaimed by the TTL mechanism (_cleanup_stale_buffers, runs every ~10s, default TTL = 60s), or by close().
Receiver side: When get() receives TRANS_ERROR or encounters any exception, the recv_buffer is explicitly released in the except block at line 885 (recv_buffer.release()), so there is no leak on the consumer side.

P1-4: put() race between bind failure and error propagation

There is no race condition. The sequence in __init__ is:

self.can_put = (role == "sender") → set to True
self._listener_thread.start() → background thread starts
self._listener_ready.wait(timeout=1.0) → blocks until listener binds or fails
If bind fails → listener sets self._bind_error and signals _listener_ready
__init__ resumes → checks self._bind_error is not None → raises RuntimeError

Since __init__ raises before returning, no external code can ever obtain a reference to the connector object. It is impossible to call put() on a connector whose bind failed.

P2-5: Missing optional dependency guard in factory.py

Already handled. The import in factory.py (line 94-102) is a lazy import inside the _create_mooncake_transfer_engine_connector() function body — it only executes when the factory is asked to create this specific connector type, not at module load time. Additionally, __init__.py guards the top-level import with try/except ImportError.

P2-6: Cleanup signature mismatch

This is an intentional backward-compatible extension. The base class defines cleanup(self, request_id: str). The override adds from_stage: str | None = None, to_stage: str | None = None — both with default values. A call like connector.cleanup("r1") still works and matches the base class contract. Copilot itself flagged this as "suppressed due to low confidence." We can update the base class signature in a follow-up if desired.

P2-7: Inconsistent error handling for unresolved sender_host

The three checks serve different entry points and are defense-in-depth, not duplication:

Line 745-750 (get() entry guard): Rejects get(metadata=None) early if sender_host is unresolved, forcing callers to call update_sender_info() first.
Line 681 (_query_metadata_from_sender internal): Validates before constructing the ZMQ address — this method could theoretically be called from other contexts in the future.
Line 758 (get() post-metadata): Validates source_host from externally-provided metadata (e.g., from a queue), which is a different trust boundary.

These are layered checks at different trust levels, not redundant code paths.

P2-8: Port offset scheme complexity

The port offset calculation (zmq_port + purpose_offset + stage_offset + dp_index * tp_size + tp_rank) is implemented in the orchestration layer (PR2 / KV transfer manager), not in this connector PR. The connector simply receives a pre-computed zmq_port via its config. Documentation for the scheme is provided in mooncake_transfer_engine_connector.md. Extracting it into a helper with unit tests is a good suggestion and will be addressed in PR2.

Summary & Fix Plan

#	Issue	Severity	Verdict	Action	Priority
P1-1	CUDA sync code duplication	P1	Correct but verbose	Extract helper method to reduce duplication in PR2	Follow-up (non-blocking)
P1-2	ZMQ socket cross-thread	P1	False positive — single-writer pattern is correct	No change needed	N/A
P1-3	Resource leak on RDMA failure	P1	Handled — receiver releases buffer, sender has TTL	No change needed	N/A
P1-4	put() race on bind failure	P1	No race — `__init__` blocks and raises	No change needed	N/A
P2-5	Import guard in factory.py	P2	Already handled — lazy import in function body	No change needed	N/A
P2-6	cleanup() signature mismatch	P2	Intentional backward-compatible extension	Update base class signature in PR2	Follow-up (non-blocking)
P2-7	Redundant sender_host checks	P2	Defense-in-depth at different trust boundaries	No change needed	N/A
P2-8	Port offset complexity	P2	Not in this PR scope (PR2 code)	Add helper + unit tests in PR2	PR2

Conclusion: No blocking issues found. All P1 items are either false positives or already handled. Two low-priority cleanups (P1-1, P2-6) can be addressed as follow-ups without blocking merge.

* [Frontend][Model] Support batch request with refined OmniDiffusionReq… (#797) Signed-off-by: Huang, Zeyu <11222265+fhfuih@users.noreply.github.com> * [Model]: add FLUX.1-dev model (#853) * [BugFix] ignore mm data from stages to async omni (#954) Signed-off-by: dengyunyang <584797741@qq.com> * Revert "[BugFix] ignore mm data from stages to async omni" (#1023) * [Bugfix] Modify output to model_runner_output (#1026) Signed-off-by: gcanlin <canlinguosdu@gmail.com> * [Feature] Support cache-dit for Wan 2.2 inference (#1021) Signed-off-by: samithuang <285365963@qq.com> Signed-off-by: Samit <285365963@qq.com> * [Doc]Format profiling doc (#993) Signed-off-by: lishunyang <lishunyang12@163.com> Signed-off-by: Hongsheng Liu <liuhongsheng4@huawei.com> Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * [Hardware] Support platforms and plugin system (#774) Signed-off-by: gcanlin <canlinguosdu@gmail.com> * [Core]: KV Cache Transfer Encapsulation (#979) Signed-off-by: princepride <wangzhipeng628@gmail.com> Signed-off-by: 汪志鹏 <wangzhipeng628@gmail.com> * [Test]Delete skip mark for amd ci test and fix CI failure (#927) Signed-off-by: wangyu31577 <wangyu31577@hundsun.com> Signed-off-by: Hongsheng Liu <liuhongsheng4@huawei.com> Co-authored-by: wangyu31577 <wangyu31577@hundsun.com> Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com> * [Bugfix][Doc]Specify Qwen3-TTS model name for each task type (#1036) Signed-off-by: Kyle Huang <yellowsea@gmail.com> * [Misc] pin version of fa3-fwd (#1051) Signed-off-by: zjy0516 <riverclouds.zhu@qq.com> * [CI] [ROCm] Add more AMD CI tests (#1039) Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com> * [Bugfix] fix qwen image layerd in dummy run (#1027) Signed-off-by: zjy0516 <riverclouds.zhu@qq.com> * [BugFix] Fix noisy output without setting a seed in Qwen Image (#1043) Signed-off-by: natureofnature <wzliu@connect.hku.hk> * [bugfix] remove vllm speech route (#1060) Signed-off-by: linyueqian <linyueqian@outlook.com> * [Debug] Update GLM-Image Pipeline (#1049) Co-authored-by: root <root@hk01dgx028.cm.cluster> * [Diffusion][Bugfix] Fix the flash_attn backends selection logic (#983) Signed-off-by: mxuax <mxuax@connect.ust.hk> Signed-off-by: XU Mingshi <91017482+mxuax@users.noreply.github.com> Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com> * [BugFix] Fix the accuracy issue of multimodal input. (#1020) Signed-off-by: amy-why-3459 <wuhaiyan17@huawei.com> Co-authored-by: Rein Yang <ruiruyang2@gmail.com> * [Bugfix] Set VaeImageProcessor `do_convert_rgb` True (#1032) Signed-off-by: gcanlin <canlinguosdu@gmail.com> * [feat]: adapt batch request for flux (#1028) Signed-off-by: wuzhongjian wuzhongjian_yewu@cmss.chinamobile.com * [CI] Change Qwen3 Omni stage placement strategy (#1072) Signed-off-by: ZeldaHuang <hzm414167@alibaba-inc.com> * [BugFix] Fix to use correct attn backend (#1038) Signed-off-by: Divyansh Singhvi <divyanshsinghvi@gmail.com> * [Perf] Qwen3 Omni talker mtp optimization (#1005) Signed-off-by: ZeldaHuang <hzm414167@alibaba-inc.com> Signed-off-by: Hongsheng Liu <liuhongsheng4@huawei.com> Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com> * [Wan2.2] Optimize memory usage with conditional transformer loading (#980) Signed-off-by: Lin, Fanli <fanli.lin@intel.com> Signed-off-by: Samit <285365963@qq.com> Co-authored-by: Samit <285365963@qq.com> * [Feat] Support XPU Backend in vLLM-Omni (#191) Signed-off-by: Fanli Lin <fanli.lin@intel.com> Signed-off-by: Fanli Lin <fanli0116@gmail.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * [Fix] stabilize diffusion images LoRA E2E across CI drift (#1075) Signed-off-by: dongbo910220 <1275604947@qq.com> * [Bugfix][Test] Re-enable the log simple tests (#1065) Signed-off-by: gcanlin <canlinguosdu@gmail.com> * [Bugfix] pr conflict fix, bugfix ignore mm data from stages to async omni (#1025) Signed-off-by: dengyunyang <584797741@qq.com> * [Doc][Bagel] Add BAGEL-7B-MoT documentation and edit the default stage configuration (#987) Signed-off-by: Ding Zuhao <e1583181@u.nus.edu> Signed-off-by: jzz <e1583181@u.nus.edu> * [Fix] Increase max wait time for server readiness to accommodate model loading (#1089) Signed-off-by: Andy Zhou <46011930+AndyZhou952@users.noreply.github.com> * [Benchmark] Add vLLM-Omni Omni model online benchmark (#780) Signed-off-by: wangyu31577 <wangyu31577@hundsun.com> Signed-off-by: wangyu <53896905+yenuo26@users.noreply.github.com> Co-authored-by: wangyu31577 <wangyu31577@hundsun.com> Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com> * [Bugfix] Remove Mooncake/Yuanrong connector import warning (#1091) Signed-off-by: natureofnature <wzliu@connect.hku.hk> * fix: UnboundLocalError for role in streaming audio/image responses (#784) Signed-off-by: Pierre Le Guen <26087574+PierreLeGuen@users.noreply.github.com> * [Misc] update wechat image (#1096) * [Feature] Support DiT Layerwise (Blockwise) CPU Offloading (#858) Signed-off-by: yuanheng <jonathan.zhaoyh@gmail.com> Signed-off-by: Hongsheng Liu <liuhongsheng4@huawei.com> Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * [BugFix] Modify max_tokens and modify the log and fix #1103 (#1097) Signed-off-by: amy-why-3459 <wuhaiyan17@huawei.com> Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com> * [BugFix] Fix modulate_index shape error in Qwen-Image-Edit Task (#1100) Signed-off-by: mxuax <mxuax@connect.ust.hk> Signed-off-by: XU Mingshi <91017482+mxuax@users.noreply.github.com> Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com> * [Platform] Add supports_torch_inductor interface (#1108) Signed-off-by: gcanlin <canlinguosdu@gmail.com> * [BugFix] Fix Qwen3 Omni talker mtp torch.compile startup error (#1104) Signed-off-by: ram16g <anlianfengjie@163.com> Signed-off-by: ZeldaHuang <hzm414167@alibaba-inc.com> Co-authored-by: ram16g <anlianfengjie@163.com> Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com> * [Bugfix] fix request_id of image generation in api server (#1112) Signed-off-by: zjy0516 <riverclouds.zhu@qq.com> Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com> * [Perf]: CFG parallel abstraction (#851) Signed-off-by: Didan Deng <33117903+wtomin@users.noreply.github.com> Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com> * [BugFix] Fix Qwen3 TTS 0.6B profile run hang (#995) (#1082) * [CI] [ROCm] Quick fix amd ci (#1116) Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com> * [Bugfix] fix benchmark audio timing error and add benchmark test (#1109) Signed-off-by: wangyu31577 <wangyu31577@hundsun.com> Co-authored-by: wangyu31577 <wangyu31577@hundsun.com> Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com> * [Bugfix][Qwen3TTS] Load speaker_id/voices from model configuration (#1079) Signed-off-by: pablo <juanz9312@gmail.com> Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com> Co-authored-by: WeiQing Chen <40507679+david6666666@users.noreply.github.com> * [NPU] Align with GPUModelRunner (#1114) Signed-off-by: gcanlin <canlinguosdu@gmail.com> * [FEATURE] /v1/images/edit interface (#1101) Signed-off-by: dengyunyang <584797741@qq.com> * [Bugfix] Fix NPU SDPA attention mask shape and semantics (#1031) Signed-off-by: gcanlin <canlinguosdu@gmail.com> Co-authored-by: muziyuhui666 <111362884+muziyuhui666@users.noreply.github.com> Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com> * [TeaCache]: Add Coefficient Estimation (#940) Signed-off-by: princepride <wangzhipeng628@gmail.com> Signed-off-by: 汪志鹏 <wangzhipeng628@gmail.com> Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com> * [CI]: Bagel E2E Smoked Test (#1074) Signed-off-by: princepride <wangzhipeng628@gmail.com> Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com> * [Misc] Bump version to 0.14.0 (#1128) Signed-off-by: Roger Wang <hey@rogerw.io> * [Doc] First stable release of vLLM-Omni (#1129) Signed-off-by: Roger Wang <hey@rogerw.io> * [Misc] Align error handling with upstream vLLM v0.14.0 (#1122) Signed-off-by: anna <lee.anna@navercorp.com> Co-authored-by: anna <lee.anna@navercorp.com> * [Feature] add Tensor Parallelism to LongCat-Image(-Edit) (#926) Signed-off-by: Rustam Khadipash <16683750+hadipash@users.noreply.github.com> * [CI] Temporarily remove slow tests. (#1143) Signed-off-by: Alicia <115451386+congw729@users.noreply.github.com> Signed-off-by: princepride <wangzhipeng628@gmail.com> Co-authored-by: princepride <wangzhipeng628@gmail.com> * [CI] Refactor test_sequence_parallel.py and add a warmup run for more accurate performance stat (#1165) Signed-off-by: mxuax <mxuax@connect.ust.hk> Signed-off-by: XU Mingshi <91017482+mxuax@users.noreply.github.com> Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com> * Dev/rebase v0.15.0 (#1159) Signed-off-by: Taichang Zhou <tzhouam@connect.ust.hk> Signed-off-by: tzhouam <tzhouam@connect.ust.hk> Signed-off-by: princepride <wangzhipeng628@gmail.com> Co-authored-by: 汪志鹏 <wangzhipeng628@gmail.com> * Docs update paper link (#1169) Signed-off-by: hsliu <liuhongsheng4@huawei.com> Signed-off-by: hsliu_ustc <hsliu_ustc@noreply.gitcode.com> Co-authored-by: hsliu_ustc <hsliu_ustc@noreply.gitcode.com> * [Debug] Clear Dockerfile.ci to accelerate build image (#1172) Signed-off-by: tzhouam <tzhouam@connect.ust.hk> * [Debug] Correct Unreasonable Long Timeout (#1175) Signed-off-by: tzhouam <tzhouam@connect.ust.hk> * [Doc]Fix - Align with repo. (#1176) Signed-off-by: Alicia <115451386+congw729@users.noreply.github.com> * [Bugfix][Qwen-Image-Edit] Add a warning log for none negative_prompt (#1170) Signed-off-by: gcanlin <canlinguosdu@gmail.com> * [Bugfix] fix qwen image oom (#1168) Signed-off-by: zjy0516 <riverclouds.zhu@qq.com> * [Hardware] Disable compile of diffusion on XPU (#1148) Signed-off-by: zhenwei-intel <zhenwei.liu@intel.com> * [Doc] Fix vLLM version in user docs (#1179) Signed-off-by: Yuanheng Zhao <jonathan.zhaoyh@gmail.com> * [Refactor] Refactor async chunk and fix the shape mismatch issue (#1151) Signed-off-by: amy-why-3459 <wuhaiyan17@huawei.com> * bugfix: /images/edits endpoint fails pipeline data format check (#1141) Signed-off-by: Huang, Zeyu <11222265+fhfuih@users.noreply.github.com> Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com> * [Perf] resolving prolonged `cudastreamsynchronize` execution in z image processing (#1105) Signed-off-by: erfgss <97771661+erfgss@users.noreply.github.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * [Bugfix] modify RTF use audio_e2e/audio_duration (#1157) Signed-off-by: wangyu31577 <wangyu31577@hundsun.com> Co-authored-by: wangyu31577 <wangyu31577@hundsun.com> * [Doc] Highlight paper & slides. (#1186) Signed-off-by: Alicia <115451386+congw729@users.noreply.github.com> * [chore] Remove zmq context initialize (#1187) Signed-off-by: xiedeyantu <czjourney@163.com> * [NPU] Update Dockerfile and docs for v0.14.0 (#671) Signed-off-by: gcanlin <canlinguosdu@gmail.com> * [Bugfix] E2E metric incorrect qwen3-omni with async chunk feature (#1018) Signed-off-by: Junhong Liu <98734602+LJH-LBJ@users.noreply.github.com> Signed-off-by: Junhong Liu <ljh_lbj@163.com> Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com> * [Doc] opt doc (#1118) Signed-off-by: David Chen <530634352@qq.com> * [Bugfix] Fix tp+sp accuracy, incorrect process group mapping (#1178) Signed-off-by: David Chen <530634352@qq.com> * [Feature] Enable use_audio_in_video for Qwen 3 Omni Online (#1198) Signed-off-by: tzhouam <tzhouam@connect.ust.hk> * [Bugfix] async_chunk rebase v0.15.0 (#1195) Signed-off-by: amy-why-3459 <wuhaiyan17@huawei.com> * [feature]: support flux cache_dit (#1145) Co-authored-by: Jiangyun Zhu <riverclouds.zhu@qq.com> * [CI] Add CI branch coverage calculation, fix statement coverage results and add log before test for buildkite log group (#1120) Signed-off-by: wangyu31577 <wangyu31577@hundsun.com> Co-authored-by: wangyu31577 <wangyu31577@hundsun.com> * [Wan 2.2][Diffusion] Add TP Support (#964) Signed-off-by: weichen <calvin_zhu0210@outlook.com> * [Hardware] [Feat] Setup platform dependent package installation (#1046) Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com> Co-authored-by: PopSoda2002 <zhouhp.me@gmail.com> Co-authored-by: gcanlin <canlinguosdu@gmail.com> * [XPU] Fix XPU UTs for basic coverage (#1164) Signed-off-by: Yan Ma <yan.ma@intel.com> * [Test] Add BuildKite test-full script for full CI. (#867) Signed-off-by: wangyu31577 <wangyu31577@hundsun.com> Co-authored-by: wangyu31577 <wangyu31577@hundsun.com> * [Refactor] Reuse upstream Qwen3MoeSparseMoeBlock (#1202) Signed-off-by: gcanlin <canlinguosdu@gmail.com> * [Bugfix] Fix wan2.2 ti2v (#1221) Signed-off-by: mxuax <mxuax@connect.ust.hk> Signed-off-by: XU Mingshi <91017482+mxuax@users.noreply.github.com> Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com> * [Bugfix] Fix '--max-generated-image-size' cli args type (#1249) Signed-off-by: ApsarasX <apsarax@outlook.com> * [Bugfix] Ensure seed=0 is correctly handled in image edit (#1248) Signed-off-by: ApsarasX <apsarax@outlook.com> * [Docs] Add example image download step to Image-To-Video examples (#1258) Signed-off-by: lishunyang <lishunyang12@163.com> * [Bugfix] Fix padding bug in 12Hz tokenizer ConvTranspose1d decode (#1241) Signed-off-by: linyueqian <linyueqian@outlook.com> * [bugfix] Fix multimodal_output property to check completion outputs where audio data is attached (#1203) Signed-off-by: linyueqian <linyueqian@outlook.com> * [Doc] Update QA relevant to quantization (#1257) Signed-off-by: lishunyang <lishunyang12@163.com> * [Bugfix] Fix Doc link Rrror (#1263) Signed-off-by: lishunyang <lishunyang12@163.com> * Process-Scoped GPU Memory Accounting (#1204) Signed-off-by: Divyansh Singhvi <divyanshsinghvi@gmail.com> * [ComfyUI]: ComfyUI integration (#1113) Signed-off-by: Huang, Zeyu <11222265+fhfuih@users.noreply.github.com> * fix: add diffusion offload args to OmniConfig group instead of serve_parser (#1271) Signed-off-by: Chenguang ZHENG <645327136@qq.com> * [Doc] Adding models/pipelines/features Tutorial (#1196) Signed-off-by: Didan Deng <33117903+wtomin@users.noreply.github.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: dongbo910220 <32610838+dongbo910220@users.noreply.github.com> * [CI] Add env variable check for nightly CI (#1281) Signed-off-by: Alicia <115451386+congw729@users.noreply.github.com> * [CI] Add pytest markers to current tests and update the doc. (#577) Signed-off-by: Alicia <115451386+congw729@users.noreply.github.com> Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com> * [Diffusion][Perf] Remove Redundant Communication Cost by Refining SP Hook Design (#1275) Signed-off-by: mxuax <mxuax@connect.ust.hk> Signed-off-by: XU Mingshi <91017482+mxuax@users.noreply.github.com> * [Feature] Opt metrics structure (#891) Signed-off-by: Junhong Liu <98734602+LJH-LBJ@users.noreply.github.com> Signed-off-by: Junhong Liu <ljh_lbj@163.com> Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com> * [Test] Add example test cases for omni online (#1086) Signed-off-by: wangyu31577 <wangyu31577@hundsun.com> Signed-off-by: Hongsheng Liu <liuhongsheng4@huawei.com> Signed-off-by: yenuo26 <410167048@qq.com> Co-authored-by: wangyu31577 <wangyu31577@hundsun.com> Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * [CI] Reduce the time for Diffusion Sequence Parallelism Test (#1283) Signed-off-by: Alicia <115451386+congw729@users.noreply.github.com> * [Model] SupportHunyuanImage3 Diffusion Model in vllm-omni (#1085) Signed-off-by: Semmer2 <semmer@live.cn> * [Chore] Update copyright year. (#1256) Signed-off-by: lishunyang <lishunyang12@163.com> * [feature]: support Flux.1-dev CFG-Parallel (#1269) * [Bugfix] Fix 'NoneType' AttributeError in stable-diffusion model detect (#1254) Signed-off-by: Yan Ma <yan.ma@intel.com> * [Doc] Update Qwen3-TTS docs for consistency with Omni examples (#1226) Signed-off-by: linyueqian <linyueqian@outlook.com> Signed-off-by: Yueqian Lin <70319226+linyueqian@users.noreply.github.com> Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com> * [Fix]Ensure HuggingFace downloads complete before initialization. (#1213) Signed-off-by: zhou zhuoxin <zhouzhuoxin1508@outlook.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * [BugFix] Fixed the issue where ignore_eos was not working. (#1286) Signed-off-by: amy-why-3459 <wuhaiyan17@huawei.com> * [Test] Add e2e tests for Qwen3-TTS speech endpoint (#1206) Signed-off-by: linyueqian <linyueqian@outlook.com> Signed-off-by: Yueqian Lin <70319226+linyueqian@users.noreply.github.com> * [Feat]: support VAE patch parallelism (#756) Signed-off-by: dongbo910220 <1275604947@qq.com> Co-authored-by: hsliuustc0106 <liuhongsheng4@huawei.com> * [CI] Disable Qwen3-TTS E2E Test in pipeline.yml (#1306) Signed-off-by: Gao Han <hgaoaf@connect.ust.hk> * [Misc] Add per-request generator_device to online image gen and edit (#1183) Signed-off-by: gcanlin <canlinguosdu@gmail.com> * [Bagel]: Support TP (#1293) Signed-off-by: princepride <wangzhipeng628@gmail.com> * [Bugfix] Fix image edit RoPE crash when explicit height/width are provided (#1265) Signed-off-by: lishunyang <lishunyang12@163.com> Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com> * [Doc] Sync (#1216) Signed-off-by: Alicia <115451386+congw729@users.noreply.github.com> * [Bugfix] fix precision issues of qwen3-omni when enable async_chunk without system prompt (#1288) Signed-off-by: Rein Yang <ruiruyang2@gmail.com> * [Debug] Add trigger to concurrent stage init (#1274) Signed-off-by: tzhouam <tzhouam@connect.ust.hk> * [Bugfix][Qwen3-TTS] Fix task type (#1317) Signed-off-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com> * Unifying CLI Argument Naming Style (#1309) Signed-off-by: Didan Deng <33117903+wtomin@users.noreply.github.com> * [Bugfix][Qwen3-TTS] Preserve original model ID in omni_snapshot_download (#1318) * [CI] Run nightly tests. (#1333) Signed-off-by: Alicia <115451386+congw729@users.noreply.github.com> * [Feature]: FP8 Quantization Support for DiT (#1034) Signed-off-by: lishunyang <lishunyang12@163.com> Signed-off-by: SYLAR <125541396+lishunyang12@users.noreply.github.com> * Fix yield token metrics and opt metrics record stats (#1292) * [Test] L2 & L3 Test Case Stratification Design for Omni Model (#1272) Signed-off-by: wangyu31577 <wangyu31577@hundsun.com> Signed-off-by: yenuo26 <410167048@qq.com> Signed-off-by: wangyu <53896905+yenuo26@users.noreply.github.com> Signed-off-by: Hongsheng Liu <liuhongsheng4@huawei.com> Co-authored-by: wangyu31577 <wangyu31577@hundsun.com> Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com> * [Pref] Support Qwen3 Omni code2wav batch infernce with async chunk (#1246) Signed-off-by: ZeldaHuang <hzm414167@alibaba-inc.com> Signed-off-by: Hongsheng Liu <liuhongsheng4@huawei.com> Signed-off-by: Ziming Huang <1520787127@qq.com> Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * update qwen3-omni & qwen2.5-onmi openai client (#1304) Signed-off-by: Rein Yang <ruiruyang2@gmail.com> * [Feature] Support Wan2.2 T2V and I2V Online Serving with OpenAI /v1/videos API (#1073) Signed-off-by: samithuang <285365963@qq.com> Signed-off-by: Samit <285365963@qq.com> Signed-off-by: SamitHuang <285365963@qq.com> Co-authored-by: Flora Feng <4florafeng@gmail.com> * [Feature] add Tensor Parallelism to SD_3.5 (#1336) Signed-off-by: GG-li <3226868735@qq.com> * [Feature]async scheduling to overlap chunk IO and compute (#951) Signed-off-by: CHEN <116010019@link.cuhk.edu.cn> Co-authored-by: Bhanu068 <voutharoja.bhanu06@gmail.com> Co-authored-by: Gao Han <gaohan19@huawei.com> * [Bugfix] reused metrics to modify the API Server token statistics in Stream Response (#1301) Signed-off-by: John Liu BUAA <liukecheng97@gmail.com> * Refactor CPU Offloading Backend Pattern (#1223) Signed-off-by: yuanheng <jonathan.zhaoyh@gmail.com> Signed-off-by: Yuanheng Zhao <jonathan.zhaoyh@gmail.com> Signed-off-by: Samit <285365963@qq.com> Co-authored-by: Samit <285365963@qq.com> * [DOC] Doc for CI test - Details about five level stucture and some other files. (#1167) Signed-off-by: Alicia <115451386+congw729@users.noreply.github.com> Co-authored-by: yenuo26 <410167048@qq.com> * [Bugfix] remove Tongyi-MAI/Z-Image-Turbo related test from L2 ci (#1348) Signed-off-by: dengyunyang <584797741@qq.com> * [Misc] wechat image update (#1354) Signed-off-by: David Chen <530634352@qq.com> * [Misc] Support WorkerWrapperBase and CustomPipeline for Diffusion Worker (#764) Signed-off-by: knlnguyen1802 <knlnguyen1802@gmail.com> * [Feature][Bugfix] Add CFG feature to Bagel (#1310) Signed-off-by: Ding Zuhao <e1583181@u.nus.edu> Signed-off-by: 汪志鹏 <wangzhipeng628@gmail.com> Co-authored-by: 汪志鹏 <wangzhipeng628@gmail.com> * [Feature]: Diffusion sleep to use process level memory calculation (#1276) Signed-off-by: Divyansh Singhvi <divyanshsinghvi@gmail.com> Signed-off-by: 汪志鹏 <wangzhipeng628@gmail.com> Signed-off-by: dsinghvi <divyanshsinghvi@gmail.com> Co-authored-by: 汪志鹏 <wangzhipeng628@gmail.com> * change qwen3-omni open cudagraph by default (#1352) Signed-off-by: Rein Yang <ruiruyang2@gmail.com> Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com> * [XPU] Update Bagel's flash_attn_varlen_func to fa utils (#1295) Signed-off-by: zhenwei-intel <zhenwei.liu@intel.com> * [Test] Add Omni Model Performance Benchmark Test (#1321) Signed-off-by: yenuo26 <410167048@qq.com> Signed-off-by: wangyu <53896905+yenuo26@users.noreply.github.com> * [BugFix]: Revert utils change (#1369) Signed-off-by: princepride <wangzhipeng628@gmail.com> * [Rebase] Rebase to vllm v0.16.0 (#1357) Signed-off-by: Taichang Zhou <tzhouam@connect.ust.hk> Signed-off-by: tzhouam <tzhouam@connect.ust.hk> Signed-off-by: princepride <wangzhipeng628@gmail.com> Co-authored-by: 汪志鹏 <wangzhipeng628@gmail.com> Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com> Co-authored-by: Isotr0py <Isotr0py@outlook.com> Co-authored-by: ZJY0516 <zhu.jiangyun@foxmail.com> Co-authored-by: youkaichao <youkaichao@gmail.com> * [Test] Fix expansion and example test case for qwen3-omni (#1358) Signed-off-by: yenuo26 <410167048@qq.com> * [v0.16.0][BUG FIX]Fix hunyuan MOE after update to 0.16.0 (#1401) Signed-off-by: Chendi Xue <chendi.xue@intel.com> * [0.16.0] remove cuda hard-code for Hunyuan Image3 (#1402) Signed-off-by: Chendi Xue <chendi.xue@intel.com> * [XPU] Add XPU Dockerfile and related docs (#1162) Signed-off-by: Yan Ma <yan.ma@intel.com> Signed-off-by: Daniel Huang <daniel1.huang@intel.com> Co-authored-by: Daniel Huang <daniel1.huang@intel.com> * [Bugfix] Fix Hardcoded Datatypes in Z-image (#1393) Signed-off-by: Alex Brooks <albrooks@redhat.com> * [Feature] : Support disaggregated inference pipeline for Qwen3_TTS (#1161) Signed-off-by: Sy03 <1370724210@qq.com> Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com> * [Feature] Add automated PR reviewer bot with GLM integration (#1424) Signed-off-by: hsliu <liuhongsheng4@huawei.com> Signed-off-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> * [Misc] Add Qwen2.5-Omni-3B model support to Gradio demo (#1382) Signed-off-by: UsamaKenway <usamakenway@gmail.com> * [misc] Feature/pr reviewer auto trigger&update model (#1431) Signed-off-by: hsliu <liuhongsheng4@huawei.com> Signed-off-by: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Hunter Liu <hunter@liu.sh> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> * Revert "[misc] Feature/pr reviewer auto trigger&update model" (#1432) * [Doc] Update GPU installation commands (#1434) * [ROCM] [CI] fix dockerfile.rocm to support nightly build and also fix amd ci v0.16.0rc1 (#1380) Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com> * [Feature][BAGEL] Combine multi-branch cfg into a single batch to accelerate inference. (#1429) Signed-off-by: Ding Zuhao <e1583181@u.nus.edu> Co-authored-by: 汪志鹏 <wangzhipeng628@gmail.com> * [Feat]: add ASCII art logo for vLLM-Omni (#1430) * [Bug] [Bagel] Fix kv transfer bug (#1437) Signed-off-by: Ding Zuhao <e1583181@u.nus.edu> Co-authored-by: Wang Zhipeng: princepride <wangzhipeng628@gmail.com> * [CI] Set L2 & L3 tests running conditions. (#1344) Signed-off-by: Alicia <115451386+congw729@users.noreply.github.com> * [Feature] vLLM-Omni RDMA connector (#1019) Signed-off-by: natureofnature <wzliu@connect.hku.hk> * [Minor][Refactor] Pass seq_token_counts explicitly (#1425) Signed-off-by: gcanlin <canlinguosdu@gmail.com> Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com> * [Misc] Extend Diffusion Benchmark script to other backends (#875) Signed-off-by: NickLucche <nlucches@redhat.com> Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com> * [Feature] Support Stage Based Deployment CLI (#939) Signed-off-by: wuhang <wuhang6@huawei.com> Signed-off-by: princepride <wangzhipeng628@gmail.com> Signed-off-by: wuhang <whlbx@hotmail.com> Co-authored-by: 汪志鹏 <wangzhipeng628@gmail.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * [Doc] Optimize vLLM-Omni metrics documentation (#1311) Signed-off-by: Junhong Liu <98734602+LJH-LBJ@users.noreply.github.com> Signed-off-by: Junhong Liu <ljh_lbj@163.com> Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com> * [Bugfix] Forward all vllm-omni serve command parameters to model (#985) Signed-off-by: Junhong Liu <98734602+LJH-LBJ@users.noreply.github.com> Signed-off-by: Junhong Liu <ljh_lbj@163.com> Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com> * [Doc]: Add bagel single/multi node usage with mooncake document (#1450) * [Qwen3TTS][Feat] Code2Wav batched decoding (#1426) Signed-off-by: pablo <pablo@agigo.ai> Co-authored-by: pablo <pablo@agigo.ai> * [CI] Remove overwhelming debug log (#1463) Signed-off-by: tzhouam <tzhouam@connect.ust.hk> * [Misc] update wechat image (#1464) Signed-off-by: David Chen <530634352@qq.com> * [Doc] Refine Diffusion Tutorial Documents (#1305) Signed-off-by: Didan Deng <33117903+wtomin@users.noreply.github.com> * [Bugfix] Robust Audio Data Handling in _create_audio_choice (#1222) Signed-off-by: Junhong Liu <98734602+LJH-LBJ@users.noreply.github.com> * [Bugfix]: Fix merging updated additional information to ensure dict type (#1296) Signed-off-by: Shijin Zhang <75300765+Dovis01@users.noreply.github.com> * [Model]Add new nextstep_1(Diffusion) model(only T2I) (#612) Signed-off-by: Dong Wang <dongw2019@gmail.com> Signed-off-by: sniper35 <dongw2019@gmail.com> Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com> * [Bugfix] Add TTS configuration options (#1177) Signed-off-by: Yanick Schraner <yanick.schraner@bs.ch> * [Debug] Multi-Request for Qwen 3 Omni use_audio_in_video (#1433) Signed-off-by: tzhouam <tzhouam@connect.ust.hk> * [Bugfix] Fix case-sensitive task_type matching in Qwen3TTSModelForGeneration (#1455) Signed-off-by: Sangchun Ha <seomk9896@gmail.com> * [BugFix] process request.num_cached_tokens if it equals to the initial value (#1468) Signed-off-by: Junhong Liu <98734602+LJH-LBJ@users.noreply.github.com> Co-authored-by: Gao Han <gaohan19@huawei.com> * [Bugfix] Fix SDPA attention mask dtype and shape (Fix #857) (#1349) Signed-off-by: jader <yjader@foxmail.com> * [Test] Reduce Perf test case and fix modify stage config (#1449) Signed-off-by: yenuo26 <410167048@qq.com> * [NPU] Upgrade to v0.16.0 (#1375) Signed-off-by: gcanlin <canlinguosdu@gmail.com> * [CI] Update Dockerfile for vllm-omni CI image and remove obsolete dep… (#1491) Signed-off-by: tzhouam <tzhouam@connect.ust.hk> * [Fix][Chore] Qwen3-TTS Modeling Minor Code Sanity Improvements (#1482) Signed-off-by: yuanheng <jonathan.zhaoyh@gmail.com> * [Bugfix] Fix tuple/list KV cache extraction crash (#1405) Signed-off-by: junuxyz <216036880+junuxyz@users.noreply.github.com> Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com> * [Doc] format lora related docs for the user's end (#1009) Signed-off-by: AndyZhou952 <jzhoubc@connect.ust.hk> Signed-off-by: Andy Zhou <46011930+AndyZhou952@users.noreply.github.com> * [Feature] Support Wan2.2 output with irregular shapes (#1279) Signed-off-by: gcanlin <canlinguosdu@gmail.com> * [Misc] Migrate L1 tests to use pytest-mock (#1315) Signed-off-by: Yuanheng Zhao <jonathan.zhaoyh@gmail.com> Signed-off-by: yuanheng <jonathan.zhaoyh@gmail.com> * [Bugfix] Fix LoRA Scaling on Active Adapters (#1421) Signed-off-by: Alex Brooks <albrooks@redhat.com> * [Bugfix] fix record audio generated frame in offline infer (#1312) Signed-off-by: Junhong Liu <98734602+LJH-LBJ@users.noreply.github.com> Signed-off-by: Junhong Liu <ljh_lbj@163.com> * [Model] Support OmniGen2 (#513) Signed-off-by: Yupu <feng.yu.pu0330@gmail.com> * [Bugfix][Qwen3TTS] (#1289) Signed-off-by: pablo <juanz9312@gmail.com> Co-authored-by: Gao Han <gaohan19@huawei.com> Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com> * Use pull through cache image for H100 pool (#1518) Signed-off-by: Kevin H. Luu <khluu000@gmail.com> * [ROCm] [CI] [Docker] Point to use the latest vLLM v0.16.0 stable version (#1500) Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com> * [Bugfix] fix offline text_to_image error from #1009 (#1515) Signed-off-by: David Chen <530634352@qq.com> * [XPU] Enable FLASH_ATTN on XPU (#1332) Signed-off-by: Yan Ma <yan.ma@intel.com> * Revert gpu_1 job to use regular image (#1521) Signed-off-by: Kevin H. Luu <khluu000@gmail.com> * [Chore] remove unused logger in omni_diffusion (#531) (#1509) Signed-off-by: Huang, Zeyu <11222265+fhfuih@users.noreply.github.com> Co-authored-by: Gao Han <gaohan19@huawei.com> * [Qwen3TTS][Feat] Streaming output (#1438) Signed-off-by: pablo <pablo@agigo.ai> Signed-off-by: Hongsheng Liu <liuhongsheng4@huawei.com> Co-authored-by: pablo <pablo@agigo.ai> Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com> * [Bugfix] Race condition in MultiprocExecutor when concurent access to Scheduler (#1448) Signed-off-by: knlnguyen1802 <knlnguyen1802@gmail.com> Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com> * [Doc][Test][Misc] ComfyUI test, more screenshot, and code cleaning (#1435) Signed-off-by: Huang, Zeyu <11222265+fhfuih@users.noreply.github.com> Signed-off-by: Samit <285365963@qq.com> Co-authored-by: Samit <285365963@qq.com> * [Performance]Qwen3-Omni performance optimization (#1378) Signed-off-by: amy-why-3459 <wuhaiyan17@huawei.com> * [Feature] Support HSDP for diffusion models (#1339) Signed-off-by: gcanlin <canlinguosdu@gmail.com> Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com> * [CI] fixed CI timeout (#1460) Signed-off-by: zhumingjue <zhumingjue@huawei.com> Signed-off-by: zhumingjue138 <zhumingjue@huawei.com> * [Bugfix] Use uds for zmq address if not set --stage-id (#1522) Signed-off-by: wuhang <wuhang6@huawei.com> * [BugFix] Restore talker's config (#1524) Signed-off-by: amy-why-3459 <wuhaiyan17@huawei.com> Signed-off-by: Hongsheng Liu <liuhongsheng4@huawei.com> Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com> Co-authored-by: Canlin Guo <961750412@qq.com> * [XPU] fix qwen_omni after rebase to v0.16.0 (#1416) Signed-off-by: Chendi Xue <chendi.xue@intel.com> Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com> * [Platform] Enable layerwise offload on all hardware (#1492) Signed-off-by: gcanlin <canlinguosdu@gmail.com> * diffusion: enable VAE patch parallel for SD3.5 (#1428) Signed-off-by: dongbo910220 <1275604947@qq.com> * [Perf] GLM Image (#920) Signed-off-by: JaredforReal <w13431838023@gmail.com> Signed-off-by: Jared Wen <w13431838023@gmail.com> Signed-off-by: Hongsheng Liu <liuhongsheng4@huawei.com> Co-authored-by: 汪志鹏 <wangzhipeng628@gmail.com> Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com> * [skip ci][Doc] add design docs for async chunk in qwen3-omni (#962) Signed-off-by: Rein Yang <ruiruyang2@gmail.com> * feat(qwen3-tts): Add CUDA Graph support for speech tokenizer decoder (#1205) Signed-off-by: xulusjb <fdukeshik@gmail.com> Co-authored-by: xulusjb <fdukeshik@gmail.com> * [New Model]: XiaomiMiMo/MiMo-Audio-7B-Instruct support (#750) Signed-off-by: wangyu31577 <wangyu31577@hundsun.com> Signed-off-by: 齐保元 <qibaoyuan@xiaomi.com> Signed-off-by: hsliu <liuhongsheng4@huawei.com> Signed-off-by: Hongsheng Liu <liuhongsheng4@huawei.com> Signed-off-by: GG-li <3226868735@qq.com> Signed-off-by: Sihao Li <111170255+GG-li@users.noreply.github.com> Signed-off-by: XU Mingshi <91017482+mxuax@users.noreply.github.com> Signed-off-by: mxuax <mxuax@connect.ust.hk> Signed-off-by: Baoyuan Qi <qibaoyuan@126.com> Signed-off-by: gcanlin <canlinguosdu@gmail.com> Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn> Signed-off-by: wuzhongjian <wuzhongjian_yewu@cmss.chinamobile.com> Signed-off-by: dongbo910220 <1275604947@qq.com> Signed-off-by: dongbo910220 <32610838+dongbo910220@users.noreply.github.com> Signed-off-by: Jiangyun Zhu <riverclouds.zhu@qq.com> Signed-off-by: Junhong Liu <98734602+LJH-LBJ@users.noreply.github.com> Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com> Signed-off-by: baoyuan qi <qibaoyuan@126.com> Signed-off-by: tzhouam <tzhouam@connect.ust.hk> Signed-off-by: zjy0516 <riverclouds.zhu@qq.com> Signed-off-by: Prajwal A <prajwalanagani@gmail.com> Signed-off-by: Shijin Zhang <75300765+Dovis01@users.noreply.github.com> Signed-off-by: 丁宁 <nndding@gmail.com> Signed-off-by: SHIJIN ZHANG <75300765+Dovis01@users.noreply.github.com> Signed-off-by: dingning<dingning7@xiaomi.com> Signed-off-by: dingning <dingning7@xiaomi.com> Signed-off-by: dingning <dingning@xiaomi.com> Co-authored-by: wangyu <53896905+yenuo26@users.noreply.github.com> Co-authored-by: wangyu31577 <wangyu31577@hundsun.com> Co-authored-by: Zhang Shijin <zhangshijin@xiaomi.com> Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com> Co-authored-by: Sihao Li <111170255+GG-li@users.noreply.github.com> Co-authored-by: XU Mingshi <91017482+mxuax@users.noreply.github.com> Co-authored-by: Canlin Guo <canlinguosdu@gmail.com> Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn> Co-authored-by: JohnJan <wuzhongjian_yewu@cmss.chinamobile.com> Co-authored-by: WeiQing Chen <40507679+david6666666@users.noreply.github.com> Co-authored-by: dongbo910220 <32610838+dongbo910220@users.noreply.github.com> Co-authored-by: Jiangyun Zhu <riverclouds.zhu@qq.com> Co-authored-by: Junhong Liu <ljh_lbj@163.com> Co-authored-by: TJian <tunjian.tan@embeddedllm.com> Co-authored-by: shijin zhang <zsj1364226740@gmail.com> Co-authored-by: Zhou Taichang <tzhouam@connect.ust.hk> Co-authored-by: root <root@hk01dgx028.cm.cluster> Co-authored-by: Prajwal A <34590600+LawJarp-A@users.noreply.github.com> Co-authored-by: Shijin Zhang <75300765+Dovis01@users.noreply.github.com> Co-authored-by: dingning <dingning7@xiaomi.com> Co-authored-by: ning ding <nndding@gmail.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * [Feature]: Native GGUF Quantization Support for DiT (#1285) Signed-off-by: David Chen <530634352@qq.com> Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn> Signed-off-by: WeiQing Chen <40507679+david6666666@users.noreply.github.com> Signed-off-by: Hongsheng Liu <liuhongsheng4@huawei.com> Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn> Co-authored-by: Isotr0py <2037008807@qq.com> Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com> * Add benchmark for `v1/audio/speech` non-streaming (#1408) Signed-off-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com> Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com> * [Version] Auto generate version using `setuptool_scm` (#1224) Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com> * [Feat] : Support Async chunk cleanup (#1087) Signed-off-by: Sy03 <1370724210@qq.com> * [Profiler] Support online profiling (#1136) Signed-off-by: gcanlin <canlinguosdu@gmail.com> Signed-off-by: Canlin Guo <961750412@qq.com> Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com> Co-authored-by: Nicolò Lucchesi <nicolo.lucchesi@gmail.com> * [Bugfix] Fix redundant finished req status updating on OmniGenerationScheduler (#1510) Signed-off-by: shijin zhang <75300765+Dovis01@users.noreply.github.com> Co-authored-by: 齐保元 <qibaoyuan@xiaomi.com> * [XPU][NPU][ROCM] enable cpu_offloading flag for non_cuda (#1488) Signed-off-by: gcanlin <canlinguosdu@gmail.com> Signed-off-by: Chendi Xue <chendi.xue@intel.com> Co-authored-by: gcanlin <canlinguosdu@gmail.com> * [Chore] Cleanup dead code in GGUF DiT code path (#1533) Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn> * [Doc] Update installation instructions for vllm 0.16.0 (#1505) Signed-off-by: tzhouam <tzhouam@connect.ust.hk> * [Doc] [skip ci]Sync. (#1363) Signed-off-by: Alicia <115451386+congw729@users.noreply.github.com> Co-authored-by: Yueqian Lin <70319226+linyueqian@users.noreply.github.com> * [CI][skip ci]Update H100 image link based on #1518 (#1538) Signed-off-by: Alicia <115451386+congw729@users.noreply.github.com> * Fix no embed text spk tokens (#1540) Signed-off-by: Junhong Liu <98734602+LJH-LBJ@users.noreply.github.com> * [Debug] Merge vllm pull 35368 (#1534) Signed-off-by: tzhouam <tzhouam@connect.ust.hk> * [Docs] update async chunk docs diagram [skip ci] (#1530) Signed-off-by: Rein Yang <ruiruyang2@gmail.com> * fix(qwen3-tts): fix Base ICL voice clone producing corrupted audio (#1554) Signed-off-by: linyueqian <linyueqian@outlook.com> * [NPU][Bugfix] Align GPU side and recover qwen3-tts (#1564) Signed-off-by: gcanlin <canlinguosdu@gmail.com> * [BugFix] Fix unexpected crash when init OmniDiffusion (#1562) Signed-off-by: Semmer2 <semmer@live.cn> * [CI] Modify some CI test cases to run on L4 environment to reduce H100 resource usage. (#1543) Signed-off-by: yenuo26 <410167048@qq.com> Signed-off-by: wangyu <53896905+yenuo26@users.noreply.github.com> * [BugFix]: fix a lot of bug (#1565) Signed-off-by: princepride <wangzhipeng628@gmail.com> * feat: add HyperCLOVAX-SEED-Omni-8B support Model files: - vllm_omni/diffusion/models/hyperclovax_vision/: vision decoder pipeline (HyperCLOVAXVisionPipeline) using flow matching diffusion + VisionTransformer - vllm_omni/diffusion/models/hyperclovax_audio/: audio decoder pipeline (HyperCLOVAXAudioPipeline) using Unit-BigVGAN codec - vllm_omni/model_executor/stage_input_processors/hyperclovax_seed_omni.py: thinker2vision_decoder and thinker2audio_decoder — extract discrete tokens from LLM output; truncate/pad vision codes to 729 (27x27) for decoder Registry: - vllm_omni/diffusion/registry.py: register HyperCLOVAXVisionPipeline and HyperCLOVAXAudioPipeline with post-process functions Stage config: - vllm_omni/model_executor/stage_configs/hcx_omni.yaml: 3-stage config Stage 0: LLM thinker (TP=4, GPUs 0-3), Stage 1: vision decoder (GPU 4), Stage 2: audio decoder (GPU 5) Bug fixes for HyperCLOVAX compatibility: - diffusion/request.py: add extra dict field to OmniDiffusionRequest so vision_tokens/audio_tokens from stage input processors reach the pipeline - entrypoints/async_omni_diffusion.py: extract OmniTokensPrompt.additional_information into OmniDiffusionRequest.extra before creating request - entrypoints/omni_stage.py: skip empty engine inputs (text-only requests where thinker2vision_decoder/thinker2audio_decoder return []) - entrypoints/async_omni.py: handle skipped sentinel in _process_single_result so text-only requests complete without crashing on Stage 1/2 * fix: correct decoder params and HCX porting fixes - hcx_omni.yaml: guidance_scale 3.5→0.75, num_inference_steps 30→50 (matches OmniServe production defaults; 3.5 caused over-amplified autoguidance → shrunken/degraded output images) - omni_stage.py: skip empty engine inputs for text-only requests - async_omni_diffusion.py: extract OmniTokensPrompt.additional_information into OmniDiffusionRequest.extra (audio_tokens/vision_tokens) - registry.py: HCX Omni diffusion model registration fix Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * feat: HyperCLOVAX-SEED-Omni-8B stage pipeline and entrypoint fixes * fix: change guidance_scale from 9.0 to 0.75 (autoguidance scale, OmniServe default) * feat: add audio decoder Stage 2 to hcx_omni pipeline - Wire HyperCLOVAXAudioPipeline as Stage 2 in hcx_omni.yaml - GPU 5 assigned for audio decoder (Unit-BigVGAN / NCCosybigvganDecoder) - Add runtime edge 0->2 (thinker -> audio decoder) - Implement post-generation PCM chunk streaming for audio output (4800 samples / 200ms per SSE event @ 24kHz, int16 base64-encoded) Refs: github.com/vllm-project/vllm-omni/pull/869 (already incorporated) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix: vllm version compatibility for HyperCLOVAX audio decoder startup - config/model.py: try/except fallback for AttentionBackendEnum import (vllm.v1.attention.backends.registry absent in older vllm builds) - pipeline_hyperclovax_audio.py: return actual named_parameters() from load_weights() when using MAR checkpoint so diffusers_loader strict check passes (weights loaded eagerly in __init__ via MAR extraction) - qwen3_omni_moe_thinker.py, qwen2_5_omni_thinker.py: try/except stubs for check_interleaved_audio_video and merge_interleaved_embeddings which are absent in older vllm qwen2_5_omni_thinker; these symbols are only exercised by Qwen models, not HyperCLOVAX Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix: add edge 1→2 and correct model key in hcx_omni.yaml Stage 2 - Add runtime edge from:1 to:2 (required for Stage-2 connector init; without it AsyncOrchestrator cannot route to audio decoder at runtime) - Change model_subdir to model for Stage-2 engine_args to match total-poc working reference config Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix: audio S2S output - handle diffusion outputs in _create_audio_choice HyperCLOVAXAudioPipeline (diffusion) stores audio in multimodal_output directly (OmniRequestOutput.from_diffusion), not in outputs[0].multimodal_output like LLM pipelines. Fix three locations: 1. _create_audio_choice (non-streaming): use omni_outputs.multimodal_output when final_res.outputs is empty (diffusion path). 2. Streaming audio path: same fix for _final_res.outputs[0]. 3. Both loops (for output in final_res.outputs): fall back to single synthetic choice at index 0 when outputs list is empty. 4. Handle bytes audio output from HyperCLOVAXAudioPipeline post-process (returns WAV bytes, not tensors like Qwen3-Omni). Also fixes audio input (A2T) regression: skip diffusion prompt extraction when mm_data has audio content (added in previous session). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix: parse WAV bytes with soundfile for uniform PCM chunk streaming HyperCLOVAXAudioPipeline returns WAV bytes including 44-byte header. The previous byte-offset splitting included the header in the first chunk, corrupting it. Fix: parse with soundfile to get float32 PCM, then convert to int16 chunks uniformly regardless of source type (bytes or tensor). Verified: 136 audio chunks x 200ms = 27.04s audio streamed correctly. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * feat: zero-shot TTS with speaker embedding from input audio - serving_chat.py: extract last input_audio base64 from request messages and inject as ref_audio_b64 into engine_prompt dict - thinker2audio_decoder: read ref_audio_b64 from prompt and pass as ref_audio_tokens to Stage 2 (HyperCLOVAXAudioPipeline) - hcx_omni.yaml: switch Stage 2 to NCZSCosybigvganDecoder.mar (zero-shot) which uses ECAPA-TDNN speaker encoder instead of finetuned ID lookup Pipeline: input audio -> ECAPA-TDNN -> speaker embedding -> BigVGAN synthesis matching the voice characteristics of the original speaker. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * feat: wire audio decoder Stage 2 to hcx_omni pipeline and fix S2S flow - Add Stage 2 (HyperCLOVAXAudioPipeline / NCZSCosybigvganDecoder) to hcx_omni.yaml with GPU 5, gpu_memory_utilization 0.4, edge 0->2 from thinker - Fix thinker2audio_decoder: correct audio token range (128606-135167), remap to [0, 6561) for BigVGAN input, handle empty token case gracefully - Fix pipeline_hyperclovax_audio.py post_process_func signature and incorporate PR#869 BUG FIX patches for stable audio generation * fix: use finetuned audio decoder and fix transformers_modules deserialization - hcx_omni.yaml: switch Stage 2 from NCZSCosybigvganDecoder (zero-shot, ECAPA-TDNN) to NCCosybigvganDecoder (finetuned, nn.Embedding speaker id). Zero-shot decoder required ref_audio (mel spectrogram) which is unavailable for text-only requests and incompatible with finetuned decoder path. - pipeline_hyperclovax_audio.py: guard ref_audio processing with 'not self.bigvgan.finetune' — finetuned decoder has no ECAPA-TDNN encoder, so passing ref_audio bytes would crash with 'expected 100 channels'. - omni_stage.py: add HuggingFace modules cache (~/.cache/huggingface/modules) to sys.path before queue.get_nowait() in try_collect(). Stage-0 pickles outputs containing custom classes from transformers_modules (trust_remote_code), but the API server process doesn't have this path, causing deserialization failures that silently drop Stage-0 outputs. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix: restore zero-shot speaker cloning with fallback for text-only requests - hcx_omni.yaml: revert to NCZSCosybigvganDecoder.mar (zero-shot ECAPA-TDNN) for voice-preserving S2S synthesis. NCCosybigvganDecoder used a fixed integer speaker_id and lost the input speaker's voice. - pipeline_hyperclovax_audio.py: add zero-mel fallback branch for finetune=False + ref_audio=None case. When a text-only request arrives (no input audio → no ref_audio), ECAPA-TDNN receives a zero mel tensor [1, num_mels, 64] instead of crashing with 'expected 100 channels'. S2S requests always have ref_audio so the zero-shot cloning path is unchanged. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * feat: add stage config yaml for HCX audio decoder Signed-off-by: Hyunjoon Jeong <hyunjoon.jeong@navercorp.com> * feat: add HyperCLOVAX-SEED-Omni 8B model as vllm-omni executor Signed-off-by: Hyunjoon Jeong <hyunjoon.jeong@navercorp.com> * feat: add HCX audio decoder pipeline Signed-off-by: Hyunjoon Jeong <hyunjoon.jeong@navercorp.com> * fix: modify exception for HCX audio decoder (GAN) Signed-off-by: Hyunjoon Jeong <hyunjoon.jeong@navercorp.com> * fix: default temperature set to 0, and pipeline model evaluation mode Signed-off-by: Hyunjoon Jeong <hyunjoon.jeong@navercorp.com> --------- Signed-off-by: Huang, Zeyu <11222265+fhfuih@users.noreply.github.com> Signed-off-by: dengyunyang <584797741@qq.com> Signed-off-by: gcanlin <canlinguosdu@gmail.com> Signed-off-by: samithuang <285365963@qq.com> Signed-off-by: Samit <285365963@qq.com> Signed-off-by: lishunyang <lishunyang12@163.com> Signed-off-by: Hongsheng Liu <liuhongsheng4@huawei.com> Signed-off-by: princepride <wangzhipeng628@gmail.com> Signed-off-by: 汪志鹏 <wangzhipeng628@gmail.com> Signed-off-by: wangyu31577 <wangyu31577@hundsun.com> Signed-off-by: Kyle Huang <yellowsea@gmail.com> Signed-off-by: zjy0516 <riverclouds.zhu@qq.com> Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com> Signed-off-by: natureofnature <wzliu@connect.hku.hk> Signed-off-by: linyueqian <linyueqian@outlook.com> Signed-off-by: mxuax <mxuax@connect.ust.hk> Signed-off-by: XU Mingshi <91017482+mxuax@users.noreply.github.com> Signed-off-by: amy-why-3459 <wuhaiyan17@huawei.com> Signed-off-by: wuzhongjian wuzhongjian_yewu@cmss.chinamobile.com Signed-off-by: ZeldaHuang <hzm414167@alibaba-inc.com> Signed-off-by: Divyansh Singhvi <divyanshsinghvi@gmail.com> Signed-off-by: Lin, Fanli <fanli.lin@intel.com> Signed-off-by: Fanli Lin <fanli.lin@intel.com> Signed-off-by: Fanli Lin <fanli0116@gmail.com> Signed-off-by: dongbo910220 <1275604947@qq.com> Signed-off-by: Ding Zuhao <e1583181@u.nus.edu> Signed-off-by: jzz <e1583181@u.nus.edu> Signed-off-by: Andy Zhou <46011930+AndyZhou952@users.noreply.github.com> Signed-off-by: wangyu <53896905+yenuo26@users.noreply.github.com> Signed-off-by: Pierre Le Guen <26087574+PierreLeGuen@users.noreply.github.com> Signed-off-by: yuanheng <jonathan.zhaoyh@gmail.com> Signed-off-by: ram16g <anlianfengjie@163.com> Signed-off-by: Didan Deng <33117903+wtomin@users.noreply.github.com> Signed-off-by: pablo <juanz9312@gmail.com> Signed-off-by: Roger Wang <hey@rogerw.io> Signed-off-by: anna <lee.anna@navercorp.com> Signed-off-by: Rustam Khadipash <16683750+hadipash@users.noreply.github.com> Signed-off-by: Alicia <115451386+congw729@users.noreply.github.com> Signed-off-by: Taichang Zhou <tzhouam@connect.ust.hk> Signed-off-by: tzhouam <tzhouam@connect.ust.hk> Signed-off-by: hsliu <liuhongsheng4@huawei.com> Signed-off-by: hsliu_ustc <hsliu_ustc@noreply.gitcode.com> Signed-off-by: zhenwei-intel <zhenwei.liu@intel.com> Signed-off-by: Yuanheng Zhao <jonathan.zhaoyh@gmail.com> Signed-off-by: erfgss <97771661+erfgss@users.noreply.github.com> Signed-off-by: xiedeyantu <czjourney@163.com> Signed-off-by: Junhong Liu <98734602+LJH-LBJ@users.noreply.github.com> Signed-off-by: Junhong Liu <ljh_lbj@163.com> Signed-off-by: David Chen <530634352@qq.com> Signed-off-by: weichen <calvin_zhu0210@outlook.com> Signed-off-by: Yan Ma <yan.ma@intel.com> Signed-off-by: ApsarasX <apsarax@outlook.com> Signed-off-by: Chenguang ZHENG <645327136@qq.com> Signed-off-by: yenuo26 <410167048@qq.com> Signed-off-by: Semmer2 <semmer@live.cn> Signed-off-by: Yueqian Lin <70319226+linyueqian@users.noreply.github.com> Signed-off-by: zhou zhuoxin <zhouzhuoxin1508@outlook.com> Signed-off-by: Gao Han <hgaoaf@connect.ust.hk> Signed-off-by: Rein Yang <ruiruyang2@gmail.com> Signed-off-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com> Signed-off-by: SYLAR <125541396+lishunyang12@users.noreply.github.com> Signed-off-by: Ziming Huang <1520787127@qq.com> Signed-off-by: SamitHuang <285365963@qq.com> Signed-off-by: GG-li <3226868735@qq.com> Signed-off-by: CHEN <116010019@link.cuhk.edu.cn> Signed-off-by: John Liu BUAA <liukecheng97@gmail.com> Signed-off-by: knlnguyen1802 <knlnguyen1802@gmail.com> Signed-off-by: dsinghvi <divyanshsinghvi@gmail.com> Signed-off-by: Chendi Xue <chendi.xue@intel.com> Signed-off-by: Daniel Huang <daniel1.huang@intel.com> Signed-off-by: Alex Brooks <albrooks@redhat.com> Signed-off-by: Sy03 <1370724210@qq.com> Signed-off-by: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: UsamaKenway <usamakenway@gmail.com> Signed-off-by: Hunter Liu <hunter@liu.sh> Signed-off-by: NickLucche <nlucches@redhat.com> Signed-off-by: wuhang <wuhang6@huawei.com> Signed-off-by: wuhang <whlbx@hotmail.com> Signed-off-by: pablo <pablo@agigo.ai> Signed-off-by: Shijin Zhang <75300765+Dovis01@users.noreply.github.com> Signed-off-by: Dong Wang <dongw2019@gmail.com> Signed-off-by: sniper35 <dongw2019@gmail.com> Signed-off-by: Yanick Schraner <yanick.schraner@bs.ch> Signed-off-by: Sangchun Ha <seomk9896@gmail.com> Signed-off-by: jader <yjader@foxmail.com> Signed-off-by: junuxyz <216036880+junuxyz@users.noreply.github.com> Signed-off-by: AndyZhou952 <jzhoubc@connect.ust.hk> Signed-off-by: Yupu <feng.yu.pu0330@gmail.com> Signed-off-by: Kevin H. Luu <khluu000@gmail.com> Signed-off-by: zhumingjue <zhumingjue@huawei.com> Signed-off-by: zhumingjue138 <zhumingjue@huawei.com> Signed-off-by: JaredforReal <w13431838023@gmail.com> Signed-off-by: Jared Wen <w13431838023@gmail.com> Signed-off-by: xulusjb <fdukeshik@gmail.com> Signed-off-by: 齐保元 <qibaoyuan@xiaomi.com> Signed-off-by: Sihao Li <111170255+GG-li@users.noreply.github.com> Signed-off-by: Baoyuan Qi <qibaoyuan@126.com> Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn> Signed-off-by: wuzhongjian <wuzhongjian_yewu@cmss.chinamobile.com> Signed-off-by: dongbo910220 <32610838+dongbo910220@users.noreply.github.com> Signed-off-by: Jiangyun Zhu <riverclouds.zhu@qq.com> Signed-off-by: baoyuan qi <qibaoyuan@126.com> Signed-off-by: Prajwal A <prajwalanagani@gmail.com> Signed-off-by: 丁宁 <nndding@gmail.com> Signed-off-by: SHIJIN ZHANG <75300765+Dovis01@users.noreply.github.com> Signed-off-by: dingning<dingning7@xiaomi.com> Signed-off-by: dingning <dingning7@xiaomi.com> Signed-off-by: dingning <dingning@xiaomi.com> Signed-off-by: WeiQing Chen <40507679+david6666666@users.noreply.github.com> Signed-off-by: Canlin Guo <961750412@qq.com> Signed-off-by: shijin zhang <75300765+Dovis01@users.noreply.github.com> Signed-off-by: Hyunjoon Jeong <hyunjoon.jeong@navercorp.com> Signed-off-by: Hyunjoon Jeong <with1015@unist.ac.kr> Co-authored-by: Zeyu Huang | 黃澤宇 <11222265+fhfuih@users.noreply.github.com> Co-authored-by: JohnJan <wuzhongjian_yewu@cmss.chinamobile.com> Co-authored-by: dengyunyang <584797741@qq.com> Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com> Co-authored-by: Canlin Guo <canlinguosdu@gmail.com> Co-authored-by: Samit <285365963@qq.com> Co-authored-by: SYLAR <125541396+lishunyang12@users.noreply.github.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: 汪志鹏 <wangzhipeng628@gmail.com> Co-authored-by: wangyu <53896905+yenuo26@users.noreply.github.com> Co-authored-by: wangyu31577 <wangyu31577@hundsun.com> Co-authored-by: kYLe <yellowsea@gmail.com> Co-authored-by: Jiangyun Zhu <riverclouds.zhu@qq.com> Co-authored-by: TJian <tunjian.tan@embeddedllm.com> Co-authored-by: NATURE <wzliu@connect.hku.hk> Co-authored-by: Yueqian Lin <70319226+linyueqian@users.noreply.github.com> Co-authored-by: Zhou Taichang <tzhouam@connect.ust.hk> Co-authored-by: root <root@hk01dgx028.cm.cluster> Co-authored-by: XU Mingshi <91017482+mxuax@users.noreply.github.com> Co-authored-by: amy-why-3459 <wuhaiyan17@huawei.com> Co-authored-by: Rein Yang <ruiruyang2@gmail.com> Co-authored-by: Ziming Huang <hzm414167@alibaba-inc.com> Co-authored-by: dsinghvi <divyanshsinghvi@gmail.com> Co-authored-by: Fanli Lin <fanli.lin@intel.com> Co-authored-by: dongbo910220 <32610838+dongbo910220@users.noreply.github.com> Co-authored-by: Ding Zuhao <e1583181@u.nus.edu> Co-authored-by: Andy Zhou <46011930+AndyZhou952@users.noreply.github.com> Co-authored-by: Pierre LE GUEN <26087574+PierreLeGuen@users.noreply.github.com> Co-authored-by: WeiQing Chen <40507679+david6666666@users.noreply.github.com> Co-authored-by: Yuanheng Zhao <54058983+yuanheng-zhao@users.noreply.github.com> Co-authored-by: ram16g <anlianfengjie@163.com> Co-authored-by: Didan Deng <33117903+wtomin@users.noreply.github.com> Co-authored-by: Markus / Mark <46672778+marksverdhei@users.noreply.github.com> Co-authored-by: Juan Pablo Zuluaga <46724788+JuanPZuluaga@users.noreply.github.com> Co-authored-by: muziyuhui666 <111362884+muziyuhui666@users.noreply.github.com> Co-authored-by: Roger Wang <hey@rogerw.io> Co-authored-by: ceanna93 <fairyanna@naver.com> Co-authored-by: anna <lee.anna@navercorp.com> Co-authored-by: Rustam Khadipash <16683750+hadipash@users.noreply.github.com> Co-authored-by: Alicia <115451386+congw729@users.noreply.github.com> Co-authored-by: hsliu_ustc <hsliu_ustc@noreply.gitcode.com> Co-authored-by: liuzhenwei <zhenweiliu@habana.ai> Co-authored-by: erfgss <97771661+erfgss@users.noreply.github.com> Co-authored-by: Jensen <czjourney@163.com> Co-authored-by: Junhong Liu <ljh_lbj@163.com> Co-authored-by: weichen <calvin_zhu0210@outlook.com> Co-authored-by: PopSoda2002 <zhouhp.me@gmail.com> Co-authored-by: Yan Ma <yan.ma@intel.com> Co-authored-by: ApsarasX <apsarax@outlook.com> Co-authored-by: Chenguang Zheng <645327136@qq.com> Co-authored-by: Jiaping Wu <53215702+ElleElleWu@users.noreply.github.com> Co-authored-by: zhou zhuoxin <zhouzhuoxin1508@outlook.com> Co-authored-by: Gao Han <gaohan19@huawei.com> Co-authored-by: rein yang <73573651+R2-Y@users.noreply.github.com> Co-authored-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com> Co-authored-by: Flora Feng <4florafeng@gmail.com> Co-authored-by: Sihao Li <111170255+GG-li@users.noreply.github.com> Co-authored-by: ChenWenjing <54166744+Shirley125@users.noreply.github.com> Co-authored-by: Bhanu068 <voutharoja.bhanu06@gmail.com> Co-authored-by: John Liu BUAA <liukecheng97@gmail.com> Co-authored-by: yenuo26 <410167048@qq.com> Co-authored-by: knlnguyen1802 <knlnguyen1802@gmail.com> Co-authored-by: liuzhenwei <zhenwei.liu@intel.com> Co-authored-by: Isotr0py <Isotr0py@outlook.com> Co-authored-by: ZJY0516 <zhu.jiangyun@foxmail.com> Co-authored-by: youkaichao <youkaichao@gmail.com> Co-authored-by: Chendi.Xue <chendi.xue@intel.com> Co-authored-by: Daniel Huang <daniel1.huang@intel.com> Co-authored-by: Alex Brooks <albrooks@redhat.com> Co-authored-by: Sy03 <1370724210@qq.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: UsamaKenway <56207634+UsamaKenway@users.noreply.github.com> Co-authored-by: Nicolò Lucchesi <nlucches@redhat.com> Co-authored-by: wuhang <wuhang6@huawei.com> Co-authored-by: pablo <pablo@agigo.ai> Co-authored-by: SHIJIN ZHANG <75300765+Dovis01@users.noreply.github.com> Co-authored-by: Dong W <89223086+sniper35@users.noreply.github.com> Co-authored-by: Yanick Schraner <yanick.schraner@gmail.com> Co-authored-by: Sangchun Ha <seomk9896@naver.com> Co-authored-by: 亦瑾 <76905040+yJader@users.noreply.github.com> Co-authored-by: junuxyz <216036880+junuxyz@users.noreply.github.com> Co-authored-by: Yupu <feng.yu.pu0330@gmail.com> Co-authored-by: Kevin H. Luu <khluu000@gmail.com> Co-authored-by: zhumingjue138 <zhumingjue@huawei.com> Co-authored-by: Canlin Guo <961750412@qq.com> Co-authored-by: Jared Wen <w13431838023@gmail.com> Co-authored-by: Xu Lu <572605156@qq.com> Co-authored-by: xulusjb <fdukeshik@gmail.com> Co-authored-by: Baoyuan Qi <qibaoyuan@xiaomi.com> Co-authored-by: Zhang Shijin <zhangshijin@xiaomi.com> Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn> Co-authored-by: shijin zhang <zsj1364226740@gmail.com> Co-authored-by: Prajwal A <34590600+LawJarp-A@users.noreply.github.com> Co-authored-by: dingning <dingning7@xiaomi.com> Co-authored-by: ning ding <nndding@gmail.com> Co-authored-by: Isotr0py <2037008807@qq.com> Co-authored-by: Nicolò Lucchesi <nicolo.lucchesi@gmail.com> Co-authored-by: Ting FU <futing10@huawei.com> Co-authored-by: developer-account <irteam@vllm-omni-dev-0.vllm-omni-dev.p-nb13557.svc.cluster.local> Co-authored-by: Hyunjoon Jeong <hyunjoon.jeong@navercorp.com>

Signed-off-by: natureofnature <wzliu@connect.hku.hk>

natureofnature requested a review from hsliuustc0106 as a code owner January 28, 2026 09:18

chatgpt-codex-connector Bot reviewed Jan 28, 2026

View reviewed changes

Comment thread vllm_omni/distributed/omni_connectors/connectors/mooncake_transfer_engine_connector.py

Comment thread vllm_omni/distributed/omni_connectors/connectors/mooncake_transfer_engine_connector.py

natureofnature force-pushed the d2d_connector branch 2 times, most recently from ae91253 to 743d268 Compare January 28, 2026 10:07

chatgpt-codex-connector Bot reviewed Jan 28, 2026

View reviewed changes

Comment thread vllm_omni/distributed/omni_connectors/connectors/mooncake_transfer_engine_connector.py

princepride mentioned this pull request Jan 29, 2026

[RFC]: Bagel deployment #936

Open

14 tasks

natureofnature force-pushed the d2d_connector branch from 743d268 to 88c1ed5 Compare February 1, 2026 13:02

This was referenced Feb 4, 2026

[RFC]: Omni Connector for Full Disaggregation Architecture 2026 Q1 Roadmap #1192

Open

[RFC]: vLLM-Omni RDMA connector Feature Design JiusiServe/vllm-omni#91

Closed

natureofnature force-pushed the d2d_connector branch 2 times, most recently from ae85693 to 791369e Compare February 10, 2026 02:03

Gaohan123 added this to the v0.16.0 milestone Feb 10, 2026

natureofnature force-pushed the d2d_connector branch from 4618a39 to 365f163 Compare February 10, 2026 03:34

chatgpt-codex-connector Bot reviewed Feb 10, 2026

View reviewed changes

Comment thread vllm_omni/distributed/omni_connectors/connectors/mooncake_transfer_engine_connector.py Outdated

Comment thread vllm_omni/distributed/omni_connectors/connectors/mooncake_transfer_engine_connector.py

natureofnature force-pushed the d2d_connector branch from 365f163 to ef6c4cc Compare February 10, 2026 10:23

chatgpt-codex-connector Bot reviewed Feb 11, 2026

View reviewed changes

Comment thread vllm_omni/distributed/omni_connectors/connectors/mooncake_transfer_engine_connector.py Outdated

Comment thread vllm_omni/distributed/omni_connectors/connectors/mooncake_transfer_engine_connector.py Outdated

chatgpt-codex-connector Bot reviewed Feb 11, 2026

View reviewed changes

Comment thread vllm_omni/distributed/omni_connectors/__init__.py Outdated

Comment thread vllm_omni/distributed/omni_connectors/connectors/mooncake_transfer_engine_connector.py Outdated

natureofnature force-pushed the d2d_connector branch from 3671728 to dd976c6 Compare February 11, 2026 07:50

natureofnature force-pushed the d2d_connector branch from dd976c6 to 0b13984 Compare February 11, 2026 07:56

chatgpt-codex-connector Bot reviewed Feb 11, 2026

View reviewed changes

Comment thread vllm_omni/distributed/omni_connectors/utils/initialization.py Outdated

Comment thread vllm_omni/distributed/__init__.py

natureofnature force-pushed the d2d_connector branch from 0b13984 to 2158eee Compare February 11, 2026 09:46

chatgpt-codex-connector Bot reviewed Feb 11, 2026

View reviewed changes

Comment thread vllm_omni/distributed/omni_connectors/connectors/mooncake_transfer_engine_connector.py Outdated

Comment thread vllm_omni/distributed/omni_connectors/connectors/mooncake_transfer_engine_connector.py

Copilot started reviewing on behalf of hsliuustc0106 February 11, 2026 17:50 View session

natureofnature changed the title ~~vLLM-Omni RDMA connector~~ [Feature] vLLM-Omni RDMA connector Feb 12, 2026

princepride reviewed Feb 12, 2026

View reviewed changes

Gaohan123 reviewed Feb 13, 2026

View reviewed changes

hsliuustc0106 requested a review from Copilot February 13, 2026 12:53

Copilot started reviewing on behalf of hsliuustc0106 February 13, 2026 12:54 View session

Copilot AI reviewed Feb 13, 2026

View reviewed changes

natureofnature added 6 commits February 16, 2026 03:10

update connector

7990274

Signed-off-by: natureofnature <wzliu@connect.hku.hk>

added a todo, update documents

8172378

Signed-off-by: natureofnature <wzliu@connect.hku.hk>

update test labels

19cd049

update connector interfaces udpate benchmark forler Signed-off-by: natureofnature <wzliu@connect.hku.hk>

add rdma connector yaml (for reference only, supported in next PR)

328b6f9

Signed-off-by: natureofnature <wzliu@connect.hku.hk>

update for copilot review

0b34b9d

Signed-off-by: natureofnature <wzliu@connect.hku.hk>

natureofnature force-pushed the d2d_connector branch from ebbda57 to 0b34b9d Compare February 16, 2026 03:10

princepride mentioned this pull request Feb 22, 2026

[Performance]: Bottleneck on the Hotspot of Inter-Stage Transfer #788

Open

1 task

hsliuustc0106 added the ready label to trigger buildkite CI label Feb 24, 2026

natureofnature closed this Feb 24, 2026

natureofnature reopened this Feb 24, 2026

hsliuustc0106 merged commit 1589931 into vllm-project:main Feb 24, 2026
9 of 10 checks passed

hsliuustc0106 mentioned this pull request Mar 11, 2026

[RFC]: vLLM-Omni 2026 Q1 Roadmap #677

Open

38 tasks

clodaghwalsh17 pushed a commit to clodaghwalsh17/nm-vllm-omni-ent that referenced this pull request May 12, 2026

[Feature] vLLM-Omni RDMA connector (vllm-project#1019)

281fb91

Signed-off-by: natureofnature <wzliu@connect.hku.hk>

natureofnature mentioned this pull request May 15, 2026

[RFC]: Qwen3-Omni Stage Transfer via Mooncake Transfer Engine #3635

Open

1 task

daixinning pushed a commit to daixinning/vllm-omni that referenced this pull request May 28, 2026

[Feature] vLLM-Omni RDMA connector (vllm-project#1019)

fd5b325

Signed-off-by: natureofnature <wzliu@connect.hku.hk>

quyifei23 pushed a commit to quyifei23/vllm-omni that referenced this pull request Jun 6, 2026

[Feature] vLLM-Omni RDMA connector (vllm-project#1019)

9ba153a

Signed-off-by: natureofnature <wzliu@connect.hku.hk>

Conversation

natureofnature commented Jan 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Progress

TODO

Test Plan

Test Result

Internode functionality

Cross nodes performance

Case 1: Simulated test

Case2: Bagel AR/DIT disaggregation test

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

natureofnature commented Jan 28, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

hsliuustc0106 commented Feb 8, 2026

Uh oh!

natureofnature commented Feb 10, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

natureofnature commented Feb 11, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

natureofnature commented Feb 11, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

natureofnature commented Feb 11, 2026

Uh oh!

natureofnature commented Feb 11, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

natureofnature commented Feb 11, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

princepride commented Feb 12, 2026

Uh oh!

princepride commented Feb 12, 2026

Uh oh!

natureofnature commented Jan 28, 2026 •

edited

Loading

natureofnature commented Feb 24, 2026 •

edited

Loading