🚀 The feature, motivation and pitch
Summary
Qwen3-Omni AR stages are configured to use vLLM V1 async scheduling, and the engine does enter the upstream step_with_batch_queue() path. However, traces show that the next decode cudaGraphLaunch is still serialized behind the previous step's sampling, output, and payload work.
In other words, async scheduling is active at the engine/scheduler level, but it does not currently provide the expected host/GPU overlap for Qwen3-Omni AR decode.
This issue intentionally focuses on the failure mode and the supporting evidence. The optimization directions listed below are investigation targets rather than a finalized implementation plan.
Expected behavior
When async scheduling is enabled for an AR stage:
- The engine should be able to schedule the next batch while previous GPU work is still in flight.
- The next decode launch should not be delayed by user-visible output packaging or downstream-stage payload packaging from the previous step.
- For decode workloads, some autoregressive dependency is unavoidable, but CPU work that is not required to produce the next token should not remain on the critical path of the next launch.
Actual behavior
The current behavior still looks like a serialized decode loop from the cudaGraphLaunch point of view. The engine uses the async scheduling path, but the next long decode graph launch is issued only after the previous GPU execute_context has already ended.
The issue is more pronounced in text+audio traces. There, the full sample_tokens() path is much larger than the actual sampler range:
stage0 sample_tokens:
avg 5.461 ms, p50 4.837 ms
stage1 sample_tokens:
avg 6.181 ms, p50 6.381 ms
stage1 _sample:
avg 0.991 ms, p50 0.952 ms
For the Talker stage, _sample is about 1 ms p50 while sample_tokens() is about 6.38 ms p50. This means most of the time spent before the async output can be returned is not token sampling itself. It is consistent with Qwen3-Omni output and payload work, such as hidden-state or multimodal payload processing, connector attachment, and output object construction, remaining on the next-launch critical path.
Why async scheduling does not currently produce the expected overlap
1. non_block=True does not make the worker method body asynchronous in the UniProc path
In the common single-GPU-per-stage path, UniProcExecutor.collective_rpc() still calls the worker method inline even when non_block=True is passed:
result = run_method(self.driver_worker, method, args, kwargs)
if isinstance(result, AsyncModelRunnerOutput):
return AsyncOutputFuture(result, single_value)
The future is created only after the worker method returns. This means any CPU work performed inside execute_model() or sample_tokens() before returning an AsyncModelRunnerOutput still blocks the engine thread.
For Qwen3-Omni, this matters because sample_tokens() does not just sample the next token. It also performs output and downstream payload work before returning the async output object.
2. GPUARModelRunner.sample_tokens() combines three different responsibilities
The current Qwen3-Omni AR sampling path mixes three responsibilities:
produce the next sampled token
produce user-visible model output
produce next-stage Omni payload
Only the first item is strictly required before the next decode step can be prepared. The other two can be large in text+audio workloads.
Inside GPUARModelRunner.sample_tokens(), the path includes work such as:
sampling and logprob processing
output-token bookkeeping
hidden-state slicing
hidden-state CPU copies
multimodal output conversion
pooler/multimodal payload construction
connector output attachment
routed-experts output extraction when initialized
OmniModelRunnerOutput construction
AsyncGPUModelRunnerOutput construction
Because the async output object is created only after the full OmniModelRunnerOutput is built, async scheduling cannot hide the earlier output and payload work. The output future only exists after that work has already run.
This matches the traces: in text+audio stage1, _sample is about 1 ms p50, while the full sample_tokens() path is about 6.38 ms p50.
3. The runner keeps only one in-flight execute state
GPUARModelRunner.execute_model() stores a single singleton state:
if self.execute_model_state is not None:
raise RuntimeError(
"State error: sample_tokens() must be called after execute_model() returns None."
)
self.execute_model_state = ExecuteModelState(...)
sample_tokens() consumes and clears the same singleton:
if self.execute_model_state is None:
...
(...) = self.execute_model_state
self.execute_model_state = None
This means the runner itself is not structured to hold multiple in-flight execute/sample states. The upstream engine can queue futures, but the runner's internal state model still forces a strict execute-then-sample handoff.
This is not necessarily wrong for correctness, but it limits how much overlap can be achieved for Qwen3-Omni AR workloads.
4. The unavoidable AR dependency is being coupled with avoidable payload work
For a single decode request, step (N+1) depends on the token sampled at step (N). That dependency is real.
The problem observed here is narrower: the next step is not only waiting for the sampled token; it is also indirectly waiting for work that packages user-visible output and downstream Omni payloads.
The current effective timeline is:
step N graph work
step N sample_tokens host path
sample next token
package output
package downstream Omni payload
step N+1 decode launch
The lack of overlap in the trace suggests that the next launch cannot progress until this combined sampling, output, and payload path has completed.
Impact
This affects Qwen3-Omni AR stages, especially text+audio workloads:
- The engine enters the async scheduling path, but measured decode-launch overlap remains effectively zero in the observed text-only trace.
sample_tokens() is much more expensive than _sample in text+audio workloads, especially in the Talker stage.
- The extra work sits on the host-side path that must complete before the next decode launch can be issued.
- The issue is likely to hurt TTFT and TPOT and reduce the benefit of enabling async scheduling for Qwen3-Omni.
Candidate optimization directions to investigate
The trace evidence points to multiple D2H, CPU staging, and synchronous payload-transfer paths that can keep async scheduling from producing useful overlap. A reasonable prioritization is:
-
Make sampled-token handoff available as early as possible.
AsyncGPUModelRunnerOutput is only useful for async scheduling if the next token can be handed back before unrelated payload work completes. The first thing to check is whether sample_tokens() can publish the sampled token and start the async sampled-token D2H copy before building Omni payloads.
-
Separate payload, hidden-state, and multimodal D2H from the next-token path.
Several paths can move tensors from GPU to CPU before the async output is returned:
generic hidden payload D2H inside sample_tokens()
build_mm_cpu() recursive multimodal D2H
hidden-state slicing/copying for downstream payloads
These copies should be measured separately. If they are confirmed to be on the next-launch path, they are candidates for a separate copy stream, pinned CPU staging buffers, and event-based materialization at the point where the payload is actually consumed.
-
Audit prefix-cache CPU staging.
Prefix-cache updates may force hidden-state D2H and slot_mapping.cpu() before the next decode launch. This path should be profiled separately from generic sampling. If it is on the critical path, the key question is whether hidden states and slot mappings can be staged asynchronously instead of being synchronously materialized on CPU.
-
Avoid connector-side CPU serialization when raw GPU payloads are possible.
OmniKVTransferManager may fall back to KV D2H and byte serialization when the connector does not support raw data. Connector serialization and SHM writes may also remain synchronous. The connector boundary should be checked to see whether raw GPU tensors can be passed through for local or downstream consumers, avoiding unnecessary .cpu() and byte serialization on the AR decode path.
-
Limit model_intermediate_buffer and postprocess CPU materialization.
Some model postprocess() or update_dict paths may move GPU tensors to CPU while updating intermediate buffers. Keys that can remain GPU-resident should stay GPU-resident until a cross-process or downstream consumer actually requires CPU materialization.
-
Treat CPU output-token history waits as conditional.
When a logits processor or output mode needs CPU sampled-token history, async_copy_ready_event.synchronize() may be required. That dependency should be kept conditional and measured separately from the common path. It should not force all Qwen3-Omni decode steps to wait on CPU token history if the next decode input can be prepared without it.
-
Measure Talker preprocess, MTP, and postprocess separately from D2H.
The Talker stage has real computation in preprocess, MTP/code predictor, and postprocess. These are not simple CPU copy problems. They should be profiled as their own buckets so that D2H optimization is not confused with real Talker-side model work.
@hsliuustc0106 @amy-why-3459 @tzhouam @natureofnature @Bounty-hunter
Alternatives
No response
Additional context
No response
Before submitting a new issue...
🚀 The feature, motivation and pitch
Summary
Qwen3-Omni AR stages are configured to use vLLM V1 async scheduling, and the engine does enter the upstream
step_with_batch_queue()path. However, traces show that the next decodecudaGraphLaunchis still serialized behind the previous step's sampling, output, and payload work.In other words, async scheduling is active at the engine/scheduler level, but it does not currently provide the expected host/GPU overlap for Qwen3-Omni AR decode.
This issue intentionally focuses on the failure mode and the supporting evidence. The optimization directions listed below are investigation targets rather than a finalized implementation plan.
Expected behavior
When async scheduling is enabled for an AR stage:
Actual behavior
The current behavior still looks like a serialized decode loop from the
cudaGraphLaunchpoint of view. The engine uses the async scheduling path, but the next long decode graph launch is issued only after the previous GPUexecute_contexthas already ended.The issue is more pronounced in text+audio traces. There, the full
sample_tokens()path is much larger than the actual sampler range:For the Talker stage,
_sampleis about 1 ms p50 whilesample_tokens()is about 6.38 ms p50. This means most of the time spent before the async output can be returned is not token sampling itself. It is consistent with Qwen3-Omni output and payload work, such as hidden-state or multimodal payload processing, connector attachment, and output object construction, remaining on the next-launch critical path.Why async scheduling does not currently produce the expected overlap
1.
non_block=Truedoes not make the worker method body asynchronous in the UniProc pathIn the common single-GPU-per-stage path,
UniProcExecutor.collective_rpc()still calls the worker method inline even whennon_block=Trueis passed:The future is created only after the worker method returns. This means any CPU work performed inside
execute_model()orsample_tokens()before returning anAsyncModelRunnerOutputstill blocks the engine thread.For Qwen3-Omni, this matters because
sample_tokens()does not just sample the next token. It also performs output and downstream payload work before returning the async output object.2.
GPUARModelRunner.sample_tokens()combines three different responsibilitiesThe current Qwen3-Omni AR sampling path mixes three responsibilities:
Only the first item is strictly required before the next decode step can be prepared. The other two can be large in text+audio workloads.
Inside
GPUARModelRunner.sample_tokens(), the path includes work such as:Because the async output object is created only after the full
OmniModelRunnerOutputis built, async scheduling cannot hide the earlier output and payload work. The output future only exists after that work has already run.This matches the traces: in text+audio stage1,
_sampleis about 1 ms p50, while the fullsample_tokens()path is about 6.38 ms p50.3. The runner keeps only one in-flight execute state
GPUARModelRunner.execute_model()stores a single singleton state:sample_tokens()consumes and clears the same singleton:This means the runner itself is not structured to hold multiple in-flight execute/sample states. The upstream engine can queue futures, but the runner's internal state model still forces a strict execute-then-sample handoff.
This is not necessarily wrong for correctness, but it limits how much overlap can be achieved for Qwen3-Omni AR workloads.
4. The unavoidable AR dependency is being coupled with avoidable payload work
For a single decode request, step (N+1) depends on the token sampled at step (N). That dependency is real.
The problem observed here is narrower: the next step is not only waiting for the sampled token; it is also indirectly waiting for work that packages user-visible output and downstream Omni payloads.
The current effective timeline is:
The lack of overlap in the trace suggests that the next launch cannot progress until this combined sampling, output, and payload path has completed.
Impact
This affects Qwen3-Omni AR stages, especially text+audio workloads:
sample_tokens()is much more expensive than_samplein text+audio workloads, especially in the Talker stage.Candidate optimization directions to investigate
The trace evidence points to multiple D2H, CPU staging, and synchronous payload-transfer paths that can keep async scheduling from producing useful overlap. A reasonable prioritization is:
Make sampled-token handoff available as early as possible.
AsyncGPUModelRunnerOutputis only useful for async scheduling if the next token can be handed back before unrelated payload work completes. The first thing to check is whethersample_tokens()can publish the sampled token and start the async sampled-token D2H copy before building Omni payloads.Separate payload, hidden-state, and multimodal D2H from the next-token path.
Several paths can move tensors from GPU to CPU before the async output is returned:
These copies should be measured separately. If they are confirmed to be on the next-launch path, they are candidates for a separate copy stream, pinned CPU staging buffers, and event-based materialization at the point where the payload is actually consumed.
Audit prefix-cache CPU staging.
Prefix-cache updates may force hidden-state D2H and
slot_mapping.cpu()before the next decode launch. This path should be profiled separately from generic sampling. If it is on the critical path, the key question is whether hidden states and slot mappings can be staged asynchronously instead of being synchronously materialized on CPU.Avoid connector-side CPU serialization when raw GPU payloads are possible.
OmniKVTransferManagermay fall back to KV D2H and byte serialization when the connector does not support raw data. Connector serialization and SHM writes may also remain synchronous. The connector boundary should be checked to see whether raw GPU tensors can be passed through for local or downstream consumers, avoiding unnecessary.cpu()and byte serialization on the AR decode path.Limit
model_intermediate_bufferand postprocess CPU materialization.Some model
postprocess()orupdate_dictpaths may move GPU tensors to CPU while updating intermediate buffers. Keys that can remain GPU-resident should stay GPU-resident until a cross-process or downstream consumer actually requires CPU materialization.Treat CPU output-token history waits as conditional.
When a logits processor or output mode needs CPU sampled-token history,
async_copy_ready_event.synchronize()may be required. That dependency should be kept conditional and measured separately from the common path. It should not force all Qwen3-Omni decode steps to wait on CPU token history if the next decode input can be prepared without it.Measure Talker preprocess, MTP, and postprocess separately from D2H.
The Talker stage has real computation in preprocess, MTP/code predictor, and postprocess. These are not simple CPU copy problems. They should be profiled as their own buckets so that D2H optimization is not confused with real Talker-side model work.
@hsliuustc0106 @amy-why-3459 @tzhouam @natureofnature @Bounty-hunter
Alternatives
No response
Additional context
No response
Before submitting a new issue...