Skip to content

[Feature]: AR async scheduling is enabled but decode launches still do not overlap #4442

@fake0fan

Description

@fake0fan

🚀 The feature, motivation and pitch

Summary

Qwen3-Omni AR stages are configured to use vLLM V1 async scheduling, and the engine does enter the upstream step_with_batch_queue() path. However, traces show that the next decode cudaGraphLaunch is still serialized behind the previous step's sampling, output, and payload work.

In other words, async scheduling is active at the engine/scheduler level, but it does not currently provide the expected host/GPU overlap for Qwen3-Omni AR decode.

This issue intentionally focuses on the failure mode and the supporting evidence. The optimization directions listed below are investigation targets rather than a finalized implementation plan.

Expected behavior

When async scheduling is enabled for an AR stage:

  • The engine should be able to schedule the next batch while previous GPU work is still in flight.
  • The next decode launch should not be delayed by user-visible output packaging or downstream-stage payload packaging from the previous step.
  • For decode workloads, some autoregressive dependency is unavoidable, but CPU work that is not required to produce the next token should not remain on the critical path of the next launch.

Actual behavior

The current behavior still looks like a serialized decode loop from the cudaGraphLaunch point of view. The engine uses the async scheduling path, but the next long decode graph launch is issued only after the previous GPU execute_context has already ended.

Image

The issue is more pronounced in text+audio traces. There, the full sample_tokens() path is much larger than the actual sampler range:

stage0 sample_tokens:
  avg 5.461 ms, p50 4.837 ms

stage1 sample_tokens:
  avg 6.181 ms, p50 6.381 ms

stage1 _sample:
  avg 0.991 ms, p50 0.952 ms

For the Talker stage, _sample is about 1 ms p50 while sample_tokens() is about 6.38 ms p50. This means most of the time spent before the async output can be returned is not token sampling itself. It is consistent with Qwen3-Omni output and payload work, such as hidden-state or multimodal payload processing, connector attachment, and output object construction, remaining on the next-launch critical path.

Why async scheduling does not currently produce the expected overlap

1. non_block=True does not make the worker method body asynchronous in the UniProc path

In the common single-GPU-per-stage path, UniProcExecutor.collective_rpc() still calls the worker method inline even when non_block=True is passed:

result = run_method(self.driver_worker, method, args, kwargs)
if isinstance(result, AsyncModelRunnerOutput):
    return AsyncOutputFuture(result, single_value)

The future is created only after the worker method returns. This means any CPU work performed inside execute_model() or sample_tokens() before returning an AsyncModelRunnerOutput still blocks the engine thread.

For Qwen3-Omni, this matters because sample_tokens() does not just sample the next token. It also performs output and downstream payload work before returning the async output object.

2. GPUARModelRunner.sample_tokens() combines three different responsibilities

The current Qwen3-Omni AR sampling path mixes three responsibilities:

produce the next sampled token
produce user-visible model output
produce next-stage Omni payload

Only the first item is strictly required before the next decode step can be prepared. The other two can be large in text+audio workloads.

Inside GPUARModelRunner.sample_tokens(), the path includes work such as:

sampling and logprob processing
output-token bookkeeping
hidden-state slicing
hidden-state CPU copies
multimodal output conversion
pooler/multimodal payload construction
connector output attachment
routed-experts output extraction when initialized
OmniModelRunnerOutput construction
AsyncGPUModelRunnerOutput construction

Because the async output object is created only after the full OmniModelRunnerOutput is built, async scheduling cannot hide the earlier output and payload work. The output future only exists after that work has already run.

This matches the traces: in text+audio stage1, _sample is about 1 ms p50, while the full sample_tokens() path is about 6.38 ms p50.

3. The runner keeps only one in-flight execute state

GPUARModelRunner.execute_model() stores a single singleton state:

if self.execute_model_state is not None:
    raise RuntimeError(
        "State error: sample_tokens() must be called after execute_model() returns None."
    )

self.execute_model_state = ExecuteModelState(...)

sample_tokens() consumes and clears the same singleton:

if self.execute_model_state is None:
    ...

(...) = self.execute_model_state
self.execute_model_state = None

This means the runner itself is not structured to hold multiple in-flight execute/sample states. The upstream engine can queue futures, but the runner's internal state model still forces a strict execute-then-sample handoff.

This is not necessarily wrong for correctness, but it limits how much overlap can be achieved for Qwen3-Omni AR workloads.

4. The unavoidable AR dependency is being coupled with avoidable payload work

For a single decode request, step (N+1) depends on the token sampled at step (N). That dependency is real.

The problem observed here is narrower: the next step is not only waiting for the sampled token; it is also indirectly waiting for work that packages user-visible output and downstream Omni payloads.

The current effective timeline is:

step N graph work
step N sample_tokens host path
  sample next token
  package output
  package downstream Omni payload
step N+1 decode launch

The lack of overlap in the trace suggests that the next launch cannot progress until this combined sampling, output, and payload path has completed.

Impact

This affects Qwen3-Omni AR stages, especially text+audio workloads:

  • The engine enters the async scheduling path, but measured decode-launch overlap remains effectively zero in the observed text-only trace.
  • sample_tokens() is much more expensive than _sample in text+audio workloads, especially in the Talker stage.
  • The extra work sits on the host-side path that must complete before the next decode launch can be issued.
  • The issue is likely to hurt TTFT and TPOT and reduce the benefit of enabling async scheduling for Qwen3-Omni.

Candidate optimization directions to investigate

The trace evidence points to multiple D2H, CPU staging, and synchronous payload-transfer paths that can keep async scheduling from producing useful overlap. A reasonable prioritization is:

  1. Make sampled-token handoff available as early as possible.

    AsyncGPUModelRunnerOutput is only useful for async scheduling if the next token can be handed back before unrelated payload work completes. The first thing to check is whether sample_tokens() can publish the sampled token and start the async sampled-token D2H copy before building Omni payloads.

  2. Separate payload, hidden-state, and multimodal D2H from the next-token path.

    Several paths can move tensors from GPU to CPU before the async output is returned:

    generic hidden payload D2H inside sample_tokens()
    build_mm_cpu() recursive multimodal D2H
    hidden-state slicing/copying for downstream payloads
    

    These copies should be measured separately. If they are confirmed to be on the next-launch path, they are candidates for a separate copy stream, pinned CPU staging buffers, and event-based materialization at the point where the payload is actually consumed.

  3. Audit prefix-cache CPU staging.

    Prefix-cache updates may force hidden-state D2H and slot_mapping.cpu() before the next decode launch. This path should be profiled separately from generic sampling. If it is on the critical path, the key question is whether hidden states and slot mappings can be staged asynchronously instead of being synchronously materialized on CPU.

  4. Avoid connector-side CPU serialization when raw GPU payloads are possible.

    OmniKVTransferManager may fall back to KV D2H and byte serialization when the connector does not support raw data. Connector serialization and SHM writes may also remain synchronous. The connector boundary should be checked to see whether raw GPU tensors can be passed through for local or downstream consumers, avoiding unnecessary .cpu() and byte serialization on the AR decode path.

  5. Limit model_intermediate_buffer and postprocess CPU materialization.

    Some model postprocess() or update_dict paths may move GPU tensors to CPU while updating intermediate buffers. Keys that can remain GPU-resident should stay GPU-resident until a cross-process or downstream consumer actually requires CPU materialization.

  6. Treat CPU output-token history waits as conditional.

    When a logits processor or output mode needs CPU sampled-token history, async_copy_ready_event.synchronize() may be required. That dependency should be kept conditional and measured separately from the common path. It should not force all Qwen3-Omni decode steps to wait on CPU token history if the next decode input can be prepared without it.

  7. Measure Talker preprocess, MTP, and postprocess separately from D2H.

    The Talker stage has real computation in preprocess, MTP/code predictor, and postprocess. These are not simple CPU copy problems. They should be profiled as their own buckets so that D2H optimization is not confused with real Talker-side model work.

@hsliuustc0106 @amy-why-3459 @tzhouam @natureofnature @Bounty-hunter

Alternatives

No response

Additional context

No response

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Metadata

Metadata

Assignees

No one assigned

    Labels

    corerelated to core module: cache, scheduler, engine, worker, modelrunnercriticalcritical issuehigh priorityhigh priority issue, needs to be done asap
    No fields configured for Feature.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions