[Feature]: AR async scheduling is enabled but decode launches still do not overlap


## 🚀 The feature, motivation and pitch

## Summary

Qwen3-Omni AR stages are configured to use vLLM V1 async scheduling, and the engine does enter the upstream `step_with_batch_queue()` path. However, traces show that the next decode `cudaGraphLaunch` is still serialized behind the previous step's sampling, output, and payload work.

In other words, async scheduling is active at the engine/scheduler level, but it does not currently provide the expected host/GPU overlap for Qwen3-Omni AR decode.

This issue intentionally focuses on the failure mode and the supporting evidence. The optimization directions listed below are investigation targets rather than a finalized implementation plan.

## Expected behavior

When async scheduling is enabled for an AR stage:

- The engine should be able to schedule the next batch while previous GPU work is still in flight.
- The next decode launch should not be delayed by user-visible output packaging or downstream-stage payload packaging from the previous step.
- For decode workloads, some autoregressive dependency is unavoidable, but CPU work that is not required to produce the next token should not remain on the critical path of the next launch.

## Actual behavior

The current behavior still looks like a serialized decode loop from the `cudaGraphLaunch` point of view. The engine uses the async scheduling path, but the next long decode graph launch is issued only after the previous GPU `execute_context` has already ended.

<img width="1310" height="1089" alt="Image" src="https://github.com/user-attachments/assets/1ded1a9a-a474-4cd8-ba18-3a1363b1d47d" />

The issue is more pronounced in text+audio traces. There, the full `sample_tokens()` path is much larger than the actual sampler range:

```text
stage0 sample_tokens:
  avg 5.461 ms, p50 4.837 ms

stage1 sample_tokens:
  avg 6.181 ms, p50 6.381 ms

stage1 _sample:
  avg 0.991 ms, p50 0.952 ms
```

For the Talker stage, `_sample` is about 1 ms p50 while `sample_tokens()` is about 6.38 ms p50. This means most of the time spent before the async output can be returned is not token sampling itself. It is consistent with Qwen3-Omni output and payload work, such as hidden-state or multimodal payload processing, connector attachment, and output object construction, remaining on the next-launch critical path.

## Why async scheduling does not currently produce the expected overlap

### 1. `non_block=True` does not make the worker method body asynchronous in the UniProc path

In the common single-GPU-per-stage path, `UniProcExecutor.collective_rpc()` still calls the worker method inline even when `non_block=True` is passed:

```python
result = run_method(self.driver_worker, method, args, kwargs)
if isinstance(result, AsyncModelRunnerOutput):
    return AsyncOutputFuture(result, single_value)
```

The future is created only after the worker method returns. This means any CPU work performed inside `execute_model()` or `sample_tokens()` before returning an `AsyncModelRunnerOutput` still blocks the engine thread.

For Qwen3-Omni, this matters because `sample_tokens()` does not just sample the next token. It also performs output and downstream payload work before returning the async output object.

### 2. `GPUARModelRunner.sample_tokens()` combines three different responsibilities

The current Qwen3-Omni AR sampling path mixes three responsibilities:

```text
produce the next sampled token
produce user-visible model output
produce next-stage Omni payload
```

Only the first item is strictly required before the next decode step can be prepared. The other two can be large in text+audio workloads.

Inside `GPUARModelRunner.sample_tokens()`, the path includes work such as:

```text
sampling and logprob processing
output-token bookkeeping
hidden-state slicing
hidden-state CPU copies
multimodal output conversion
pooler/multimodal payload construction
connector output attachment
routed-experts output extraction when initialized
OmniModelRunnerOutput construction
AsyncGPUModelRunnerOutput construction
```

Because the async output object is created only after the full `OmniModelRunnerOutput` is built, async scheduling cannot hide the earlier output and payload work. The output future only exists after that work has already run.

This matches the traces: in text+audio stage1, `_sample` is about 1 ms p50, while the full `sample_tokens()` path is about 6.38 ms p50.

### 3. The runner keeps only one in-flight execute state

`GPUARModelRunner.execute_model()` stores a single singleton state:

```python
if self.execute_model_state is not None:
    raise RuntimeError(
        "State error: sample_tokens() must be called after execute_model() returns None."
    )

self.execute_model_state = ExecuteModelState(...)
```

`sample_tokens()` consumes and clears the same singleton:

```python
if self.execute_model_state is None:
    ...

(...) = self.execute_model_state
self.execute_model_state = None
```

This means the runner itself is not structured to hold multiple in-flight execute/sample states. The upstream engine can queue futures, but the runner's internal state model still forces a strict execute-then-sample handoff.

This is not necessarily wrong for correctness, but it limits how much overlap can be achieved for Qwen3-Omni AR workloads.

### 4. The unavoidable AR dependency is being coupled with avoidable payload work

For a single decode request, step \(N+1\) depends on the token sampled at step \(N\). That dependency is real.

The problem observed here is narrower: the next step is not only waiting for the sampled token; it is also indirectly waiting for work that packages user-visible output and downstream Omni payloads.

The current effective timeline is:

```text
step N graph work
step N sample_tokens host path
  sample next token
  package output
  package downstream Omni payload
step N+1 decode launch
```

The lack of overlap in the trace suggests that the next launch cannot progress until this combined sampling, output, and payload path has completed.

## Impact

This affects Qwen3-Omni AR stages, especially text+audio workloads:

- The engine enters the async scheduling path, but measured decode-launch overlap remains effectively zero in the observed text-only trace.
- `sample_tokens()` is much more expensive than `_sample` in text+audio workloads, especially in the Talker stage.
- The extra work sits on the host-side path that must complete before the next decode launch can be issued.
- The issue is likely to hurt TTFT and TPOT and reduce the benefit of enabling async scheduling for Qwen3-Omni.

## Candidate optimization directions to investigate

The trace evidence points to multiple D2H, CPU staging, and synchronous payload-transfer paths that can keep async scheduling from producing useful overlap. A reasonable prioritization is:

1. **Make sampled-token handoff available as early as possible.**

   `AsyncGPUModelRunnerOutput` is only useful for async scheduling if the next token can be handed back before unrelated payload work completes. The first thing to check is whether `sample_tokens()` can publish the sampled token and start the async sampled-token D2H copy before building Omni payloads.

2. **Separate payload, hidden-state, and multimodal D2H from the next-token path.**

   Several paths can move tensors from GPU to CPU before the async output is returned:

   ```text
   generic hidden payload D2H inside sample_tokens()
   build_mm_cpu() recursive multimodal D2H
   hidden-state slicing/copying for downstream payloads
   ```

   These copies should be measured separately. If they are confirmed to be on the next-launch path, they are candidates for a separate copy stream, pinned CPU staging buffers, and event-based materialization at the point where the payload is actually consumed.

3. **Audit prefix-cache CPU staging.**

   Prefix-cache updates may force hidden-state D2H and `slot_mapping.cpu()` before the next decode launch. This path should be profiled separately from generic sampling. If it is on the critical path, the key question is whether hidden states and slot mappings can be staged asynchronously instead of being synchronously materialized on CPU.

4. **Avoid connector-side CPU serialization when raw GPU payloads are possible.**

   `OmniKVTransferManager` may fall back to KV D2H and byte serialization when the connector does not support raw data. Connector serialization and SHM writes may also remain synchronous. The connector boundary should be checked to see whether raw GPU tensors can be passed through for local or downstream consumers, avoiding unnecessary `.cpu()` and byte serialization on the AR decode path.

5. **Limit `model_intermediate_buffer` and postprocess CPU materialization.**

   Some model `postprocess()` or `update_dict` paths may move GPU tensors to CPU while updating intermediate buffers. Keys that can remain GPU-resident should stay GPU-resident until a cross-process or downstream consumer actually requires CPU materialization.

6. **Treat CPU output-token history waits as conditional.**

   When a logits processor or output mode needs CPU sampled-token history, `async_copy_ready_event.synchronize()` may be required. That dependency should be kept conditional and measured separately from the common path. It should not force all Qwen3-Omni decode steps to wait on CPU token history if the next decode input can be prepared without it.

7. **Measure Talker preprocess, MTP, and postprocess separately from D2H.**

   The Talker stage has real computation in preprocess, MTP/code predictor, and postprocess. These are not simple CPU copy problems. They should be profiled as their own buckets so that D2H optimization is not confused with real Talker-side model work.

@hsliuustc0106 @amy-why-3459 @tzhouam @natureofnature @Bounty-hunter 

## Alternatives

_No response_

## Additional context

_No response_

## Before submitting a new issue...

- [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature]: AR async scheduling is enabled but decode launches still do not overlap #4442

🚀 The feature, motivation and pitch

Summary

Expected behavior

Actual behavior

Why async scheduling does not currently produce the expected overlap

1. `non_block=True` does not make the worker method body asynchronous in the UniProc path

2. `GPUARModelRunner.sample_tokens()` combines three different responsibilities

3. The runner keeps only one in-flight execute state

4. The unavoidable AR dependency is being coupled with avoidable payload work

Impact

Candidate optimization directions to investigate

Alternatives

Additional context

Before submitting a new issue...

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[Feature]: AR async scheduling is enabled but decode launches still do not overlap #4442

Description

🚀 The feature, motivation and pitch

Summary

Expected behavior

Actual behavior

Why async scheduling does not currently produce the expected overlap

1. non_block=True does not make the worker method body asynchronous in the UniProc path

2. GPUARModelRunner.sample_tokens() combines three different responsibilities

3. The runner keeps only one in-flight execute state

4. The unavoidable AR dependency is being coupled with avoidable payload work

Impact

Candidate optimization directions to investigate

Alternatives

Additional context

Before submitting a new issue...

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

1. `non_block=True` does not make the worker method body asynchronous in the UniProc path

2. `GPUARModelRunner.sample_tokens()` combines three different responsibilities