[Bug]: Orchestrator bottleneck caused by redundant forwarding of thinker hidden states over the orchestrator wire under async-chunk

### Your current environment

# Orchestrator bottleneck under high concurrency (cc64 vs cc128)

## Concurrency sweep: cc64 vs cc128

We profiled the effect of **max concurrency** (64 vs 128) under **`request_rate=inf`**. Both thinker and talker use **`max_num_seqs=32`**. We tested three pipeline layouts: **1:1**, **1:2**, and **1:3** (thinker : talker GPU allocation).

**64 concurrency:**

<img width="2794" height="1818" alt="Image" src="https://github.com/user-attachments/assets/0c80026f-5a66-441f-bd8a-da5ae4578677" />

**128 concurrency:**

<img width="2794" height="1818" alt="Image" src="https://github.com/user-attachments/assets/b85435cc-9117-4405-a21a-b285033332eb" />

**128 concurrency consistently underperforms 64** on throughput and E2E latency. This suggests that at very high in-flight request counts, the system hits a **non-GPU bottleneck** rather than simply benefiting from more parallelism.

Profiling points to the **orchestrator** as the main source of this regression.

---

## Orchestrator behavior under load

### Idle vs active loop iterations

We instrumented the orchestrator main loop to track **idle vs working iterations** per 1s window. Under cc128, the loop stays busy for most of the run, but during **thinker-heavy** periods we see sharp drops in active iteration rate — in some 1s windows it falls **close to zero**:

<img width="2100" height="600" alt="Image" src="https://github.com/user-attachments/assets/903e067c-91ce-40c5-978f-3e7f5de59e50" />

### `process_outputs` dominates blocked time

When active iterations collapse, wall time is dominated by **`process_outputs`**:

<img width="2100" height="750" alt="Image" src="https://github.com/user-attachments/assets/cd10bbd2-a6c3-4fd2-8293-0b7e728b3cc5" />

The cost is concentrated on **thinker (stage 0)** replicas — not evenly spread across stages:

<img width="2100" height="750" alt="Image" src="https://github.com/user-attachments/assets/a7c44061-a859-47e3-8636-104b1a43c6b3" />

### Head-of-line blocking and output queue backlog

The orchestrator runs a **single asyncio loop** that polls all stage replicas sequentially (thinker → talker replicas → codec replicas). If one poll triggers a slow, synchronous `process_outputs` call, **every other stage waits** on the same event loop.

Meanwhile, each stage's EngineCore client keeps pushing step outputs into an **`outputs_queue`** (ZMQ receiver → asyncio queue → orchestrator `get_output_async()`). When the orchestrator is blocked, this queue backs up. We profiled the **max queue depth per 1s window**:

<img width="2100" height="600" alt="Image" src="https://github.com/user-attachments/assets/6902e687-a9bd-4993-a9b6-131d698cf59a" />

In the worst windows, queue depth exceeds **100** for thinker/talker. This adds latency variance and hurts **RTF** and **E2E** time, even when GPU utilization looks healthy.

Under cc128 this effect is worse because (1) more concurrent requests produce more orchestrator work per second, and (2) thinker `process_outputs` occasionally stalls for **hundreds of milliseconds** per batch (see below).

---

## Root cause: redundant inter-stage latent IPC + tensor consolidation

Tracing the code, the expensive thinker path is **not** normal text detokenization. With **`--async-chunk`**, thinker hidden states are already sent to talker via the **stage connector** (`save_async`). However, on **`main`**, the scheduler **also forwards the same latent tensors** to the orchestrator on every decode step — there is no gate today:

```466:489:vllm_omni/core/sched/omni_ar_scheduler.py
            if new_token_ids or mm_output is not None or pooler_output is not None or kv_transfer_params or stopped:
                # Add EngineCoreOutput for this Request.
                outputs[request.client_index].append(
                    OmniEngineCoreOutput(
                        request_id=req_id,
                        new_token_ids=new_token_ids,
                        finish_reason=finish_reason,
                        new_logprobs=new_logprobs,
                        new_prompt_logprobs_tensors=prompt_logprobs_tensors,
                        pooling_output=pooler_output,
                        multimodal_output=mm_output,
                        stop_reason=request.stop_reason,
                        events=request.take_events(),
                        prefill_stats=request.take_prefill_stats(),
                        kv_transfer_params=kv_transfer_params,
                        trace_headers=request.trace_headers,
                        routed_experts=routed_experts,
                        num_nans_in_logits=request.num_nans_in_logits,
                        is_segment_finished=is_segment_finished,
                        new_prompt_len_snapshot=self._new_prompt_len_snapshot.get(req_id, None),
                    )
                )
                if self.chunk_transfer_adapter is not None:
                    self.chunk_transfer_adapter.save_async(mm_output, request, is_segment_finished)
```

So each thinker step sends latents **twice**: once through `save_async()` to the next stage, and again on the wire to the orchestrator via `multimodal_output`.

For Qwen3-Omni thinker → talker, the payload contains **layer-0 word embeddings** and **layer-24 hidden states** (used by `thinker2talker_async_chunk`):

```36:38:vllm_omni/model_executor/stage_input_processors/qwen3_omni.py
# Pooling output layer keys: "0" = word embedding, "24" = accept_hidden_layer
_EMBED_LAYER_KEY = "0"
_HIDDEN_LAYER_KEY = "24"
```

On the orchestrator side, `MultimodalOutputProcessor.process_outputs()` **accumulates every step's `multimodal_output`**:

```603:608:vllm_omni/engine/output_processor.py
            if isinstance(req_state, OmniRequestState):
                mm_output = getattr(eco, "multimodal_output", None)
                if mm_output is not None:
                    mm_type = getattr(eco, "output_type", None) or default_mm_type
                    req_state.add_multimodal_tensor(mm_output, mm_type)
```

Per step, `add_multimodal_tensor()` **copies tensors to CPU** and appends them to deferred lists (to avoid O(n²) `cat` on every token):

```154:195:vllm_omni/engine/output_processor.py
    def add_multimodal_tensor(self, payload: Any | None, mm_type: str | None) -> None:
        ...
                    remapped[k] = _to_cpu(v)
        ...
                _merge_payload(self.mm_accumulated, incoming)
```

When a request finishes (or on cumulative output steps), `make_request_output()` calls **`_consolidate_multimodal_tensors()`**, which runs **`torch.cat`** over the full accumulated sequence — for a 900-token request this can mean concatenating **~900 chunks × 2 layers** of hidden states on the orchestrator thread:

```199:219:vllm_omni/engine/output_processor.py
    def _consolidate_multimodal_tensors(self) -> None:
        ...
            for k, v in list(self.mm_accumulated.tensors.items()):
                if isinstance(v, list) and v and isinstance(v[0], torch.Tensor):
                    ...
                        self.mm_accumulated.tensors[k] = _cat_tensors(v, strategy)
```

**Why this is redundant:** with `--async-chunk`, talker already receives these tensors through the connector. The orchestrator only needs **text tokens** for client streaming; it does not consume thinker hidden states for stage routing. Yet under the default path it still pays **ZMQ IPC + CPU copy + finish-time concat** on every in-flight request — cost scales with **concurrency × output length**, which explains why cc128 is worse than cc64.

> **Note:** An earlier attempt to strip only `pooling_output` had no effect, because on current AR runners **`pooler_output` is `None` on the wire** and latents are carried on **`multimodal_output`** instead.

---

@hsliuustc0106 @Gaohan123 @fake0fan @amy-why-3459 

### Your code version

<details>
<summary>The commit id or version of vllm</summary>

```text

```
</details>
<details>
<summary>The commit id or version of vllm-omni</summary>

```text

```
</details>


### 🐛 Describe the bug

```

### Before submitting a new issue...

- [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://vllm-omni.readthedocs.io), which can answer lots of frequently asked questions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: Orchestrator bottleneck caused by redundant forwarding of thinker hidden states over the orchestrator wire under async-chunk #4469

Your current environment

Orchestrator bottleneck under high concurrency (cc64 vs cc128)

Concurrency sweep: cc64 vs cc128

Orchestrator behavior under load

Idle vs active loop iterations

`process_outputs` dominates blocked time

Head-of-line blocking and output queue backlog

Root cause: redundant inter-stage latent IPC + tensor consolidation

Your code version

🐛 Describe the bug

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[Bug]: Orchestrator bottleneck caused by redundant forwarding of thinker hidden states over the orchestrator wire under async-chunk #4469

Description

Your current environment

Orchestrator bottleneck under high concurrency (cc64 vs cc128)

Concurrency sweep: cc64 vs cc128

Orchestrator behavior under load

Idle vs active loop iterations

process_outputs dominates blocked time

Head-of-line blocking and output queue backlog

Root cause: redundant inter-stage latent IPC + tensor consolidation

Your code version

🐛 Describe the bug

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

`process_outputs` dominates blocked time