Skip to content

[Bug]: Orchestrator bottleneck caused by redundant forwarding of thinker hidden states over the orchestrator wire under async-chunk #4469

@yinpeiqi

Description

@yinpeiqi

Your current environment

Orchestrator bottleneck under high concurrency (cc64 vs cc128)

Concurrency sweep: cc64 vs cc128

We profiled the effect of max concurrency (64 vs 128) under request_rate=inf. Both thinker and talker use max_num_seqs=32. We tested three pipeline layouts: 1:1, 1:2, and 1:3 (thinker : talker GPU allocation).

64 concurrency:

Image

128 concurrency:

Image

128 concurrency consistently underperforms 64 on throughput and E2E latency. This suggests that at very high in-flight request counts, the system hits a non-GPU bottleneck rather than simply benefiting from more parallelism.

Profiling points to the orchestrator as the main source of this regression.


Orchestrator behavior under load

Idle vs active loop iterations

We instrumented the orchestrator main loop to track idle vs working iterations per 1s window. Under cc128, the loop stays busy for most of the run, but during thinker-heavy periods we see sharp drops in active iteration rate — in some 1s windows it falls close to zero:

Image

process_outputs dominates blocked time

When active iterations collapse, wall time is dominated by process_outputs:

Image

The cost is concentrated on thinker (stage 0) replicas — not evenly spread across stages:

Image

Head-of-line blocking and output queue backlog

The orchestrator runs a single asyncio loop that polls all stage replicas sequentially (thinker → talker replicas → codec replicas). If one poll triggers a slow, synchronous process_outputs call, every other stage waits on the same event loop.

Meanwhile, each stage's EngineCore client keeps pushing step outputs into an outputs_queue (ZMQ receiver → asyncio queue → orchestrator get_output_async()). When the orchestrator is blocked, this queue backs up. We profiled the max queue depth per 1s window:

Image

In the worst windows, queue depth exceeds 100 for thinker/talker. This adds latency variance and hurts RTF and E2E time, even when GPU utilization looks healthy.

Under cc128 this effect is worse because (1) more concurrent requests produce more orchestrator work per second, and (2) thinker process_outputs occasionally stalls for hundreds of milliseconds per batch (see below).


Root cause: redundant inter-stage latent IPC + tensor consolidation

Tracing the code, the expensive thinker path is not normal text detokenization. With --async-chunk, thinker hidden states are already sent to talker via the stage connector (save_async). However, on main, the scheduler also forwards the same latent tensors to the orchestrator on every decode step — there is no gate today:

            if new_token_ids or mm_output is not None or pooler_output is not None or kv_transfer_params or stopped:
                # Add EngineCoreOutput for this Request.
                outputs[request.client_index].append(
                    OmniEngineCoreOutput(
                        request_id=req_id,
                        new_token_ids=new_token_ids,
                        finish_reason=finish_reason,
                        new_logprobs=new_logprobs,
                        new_prompt_logprobs_tensors=prompt_logprobs_tensors,
                        pooling_output=pooler_output,
                        multimodal_output=mm_output,
                        stop_reason=request.stop_reason,
                        events=request.take_events(),
                        prefill_stats=request.take_prefill_stats(),
                        kv_transfer_params=kv_transfer_params,
                        trace_headers=request.trace_headers,
                        routed_experts=routed_experts,
                        num_nans_in_logits=request.num_nans_in_logits,
                        is_segment_finished=is_segment_finished,
                        new_prompt_len_snapshot=self._new_prompt_len_snapshot.get(req_id, None),
                    )
                )
                if self.chunk_transfer_adapter is not None:
                    self.chunk_transfer_adapter.save_async(mm_output, request, is_segment_finished)

So each thinker step sends latents twice: once through save_async() to the next stage, and again on the wire to the orchestrator via multimodal_output.

For Qwen3-Omni thinker → talker, the payload contains layer-0 word embeddings and layer-24 hidden states (used by thinker2talker_async_chunk):

# Pooling output layer keys: "0" = word embedding, "24" = accept_hidden_layer
_EMBED_LAYER_KEY = "0"
_HIDDEN_LAYER_KEY = "24"

On the orchestrator side, MultimodalOutputProcessor.process_outputs() accumulates every step's multimodal_output:

            if isinstance(req_state, OmniRequestState):
                mm_output = getattr(eco, "multimodal_output", None)
                if mm_output is not None:
                    mm_type = getattr(eco, "output_type", None) or default_mm_type
                    req_state.add_multimodal_tensor(mm_output, mm_type)

Per step, add_multimodal_tensor() copies tensors to CPU and appends them to deferred lists (to avoid O(n²) cat on every token):

    def add_multimodal_tensor(self, payload: Any | None, mm_type: str | None) -> None:
        ...
                    remapped[k] = _to_cpu(v)
        ...
                _merge_payload(self.mm_accumulated, incoming)

When a request finishes (or on cumulative output steps), make_request_output() calls _consolidate_multimodal_tensors(), which runs torch.cat over the full accumulated sequence — for a 900-token request this can mean concatenating ~900 chunks × 2 layers of hidden states on the orchestrator thread:

    def _consolidate_multimodal_tensors(self) -> None:
        ...
            for k, v in list(self.mm_accumulated.tensors.items()):
                if isinstance(v, list) and v and isinstance(v[0], torch.Tensor):
                    ...
                        self.mm_accumulated.tensors[k] = _cat_tensors(v, strategy)

Why this is redundant: with --async-chunk, talker already receives these tensors through the connector. The orchestrator only needs text tokens for client streaming; it does not consume thinker hidden states for stage routing. Yet under the default path it still pays ZMQ IPC + CPU copy + finish-time concat on every in-flight request — cost scales with concurrency × output length, which explains why cc128 is worse than cc64.

Note: An earlier attempt to strip only pooling_output had no effect, because on current AR runners pooler_output is None on the wire and latents are carried on multimodal_output instead.


@hsliuustc0106 @Gaohan123 @fake0fan @amy-why-3459

Your code version

The commit id or version of vllm

The commit id or version of vllm-omni

🐛 Describe the bug


### Before submitting a new issue...

- [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://vllm-omni.readthedocs.io), which can answer lots of frequently asked questions.

Metadata

Metadata

Assignees

Labels

bugSomething isn't workingmedium prioritymedium priority issue

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions