Your current environment
Orchestrator bottleneck under high concurrency (cc64 vs cc128)
Concurrency sweep: cc64 vs cc128
We profiled the effect of max concurrency (64 vs 128) under request_rate=inf. Both thinker and talker use max_num_seqs=32. We tested three pipeline layouts: 1:1, 1:2, and 1:3 (thinker : talker GPU allocation).
64 concurrency:
128 concurrency:
128 concurrency consistently underperforms 64 on throughput and E2E latency. This suggests that at very high in-flight request counts, the system hits a non-GPU bottleneck rather than simply benefiting from more parallelism.
Profiling points to the orchestrator as the main source of this regression.
Orchestrator behavior under load
Idle vs active loop iterations
We instrumented the orchestrator main loop to track idle vs working iterations per 1s window. Under cc128, the loop stays busy for most of the run, but during thinker-heavy periods we see sharp drops in active iteration rate — in some 1s windows it falls close to zero:
process_outputs dominates blocked time
When active iterations collapse, wall time is dominated by process_outputs:
The cost is concentrated on thinker (stage 0) replicas — not evenly spread across stages:
Head-of-line blocking and output queue backlog
The orchestrator runs a single asyncio loop that polls all stage replicas sequentially (thinker → talker replicas → codec replicas). If one poll triggers a slow, synchronous process_outputs call, every other stage waits on the same event loop.
Meanwhile, each stage's EngineCore client keeps pushing step outputs into an outputs_queue (ZMQ receiver → asyncio queue → orchestrator get_output_async()). When the orchestrator is blocked, this queue backs up. We profiled the max queue depth per 1s window:
In the worst windows, queue depth exceeds 100 for thinker/talker. This adds latency variance and hurts RTF and E2E time, even when GPU utilization looks healthy.
Under cc128 this effect is worse because (1) more concurrent requests produce more orchestrator work per second, and (2) thinker process_outputs occasionally stalls for hundreds of milliseconds per batch (see below).
Root cause: redundant inter-stage latent IPC + tensor consolidation
Tracing the code, the expensive thinker path is not normal text detokenization. With --async-chunk, thinker hidden states are already sent to talker via the stage connector (save_async). However, on main, the scheduler also forwards the same latent tensors to the orchestrator on every decode step — there is no gate today:
if new_token_ids or mm_output is not None or pooler_output is not None or kv_transfer_params or stopped:
# Add EngineCoreOutput for this Request.
outputs[request.client_index].append(
OmniEngineCoreOutput(
request_id=req_id,
new_token_ids=new_token_ids,
finish_reason=finish_reason,
new_logprobs=new_logprobs,
new_prompt_logprobs_tensors=prompt_logprobs_tensors,
pooling_output=pooler_output,
multimodal_output=mm_output,
stop_reason=request.stop_reason,
events=request.take_events(),
prefill_stats=request.take_prefill_stats(),
kv_transfer_params=kv_transfer_params,
trace_headers=request.trace_headers,
routed_experts=routed_experts,
num_nans_in_logits=request.num_nans_in_logits,
is_segment_finished=is_segment_finished,
new_prompt_len_snapshot=self._new_prompt_len_snapshot.get(req_id, None),
)
)
if self.chunk_transfer_adapter is not None:
self.chunk_transfer_adapter.save_async(mm_output, request, is_segment_finished)
So each thinker step sends latents twice: once through save_async() to the next stage, and again on the wire to the orchestrator via multimodal_output.
For Qwen3-Omni thinker → talker, the payload contains layer-0 word embeddings and layer-24 hidden states (used by thinker2talker_async_chunk):
# Pooling output layer keys: "0" = word embedding, "24" = accept_hidden_layer
_EMBED_LAYER_KEY = "0"
_HIDDEN_LAYER_KEY = "24"
On the orchestrator side, MultimodalOutputProcessor.process_outputs() accumulates every step's multimodal_output:
if isinstance(req_state, OmniRequestState):
mm_output = getattr(eco, "multimodal_output", None)
if mm_output is not None:
mm_type = getattr(eco, "output_type", None) or default_mm_type
req_state.add_multimodal_tensor(mm_output, mm_type)
Per step, add_multimodal_tensor() copies tensors to CPU and appends them to deferred lists (to avoid O(n²) cat on every token):
def add_multimodal_tensor(self, payload: Any | None, mm_type: str | None) -> None:
...
remapped[k] = _to_cpu(v)
...
_merge_payload(self.mm_accumulated, incoming)
When a request finishes (or on cumulative output steps), make_request_output() calls _consolidate_multimodal_tensors(), which runs torch.cat over the full accumulated sequence — for a 900-token request this can mean concatenating ~900 chunks × 2 layers of hidden states on the orchestrator thread:
def _consolidate_multimodal_tensors(self) -> None:
...
for k, v in list(self.mm_accumulated.tensors.items()):
if isinstance(v, list) and v and isinstance(v[0], torch.Tensor):
...
self.mm_accumulated.tensors[k] = _cat_tensors(v, strategy)
Why this is redundant: with --async-chunk, talker already receives these tensors through the connector. The orchestrator only needs text tokens for client streaming; it does not consume thinker hidden states for stage routing. Yet under the default path it still pays ZMQ IPC + CPU copy + finish-time concat on every in-flight request — cost scales with concurrency × output length, which explains why cc128 is worse than cc64.
Note: An earlier attempt to strip only pooling_output had no effect, because on current AR runners pooler_output is None on the wire and latents are carried on multimodal_output instead.
@hsliuustc0106 @Gaohan123 @fake0fan @amy-why-3459
Your code version
The commit id or version of vllm
The commit id or version of vllm-omni
🐛 Describe the bug
### Before submitting a new issue...
- [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://vllm-omni.readthedocs.io), which can answer lots of frequently asked questions.
Your current environment
Orchestrator bottleneck under high concurrency (cc64 vs cc128)
Concurrency sweep: cc64 vs cc128
We profiled the effect of max concurrency (64 vs 128) under
request_rate=inf. Both thinker and talker usemax_num_seqs=32. We tested three pipeline layouts: 1:1, 1:2, and 1:3 (thinker : talker GPU allocation).64 concurrency:
128 concurrency:
128 concurrency consistently underperforms 64 on throughput and E2E latency. This suggests that at very high in-flight request counts, the system hits a non-GPU bottleneck rather than simply benefiting from more parallelism.
Profiling points to the orchestrator as the main source of this regression.
Orchestrator behavior under load
Idle vs active loop iterations
We instrumented the orchestrator main loop to track idle vs working iterations per 1s window. Under cc128, the loop stays busy for most of the run, but during thinker-heavy periods we see sharp drops in active iteration rate — in some 1s windows it falls close to zero:
process_outputsdominates blocked timeWhen active iterations collapse, wall time is dominated by
process_outputs:The cost is concentrated on thinker (stage 0) replicas — not evenly spread across stages:
Head-of-line blocking and output queue backlog
The orchestrator runs a single asyncio loop that polls all stage replicas sequentially (thinker → talker replicas → codec replicas). If one poll triggers a slow, synchronous
process_outputscall, every other stage waits on the same event loop.Meanwhile, each stage's EngineCore client keeps pushing step outputs into an
outputs_queue(ZMQ receiver → asyncio queue → orchestratorget_output_async()). When the orchestrator is blocked, this queue backs up. We profiled the max queue depth per 1s window:In the worst windows, queue depth exceeds 100 for thinker/talker. This adds latency variance and hurts RTF and E2E time, even when GPU utilization looks healthy.
Under cc128 this effect is worse because (1) more concurrent requests produce more orchestrator work per second, and (2) thinker
process_outputsoccasionally stalls for hundreds of milliseconds per batch (see below).Root cause: redundant inter-stage latent IPC + tensor consolidation
Tracing the code, the expensive thinker path is not normal text detokenization. With
--async-chunk, thinker hidden states are already sent to talker via the stage connector (save_async). However, onmain, the scheduler also forwards the same latent tensors to the orchestrator on every decode step — there is no gate today:So each thinker step sends latents twice: once through
save_async()to the next stage, and again on the wire to the orchestrator viamultimodal_output.For Qwen3-Omni thinker → talker, the payload contains layer-0 word embeddings and layer-24 hidden states (used by
thinker2talker_async_chunk):On the orchestrator side,
MultimodalOutputProcessor.process_outputs()accumulates every step'smultimodal_output:Per step,
add_multimodal_tensor()copies tensors to CPU and appends them to deferred lists (to avoid O(n²)caton every token):When a request finishes (or on cumulative output steps),
make_request_output()calls_consolidate_multimodal_tensors(), which runstorch.catover the full accumulated sequence — for a 900-token request this can mean concatenating ~900 chunks × 2 layers of hidden states on the orchestrator thread:Why this is redundant: with
--async-chunk, talker already receives these tensors through the connector. The orchestrator only needs text tokens for client streaming; it does not consume thinker hidden states for stage routing. Yet under the default path it still pays ZMQ IPC + CPU copy + finish-time concat on every in-flight request — cost scales with concurrency × output length, which explains why cc128 is worse than cc64.@hsliuustc0106 @Gaohan123 @fake0fan @amy-why-3459
Your code version
The commit id or version of vllm
The commit id or version of vllm-omni
🐛 Describe the bug