Proposal to improve performance
This issue is distinct from #1238. #1238 reported an aggregate benchmark-level increase in Qwen3-Omni audio TTFP when max_concurrency changed from 1 to 2, but the behavior was not consistently reproducible and no queue-level mechanism was established. Here, request-wave and queue-level profiling isolates a generic Orchestrator service-rate problem: each polling round consumes at most one raw output per stage replica and may wait on an empty stage, while downstream queues continue to accumulate non-user-visible raw outputs. The behavior affects every request; num_prompts > max_concurrency merely exposes it more clearly because later request waves inherit a deeper backlog. This proposal therefore focuses on stage-output retrieval and routing, rather than Code2Wav computation, batching, or cross-stage chunk transfer.
Expected behavior
After stage-2 produces the first audio packet for a request, the Orchestrator should retrieve and route it promptly. Later request windows may incur legitimate admission or compute queueing, but should not incur large additional delay after the packet is already ready.
Actual behavior
For any request, the first audio packet can remain behind earlier non-user-visible raw outputs when the Orchestrator drain rate is lower than the downstream production rate. The effect is usually modest in the first concurrency-sized wave, but becomes high and unstable in later waves as queue backlog accumulates.
Proposed change
The minimal prototype changes the following locations:
-
vLLM: add a non-blocking dequeue API
Add a generic API such as get_output_nowait() or try_get_output() that:
- returns immediately when the output queue is empty;
- preserves FIFO ordering;
- propagates queued exceptions consistently;
- leaves the existing blocking
get_output_async() semantics unchanged.
-
vLLM-Omni: perform non-blocking, bounded draining in StagePool
Update _poll_stage_raw() / poll_llm_raw_output() so that one polling pass:
- returns immediately when no raw output is ready;
- drains already-ready raw outputs until the first routable request output is produced, the queue becomes empty, or a bounded item/time budget is reached;
- processes statistics, errors, terminal events, KV-ready metadata, and streaming state transitions encountered during draining.
Conceptually:
empty queue -> return immediately
[N N N A ...] -> process N entries and return A in the same bounded polling pass
The drain must be bounded to preserve fairness across stages. Reducing the current timeout alone is insufficient because it does not change the one-item-per-stage-per-round drain rate.
Report of performance regression
Observed pattern
When serving Qwen/Qwen3-Omni-30B-A3B-Instruct with text input and streaming text+audio output, we observe:
- The same stage-output polling path is used for every request, so the underlying drain-rate limitation is not restricted to requests beyond
max_concurrency.
- With
max_concurrency == 1, the queue usually does not build deeply enough for the resulting delay to become prominent.
- With
max_concurrency > 1 and num_prompts == max_concurrency, only the first request wave is present. These requests may still incur output-queue polling delay, but the backlog is generally shallower and the impact is less visible.
- With
max_concurrency > 1 and num_prompts > max_concurrency, later request waves inherit accumulated downstream queue backlog and therefore show much higher and more variable AUDIO_TTFP.
Therefore, num_prompts > max_concurrency is an observation and amplification condition, not a necessary trigger or the root cause. Although discovered with Qwen3-Omni, the underlying behavior is in the generic multi-stage output polling path and may also affect E2E and tail latency in other pipelines with imbalanced stage-output rates.
The number of requests is 16, with concurrency levels of 1, 4, 8, 12, and 16. You can see that as concurrency increases, TTFP grows normally, but for requests exceeding the maximum concurrency limit, TTFP spikes sharply. As shown in the figure below.
Taking a test case with a maximum concurrency of 4 and 16 requests as an example, the TTFP of the first batch of concurrent requests is controlled within a smaller range, while the TTFP of the other 12 requests exceeding the maximum concurrency first increases and then decreases, but is still significantly higher than the first batch of concurrent requests.
Root cause analysis
The Orchestrator polls stage replicas sequentially and consumes at most one raw EngineCoreOutputs item from each replica in one orchestration round. If an upstream stage has no ready output, its poll can wait until the per-stage timeout expires, even when a downstream stage already has a backlog.
The downstream queue may contain many raw outputs that do not produce a frontend-visible text or audio payload after vLLM-Omni output processing. These outputs are non-user-visible for that iteration, but they may still carry statistics or control metadata and must not be silently discarded.
For the Qwen3-Omni audio path, the queue can conceptually become:
stage-2 output queue:
[N N N N A N N ...]
^
first frontend-visible audio packet
N: raw output without a frontend-visible payload
A: first frontend-visible audio packet
If stage-2 produces raw outputs faster than the Orchestrator drains them, the first audio packet may already be enqueued but remain behind earlier entries. Its additional latency is therefore queue residence time rather than model execution time:
first audio produced
-> waits in stage-2 output queue
-> dequeued by Orchestrator
-> routed to frontend
The mechanism can affect every request. Its severity depends on the queue depth and Orchestrator round duration when the first audio packet is inserted: the initial request wave typically encounters a shallow queue, whereas later waves inherit accumulated backlog and therefore exhibit larger and more variable TTFP.
Reproduction and results
The comparison chart of the results after improving the polling method for getting output is shown below. The test load was set with a maximum concurrency of 4 and 16 requests.
The previous high TTFP was not primarily caused by slow audio computation, but by delayed retrieval of the first valid audio packet. Empty or ineffective outputs accumulated in the stage output queue, and the Orchestrator drained them too slowly. By skipping empty audio packets and returning only effective outputs, the new polling logic removes this queue-induced head-of-line blocking.
============ Serving Benchmark Result ============
Successful requests: 12
Failed requests: 0
Maximum request concurrency: 4
Benchmark duration (s): 84.83
Request throughput (req/s): 0.14
Peak concurrent requests: 6.00
================== Text Result ===================
Total input tokens: 552
Total generated tokens: 2585
Output token throughput (tok/s): 30.47
Peak output token throughput (tok/s): 272.00
Peak concurrent requests: 6.00
Total Token throughput (tok/s): 36.98
---------------Time to First Token----------------
Mean TTFT (ms): 192.09
Median TTFT (ms): 181.19
P99 TTFT (ms): 357.88
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 17.12
Median TPOT (ms): 15.72
P99 TPOT (ms): 28.01
---------------Inter-token Latency----------------
Mean ITL (ms): 13.25
Median ITL (ms): 9.76
P99 ITL (ms): 187.52
================== Audio Result ==================
Total audio duration generated(s): 1035.67
Total audio frames generated: 24856020
Audio throughput(audio duration/s): 12.21
============= Omni Pipeline Metrics ==============
Mean TTFA (ms): 1268.81
Median TTFA (ms): 1204.98
Mean Text-Audio Gap (ms): 1076.72
Median Text-Audio Gap (ms): 1066.77
Mean Audio E2EL (ms): 24496.26
Median Audio E2EL (ms): 25202.14
Mean Audio ITL (ms): 539.13
Median Audio ITL (ms): 462.84
Mean Pipeline Ratio (TTFT/TTFA): 0.15
==================================================
Queue-level evidence should include the first-audio production, Orchestrator dequeue, and frontend-send timestamps, together with stage queue depth and the number of entries ahead of the first audio packet.
Please see related prior performance discussions and optimizations:
Misc discussion on performance
This proposal is complementary to async-chunk, batching, and inter-packet-latency optimizations. Those efforts primarily improve stage computation and cross-stage transfer; this issue focuses on consuming and routing outputs that have already been produced.
Open design points:
- Upstream vLLM should preferably expose only a generic non-blocking dequeue API; Omni-specific routability should remain in
StagePool.
- A routable-output predicate should be based on processed request outputs, not hard-coded to
pooling_output or a model-specific audio field.
- Draining should use an item or elapsed-time budget to prevent a hot stage from starving other stages.
- The regression test should measure both the first concurrency-sized request window and later windows, verify that the first wave can also benefit from faster draining, and check output ordering, terminal/error propagation, and stage fairness.
Your current environment (if you think it is necessary)
Workload
- Model:
Qwen/Qwen3-Omni-30B-A3B-Instruct
- Input: text
- Output: text + audio
- Underlying scope: all requests traversing the multi-stage output polling path
- Amplification/reproduction condition:
max_concurrency > 1 and num_prompts > max_concurrency
- Primary metric:
audio_ttfp
- Secondary metrics:
e2el and output-queue residence time
Current environment and Scripts
# vllm bench serve command
vllm bench serve \
--omni \
--dataset-name random-mm \
--port 8000 \
--model /data/models/Qwen3-Omni-30B-A3B-Instruct \
--endpoint /v1/chat/completions \
--backend openai-chat-omni \
--request-rate inf \
--burstiness 1.0 \
--max-concurrency 2\
--num-prompts 16 \
--ready-check-timeout-sec 600 \
--random-input-len 32 \
--random-range-ratio 0.0 \
--random-mm-base-items-per-request 0 \
--random-mm-num-mm-items-range-ratio 0 \
--random-mm-limit-mm-per-prompt '{"image":0,"video":0,"audio":0}' \
--ignore-eos \
--random-output-len 256 \
--extra_body '{"modalities": ["text", "audio"]}'
# deploy:~/vllm-omni/vllm_omni/deploy/qwen3_omni_moe.yaml
async_chunk: true
connectors:
connector_of_shared_memory:
name: SharedMemoryConnector
extra:
codec_chunk_frames: 25
codec_left_context_frames: 25
stages:
- stage_id: 0
gpu_memory_utilization: 0.9
devices: "0"
default_sampling_params:
temperature: 0.4
top_p: 0.9
top_k: 1
max_tokens: 2048
seed: 42
repetition_penalty: 1.05
- stage_id: 1
gpu_memory_utilization: 0.6
devices: "1"
input_connectors:
from_stage_0: connector_of_shared_memory
default_sampling_params:
temperature: 0.9
top_k: 50
max_tokens: 4096
seed: 42
repetition_penalty: 1.05
- stage_id: 2
gpu_memory_utilization: 0.1
enforce_eager: true
async_scheduling: false
max_num_batched_tokens: 51200
devices: "1"
input_connectors:
from_stage_1: connector_of_shared_memory
default_sampling_params:
temperature: 0.0
top_p: 1.0
top_k: -1
max_tokens: 65536
seed: 42
repetition_penalty: 1.1
platforms:
npu:
stages:
- stage_id: 0
gpu_memory_utilization: 0.6
tensor_parallel_size: 2
max_num_batched_tokens: 8192
devices: "0,1"
- stage_id: 1
gpu_memory_utilization: 0.6
max_num_batched_tokens: 8192
devices: "2"
- stage_id: 2
gpu_memory_utilization: 0.3
devices: "2"
rocm:
stages:
- stage_id: 0
enforce_eager: true
xpu:
stages:
- stage_id: 0
tensor_parallel_size: 4
enforce_eager: true
max_cudagraph_capture_size: 0
devices: "0,1,2,3"
- stage_id: 1
enforce_eager: true
max_cudagraph_capture_size: 0
devices: "4"
- stage_id: 2
gpu_memory_utilization: 0.3
max_cudagraph_capture_size: 0
devices: "4"
Before submitting a new issue...
Proposal to improve performance
This issue is distinct from #1238. #1238 reported an aggregate benchmark-level increase in Qwen3-Omni audio TTFP when
max_concurrencychanged from 1 to 2, but the behavior was not consistently reproducible and no queue-level mechanism was established. Here, request-wave and queue-level profiling isolates a generic Orchestrator service-rate problem: each polling round consumes at most one raw output per stage replica and may wait on an empty stage, while downstream queues continue to accumulate non-user-visible raw outputs. The behavior affects every request;num_prompts > max_concurrencymerely exposes it more clearly because later request waves inherit a deeper backlog. This proposal therefore focuses on stage-output retrieval and routing, rather than Code2Wav computation, batching, or cross-stage chunk transfer.Expected behavior
After stage-2 produces the first audio packet for a request, the Orchestrator should retrieve and route it promptly. Later request windows may incur legitimate admission or compute queueing, but should not incur large additional delay after the packet is already ready.
Actual behavior
For any request, the first audio packet can remain behind earlier non-user-visible raw outputs when the Orchestrator drain rate is lower than the downstream production rate. The effect is usually modest in the first concurrency-sized wave, but becomes high and unstable in later waves as queue backlog accumulates.
Proposed change
The minimal prototype changes the following locations:
vLLM: add a non-blocking dequeue API
Add a generic API such as
get_output_nowait()ortry_get_output()that:get_output_async()semantics unchanged.vLLM-Omni: perform non-blocking, bounded draining in
StagePoolUpdate
_poll_stage_raw()/poll_llm_raw_output()so that one polling pass:Conceptually:
The drain must be bounded to preserve fairness across stages. Reducing the current timeout alone is insufficient because it does not change the one-item-per-stage-per-round drain rate.
Report of performance regression
Observed pattern
When serving
Qwen/Qwen3-Omni-30B-A3B-Instructwith text input and streaming text+audio output, we observe:max_concurrency.max_concurrency == 1, the queue usually does not build deeply enough for the resulting delay to become prominent.max_concurrency > 1andnum_prompts == max_concurrency, only the first request wave is present. These requests may still incur output-queue polling delay, but the backlog is generally shallower and the impact is less visible.max_concurrency > 1andnum_prompts > max_concurrency, later request waves inherit accumulated downstream queue backlog and therefore show much higher and more variableAUDIO_TTFP.Therefore,
num_prompts > max_concurrencyis an observation and amplification condition, not a necessary trigger or the root cause. Although discovered with Qwen3-Omni, the underlying behavior is in the generic multi-stage output polling path and may also affect E2E and tail latency in other pipelines with imbalanced stage-output rates.The number of requests is 16, with concurrency levels of 1, 4, 8, 12, and 16. You can see that as concurrency increases, TTFP grows normally, but for requests exceeding the maximum concurrency limit, TTFP spikes sharply. As shown in the figure below.
Taking a test case with a maximum concurrency of 4 and 16 requests as an example, the TTFP of the first batch of concurrent requests is controlled within a smaller range, while the TTFP of the other 12 requests exceeding the maximum concurrency first increases and then decreases, but is still significantly higher than the first batch of concurrent requests.
Root cause analysis
The Orchestrator polls stage replicas sequentially and consumes at most one raw
EngineCoreOutputsitem from each replica in one orchestration round. If an upstream stage has no ready output, its poll can wait until the per-stage timeout expires, even when a downstream stage already has a backlog.The downstream queue may contain many raw outputs that do not produce a frontend-visible text or audio payload after vLLM-Omni output processing. These outputs are non-user-visible for that iteration, but they may still carry statistics or control metadata and must not be silently discarded.
For the Qwen3-Omni audio path, the queue can conceptually become:
If stage-2 produces raw outputs faster than the Orchestrator drains them, the first audio packet may already be enqueued but remain behind earlier entries. Its additional latency is therefore queue residence time rather than model execution time:
The mechanism can affect every request. Its severity depends on the queue depth and Orchestrator round duration when the first audio packet is inserted: the initial request wave typically encounters a shallow queue, whereas later waves inherit accumulated backlog and therefore exhibit larger and more variable TTFP.
Reproduction and results
The comparison chart of the results after improving the polling method for getting output is shown below. The test load was set with a maximum concurrency of 4 and 16 requests.
The previous high TTFP was not primarily caused by slow audio computation, but by delayed retrieval of the first valid audio packet. Empty or ineffective outputs accumulated in the stage output queue, and the Orchestrator drained them too slowly. By skipping empty audio packets and returning only effective outputs, the new polling logic removes this queue-induced head-of-line blocking.
Queue-level evidence should include the first-audio production, Orchestrator dequeue, and frontend-send timestamps, together with stage queue depth and the number of entries ahead of the first audio packet.
Please see related prior performance discussions and optimizations:
max_concurrencyfrom 1 to 2. Its follow-up results changed substantially across later tests, and it did not isolate the request-wave pattern or queue-consumption mechanism discussed here. It is included only for symptom-level context, not because this issue is a duplicate or continuation of it.Misc discussion on performance
This proposal is complementary to async-chunk, batching, and inter-packet-latency optimizations. Those efforts primarily improve stage computation and cross-stage transfer; this issue focuses on consuming and routing outputs that have already been produced.
Open design points:
StagePool.pooling_outputor a model-specific audio field.Your current environment (if you think it is necessary)
Workload
Qwen/Qwen3-Omni-30B-A3B-Instructmax_concurrency > 1andnum_prompts > max_concurrencyaudio_ttfpe2eland output-queue residence timeCurrent environment and Scripts
Before submitting a new issue...