Skip to content

[Performance]: Head-of-line blocking in stage output polling inflates audio TTFP and E2E under queued concurrency #4561

@yutou-03

Description

@yutou-03

Proposal to improve performance

This issue is distinct from #1238. #1238 reported an aggregate benchmark-level increase in Qwen3-Omni audio TTFP when max_concurrency changed from 1 to 2, but the behavior was not consistently reproducible and no queue-level mechanism was established. Here, request-wave and queue-level profiling isolates a generic Orchestrator service-rate problem: each polling round consumes at most one raw output per stage replica and may wait on an empty stage, while downstream queues continue to accumulate non-user-visible raw outputs. The behavior affects every request; num_prompts > max_concurrency merely exposes it more clearly because later request waves inherit a deeper backlog. This proposal therefore focuses on stage-output retrieval and routing, rather than Code2Wav computation, batching, or cross-stage chunk transfer.

Expected behavior

After stage-2 produces the first audio packet for a request, the Orchestrator should retrieve and route it promptly. Later request windows may incur legitimate admission or compute queueing, but should not incur large additional delay after the packet is already ready.

Actual behavior

For any request, the first audio packet can remain behind earlier non-user-visible raw outputs when the Orchestrator drain rate is lower than the downstream production rate. The effect is usually modest in the first concurrency-sized wave, but becomes high and unstable in later waves as queue backlog accumulates.

Proposed change

The minimal prototype changes the following locations:

  1. vLLM: add a non-blocking dequeue API

    Add a generic API such as get_output_nowait() or try_get_output() that:

    • returns immediately when the output queue is empty;
    • preserves FIFO ordering;
    • propagates queued exceptions consistently;
    • leaves the existing blocking get_output_async() semantics unchanged.
  2. vLLM-Omni: perform non-blocking, bounded draining in StagePool

    Update _poll_stage_raw() / poll_llm_raw_output() so that one polling pass:

    • returns immediately when no raw output is ready;
    • drains already-ready raw outputs until the first routable request output is produced, the queue becomes empty, or a bounded item/time budget is reached;
    • processes statistics, errors, terminal events, KV-ready metadata, and streaming state transitions encountered during draining.

Conceptually:

empty queue -> return immediately
[N N N A ...] -> process N entries and return A in the same bounded polling pass

The drain must be bounded to preserve fairness across stages. Reducing the current timeout alone is insufficient because it does not change the one-item-per-stage-per-round drain rate.

Report of performance regression

Observed pattern

When serving Qwen/Qwen3-Omni-30B-A3B-Instruct with text input and streaming text+audio output, we observe:

  • The same stage-output polling path is used for every request, so the underlying drain-rate limitation is not restricted to requests beyond max_concurrency.
  • With max_concurrency == 1, the queue usually does not build deeply enough for the resulting delay to become prominent.
  • With max_concurrency > 1 and num_prompts == max_concurrency, only the first request wave is present. These requests may still incur output-queue polling delay, but the backlog is generally shallower and the impact is less visible.
  • With max_concurrency > 1 and num_prompts > max_concurrency, later request waves inherit accumulated downstream queue backlog and therefore show much higher and more variable AUDIO_TTFP.

Therefore, num_prompts > max_concurrency is an observation and amplification condition, not a necessary trigger or the root cause. Although discovered with Qwen3-Omni, the underlying behavior is in the generic multi-stage output polling path and may also affect E2E and tail latency in other pipelines with imbalanced stage-output rates.

The number of requests is 16, with concurrency levels of 1, 4, 8, 12, and 16. You can see that as concurrency increases, TTFP grows normally, but for requests exceeding the maximum concurrency limit, TTFP spikes sharply. As shown in the figure below.

Image

Taking a test case with a maximum concurrency of 4 and 16 requests as an example, the TTFP of the first batch of concurrent requests is controlled within a smaller range, while the TTFP of the other 12 requests exceeding the maximum concurrency first increases and then decreases, but is still significantly higher than the first batch of concurrent requests.

Image

Root cause analysis

The Orchestrator polls stage replicas sequentially and consumes at most one raw EngineCoreOutputs item from each replica in one orchestration round. If an upstream stage has no ready output, its poll can wait until the per-stage timeout expires, even when a downstream stage already has a backlog.

The downstream queue may contain many raw outputs that do not produce a frontend-visible text or audio payload after vLLM-Omni output processing. These outputs are non-user-visible for that iteration, but they may still carry statistics or control metadata and must not be silently discarded.

For the Qwen3-Omni audio path, the queue can conceptually become:

stage-2 output queue:
[N N N N A N N ...]
         ^
         first frontend-visible audio packet

N: raw output without a frontend-visible payload
A: first frontend-visible audio packet

If stage-2 produces raw outputs faster than the Orchestrator drains them, the first audio packet may already be enqueued but remain behind earlier entries. Its additional latency is therefore queue residence time rather than model execution time:

first audio produced
    -> waits in stage-2 output queue
    -> dequeued by Orchestrator
    -> routed to frontend

The mechanism can affect every request. Its severity depends on the queue depth and Orchestrator round duration when the first audio packet is inserted: the initial request wave typically encounters a shallow queue, whereas later waves inherit accumulated backlog and therefore exhibit larger and more variable TTFP.

Reproduction and results

The comparison chart of the results after improving the polling method for getting output is shown below. The test load was set with a maximum concurrency of 4 and 16 requests.

The previous high TTFP was not primarily caused by slow audio computation, but by delayed retrieval of the first valid audio packet. Empty or ineffective outputs accumulated in the stage output queue, and the Orchestrator drained them too slowly. By skipping empty audio packets and returning only effective outputs, the new polling logic removes this queue-induced head-of-line blocking.

Image
============ Serving Benchmark Result ============
Successful requests:                     12        
Failed requests:                         0         
Maximum request concurrency:             4         
Benchmark duration (s):                  84.83     
Request throughput (req/s):              0.14      
Peak concurrent requests:                6.00      
================== Text Result ===================
Total input tokens:                      552       
Total generated tokens:                  2585      
Output token throughput (tok/s):         30.47     
Peak output token throughput (tok/s):    272.00    
Peak concurrent requests:                6.00      
Total Token throughput (tok/s):          36.98     
---------------Time to First Token----------------
Mean TTFT (ms):                          192.09    
Median TTFT (ms):                        181.19    
P99 TTFT (ms):                           357.88    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          17.12     
Median TPOT (ms):                        15.72     
P99 TPOT (ms):                           28.01     
---------------Inter-token Latency----------------
Mean ITL (ms):                           13.25     
Median ITL (ms):                         9.76      
P99 ITL (ms):                            187.52    
================== Audio Result ==================
Total audio duration generated(s):       1035.67   
Total audio frames generated:            24856020  
Audio throughput(audio duration/s):      12.21     
============= Omni Pipeline Metrics ==============
Mean TTFA (ms):                          1268.81   
Median TTFA (ms):                        1204.98   
Mean Text-Audio Gap (ms):                1076.72   
Median Text-Audio Gap (ms):              1066.77   
Mean Audio E2EL (ms):                    24496.26  
Median Audio E2EL (ms):                  25202.14  
Mean Audio ITL (ms):                     539.13    
Median Audio ITL (ms):                   462.84    
Mean Pipeline Ratio (TTFT/TTFA):         0.15      
==================================================

Queue-level evidence should include the first-audio production, Orchestrator dequeue, and frontend-send timestamps, together with stage queue depth and the number of entries ahead of the first audio packet.

Please see related prior performance discussions and optimizations:

Misc discussion on performance

This proposal is complementary to async-chunk, batching, and inter-packet-latency optimizations. Those efforts primarily improve stage computation and cross-stage transfer; this issue focuses on consuming and routing outputs that have already been produced.

Open design points:

  1. Upstream vLLM should preferably expose only a generic non-blocking dequeue API; Omni-specific routability should remain in StagePool.
  2. A routable-output predicate should be based on processed request outputs, not hard-coded to pooling_output or a model-specific audio field.
  3. Draining should use an item or elapsed-time budget to prevent a hot stage from starving other stages.
  4. The regression test should measure both the first concurrency-sized request window and later windows, verify that the first wave can also benefit from faster draining, and check output ordering, terminal/error propagation, and stage fairness.

Your current environment (if you think it is necessary)

Workload

  • Model: Qwen/Qwen3-Omni-30B-A3B-Instruct
  • Input: text
  • Output: text + audio
  • Underlying scope: all requests traversing the multi-stage output polling path
  • Amplification/reproduction condition: max_concurrency > 1 and num_prompts > max_concurrency
  • Primary metric: audio_ttfp
  • Secondary metrics: e2el and output-queue residence time

Current environment and Scripts

# vllm bench serve command
vllm bench serve \
  --omni \
  --dataset-name random-mm \
  --port 8000 \
  --model /data/models/Qwen3-Omni-30B-A3B-Instruct \
  --endpoint /v1/chat/completions \
  --backend openai-chat-omni \
  --request-rate inf \
  --burstiness 1.0 \
  --max-concurrency 2\
  --num-prompts 16 \
  --ready-check-timeout-sec 600 \
  --random-input-len 32 \
  --random-range-ratio 0.0 \
  --random-mm-base-items-per-request 0 \
  --random-mm-num-mm-items-range-ratio 0 \
  --random-mm-limit-mm-per-prompt '{"image":0,"video":0,"audio":0}' \
  --ignore-eos \
  --random-output-len 256 \
  --extra_body '{"modalities": ["text", "audio"]}'
# deploy:~/vllm-omni/vllm_omni/deploy/qwen3_omni_moe.yaml
async_chunk: true

connectors:
  connector_of_shared_memory:
    name: SharedMemoryConnector
    extra:
      codec_chunk_frames: 25
      codec_left_context_frames: 25

stages:
  - stage_id: 0
    gpu_memory_utilization: 0.9
    devices: "0"
    default_sampling_params:
      temperature: 0.4
      top_p: 0.9
      top_k: 1
      max_tokens: 2048
      seed: 42
      repetition_penalty: 1.05

  - stage_id: 1
    gpu_memory_utilization: 0.6
    devices: "1"
    input_connectors:
      from_stage_0: connector_of_shared_memory
    default_sampling_params:
      temperature: 0.9
      top_k: 50
      max_tokens: 4096
      seed: 42
      repetition_penalty: 1.05

  - stage_id: 2
    gpu_memory_utilization: 0.1
    enforce_eager: true
    async_scheduling: false
    max_num_batched_tokens: 51200
    devices: "1"
    input_connectors:
      from_stage_1: connector_of_shared_memory
    default_sampling_params:
      temperature: 0.0
      top_p: 1.0
      top_k: -1
      max_tokens: 65536
      seed: 42
      repetition_penalty: 1.1

platforms:
  npu:
    stages:
      - stage_id: 0
        gpu_memory_utilization: 0.6
        tensor_parallel_size: 2
        max_num_batched_tokens: 8192
        devices: "0,1"
      - stage_id: 1
        gpu_memory_utilization: 0.6
        max_num_batched_tokens: 8192
        devices: "2"
      - stage_id: 2
        gpu_memory_utilization: 0.3
        devices: "2"

  rocm:
    stages:
      - stage_id: 0
        enforce_eager: true

  xpu:
    stages:
      - stage_id: 0
        tensor_parallel_size: 4
        enforce_eager: true
        max_cudagraph_capture_size: 0
        devices: "0,1,2,3"
      - stage_id: 1
        enforce_eager: true
        max_cudagraph_capture_size: 0
        devices: "4"
      - stage_id: 2
        gpu_memory_utilization: 0.3
        max_cudagraph_capture_size: 0
        devices: "4"

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions