[Performance]: Head-of-line blocking in stage output polling inflates audio TTFP and E2E under queued concurrency

### Proposal to improve performance

This issue is **distinct from #1238**. #1238 reported an aggregate benchmark-level increase in Qwen3-Omni audio TTFP when `max_concurrency` changed from 1 to 2, but the behavior was not consistently reproducible and no queue-level mechanism was established. Here, request-wave and queue-level profiling isolates a generic Orchestrator service-rate problem: each polling round consumes at most one raw output per stage replica and may wait on an empty stage, while downstream queues continue to accumulate non-user-visible raw outputs. The behavior affects every request; `num_prompts > max_concurrency` merely exposes it more clearly because later request waves inherit a deeper backlog. This proposal therefore focuses on stage-output retrieval and routing, rather than Code2Wav computation, batching, or cross-stage chunk transfer.

### Expected behavior

After stage-2 produces the first audio packet for a request, the Orchestrator should retrieve and route it promptly. Later request windows may incur legitimate admission or compute queueing, but should not incur large additional delay after the packet is already ready.

### Actual behavior

For any request, the first audio packet can remain behind earlier non-user-visible raw outputs when the Orchestrator drain rate is lower than the downstream production rate. The effect is usually modest in the first concurrency-sized wave, but becomes high and unstable in later waves as queue backlog accumulates.

### Proposed change

The minimal prototype changes the following locations:

1. **vLLM: add a non-blocking dequeue API**

   Add a generic API such as `get_output_nowait()` or `try_get_output()` that:

   - returns immediately when the output queue is empty;
   - preserves FIFO ordering;
   - propagates queued exceptions consistently;
   - leaves the existing blocking `get_output_async()` semantics unchanged.

2. **vLLM-Omni: perform non-blocking, bounded draining in `StagePool`**

   Update `_poll_stage_raw()` / `poll_llm_raw_output()` so that one polling pass:

   - returns immediately when no raw output is ready;
   - drains already-ready raw outputs until the first routable request output is produced, the queue becomes empty, or a bounded item/time budget is reached;
   - processes statistics, errors, terminal events, KV-ready metadata, and streaming state transitions encountered during draining.

Conceptually:

```text
empty queue -> return immediately
[N N N A ...] -> process N entries and return A in the same bounded polling pass
```

The drain must be bounded to preserve fairness across stages. Reducing the current timeout alone is insufficient because it does not change the one-item-per-stage-per-round drain rate.

### Report of performance regression

### Observed pattern

When serving `Qwen/Qwen3-Omni-30B-A3B-Instruct` with text input and streaming text+audio output, we observe:

- The same stage-output polling path is used for every request, so the underlying drain-rate limitation is not restricted to requests beyond `max_concurrency`.
- With `max_concurrency == 1`, the queue usually does not build deeply enough for the resulting delay to become prominent.
- With `max_concurrency > 1` and `num_prompts == max_concurrency`, only the first request wave is present. These requests may still incur output-queue polling delay, but the backlog is generally shallower and the impact is less visible.
- With `max_concurrency > 1` and `num_prompts > max_concurrency`, later request waves inherit accumulated downstream queue backlog and therefore show much higher and more variable `AUDIO_TTFP`.

Therefore, `num_prompts > max_concurrency` is an observation and amplification condition, not a necessary trigger or the root cause. Although discovered with Qwen3-Omni, the underlying behavior is in the generic multi-stage output polling path and may also affect **E2E** and **tail latency** in other pipelines with imbalanced stage-output rates.

The number of requests is 16, with concurrency levels of 1, 4, 8, 12, and 16. You can see that as concurrency increases, TTFP grows normally, but for requests exceeding the maximum concurrency limit, TTFP spikes sharply. As shown in the figure below.

<img width="709" height="418" alt="Image" src="https://github.com/user-attachments/assets/b85e24f8-b5b2-452b-868f-6bd76f305c87" />

Taking a test case with a maximum concurrency of 4 and 16 requests as an example, the TTFP of the first batch of concurrent requests is controlled within a smaller range, while the TTFP of the other 12 requests exceeding the maximum concurrency first increases and then decreases, but is still significantly higher than the first batch of concurrent requests.

<img width="748" height="417" alt="Image" src="https://github.com/user-attachments/assets/0931e8b1-6d61-4c34-8bf8-f3bf9a755b27" />

### Root cause analysis

The Orchestrator polls stage replicas sequentially and consumes at most one raw `EngineCoreOutputs` item from each replica in one orchestration round. If an upstream stage has no ready output, its poll can wait until the per-stage timeout expires, even when a downstream stage already has a backlog.

The downstream queue may contain many raw outputs that do not produce a frontend-visible text or audio payload after vLLM-Omni output processing. These outputs are *non-user-visible* for that iteration, but they may still carry statistics or control metadata and must not be silently discarded.

For the Qwen3-Omni audio path, the queue can conceptually become:

```text
stage-2 output queue:
[N N N N A N N ...]
         ^
         first frontend-visible audio packet

N: raw output without a frontend-visible payload
A: first frontend-visible audio packet
```

If stage-2 produces raw outputs faster than the Orchestrator drains them, the first audio packet may already be enqueued but remain behind earlier entries. Its additional latency is therefore queue residence time rather than model execution time:

```text
first audio produced
    -> waits in stage-2 output queue
    -> dequeued by Orchestrator
    -> routed to frontend
```

The mechanism can affect every request. Its severity depends on the queue depth and Orchestrator round duration when the first audio packet is inserted: the initial request wave typically encounters a shallow queue, whereas later waves inherit accumulated backlog and therefore exhibit larger and more variable TTFP.

### Reproduction and results

The comparison chart of the results after improving the polling method for getting output is shown below. The test load was set with a maximum concurrency of 4 and 16 requests.

The previous high TTFP was not primarily caused by slow audio computation, but by delayed retrieval of the first valid audio packet. Empty or ineffective outputs accumulated in the stage output queue, and the Orchestrator drained them too slowly. By skipping empty audio packets and returning only effective outputs, the new polling logic removes this queue-induced head-of-line blocking.

<img width="1197" height="510" alt="Image" src="https://github.com/user-attachments/assets/7f5da618-3db0-4b52-83de-8d37572aa033" />

```python
============ Serving Benchmark Result ============
Successful requests:                     12        
Failed requests:                         0         
Maximum request concurrency:             4         
Benchmark duration (s):                  84.83     
Request throughput (req/s):              0.14      
Peak concurrent requests:                6.00      
================== Text Result ===================
Total input tokens:                      552       
Total generated tokens:                  2585      
Output token throughput (tok/s):         30.47     
Peak output token throughput (tok/s):    272.00    
Peak concurrent requests:                6.00      
Total Token throughput (tok/s):          36.98     
---------------Time to First Token----------------
Mean TTFT (ms):                          192.09    
Median TTFT (ms):                        181.19    
P99 TTFT (ms):                           357.88    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          17.12     
Median TPOT (ms):                        15.72     
P99 TPOT (ms):                           28.01     
---------------Inter-token Latency----------------
Mean ITL (ms):                           13.25     
Median ITL (ms):                         9.76      
P99 ITL (ms):                            187.52    
================== Audio Result ==================
Total audio duration generated(s):       1035.67   
Total audio frames generated:            24856020  
Audio throughput(audio duration/s):      12.21     
============= Omni Pipeline Metrics ==============
Mean TTFA (ms):                          1268.81   
Median TTFA (ms):                        1204.98   
Mean Text-Audio Gap (ms):                1076.72   
Median Text-Audio Gap (ms):              1066.77   
Mean Audio E2EL (ms):                    24496.26  
Median Audio E2EL (ms):                  25202.14  
Mean Audio ITL (ms):                     539.13    
Median Audio ITL (ms):                   462.84    
Mean Pipeline Ratio (TTFT/TTFA):         0.15      
==================================================
```
Queue-level evidence should include the first-audio production, Orchestrator dequeue, and frontend-send timestamps, together with stage queue depth and the number of entries ahead of the first audio packet.

Please see related prior performance discussions and optimizations:

- #4469 — the closest Orchestrator-level discussion: it also identifies head-of-line blocking and output-queue backlog under high concurrency, but its direct cause is expensive synchronous output processing and redundant hidden-state forwarding rather than blocking single-item polling and an insufficient queue-drain rate.
- #1238 — a historical benchmark-level observation of elevated Qwen3-Omni audio TTFP after increasing `max_concurrency` from 1 to 2. Its follow-up results changed substantially across later tests, and it did not isolate the request-wave pattern or queue-consumption mechanism discussed here. It is included only for symptom-level context, not because this issue is a duplicate or continuation of it.
- #934 and #1211 — were proposed in the discussion of #1238 to improve asynchronous connector I/O and Code2Wav batching. They target cross-stage transfer or model-stage computation, whereas this issue occurs after raw outputs have already been produced and queued.
- #268 and PR #951 — optimize async-chunk production and cross-stage communication.
- PR #1656 — reduces inter-packet latency in the async-chunk transfer path.
- #696, #1191, and #2207 — provide broader first-packet metrics and Qwen-Omni performance roadmaps.


### Misc discussion on performance

This proposal is complementary to async-chunk, batching, and inter-packet-latency optimizations. Those efforts primarily improve stage computation and cross-stage transfer; this issue focuses on consuming and routing outputs that have already been produced.

Open design points:

1. Upstream vLLM should preferably expose only a generic non-blocking dequeue API; Omni-specific routability should remain in `StagePool`.
2. A routable-output predicate should be based on processed request outputs, not hard-coded to `pooling_output` or a model-specific audio field.
3. Draining should use an item or elapsed-time budget to prevent a hot stage from starving other stages.
4. The regression test should measure both the first concurrency-sized request window and later windows, verify that the first wave can also benefit from faster draining, and check output ordering, terminal/error propagation, and stage fairness.

### Your current environment (if you think it is necessary)

### Workload

- Model: `Qwen/Qwen3-Omni-30B-A3B-Instruct`
- Input: text
- Output: text + audio
- Underlying scope: all requests traversing the multi-stage output polling path
- Amplification/reproduction condition: `max_concurrency > 1` and `num_prompts > max_concurrency`
- Primary metric: `audio_ttfp`
- Secondary metrics: `e2el` and output-queue residence time

### Current environment and Scripts
```python
# vllm bench serve command
vllm bench serve \
  --omni \
  --dataset-name random-mm \
  --port 8000 \
  --model /data/models/Qwen3-Omni-30B-A3B-Instruct \
  --endpoint /v1/chat/completions \
  --backend openai-chat-omni \
  --request-rate inf \
  --burstiness 1.0 \
  --max-concurrency 2\
  --num-prompts 16 \
  --ready-check-timeout-sec 600 \
  --random-input-len 32 \
  --random-range-ratio 0.0 \
  --random-mm-base-items-per-request 0 \
  --random-mm-num-mm-items-range-ratio 0 \
  --random-mm-limit-mm-per-prompt '{"image":0,"video":0,"audio":0}' \
  --ignore-eos \
  --random-output-len 256 \
  --extra_body '{"modalities": ["text", "audio"]}'
```

```python
# deploy:~/vllm-omni/vllm_omni/deploy/qwen3_omni_moe.yaml
async_chunk: true

connectors:
  connector_of_shared_memory:
    name: SharedMemoryConnector
    extra:
      codec_chunk_frames: 25
      codec_left_context_frames: 25

stages:
  - stage_id: 0
    gpu_memory_utilization: 0.9
    devices: "0"
    default_sampling_params:
      temperature: 0.4
      top_p: 0.9
      top_k: 1
      max_tokens: 2048
      seed: 42
      repetition_penalty: 1.05

  - stage_id: 1
    gpu_memory_utilization: 0.6
    devices: "1"
    input_connectors:
      from_stage_0: connector_of_shared_memory
    default_sampling_params:
      temperature: 0.9
      top_k: 50
      max_tokens: 4096
      seed: 42
      repetition_penalty: 1.05

  - stage_id: 2
    gpu_memory_utilization: 0.1
    enforce_eager: true
    async_scheduling: false
    max_num_batched_tokens: 51200
    devices: "1"
    input_connectors:
      from_stage_1: connector_of_shared_memory
    default_sampling_params:
      temperature: 0.0
      top_p: 1.0
      top_k: -1
      max_tokens: 65536
      seed: 42
      repetition_penalty: 1.1

platforms:
  npu:
    stages:
      - stage_id: 0
        gpu_memory_utilization: 0.6
        tensor_parallel_size: 2
        max_num_batched_tokens: 8192
        devices: "0,1"
      - stage_id: 1
        gpu_memory_utilization: 0.6
        max_num_batched_tokens: 8192
        devices: "2"
      - stage_id: 2
        gpu_memory_utilization: 0.3
        devices: "2"

  rocm:
    stages:
      - stage_id: 0
        enforce_eager: true

  xpu:
    stages:
      - stage_id: 0
        tensor_parallel_size: 4
        enforce_eager: true
        max_cudagraph_capture_size: 0
        devices: "0,1,2,3"
      - stage_id: 1
        enforce_eager: true
        max_cudagraph_capture_size: 0
        devices: "4"
      - stage_id: 2
        gpu_memory_utilization: 0.3
        max_cudagraph_capture_size: 0
        devices: "4"
```

### Before submitting a new issue...

- [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://vllm-omni.readthedocs.io), which can answer lots of frequently asked questions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Performance]: Head-of-line blocking in stage output polling inflates audio TTFP and E2E under queued concurrency #4561

Proposal to improve performance

Expected behavior

Actual behavior

Proposed change

Report of performance regression

Observed pattern

Root cause analysis

Reproduction and results

Misc discussion on performance

Your current environment (if you think it is necessary)

Workload

Current environment and Scripts

Before submitting a new issue...

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[Performance]: Head-of-line blocking in stage output polling inflates audio TTFP and E2E under queued concurrency #4561

Description

Proposal to improve performance

Expected behavior

Actual behavior

Proposed change

Report of performance regression

Observed pattern

Root cause analysis

Reproduction and results

Misc discussion on performance

Your current environment (if you think it is necessary)

Workload

Current environment and Scripts

Before submitting a new issue...

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions