Skip to content

[RFC]: Qwen3-omni performance analyze #696

@Bounty-hunter

Description

@Bounty-hunter

Motivation.

https://arxiv.org/pdf/2509.17765 provides the theoretical performance of Qwen3-omni, we need to analyze and optimize vllm-omni to achieve compatible performace.

Performance metrics

  • End-to-end first-packet latency
  • Thinker/Talker TPS
  • Generation RTF (生成音频的单位时间/单位音频播放时间 80ms)
Image

End-to-end first-packet latency = 72 + 88 + 57 + 14 + 3 = 234ms
RTF = (1000/75 + 1000/140 + 14 + 3)/80 = 0.47

It can be understood as the "TTFT" and "TPOT" for speech.

How to get metrics from vllm-omni

Benchmark: vllm-omni/benchmarks/qwen3-omni/vllm_omni/eval_qwen3_moe_omni.sh and get summary metric from log.

End-to-end first-packet latency: Since streamming output is not currently supported, we can set Thinker's max output len = 1 to apporximately estimate it.

RTF: After support stream audio output, we can get it from log metric.

How to analyze

If the performance un meet expectations, we can set VLLM_TORCH_PROFILER_DIR to en able further analyze.

Scenario

  • AudioVisual Video to Text (dataset: WorldSense)
  • Text to speech (dataset: SEED)
  • AudioVisual Video to speech (dataset: WorldSense)

The dataset can be select acordding to https://arxiv.org/pdf/2509.17765

Proposed Change.

Todo:
we set batch size = 1, enable thinker cuda graph currently.

Feedback Period.

No response

CC List.

No response

Any Other Things.

No response

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions