Motivation.
https://arxiv.org/pdf/2509.17765 provides the theoretical performance of Qwen3-omni, we need to analyze and optimize vllm-omni to achieve compatible performace.
Performance metrics
- End-to-end first-packet latency
- Thinker/Talker TPS
- Generation RTF (生成音频的单位时间/单位音频播放时间 80ms)
End-to-end first-packet latency = 72 + 88 + 57 + 14 + 3 = 234ms
RTF = (1000/75 + 1000/140 + 14 + 3)/80 = 0.47
It can be understood as the "TTFT" and "TPOT" for speech.
How to get metrics from vllm-omni
Benchmark: vllm-omni/benchmarks/qwen3-omni/vllm_omni/eval_qwen3_moe_omni.sh and get summary metric from log.
End-to-end first-packet latency: Since streamming output is not currently supported, we can set Thinker's max output len = 1 to apporximately estimate it.
RTF: After support stream audio output, we can get it from log metric.
How to analyze
If the performance un meet expectations, we can set VLLM_TORCH_PROFILER_DIR to en able further analyze.
Scenario
- AudioVisual Video to Text (dataset: WorldSense)
- Text to speech (dataset: SEED)
- AudioVisual Video to speech (dataset: WorldSense)
The dataset can be select acordding to https://arxiv.org/pdf/2509.17765
Proposed Change.
Todo:
we set batch size = 1, enable thinker cuda graph currently.
Feedback Period.
No response
CC List.
No response
Any Other Things.
No response
Before submitting a new issue...
Motivation.
https://arxiv.org/pdf/2509.17765 provides the theoretical performance of Qwen3-omni, we need to analyze and optimize vllm-omni to achieve compatible performace.
Performance metrics
End-to-end first-packet latency = 72 + 88 + 57 + 14 + 3 = 234ms
RTF = (1000/75 + 1000/140 + 14 + 3)/80 = 0.47
It can be understood as the "TTFT" and "TPOT" for speech.
How to get metrics from vllm-omni
Benchmark: vllm-omni/benchmarks/qwen3-omni/vllm_omni/eval_qwen3_moe_omni.sh and get summary metric from log.
End-to-end first-packet latency: Since streamming output is not currently supported, we can set Thinker's max output len = 1 to apporximately estimate it.
RTF: After support stream audio output, we can get it from log metric.
How to analyze
If the performance un meet expectations, we can set VLLM_TORCH_PROFILER_DIR to en able further analyze.
Scenario
The dataset can be select acordding to https://arxiv.org/pdf/2509.17765
Proposed Change.
Todo:
we set batch size = 1, enable thinker cuda graph currently.
Feedback Period.
No response
CC List.
No response
Any Other Things.
No response
Before submitting a new issue...