[Breakable Graph] add eager break for unified attention by zhenwei-intel · Pull Request #44275 · vllm-project/vllm

zhenwei-intel · 2026-06-02T02:42:19Z

Purpose

Add eager break support for unified attention in breakable CUDA graph, follow up #42304

Test Plan

Online benchmark on 4090D

Config: 1024 input tokens, 256 output tokens, 16 concurrency

--- Performance Results ---
Model                          Mode                      req/s  out_tok/s    ttft_ms    tpot_ms
--------------------------------------------------------------------------------------------
llama3-8b                      eager                      1.20     613.32     661.02      22.64
llama3-8b                      compile-only               1.21     620.80     664.49      22.35
llama3-8b                      full_and_piecewise         1.23     628.18     660.44      22.09
llama3-8b                      breakable-cudagraph        1.21     621.24     662.75      22.34
llama3-8b-tp4                  eager                      2.31    1184.53     517.26      11.33
llama3-8b-tp4                  compile-only               2.41    1235.01     521.39      10.83
llama3-8b-tp4                  full_and_piecewise         2.74    1401.45     548.74       9.56
llama3-8b-tp4                  breakable-cudagraph        2.72    1394.69     541.41       9.62
qwen3-30b-a3b                  eager                      0.78     399.77     410.93      35.33
qwen3-30b-a3b                  compile-only               0.75     384.61     421.48      36.71
qwen3-30b-a3b                  full_and_piecewise         1.79     914.67     484.59      15.33
qwen3-30b-a3b                  breakable-cudagraph        1.84     944.29     511.77      14.93
deepseek-v2-lite               eager                      0.79     406.86     300.64      34.73
deepseek-v2-lite               compile-only               0.74     379.10     328.31      37.32
deepseek-v2-lite               full_and_piecewise         1.68     859.28     365.88      16.87
deepseek-v2-lite               breakable-cudagraph        1.69     864.51     297.19      16.90

Breakable CUDA graph achieves parity with full-and-piecewise cudagraph

Test Result

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.

Signed-off-by: zhenwei-intel <zhenwei.liu@intel.com>

zhenwei-intel · 2026-06-02T05:40:25Z

cc @ZJY0516 @WoosukKwon

[Breakable Graph] add eager break for unified attention

1791dbd

Signed-off-by: zhenwei-intel <zhenwei.liu@intel.com>

zhenwei-intel marked this pull request as ready for review June 2, 2026 05:40

zhenwei-intel requested review from LucasWilkinson and MatthewBonanni as code owners June 2, 2026 05:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Breakable Graph] add eager break for unified attention#44275

[Breakable Graph] add eager break for unified attention#44275
zhenwei-intel wants to merge 1 commit into
vllm-project:mainfrom
zhenwei-intel:bg_attn

zhenwei-intel commented Jun 2, 2026 •

edited

Loading

Uh oh!

zhenwei-intel commented Jun 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

zhenwei-intel commented Jun 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

zhenwei-intel commented Jun 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

zhenwei-intel commented Jun 2, 2026 •

edited

Loading