Skip to content

[Breakable Graph] add eager break for unified attention#44275

Open
zhenwei-intel wants to merge 1 commit into
vllm-project:mainfrom
zhenwei-intel:bg_attn
Open

[Breakable Graph] add eager break for unified attention#44275
zhenwei-intel wants to merge 1 commit into
vllm-project:mainfrom
zhenwei-intel:bg_attn

Conversation

@zhenwei-intel

@zhenwei-intel zhenwei-intel commented Jun 2, 2026

Copy link
Copy Markdown
Contributor

Purpose

Add eager break support for unified attention in breakable CUDA graph, follow up #42304

Test Plan

Online benchmark on 4090D

Config: 1024 input tokens, 256 output tokens, 16 concurrency

--- Performance Results ---
Model                          Mode                      req/s  out_tok/s    ttft_ms    tpot_ms
--------------------------------------------------------------------------------------------
llama3-8b                      eager                      1.20     613.32     661.02      22.64
llama3-8b                      compile-only               1.21     620.80     664.49      22.35
llama3-8b                      full_and_piecewise         1.23     628.18     660.44      22.09
llama3-8b                      breakable-cudagraph        1.21     621.24     662.75      22.34
llama3-8b-tp4                  eager                      2.31    1184.53     517.26      11.33
llama3-8b-tp4                  compile-only               2.41    1235.01     521.39      10.83
llama3-8b-tp4                  full_and_piecewise         2.74    1401.45     548.74       9.56
llama3-8b-tp4                  breakable-cudagraph        2.72    1394.69     541.41       9.62
qwen3-30b-a3b                  eager                      0.78     399.77     410.93      35.33
qwen3-30b-a3b                  compile-only               0.75     384.61     421.48      36.71
qwen3-30b-a3b                  full_and_piecewise         1.79     914.67     484.59      15.33
qwen3-30b-a3b                  breakable-cudagraph        1.84     944.29     511.77      14.93
deepseek-v2-lite               eager                      0.79     406.86     300.64      34.73
deepseek-v2-lite               compile-only               0.74     379.10     328.31      37.32
deepseek-v2-lite               full_and_piecewise         1.68     859.28     365.88      16.87
deepseek-v2-lite               breakable-cudagraph        1.69     864.51     297.19      16.90

Breakable CUDA graph achieves parity with full-and-piecewise cudagraph

Test Result


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.

Signed-off-by: zhenwei-intel <zhenwei.liu@intel.com>
@zhenwei-intel

Copy link
Copy Markdown
Contributor Author

cc @ZJY0516 @WoosukKwon

@zhenwei-intel zhenwei-intel marked this pull request as ready for review June 2, 2026 05:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant