[DSv4] Adding TRTLLM gen attention kernel by zyongye · Pull Request #43827 · vllm-project/vllm

zyongye · 2026-05-28T02:49:59Z

Summary

Rebase of @PerkzZheng's #42316 onto current main, plus a few materially
new pieces:

Once-per-step C128A metadata caching — adds FlashInferMLASparseMetadata
- FlashInferMLASparseMetadataBuilder so that for compress_ratio == 128
  layers the mixed-sparse-index Triton kernel runs once per step instead
  of once per layer. The SWA-baked combine is materialized lazily on first
  access by ensure_sparse_indices and cached on the metadata for the
  remaining C128A layers in the same step. C4 / SWA-only paths are
  unchanged (they read layer.topk_indices_buffer, which is per-layer).
Single-call launcher — collapses the previous decode+prefill split
into one flashinfer_trtllm_batch_decode_sparse_mla_dsv4_raw call over
the mixed batch.
FP8 scalar-vs-tensor scale dispatch — documents and consolidates the
precomputed Python-float bmm1_scale / bmm2_scale derived from the
per-tensor placeholders. The TRTLLM-GEN sparse-MLA launcher takes
different C++ code paths for scalar vs 1-elem-tensor scale args, so this
matters for correctness.
CuTeDSL compressor writes contiguous BF16 / per-tensor FP8 cache,
matching what the FlashInfer V4 backend reads (no UE8M0 padding).

Relation to #42316

This branch is a rebase of #42316; the C128A metadata caching, single-call
launcher collapse, and FP8 scale dispatch documentation are the new pieces.
Opening as a separate PR rather than pushing into #42316 because the rebase

new caching work changes a non-trivial part of the surface; happy to fold
into Port DeepSeek V4 FlashInfer sparse MLA kernels #42316 if @PerkzZheng prefers.

End-to-end eval

deepseek-ai/DeepSeek-V4-Flash, 4× GB200, TP=4, V4_FLASHINFER_MLA_SPARSE
backend, --kv-cache-dtype fp8 --block-size 256,
cudagraph_mode=FULL_DECODE_ONLY. vLLM commit ce4e168ba (branch tip
before today's mechanical rebase onto main; rebase only resolved stale
torch-stable-ABI decls in csrc/ops.h / csrc/torch_bindings.cpp):

Task	Setting	n	Score
GSM8K	5-shot, T=0, completions	1319	0.9538 strict / 0.9530 flexible
GPQA-Diamond	0-shot, T=1.0, top_p=0.95, thinking=on, 4× epochs	792	0.8586
AIME25	0-shot, T=1.0, top_p=0.95, thinking=on, 4× epochs	120	0.9750

Test plan

pre-commit run --files vllm/models/deepseek_v4/nvidia/flashinfer_sparse.py (passed, including mypy)
tests/kernels/test_compressor_kv_cache.py — 36/36 passed
End-to-end eval on DSv4-Flash (table above)
Profile to confirm build_flashinfer_mixed_sparse_indices Triton kernel
now appears once per step instead of N×per-step.

AI-assisted disclosure

Developed with AI assistance (Claude Code). Per AGENTS.md, the submitter
has reviewed each changed line.

🤖 Generated with Claude Code

mergify · 2026-05-28T02:50:36Z

Documentation preview: https://vllm--43827.org.readthedocs.build/en/43827/

mergify · 2026-05-28T02:50:40Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @zyongye.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

mergify · 2026-05-28T03:30:52Z

Hi @zyongye, the pre-commit checks have failed. Please run:

uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?

mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

PerkzZheng

@zyongye Hi Yongye, thanks for rebasing my MR. I have left some comments.

PerkzZheng · 2026-05-29T05:57:27Z

+
+| Backend | Dtypes | KV Dtypes | Block Sizes | Head Sizes | Sink | Non-Causal | Sparse | MM Prefix | DCP | Attention Types | Compute Cap. |
+| ------- | ------ | --------- | ----------- | ---------- | ---- | ---------- | ------ | --------- | --- | --------------- | ------------ |
+| `V4_FLASHINFER_MLA_SPARSE` | fp16, bf16 | `auto` | 256 | 512 | ❌ | ❌ | ❌ | ❌ | ❌ | Decoder | Any |


Curious what 256 Block Sizes mean here ?

I think it's a guidance on what block size user should specify when launching the engine with this kernel. Currently since we did some custom kv cache layout we restricted the block size to be 256 by passing --block-size 256. I need to look into if this are really necessary.

Thanks, then it should be 512 for flashinfer sparse MLA for your information.

PerkzZheng · 2026-05-29T05:59:18Z

+        decode_compressed_topk_lens = token_to_req_indices
+
+    padded_topk = max(topk, decode_compressed_topk)
+    padded_topk = (padded_topk + 3) // 4 * 4


flashinfer mla kernels have a requirement of 16B alignment for topk indices. Can you help add some comments here ? thanks.

PerkzZheng · 2026-05-29T06:03:55Z

+    # paged path and writes a contiguous 512-wide cache row per token; bf16
+    # vs per-tensor fp8 is selected by ``store_full_fp8`` (with the scale
+    # source supplied via ``fp8_scale``).
+    store_full_bf16: bool = False,


it seems better to rename to store_full_kv otherwise it is misleading. sorry if it was introduced in my commits. we will also need to modify other places that are using this term.

PerkzZheng · 2026-05-29T06:09:23Z

+        num_decodes = swa_metadata.num_decodes
+        num_prefills = swa_metadata.num_prefills
+        num_decode_tokens = swa_metadata.num_decode_tokens
+        num_prefill_tokens = swa_metadata.num_prefill_tokens


I split it into two calls (prefill and decode) in my previous MR because we pad gridDim.x to maxSeqLenQ, which can launch too many paddings CTA for mixed requests. I still observe obvious perf gains by splitting even though we will skip those paddings during runtime (which means the CTA switching overhead is not negligible).

mergify · 2026-06-02T03:29:30Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @zyongye.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

WoosukKwon · 2026-06-02T07:16:03Z

+@dataclass
+class DeepseekV4MLAModules:
+    """Modules used in DeepseekV4 MLA."""
+
+    vllm_config: VllmConfig
+    fused_wqa_wkv: torch.nn.Module
+    q_norm: torch.nn.Module
+    wq_b: torch.nn.Module
+    kv_norm: torch.nn.Module
+    wo_a: torch.nn.Module
+    wo_b: torch.nn.Module
+    attn_sink: torch.nn.Module
+    rotary_emb: torch.nn.Module
+    indexer: torch.nn.Module | None
+    indexer_rotary_emb: torch.nn.Module
+    topk_indices_buffer: torch.Tensor | None
+    aux_stream_list: list[torch.cuda.Stream] | None = None
+
+
+# --8<-- [start:multi_head_latent_attention]
+@PluggableLayer.register("deepseek_v4_multi_head_latent_attention")
+class DeepseekV4MultiHeadLatentAttentionWrapper(PluggableLayer):


Can you remove these classes as I did in #44246? I think there was a error in resolving the merge conflict.

WoosukKwon · 2026-06-02T07:16:37Z

+        torch.ops.vllm.deepseek_v4_attention(
+            hidden_states,
+            positions,
+            o_padded,
+            self.layer_name,
+        )


This torch op was also removed in the main branch

WoosukKwon · 2026-06-02T07:16:42Z

-            "bhr,hdr->bhd",
-            (o_fp8, o_scale),
-            (wo_a_fp8, wo_a_scale),
+        torch.ops.vllm.deepseek_v4_fp8_einsum(


WoosukKwon · 2026-06-02T07:17:23Z

+def deepseek_v4_attention_fake(
+    hidden_states: torch.Tensor,
+    positions: torch.Tensor,
+    out: torch.Tensor,
+    layer_name: str,
+) -> None:
+    return None
+
+
+direct_register_custom_op(
+    op_name="deepseek_v4_attention",
+    op_func=deepseek_v4_attention,
+    mutates_args=["out"],
+    fake_impl=deepseek_v4_attention_fake,
+)
+
+
+def deepseek_v4_fp8_einsum(
+    a: torch.Tensor,
+    a_scale: torch.Tensor,
+    b: torch.Tensor,
+    b_scale: torch.Tensor,
+    out: torch.Tensor,
+    equation: str,
+    recipe: list[int],
+) -> None:
+    fp8_einsum(equation, (a, a_scale), (b, b_scale), out, recipe=tuple(recipe))
+
+
+def deepseek_v4_fp8_einsum_fake(
+    a: torch.Tensor,
+    a_scale: torch.Tensor,
+    b: torch.Tensor,
+    b_scale: torch.Tensor,
+    out: torch.Tensor,
+    equation: str,
+    recipe: list[int],
+) -> None:
+    return None
+
+
+direct_register_custom_op(
+    op_name="deepseek_v4_fp8_einsum",
+    op_func=deepseek_v4_fp8_einsum,
+    mutates_args=["out"],
+    fake_impl=deepseek_v4_fp8_einsum_fake,
+)


This also needs to be removed

WoosukKwon · 2026-06-02T07:21:42Z

+                                ),
+                                Float32(self.fp8_max),
+                            )
+                            y1 = cute.arch.fmin(


Should we avoid using fmin?

You're right. Will change

WoosukKwon · 2026-06-04T01:28:17Z

+    FLASHMLA_SPARSE_V4 = (
+        "vllm.models.deepseek_v4.nvidia.flashmla.DeepseekV4FlashMLASparseBackend"
+    )
+    FLASHINFER_MLA_SPARSE_V4 = (
+        "vllm.models.deepseek_v4.nvidia.flashinfer_sparse."
+        "DeepseekV4FlashInferMLASparseBackend"
+    )
+    ROCM_FLASHMLA_SPARSE_V4 = (


nit: What about DSV4 instead of V4?

Suggested change

FLASHMLA_SPARSE_V4 = (

"vllm.models.deepseek_v4.nvidia.flashmla.DeepseekV4FlashMLASparseBackend"

)

FLASHINFER_MLA_SPARSE_V4 = (

"vllm.models.deepseek_v4.nvidia.flashinfer_sparse."

"DeepseekV4FlashInferMLASparseBackend"

)

ROCM_FLASHMLA_SPARSE_V4 = (

FLASHMLA_SPARSE_DSV4 = (

"vllm.models.deepseek_v4.nvidia.flashmla.DeepseekV4FlashMLASparseBackend"

)

FLASHINFER_MLA_SPARSE_DSV4 = (

"vllm.models.deepseek_v4.nvidia.flashinfer_sparse."

"DeepseekV4FlashInferMLASparseBackend"

)

ROCM_FLASHMLA_SPARSE_DSV4 = (

mergify · 2026-06-04T03:23:57Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @zyongye.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Add a selectable DeepSeek V4 sparse-MLA decode backend that runs through FlashInfer's TRTLLM-gen kernel with a contiguous bf16 / per-tensor FP8 KV cache, alongside the existing FlashMLA path. Re-ported from PerkzZheng's - csrc: sibling fusedDeepseekV4FullCacheKernel (contiguous 512-wide bf16 / per-tensor fp8 insert) + full_cache_{bf16,fp8}_insert ops, in the libtorch stable ABI (csrc/libtorch_stable/). - nvidia/flashinfer_sparse.py: DeepseekV4FlashInferMLASparseBackend/Impl; public flashinfer.mla.trtllm_batch_decode_sparse_mla_dsv4 launcher, two-call decode/prefill split, q-head padding to {64,128}, fp8 scale buffers. - registry + selection: FLASHMLA_SPARSE_DSV4 / FLASHINFER_MLA_SPARSE_DSV4 / ROCM_FLASHMLA_SPARSE_DSV4; _select_v4_sparse_impl consults the backend; _resolve_dsv4_kv_cache_dtype maps dtype per backend. - compressor: CuTeDSL full-cache classes (SparseAttnCompressNormRopeStoreFullC4Kernel, SparseAttnNormRopeStoreFullKernel) separate from the pristine legacy UE8M0 classes so the legacy path keeps its perf; build_flashinfer_mixed_sparse_indices in common/ops. - SWA cache / kv_cache_interface: accept bf16/fp8 dtypes, gate 576B alignment on fp8_ds_mla. - docs: dedicated "DeepSeek V4 Decode Backends" section. - tests: full-cache parity (insert ops + cutedsl compressor). Verified: kernel/compressor tests pass; e2e GSM8K on DSv4-Flash (TP=4, fp8) matches the FlashMLA baseline (~0.953). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>

WoosukKwon

Thanks for the PR!
I will follow up with some refactoring.

wzhao18 · 2026-06-04T14:56:46Z

Hi @zyongye, thanks for the PR! Have you done benchmarking to measure the performance of the TRTLLM gen attention kernel vs FlashMLA? Which settings does this kernel have better perf for dsv4?

zyongye · 2026-06-04T16:56:37Z

Hi @zyongye, thanks for the PR! Have you done benchmarking to measure the performance of the TRTLLM gen attention kernel vs FlashMLA? Which settings does this kernel have better perf for dsv4?

I only test c2048. I actually didn't see any perf improvement. Haven't check out points yet.

Signed-off-by: JisoLya <523420504@qq.com>

Signed-off-by: Waqar Ahmed <waqar.ahmed@amd.com>

zyongye requested review from AndreasKaratzas, LucasWilkinson, MatthewBonanni, WoosukKwon, heheda12345, mgoin, pavanimajety, tlrmchlsmth and yewentao256 as code owners May 28, 2026 02:50

mergify Bot added documentation Improvements or additions to documentation deepseek Related to DeepSeek models nvidia v1 labels May 28, 2026

github-project-automation Bot added this to NVIDIA May 28, 2026

mergify Bot added the needs-rebase label May 28, 2026

zyongye changed the title ~~[DSv4] FlashInfer sparse MLA: rebase + once-per-step C128A metadata caching~~ [DSv4] Adding TRTLLM gen attention kernel May 28, 2026

zyongye force-pushed the dsv4-sparse-mla-flashinfer-rebased branch from df2a27f to b96b676 Compare May 28, 2026 03:25

zyongye requested a review from hmellor as a code owner May 28, 2026 03:25

mergify Bot removed the needs-rebase label May 28, 2026

PerkzZheng reviewed May 29, 2026

View reviewed changes

zyongye requested review from dllehr-amd and tjtanaa as code owners May 31, 2026 20:28

zyongye added ready ONLY add when PR is ready to merge/full CI is needed and removed ready ONLY add when PR is ready to merge/full CI is needed labels May 31, 2026

PerkzZheng approved these changes Jun 1, 2026

View reviewed changes

mergify Bot added the needs-rebase label Jun 2, 2026

WoosukKwon reviewed Jun 2, 2026

View reviewed changes

zyongye force-pushed the dsv4-sparse-mla-flashinfer-rebased branch 2 times, most recently from f73111f to a37d8af Compare June 3, 2026 05:30

zyongye added the ready ONLY add when PR is ready to merge/full CI is needed label Jun 3, 2026

zyongye force-pushed the dsv4-sparse-mla-flashinfer-rebased branch from a37d8af to 3f596f3 Compare June 3, 2026 18:13

WoosukKwon reviewed Jun 4, 2026

View reviewed changes

mergify Bot added the needs-rebase label Jun 4, 2026

zyongye force-pushed the dsv4-sparse-mla-flashinfer-rebased branch from 8e39bc7 to 6a65ae7 Compare June 4, 2026 04:45

mergify Bot removed the needs-rebase label Jun 4, 2026

WoosukKwon self-assigned this Jun 4, 2026

WoosukKwon self-requested a review June 4, 2026 08:01

WoosukKwon approved these changes Jun 4, 2026

View reviewed changes

github-project-automation Bot moved this to Ready in NVIDIA Jun 4, 2026

WoosukKwon enabled auto-merge (squash) June 4, 2026 08:02

WoosukKwon merged commit b5235fc into vllm-project:main Jun 4, 2026
164 of 165 checks passed

github-project-automation Bot moved this from Ready to Done in NVIDIA Jun 4, 2026

zyongye deleted the dsv4-sparse-mla-flashinfer-rebased branch June 4, 2026 16:55

JisoLya pushed a commit to JisoLya/vllm that referenced this pull request Jun 5, 2026

[DSv4] Adding TRTLLM gen attention kernel (vllm-project#43827)

809de4c

Signed-off-by: JisoLya <523420504@qq.com>

knight0528 pushed a commit to knight0528/vllm that referenced this pull request Jun 8, 2026

[DSv4] Adding TRTLLM gen attention kernel (vllm-project#43827)

d3cde13

waqahmed-amd-fi pushed a commit to waqahmed-amd-fi/vllm that referenced this pull request Jun 10, 2026

[DSv4] Adding TRTLLM gen attention kernel (vllm-project#43827)

48ff37f

Signed-off-by: Waqar Ahmed <waqar.ahmed@amd.com>

zyongye mentioned this pull request Jun 12, 2026

Port DeepSeek V4 FlashInfer sparse MLA kernels #42316

Closed

Saddss pushed a commit to Saddss/vllm that referenced this pull request Jun 14, 2026

[DSv4] Adding TRTLLM gen attention kernel (vllm-project#43827)

ad8a960

Uh oh!

Conversation

zyongye commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Relation to #42316

End-to-end eval

Test plan

AI-assisted disclosure

Uh oh!

mergify Bot commented May 28, 2026

Uh oh!

mergify Bot commented May 28, 2026

Uh oh!

mergify Bot commented May 28, 2026

Uh oh!

PerkzZheng left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mergify Bot commented Jun 2, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mergify Bot commented Jun 4, 2026

Uh oh!

WoosukKwon left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

wzhao18 commented Jun 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zyongye commented Jun 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

zyongye commented May 28, 2026 •

edited

Loading

wzhao18 commented Jun 4, 2026 •

edited

Loading