[None][feat] : Add FP8 context MLA support for SM120 #6059

peaceh-nv · 2025-07-15T12:04:43Z

Description

Add FP8 context MLA support for SM120

Compared to FP8 context FMHA, FP8 context MLA needs BF16 output.

Accuracy:
GPQA Diamond score for DeepSeek-R1 quant wo_gemm ckpt + FP8 context MLA + FP8 xqa-mla gen is 0.707 on SM120, which is reasonable since the baseline : nvfp4 DeepSeek-R1 ckpt + BF16 MLA + FP8 MLA is 0.702 on B200

Performance:
TTFT improvement on SM120:
BF16 context MLA : 121244.7586ms
FP8 context MLA : 97594.8747ms
~24% improvement

Summary by CodeRabbit

New Features
- Added support for FP8 quantization in Multi-Head Linear Attention (MLA) mode, including buffer management, runtime configuration, and input data quantization.
Bug Fixes
- Improved robustness by adding checks for optional input tensors to prevent errors from missing data.
Tests
- Updated accuracy tolerances for FP8 data type in MLA attention tests to better reflect expected results.

peaceh-nv · 2025-07-15T12:05:44Z

/bot run

tensorrt-cicd · 2025-07-15T12:11:12Z

PR_Github #11939 [ run ] triggered by Bot

cpp/tensorrt_llm/common/attentionOp.cpp

tensorrt-cicd · 2025-07-15T13:50:37Z

PR_Github #11939 [ run ] completed with state FAILURE
/LLM/main/L0_MergeRequest_PR pipeline #8857 completed with status: 'FAILURE'

coderabbitai · 2025-07-21T07:35:22Z

📝 Walkthrough

## Walkthrough

Support for a new boolean flag, `mFP8ContextMLA`, was introduced to enable FP8 context mode for Multi-Head Linear Attention (MLA) alongside the existing FMHA mode. This required updates to buffer size calculations, parameter passing, kernel launches, and conditional logic across attention operator implementations, kernel headers, and related tests. Additionally, a new CUDA kernel was added to quantize input data to FP8 format within MLA context.

## Changes

| File(s)                                                                                 | Change Summary                                                                                                          |
|-----------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------|
| **Attention Operator Source**<br>`cpp/tensorrt_llm/common/attentionOp.cpp`, `cpp/tensorrt_llm/thop/attentionOp.cpp`          | Integrated `mFP8ContextMLA` flag into buffer size, parameter, and control flow logic; added robust checks for optional tensors; set MLA FP8 context flags during generation and enqueue phases |
| **Attention Operator Header**<br>`cpp/tensorrt_llm/common/attentionOp.h`                                                   | Added private member variable `mFP8ContextMLA` to `AttentionOp` class                                                   |
| **MLA Kernels Source and Header**<br>`cpp/tensorrt_llm/kernels/mlaKernels.cu`, `cpp/tensorrt_llm/kernels/mlaKernels.h`           | Added `QuantizeCopyInputToFp8Kernel` CUDA kernel; extended `MlaParams` struct with `quant_scale_qkv`; updated MLA kernel invocation to launch FP8 quantization kernel conditionally    |
| **Unit Tests**<br>`tests/unittest/_torch/test_attention_mla.py`                                             | Increased FP8 accuracy tolerances in test dictionary                                                                    |

## Sequence Diagram(s)

```mermaid
sequenceDiagram
    participant User
    participant AttentionOp
    participant MLA Kernel
    participant CUDA Device

    User->>AttentionOp: Request context enqueue (MLA, FP8 enabled)
    AttentionOp->>MLA Kernel: invokeMLARopeContext(params, ...)
    MLA Kernel->>CUDA Device: applyMLARopeAndAssignQKVKernelOptContext
    alt FP8 Context MLA enabled
        MLA Kernel->>CUDA Device: QuantizeCopyInputToFp8Kernel (quantize input to FP8)
    end
    MLA Kernel-->>AttentionOp: Return
    AttentionOp-->>User: Complete

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Suggested reviewers

lucifer1004
kaiyux


</details>

<!-- walkthrough_end -->

<!-- announcements_start -->

> [!NOTE]
> <details open="true">
> <summary>⚡️ Unit Test Generation is now available in beta!</summary>
> 
> Learn more [here](https://docs.coderabbit.ai/finishing-touches/unit-test-generation), or try it out under "Finishing Touches" below.
> 
> </details>

<!-- announcements_end -->

---

<details>
<summary>📜 Recent review details</summary>

**Configuration used: .coderabbit.yaml**
**Review profile: CHILL**
**Plan: Pro**


<details>
<summary>📥 Commits</summary>

Reviewing files that changed from the base of the PR and between c03bae23eef8f660361ee2985fb8ee018cf1e8d2 and e1f611ca686ff4ab0b1ec957124c2e8638e2897d.

</details>

<details>
<summary>📒 Files selected for processing (6)</summary>

* `cpp/tensorrt_llm/common/attentionOp.cpp` (9 hunks)
* `cpp/tensorrt_llm/common/attentionOp.h` (1 hunks)
* `cpp/tensorrt_llm/kernels/mlaKernels.cu` (2 hunks)
* `cpp/tensorrt_llm/kernels/mlaKernels.h` (2 hunks)
* `cpp/tensorrt_llm/thop/attentionOp.cpp` (4 hunks)
* `tests/unittest/_torch/test_attention_mla.py` (1 hunks)

</details>

<details>
<summary>🚧 Files skipped from review as they are similar to previous changes (6)</summary>

* tests/unittest/_torch/test_attention_mla.py
* cpp/tensorrt_llm/common/attentionOp.h
* cpp/tensorrt_llm/kernels/mlaKernels.h
* cpp/tensorrt_llm/thop/attentionOp.cpp
* cpp/tensorrt_llm/kernels/mlaKernels.cu
* cpp/tensorrt_llm/common/attentionOp.cpp

</details>

<details>
<summary>⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)</summary>

* GitHub Check: Pre-commit Check

</details>

</details>
<!-- internal state start -->


<!-- DwQgtGAEAqAWCWBnSTIEMB26CuAXA9mAOYCmGJATmriQCaQDG+Ats2bgFyQAOFk+AIwBWJBrngA3EsgEBPRvlqU0AgfFwA6NPEgQAfACgjoCEYDEZyAAUASpETZWaCrKNwSPbABsvkCiQBHbGlcSHFcLzpIACIAbQAzEmoAXUguAEFaegAxKwAOBQwaAA9QgFkAGXT7bG5ufApQ+IbIAGUygEYAJgAGaMg5SGxESh4khhJYMAwJFCKKRWwJ5Ac6hqaW3IKmIpJSyErq/CxcWA927p70CgYEGjFsfw0YBGRE6kePWnh4xIo3hbMSBbQolULZMoACWqg38QXgFHgGCIkAAQtkOgA2fh4bh4Z7uax2JEMLzYJTINAMBiPKnyTD0biUZoUZiYCaQNi4REMZAESBKZjHRDc6geU4eeDMbhU0L4eJhM6MWCYUjPdLU2kMeSoNhoBz+ejDJEoiWQADiVgAitUACLwNBCjD0RBMfyQFmQW0kEjcVo+gDWYBsHUgQUw4gAXtR4MdIAB3dSwfB4C0AUTKZWVogD9SRoSYzDU5HoidOwPyoL25Sq6GdFYKxQCaDAzC8aEgpHIVHExwANOhbvASBITeh7G6PPLID0NAB2Hpz/hYC69AmvFDIQsyqgCSJhfCKjwCfUkLxIjyuhpThWzhddfgCXDaEsJpOQGbxbgAFi9Pr9gbBqGtw5nmRRDIgY7oliBy1gyDawUcWCor0VwqhQtDxs4JDPFYzINGyGAclKvD4FIbBFJS7poF4iCHv49SNHQA5lrAMDQNk0CQAAFOIbAHh6CIigeAZkAAlMubSdL09jJomyLjoaSy9lg8SApA3TdN+37zgArHk2LMG8LTQdiOxgohAkAJxzrpVk6Xkc7fkuRkepslbmdWiEDgx/gjEUY5oHUCzFFKYpePSkBdN+ACkKDSgs5HsBo5iWAAwiwFG4Csjhsi4bhKoWWXIHGZp4j4fiBMEwm8GFiIRQK+A0llOAEGyND0PmyhiLGGCUvW5lItgKbIF1RA9r1PFpQAkhJfneNl64eGgeDJhQA5MlSkzTBIA5IA4URmkV8HRAA9AI+ChBQ2AYP0zALfA3D7nx0gCdy8BEKQfAzQmDQBvEXj4PGiDPGmVJsRdBaZfB72fZQUQdj9QiCCgSgBfEw70IMHY3fAQQeLYAD65pJtgAiQMj5MYI4AiUEtOK4IWr3ThKIwU4IyASM4mMcEYUCEhj/yhLDX1RJTA5EyTpxk5AZgdB0VkAMxWQO26RB1b7lh28TaGS7oih8iADjDSr6nRDAOhrFQ9ITZSUKQNhVSEhO2Dwj1nhest5HkulLjRdGQGQSilu+ADk2TpNNFQAKo2GmocpfzSojDs9Ai/D9Di0SxOk+TcvRYrPSqywT0kBrDjUtIiDxN4EUANyQHJI6UAOZpkcoFXW7b9skI7+Mii7djcO757kLLVkF/Y4gVTspLklErGQOHkcx3HCcGFArRkyM+PgZDfg3XyiJw4a7MCMgku57L3SYl0Vl1vQ+eYkrLHvmaLJsHwOvwKPKImx47ZcA0D4BXCYdATTPG9IgYeNAjw1ErogEqCon4vzgW6Py9RnRji7nbCgDsnYD1dsPJko8PBmHst+B8GMMBIDOCHcsy8o6x3jonSA6QMCXTOHwfe10TjH1FpnFGl9pZ526HOcRLEEC3AUNKdWURQFVxrj4WQA4BCpjNP4QBUQcE9z7tVXAg83YkM9uQ3SHRdLoCIC+QOzoohIg9LrT4rDMjfBUjRdAQCSDSmynMOewds5S1gDLfOVkOhFwTFIti39Ii0GNvWYRQTRGK0Vl0RWkj4DSIUXQOgDcm5SHWnA9uVAKrEI9mPchVlfZTx/r4aJdBWHZEeBKbhl0D59TCPwjOZ9kBy0Vt+XSukBy9P0irB+19FaYl0t+Y2M8S7qzHAoxBSiIqqPUWcBElVoGiHEFISAOi8G9wIQYohI8LzIG4nLHoi4uhDLCR0b2cTH5hOinkCSQcF7vmiZ8KehtpCsIqGKFpV1D4dI+gI7p4y5yYiuPBXpUKOgN0ppC6FCZ9ToAukxWJESf5kI6IreFMjS7lyWMsautdZD02oe4/Z+D+7HKHqc8pYS+mhhlIg+RJLFHkvQPEYBrUWAxgYJVd60hjajSwQwMUWN5AaMoDdSqjFNAbxeAA/UwtOmnyzgk4JeLKnTMJXI+hUTHH+EeXAs2TVLbaJtrg2l+jDGlNIdfHoukej339oed5Rql4RyYWvFKBg4ALGwEQZMayPBFXYEbD8h5CLYHcYkOgJ4GABlbsGvc0hkyXTHEoZ8P9o0tD2E9TAMZhQJnhjwRK8B/H+CsRhMcR0VojCQXAn6Xy/L8D4MW9p/I0BZD8nyM4zB6bfFdMMSCcYdiQRFCsfAkQGrThWm1KVAxWnHWdP1egSMUYG1wMMBV6wQZ8zAIYAwJgoBB34AqJdhAuzKA1pGooXBeCPhED1KQMh5BMCULuNQmhtC6BPWe8ASdUCoEwPy4gZB71REfZwPwaB4w1CcC4AYX7FDKFUOoLQOh9BGGA6YAwDA6inRoH1BojRCY+GYKdIqxxTrUDIypAA8twDQxHuC82iNxgwFg2HTSg92FdDgUPyGnLcVU0gCoRpVMiV6+YFi0CWJeWoiq3J8A7OQJDF051JFUu2FEAADZgWwMq7FKIcQz6BAbIkgkoOBewkDiAUsZ0zxwwQQmhFZ+xZpDPpE8QFY4rGrOkjNgOasQcxwgk8vsWTtBf4CUOJyDDzwADSJB5C1FoGKcVfiSC82VdNLAhmdj5NwJmaEVhnCOkQNAfAAANG0VWqBGUM63QqxxXGTU9CMIBY5DNfjyITFMBjXQ0RIN5rcsnSD0DUiwbMKaxzHAaq5/IZmPNQnSFZ/kIFFsue4iZtb7nqyeeqAAHzO5yNz5nytVDEoZ1hAB1P60Ctr2HgJGCNNEaSAN6uKyAhnSC4GexQAMr2JitA+yQbIDR1vVis/BNRvxRiQU+4SvApb2n2MM2QfGwQ4elCsxwpDU7q2jFW3kAnt2tvPFaEyC2GNJXKPax4EEVpUsADUBjYGRyAqHm50ca2NApQ4YAtkM4yY3JI9BvgUQnVgHcjoy6UHOcZw4zWasaACAGQmCwmSEzOH2wmsu2sA+YOr6rRktc644frw3tBjdSlN2rqoGurcSAN9Lx3zBDMSXgr1lACoKdU8swL7kwQxmcNGKndQf3G5ztoKwwkqPXqelO2iTMoZ4KokzA+Mb+4kd/Coh4LLK7e0+CamKQPgckzk8O5T47pRTtWZaMHxv1PJsdOCKworAPcfBHx+303fauvHHcaL8XPxJfRfbzwS3yv/jXEvGXSShnFdW7bGgQm6/TckjJN8BSOesz59FfH4SR/Qwn4HOGAK0YVJzDxKEQvoxuIB+WzqIP9eQ9VE7+HkgYkzUb9cBTolAgCJxxsj0k5WdNs58WsF8AdtcJArBuRvNit+8SBB8btO8A9+QzQ2dOdudedq8SBa8+A28btm9O0zdrswRQ9UA/89o+oaA+0r1lwVsv929m8k8lR09eFuw5hY8aIPtMc5gAckRBDzxPtuJ7sOkkgfFyDaCf93tmAf5nAGp+QFCTtNsrN4wzgsBetnMURstnwwhZAmRN1KotFZsDNICVVChR8sAyAVAEtjM0wMBnCSBGstt1NLI70Jo4xUB/BqFDp6IqoERjxOEzd2hIAABeGIyAAMdoQmS4BHesYzaaRAc0aDfwjASzbgjwQzAgVod6ZEaQqzLkZMTqZACLWxNOQ8PfeeOBLTago7G7UPAGNAFEHzJUJQNRFEEbR/ANZjfJGiLwFnZUSTcVGgcaKvTQizJQtFDsRXHwM8BCGLUIIUezGiY4IgOzAojgig7Q42WgIQYYQwwgv4d7NHJnH7THaNdfeAgwk0aNeCXhF6D0GwgSVYNTU4VAZozYiNYUJzdgBqKkBYRBKsfYeCPwkQ7gFUZtANVKNhLwYBO4t6Ho0QdsHIltItdYKIFoPEPcSXdgWPKTZVAAOUPAkzkz5EPDNEgiIHcL3Q7WnFxMxSoMJPPCFRJPEFenjArTZCUHVAqk2Kn0lRUmLzmGAXcN8H5AqMUHilLiyjRPglCwhLYCLFGGGE6JwgMG42iGPTACMA41IzIDogoEo2o1o0yno0YxJKCzY1gC4x4z40jkExgxdFymcDEwVGpNIEQCMGqGaNqi5lgR00iAgw1Npj4C5kRA8JaIbzaKUKwkpCyBCLgT8wCxY24BC3bEQWeGmlCDAxoXECEM+zqIBx1logmzQwFBIB1gWjGSUFC1Pm2NszJ34GaQ+M6OQE0TL0PBBHgiS3WOSwpGeEpM7K4QmJpIEkBiIAyQHHlNoALT4HMgWFqUBiQ35PdAU0WAmET3w140sHSBRPvTj1wIxJbLRNZOKEVXxK7TJi5JsXCGHADOVRcSiBDKryjNGFjIdAzQB3DKu1aMUOqHiKrJGDrlQMYDzOQEzKY16mC1EJKxIzI3NMtK8BozowwAYyzMQsdMMz1J4w3kIxNLQoowMStNEgoHIFolOk33SxorPBBhpGdINNdIExhOJVE1YL9LJKDJICQzSmjltGqGotorCC8WLVgUMytHjVvxIAym4FkCK0fzq2yG4DyEYtop0MWLTIrNK0oFCHgiAP5yREfwFGoG1nUg7D8Ml1wDMOPB50uN7SwBBGfz4GFz/nAISx1jEAaHXFQHEtWNIlJVejPC8SjVEOWPnWNjqAigbWTiZ2WgGjVTHHGDYgipagvNZ0rA/moADV718yRAkHwFEkOBsHwCZCpysxriIhUmNl5VGHbBuiHAUjNEcxFH6yCielkAqqqpIHYVoHSEQQ+gwHZw520rPFY1wBqoSMoFouNg/EEvsNjzH18D3CagDFRSmMU2U0TxeFQE2pTQWzB2r0hjYkMztMCwwEJnMrwEJiR1SPoEMyAMJmupUjuowEf0ep5wR3dA4QwGmFrjGR+DQXBnFEcoFy2ALIVDoj2gLG+28ByzgQIGfF8Gpk1L4GnCtAHFSzNS5yyqirBPwAhNGKlz7QsIIFEj6h8jLkRGbkHQ8FMrv0mhPx4HwC6gKXghaqIjOCZuWqQ1kvkqjEUqqpUu+rwHUs0qmq8G20PBZvFCVCurwuOC+p+qeqlMPFepFvetVtuvuoMSeoLNCGGFem1ngGKEOlgH8BYOOu2pT1YK6F0mxFVJLjwFejNExujNYPtpkFPHoFKiVDRvcSJsonVCar4GCt8F5tuARvsFkD5oWBoU+wFuEtEqnltqBFVLOBTWMj4EoAWH+FhtRsuncW9tGFZMiGKgF0+wWHjrnMpFRRosCnrEQADEemQBZsxwDQnKjy/larcV8D4pKj4FFMZzRJ23c3XI+KBnLXdEFN1Pw1dNPOxPRK+ExOqwlNYLZI1gJMfOJICl5LfKgA/LTikq0UgHTrEoWrPC4EgAMEgABxoFkSr2AAcqZHcP4mgEJmmnJL0EgEJkJiIEBhPC8CAcgFKurUgDkojChyUolrUvwA0q0tvq8G4h/r/sKBFAACoH8HqPKBwgGZhCZBtCYSBvxmBFY8GBiHqyHCGpSPxHBCYQ7wGw7soBwAZ8BqBsHcA8GlBRwJhCYT9t9uQZDsdyKLTKLMKOAOBo7EAicldwcSABwlBgjOpisyKzSKKqNMLTp5H6L2xZaWLsAHs+YDhFAp8og6qepSoL6xR77H7n77HYF37HKv6PBoBW53GldIBOc0oIbURnLKAAGoGNHSryqqhKrqr29uIyh2w3dEBgBoA9AAAyWAmrPGjnAJkCIJoggMD3SVECX63nVWckNAYorOlhzOpIZgcR4rSRjC5gWR+RxRtgZR1RekLIQKQW1aoegYQGE604Hh2O/mjM2BhShB1SqW5BmWtBqzaOnwkEbu+/FmU2fW/Bp/YJigFKfUw0401C7RqR3RmjAxhitBkGJ0oi9i48zi7I4TL01DcTabMkwkQzeJtARJqzEUa6MQHawOUoL1WskfHp/4rxH2sCPlHWiMYR5KwmRAlvBUD+gorhnhqdPh1AsjfxNPDyWfCasMEW4Q+/fPCBNhbpoelZccZo6++api2UlxgoiZ0WqZyW3AaW1Bul3S5AZsrE+pQ65ARZ58USJuuyoVQ2i4yupFqGwzTB8k3hnB4fLAWhrZoglmaV4hj3MhihqhhVpZysRWmXKy8Y1hphrGne6uqK/kUKquM1JYzm3YPgXtOswRy8ZKhxfynZyACckiC1ooEQ3NXWSU0iUcfxbo1Ae3FuNgmVS8rEnuw8le1E7enKusq87em8u8wOh8ok7ko+188xs+zkcF8nVFqGJgvBt6kRhFr48PP595hJy3BR5CrR8jY5qii5wxtAYxjQWAQi0+/SyS1+2BGlxZnlreyaQohlyANxz+3x2VgBoBkBwQGiCBsJmBwlz7FlpBlB2WjB3+uV9FvBw2kpv4IhwmEhrVyh6hhmH6+h7Zxg0ITGlhsuth31jh2etF4UPh51jJEgWF8bURigMSKC0Qjx5R5xltpplpi56ClC7gU0iD6R059t85uly5wivZkio0ojQ5hDk50jZMODj6/C9jOoNio8/jd0nsDlHi55yY6Tac/0qUvajkGPfp3gaQMgFj3Os6uMKq/pyRmQes68QcUlRKkgmMmiaqKg4wu1rm/7XzGwG6bsZJvQWR3hcosuSogSNAVdkqoQmXeGP4Tj0lunUQMU0Y2QArKA9AUaxklqacQzBYZ8FwQmJgRAYRpEADzvYnTseNDCKIQYRzsulztzjzjAbt/UQmLmMkEgMo/IxgJG37Xjz/NAYoPWhCtW+SWgIGDzz7KzM2pu1j3qUO28gdSaHbbj6vErCGr674fwWxjAKzeE8caL4IBuMGjhXAeOtRlaFE2kgHIjzLpEbL+MXLibVhDnbmDw3ss8fsvxrnIppUe2q9eIXre4/AOdTkHqscYbn9l4+seoTbyFlXGNJDBo+zdj/yLjnMEqE4dZVcijaQTBA/fo7gfjo5wTlkZaBBSCdq+7yByT5mPgGTjmuT+L47xfQzApv9yIYbY+eF3Wt7igZ6gHaHyt3WhoD6LzpfGzhk18d/avM0MAgEgXJwjNegRHSIvj4r+llt5AFUXZVrv5ZVQkZVnyjwCHuCkbGH39pHlHwzbn6uHn7H7CXHsa/EjABqMGgXh6k/JrxYgHmLxg9Acl8rz1UKLqhSIr4UeL6r4p3b5H0Hh10QriyadAiPB44BUaSkWz/HyXj/QC2pwmRMWgU4TvcaOQ0YYZrALPNIxb39g37ZXqeXpupnhufuxMEYeO1AHA7W6mHwPn5xXp8MvTQts1uYjvAXN/O7go/zDLjAJCwQN9UIJTREEXWsMn194uLBFSCXuBaI/JeXAYEghSS4MZM0Ag/3gliMUcjwBnqLSsTviGgNdwNmcejJFUrIPpmn3z5wZc64FMesan9aySltkTquT2w8djqQcCPT88AzoI+GIiVujR7kPc5aXpjolEPsjWfkGfG7Syb49YRE8jk8xN88ukmNsdstdNvEzNzwbNs+VJIn0vWVJF5v13pJjUPgLJBULvXvIACnyPJV8vPQ8CL1hSvgcfuKTjyi8uaMpdegDkU4YBuwuZM2IqVfYiF/4A3fWurXbATBaqg9SaBdFoAUojAaYLqu1FgwYZKoo4FavWRZDwY7Y3wRwNc32YGAaAM6U6LjE8QihToT7G4LAFNIDxBut1TfBoGUpkcOKlHB5jR19LgD6OfFIYNwGMKe1TYmoKgNqAPCRAqAREDwEz3zoZkCA8gjQCWyGzatFY8QRrgKGRaiFfMVIGkOYNkCO4xAD2NoPTnM7M5zU58OdB7UsHKAbBPEagHOgkgplfEttEYLNnUizhfw/ILIYuEGTt8lQN/SQOKF0zWCOQ3ERoEkP+Ykg0h1jTIRoB6DZDDws4RoXkPHKHh+6HoBgSvxaBrlNuXDEnOAJQFbchSSJN/meTLTJtR2a9X/uyX3qACkBZJSkuQF2bEU8MBGC9Ev2vR4Bb09zB9JlHYBcAqASGETHlHkCDBv0mGP9DhkAyGBz0MiFQgYmrTud/APA/kg7gNiNBbhxgEDJAG/DJIKEXQeWJiGSSYhxEAgdwdChdq0wkg8QLoBCJ6CKw0A8sEgD0DyAdAFQeGe4XOHlhoBKE8sV1FZAEB5B6yyItAHkFoALg0AJAMET0HiCYh/hXQEgF0AYCKwOgogbEFiN+HQpMQ0KEkb7AVguoBAPQVkVZAYDdA2Rc4PEXkFUDojqRukNAIiJSQPguREAQOBSN0i0AGAOsUQPLEVgMAegtAAQFCl0gTIKREyNAL8HiCsi+kSQRWFkHJiqioAqSb8AwCZEdByRXQAyIaIhHOQugTkeIN+AMjaRC4VkGUVZEqQMBMQ1IAoE6MYCIiTwzIxWD6HiB5B6R0KCZOyOZFhjdI8QYkT6B6D3JtR7IikSqKAz3CSAGIzEPLElSYgDIvwb8CoB6ACB2RDASpDiOihuiSABkRWCSK9E2R6A6w+4UVHUB3VlyuuEcMOHeHkN6wcY5gMRlc7t5hGznUIOsMMzriDAAAbycbRAkQDJWANlGiBcBYgyQPsDuKwgt1bMR4yACeIMAABfAwOuMIobDOQC49YjOMHEnogAA= -->

<!-- internal state end -->
<!-- finishing_touch_checkbox_start -->

<details>
<summary>✨ Finishing Touches</summary>

- [ ] <!-- {"checkboxId": "7962f53c-55bc-4827-bfbf-6a18da830691"} --> 📝 Generate Docstrings
<details>
<summary>🧪 Generate unit tests</summary>

- [ ] <!-- {"checkboxId": "f47ac10b-58cc-4372-a567-0e02b2c3d479", "radioGroupId": "utg-output-choice-group-unknown_comment_id"} -->   Create PR with unit tests
- [ ] <!-- {"checkboxId": "07f1e7d6-8a8e-4e23-9900-8731c2c87f58", "radioGroupId": "utg-output-choice-group-unknown_comment_id"} -->   Post copyable unit tests in a comment

</details>

</details>

<!-- finishing_touch_checkbox_end -->
<!-- tips_start -->

---

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

<details>
<summary>❤️ Share</summary>

- [X](https://twitter.com/intent/tweet?text=I%20just%20used%20%40coderabbitai%20for%20my%20code%20review%2C%20and%20it%27s%20fantastic%21%20It%27s%20free%20for%20OSS%20and%20offers%20a%20free%20trial%20for%20the%20proprietary%20code.%20Check%20it%20out%3A&url=https%3A//coderabbit.ai)
- [Mastodon](https://mastodon.social/share?text=I%20just%20used%20%40coderabbitai%20for%20my%20code%20review%2C%20and%20it%27s%20fantastic%21%20It%27s%20free%20for%20OSS%20and%20offers%20a%20free%20trial%20for%20the%20proprietary%20code.%20Check%20it%20out%3A%20https%3A%2F%2Fcoderabbit.ai)
- [Reddit](https://www.reddit.com/submit?title=Great%20tool%20for%20code%20review%20-%20CodeRabbit&text=I%20just%20used%20CodeRabbit%20for%20my%20code%20review%2C%20and%20it%27s%20fantastic%21%20It%27s%20free%20for%20OSS%20and%20offers%20a%20free%20trial%20for%20proprietary%20code.%20Check%20it%20out%3A%20https%3A//coderabbit.ai)
- [LinkedIn](https://www.linkedin.com/sharing/share-offsite/?url=https%3A%2F%2Fcoderabbit.ai&mini=true&title=Great%20tool%20for%20code%20review%20-%20CodeRabbit&summary=I%20just%20used%20CodeRabbit%20for%20my%20code%20review%2C%20and%20it%27s%20fantastic%21%20It%27s%20free%20for%20OSS%20and%20offers%20a%20free%20trial%20for%20proprietary%20code)

</details>

<details>
<summary>🪧 Tips</summary>

### Chat

There are 3 ways to chat with [CodeRabbit](https://coderabbit.ai?utm_source=oss&utm_medium=github&utm_campaign=NVIDIA/TensorRT-LLM&utm_content=6059):

- Review comments: Directly reply to a review comment made by CodeRabbit. Example:
  - `I pushed a fix in commit <commit_id>, please review it.`
  - `Explain this complex logic.`
  - `Open a follow-up GitHub issue for this discussion.`
- Files and specific lines of code (under the "Files changed" tab): Tag `@coderabbitai` in a new review comment at the desired location with your query. Examples:
  - `@coderabbitai explain this code block.`
- PR comments: Tag `@coderabbitai` in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
  - `@coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.`
  - `@coderabbitai read src/utils.ts and explain its main purpose.`
  - `@coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.`

### Support

Need help? Create a ticket on our [support page](https://www.coderabbit.ai/contact-us/support) for assistance with any issues or questions.

### CodeRabbit Commands (Invoked using PR comments)

- `@coderabbitai pause` to pause the reviews on a PR.
- `@coderabbitai resume` to resume the paused reviews.
- `@coderabbitai review` to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
- `@coderabbitai full review` to do a full review from scratch and review all the files again.
- `@coderabbitai summary` to regenerate the summary of the PR.
- `@coderabbitai generate docstrings` to [generate docstrings](https://docs.coderabbit.ai/finishing-touches/docstrings) for this PR.
- `@coderabbitai generate sequence diagram` to generate a sequence diagram of the changes in this PR.
- `@coderabbitai generate unit tests` to generate unit tests for this PR.
- `@coderabbitai resolve` resolve all the CodeRabbit review comments.
- `@coderabbitai configuration` to show the current CodeRabbit configuration for the repository.
- `@coderabbitai help` to get help.

### Other keywords and placeholders

- Add `@coderabbitai ignore` anywhere in the PR description to prevent this PR from being reviewed.
- Add `@coderabbitai summary` to generate the high-level summary at a specific location in the PR description.
- Add `@coderabbitai` or `@coderabbitai title` anywhere in the PR title to generate the title automatically.

### Documentation and Community

- Visit our [Documentation](https://docs.coderabbit.ai) for detailed information on how to use CodeRabbit.
- Join our [Discord Community](http://discord.gg/coderabbit) to get help, request features, and share feedback.
- Follow us on [X/Twitter](https://twitter.com/coderabbitai) for updates and announcements.

</details>

<!-- tips_end -->

coderabbitai

Actionable comments posted: 5

♻️ Duplicate comments (1)

cpp/tensorrt_llm/common/attentionOp.cpp (1)

749-749: Address the existing review comment - unit tests are still missing for this code path.

QiJune previously requested unit tests for this FP8 context MLA path, which haven't been added yet.

🧹 Nitpick comments (3)

cpp/tensorrt_llm/kernels/mlaKernels.cu (1)

926-968: Remove the explicit cudaStreamSynchronize & drop the unused headDim

cudaStreamSynchronize(stream); forces a host/device sync for every context-level call, nullifying any overlap with subsequent kernels and hurting throughput.
Nothing in this path depends on a hard sync; the earlier sync_check_cuda_error(stream) is enough.
➜ Delete the sync or make it optional behind a debug flag.

size_t headDim = …; is never used – will trigger “set but not used” warnings when compiling with -Wall.
➜ Remove the variable.
cpp/tensorrt_llm/common/attentionOp.cpp (2)
732-754: Consider improving code clarity and consistency.

The buffer size calculation logic is correct, but could benefit from:

Consistent naming convention (e.g., dim_*_per_head vs total_*_dim_all_heads)

Adding a comment explaining why MLA requires different buffer size calculation
-    int const num_total_qkv_elements
-        = max_num_tokens * (total_q_dim_all_heads + total_k_dim_all_heads + total_v_dim_all_heads);
+    // MLA uses different head dimensions, requiring custom buffer size calculation
+    int const total_qkv_elements
+        = max_num_tokens * (total_q_dim_all_heads + total_k_dim_all_heads + total_v_dim_all_heads);
277-2887: Verify the performance improvements claimed in the PR.

The PR claims ~24% TTFT improvement on SM120. Please ensure performance benchmarks are included in the test suite to validate these gains and prevent regressions.

Would you like me to help create a performance test framework to track these metrics?

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e8c068b and 4339442.

📒 Files selected for processing (6)

cpp/tensorrt_llm/common/attentionOp.cpp (9 hunks)
cpp/tensorrt_llm/common/attentionOp.h (1 hunks)
cpp/tensorrt_llm/kernels/mlaKernels.cu (2 hunks)
cpp/tensorrt_llm/kernels/mlaKernels.h (2 hunks)
cpp/tensorrt_llm/thop/attentionOp.cpp (4 hunks)
tests/unittest/_torch/test_attention_mla.py (1 hunks)

🧰 Additional context used

🧬 Code Graph Analysis (1)

cpp/tensorrt_llm/thop/attentionOp.cpp (2)

tensorrt_llm/_torch/attention_backend/trtllm.py (2)

host_kv_cache_pool_mapping (586-590)

host_kv_cache_pool_pointers (579-583)

cpp/tensorrt_llm/common/attentionOp.cpp (1)

T (1285-2054)

🔇 Additional comments (12)

cpp/tensorrt_llm/common/attentionOp.h (1)

384-384: LGTM!

The addition of the mFP8ContextMLA flag follows the established naming convention and is properly initialized.

tests/unittest/_torch/test_attention_mla.py (1)

342-342: Tolerance adjustment for FP8 precision looks reasonable.

The minor increase in tolerance values (1.875%) is appropriate for FP8 operations with the new FP8 context MLA support.

cpp/tensorrt_llm/kernels/mlaKernels.h (2)

90-91: LGTM!

The addition of quant_scale_qkv pointer is properly typed and initialized, following the existing pattern for quantization scale pointers.

116-119: LGTM!

The kernel declaration follows proper CUDA conventions with appropriate template parameter for input flexibility and fixed FP8 output type.

cpp/tensorrt_llm/thop/attentionOp.cpp (6)

152-155: Good defensive programming practice.

Adding the null check before accessing the optional tensor prevents potential crashes.

210-212: LGTM!

Proper fallback to attention_window_size when cache_indirection is not available.

218-233: Excellent robustness improvements.

All the added checks for KV cache related optional tensors follow a consistent pattern and prevent null pointer dereferences.

254-259: LGTM!

The compound check ensures both quant mode and tensor availability before pointer assignment.

261-266: Good consistency in output scale handling.

The checks properly differentiate between FP8 and FP4 output scale scenarios.

332-334: LGTM!

Consistent with the pattern of checking optional tensors before access.
cpp/tensorrt_llm/kernels/mlaKernels.cu (1)

931-944: Sanity-check the per-head dimensional assumptions

total_k_dim_all_heads and total_v_dim_all_heads are computed with head_num, implicitly assuming
num_kv_heads == head_num.
For MQA/GQA models that use 1 or 2 KV-heads this will over-estimate the element count and overrun the
destination buffer.

Please confirm that:
total_k_dim_all_heads = effective_num_kv_heads * dim_k_per_head;
total_v_dim_all_heads = effective_num_kv_heads * dim_v_per_head;
or guard with an assertion.
cpp/tensorrt_llm/common/attentionOp.cpp (1)

2623-2623: Verify XQA enablement conditions for SM120.

The XQA is enabled only for SM120 with generation MLA. Please confirm:

Is XQA support exclusive to SM120, or should it also include SM121?

Are there any other conditions that should be checked (e.g., FP8 mode)?

cpp/tensorrt_llm/common/attentionOp.cpp

cpp/tensorrt_llm/kernels/mlaKernels.cu

cpp/tensorrt_llm/thop/attentionOp.cpp

peaceh-nv · 2025-07-21T07:40:32Z

/bot run

tensorrt-cicd · 2025-07-21T07:45:27Z

PR_Github #12430 [ run ] triggered by Bot

tensorrt-cicd · 2025-07-21T12:14:37Z

PR_Github #12430 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #9243 completed with status: 'FAILURE'

peaceh-nv · 2025-07-23T01:03:09Z

/bot run

tensorrt-cicd · 2025-07-23T01:08:42Z

PR_Github #12629 [ run ] triggered by Bot

tensorrt-cicd · 2025-07-23T01:22:54Z

PR_Github #12629 [ run ] completed with state FAILURE

peaceh-nv · 2025-07-23T09:11:45Z

/bot run

tensorrt-cicd · 2025-07-23T09:16:58Z

PR_Github #12693 [ run ] triggered by Bot

tensorrt-cicd · 2025-07-23T12:42:31Z

PR_Github #12693 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #9442 completed with status: 'FAILURE'

peaceh-nv · 2025-07-24T02:35:36Z

/bot run

tensorrt-cicd · 2025-07-24T02:41:07Z

PR_Github #12777 [ run ] triggered by Bot

tensorrt-cicd · 2025-07-24T04:42:28Z

PR_Github #12777 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #9515 completed with status: 'FAILURE'

peaceh-nv · 2025-07-25T00:16:46Z

/bot run

tensorrt-cicd · 2025-07-25T00:21:57Z

PR_Github #12910 [ run ] triggered by Bot

tensorrt-cicd · 2025-07-25T00:22:02Z

PR_Github #12910 [ run ] completed with state FAILURE

peaceh-nv · 2025-07-29T07:21:54Z

/bot run

tensorrt-cicd · 2025-07-29T07:27:01Z

PR_Github #13323 [ run ] triggered by Bot

tensorrt-cicd · 2025-07-29T21:58:17Z

PR_Github #13323 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #9957 completed with status: 'FAILURE'

peaceh-nv · 2025-08-01T03:39:36Z

/bot run

peaceh-nv · 2025-08-01T03:43:29Z

/bot run

tensorrt-cicd · 2025-08-01T03:44:33Z

PR_Github #13760 [ run ] triggered by Bot

tensorrt-cicd · 2025-08-01T03:48:54Z

PR_Github #13761 [ run ] triggered by Bot

tensorrt-cicd · 2025-08-01T03:48:57Z

PR_Github #13760 [ run ] completed with state ABORTED

tensorrt-cicd · 2025-08-01T11:47:31Z

PR_Github #13761 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #10341 completed with status: 'SUCCESS'
Pipeline passed with automatic retried tests. Check the rerun report for details.

peaceh-nv · 2025-08-04T06:52:34Z

/bot run

tensorrt-cicd · 2025-08-04T06:58:08Z

PR_Github #13954 [ run ] triggered by Bot

tensorrt-cicd · 2025-08-04T10:07:25Z

PR_Github #13954 [ run ] completed with state FAILURE
/LLM/main/L0_MergeRequest_PR pipeline #10509 completed with status: 'FAILURE'

Signed-off-by: peaceh <[email protected]>

peaceh-nv · 2025-08-05T03:49:29Z

/bot run

tensorrt-cicd · 2025-08-05T03:54:54Z

PR_Github #14077 [ run ] triggered by Bot

tensorrt-cicd · 2025-08-05T10:54:40Z

PR_Github #14077 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #10622 completed with status: 'SUCCESS'
Pipeline passed with automatic retried tests. Check the rerun report for details.

Signed-off-by: peaceh <[email protected]>

QiJune reviewed Jul 15, 2025

View reviewed changes

cpp/tensorrt_llm/common/attentionOp.cpp Show resolved Hide resolved

peaceh-nv force-pushed the fp8-context-sm120 branch from 7eee4ec to 4339442 Compare July 21, 2025 07:35

coderabbitai bot reviewed Jul 21, 2025

View reviewed changes

peaceh-nv force-pushed the fp8-context-sm120 branch from 4339442 to 711a421 Compare July 23, 2025 01:02

peaceh-nv force-pushed the fp8-context-sm120 branch from 711a421 to 606608e Compare July 25, 2025 00:16

coderabbitai bot requested review from kaiyux and lucifer1004 July 25, 2025 00:17

peaceh-nv force-pushed the fp8-context-sm120 branch from 606608e to e8d5dcf Compare July 30, 2025 02:00

peaceh-nv changed the title ~~[feat] : Add FP8 context MLA support for SM120~~ [None][feat] : Add FP8 context MLA support for SM120 Aug 1, 2025

peaceh-nv force-pushed the fp8-context-sm120 branch from 234c2e1 to c03bae2 Compare August 4, 2025 06:51

peaceh-nv added 2 commits August 4, 2025 20:48

feat : Add FP8 context MLA support for SM120

295de30

Signed-off-by: peaceh <[email protected]>

Bump the tolerance for FP8 MLA unit test

e1f611c

Signed-off-by: peaceh <[email protected]>

peaceh-nv force-pushed the fp8-context-sm120 branch from c03bae2 to e1f611c Compare August 5, 2025 03:49

peaceh-nv requested review from bobboli, PerkzZheng and jinyangyuan-nvidia August 6, 2025 00:22

jinyangyuan-nvidia approved these changes Aug 7, 2025

View reviewed changes

litaotju approved these changes Aug 7, 2025

View reviewed changes

litaotju merged commit 8ec3b1d into NVIDIA:main Aug 7, 2025
4 checks passed

Shunkangz pushed a commit to hcyezhang/TensorRT-LLM that referenced this pull request Aug 8, 2025

[None][feat] : Add FP8 context MLA support for SM120 (NVIDIA#6059)

3447c4d

Signed-off-by: peaceh <[email protected]>

This was referenced Aug 21, 2025

[None][feat] add Hopper FP8 context MLA #7107

Closed

[None][feat] Hopper Fp8 context mla #7116

Merged

[None][feat] : Add FP8 context MLA support for SM120 #6059

[None][feat] : Add FP8 context MLA support for SM120 #6059

Uh oh!

Conversation

peaceh-nv commented Jul 15, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Summary by CodeRabbit

Uh oh!

peaceh-nv commented Jul 15, 2025

Uh oh!

tensorrt-cicd commented Jul 15, 2025

Uh oh!

Uh oh!

tensorrt-cicd commented Jul 15, 2025

Uh oh!

coderabbitai bot commented Jul 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Estimated code review effort

Suggested reviewers

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

peaceh-nv commented Jul 21, 2025

Uh oh!

tensorrt-cicd commented Jul 21, 2025

Uh oh!

tensorrt-cicd commented Jul 21, 2025

Uh oh!

peaceh-nv commented Jul 23, 2025

Uh oh!

tensorrt-cicd commented Jul 23, 2025

Uh oh!

tensorrt-cicd commented Jul 23, 2025

Uh oh!

peaceh-nv commented Jul 23, 2025

Uh oh!

tensorrt-cicd commented Jul 23, 2025

Uh oh!

tensorrt-cicd commented Jul 23, 2025

Uh oh!

peaceh-nv commented Jul 24, 2025

Uh oh!

tensorrt-cicd commented Jul 24, 2025

Uh oh!

tensorrt-cicd commented Jul 24, 2025

Uh oh!

peaceh-nv commented Jul 25, 2025

Uh oh!

tensorrt-cicd commented Jul 25, 2025

Uh oh!

tensorrt-cicd commented Jul 25, 2025

Uh oh!

peaceh-nv commented Jul 29, 2025

Uh oh!

tensorrt-cicd commented Jul 29, 2025

Uh oh!

tensorrt-cicd commented Jul 29, 2025

Uh oh!

peaceh-nv commented Aug 1, 2025

Uh oh!

peaceh-nv commented Aug 1, 2025

Uh oh!

tensorrt-cicd commented Aug 1, 2025

Uh oh!

tensorrt-cicd commented Aug 1, 2025

Uh oh!

tensorrt-cicd commented Aug 1, 2025

Uh oh!

tensorrt-cicd commented Aug 1, 2025

Uh oh!

peaceh-nv commented Aug 4, 2025

Uh oh!

tensorrt-cicd commented Aug 4, 2025

peaceh-nv commented Jul 15, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Jul 21, 2025 •

edited

Loading