Skip to content

Conversation

peaceh-nv
Copy link
Collaborator

@peaceh-nv peaceh-nv commented Jul 15, 2025

Description

Add FP8 context MLA support for SM120

Compared to FP8 context FMHA, FP8 context MLA needs BF16 output.

Accuracy:
GPQA Diamond score for DeepSeek-R1 quant wo_gemm ckpt + FP8 context MLA + FP8 xqa-mla gen is 0.707 on SM120, which is reasonable since the baseline : nvfp4 DeepSeek-R1 ckpt + BF16 MLA + FP8 MLA is 0.702 on B200

Performance:
TTFT improvement on SM120:
BF16 context MLA : 121244.7586ms
FP8 context MLA : 97594.8747ms
~24% improvement

Summary by CodeRabbit

  • New Features

    • Added support for FP8 quantization in Multi-Head Linear Attention (MLA) mode, including buffer management, runtime configuration, and input data quantization.
  • Bug Fixes

    • Improved robustness by adding checks for optional input tensors to prevent errors from missing data.
  • Tests

    • Updated accuracy tolerances for FP8 data type in MLA attention tests to better reflect expected results.

@peaceh-nv
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #11939 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #11939 [ run ] completed with state FAILURE
/LLM/main/L0_MergeRequest_PR pipeline #8857 completed with status: 'FAILURE'

@peaceh-nv peaceh-nv force-pushed the fp8-context-sm120 branch from 7eee4ec to 4339442 Compare July 21, 2025 07:35
Copy link
Contributor

coderabbitai bot commented Jul 21, 2025

📝 Walkthrough
## Walkthrough

Support for a new boolean flag, `mFP8ContextMLA`, was introduced to enable FP8 context mode for Multi-Head Linear Attention (MLA) alongside the existing FMHA mode. This required updates to buffer size calculations, parameter passing, kernel launches, and conditional logic across attention operator implementations, kernel headers, and related tests. Additionally, a new CUDA kernel was added to quantize input data to FP8 format within MLA context.

## Changes

| File(s)                                                                                 | Change Summary                                                                                                          |
|-----------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------|
| **Attention Operator Source**<br>`cpp/tensorrt_llm/common/attentionOp.cpp`, `cpp/tensorrt_llm/thop/attentionOp.cpp`          | Integrated `mFP8ContextMLA` flag into buffer size, parameter, and control flow logic; added robust checks for optional tensors; set MLA FP8 context flags during generation and enqueue phases |
| **Attention Operator Header**<br>`cpp/tensorrt_llm/common/attentionOp.h`                                                   | Added private member variable `mFP8ContextMLA` to `AttentionOp` class                                                   |
| **MLA Kernels Source and Header**<br>`cpp/tensorrt_llm/kernels/mlaKernels.cu`, `cpp/tensorrt_llm/kernels/mlaKernels.h`           | Added `QuantizeCopyInputToFp8Kernel` CUDA kernel; extended `MlaParams` struct with `quant_scale_qkv`; updated MLA kernel invocation to launch FP8 quantization kernel conditionally    |
| **Unit Tests**<br>`tests/unittest/_torch/test_attention_mla.py`                                             | Increased FP8 accuracy tolerances in test dictionary                                                                    |

## Sequence Diagram(s)

```mermaid
sequenceDiagram
    participant User
    participant AttentionOp
    participant MLA Kernel
    participant CUDA Device

    User->>AttentionOp: Request context enqueue (MLA, FP8 enabled)
    AttentionOp->>MLA Kernel: invokeMLARopeContext(params, ...)
    MLA Kernel->>CUDA Device: applyMLARopeAndAssignQKVKernelOptContext
    alt FP8 Context MLA enabled
        MLA Kernel->>CUDA Device: QuantizeCopyInputToFp8Kernel (quantize input to FP8)
    end
    MLA Kernel-->>AttentionOp: Return
    AttentionOp-->>User: Complete

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Suggested reviewers

  • lucifer1004
  • kaiyux

</details>

<!-- walkthrough_end -->

<!-- announcements_start -->

> [!NOTE]
> <details open="true">
> <summary>⚡️ Unit Test Generation is now available in beta!</summary>
> 
> Learn more [here](https://docs.coderabbit.ai/finishing-touches/unit-test-generation), or try it out under "Finishing Touches" below.
> 
> </details>

<!-- announcements_end -->

---

<details>
<summary>📜 Recent review details</summary>

**Configuration used: .coderabbit.yaml**
**Review profile: CHILL**
**Plan: Pro**


<details>
<summary>📥 Commits</summary>

Reviewing files that changed from the base of the PR and between c03bae23eef8f660361ee2985fb8ee018cf1e8d2 and e1f611ca686ff4ab0b1ec957124c2e8638e2897d.

</details>

<details>
<summary>📒 Files selected for processing (6)</summary>

* `cpp/tensorrt_llm/common/attentionOp.cpp` (9 hunks)
* `cpp/tensorrt_llm/common/attentionOp.h` (1 hunks)
* `cpp/tensorrt_llm/kernels/mlaKernels.cu` (2 hunks)
* `cpp/tensorrt_llm/kernels/mlaKernels.h` (2 hunks)
* `cpp/tensorrt_llm/thop/attentionOp.cpp` (4 hunks)
* `tests/unittest/_torch/test_attention_mla.py` (1 hunks)

</details>

<details>
<summary>🚧 Files skipped from review as they are similar to previous changes (6)</summary>

* tests/unittest/_torch/test_attention_mla.py
* cpp/tensorrt_llm/common/attentionOp.h
* cpp/tensorrt_llm/kernels/mlaKernels.h
* cpp/tensorrt_llm/thop/attentionOp.cpp
* cpp/tensorrt_llm/kernels/mlaKernels.cu
* cpp/tensorrt_llm/common/attentionOp.cpp

</details>

<details>
<summary>⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)</summary>

* GitHub Check: Pre-commit Check

</details>

</details>
<!-- internal state start -->


<!-- DwQgtGAEAqAWCWBnSTIEMB26CuAXA9mAOYCmGJATmriQCaQDG+Ats2bgFyQAOFk+AIwBWJBrngA3EsgEBPRvlqU0AgfFwA6NPEgQAfACgjoCEYDEZyAAUASpETZWaCrKNwSPbABsvkCiQBHbGlcSHFcLzpIACIAbQAzEmoAXUguAEFaegAxKwAOBQwaAA9QgFkAGXT7bG5ufApQ+IbIAGUygEYAJgAGaMg5SGxESh4khhJYMAwJFCKKRWwJ5Ac6hqaW3IKmIpJSyErq/CxcWA927p70CgYEGjFsfw0YBGRE6kePWnh4xIo3hbMSBbQolULZMoACWqg38QXgFHgGCIkAAQtkOgA2fh4bh4Z7uax2JEMLzYJTINAMBiPKnyTD0biUZoUZiYCaQNi4REMZAESBKZjHRDc6geU4eeDMbhU0L4eJhM6MWCYUjPdLU2kMeSoNhoBz+ejDJEoiWQADiVgAitUACLwNBCjD0RBMfyQFmQW0kEjcVo+gDWYBsHUgQUw4gAXtR4MdIAB3dSwfB4C0AUTKZWVogD9SRoSYzDU5HoidOwPyoL25Sq6GdFYKxQCaDAzC8aEgpHIVHExwANOhbvASBITeh7G6PPLID0NAB2Hpz/hYC69AmvFDIQsyqgCSJhfCKjwCfUkLxIjyuhpThWzhddfgCXDaEsJpOQGbxbgAFi9Pr9gbBqGtw5nmRRDIgY7oliBy1gyDawUcWCor0VwqhQtDxs4JDPFYzINGyGAclKvD4FIbBFJS7poF4iCHv49SNHQA5lrAMDQNk0CQAAFOIbAHh6CIigeAZkAAlMubSdL09jJomyLjoaSy9lg8SApA3TdN+37zgArHk2LMG8LTQdiOxgohAkAJxzrpVk6Xkc7fkuRkepslbmdWiEDgx/gjEUY5oHUCzFFKYpePSkBdN+ACkKDSgs5HsBo5iWAAwiwFG4Csjhsi4bhKoWWXIHGZp4j4fiBMEwm8GFiIRQK+A0llOAEGyND0PmyhiLGGCUvW5lItgKbIF1RA9r1PFpQAkhJfneNl64eGgeDJhQA5MlSkzTBIA5IA4URmkV8HRAA9AI+ChBQ2AYP0zALfA3D7nx0gCdy8BEKQfAzQmDQBvEXj4PGiDPGmVJsRdBaZfB72fZQUQdj9QiCCgSgBfEw70IMHY3fAQQeLYAD65pJtgAiQMj5MYI4AiUEtOK4IWr3ThKIwU4IyASM4mMcEYUCEhj/yhLDX1RJTA5EyTpxk5AZgdB0VkAMxWQO26RB1b7lh28TaGS7oih8iADjDSr6nRDAOhrFQ9ITZSUKQNhVSEhO2Dwj1nhest5HkulLjRdGQGQSilu+ADk2TpNNFQAKo2GmocpfzSojDs9Ai/D9Di0SxOk+TcvRYrPSqywT0kBrDjUtIiDxN4EUANyQHJI6UAOZpkcoFXW7b9skI7+Mii7djcO757kLLVkF/Y4gVTspLklErGQOHkcx3HCcGFArRkyM+PgZDfg3XyiJw4a7MCMgku57L3SYl0Vl1vQ+eYkrLHvmaLJsHwOvwKPKImx47ZcA0D4BXCYdATTPG9IgYeNAjw1ErogEqCon4vzgW6Py9RnRji7nbCgDsnYD1dsPJko8PBmHst+B8GMMBIDOCHcsy8o6x3jonSA6QMCXTOHwfe10TjH1FpnFGl9pZ526HOcRLEEC3AUNKdWURQFVxrj4WQA4BCpjNP4QBUQcE9z7tVXAg83YkM9uQ3SHRdLoCIC+QOzoohIg9LrT4rDMjfBUjRdAQCSDSmynMOewds5S1gDLfOVkOhFwTFIti39Ii0GNvWYRQTRGK0Vl0RWkj4DSIUXQOgDcm5SHWnA9uVAKrEI9mPchVlfZTx/r4aJdBWHZEeBKbhl0D59TCPwjOZ9kBy0Vt+XSukBy9P0irB+19FaYl0t+Y2M8S7qzHAoxBSiIqqPUWcBElVoGiHEFISAOi8G9wIQYohI8LzIG4nLHoi4uhDLCR0b2cTH5hOinkCSQcF7vmiZ8KehtpCsIqGKFpV1D4dI+gI7p4y5yYiuPBXpUKOgN0ppC6FCZ9ToAukxWJESf5kI6IreFMjS7lyWMsautdZD02oe4/Z+D+7HKHqc8pYS+mhhlIg+RJLFHkvQPEYBrUWAxgYJVd60hjajSwQwMUWN5AaMoDdSqjFNAbxeAA/UwtOmnyzgk4JeLKnTMJXI+hUTHH+EeXAs2TVLbaJtrg2l+jDGlNIdfHoukej339oed5Rql4RyYWvFKBg4ALGwEQZMayPBFXYEbD8h5CLYHcYkOgJ4GABlbsGvc0hkyXTHEoZ8P9o0tD2E9TAMZhQJnhjwRK8B/H+CsRhMcR0VojCQXAn6Xy/L8D4MW9p/I0BZD8nyM4zB6bfFdMMSCcYdiQRFCsfAkQGrThWm1KVAxWnHWdP1egSMUYG1wMMBV6wQZ8zAIYAwJgoBB34AqJdhAuzKA1pGooXBeCPhED1KQMh5BMCULuNQmhtC6BPWe8ASdUCoEwPy4gZB71REfZwPwaB4w1CcC4AYX7FDKFUOoLQOh9BGGA6YAwDA6inRoH1BojRCY+GYKdIqxxTrUDIypAA8twDQxHuC82iNxgwFg2HTSg92FdDgUPyGnLcVU0gCoRpVMiV6+YFi0CWJeWoiq3J8A7OQJDF051JFUu2FEAADZgWwMq7FKIcQz6BAbIkgkoOBewkDiAUsZ0zxwwQQmhFZ+xZpDPpE8QFY4rGrOkjNgOasQcxwgk8vsWTtBf4CUOJyDDzwADSJB5C1FoGKcVfiSC82VdNLAhmdj5NwJmaEVhnCOkQNAfAAANG0VWqBGUM63QqxxXGTU9CMIBY5DNfjyITFMBjXQ0RIN5rcsnSD0DUiwbMKaxzHAaq5/IZmPNQnSFZ/kIFFsue4iZtb7nqyeeqAAHzO5yNz5nytVDEoZ1hAB1P60Ctr2HgJGCNNEaSAN6uKyAhnSC4GexQAMr2JitA+yQbIDR1vVis/BNRvxRiQU+4SvApb2n2MM2QfGwQ4elCsxwpDU7q2jFW3kAnt2tvPFaEyC2GNJXKPax4EEVpUsADUBjYGRyAqHm50ca2NApQ4YAtkM4yY3JI9BvgUQnVgHcjoy6UHOcZw4zWasaACAGQmCwmSEzOH2wmsu2sA+YOr6rRktc644frw3tBjdSlN2rqoGurcSAN9Lx3zBDMSXgr1lACoKdU8swL7kwQxmcNGKndQf3G5ztoKwwkqPXqelO2iTMoZ4KokzA+Mb+4kd/Coh4LLK7e0+CamKQPgckzk8O5T47pRTtWZaMHxv1PJsdOCKworAPcfBHx+303fauvHHcaL8XPxJfRfbzwS3yv/jXEvGXSShnFdW7bGgQm6/TckjJN8BSOesz59FfH4SR/Qwn4HOGAK0YVJzDxKEQvoxuIB+WzqIP9eQ9VE7+HkgYkzUb9cBTolAgCJxxsj0k5WdNs58WsF8AdtcJArBuRvNit+8SBB8btO8A9+QzQ2dOdudedq8SBa8+A28btm9O0zdrswRQ9UA/89o+oaA+0r1lwVsv929m8k8lR09eFuw5hY8aIPtMc5gAckRBDzxPtuJ7sOkkgfFyDaCf93tmAf5nAGp+QFCTtNsrN4wzgsBetnMURstnwwhZAmRN1KotFZsDNICVVChR8sAyAVAEtjM0wMBnCSBGstt1NLI70Jo4xUB/BqFDp6IqoERjxOEzd2hIAABeGIyAAMdoQmS4BHesYzaaRAc0aDfwjASzbgjwQzAgVod6ZEaQqzLkZMTqZACLWxNOQ8PfeeOBLTago7G7UPAGNAFEHzJUJQNRFEEbR/ANZjfJGiLwFnZUSTcVGgcaKvTQizJQtFDsRXHwM8BCGLUIIUezGiY4IgOzAojgig7Q42WgIQYYQwwgv4d7NHJnH7THaNdfeAgwk0aNeCXhF6D0GwgSVYNTU4VAZozYiNYUJzdgBqKkBYRBKsfYeCPwkQ7gFUZtANVKNhLwYBO4t6Ho0QdsHIltItdYKIFoPEPcSXdgWPKTZVAAOUPAkzkz5EPDNEgiIHcL3Q7WnFxMxSoMJPPCFRJPEFenjArTZCUHVAqk2Kn0lRUmLzmGAXcN8H5AqMUHilLiyjRPglCwhLYCLFGGGE6JwgMG42iGPTACMA41IzIDogoEo2o1o0yno0YxJKCzY1gC4x4z40jkExgxdFymcDEwVGpNIEQCMGqGaNqi5lgR00iAgw1Npj4C5kRA8JaIbzaKUKwkpCyBCLgT8wCxY24BC3bEQWeGmlCDAxoXECEM+zqIBx1logmzQwFBIB1gWjGSUFC1Pm2NszJ34GaQ+M6OQE0TL0PBBHgiS3WOSwpGeEpM7K4QmJpIEkBiIAyQHHlNoALT4HMgWFqUBiQ35PdAU0WAmET3w140sHSBRPvTj1wIxJbLRNZOKEVXxK7TJi5JsXCGHADOVRcSiBDKryjNGFjIdAzQB3DKu1aMUOqHiKrJGDrlQMYDzOQEzKY16mC1EJKxIzI3NMtK8BozowwAYyzMQsdMMz1J4w3kIxNLQoowMStNEgoHIFolOk33SxorPBBhpGdINNdIExhOJVE1YL9LJKDJICQzSmjltGqGotorCC8WLVgUMytHjVvxIAym4FkCK0fzq2yG4DyEYtop0MWLTIrNK0oFCHgiAP5yREfwFGoG1nUg7D8Ml1wDMOPB50uN7SwBBGfz4GFz/nAISx1jEAaHXFQHEtWNIlJVejPC8SjVEOWPnWNjqAigbWTiZ2WgGjVTHHGDYgipagvNZ0rA/moADV718yRAkHwFEkOBsHwCZCpysxriIhUmNl5VGHbBuiHAUjNEcxFH6yCielkAqqqpIHYVoHSEQQ+gwHZw520rPFY1wBqoSMoFouNg/EEvsNjzH18D3CagDFRSmMU2U0TxeFQE2pTQWzB2r0hjYkMztMCwwEJnMrwEJiR1SPoEMyAMJmupUjuowEf0ep5wR3dA4QwGmFrjGR+DQXBnFEcoFy2ALIVDoj2gLG+28ByzgQIGfF8Gpk1L4GnCtAHFSzNS5yyqirBPwAhNGKlz7QsIIFEj6h8jLkRGbkHQ8FMrv0mhPx4HwC6gKXghaqIjOCZuWqQ1kvkqjEUqqpUu+rwHUs0qmq8G20PBZvFCVCurwuOC+p+qeqlMPFepFvetVtuvuoMSeoLNCGGFem1ngGKEOlgH8BYOOu2pT1YK6F0mxFVJLjwFejNExujNYPtpkFPHoFKiVDRvcSJsonVCar4GCt8F5tuARvsFkD5oWBoU+wFuEtEqnltqBFVLOBTWMj4EoAWH+FhtRsuncW9tGFZMiGKgF0+wWHjrnMpFRRosCnrEQADEemQBZsxwDQnKjy/larcV8D4pKj4FFMZzRJ23c3XI+KBnLXdEFN1Pw1dNPOxPRK+ExOqwlNYLZI1gJMfOJICl5LfKgA/LTikq0UgHTrEoWrPC4EgAMEgABxoFkSr2AAcqZHcP4mgEJmmnJL0EgEJkJiIEBhPC8CAcgFKurUgDkojChyUolrUvwA0q0tvq8G4h/r/sKBFAACoH8HqPKBwgGZhCZBtCYSBvxmBFY8GBiHqyHCGpSPxHBCYQ7wGw7soBwAZ8BqBsHcA8GlBRwJhCYT9t9uQZDsdyKLTKLMKOAOBo7EAicldwcSABwlBgjOpisyKzSKKqNMLTp5H6L2xZaWLsAHs+YDhFAp8og6qepSoL6xR77H7n77HYF37HKv6PBoBW53GldIBOc0oIbURnLKAAGoGNHSryqqhKrqr29uIyh2w3dEBgBoA9AAAyWAmrPGjnAJkCIJoggMD3SVECX63nVWckNAYorOlhzOpIZgcR4rSRjC5gWR+RxRtgZR1RekLIQKQW1aoegYQGE604Hh2O/mjM2BhShB1SqW5BmWtBqzaOnwkEbu+/FmU2fW/Bp/YJigFKfUw0401C7RqR3RmjAxhitBkGJ0oi9i48zi7I4TL01DcTabMkwkQzeJtARJqzEUa6MQHawOUoL1WskfHp/4rxH2sCPlHWiMYR5KwmRAlvBUD+gorhnhqdPh1AsjfxNPDyWfCasMEW4Q+/fPCBNhbpoelZccZo6++api2UlxgoiZ0WqZyW3AaW1Bul3S5AZsrE+pQ65ARZ58USJuuyoVQ2i4yupFqGwzTB8k3hnB4fLAWhrZoglmaV4hj3MhihqhhVpZysRWmXKy8Y1hphrGne6uqK/kUKquM1JYzm3YPgXtOswRy8ZKhxfynZyACckiC1ooEQ3NXWSU0iUcfxbo1Ae3FuNgmVS8rEnuw8le1E7enKusq87em8u8wOh8ok7ko+188xs+zkcF8nVFqGJgvBt6kRhFr48PP595hJy3BR5CrR8jY5qii5wxtAYxjQWAQi0+/SyS1+2BGlxZnlreyaQohlyANxz+3x2VgBoBkBwQGiCBsJmBwlz7FlpBlB2WjB3+uV9FvBw2kpv4IhwmEhrVyh6hhmH6+h7Zxg0ITGlhsuth31jh2etF4UPh51jJEgWF8bURigMSKC0Qjx5R5xltpplpi56ClC7gU0iD6R059t85uly5wivZkio0ojQ5hDk50jZMODj6/C9jOoNio8/jd0nsDlHi55yY6Tac/0qUvajkGPfp3gaQMgFj3Os6uMKq/pyRmQes68QcUlRKkgmMmiaqKg4wu1rm/7XzGwG6bsZJvQWR3hcosuSogSNAVdkqoQmXeGP4Tj0lunUQMU0Y2QArKA9AUaxklqacQzBYZ8FwQmJgRAYRpEADzvYnTseNDCKIQYRzsulztzjzjAbt/UQmLmMkEgMo/IxgJG37Xjz/NAYoPWhCtW+SWgIGDzz7KzM2pu1j3qUO28gdSaHbbj6vErCGr674fwWxjAKzeE8caL4IBuMGjhXAeOtRlaFE2kgHIjzLpEbL+MXLibVhDnbmDw3ss8fsvxrnIppUe2q9eIXre4/AOdTkHqscYbn9l4+seoTbyFlXGNJDBo+zdj/yLjnMEqE4dZVcijaQTBA/fo7gfjo5wTlkZaBBSCdq+7yByT5mPgGTjmuT+L47xfQzApv9yIYbY+eF3Wt7igZ6gHaHyt3WhoD6LzpfGzhk18d/avM0MAgEgXJwjNegRHSIvj4r+llt5AFUXZVrv5ZVQkZVnyjwCHuCkbGH39pHlHwzbn6uHn7H7CXHsa/EjABqMGgXh6k/JrxYgHmLxg9Acl8rz1UKLqhSIr4UeL6r4p3b5H0Hh10QriyadAiPB44BUaSkWz/HyXj/QC2pwmRMWgU4TvcaOQ0YYZrALPNIxb39g37ZXqeXpupnhufuxMEYeO1AHA7W6mHwPn5xXp8MvTQts1uYjvAXN/O7go/zDLjAJCwQN9UIJTREEXWsMn194uLBFSCXuBaI/JeXAYEghSS4MZM0Ag/3gliMUcjwBnqLSsTviGgNdwNmcejJFUrIPpmn3z5wZc64FMesan9aySltkTquT2w8djqQcCPT88AzoI+GIiVujR7kPc5aXpjolEPsjWfkGfG7Syb49YRE8jk8xN88ukmNsdstdNvEzNzwbNs+VJIn0vWVJF5v13pJjUPgLJBULvXvIACnyPJV8vPQ8CL1hSvgcfuKTjyi8uaMpdegDkU4YBuwuZM2IqVfYiF/4A3fWurXbATBaqg9SaBdFoAUojAaYLqu1FgwYZKoo4FavWRZDwY7Y3wRwNc32YGAaAM6U6LjE8QihToT7G4LAFNIDxBut1TfBoGUpkcOKlHB5jR19LgD6OfFIYNwGMKe1TYmoKgNqAPCRAqAREDwEz3zoZkCA8gjQCWyGzatFY8QRrgKGRaiFfMVIGkOYNkCO4xAD2NoPTnM7M5zU58OdB7UsHKAbBPEagHOgkgplfEttEYLNnUizhfw/ILIYuEGTt8lQN/SQOKF0zWCOQ3ERoEkP+Ykg0h1jTIRoB6DZDDws4RoXkPHKHh+6HoBgSvxaBrlNuXDEnOAJQFbchSSJN/meTLTJtR2a9X/uyX3qACkBZJSkuQF2bEU8MBGC9Ev2vR4Bb09zB9JlHYBcAqASGETHlHkCDBv0mGP9DhkAyGBz0MiFQgYmrTud/APA/kg7gNiNBbhxgEDJAG/DJIKEXQeWJiGSSYhxEAgdwdChdq0wkg8QLoBCJ6CKw0A8sEgD0DyAdAFQeGe4XOHlhoBKE8sV1FZAEB5B6yyItAHkFoALg0AJAMET0HiCYh/hXQEgF0AYCKwOgogbEFiN+HQpMQ0KEkb7AVguoBAPQVkVZAYDdA2Rc4PEXkFUDojqRukNAIiJSQPguREAQOBSN0i0AGAOsUQPLEVgMAegtAAQFCl0gTIKREyNAL8HiCsi+kSQRWFkHJiqioAqSb8AwCZEdByRXQAyIaIhHOQugTkeIN+AMjaRC4VkGUVZEqQMBMQ1IAoE6MYCIiTwzIxWD6HiB5B6R0KCZOyOZFhjdI8QYkT6B6D3JtR7IikSqKAz3CSAGIzEPLElSYgDIvwb8CoB6ACB2RDASpDiOihuiSABkRWCSK9E2R6A6w+4UVHUB3VlyuuEcMOHeHkN6wcY5gMRlc7t5hGznUIOsMMzriDAAAbycbRAkQDJWANlGiBcBYgyQPsDuKwgt1bMR4yACeIMAABfAwOuMIobDOQC49YjOMHEnogAA= -->

<!-- internal state end -->
<!-- finishing_touch_checkbox_start -->

<details>
<summary>✨ Finishing Touches</summary>

- [ ] <!-- {"checkboxId": "7962f53c-55bc-4827-bfbf-6a18da830691"} --> 📝 Generate Docstrings
<details>
<summary>🧪 Generate unit tests</summary>

- [ ] <!-- {"checkboxId": "f47ac10b-58cc-4372-a567-0e02b2c3d479", "radioGroupId": "utg-output-choice-group-unknown_comment_id"} -->   Create PR with unit tests
- [ ] <!-- {"checkboxId": "07f1e7d6-8a8e-4e23-9900-8731c2c87f58", "radioGroupId": "utg-output-choice-group-unknown_comment_id"} -->   Post copyable unit tests in a comment

</details>

</details>

<!-- finishing_touch_checkbox_end -->
<!-- tips_start -->

---

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

<details>
<summary>❤️ Share</summary>

- [X](https://twitter.com/intent/tweet?text=I%20just%20used%20%40coderabbitai%20for%20my%20code%20review%2C%20and%20it%27s%20fantastic%21%20It%27s%20free%20for%20OSS%20and%20offers%20a%20free%20trial%20for%20the%20proprietary%20code.%20Check%20it%20out%3A&url=https%3A//coderabbit.ai)
- [Mastodon](https://mastodon.social/share?text=I%20just%20used%20%40coderabbitai%20for%20my%20code%20review%2C%20and%20it%27s%20fantastic%21%20It%27s%20free%20for%20OSS%20and%20offers%20a%20free%20trial%20for%20the%20proprietary%20code.%20Check%20it%20out%3A%20https%3A%2F%2Fcoderabbit.ai)
- [Reddit](https://www.reddit.com/submit?title=Great%20tool%20for%20code%20review%20-%20CodeRabbit&text=I%20just%20used%20CodeRabbit%20for%20my%20code%20review%2C%20and%20it%27s%20fantastic%21%20It%27s%20free%20for%20OSS%20and%20offers%20a%20free%20trial%20for%20proprietary%20code.%20Check%20it%20out%3A%20https%3A//coderabbit.ai)
- [LinkedIn](https://www.linkedin.com/sharing/share-offsite/?url=https%3A%2F%2Fcoderabbit.ai&mini=true&title=Great%20tool%20for%20code%20review%20-%20CodeRabbit&summary=I%20just%20used%20CodeRabbit%20for%20my%20code%20review%2C%20and%20it%27s%20fantastic%21%20It%27s%20free%20for%20OSS%20and%20offers%20a%20free%20trial%20for%20proprietary%20code)

</details>

<details>
<summary>🪧 Tips</summary>

### Chat

There are 3 ways to chat with [CodeRabbit](https://coderabbit.ai?utm_source=oss&utm_medium=github&utm_campaign=NVIDIA/TensorRT-LLM&utm_content=6059):

- Review comments: Directly reply to a review comment made by CodeRabbit. Example:
  - `I pushed a fix in commit <commit_id>, please review it.`
  - `Explain this complex logic.`
  - `Open a follow-up GitHub issue for this discussion.`
- Files and specific lines of code (under the "Files changed" tab): Tag `@coderabbitai` in a new review comment at the desired location with your query. Examples:
  - `@coderabbitai explain this code block.`
- PR comments: Tag `@coderabbitai` in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
  - `@coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.`
  - `@coderabbitai read src/utils.ts and explain its main purpose.`
  - `@coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.`

### Support

Need help? Create a ticket on our [support page](https://www.coderabbit.ai/contact-us/support) for assistance with any issues or questions.

### CodeRabbit Commands (Invoked using PR comments)

- `@coderabbitai pause` to pause the reviews on a PR.
- `@coderabbitai resume` to resume the paused reviews.
- `@coderabbitai review` to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
- `@coderabbitai full review` to do a full review from scratch and review all the files again.
- `@coderabbitai summary` to regenerate the summary of the PR.
- `@coderabbitai generate docstrings` to [generate docstrings](https://docs.coderabbit.ai/finishing-touches/docstrings) for this PR.
- `@coderabbitai generate sequence diagram` to generate a sequence diagram of the changes in this PR.
- `@coderabbitai generate unit tests` to generate unit tests for this PR.
- `@coderabbitai resolve` resolve all the CodeRabbit review comments.
- `@coderabbitai configuration` to show the current CodeRabbit configuration for the repository.
- `@coderabbitai help` to get help.

### Other keywords and placeholders

- Add `@coderabbitai ignore` anywhere in the PR description to prevent this PR from being reviewed.
- Add `@coderabbitai summary` to generate the high-level summary at a specific location in the PR description.
- Add `@coderabbitai` or `@coderabbitai title` anywhere in the PR title to generate the title automatically.

### Documentation and Community

- Visit our [Documentation](https://docs.coderabbit.ai) for detailed information on how to use CodeRabbit.
- Join our [Discord Community](http://discord.gg/coderabbit) to get help, request features, and share feedback.
- Follow us on [X/Twitter](https://twitter.com/coderabbitai) for updates and announcements.

</details>

<!-- tips_end -->

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

♻️ Duplicate comments (1)
cpp/tensorrt_llm/common/attentionOp.cpp (1)

749-749: Address the existing review comment - unit tests are still missing for this code path.

QiJune previously requested unit tests for this FP8 context MLA path, which haven't been added yet.

🧹 Nitpick comments (3)
cpp/tensorrt_llm/kernels/mlaKernels.cu (1)

926-968: Remove the explicit cudaStreamSynchronize & drop the unused headDim

  1. cudaStreamSynchronize(stream); forces a host/device sync for every context-level call, nullifying any overlap with subsequent kernels and hurting throughput.
    Nothing in this path depends on a hard sync; the earlier sync_check_cuda_error(stream) is enough.
    ➜ Delete the sync or make it optional behind a debug flag.

  2. size_t headDim = …; is never used – will trigger “set but not used” warnings when compiling with -Wall.
    ➜ Remove the variable.

cpp/tensorrt_llm/common/attentionOp.cpp (2)

732-754: Consider improving code clarity and consistency.

The buffer size calculation logic is correct, but could benefit from:

  1. Consistent naming convention (e.g., dim_*_per_head vs total_*_dim_all_heads)
  2. Adding a comment explaining why MLA requires different buffer size calculation
-    int const num_total_qkv_elements
-        = max_num_tokens * (total_q_dim_all_heads + total_k_dim_all_heads + total_v_dim_all_heads);
+    // MLA uses different head dimensions, requiring custom buffer size calculation
+    int const total_qkv_elements
+        = max_num_tokens * (total_q_dim_all_heads + total_k_dim_all_heads + total_v_dim_all_heads);

277-2887: Verify the performance improvements claimed in the PR.

The PR claims ~24% TTFT improvement on SM120. Please ensure performance benchmarks are included in the test suite to validate these gains and prevent regressions.

Would you like me to help create a performance test framework to track these metrics?

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e8c068b and 4339442.

📒 Files selected for processing (6)
  • cpp/tensorrt_llm/common/attentionOp.cpp (9 hunks)
  • cpp/tensorrt_llm/common/attentionOp.h (1 hunks)
  • cpp/tensorrt_llm/kernels/mlaKernels.cu (2 hunks)
  • cpp/tensorrt_llm/kernels/mlaKernels.h (2 hunks)
  • cpp/tensorrt_llm/thop/attentionOp.cpp (4 hunks)
  • tests/unittest/_torch/test_attention_mla.py (1 hunks)
🧰 Additional context used
🧬 Code Graph Analysis (1)
cpp/tensorrt_llm/thop/attentionOp.cpp (2)
tensorrt_llm/_torch/attention_backend/trtllm.py (2)
  • host_kv_cache_pool_mapping (586-590)
  • host_kv_cache_pool_pointers (579-583)
cpp/tensorrt_llm/common/attentionOp.cpp (1)
  • T (1285-2054)
🔇 Additional comments (12)
cpp/tensorrt_llm/common/attentionOp.h (1)

384-384: LGTM!

The addition of the mFP8ContextMLA flag follows the established naming convention and is properly initialized.

tests/unittest/_torch/test_attention_mla.py (1)

342-342: Tolerance adjustment for FP8 precision looks reasonable.

The minor increase in tolerance values (1.875%) is appropriate for FP8 operations with the new FP8 context MLA support.

cpp/tensorrt_llm/kernels/mlaKernels.h (2)

90-91: LGTM!

The addition of quant_scale_qkv pointer is properly typed and initialized, following the existing pattern for quantization scale pointers.


116-119: LGTM!

The kernel declaration follows proper CUDA conventions with appropriate template parameter for input flexibility and fixed FP8 output type.

cpp/tensorrt_llm/thop/attentionOp.cpp (6)

152-155: Good defensive programming practice.

Adding the null check before accessing the optional tensor prevents potential crashes.


210-212: LGTM!

Proper fallback to attention_window_size when cache_indirection is not available.


218-233: Excellent robustness improvements.

All the added checks for KV cache related optional tensors follow a consistent pattern and prevent null pointer dereferences.


254-259: LGTM!

The compound check ensures both quant mode and tensor availability before pointer assignment.


261-266: Good consistency in output scale handling.

The checks properly differentiate between FP8 and FP4 output scale scenarios.


332-334: LGTM!

Consistent with the pattern of checking optional tensors before access.

cpp/tensorrt_llm/kernels/mlaKernels.cu (1)

931-944: Sanity-check the per-head dimensional assumptions

total_k_dim_all_heads and total_v_dim_all_heads are computed with head_num, implicitly assuming
num_kv_heads == head_num.
For MQA/GQA models that use 1 or 2 KV-heads this will over-estimate the element count and overrun the
destination buffer.

Please confirm that:

total_k_dim_all_heads = effective_num_kv_heads * dim_k_per_head;
total_v_dim_all_heads = effective_num_kv_heads * dim_v_per_head;

or guard with an assertion.

cpp/tensorrt_llm/common/attentionOp.cpp (1)

2623-2623: Verify XQA enablement conditions for SM120.

The XQA is enabled only for SM120 with generation MLA. Please confirm:

  1. Is XQA support exclusive to SM120, or should it also include SM121?
  2. Are there any other conditions that should be checked (e.g., FP8 mode)?

@peaceh-nv
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #12430 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #12430 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #9243 completed with status: 'FAILURE'

@peaceh-nv peaceh-nv force-pushed the fp8-context-sm120 branch from 4339442 to 711a421 Compare July 23, 2025 01:02
@peaceh-nv
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #12629 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #12629 [ run ] completed with state FAILURE

@peaceh-nv
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #12693 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #12693 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #9442 completed with status: 'FAILURE'

@peaceh-nv
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #12777 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #12777 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #9515 completed with status: 'FAILURE'

@peaceh-nv peaceh-nv force-pushed the fp8-context-sm120 branch from 711a421 to 606608e Compare July 25, 2025 00:16
@peaceh-nv
Copy link
Collaborator Author

/bot run

@coderabbitai coderabbitai bot requested review from kaiyux and lucifer1004 July 25, 2025 00:17
@tensorrt-cicd
Copy link
Collaborator

PR_Github #12910 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #12910 [ run ] completed with state FAILURE

@peaceh-nv
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #13323 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #13323 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #9957 completed with status: 'FAILURE'

@peaceh-nv peaceh-nv force-pushed the fp8-context-sm120 branch from 606608e to e8d5dcf Compare July 30, 2025 02:00
@peaceh-nv
Copy link
Collaborator Author

/bot run

@peaceh-nv peaceh-nv changed the title [feat] : Add FP8 context MLA support for SM120 [None][feat] : Add FP8 context MLA support for SM120 Aug 1, 2025
@peaceh-nv
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #13760 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #13761 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #13760 [ run ] completed with state ABORTED

@tensorrt-cicd
Copy link
Collaborator

PR_Github #13761 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #10341 completed with status: 'SUCCESS'
Pipeline passed with automatic retried tests. Check the rerun report for details.

@peaceh-nv
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #13954 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #13954 [ run ] completed with state FAILURE
/LLM/main/L0_MergeRequest_PR pipeline #10509 completed with status: 'FAILURE'

@peaceh-nv
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #14077 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #14077 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #10622 completed with status: 'SUCCESS'
Pipeline passed with automatic retried tests. Check the rerun report for details.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants