-
Notifications
You must be signed in to change notification settings - Fork 1.7k
[Fix] the bug in the trtllm-gen heurisitcf for MLA kernels. #6284
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Fix] the bug in the trtllm-gen heurisitcf for MLA kernels. #6284
Conversation
📝 WalkthroughWalkthroughThe update modifies the kernel selection logic for MLA generation in the FMHA kernels. It replaces the integer-returning method for computing sequence length per CTA for KV with a boolean method that determines whether to use a specific kernel based on sequence length and CTA count. The kernel selection logic is updated accordingly. Changes
Estimated code review effort🎯 1 (Trivial) | ⏱️ ~5 minutes Suggested reviewers
Poem
Note ⚡️ Unit Test Generation is now available in beta!Learn more here, or try it out under "Finishing Touches" below. 📜 Recent review detailsConfiguration used: .coderabbit.yaml 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
✨ Finishing Touches
🧪 Generate unit tests
🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
Documentation and Community
|
/bot run |
PR_Github #12677 [ run ] triggered by Bot |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
PR_Github #12677 [ run ] completed with state |
3ba1097
to
200f01b
Compare
/bot run |
PR_Github #12775 [ run ] triggered by Bot |
PR_Github #12775 [ run ] completed with state |
Signed-off-by: Perkz Zheng <[email protected]>
200f01b
to
c8ef0c5
Compare
/bot run |
PR_Github #12845 [ run ] triggered by Bot |
PR_Github #12845 [ run ] completed with state |
) Signed-off-by: Perkz Zheng <[email protected]> Signed-off-by: Shreyas Misra <[email protected]>
) Signed-off-by: Perkz Zheng <[email protected]> Signed-off-by: Ransiki Zhang <[email protected]>
) Signed-off-by: Perkz Zheng <[email protected]> Signed-off-by: Lanyu Liao <[email protected]>
this fixes the bug of still selecting low-latency (swapsMmaAb) MLA kernels when batch size is quite large under the high-throughput case (attention DP is used).
The same fix has been merged into flashinfer. see flashinfer-ai/flashinfer#1307.
Summary by CodeRabbit
Summary by CodeRabbit
New Features
Refactor