-
Notifications
You must be signed in to change notification settings - Fork 1.7k
DeepEP LL dispatch FP4 #6296
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DeepEP LL dispatch FP4 #6296
Conversation
📝 Walkthrough""" WalkthroughThe changes update the DeepEP dependency version in the CMake configuration, add a new Changes
Sequence Diagram(s)sequenceDiagram
participant ForwardChunk
participant VariableLengthLowLatencyBuffer
participant DeepEPBuffer
ForwardChunk->>VariableLengthLowLatencyBuffer: low_latency_dispatch_fp4(hidden_states, scales, topk_idx, ...)
VariableLengthLowLatencyBuffer->>DeepEPBuffer: low_latency_dispatch_fp4(hidden_states, scales, topk_idx, ...)
DeepEPBuffer-->>VariableLengthLowLatencyBuffer: (recv_hidden_states, recv_scales, recv_expert_count, handle, event, hook)
VariableLengthLowLatencyBuffer-->>ForwardChunk: (recv_hidden_states, recv_scales, recv_expert_count, handle)
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~15 minutes Possibly related PRs
Suggested reviewers
Poem
Note ⚡️ Unit Test Generation is now available in beta!Learn more here, or try it out under "Finishing Touches" below. 📜 Recent review detailsConfiguration used: .coderabbit.yaml 📒 Files selected for processing (3)
✅ Files skipped from review due to trivial changes (1)
🧰 Additional context used📓 Path-based instructions (2)**/*.py📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
Files:
**/*.{cpp,h,hpp,cc,cxx,cu,py}📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
Files:
🪛 Ruff (0.12.2)tensorrt_llm/_torch/modules/fused_moe/deep_ep_utils.py165-165: Line too long (128 > 120) (E501) tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py605-605: Line too long (132 > 120) (E501) 🔇 Additional comments (3)
✨ Finishing Touches
🧪 Generate unit tests
🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
Documentation and Community
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
tensorrt_llm/_torch/modules/fused_moe/deep_ep_utils.py (1)
157-174
: LGTM! Well-implemented FP4 dispatch method with minor formatting issue.The new
low_latency_dispatch_fp4
method follows the established patterns and correctly extends the functionality to support FP4 data with scales. The implementation is consistent with the existinglow_latency_dispatch
method.However, please fix the line length issue on line 166:
- recv_hidden_states, recv_scales, recv_expert_count, handle, event, hook = \ - self.buffer.low_latency_dispatch_fp4(hidden_states, scales, topk_idx, num_max_dispatch_tokens_per_rank, num_experts) + recv_hidden_states, recv_scales, recv_expert_count, handle, event, hook = \ + self.buffer.low_latency_dispatch_fp4( + hidden_states, scales, topk_idx, + num_max_dispatch_tokens_per_rank, num_experts)
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
cpp/tensorrt_llm/deep_ep/CMakeLists.txt
(1 hunks)tensorrt_llm/_torch/modules/fused_moe/deep_ep_utils.py
(1 hunks)
🪛 Ruff (0.12.2)
tensorrt_llm/_torch/modules/fused_moe/deep_ep_utils.py
166-166: Line too long (128 > 120)
(E501)
🧰 Additional context used
🪛 Ruff (0.12.2)
tensorrt_llm/_torch/modules/fused_moe/deep_ep_utils.py
166-166: Line too long (128 > 120)
(E501)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (2)
cpp/tensorrt_llm/deep_ep/CMakeLists.txt (1)
1-1
: DeepEP commit hash verified
- Commit 5ac330ba4157b46c439e7575a9b558d2b01224e6 exists in the DeepEP repo.
- Merge message confirms “LL dispatch FP4”.
- FP4 functionality is present in
tests/test_low_latency_fp4.py
.This update is consistent with the PR objectives. Approved.
tensorrt_llm/_torch/modules/fused_moe/deep_ep_utils.py (1)
166-166
: DeepEP buffer method verifiedThe
low_latency_dispatch_fp4
method is present in the DeepEP codebase at the specified commit:
- In
csrc/deep_ep.hpp
(line 146): method signature- In
csrc/deep_ep.cpp
(line 1187): method definition- In
csrc/deep_ep.cpp
(line 1440): Python binding via.def
No further action is required.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py (1)
605-605
: Consider breaking long line for better readability.The line exceeds the 120 character limit. Consider breaking it for better readability.
- x, x_sf, recv_expert_count, deep_ep_handle = \ - self.deep_ep_buffer.low_latency_dispatch_fp4(x, x_sf, deep_ep_topk_idx, all_rank_max_num_tokens, self.num_slots) + x, x_sf, recv_expert_count, deep_ep_handle = \ + self.deep_ep_buffer.low_latency_dispatch_fp4( + x, x_sf, deep_ep_topk_idx, + all_rank_max_num_tokens, self.num_slots)
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py
(2 hunks)
🧰 Additional context used
🪛 Ruff (0.12.2)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py
605-605: Line too long (132 > 120)
(E501)
🔇 Additional comments (3)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py (3)
591-592
: Critical dimension assignment fix.The correction of swapping
token_num
andhidden_size
assignments is appropriate. The previous assignment was incorrect asx_row
(shape[0]) represents the token dimension whilex_col
(shape[1]) represents the hidden dimension.
595-595
: Correct FP4 data type handling and dispatch method.The changes correctly handle FP4 quantized data:
- Using
torch.uint8
dtype for packed FP4 values is appropriate- The dimension assertions are updated to match the corrected variable assignments
- The new
low_latency_dispatch_fp4
method provides a cleaner interface for FP4 dispatchAlso applies to: 598-599, 606-609
619-622
: Reshape dimensions are consistent withlow_latency_dispatch_fp4
The preceding assertions infused_moe_wide_ep.py
(x.shape[2] == hidden_size // 2
andx_sf.shape[2] == hidden_size // 16 // 2
) confirm the 3D layouts for hidden states and scales. Flattening them to[batch_size * seq_length, feature_dim]
matches the buffer call and requires no further changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py (1)
604-609
: Approve FP4 dispatch method integration, but fix line lengthThe direct call to
low_latency_dispatch_fp4
is an architectural improvement over the previous temporary packed tensor approach. However, there's a line length violation.Apply this diff to fix the line length issue:
- x, x_sf, recv_expert_count, deep_ep_handle = \ - self.deep_ep_buffer.low_latency_dispatch_fp4(x, x_sf, deep_ep_topk_idx, all_rank_max_num_tokens, self.num_slots) + x, x_sf, recv_expert_count, deep_ep_handle = self.deep_ep_buffer.low_latency_dispatch_fp4( + x, x_sf, deep_ep_topk_idx, all_rank_max_num_tokens, self.num_slots + )
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py
(2 hunks)
🧰 Additional context used
🪛 Ruff (0.12.2)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py
605-605: Line too long (132 > 120)
(E501)
🔇 Additional comments (4)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py (4)
591-592
: LGTM: Variable assignment correctionThe swap of
token_num
andhidden_size
assignments correctly aligns with tensor dimension semantics wherex.shape[0]
represents the token count andx.shape[1]
represents the hidden size.
595-598
: LGTM: Assertions updated for corrected variable semanticsThe assertions correctly validate tensor shapes and data types using the corrected
token_num
andhidden_size
variable assignments.
619-621
: LGTM: Tensor reshaping updated for 3D structureThe reshaping operations correctly flatten the first two dimensions of the 3D tensors returned by the new FP4 dispatch method, preparing them for the fused MoE kernel.
590-626
: Excellent architectural improvement for FP4 dispatchThe overall changes represent a well-executed architectural improvement:
- Corrected semantics: Variable assignments now properly reflect tensor dimensions (
token_num = x_col
,hidden_size = x_row
)- Cleaner architecture: Direct
low_latency_dispatch_fp4
call eliminates the need for temporary packed tensors- Consistent updates: All assertions, reshaping operations, and tensor handling have been updated to maintain consistency
- Maintained functionality: The changes preserve the existing behavior while improving code clarity
This modernization of the FP4 dispatch logic enhances maintainability and aligns with the broader DeepEP integration improvements.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (1)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py (1)
606-609
: Consider documenting the FP4 tensor layout.The post-dispatch assertions correctly validate the 3D tensor format, but the FP4 packing scheme (2 values per uint8) could benefit from documentation for future maintainability.
Add a comment explaining the FP4 layout:
assert x.dtype == torch.uint8 and x_sf.dtype == torch.uint8 assert x.dim() == 3 and x_sf.dim() == 3 + # FP4 packing: each uint8 holds 2 FP4 values, hence hidden_size // 2 + # Scaling factors: one per 16 FP4 values, hence hidden_size // 16 assert x.shape[2] == hidden_size // 2 and x_sf.shape[2] == hidden_size // 16
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py
(2 hunks)
🧰 Additional context used
🪛 Ruff (0.12.2)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py
605-605: Line too long (132 > 120)
(E501)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (2)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py (2)
593-599
: LGTM! Proper validation for FP4 tensor format.The assertions correctly validate the FP4 tensor format requirements:
- Ensures scaling factors are present when using FP4
- Validates uint8 data type (as FP4 values are packed into uint8)
- Confirms proper shape constraints where scaling factors are 1/16th of hidden size
619-621
: LGTM! Correct tensor reshaping for fused_moe operation.The reshape operations properly flatten the 3D tensors from the dispatch output into the 2D format expected by the downstream
fused_moe
operation.
46580cf
to
df6b097
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
♻️ Duplicate comments (1)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py (1)
604-605
: Fix line length issue.Line 605 exceeds the 120 character limit.
Split the function call across multiple lines:
- x, x_sf, recv_expert_count, deep_ep_handle = \ - self.deep_ep_buffer.low_latency_dispatch_fp4(x, x_sf, deep_ep_topk_idx, all_rank_max_num_tokens, self.num_slots) + x, x_sf, recv_expert_count, deep_ep_handle = \ + self.deep_ep_buffer.low_latency_dispatch_fp4( + x, x_sf, deep_ep_topk_idx, + all_rank_max_num_tokens, self.num_slots)
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
cpp/tensorrt_llm/deep_ep/CMakeLists.txt
(1 hunks)tensorrt_llm/_torch/modules/fused_moe/deep_ep_utils.py
(1 hunks)tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py
(2 hunks)
✅ Files skipped from review due to trivial changes (1)
- cpp/tensorrt_llm/deep_ep/CMakeLists.txt
🧰 Additional context used
🪛 Ruff (0.12.2)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py
605-605: Line too long (132 > 120)
(E501)
tensorrt_llm/_torch/modules/fused_moe/deep_ep_utils.py
166-166: Line too long (128 > 120)
(E501)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (3)
tensorrt_llm/_torch/modules/fused_moe/deep_ep_utils.py (1)
157-174
: LGTM! FP4 dispatch method follows the established pattern.The new
low_latency_dispatch_fp4
method correctly extends the low-latency dispatch functionality for FP4 data, properly handling separate scale tensors alongside hidden states.tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py (2)
593-598
: Comprehensive FP4 data validation.The assertions properly validate FP4 data format requirements, ensuring correct tensor types and shape constraints for the dispatch operation.
606-626
: Correct FP4 post-dispatch processing.The code properly handles the dispatched FP4 data by validating output formats, reshaping tensors, and applying the necessary swizzle operation for optimal memory layout.
df6b097
to
5817c7f
Compare
5817c7f
to
a854cc6
Compare
/bot run |
PR_Github #12820 [ run ] triggered by Bot |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
PR_Github #12820 [ run ] completed with state |
Signed-off-by: Yilin Zhang <[email protected]>
a854cc6
to
d9df261
Compare
/bot reuse-pipeline |
PR_Github #13133 [ reuse-pipeline ] triggered by Bot |
PR_Github #13133 [ reuse-pipeline ] completed with state |
Signed-off-by: Yilin Zhang <[email protected]> Signed-off-by: Shreyas Misra <[email protected]>
Signed-off-by: Yilin Zhang <[email protected]> Signed-off-by: Ransiki Zhang <[email protected]>
Signed-off-by: Yilin Zhang <[email protected]> Signed-off-by: Lanyu Liao <[email protected]>
DeepEP diff:https://github.com/deepseek-ai/DeepEP/compare/7b15af835942675df041eca2dcb9930b880287e1...edf3ea2b086a393d3163bf2773eab69d9191cc01?expand=1
Summary by CodeRabbit
Summary by CodeRabbit
New Features
Bug Fixes
Chores