-
Notifications
You must be signed in to change notification settings - Fork 1.7k
Don't review - Testing: Support fp8 block wide ep #7209
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: xxi <[email protected]> modified: tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py new file: tensorrt_llm/_torch/modules/fused_moe/moe_backend.py
Signed-off-by: xxi <[email protected]>
📝 WalkthroughWalkthroughAdds per-item debug checks in reducescatter; robustly gates MNNVL AllReduce initialization on strategy, tp/world-size, dtype, and runtime capability; conditions DeepseekV3 MLP TP sizing on allreduce strategy and MNNVL support; introduces a pluggable MoE backend API with Cutlass and DeepGemm backends and runtime selection; adds a minor MoE load-balancer debug print. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant Caller
participant Ops as distributed.ops
participant MNN as MNNVLAllReduce
Caller->>Ops: AllReduce.__init__(strategy, tp_size, world_size, dtype, mapping)
alt strategy in (AUTO, MNNVL)
alt tp_size == world_size and dtype provided
Ops->>MNN: is_mnnvl(mapping, dtype)?
alt supported
Ops->>MNN: try instantiate MNNVLAllReduce(mapping, dtype)
MNN-->>Ops: instance or raise
alt success
Ops-->>Caller: debug "MNNVL enabled"
else exception/fail
Ops-->>Caller: debug "MNNVL disabled"
end
else unsupported
Ops-->>Caller: debug "MNNVL cannot be enabled"
end
else size mismatch
Ops-->>Caller: debug "MNNVL disabled (size mismatch)"
end
else other strategy
Ops-->>Caller: debug "MNNVL disabled (strategy)"
end
sequenceDiagram
autonumber
participant Layer as WideEPMoE
participant Selector as MoEBackendSelection
participant Cut as MoECutlassBackend
participant DG as MoEDeepGemmBackend
Layer->>Selector: select_backend(module=self)
alt SM==100 and module/runtime prefers DeepGemm
Selector-->>Layer: MoEDeepGemmBackend
Layer->>DG: run_moe(..., module=self)
DG-->>Layer: final_hidden_states
else
Selector-->>Layer: MoECutlassBackend
Layer->>Cut: run_moe(..., module=self)
Cut-->>Layer: final_hidden_states
end
Layer-->>Layer: continue with alltoall / post-processing
Estimated code review effort🎯 4 (Complex) | ⏱️ ~70 minutes Possibly related PRs
Suggested reviewers
Tip 🔌 Remote MCP (Model Context Protocol) integration is now available!Pro plan users can now connect to remote MCP servers from the Integrations page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats. ✨ Finishing Touches
🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. CodeRabbit Commands (Invoked using PR/Issue comments)Type Other keywords and placeholders
Status, Documentation and Community
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
🧹 Nitpick comments (5)
tensorrt_llm/_torch/distributed/ops.py (1)
460-484
: AUTO now considers MNNVL; add a safety check on tp/world mismatch and keep graceful fallback — LGTM with one gating question
- The broadened gating (AUTO or MNNVL) with a try/except fallback is solid. Good call on logging enable/disable states and exceptions.
- The new tp_size != world_size guard disables MNNVL when TP is a subgroup (e.g., DP>1). Is this strictly required by MNNVL AR? The workspace and kernels appear to use the TP group, so MNNVL might still be valid with dp_size > 1. If the intent is “MNNVL only when TP spans the full world,” consider documenting that assumption or keying the check off the actual communicators used by MNNVL rather than global world_size to avoid unnecessary disablement.
If the restriction is not required, consider this change:
- if self.mapping.tp_size != self.mapping.world_size: + # Disable only when the TP group spans multiple process groups in a way MNNVL can't handle. + # If the MNNVL implementation only depends on the TP group, disablement on DP>1 may be unnecessary. + if getattr(self.mapping, "tp_spans_world", None) is False: logger.debug( - f"MNNVLAllReduce is disabled due to tp_size:{self.mapping.tp_size} " - f"!= world_size:{self.mapping.world_size}") + f"MNNVLAllReduce is disabled: TP group does not span the intended MNNVL domain.") self.mnnvl_allreduce = NoneOr, if you do require world_size equality, add a one‑liner comment to codify the constraint to help future readers.
tensorrt_llm/_torch/models/modeling_deepseekv3.py (1)
742-759
: Avoid repetitive capability probes for MNNVL in hot pathsThe call to MNNVLAllReduce.is_mnnvl(...) can be relatively expensive (imports + capability checks). If this is executed per layer instantiation, consider caching the result on the model_config or mapping (e.g., a lazy singleton per process) to reduce overhead.
tensorrt_llm/_torch/modules/fused_moe/moe_backend.py (2)
568-616
: Avoid per-call workspace allocations in DeepGemm path
_get_deepgemm_workspace
allocates multiple large CUDA tensors on every forward. Cache by (m_max, group_size) to avoid repeated alloc/free, which can fragment memory and add latency in steady state.Example:
class MoeDeepGemmBackend(MoEBackend): def __init__(self): ... self._workspace_cache = {} @@ - def _get_deepgemm_workspace(self, module: Any, m_max: int, - group_size: int) -> Dict[str, torch.Tensor]: + def _get_deepgemm_workspace(self, module: Any, m_max: int, + group_size: int) -> Dict[str, torch.Tensor]: + key = (module.expert_size_per_partition, m_max, group_size, + module.hidden_size, module.intermediate_size) + if key in self._workspace_cache: + return self._workspace_cache[key] ... - return workspace + self._workspace_cache[key] = workspace + return workspace
316-413
: Cutlass compute_moe contract — return type clarityYou return a list
[output]
in the non–min-latency case and a bare tensor in min latency viarun_moe_min_latency
. Document this in the docstring and keep it aligned with callers. With the DeepGemm fix above, both backends will behave consistently for non–min-latency.tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py (1)
680-708
: Route fused MoE through pluggable backend — LGTM with one robustness tweakThe call mirrors the original fused_moe signature and passes module for backend-specific needs — good. After addressing the DeepGemm return type in moe_backend.py, this will work. For extra safety at call sites, make the unpacking resilient to both list and tensor returns.
Apply:
- # Only in cutlass_min_latency_mode, the output is a list of tensors. - # Otherwise, the output should be unpacked as a single tensor. - final_hidden_states = final_hidden_states[0] + # Be robust to either a bare tensor or a single-element list/tuple + if isinstance(final_hidden_states, (list, tuple)): + final_hidden_states = final_hidden_states[0]
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (4)
tensorrt_llm/_torch/distributed/ops.py
(1 hunks)tensorrt_llm/_torch/models/modeling_deepseekv3.py
(3 hunks)tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py
(7 hunks)tensorrt_llm/_torch/modules/fused_moe/moe_backend.py
(1 hunks)
🧰 Additional context used
🧠 Learnings (2)
📓 Common learnings
Learnt from: timlee0212
PR: NVIDIA/TensorRT-LLM#6886
File: tensorrt_llm/_torch/models/modeling_deepseekv3.py:0-0
Timestamp: 2025-08-14T06:36:40.701Z
Learning: In DeepSeek V3 model (tensorrt_llm/_torch/models/modeling_deepseekv3.py), the disagreement between AllReduce.__init__ guard and _compute_mlp_tp_size logic for MNNVL usage is expected by design. The AllReduce component and MLP TP-size computation intentionally use different criteria for MNNVL availability decisions.
📚 Learning: 2025-08-14T06:36:40.701Z
Learnt from: timlee0212
PR: NVIDIA/TensorRT-LLM#6886
File: tensorrt_llm/_torch/models/modeling_deepseekv3.py:0-0
Timestamp: 2025-08-14T06:36:40.701Z
Learning: In DeepSeek V3 model (tensorrt_llm/_torch/models/modeling_deepseekv3.py), the disagreement between AllReduce.__init__ guard and _compute_mlp_tp_size logic for MNNVL usage is expected by design. The AllReduce component and MLP TP-size computation intentionally use different criteria for MNNVL availability decisions.
Applied to files:
tensorrt_llm/_torch/distributed/ops.py
tensorrt_llm/_torch/models/modeling_deepseekv3.py
🧬 Code graph analysis (4)
tensorrt_llm/_torch/distributed/ops.py (4)
tensorrt_llm/functional.py (1)
AllReduceStrategy
(3876-3885)tensorrt_llm/_torch/distributed/communicator.py (2)
tp_size
(46-47)world_size
(26-27)tensorrt_llm/llmapi/llm_args.py (2)
world_size
(246-258)world_size
(267-274)tensorrt_llm/logger.py (1)
debug
(143-144)
tensorrt_llm/_torch/models/modeling_deepseekv3.py (2)
tensorrt_llm/functional.py (1)
AllReduceStrategy
(3876-3885)tensorrt_llm/_torch/distributed/ops.py (2)
MNNVLAllReduce
(290-396)is_mnnvl
(316-323)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py (2)
tensorrt_llm/_torch/modules/fused_moe/moe_backend.py (4)
MoEBackendSelection
(823-866)select_backend
(835-866)run_moe
(101-212)run_moe
(413-533)tensorrt_llm/_torch/modules/fused_moe/quantization.py (2)
DeepSeekFP8BlockScalesFusedMoEMethod
(604-737)DeepSeekFP8BlockScalesFusedMoEMethodDeepGemm
(740-781)
tensorrt_llm/_torch/modules/fused_moe/moe_backend.py (5)
tensorrt_llm/_torch/autotuner.py (2)
AutoTuner
(271-752)choose_one
(329-436)tensorrt_llm/_torch/custom_ops/torch_custom_ops.py (1)
MoERunner
(23-114)tensorrt_llm/_torch/modules/fused_moe/fused_moe_deepgemm.py (5)
masked_index_copy_group_quant_fp8
(89-160)preprocess_after_permute
(233-242)set_strides
(287-293)triton_masked_index_gather
(195-216)deepgemm_fp8_group_blockwise_gemm
(246-284)tensorrt_llm/quantization/utils/fp8_utils.py (1)
silu_and_mul_masked_post_quant_fwd
(304-375)tensorrt_llm/_torch/modules/fused_moe/interface.py (1)
has_deepseek_fp8_block_scales
(127-130)
🪛 Ruff (0.12.2)
tensorrt_llm/_torch/modules/fused_moe/moe_backend.py
382-382: Line too long (136 > 120)
(E501)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (5)
tensorrt_llm/_torch/models/modeling_deepseekv3.py (1)
742-759
: MLP TP-size capping conditioned on strategy and MNNVL capability — LGTM and consistent with design
- The gating to avoid inter-node TP when MNNVL is not in use or unsupported is reasonable.
- This intentionally differs from AllReduce.init’s guard and is expected by design per prior learnings (DeepSeek V3 decisions). No change requested.
tensorrt_llm/_torch/modules/fused_moe/moe_backend.py (1)
251-275
: Verify that MoERunner and profiling paths accept optional biases as NoneYou pass
w3_w1_bias
/w2_bias
potentially as None to AutoTuner/MoERunner. Ensure the custom op path gracefully handles None for these inputs during profiling; if not, use zero-sized or zero-filled tensors of appropriate dtype/shape.Minimal repro to validate:
# Expected: no crash during profiling when biases are None # with realistic dtypes and small shapestensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py (3)
326-329
: Backend-specific quant method choice for SM100 — LGTMSelecting DeepGemm-specific FP8 block-scales method only on SM100 keeps behavior correct on other arches. No change requested.
352-354
: Initialize backend after weights — LGTMDeferring backend selection until weights exist makes sense, since backend needs module state. No change requested.
680-708
: Confirm quant_scales layout matches backend expectationDeepGemm expects pre-transformed scales (see DeepSeekFP8BlockScalesFusedMoEMethodDeepGemm.transform_sf_into_required_layout) while Cutlass expects original layout. Ensure
self.quant_scales
correspond to the runtime-selected backend.If there’s any chance of mismatch, add an assert near backend selection:
assert (isinstance(self.quant_method, DeepSeekFP8BlockScalesFusedMoEMethodDeepGemm)) == isinstance(self.moe_backend, MoeDeepGemmBackend)Or gate a small shape/layout check before the call.
use_deepseek_fp8_block_scale=getattr( | ||
module, 'has_deepseek_fp8_block_scales', False), | ||
use_w4_group_scaling=getattr(module, 'has_w4afp8', False), | ||
use_int8_woq_per_channel=getattr(module, | ||
'has_int8_woq_per_channel', | ||
False), | ||
use_mxfp8_act_scaling=getattr(module, 'has_mxfp8_act_scaling', | ||
False), | ||
min_latency_mode=min_latency_mode, | ||
use_fused_finalize=getattr(module, 'use_fused_finalize', False), | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bug: passing a bound method where a bool is expected (missing “()”)
has_deepseek_fp8_block_scales
is a method, but it’s passed as an attribute to MoERunner. This will always be truthy when present, enabling FP8 block-scale paths unintentionally.
Apply:
- use_deepseek_fp8_block_scale=getattr(
- module, 'has_deepseek_fp8_block_scales', False),
+ use_deepseek_fp8_block_scale=(
+ module.has_deepseek_fp8_block_scales()
+ if hasattr(module, 'has_deepseek_fp8_block_scales') else False),
Also double-check other getattr flags (e.g., has_int8_woq_per_channel
, has_mxfp8_act_scaling
) to ensure they are booleans, not methods.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
use_deepseek_fp8_block_scale=getattr( | |
module, 'has_deepseek_fp8_block_scales', False), | |
use_w4_group_scaling=getattr(module, 'has_w4afp8', False), | |
use_int8_woq_per_channel=getattr(module, | |
'has_int8_woq_per_channel', | |
False), | |
use_mxfp8_act_scaling=getattr(module, 'has_mxfp8_act_scaling', | |
False), | |
min_latency_mode=min_latency_mode, | |
use_fused_finalize=getattr(module, 'use_fused_finalize', False), | |
) | |
use_deepseek_fp8_block_scale=( | |
module.has_deepseek_fp8_block_scales() | |
if hasattr(module, 'has_deepseek_fp8_block_scales') else False), | |
use_w4_group_scaling=getattr(module, 'has_w4afp8', False), | |
use_int8_woq_per_channel=getattr(module, | |
'has_int8_woq_per_channel', | |
False), | |
use_mxfp8_act_scaling=getattr(module, 'has_mxfp8_act_scaling', | |
False), | |
min_latency_mode=min_latency_mode, | |
use_fused_finalize=getattr(module, 'use_fused_finalize', False), |
🤖 Prompt for AI Agents
In tensorrt_llm/_torch/modules/fused_moe/moe_backend.py around lines 264 to 274,
a bound method (has_deepseek_fp8_block_scales) is being passed to MoERunner via
getattr without calling it, causing the flag to be truthy even when it should
evaluate to False; change the getattr call to invoke the method when present
(e.g., getattr(module, 'has_deepseek_fp8_block_scales', False)()) or better:
check if the attribute is callable and call it, otherwise use its boolean value,
and apply the same guard/call pattern to the other flags
(has_int8_woq_per_channel, has_mxfp8_act_scaling, etc.) so each flag passed is a
real boolean rather than a bound method.
91eb8b8
to
b206849
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
♻️ Duplicate comments (3)
tensorrt_llm/_torch/modules/fused_moe/moe_backend.py (3)
264-274
: Bug: passing a bound method where a bool is expected (missing “()”)
has_deepseek_fp8_block_scales
is a method on the module (see interface.py), but it’s passed through toMoERunner
as an attribute. This will always be truthy and incorrectly enable block-FP8 paths.Apply:
- use_deepseek_fp8_block_scale=getattr( - module, 'has_deepseek_fp8_block_scales', False), + use_deepseek_fp8_block_scale=( + module.has_deepseek_fp8_block_scales() + if hasattr(module, 'has_deepseek_fp8_block_scales') and + callable(getattr(module, 'has_deepseek_fp8_block_scales')) + else bool(getattr(module, 'has_deepseek_fp8_block_scales', False)) + ),Optional: audit similar flags (
has_int8_woq_per_channel
,has_mxfp8_act_scaling
) if any are methods.
859-868
: Backend selection condition always true when attribute exists (missing “()”)Same issue here: treating
has_deepseek_fp8_block_scales
as a property makeshas_block_fp8
true whenever the attribute exists, even if the method would return False.Apply:
- has_block_fp8 = (hasattr(module, 'has_deepseek_fp8_block_scales') - and module.has_deepseek_fp8_block_scales) + flag = getattr(module, 'has_deepseek_fp8_block_scales', None) + has_block_fp8 = bool(flag() if callable(flag) else flag)
822-822
: Unify run_moe return contract across backends to List[Tensor]DeepGemm returns a bare Tensor while Cutlass returns a one-element list, and the caller unconditionally indexes
[0]
. This will slice/corrupt output on DeepGemm.Apply:
- return final_hidden_states + return [final_hidden_states]Follow-up: with this change, both backends consistently return
List[Tensor]
. Consider updating the base interface’s type hints/docstring accordingly (see next comment).
🧹 Nitpick comments (5)
tensorrt_llm/_torch/modules/fused_moe/moe_load_balancer.py (1)
963-965
: Replacelogger.debug
and fix E501 long lineAvoid stdout in library code; use the project logger. Also split the long f-string to satisfy Ruff (E501).
Apply:
- print( - f"maybe_create_moe_load_balancer: in_supported_model_arch={in_supported_model_arch}, using_ep={using_ep}, using_smart_router={using_smart_router}, model_config.moe_load_balancer={model_config.moe_load_balancer}" - ) + logger.debug( + "maybe_create_moe_load_balancer: in_supported_model_arch=%s, using_ep=%s, " + "using_smart_router=%s, model_config.moe_load_balancer=%s", + bool(in_supported_model_arch), + bool(using_ep), + bool(using_smart_router), + model_config.moe_load_balancer, + )Note: casting to bool avoids logging None for flag-like variables when
mapping
is None. If you prefer one-time logging, switch tologger.info_once
(if available) instead ofdebug
.tensorrt_llm/_torch/modules/fused_moe/moe_backend.py (2)
101-113
: Type hints/docstring: reflect List[Tensor] return to match unified APIAfter unifying the contract, adjust type hints and docs so callers and static tooling don’t get misled.
Apply:
- def run_moe( + def run_moe( ... - **kwargs) -> torch.Tensor: + **kwargs) -> List[torch.Tensor]: @@ - Returns: - Computed MoE output tensor + Returns: + List[torch.Tensor]: one tensor per pipeline (single-element list in current backends)And similarly in
compute_moe
:- def compute_moe( + def compute_moe( ... - **kwargs) -> torch.Tensor: + **kwargs) -> List[torch.Tensor]:Also applies to: 136-138
381-383
: Fix E501: break long conditional assignment lineReadability and Ruff compliance.
Apply:
- run_moe = self.moe_runner.fused_moe_runner.run_moe_min_latency if min_latency_mode else self.moe_runner.fused_moe_runner.run_moe + run_moe = ( + self.moe_runner.fused_moe_runner.run_moe_min_latency + if min_latency_mode + else self.moe_runner.fused_moe_runner.run_moe + )tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py (2)
402-405
: Replacelogger.debug
and fix E501Avoid stdout spam in multi-rank runs; also satisfy Ruff (E501).
Apply:
- print( - f"wide_ep forward_chunk: layer_load_balancer={self.layer_load_balancer}, is_first_call={is_first_call}, is_last_call={is_last_call}" - ) + logger.debug( + "wide_ep forward_chunk: layer_load_balancer=%s, is_first_call=%s, is_last_call=%s", + bool(self.layer_load_balancer), + is_first_call, + is_last_call, + )
719-719
: Make caller robust to either Tensor or List[Tensor] (defensive until backends converge)Even with backend unification, this guard is harmless and future-proof.
Apply:
- final_hidden_states = final_hidden_states[0] + if isinstance(final_hidden_states, list): + final_hidden_states = final_hidden_states[0]
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (3)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py
(9 hunks)tensorrt_llm/_torch/modules/fused_moe/moe_backend.py
(1 hunks)tensorrt_llm/_torch/modules/fused_moe/moe_load_balancer.py
(1 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.py
: Code must target Python 3.8+
Indent Python code with 4 spaces; do not use tabs
Preserve module namespaces when importing; import modules/packages and access members via the module (e.g., from package.subpackage import foo; foo.SomeClass())
Python file names should be snake_case
Python class names should be PascalCase
Python functions/methods and local variables should be snake_case; variables beginning with a number should be prefixed with k_ (e.g., k_99th_percentile)
Global variables should be UPPER_SNAKE_CASE prefixed with G_ (e.g., G_MY_GLOBAL); constants should be UPPER_SNAKE_CASE
Avoid shadowing variables from outer scopes; initialize all externally visible members in init
Prefer docstrings for interfaces used outside a file; comments should be reserved for in-function or file-local interfaces
Use Google-style docstrings for classes and functions; attributes and variables may be documented inline with trailing string literals
Avoid reflection when simpler, explicit code suffices (e.g., avoid dict(**locals()) patterns)
In try/except, catch the narrowest exceptions possible
For duck-typing patterns, keep the try body minimal and move logic to else to avoid masking unrelated failures
Files:
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py
tensorrt_llm/_torch/modules/fused_moe/moe_backend.py
tensorrt_llm/_torch/modules/fused_moe/moe_load_balancer.py
**/*.{c,cc,cpp,cxx,h,hh,hpp,hxx,cu,cuh,py}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Prepend the NVIDIA copyright header (current year) to all source files (.cpp, .h, .cu, .py, etc.)
Files:
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py
tensorrt_llm/_torch/modules/fused_moe/moe_backend.py
tensorrt_llm/_torch/modules/fused_moe/moe_load_balancer.py
🧠 Learnings (3)
📓 Common learnings
Learnt from: timlee0212
PR: NVIDIA/TensorRT-LLM#6886
File: tensorrt_llm/_torch/models/modeling_deepseekv3.py:0-0
Timestamp: 2025-08-14T06:36:40.701Z
Learning: In DeepSeek V3 model (tensorrt_llm/_torch/models/modeling_deepseekv3.py), the disagreement between AllReduce.__init__ guard and _compute_mlp_tp_size logic for MNNVL usage is expected by design. The AllReduce component and MLP TP-size computation intentionally use different criteria for MNNVL availability decisions.
📚 Learning: 2025-08-14T23:23:27.449Z
Learnt from: djns99
PR: NVIDIA/TensorRT-LLM#6915
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:4010-4012
Timestamp: 2025-08-14T23:23:27.449Z
Learning: For MOE (Mixture of Experts) code reviews in TensorRT-LLM, avoid repeatedly suggesting finalize fusion validation checks and safety assertions. The user djns99 has indicated these suggestions are repetitive and unwanted across multiple MOE-related changes.
Applied to files:
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py
tensorrt_llm/_torch/modules/fused_moe/moe_backend.py
📚 Learning: 2025-08-09T20:57:04.084Z
Learnt from: sklevtsov-nvidia
PR: NVIDIA/TensorRT-LLM#3294
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu:118-127
Timestamp: 2025-08-09T20:57:04.084Z
Learning: In the CUTLASS MoE finalize fusion implementation (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu), when setting `fused_finalize_epilogue.stride_final_output` with shape `(hidden_size, num_output_tokens, 1)`, the `num_rows_in_final_output` should be set to `num_output_tokens` (not `hidden_size`) because of a swap+transpose operation that maps rows of the output tensor to `hidden_size` and columns to `num_output_tokens`.
Applied to files:
tensorrt_llm/_torch/modules/fused_moe/moe_backend.py
🧬 Code graph analysis (2)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py (3)
tensorrt_llm/_torch/modules/fused_moe/moe_backend.py (4)
MoEBackendSelection
(825-868)select_backend
(837-868)run_moe
(101-212)run_moe
(413-533)tensorrt_llm/_torch/modules/fused_moe/moe_load_balancer.py (1)
get_moe_load_balancer
(1027-1037)tensorrt_llm/_torch/modules/fused_moe/quantization.py (2)
DeepSeekFP8BlockScalesFusedMoEMethod
(604-737)DeepSeekFP8BlockScalesFusedMoEMethodDeepGemm
(740-781)
tensorrt_llm/_torch/modules/fused_moe/moe_backend.py (5)
tensorrt_llm/_torch/autotuner.py (2)
AutoTuner
(271-752)choose_one
(329-436)tensorrt_llm/_torch/custom_ops/torch_custom_ops.py (1)
MoERunner
(23-114)tensorrt_llm/_torch/modules/fused_moe/fused_moe_deepgemm.py (5)
deepgemm_fp8_group_blockwise_gemm
(246-284)masked_index_copy_group_quant_fp8
(89-160)preprocess_after_permute
(233-242)set_strides
(287-293)triton_masked_index_gather
(195-216)tensorrt_llm/quantization/utils/fp8_utils.py (1)
silu_and_mul_masked_post_quant_fwd
(304-375)tensorrt_llm/_torch/modules/fused_moe/interface.py (1)
has_deepseek_fp8_block_scales
(127-130)
🪛 Ruff (0.12.2)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py
403-403: Line too long (144 > 120)
(E501)
tensorrt_llm/_torch/modules/fused_moe/moe_backend.py
382-382: Line too long (136 > 120)
(E501)
tensorrt_llm/_torch/modules/fused_moe/moe_load_balancer.py
964-964: Line too long (219 > 120)
(E501)
🔇 Additional comments (4)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py (4)
326-330
: Good: SM100-aware quant-method swap for block FP8Selecting
DeepSeekFP8BlockScalesFusedMoEMethodDeepGemm
only on SM100 keeps legacy paths intact elsewhere. Looks correct.
352-354
: Good: backend selection deferred until after weights are createdInitializing
self.moe_backend
post-create_weights()
guarantees the module has everything the backend needs to introspect.
482-487
: LGTM: guardget_local_statistic_tensor()
with both LB presence and last-callPrevents invalid access when LB is disabled or on intermediate repeats.
684-712
: Forward delegates to pluggable backend correctly (module passed, tuner args wired)The call mirrors the original
fused_moe
signature and threads through tuning knobs and module context. Once the backend return type is unified toList[Tensor]
, this section is good.
b206849
to
ed20793
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
tensorrt_llm/_torch/modules/fused_moe/moe_load_balancer.py (1)
1-1
: Missing NVIDIA copyright headerPer coding guidelines, prepend the NVIDIA copyright header (current year) to this source file.
Apply:
+# Copyright (c) 2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py (1)
1-1
: Missing NVIDIA copyright headerPer coding guidelines, prepend the NVIDIA copyright header (current year) to this source file.
+# Copyright (c) 2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
♻️ Duplicate comments (4)
tensorrt_llm/_torch/modules/fused_moe/moe_backend.py (4)
254-274
: MoERunner constructed with storage dtype; should use view dtype used at runtime
weight_view_dtype
is computed (e.g.,torch.quint4x2
for W4), butMoERunner
is created withmodule.w3_w1_weight.dtype
. This can desynchronize tuning from runtime execution.- weight_dtype=module.w3_w1_weight.dtype, + weight_dtype=weight_view_dtype,
264-271
: Passing a bound method instead of a boolean for flags (missing “()”)
has_deepseek_fp8_block_scales
is a method on the module. Passing it without calling makes the flag truthy whenever the attribute exists.- use_deepseek_fp8_block_scale=getattr( - module, 'has_deepseek_fp8_block_scales', False), + use_deepseek_fp8_block_scale=( + module.has_deepseek_fp8_block_scales() + if hasattr(module, 'has_deepseek_fp8_block_scales') else False + ),Also audit other boolean flags if they can be methods and guard with
callable(...)
.
822-822
: Unify return type across backends to List[Tensor]DeepGemm returns a bare
Tensor
, while Cutlass returnsList[Tensor]
. Callers (e.g., WideEPMoE.forward_chunk) index[0]
unconditionally; returning a bare tensor will slice the first token and corrupt output.- return final_hidden_states + return [final_hidden_states]
859-868
: Backend selection wrongly treats method as a property; FP8 block path always true when attribute exists
module.has_deepseek_fp8_block_scales
is a method. Current code evaluates truthiness of the attribute, not its return value, enabling the DeepGemm path unintentionally.- has_block_fp8 = (hasattr(module, 'has_deepseek_fp8_block_scales') - and module.has_deepseek_fp8_block_scales) + has_block_fp8 = ( + hasattr(module, 'has_deepseek_fp8_block_scales') and + ( + module.has_deepseek_fp8_block_scales() + if callable(getattr(module, 'has_deepseek_fp8_block_scales')) + else bool(getattr(module, 'has_deepseek_fp8_block_scales')) + ) + )
🧹 Nitpick comments (6)
tensorrt_llm/_torch/modules/fused_moe/moe_load_balancer.py (1)
963-965
: Replace noisy print with structured logging and avoid long line (Ruff E501)Unconditional print will spam multi-rank logs and breaks line-length rule. Use
logger.info_once
(or rank-gated logging) and compress the payload; do not print the entiremodel_config.moe_load_balancer
object.- print( - f"maybe_create_moe_load_balancer: in_supported_model_arch={in_supported_model_arch}, using_ep={using_ep}, using_smart_router={using_smart_router}, model_config.moe_load_balancer={model_config.moe_load_balancer}" - ) + # Log once to avoid multi-rank spam; keep message under 120 chars per line + has_moe_lb = model_config.moe_load_balancer is not None + logger.info_once( + "maybe_create_moe_load_balancer: " + f"in_supported_model_arch={in_supported_model_arch}, " + f"using_ep={bool(using_ep)}, " + f"using_smart_router={bool(using_smart_router)}, " + f"has_moe_lb={has_moe_lb}", + key="moe_lb_debug", + )tensorrt_llm/_torch/modules/fused_moe/moe_backend.py (1)
382-382
: Break long line per Ruff E501 and improve readabilityLine exceeds 120 chars. Assign the function to a local variable with line breaks.
- run_moe = self.moe_runner.fused_moe_runner.run_moe_min_latency if min_latency_mode else self.moe_runner.fused_moe_runner.run_moe + run_moe = ( + self.moe_runner.fused_moe_runner.run_moe_min_latency + if min_latency_mode + else self.moe_runner.fused_moe_runner.run_moe + )tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py (4)
326-338
: Replace debug prints with structured, rank-safe logging (and keep lines short)These
logger.info_once
(or rank-gated logs) to emit concise messages once.- print( - f"wide_ep _get_quant_method: get_sm_version()={get_sm_version()}" - ) + logger.info_once( + f"wide_ep _get_quant_method: get_sm_version()={get_sm_version()}", + key="wide_ep_quant_sm" + ) @@ - print( - f"wide_ep _get_quant_method: use DeepSeekFP8BlockScalesFusedMoEMethodDeepGemm" - ) + logger.info_once( + "wide_ep _get_quant_method: use DeepSeekFP8BlockScalesFusedMoEMethodDeepGemm", + key="wide_ep_quant_method" + ) @@ - print( - f"wide_ep _get_quant_method: use DeepSeekFP8BlockScalesFusedMoEMethod" - ) + logger.info_once( + "wide_ep _get_quant_method: use DeepSeekFP8BlockScalesFusedMoEMethod", + key="wide_ep_quant_method" + )
411-413
: Replace debug print with structured logging and fix long line (Ruff E501)Same rationale as above; this print is noisy across ranks and exceeds 120 columns.
- print( - f"wide_ep forward_chunk: layer_load_balancer={self.layer_load_balancer}, is_first_call={is_first_call}, is_last_call={is_last_call}" - ) + logger.info_once( + "wide_ep forward_chunk: " + f"layer_load_balancer={bool(self.layer_load_balancer)}, " + f"is_first_call={is_first_call}, " + f"is_last_call={is_last_call}", + key="wide_ep_forward_chunk_debug", + )
403-408
: Tidy up redundant assignment
output_dtype = output_dtype
is a no-op. Simplify the block; keep the assert for FP4 tensors.- if isinstance(x, Fp4QuantizedTensor): - assert output_dtype is not None - output_dtype = output_dtype - else: - output_dtype = x.dtype + if isinstance(x, Fp4QuantizedTensor): + assert output_dtype is not None, "Provide output_dtype when passing Fp4QuantizedTensor" + else: + output_dtype = x.dtype
726-729
: Be robust to backend return type until all backends are alignedEven after unifying backends, making the caller tolerant avoids future regressions.
- # Only in cutlass_min_latency_mode, the output is a list of tensors. - # Otherwise, the output should be unpacked as a single tensor. - final_hidden_states = final_hidden_states[0] + # Backends return List[Tensor] (even for single output); keep robust. + if isinstance(final_hidden_states, list): + final_hidden_states = final_hidden_states[0]
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (3)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py
(9 hunks)tensorrt_llm/_torch/modules/fused_moe/moe_backend.py
(1 hunks)tensorrt_llm/_torch/modules/fused_moe/moe_load_balancer.py
(1 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.py
: Code must target Python 3.8+
Indent Python code with 4 spaces; do not use tabs
Preserve module namespaces when importing; import modules/packages and access members via the module (e.g., from package.subpackage import foo; foo.SomeClass())
Python file names should be snake_case
Python class names should be PascalCase
Python functions/methods and local variables should be snake_case; variables beginning with a number should be prefixed with k_ (e.g., k_99th_percentile)
Global variables should be UPPER_SNAKE_CASE prefixed with G_ (e.g., G_MY_GLOBAL); constants should be UPPER_SNAKE_CASE
Avoid shadowing variables from outer scopes; initialize all externally visible members in init
Prefer docstrings for interfaces used outside a file; comments should be reserved for in-function or file-local interfaces
Use Google-style docstrings for classes and functions; attributes and variables may be documented inline with trailing string literals
Avoid reflection when simpler, explicit code suffices (e.g., avoid dict(**locals()) patterns)
In try/except, catch the narrowest exceptions possible
For duck-typing patterns, keep the try body minimal and move logic to else to avoid masking unrelated failures
Files:
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py
tensorrt_llm/_torch/modules/fused_moe/moe_backend.py
tensorrt_llm/_torch/modules/fused_moe/moe_load_balancer.py
**/*.{c,cc,cpp,cxx,h,hh,hpp,hxx,cu,cuh,py}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Prepend the NVIDIA copyright header (current year) to all source files (.cpp, .h, .cu, .py, etc.)
Files:
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py
tensorrt_llm/_torch/modules/fused_moe/moe_backend.py
tensorrt_llm/_torch/modules/fused_moe/moe_load_balancer.py
🧠 Learnings (3)
📓 Common learnings
Learnt from: timlee0212
PR: NVIDIA/TensorRT-LLM#6886
File: tensorrt_llm/_torch/models/modeling_deepseekv3.py:0-0
Timestamp: 2025-08-14T06:36:40.701Z
Learning: In DeepSeek V3 model (tensorrt_llm/_torch/models/modeling_deepseekv3.py), the disagreement between AllReduce.__init__ guard and _compute_mlp_tp_size logic for MNNVL usage is expected by design. The AllReduce component and MLP TP-size computation intentionally use different criteria for MNNVL availability decisions.
📚 Learning: 2025-08-14T23:23:27.449Z
Learnt from: djns99
PR: NVIDIA/TensorRT-LLM#6915
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:4010-4012
Timestamp: 2025-08-14T23:23:27.449Z
Learning: For MOE (Mixture of Experts) code reviews in TensorRT-LLM, avoid repeatedly suggesting finalize fusion validation checks and safety assertions. The user djns99 has indicated these suggestions are repetitive and unwanted across multiple MOE-related changes.
Applied to files:
tensorrt_llm/_torch/modules/fused_moe/moe_backend.py
📚 Learning: 2025-08-09T20:57:04.084Z
Learnt from: sklevtsov-nvidia
PR: NVIDIA/TensorRT-LLM#3294
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu:118-127
Timestamp: 2025-08-09T20:57:04.084Z
Learning: In the CUTLASS MoE finalize fusion implementation (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu), when setting `fused_finalize_epilogue.stride_final_output` with shape `(hidden_size, num_output_tokens, 1)`, the `num_rows_in_final_output` should be set to `num_output_tokens` (not `hidden_size`) because of a swap+transpose operation that maps rows of the output tensor to `hidden_size` and columns to `num_output_tokens`.
Applied to files:
tensorrt_llm/_torch/modules/fused_moe/moe_backend.py
🧬 Code graph analysis (2)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py (3)
tensorrt_llm/_torch/modules/fused_moe/moe_backend.py (4)
MoEBackendSelection
(825-868)select_backend
(837-868)run_moe
(101-212)run_moe
(413-533)tensorrt_llm/_torch/modules/fused_moe/moe_load_balancer.py (1)
get_moe_load_balancer
(1027-1037)tensorrt_llm/_torch/modules/fused_moe/quantization.py (2)
DeepSeekFP8BlockScalesFusedMoEMethod
(604-737)DeepSeekFP8BlockScalesFusedMoEMethodDeepGemm
(740-781)
tensorrt_llm/_torch/modules/fused_moe/moe_backend.py (3)
tensorrt_llm/_torch/custom_ops/torch_custom_ops.py (1)
MoERunner
(23-114)tensorrt_llm/_torch/modules/fused_moe/fused_moe_deepgemm.py (5)
deepgemm_fp8_group_blockwise_gemm
(246-284)masked_index_copy_group_quant_fp8
(89-160)preprocess_after_permute
(233-242)set_strides
(287-293)triton_masked_index_gather
(195-216)tensorrt_llm/_torch/modules/fused_moe/interface.py (1)
has_deepseek_fp8_block_scales
(127-130)
🪛 Ruff (0.12.2)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py
412-412: Line too long (144 > 120)
(E501)
tensorrt_llm/_torch/modules/fused_moe/moe_backend.py
382-382: Line too long (136 > 120)
(E501)
tensorrt_llm/_torch/modules/fused_moe/moe_load_balancer.py
964-964: Line too long (219 > 120)
(E501)
ed20793
to
36f4493
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (3)
tensorrt_llm/_torch/distributed/ops.py (1)
463-489
: Missingdtype
argument disables MNNVLAllReduce under AUTO/MNNVL
AllReduce’s constructor only enables MNNVL when a non-Nonedtype
is passed. Call sites that omit thedtype
parameter will never activate MNNVL (even ifstrategy
is AUTO/MNNVL andtp_size == world_size
). Please update the following invocations to include the appropriatedtype
(e.g.config.torch_dtype
):
tensorrt_llm/_torch/models/modeling_qwen3_moe.py
• Line 93:AllReduce(mapping=…, strategy=…)
• Line 187:AllReduce(mapping=…, strategy=…)
tensorrt_llm/_torch/models/modeling_deepseekv3.py
• Line 485:AllReduce(mapping=…, strategy=…)
tensorrt_llm/_torch/models/modeling_llama.py
• Lines 304–307:AllReduce(mapping=…, strategy=…)
• Lines 455–456:AllReduce(mapping=…, strategy=…)
• Line 645:AllReduce(mapping=…)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_vanilla.py
• Line 72:AllReduce(mapping=…, strategy=…)
For each, modify to:
- AllReduce(mapping=…, strategy=…, …) + AllReduce(mapping=…, strategy=…, dtype=config.torch_dtype)tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py (2)
700-728
: Contract mismatch risk: backend.run_moe return typeDownstream you index final_hidden_states[0]. Ensure all backends return List[Tensor] to avoid slicing a Tensor (first-token bug). This will be fixed if DeepGemm returns [tensor], but add a defensive guard here or in the backend meantime.
No change needed here if you accept the backend fix in moe_backend.py. If not, adjust the unpack site below (Lines 733-736).
733-736
: Defensive unpack to handle both List[Tensor] and TensorUntil all backends consistently return List[Tensor], avoid accidental first-dimension slicing.
- # Only in cutlass_min_latency_mode, the output is a list of tensors. - # Otherwise, the output should be unpacked as a single tensor. - final_hidden_states = final_hidden_states[0] + # Normalize backend output to a tensor + if isinstance(final_hidden_states, (list, tuple)): + final_hidden_states = final_hidden_states[0]
♻️ Duplicate comments (5)
tensorrt_llm/_torch/modules/fused_moe/moe_backend.py (5)
253-275
: MoERunner should use the actual weight view dtypeDuring tuning you view weights with a possibly different dtype (e.g., W4 group scaling). Pass that view dtype into MoERunner for consistency.
- self.moe_runner = MoERunner( + self.moe_runner = MoERunner( x_dtype=tuner_input.dtype, - weight_dtype=module.w3_w1_weight.dtype, + weight_dtype=weight_view_dtype, output_dtype=output_dtype, top_k=tuner_top_k, tp_size=module.tp_size,
264-272
: Don’t pass a bound method as a bool flag; call it or cast safelymodule.has_deepseek_fp8_block_scales is a method; passing it directly will always be truthy.
- use_deepseek_fp8_block_scale=getattr( - module, 'has_deepseek_fp8_block_scales', False), + use_deepseek_fp8_block_scale=( + module.has_deepseek_fp8_block_scales() + if hasattr(module, 'has_deepseek_fp8_block_scales') + and callable(module.has_deepseek_fp8_block_scales) + else bool(getattr(module, 'has_deepseek_fp8_block_scales', False)) + ),Also double-check similar flags (has_int8_woq_per_channel, has_mxfp8_act_scaling) are booleans, not callables.
822-823
: Unify return type: wrap DeepGemm output in a listCallers assume a List[Tensor] in the non-min-latency path. Returning a bare tensor here will corrupt outputs when indexed.
- return final_hidden_states + return [final_hidden_states]
836-868
: Fix backend selection: call has_deepseek_fp8_block_scales()The check currently treats the method as a property; it should be invoked (and safely guarded).
- has_block_fp8 = (hasattr(module, 'has_deepseek_fp8_block_scales') - and module.has_deepseek_fp8_block_scales) + has_block_fp8 = ( + hasattr(module, 'has_deepseek_fp8_block_scales') + and ( + module.has_deepseek_fp8_block_scales() + if callable(getattr(module, 'has_deepseek_fp8_block_scales')) + else bool(getattr(module, 'has_deepseek_fp8_block_scales')) + ) + )
1-4
: Add mandatory NVIDIA copyright headerPer repo guidelines, prepend the NVIDIA copyright header to all source files.
+# Copyright (c) 2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# +# """ MoE Backend abstraction for supporting different MoE computation implementations. This module provides a unified interface for different MoE backends (Cutlass, DeepGemm, etc.) """
🧹 Nitpick comments (9)
tensorrt_llm/_torch/modules/fused_moe/moe_load_balancer.py (1)
963-965
: Replace print with project logger and wrap long lineUse the project logger instead of an unconditional print; also fix the E501 long line warning by splitting the message.
- print( - f"maybe_create_moe_load_balancer: in_supported_model_arch={in_supported_model_arch}, using_ep={using_ep}, using_smart_router={using_smart_router}, model_config.moe_load_balancer={model_config.moe_load_balancer}" - ) + logger.debug( + "maybe_create_moe_load_balancer: " + f"in_supported_model_arch={in_supported_model_arch}, " + f"using_ep={using_ep}, " + f"using_smart_router={using_smart_router}, " + f"model_config.moe_load_balancer={model_config.moe_load_balancer}" + )tensorrt_llm/_torch/distributed/ops.py (1)
243-251
: Avoid stdout in hot path; log mismatch and strengthen assertion messageUnconditional print in collective validation will spam logs and hurt perf. Prefer debug logging and attach context to the assertion.
- for val in input: - if val is not None and val.shape[dim] != sum_split_size: - print( - f"[reducescatter] val.shape={val.shape}, dim={dim}, val.shape[dim]={val.shape[dim]}, sum_split_size={sum_split_size}, sizes={sizes}" - ) - assert all([ - val.shape[dim] == sum_split_size for val in input - if val is not None - ]) + for val in input: + if val is not None and val.shape[dim] != sum_split_size: + logger.debug( + "reducescatter size mismatch: val_shape=%s, dim=%d, actual=%d, expected=%d, sizes=%s", + tuple(val.shape), dim, val.shape[dim], sum_split_size, sizes, + ) + assert all( + val.shape[dim] == sum_split_size for val in input if val is not None + ), ( + f"reducescatter: mismatched input shapes along dim={dim}; " + f"expected {sum_split_size}, sizes={sizes}" + )tensorrt_llm/_torch/modules/fused_moe/moe_backend.py (2)
382-383
: Wrap long line and improve readability of run_moe selectionRefactor the ternary for readability and to satisfy E501.
- run_moe = self.moe_runner.fused_moe_runner.run_moe_min_latency if min_latency_mode else self.moe_runner.fused_moe_runner.run_moe + if min_latency_mode: + run_moe = self.moe_runner.fused_moe_runner.run_moe_min_latency + else: + run_moe = self.moe_runner.fused_moe_runner.run_moe
569-618
: Avoid per-call large allocations in DeepGemm workspace_get_deepgemm_workspace allocates multiple large tensors per call. Cache workspaces keyed by (m_max, hidden/intermediate sizes, group size) on self to avoid repeated allocations.
Example implementation outside this hunk (add in init and use here):
# In __init__ self._workspace_cache = {} # In _get_deepgemm_workspace(...) key = (m_max, module.hidden_size, module.intermediate_size, group_size) ws = self._workspace_cache.get(key) if ws is None: ws = { ... allocate once ... } self._workspace_cache[key] = ws return wsHappy to provide a full patch if you want it wired end-to-end.
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py (5)
311-325
: Replace prints with logger and trim long linesThese diagnostics should go through the project logger and avoid long lines.
- print( - f"can not use alltoall due to chunking {self.calculate_num_chunks(all_rank_num_tokens)}" - ) + logger.debug( + "Disable alltoall: chunking=%d", + self.calculate_num_chunks(all_rank_num_tokens), + ) @@ - print( - f"can not use alltoall due to deep_ep_max_num_tokens {all_rank_max_num_tokens} > {self.deep_ep_max_num_tokens}" - ) + logger.debug( + "Disable DeepEPLowLatency: max_tokens=%d > threshold=%d", + all_rank_max_num_tokens, self.deep_ep_max_num_tokens, + ) @@ - print(f"all to all type {self.alltoall_method_type}") + logger.debug("alltoall method type: %s", self.alltoall_method_type)
333-345
: Use logger for backend pick trace and avoid noisy stdoutLeverage logger and keep consistent formatting.
- print( - f"wide_ep _get_quant_method: get_sm_version()={get_sm_version()}" - ) + logger.debug("wide_ep _get_quant_method: sm=%s", get_sm_version()) if get_sm_version() == 100: - print( - f"wide_ep _get_quant_method: use DeepSeekFP8BlockScalesFusedMoEMethodDeepGemm" - ) + logger.debug("Select DeepSeekFP8BlockScalesFusedMoEMethodDeepGemm") return DeepSeekFP8BlockScalesFusedMoEMethodDeepGemm() else: - print( - f"wide_ep _get_quant_method: use DeepSeekFP8BlockScalesFusedMoEMethod" - ) + logger.debug("Select DeepSeekFP8BlockScalesFusedMoEMethod") return DeepSeekFP8BlockScalesFusedMoEMethod()
418-421
: Debug prints in forward path should use loggerReplace the print with logger.debug and keep messages short.
- print( - f"wide_ep forward_chunk: layer_load_balancer={self.layer_load_balancer}, is_first_call={is_first_call}, is_last_call={is_last_call}" - ) + logger.debug( + "wide_ep forward_chunk: has_lb=%s, first=%s, last=%s", + self.layer_load_balancer is not None, is_first_call, is_last_call, + )
672-699
: Remove large commented-out legacy callThe old fused_moe call is fully preserved in comments. Consider removing it or moving to docs to reduce noise.
809-811
: Replace extra-verbose print with logger and keep it readableUse logger.debug and reduce verbosity; very long single-line prints violate E501 and clutter logs.
- # 一行打印所有信息 - print( - f"xxi x.shape: {getattr(x, 'shape', None)}, use_all_to_all: {use_all_to_all}, all_rank_num_tokens: {all_rank_num_tokens}, all_rank_num_tokens_padded: {all_rank_num_tokens_padded}, all_rank_max_num_tokens: {all_rank_max_num_tokens}, use_dp_padding: {use_dp_padding}, outputs.shape: {getattr(outputs, 'shape', None)}, use_dp_padding(again): {use_dp_padding}" - ) + logger.debug( + "wide_ep forward: x.shape=%s, use_all_to_all=%s, " + "num_tokens=%s, num_tokens_padded=%s, max_tokens=%s, " + "use_dp_padding=%s, out.shape=%s", + getattr(x, "shape", None), + use_all_to_all, + all_rank_num_tokens, + all_rank_num_tokens_padded, + all_rank_max_num_tokens, + use_dp_padding, + getattr(outputs, "shape", None), + )
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (4)
tensorrt_llm/_torch/distributed/ops.py
(2 hunks)tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py
(11 hunks)tensorrt_llm/_torch/modules/fused_moe/moe_backend.py
(1 hunks)tensorrt_llm/_torch/modules/fused_moe/moe_load_balancer.py
(1 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.py
: Code must target Python 3.8+
Indent Python code with 4 spaces; do not use tabs
Preserve module namespaces when importing; import modules/packages and access members via the module (e.g., from package.subpackage import foo; foo.SomeClass())
Python file names should be snake_case
Python class names should be PascalCase
Python functions/methods and local variables should be snake_case; variables beginning with a number should be prefixed with k_ (e.g., k_99th_percentile)
Global variables should be UPPER_SNAKE_CASE prefixed with G_ (e.g., G_MY_GLOBAL); constants should be UPPER_SNAKE_CASE
Avoid shadowing variables from outer scopes; initialize all externally visible members in init
Prefer docstrings for interfaces used outside a file; comments should be reserved for in-function or file-local interfaces
Use Google-style docstrings for classes and functions; attributes and variables may be documented inline with trailing string literals
Avoid reflection when simpler, explicit code suffices (e.g., avoid dict(**locals()) patterns)
In try/except, catch the narrowest exceptions possible
For duck-typing patterns, keep the try body minimal and move logic to else to avoid masking unrelated failures
Files:
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py
tensorrt_llm/_torch/distributed/ops.py
tensorrt_llm/_torch/modules/fused_moe/moe_backend.py
tensorrt_llm/_torch/modules/fused_moe/moe_load_balancer.py
**/*.{c,cc,cpp,cxx,h,hh,hpp,hxx,cu,cuh,py}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Prepend the NVIDIA copyright header (current year) to all source files (.cpp, .h, .cu, .py, etc.)
Files:
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py
tensorrt_llm/_torch/distributed/ops.py
tensorrt_llm/_torch/modules/fused_moe/moe_backend.py
tensorrt_llm/_torch/modules/fused_moe/moe_load_balancer.py
🧠 Learnings (4)
📓 Common learnings
Learnt from: timlee0212
PR: NVIDIA/TensorRT-LLM#6886
File: tensorrt_llm/_torch/models/modeling_deepseekv3.py:0-0
Timestamp: 2025-08-14T06:36:40.701Z
Learning: In DeepSeek V3 model (tensorrt_llm/_torch/models/modeling_deepseekv3.py), the disagreement between AllReduce.__init__ guard and _compute_mlp_tp_size logic for MNNVL usage is expected by design. The AllReduce component and MLP TP-size computation intentionally use different criteria for MNNVL availability decisions.
📚 Learning: 2025-08-14T06:36:40.701Z
Learnt from: timlee0212
PR: NVIDIA/TensorRT-LLM#6886
File: tensorrt_llm/_torch/models/modeling_deepseekv3.py:0-0
Timestamp: 2025-08-14T06:36:40.701Z
Learning: In DeepSeek V3 model (tensorrt_llm/_torch/models/modeling_deepseekv3.py), the disagreement between AllReduce.__init__ guard and _compute_mlp_tp_size logic for MNNVL usage is expected by design. The AllReduce component and MLP TP-size computation intentionally use different criteria for MNNVL availability decisions.
Applied to files:
tensorrt_llm/_torch/distributed/ops.py
📚 Learning: 2025-08-14T23:23:27.449Z
Learnt from: djns99
PR: NVIDIA/TensorRT-LLM#6915
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:4010-4012
Timestamp: 2025-08-14T23:23:27.449Z
Learning: For MOE (Mixture of Experts) code reviews in TensorRT-LLM, avoid repeatedly suggesting finalize fusion validation checks and safety assertions. The user djns99 has indicated these suggestions are repetitive and unwanted across multiple MOE-related changes.
Applied to files:
tensorrt_llm/_torch/modules/fused_moe/moe_backend.py
📚 Learning: 2025-08-09T20:57:04.084Z
Learnt from: sklevtsov-nvidia
PR: NVIDIA/TensorRT-LLM#3294
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu:118-127
Timestamp: 2025-08-09T20:57:04.084Z
Learning: In the CUTLASS MoE finalize fusion implementation (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu), when setting `fused_finalize_epilogue.stride_final_output` with shape `(hidden_size, num_output_tokens, 1)`, the `num_rows_in_final_output` should be set to `num_output_tokens` (not `hidden_size`) because of a swap+transpose operation that maps rows of the output tensor to `hidden_size` and columns to `num_output_tokens`.
Applied to files:
tensorrt_llm/_torch/modules/fused_moe/moe_backend.py
🧬 Code graph analysis (3)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py (3)
tensorrt_llm/_torch/modules/fused_moe/moe_backend.py (4)
MoEBackendSelection
(825-868)select_backend
(837-868)run_moe
(101-212)run_moe
(413-533)tensorrt_llm/_torch/modules/fused_moe/moe_load_balancer.py (1)
get_moe_load_balancer
(1027-1037)tensorrt_llm/_torch/modules/fused_moe/quantization.py (2)
DeepSeekFP8BlockScalesFusedMoEMethod
(604-737)DeepSeekFP8BlockScalesFusedMoEMethodDeepGemm
(740-781)
tensorrt_llm/_torch/distributed/ops.py (3)
tensorrt_llm/functional.py (1)
AllReduceStrategy
(3876-3885)tensorrt_llm/_torch/distributed/communicator.py (2)
tp_size
(46-47)world_size
(26-27)tensorrt_llm/logger.py (1)
debug
(143-144)
tensorrt_llm/_torch/modules/fused_moe/moe_backend.py (4)
tensorrt_llm/_torch/autotuner.py (2)
AutoTuner
(271-752)choose_one
(329-436)tensorrt_llm/_torch/custom_ops/torch_custom_ops.py (1)
MoERunner
(23-114)tensorrt_llm/_torch/modules/fused_moe/fused_moe_deepgemm.py (5)
deepgemm_fp8_group_blockwise_gemm
(246-284)masked_index_copy_group_quant_fp8
(89-160)preprocess_after_permute
(233-242)set_strides
(287-293)triton_masked_index_gather
(195-216)tensorrt_llm/_torch/modules/fused_moe/interface.py (1)
has_deepseek_fp8_block_scales
(127-130)
🪛 Ruff (0.12.2)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py
320-320: Line too long (127 > 120)
(E501)
419-419: Line too long (144 > 120)
(E501)
810-810: Line too long (372 > 120)
(E501)
tensorrt_llm/_torch/distributed/ops.py
246-246: Line too long (156 > 120)
(E501)
tensorrt_llm/_torch/modules/fused_moe/moe_backend.py
382-382: Line too long (136 > 120)
(E501)
tensorrt_llm/_torch/modules/fused_moe/moe_load_balancer.py
964-964: Line too long (219 > 120)
(E501)
🔇 Additional comments (3)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py (3)
237-239
: Lazy backend init acknowledgedInitializing moe_backend after weight creation is reasonable and keeps state consistent.
368-370
: Backend selection timing is goodSelecting the backend right after weight creation ensures quant flags are available.
498-503
: Good: guard stat fetch behind load balancer and last_callThe additional check avoids None access when LB is disabled.
36f4493
to
bfb9f40
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 15
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
tensorrt_llm/_torch/distributed/ops.py (2)
15-21
: Avoid shadowing the project logger.You import logger from tensorrt_llm.logger at Line 15, then overwrite it with logging.getLogger(name) at Line 20. This changes behavior of logger.debug/info_once, and can silently route logs incorrectly.
-from tensorrt_llm.logger import logger +from tensorrt_llm.logger import logger ... -_thread_local = threading.local() -logger = logging.getLogger(__name__) +_thread_local = threading.local()
243-251
: Use logger.debug instead of print in reducescatter and fix E501.Printing inside reducescatter is expensive and noisy. Prefer the project logger, and split the message to satisfy line length.
- for val in input: - if val is not None and val.shape[dim] != sum_split_size: - print( - f"[reducescatter] val.shape={val.shape}, dim={dim}, val.shape[dim]={val.shape[dim]}, sum_split_size={sum_split_size}, sizes={sizes}" - ) + for val in input: + if val is not None and val.shape[dim] != sum_split_size: + logger.debug( + "[reducescatter] shape_mismatch: val.shape=%s, dim=%d, " + "val_dim=%d, sum_split_size=%d, sizes=%s", + tuple(val.shape), dim, val.shape[dim], sum_split_size, sizes + )
♻️ Duplicate comments (10)
tensorrt_llm/_torch/modules/fused_moe/moe_backend.py (9)
1-1
: Missing NVIDIA copyright headerPer coding guidelines, prepend the NVIDIA copyright header (current year) to this new file.
+# Copyright (c) 2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved. """ MoE Backend abstraction for supporting different MoE computation implementations.
264-274
: Fix incorrect flag handling for has_deepseek_fp8_block_scales
has_deepseek_fp8_block_scales
is a method, but it's being passed without calling it. This will always be truthy when the attribute exists, incorrectly enabling FP8 block-scale paths.Apply:
- use_deepseek_fp8_block_scale=getattr( - module, 'has_deepseek_fp8_block_scales', False), + use_deepseek_fp8_block_scale=( + module.has_deepseek_fp8_block_scales() + if hasattr(module, 'has_deepseek_fp8_block_scales') and callable(getattr(module, 'has_deepseek_fp8_block_scales')) + else False),Also verify other flags (e.g.,
has_int8_woq_per_channel
,has_mxfp8_act_scaling
) to ensure they are booleans, not methods.
253-255
: Fix weight_dtype inconsistency in MoERunner initializationThe MoERunner should be constructed with
weight_view_dtype
(the actual view dtype used at runtime), not the original storage dtype.self.moe_runner = MoERunner( x_dtype=tuner_input.dtype, - weight_dtype=module.w3_w1_weight.dtype, + weight_dtype=weight_view_dtype, output_dtype=output_dtype,
846-846
: Fix API contract: DeepGemm backend should return List[Tensor] to match Cutlass
WideEPMoE.forward_chunk
unconditionally indexes[0]
afterrun_moe
. Cutlass returns[output]
, but DeepGemm returns a bare tensor, which will slice the first token and corrupt output.- return final_hidden_states + return [final_hidden_states]
883-885
: Fix method call for has_deepseek_fp8_block_scales in backend selection
has_deepseek_fp8_block_scales
is a method but is being accessed as a property. This makeshas_block_fp8
true whenever the attribute exists.has_block_fp8 = (hasattr(module, 'has_deepseek_fp8_block_scales') - and module.has_deepseek_fp8_block_scales) + and (module.has_deepseek_fp8_block_scales() + if callable(getattr(module, 'has_deepseek_fp8_block_scales')) + else bool(getattr(module, 'has_deepseek_fp8_block_scales'))))
251-275
: Two fixes: pass the correct weight view dtype and call boolean flags, not bound methods.
- weight_dtype should match the view dtype used during profiling (e.g., W4 paths), otherwise the MoERunner key diverges.
- has_deepseek_fp8_block_scales is a method; passing the bound method makes the flag always truthy.
if self.moe_runner is None: self.moe_runner = MoERunner( x_dtype=tuner_input.dtype, - weight_dtype=module.w3_w1_weight.dtype, + weight_dtype=weight_view_dtype, output_dtype=output_dtype, top_k=tuner_top_k, tp_size=module.tp_size, tp_rank=module.tp_rank, ep_size=module.ep_size, ep_rank=module.ep_rank, cluster_size=module.cluster_size, cluster_rank=module.cluster_rank, - use_deepseek_fp8_block_scale=getattr( - module, 'has_deepseek_fp8_block_scales', False), + use_deepseek_fp8_block_scale=( + module.has_deepseek_fp8_block_scales() + if hasattr(module, 'has_deepseek_fp8_block_scales') else False + ), use_w4_group_scaling=getattr(module, 'has_w4afp8', False), use_int8_woq_per_channel=getattr(module, 'has_int8_woq_per_channel', False), use_mxfp8_act_scaling=getattr(module, 'has_mxfp8_act_scaling', False), min_latency_mode=min_latency_mode, use_fused_finalize=getattr(module, 'use_fused_finalize', False), )
846-846
: Unify return type to match caller (wrap DeepGemm output in a list).WideEPMoE.forward_chunk indexes [0]. Returning a bare Tensor here will slice the first token. Wrap in a one-element list, aligning with Cutlass.
- return final_hidden_states + return [final_hidden_states]
883-889
: Call the method when checking block-FP8 support.module.has_deepseek_fp8_block_scales is a method; the current code treats it as a property and will always be truthy if present.
- has_block_fp8 = (hasattr(module, 'has_deepseek_fp8_block_scales') - and module.has_deepseek_fp8_block_scales) + has_block_fp8 = ( + hasattr(module, "has_deepseek_fp8_block_scales") + and ( + module.has_deepseek_fp8_block_scales() + if callable(getattr(module, "has_deepseek_fp8_block_scales")) + else bool(getattr(module, "has_deepseek_fp8_block_scales")) + ) + )
1-4
: Add required NVIDIA copyright header to new file.Repo guideline requires the NVIDIA header on all source files. Add it above the module docstring.
+# Copyright (c) 2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# +""" MoE Backend abstraction for supporting different MoE computation implementations. This module provides a unified interface for different MoE backends (Cutlass, DeepGemm, etc.) """tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py (1)
739-739
: Robustness: handle both Tensor and List[Tensor] return types.Even after aligning DeepGemm to return a list, making the caller tolerant prevents future regressions.
- final_hidden_states = final_hidden_states[0] + if isinstance(final_hidden_states, list): + final_hidden_states = final_hidden_states[0]
🧹 Nitpick comments (4)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py (4)
311-326
: Use logger.debug instead of print in can_use_alltoall.Stdout spam in control flow harms perf and clobbers logs. Prefer logger.debug and keep messages short.
- print( - f"can not use alltoall due to chunking {self.calculate_num_chunks(all_rank_num_tokens)}" - ) + logger.debug( + "Disable alltoall: chunking=%d", + self.calculate_num_chunks(all_rank_num_tokens), + ) ... - print( - f"can not use alltoall due to deep_ep_max_num_tokens {all_rank_max_num_tokens} > {self.deep_ep_max_num_tokens}" - ) + logger.debug( + "Disable alltoall: deep_ep_max_tokens %d > %d", + all_rank_max_num_tokens, self.deep_ep_max_num_tokens, + ) - print(f"all to all type {self.alltoall_method_type}") + logger.debug("alltoall method type: %s", self.alltoall_method_type)
333-345
: Drop prints in _get_quant_method; log at debug level.These messages are useful during bring-up but too verbose for regular runs.
- print( - f"wide_ep _get_quant_method: get_sm_version()={get_sm_version()}" - ) + logger.debug("wide_ep::_get_quant_method: sm=%s", get_sm_version()) if get_sm_version() == 100: - print( - f"wide_ep _get_quant_method: use DeepSeekFP8BlockScalesFusedMoEMethodDeepGemm" - ) + logger.debug("wide_ep::using DeepGemm MoE method for block FP8") return DeepSeekFP8BlockScalesFusedMoEMethodDeepGemm() else: - print( - f"wide_ep _get_quant_method: use DeepSeekFP8BlockScalesFusedMoEMethod" - ) + logger.debug("wide_ep::using Cutlass MoE method for block FP8") return DeepSeekFP8BlockScalesFusedMoEMethod()
701-728
: Small safety: assert backend is initialized when weights are created.create_weights assigns moe_backend, but if skip_create_weights_in_init=True and forward_chunk is called prematurely, this could be None. Either assert here or lazy-init on first use.
# Use the selected backend to compute MoE with the same parameters as fused_moe + assert self.moe_backend is not None, "MoE backend not initialized" final_hidden_states = self.moe_backend.run_moe( x, token_selected_slots, token_final_scales,
814-815
: Replace single-line mega print with logger.debug and shorten.This violates E501 and is too verbose for every forward.
- print( - f"xxi x.shape: {getattr(x, 'shape', None)}, use_all_to_all: {use_all_to_all}, all_rank_num_tokens: {all_rank_num_tokens}, all_rank_num_tokens_padded: {all_rank_num_tokens_padded}, all_rank_max_num_tokens: {all_rank_max_num_tokens}, use_dp_padding: {use_dp_padding}, outputs.shape: {getattr(outputs, 'shape', None)}, use_dp_padding(again): {use_dp_padding}" - ) + logger.debug( + "forward: x=%s alltoall=%s tokens=%s padded=%s max=%s dp_pad=%s out=%s", + getattr(x, "shape", None), use_all_to_all, all_rank_num_tokens, + all_rank_num_tokens_padded, all_rank_max_num_tokens, + use_dp_padding, getattr(outputs, "shape", None), + )
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (4)
tensorrt_llm/_torch/distributed/ops.py
(2 hunks)tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py
(11 hunks)tensorrt_llm/_torch/modules/fused_moe/moe_backend.py
(1 hunks)tensorrt_llm/_torch/modules/fused_moe/moe_load_balancer.py
(1 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.py
: Code must target Python 3.8+
Indent Python code with 4 spaces; do not use tabs
Preserve module namespaces when importing; import modules/packages and access members via the module (e.g., from package.subpackage import foo; foo.SomeClass())
Python file names should be snake_case
Python class names should be PascalCase
Python functions/methods and local variables should be snake_case; variables beginning with a number should be prefixed with k_ (e.g., k_99th_percentile)
Global variables should be UPPER_SNAKE_CASE prefixed with G_ (e.g., G_MY_GLOBAL); constants should be UPPER_SNAKE_CASE
Avoid shadowing variables from outer scopes; initialize all externally visible members in init
Prefer docstrings for interfaces used outside a file; comments should be reserved for in-function or file-local interfaces
Use Google-style docstrings for classes and functions; attributes and variables may be documented inline with trailing string literals
Avoid reflection when simpler, explicit code suffices (e.g., avoid dict(**locals()) patterns)
In try/except, catch the narrowest exceptions possible
For duck-typing patterns, keep the try body minimal and move logic to else to avoid masking unrelated failures
Files:
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py
tensorrt_llm/_torch/modules/fused_moe/moe_backend.py
tensorrt_llm/_torch/distributed/ops.py
tensorrt_llm/_torch/modules/fused_moe/moe_load_balancer.py
**/*.{c,cc,cpp,cxx,h,hh,hpp,hxx,cu,cuh,py}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Prepend the NVIDIA copyright header (current year) to all source files (.cpp, .h, .cu, .py, etc.)
Files:
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py
tensorrt_llm/_torch/modules/fused_moe/moe_backend.py
tensorrt_llm/_torch/distributed/ops.py
tensorrt_llm/_torch/modules/fused_moe/moe_load_balancer.py
🧠 Learnings (4)
📓 Common learnings
Learnt from: timlee0212
PR: NVIDIA/TensorRT-LLM#6886
File: tensorrt_llm/_torch/models/modeling_deepseekv3.py:0-0
Timestamp: 2025-08-14T06:36:40.701Z
Learning: In DeepSeek V3 model (tensorrt_llm/_torch/models/modeling_deepseekv3.py), the disagreement between AllReduce.__init__ guard and _compute_mlp_tp_size logic for MNNVL usage is expected by design. The AllReduce component and MLP TP-size computation intentionally use different criteria for MNNVL availability decisions.
📚 Learning: 2025-08-14T23:23:27.449Z
Learnt from: djns99
PR: NVIDIA/TensorRT-LLM#6915
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:4010-4012
Timestamp: 2025-08-14T23:23:27.449Z
Learning: For MOE (Mixture of Experts) code reviews in TensorRT-LLM, avoid repeatedly suggesting finalize fusion validation checks and safety assertions. The user djns99 has indicated these suggestions are repetitive and unwanted across multiple MOE-related changes.
Applied to files:
tensorrt_llm/_torch/modules/fused_moe/moe_backend.py
📚 Learning: 2025-08-09T20:57:04.084Z
Learnt from: sklevtsov-nvidia
PR: NVIDIA/TensorRT-LLM#3294
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu:118-127
Timestamp: 2025-08-09T20:57:04.084Z
Learning: In the CUTLASS MoE finalize fusion implementation (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu), when setting `fused_finalize_epilogue.stride_final_output` with shape `(hidden_size, num_output_tokens, 1)`, the `num_rows_in_final_output` should be set to `num_output_tokens` (not `hidden_size`) because of a swap+transpose operation that maps rows of the output tensor to `hidden_size` and columns to `num_output_tokens`.
Applied to files:
tensorrt_llm/_torch/modules/fused_moe/moe_backend.py
📚 Learning: 2025-08-14T06:36:40.701Z
Learnt from: timlee0212
PR: NVIDIA/TensorRT-LLM#6886
File: tensorrt_llm/_torch/models/modeling_deepseekv3.py:0-0
Timestamp: 2025-08-14T06:36:40.701Z
Learning: In DeepSeek V3 model (tensorrt_llm/_torch/models/modeling_deepseekv3.py), the disagreement between AllReduce.__init__ guard and _compute_mlp_tp_size logic for MNNVL usage is expected by design. The AllReduce component and MLP TP-size computation intentionally use different criteria for MNNVL availability decisions.
Applied to files:
tensorrt_llm/_torch/distributed/ops.py
🧬 Code graph analysis (3)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py (3)
tensorrt_llm/_torch/modules/fused_moe/moe_backend.py (4)
MoEBackendSelection
(849-892)select_backend
(861-892)run_moe
(101-212)run_moe
(413-533)tensorrt_llm/_torch/modules/fused_moe/moe_load_balancer.py (1)
get_moe_load_balancer
(1027-1037)tensorrt_llm/_torch/modules/fused_moe/quantization.py (2)
DeepSeekFP8BlockScalesFusedMoEMethod
(604-737)DeepSeekFP8BlockScalesFusedMoEMethodDeepGemm
(740-781)
tensorrt_llm/_torch/modules/fused_moe/moe_backend.py (5)
tensorrt_llm/_torch/autotuner.py (2)
AutoTuner
(271-752)choose_one
(329-436)tensorrt_llm/_torch/custom_ops/torch_custom_ops.py (1)
MoERunner
(23-114)tensorrt_llm/_torch/modules/fused_moe/fused_moe_deepgemm.py (5)
deepgemm_fp8_group_blockwise_gemm
(246-284)masked_index_copy_group_quant_fp8
(89-160)preprocess_after_permute
(233-242)set_strides
(287-293)triton_masked_index_gather
(195-216)tensorrt_llm/quantization/utils/fp8_utils.py (1)
silu_and_mul_masked_post_quant_fwd
(304-375)tensorrt_llm/_torch/modules/fused_moe/interface.py (1)
has_deepseek_fp8_block_scales
(127-130)
tensorrt_llm/_torch/distributed/ops.py (3)
tensorrt_llm/functional.py (1)
AllReduceStrategy
(3876-3885)tensorrt_llm/_torch/distributed/communicator.py (2)
tp_size
(46-47)world_size
(26-27)tensorrt_llm/logger.py (1)
debug
(143-144)
🪛 Ruff (0.12.2)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py
320-320: Line too long (127 > 120)
(E501)
419-419: Line too long (472 > 120)
(E501)
731-731: Line too long (952 > 120)
(E501)
814-814: Line too long (372 > 120)
(E501)
tensorrt_llm/_torch/modules/fused_moe/moe_backend.py
382-382: Line too long (136 > 120)
(E501)
721-721: Line too long (127 > 120)
(E501)
722-722: Line too long (129 > 120)
(E501)
725-725: Line too long (121 > 120)
(E501)
726-726: Line too long (125 > 120)
(E501)
839-839: Line too long (125 > 120)
(E501)
840-840: Line too long (125 > 120)
(E501)
tensorrt_llm/_torch/distributed/ops.py
246-246: Line too long (156 > 120)
(E501)
tensorrt_llm/_torch/modules/fused_moe/moe_load_balancer.py
964-964: Line too long (219 > 120)
(E501)
🔇 Additional comments (7)
tensorrt_llm/_torch/distributed/ops.py (2)
463-488
: LGTM: Enhanced MNNVL AllReduce initialization with broader runtime detectionThe changes properly expand MNNVL support to AUTO strategy and add runtime checks for tp_size/world_size compatibility. The graceful fallback with debug logging is well-implemented.
463-489
: LGTM: safer MNNVL init path with AUTO fallback and explicit gating.Allowing MNNVL under AUTO and guarding by tp_size == world_size plus try/except improves robustness. The debug traces on enable/disable paths are helpful.
Please double-check that requiring tp_size == world_size is intended for your typical mappings (e.g., TP×DP×PP > 1) to avoid over-disabling MNNVL where it should be valid. If needed, we can gate on the actual communicator/group instead of global world_size.
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py (4)
498-499
: Good defensive programming with conditional checkThe added conditional check for
self.layer_load_balancer
before calling methods prevents potential null pointer issues.
237-239
: Good modular design with backend selection after weights creationThe initialization of MoE backend after weights creation ensures all necessary properties are available for backend selection. The lazy initialization pattern is well-structured.
Also applies to: 368-370
701-728
: Preserved original fused_moe call as reference documentationKeeping the original call as a comment provides valuable reference for future maintenance. The backend call maintains the same parameter structure ensuring compatibility.
498-505
: LGTM: guard load-balancer stat retrieval by is_last_call.This avoids early access and aligns with staged statistic updates.
tensorrt_llm/_torch/modules/fused_moe/moe_backend.py (1)
381-412
: Contract: return a list (non-min-latency) to match caller behavior.WideEPMoE.forward_chunk unconditionally indexes [0] later. Ensure Cutlass backend consistently returns a list in non-min-latency mode (already does) and keep min-latency path returning a Tensor. No change needed here; just highlighting the contract for the DeepGemm backend below.
for val in input: | ||
if val is not None and val.shape[dim] != sum_split_size: | ||
print( | ||
f"[reducescatter] val.shape={val.shape}, dim={dim}, val.shape[dim]={val.shape[dim]}, sum_split_size={sum_split_size}, sizes={sizes}" | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove debug print statement from production code
Debug print statements should not be in production code. Consider using proper logging instead.
- for val in input:
- if val is not None and val.shape[dim] != sum_split_size:
- print(
- f"[reducescatter] val.shape={val.shape}, dim={dim}, val.shape[dim]={val.shape[dim]}, sum_split_size={sum_split_size}, sizes={sizes}"
- )
+ for val in input:
+ if val is not None and val.shape[dim] != sum_split_size:
+ logger.debug(
+ f"[reducescatter] val.shape={val.shape}, dim={dim}, val.shape[dim]={val.shape[dim]}, sum_split_size={sum_split_size}, sizes={sizes}"
+ )
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
for val in input: | |
if val is not None and val.shape[dim] != sum_split_size: | |
print( | |
f"[reducescatter] val.shape={val.shape}, dim={dim}, val.shape[dim]={val.shape[dim]}, sum_split_size={sum_split_size}, sizes={sizes}" | |
) | |
for val in input: | |
if val is not None and val.shape[dim] != sum_split_size: | |
logger.debug( | |
f"[reducescatter] val.shape={val.shape}, dim={dim}, val.shape[dim]={val.shape[dim]}, sum_split_size={sum_split_size}, sizes={sizes}" | |
) |
🧰 Tools
🪛 Ruff (0.12.2)
246-246: Line too long (156 > 120)
(E501)
🤖 Prompt for AI Agents
In tensorrt_llm/_torch/distributed/ops.py around lines 243 to 247, remove the
debug print statement that outputs shape and size info; replace it with a proper
logging call using the module logger (e.g., logger.debug or logger.warning) and
include the same contextual message and variables so it's available in logs but
not printed to stdout; ensure the module imports and uses the logger
consistently and that the log level chosen is appropriate for diagnostics.
print( | ||
f"can not use alltoall due to deep_ep_max_num_tokens {all_rank_max_num_tokens} > {self.deep_ep_max_num_tokens}" | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove debug print statement from production code
Debug print statements should be removed.
- print(
- f"can not use alltoall due to deep_ep_max_num_tokens {all_rank_max_num_tokens} > {self.deep_ep_max_num_tokens}"
- )
+ logger.debug(
+ f"Cannot use alltoall due to deep_ep_max_num_tokens {all_rank_max_num_tokens} > {self.deep_ep_max_num_tokens}"
+ )
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
print( | |
f"can not use alltoall due to deep_ep_max_num_tokens {all_rank_max_num_tokens} > {self.deep_ep_max_num_tokens}" | |
) | |
logger.debug( | |
f"Cannot use alltoall due to deep_ep_max_num_tokens {all_rank_max_num_tokens} > {self.deep_ep_max_num_tokens}" | |
) |
🧰 Tools
🪛 Ruff (0.12.2)
320-320: Line too long (127 > 120)
(E501)
🤖 Prompt for AI Agents
In tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py around lines
319-321, remove the debug print statement; replace it with a proper logging call
(e.g., module logger or logging.warning) or drop it entirely if no runtime
notice is needed. If using logging, ensure an appropriate logger is available or
import logging and call logger.warning(f"cannot use alltoall due to
deep_ep_max_num_tokens {all_rank_max_num_tokens} >
{self.deep_ep_max_num_tokens}") to preserve context instead of printing to
stdout.
return False | ||
|
||
print(f"all to all type {self.alltoall_method_type}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove debug print statement from production code
Debug print statement should be removed.
- print(f"all to all type {self.alltoall_method_type}")
+ logger.debug(f"Alltoall type: {self.alltoall_method_type}")
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
print(f"all to all type {self.alltoall_method_type}") | |
logger.debug(f"Alltoall type: {self.alltoall_method_type}") |
🤖 Prompt for AI Agents
In tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py around line 324,
there is a debug print statement printing the all-to-all method type; remove
this print and replace it with either no output or a proper logger call if
runtime visibility is required (use the module's logger or Python logging at an
appropriate log level). Ensure no stray print() calls remain in production code
and run tests/linters to confirm removal.
print( | ||
f"wide_ep _get_quant_method: get_sm_version()={get_sm_version()}" | ||
) | ||
if get_sm_version() == 100: | ||
print( | ||
f"wide_ep _get_quant_method: use DeepSeekFP8BlockScalesFusedMoEMethodDeepGemm" | ||
) | ||
return DeepSeekFP8BlockScalesFusedMoEMethodDeepGemm() | ||
else: | ||
print( | ||
f"wide_ep _get_quant_method: use DeepSeekFP8BlockScalesFusedMoEMethod" | ||
) | ||
return DeepSeekFP8BlockScalesFusedMoEMethod() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove debug print statements from production code
Multiple debug print statements in the quantization method selection should be removed.
- print(
- f"wide_ep _get_quant_method: get_sm_version()={get_sm_version()}"
- )
+ logger.debug(
+ f"wide_ep _get_quant_method: get_sm_version()={get_sm_version()}"
+ )
if get_sm_version() == 100:
- print(
- f"wide_ep _get_quant_method: use DeepSeekFP8BlockScalesFusedMoEMethodDeepGemm"
- )
+ logger.debug(
+ "wide_ep _get_quant_method: use DeepSeekFP8BlockScalesFusedMoEMethodDeepGemm"
+ )
return DeepSeekFP8BlockScalesFusedMoEMethodDeepGemm()
else:
- print(
- f"wide_ep _get_quant_method: use DeepSeekFP8BlockScalesFusedMoEMethod"
- )
+ logger.debug(
+ "wide_ep _get_quant_method: use DeepSeekFP8BlockScalesFusedMoEMethod"
+ )
return DeepSeekFP8BlockScalesFusedMoEMethod()
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
print( | |
f"wide_ep _get_quant_method: get_sm_version()={get_sm_version()}" | |
) | |
if get_sm_version() == 100: | |
print( | |
f"wide_ep _get_quant_method: use DeepSeekFP8BlockScalesFusedMoEMethodDeepGemm" | |
) | |
return DeepSeekFP8BlockScalesFusedMoEMethodDeepGemm() | |
else: | |
print( | |
f"wide_ep _get_quant_method: use DeepSeekFP8BlockScalesFusedMoEMethod" | |
) | |
return DeepSeekFP8BlockScalesFusedMoEMethod() | |
logger.debug( | |
f"wide_ep _get_quant_method: get_sm_version()={get_sm_version()}" | |
) | |
if get_sm_version() == 100: | |
logger.debug( | |
"wide_ep _get_quant_method: use DeepSeekFP8BlockScalesFusedMoEMethodDeepGemm" | |
) | |
return DeepSeekFP8BlockScalesFusedMoEMethodDeepGemm() | |
else: | |
logger.debug( | |
"wide_ep _get_quant_method: use DeepSeekFP8BlockScalesFusedMoEMethod" | |
) | |
return DeepSeekFP8BlockScalesFusedMoEMethod() |
🤖 Prompt for AI Agents
In tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py around lines 333
to 345 there are debug print() calls in _get_quant_method; remove these print
statements so production code has no stdout debug noise, and if project logging
is available replace them with an appropriate logger.debug(...) calls preserving
the same conditional flow and returned classes (no behavioral change).
module.routing_method.top_k if module else 1, | ||
expert_size_per_partition, | ||
tp_size, | ||
tp_rank, | ||
ep_size, | ||
ep_rank, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix attribute: use experts_per_token (not top_k) for routing method.
Routing methods expose experts_per_token elsewhere in this filebase. Using .top_k risks AttributeError at runtime.
- module.routing_method.top_k if module else 1,
+ module.routing_method.experts_per_token if module else 1,
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
module.routing_method.top_k if module else 1, | |
expert_size_per_partition, | |
tp_size, | |
tp_rank, | |
ep_size, | |
ep_rank, | |
) | |
module.routing_method.experts_per_token if module else 1, | |
expert_size_per_partition, | |
tp_size, | |
tp_rank, | |
ep_size, | |
ep_rank, | |
) |
🤖 Prompt for AI Agents
In tensorrt_llm/_torch/modules/fused_moe/moe_backend.py around lines 828 to 834,
the code accesses module.routing_method.top_k which can raise AttributeError
because routing methods in this codebase expose experts_per_token; replace the
.top_k access with module.routing_method.experts_per_token (or a safe fallback
to 1 if missing) so the routing parameter uses the correct attribute name and
avoids runtime errors.
print( | ||
f"maybe_create_moe_load_balancer: in_supported_model_arch={in_supported_model_arch}, using_ep={using_ep}, using_smart_router={using_smart_router}, model_config.moe_load_balancer={model_config.moe_load_balancer}" | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove debug print statement from production code
Debug print statements should not be committed to production code. Please remove this print statement.
- print(
- f"maybe_create_moe_load_balancer: in_supported_model_arch={in_supported_model_arch}, using_ep={using_ep}, using_smart_router={using_smart_router}, model_config.moe_load_balancer={model_config.moe_load_balancer}"
- )
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
print( | |
f"maybe_create_moe_load_balancer: in_supported_model_arch={in_supported_model_arch}, using_ep={using_ep}, using_smart_router={using_smart_router}, model_config.moe_load_balancer={model_config.moe_load_balancer}" | |
) |
🧰 Tools
🪛 Ruff (0.12.2)
964-964: Line too long (219 > 120)
(E501)
🤖 Prompt for AI Agents
In tensorrt_llm/_torch/modules/fused_moe/moe_load_balancer.py around lines
963-965, remove the debug print statement that outputs
maybe_create_moe_load_balancer state; delete the print(...) call and, if runtime
diagnostics are needed, replace it with a proper logger.debug(...) call using
the module logger (ensure the logger is imported/initialized) so no raw prints
remain in production code.
🛠️ Refactor suggestion
Replace noisy print with logger.debug and wrap long line (E501).
Stdout prints in distributed hot paths cause interleaved logs and measurable overhead. Use the project logger and keep within 120 cols.
- print(
- f"maybe_create_moe_load_balancer: in_supported_model_arch={in_supported_model_arch}, using_ep={using_ep}, using_smart_router={using_smart_router}, model_config.moe_load_balancer={model_config.moe_load_balancer}"
- )
+ logger.debug(
+ "maybe_create_moe_load_balancer: in_supported_model_arch=%s, "
+ "using_ep=%s, using_smart_router=%s, moe_load_balancer=%s",
+ in_supported_model_arch, using_ep, using_smart_router,
+ getattr(model_config, "moe_load_balancer", None),
+ )
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
print( | |
f"maybe_create_moe_load_balancer: in_supported_model_arch={in_supported_model_arch}, using_ep={using_ep}, using_smart_router={using_smart_router}, model_config.moe_load_balancer={model_config.moe_load_balancer}" | |
) | |
logger.debug( | |
"maybe_create_moe_load_balancer: in_supported_model_arch=%s, " | |
"using_ep=%s, using_smart_router=%s, moe_load_balancer=%s", | |
in_supported_model_arch, using_ep, using_smart_router, | |
getattr(model_config, "moe_load_balancer", None), | |
) |
🧰 Tools
🪛 Ruff (0.12.2)
964-964: Line too long (219 > 120)
(E501)
🤖 Prompt for AI Agents
In tensorrt_llm/_torch/modules/fused_moe/moe_load_balancer.py around lines 963
to 965, replace the noisy print call with a project logger.debug call and wrap
the long formatted message to keep line length under 120 characters; ensure you
obtain the module logger (e.g., via logging.getLogger(__name__) or use the
existing module logger) and pass the same f-string content to logger.debug,
splitting the format across concatenated strings or multiple arguments to stay
within the line length limit.
bfb9f40
to
360f36d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (16)
tensorrt_llm/_torch/modules/fused_moe/moe_load_balancer.py (1)
963-965
: Replace stdout print with logger.debug and wrap to <=120 cols (fix E501).Raw prints in hot/distributed paths are noisy and can materially impact performance. Use the module logger with parameterized formatting to avoid eager string interpolation and keep within style limits.
Apply this diff:
- print( - f"maybe_create_moe_load_balancer: in_supported_model_arch={in_supported_model_arch}, using_ep={using_ep}, using_smart_router={using_smart_router}, model_config.moe_load_balancer={model_config.moe_load_balancer}" - ) + logger.debug( + "maybe_create_moe_load_balancer: in_supported_model_arch=%s, " + "using_ep=%s, using_smart_router=%s, moe_load_balancer=%s", + in_supported_model_arch, + bool(using_ep), + bool(using_smart_router), + getattr(model_config, "moe_load_balancer", None), + )tensorrt_llm/_torch/distributed/ops.py (1)
243-247
: Replace debug print with proper logging (and fix the long line).Using
Apply:
- for val in input: - if val is not None and val.shape[dim] != sum_split_size: - print( - f"[reducescatter] val.shape={val.shape}, dim={dim}, val.shape[dim]={val.shape[dim]}, sum_split_size={sum_split_size}, sizes={sizes}" - ) + for val in input: + if val is not None and val.shape[dim] != sum_split_size: + logger.debug( + f"[reducescatter] val.shape={val.shape}, dim={dim}, " + f"val.shape[dim]={val.shape[dim]}, " + f"sum_split_size={sum_split_size}, sizes={sizes}" + )Optional: if you prefer lazy logging to avoid f-string formatting cost when DEBUG is off, use the parameterized form:
logger.debug( "[reducescatter] val.shape=%s, dim=%s, val.shape[dim]=%s, sum_split_size=%s, sizes=%s", tuple(val.shape), dim, val.shape[dim], sum_split_size, sizes, )Verification (find remaining prints in this file):
#!/bin/bash rg -n -C2 -g 'tensorrt_llm/_torch/distributed/ops.py' '^\s*print\('Note: This file currently both imports
logger
fromtensorrt_llm.logger
and redefineslogger = logging.getLogger(__name__)
. Pick one to avoid confusion; project-wide consistency typically favors thetensorrt_llm.logger
logger.tensorrt_llm/_torch/modules/fused_moe/moe_backend.py (7)
715-727
: Remove debug prints; use logger.debug with short, wrapped messageRaw prints spam stdout and violate E501. Convert to logger.debug and keep under 120 cols.
- print( - "enter deepgemm backend compute_moe \n" - f"x.shape: {getattr(x, 'shape', None)}, \n" - f"input_sf.shape: {getattr(input_sf, 'shape', None)}, \n" - f"token_selected_slots.shape: {getattr(token_selected_slots, 'shape', None)}, \n" - f"token_final_scales.shape: {getattr(token_final_scales, 'shape', None)}, \n" - f"permuted_row_to_unpermuted_row_tensor.shape: {getattr(permuted_row_to_unpermuted_row_tensor, 'shape', None)}, \n" - f"permuted_token_selected_experts_tensor.shape: {getattr(permuted_token_selected_experts_tensor, 'shape', None)}, \n" - f"permuted_data_tensor.shape: {getattr(permuted_data_tensor, 'shape', None)}, \n" - f"expert_first_token_offset_tensor.shape: {getattr(expert_first_token_offset_tensor, 'shape', None)}, \n" - f"permuted_token_final_scales_tensor.shape: {getattr(permuted_token_final_scales_tensor, 'shape', None)}, \n" - f"unpermuted_row_to_permuted_row_tensor.shape: {getattr(unpermuted_row_to_permuted_row_tensor, 'shape', None)}\n" - ) + logger.debug( + "[DeepGemm::enter] x=%s input_sf=%s slots=%s scales=%s " + "perm_row2unperm=%s perm_experts=%s perm_data=%s " + "first_off=%s perm_final_scales=%s unperm_row2perm=%s", + getattr(x, "shape", None), + getattr(input_sf, "shape", None), + getattr(token_selected_slots, "shape", None), + getattr(token_final_scales, "shape", None), + getattr(permuted_row_to_unpermuted_row_tensor, "shape", None), + getattr(permuted_token_selected_experts_tensor, "shape", None), + getattr(permuted_data_tensor, "shape", None), + getattr(expert_first_token_offset_tensor, "shape", None), + getattr(permuted_token_final_scales_tensor, "shape", None), + getattr(unpermuted_row_to_permuted_row_tensor, "shape", None), + )
835-844
: Remove trailing debug print; replace with logger.debugSame rationale as the entry print and resolves E501.
- print( - "exit deepgemm backend compute_moe \n" - f"permuted_data_tensor.shape: {getattr(permuted_data_tensor, 'shape', None)}, " - f"token_final_scales.shape: {getattr(token_final_scales, 'shape', None)}, " - f"unpermuted_row_to_permuted_row_tensor.shape: {getattr(unpermuted_row_to_permuted_row_tensor, 'shape', None)}, " - f"permuted_row_to_unpermuted_row_tensor.shape: {getattr(permuted_row_to_unpermuted_row_tensor, 'shape', None)}, " - f"expert_first_token_offset_tensor.shape: {getattr(expert_first_token_offset_tensor, 'shape', None)}, " - f"x.shape: {getattr(x, 'shape', None)}, " - f"final_hidden_states.shape: {getattr(final_hidden_states, 'shape', None)}" - ) + logger.debug( + "[DeepGemm::exit] perm_data=%s final_scales=%s unperm_row2perm=%s " + "perm_row2unperm=%s first_off=%s x=%s out=%s", + getattr(permuted_data_tensor, "shape", None), + getattr(token_final_scales, "shape", None), + getattr(unpermuted_row_to_permuted_row_tensor, "shape", None), + getattr(permuted_row_to_unpermuted_row_tensor, "shape", None), + getattr(expert_first_token_offset_tensor, "shape", None), + getattr(x, "shape", None), + getattr(final_hidden_states, "shape", None), + )
1-4
: Add required NVIDIA copyright header at file topPer repo guidelines, source files must prepend the NVIDIA copyright header. Insert it before the module docstring.
+ # Copyright (c) 2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved. """ MoE Backend abstraction for supporting different MoE computation implementations. This module provides a unified interface for different MoE backends (Cutlass, DeepGemm, etc.) """
245-256
: MoERunner init: pass the runtime view dtype and ensure flags are booleans (not bound methods)
- weight_dtype should match the view dtype used during profiling (weight_view_dtype), not the storage dtype.
- Several flags are obtained via getattr and may be bound methods; pass actual booleans by calling when callable or coercing with bool(...).
self.moe_runner = MoERunner( x_dtype=tuner_input.dtype, - weight_dtype=module.w3_w1_weight.dtype, + weight_dtype=weight_view_dtype, output_dtype=output_dtype, top_k=tuner_top_k, tp_size=module.tp_size, tp_rank=module.tp_rank, ep_size=module.ep_size, ep_rank=module.ep_rank, cluster_size=module.cluster_size, cluster_rank=module.cluster_rank, - use_deepseek_fp8_block_scale=getattr( - module, 'has_deepseek_fp8_block_scales', False), - use_w4_group_scaling=getattr(module, 'has_w4afp8', False), - use_int8_woq_per_channel=getattr(module, - 'has_int8_woq_per_channel', - False), - use_mxfp8_act_scaling=getattr(module, 'has_mxfp8_act_scaling', - False), + use_deepseek_fp8_block_scale=( + module.has_deepseek_fp8_block_scales() + if hasattr(module, 'has_deepseek_fp8_block_scales') + and callable(getattr(module, 'has_deepseek_fp8_block_scales')) + else bool(getattr(module, 'has_deepseek_fp8_block_scales', False)) + ), + use_w4_group_scaling=bool(getattr(module, 'has_w4afp8', False)), + use_int8_woq_per_channel=bool( + getattr(module, 'has_int8_woq_per_channel', False) + ), + use_mxfp8_act_scaling=bool( + getattr(module, 'has_mxfp8_act_scaling', False) + ), min_latency_mode=min_latency_mode, use_fused_finalize=getattr(module, 'use_fused_finalize', False), )Also applies to: 264-274
827-834
: Use experts_per_token (not top_k) to avoid AttributeError on routing methodProject-wide the attribute is experts_per_token.
- module.routing_method.top_k if module else 1, + module.routing_method.experts_per_token if module else 1,
846-846
: Align DeepGemm return type with Cutlass: return a one-element listCallers expect List[Tensor] in the non-min-latency path. Wrap the tensor to avoid slicing bugs.
- return final_hidden_states + return [final_hidden_states]
861-889
: Backend selection: call has_deepseek_fp8_block_scales instead of passing a bound methodAs written, has_block_fp8 is truthy whenever the attribute exists, regardless of the method’s return. Ensure it’s called when callable.
is_blackwell = get_sm_version() == 100 - has_block_fp8 = (hasattr(module, 'has_deepseek_fp8_block_scales') - and module.has_deepseek_fp8_block_scales) + has_block_fp8 = ( + hasattr(module, 'has_deepseek_fp8_block_scales') + and ( + module.has_deepseek_fp8_block_scales() + if callable(getattr(module, 'has_deepseek_fp8_block_scales')) + else bool(getattr(module, 'has_deepseek_fp8_block_scales')) + ) + )tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py (7)
311-313
: Replace debug print with logger.debug and keep under 120 colsAvoid stdout noise in production paths.
- print( - f"can not use alltoall due to chunking {self.calculate_num_chunks(all_rank_num_tokens)}" - ) + logger.debug( + "Cannot use alltoall: chunking=%s", + self.calculate_num_chunks(all_rank_num_tokens), + )
319-321
: Replace debug print with logger.debugAlso fixes E501.
- print( - f"can not use alltoall due to deep_ep_max_num_tokens {all_rank_max_num_tokens} > {self.deep_ep_max_num_tokens}" - ) + logger.debug( + "Cannot use alltoall: deep_ep_max_num_tokens %s > %s", + all_rank_max_num_tokens, + self.deep_ep_max_num_tokens, + )
324-324
: Replace debug print with logger.debugShort and consistent logging.
- print(f"all to all type {self.alltoall_method_type}") + logger.debug("Alltoall type: %s", self.alltoall_method_type)
333-345
: Use logger.debug for SM/back-end selection tracesKeep diagnostics but avoid prints; also fix long lines.
- print( - f"wide_ep _get_quant_method: get_sm_version()={get_sm_version()}" - ) + logger.debug( + "wide_ep _get_quant_method: get_sm_version()=%s", + get_sm_version(), + ) if get_sm_version() == 100: - print( - f"wide_ep _get_quant_method: use DeepSeekFP8BlockScalesFusedMoEMethodDeepGemm" - ) + logger.debug( + "wide_ep _get_quant_method: use DeepSeekFP8BlockScalesFusedMoEMethodDeepGemm" + ) return DeepSeekFP8BlockScalesFusedMoEMethodDeepGemm() else: - print( - f"wide_ep _get_quant_method: use DeepSeekFP8BlockScalesFusedMoEMethod" - ) + logger.debug( + "wide_ep _get_quant_method: use DeepSeekFP8BlockScalesFusedMoEMethod" + ) return DeepSeekFP8BlockScalesFusedMoEMethod()
418-420
: Replace verbose print in forward_chunk with logger.debugThis line floods stdout and violates E501. Keep it concise and wrapped.
- print( - f"wide_ep forward_chunk: layer_load_balancer={self.layer_load_balancer}, is_first_call={is_first_call}, is_last_call={is_last_call}, x shape: {getattr(x, 'shape', None)}, router_logits shape: {getattr(router_logits, 'shape', None)}, use_all_to_all: {use_all_to_all}, all_rank_num_tokens: {all_rank_num_tokens}, all_rank_max_num_tokens: {all_rank_max_num_tokens}, use_dp_padding: {use_dp_padding}, repeating_info: {repeating_info}" - ) + logger.debug( + "forward_chunk: lb=%s first=%s last=%s x=%s router=%s alltoall=%s " + "num_tokens=%s max_tokens=%s dp_pad=%s repeat=%s", + bool(self.layer_load_balancer), is_first_call, is_last_call, + getattr(x, "shape", None), getattr(router_logits, "shape", None), + use_all_to_all, all_rank_num_tokens, all_rank_max_num_tokens, + use_dp_padding, repeating_info, + )
730-732
: Remove extremely verbose print after backend runThis prints shapes and flags every call; switch to a gated debug log or remove.
- print( - f"xxi x.shape: {getattr(x, 'shape', None)}, final_hidden_states shape: {getattr(final_hidden_states[0], 'shape', None)}, token_selected_slots shape: {getattr(token_selected_slots, 'shape', None)}, token_final_scales shape: {getattr(token_final_scales, 'shape', None)}, w3_w1_weight shape: {getattr(w3_w1_weight, 'shape', None)}, w2_weight shape: {getattr(w2_weight, 'shape', None)}, quant_scales: {getattr(quant_scales, 'shape', None)}, input_sf: {getattr(x_sf, 'shape', None)}, swizzled_input_sf: False, tp_size: {self.tp_size}, tp_rank: {self.tp_rank}, ep_size: {ep_size}, ep_rank: {ep_rank}, cluster_size: {cluster_size}, cluster_rank: {cluster_rank}, enable_alltoall: {use_all_to_all}, use_deepseek_fp8_block_scale: {use_deepseek_fp8_block_scale}, use_w4_group_scaling: {use_w4_group_scaling}, min_latency_mode: False, tune_max_num_tokens: {self.tune_max_num_tokens}, tuner_num_tokens: {tuner_num_tokens}, tuner_top_k: {tuner_top_k}" - ) + logger.debug( + "forward_chunk result: out=%s slots=%s scales=%s alltoall=%s tp=%s/%s ep=%s/%s", + getattr(final_hidden_states[0], "shape", None), + getattr(token_selected_slots, "shape", None), + getattr(token_final_scales, "shape", None), + use_all_to_all, self.tp_rank, self.tp_size, ep_rank, ep_size, + )
812-815
: Remove leftover print in forward; convert to concise debug log if neededAvoid production prints and long lines.
- # 一行打印所有信息 - print( - f"xxi x.shape: {getattr(x, 'shape', None)}, use_all_to_all: {use_all_to_all}, all_rank_num_tokens: {all_rank_num_tokens}, all_rank_num_tokens_padded: {all_rank_num_tokens_padded}, all_rank_max_num_tokens: {all_rank_max_num_tokens}, use_dp_padding: {use_dp_padding}, outputs.shape: {getattr(outputs, 'shape', None)}, use_dp_padding(again): {use_dp_padding}" - ) + logger.debug( + "forward: x=%s alltoall=%s num_tokens=%s padded=%s max_tokens=%s " + "dp_pad=%s out=%s", + getattr(x, "shape", None), use_all_to_all, all_rank_num_tokens, + all_rank_num_tokens_padded, all_rank_max_num_tokens, + use_dp_padding, getattr(outputs, "shape", None), + )
🧹 Nitpick comments (7)
tensorrt_llm/_torch/modules/fused_moe/moe_load_balancer.py (3)
959-962
: Normalize booleans to avoid None/strange truthiness fromand
chaining.
using_ep = mapping and ...
returnsNone
whenmapping
is falsy; prefer explicitbool(mapping)
for clarity and stable typing.- using_ep = mapping and mapping.moe_ep_size > 1 + using_ep = bool(mapping) and mapping.moe_ep_size > 1 - using_smart_router = mapping and mapping.moe_cluster_size > 1 + using_smart_router = bool(mapping) and mapping.moe_cluster_size > 1
954-956
: Return type annotation doesn’t match actual return (nullcontext vs. MoeLoadBalancer).This function returns either a
MoeLoadBalancer
(context manager) or anullcontext()
;Optional[MoeLoadBalancer]
is misleading. Consider annotating as a context manager to reflect usage (with maybe_create_moe_load_balancer(...):
).Minimal edits:
-from typing import Callable, Dict, List, Optional, Tuple +from typing import Callable, Dict, List, Optional, Tuple, Any, ContextManager @@ -def maybe_create_moe_load_balancer( - model_config, mapping: Optional[Mapping]) -> Optional[MoeLoadBalancer]: +def maybe_create_moe_load_balancer( + model_config, mapping: Optional[Mapping]) -> ContextManager[Any]:
1-1
: Missing NVIDIA copyright header.Project guideline requires the NVIDIA copyright header on all source files.
Suggested header (adjust exact wording/year to match repo standard):
+# Copyright (c) 2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
tensorrt_llm/_torch/distributed/ops.py (1)
463-489
: MNNVL runtime gate looks good; add a small guard and improve exception logging.Broadening enablement to AUTO/MNNVL and gating on
tp_size == world_size
is reasonable. The divergence from DeepSeekV3’s TP-size logic is known/intentional per prior learnings. Minor polish:
- Avoid calling
is_mnnvl
whendtype is None
.- Include stack info in the debug log on init failure for easier diagnosis.
- if self.strategy in (AllReduceStrategy.AUTO, - AllReduceStrategy.MNNVL): + if self.strategy in (AllReduceStrategy.AUTO, AllReduceStrategy.MNNVL): if self.mapping.tp_size != self.mapping.world_size: logger.debug( f"MNNVLAllReduce is disabled due to tp_size:{self.mapping.tp_size} " f"!= world_size:{self.mapping.world_size}") self.mnnvl_allreduce = None - elif MNNVLAllReduce.is_mnnvl(self.mapping, dtype): + elif dtype is not None and MNNVLAllReduce.is_mnnvl(self.mapping, dtype): try: self.mnnvl_allreduce = MNNVLAllReduce( self.mapping, dtype) if dtype else None if self.mnnvl_allreduce: logger.debug(f"MNNVLAllReduce is enabled") else: logger.debug(f"MNNVLAllReduce is disabled") - except Exception as e: - logger.debug( - f"MNNVL AllReduce can't be enabled due to {e}.") + except Exception: + logger.debug("MNNVL AllReduce init failed.", exc_info=True) self.mnnvl_allreduce = None else: logger.debug( f"MNNVLAllReduce can't be enabled due to failing the is_mnnvl check." ) self.mnnvl_allreduce = NoneSide note: since this block adds several new
logger.debug
calls, consider unifying on a single logger source for the module (see note in the reducescatter comment).tensorrt_llm/_torch/modules/fused_moe/moe_backend.py (3)
11-11
: Import the module logger for use instead of print()Follow project practice and avoid raw prints. Import logger alongside get_sm_version.
- from tensorrt_llm._utils import get_sm_version + from tensorrt_llm._utils import get_sm_version, logger
381-383
: Wrap long conditional assignment to satisfy line length and readabilityNo functional change; resolves Ruff E501 and improves clarity.
- run_moe = self.moe_runner.fused_moe_runner.run_moe_min_latency if min_latency_mode else self.moe_runner.fused_moe_runner.run_moe + if min_latency_mode: + run_moe = self.moe_runner.fused_moe_runner.run_moe_min_latency + else: + run_moe = self.moe_runner.fused_moe_runner.run_moe
97-99
: Doc/typing: clarify run_moe/compute_moe return contract (List[Tensor] vs Tensor)With the DeepGemm fix, both backends return List[Tensor] (non-min-latency). Update docstrings and type hints to reflect this, or standardize on always returning List[Tensor] except when explicitly in min-latency mode.
Also applies to: 172-174
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (4)
tensorrt_llm/_torch/distributed/ops.py
(2 hunks)tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py
(11 hunks)tensorrt_llm/_torch/modules/fused_moe/moe_backend.py
(1 hunks)tensorrt_llm/_torch/modules/fused_moe/moe_load_balancer.py
(1 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.py
: Code must target Python 3.8+
Indent Python code with 4 spaces; do not use tabs
Preserve module namespaces when importing; import modules/packages and access members via the module (e.g., from package.subpackage import foo; foo.SomeClass())
Python file names should be snake_case
Python class names should be PascalCase
Python functions/methods and local variables should be snake_case; variables beginning with a number should be prefixed with k_ (e.g., k_99th_percentile)
Global variables should be UPPER_SNAKE_CASE prefixed with G_ (e.g., G_MY_GLOBAL); constants should be UPPER_SNAKE_CASE
Avoid shadowing variables from outer scopes; initialize all externally visible members in init
Prefer docstrings for interfaces used outside a file; comments should be reserved for in-function or file-local interfaces
Use Google-style docstrings for classes and functions; attributes and variables may be documented inline with trailing string literals
Avoid reflection when simpler, explicit code suffices (e.g., avoid dict(**locals()) patterns)
In try/except, catch the narrowest exceptions possible
For duck-typing patterns, keep the try body minimal and move logic to else to avoid masking unrelated failures
Files:
tensorrt_llm/_torch/modules/fused_moe/moe_backend.py
tensorrt_llm/_torch/distributed/ops.py
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py
tensorrt_llm/_torch/modules/fused_moe/moe_load_balancer.py
**/*.{c,cc,cpp,cxx,h,hh,hpp,hxx,cu,cuh,py}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Prepend the NVIDIA copyright header (current year) to all source files (.cpp, .h, .cu, .py, etc.)
Files:
tensorrt_llm/_torch/modules/fused_moe/moe_backend.py
tensorrt_llm/_torch/distributed/ops.py
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py
tensorrt_llm/_torch/modules/fused_moe/moe_load_balancer.py
🧠 Learnings (4)
📓 Common learnings
Learnt from: timlee0212
PR: NVIDIA/TensorRT-LLM#6886
File: tensorrt_llm/_torch/models/modeling_deepseekv3.py:0-0
Timestamp: 2025-08-14T06:36:40.701Z
Learning: In DeepSeek V3 model (tensorrt_llm/_torch/models/modeling_deepseekv3.py), the disagreement between AllReduce.__init__ guard and _compute_mlp_tp_size logic for MNNVL usage is expected by design. The AllReduce component and MLP TP-size computation intentionally use different criteria for MNNVL availability decisions.
📚 Learning: 2025-08-14T23:23:27.449Z
Learnt from: djns99
PR: NVIDIA/TensorRT-LLM#6915
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:4010-4012
Timestamp: 2025-08-14T23:23:27.449Z
Learning: For MOE (Mixture of Experts) code reviews in TensorRT-LLM, avoid repeatedly suggesting finalize fusion validation checks and safety assertions. The user djns99 has indicated these suggestions are repetitive and unwanted across multiple MOE-related changes.
Applied to files:
tensorrt_llm/_torch/modules/fused_moe/moe_backend.py
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py
📚 Learning: 2025-08-09T20:57:04.084Z
Learnt from: sklevtsov-nvidia
PR: NVIDIA/TensorRT-LLM#3294
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu:118-127
Timestamp: 2025-08-09T20:57:04.084Z
Learning: In the CUTLASS MoE finalize fusion implementation (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu), when setting `fused_finalize_epilogue.stride_final_output` with shape `(hidden_size, num_output_tokens, 1)`, the `num_rows_in_final_output` should be set to `num_output_tokens` (not `hidden_size`) because of a swap+transpose operation that maps rows of the output tensor to `hidden_size` and columns to `num_output_tokens`.
Applied to files:
tensorrt_llm/_torch/modules/fused_moe/moe_backend.py
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py
📚 Learning: 2025-08-14T06:36:40.701Z
Learnt from: timlee0212
PR: NVIDIA/TensorRT-LLM#6886
File: tensorrt_llm/_torch/models/modeling_deepseekv3.py:0-0
Timestamp: 2025-08-14T06:36:40.701Z
Learning: In DeepSeek V3 model (tensorrt_llm/_torch/models/modeling_deepseekv3.py), the disagreement between AllReduce.__init__ guard and _compute_mlp_tp_size logic for MNNVL usage is expected by design. The AllReduce component and MLP TP-size computation intentionally use different criteria for MNNVL availability decisions.
Applied to files:
tensorrt_llm/_torch/distributed/ops.py
🪛 Ruff (0.12.2)
tensorrt_llm/_torch/modules/fused_moe/moe_backend.py
382-382: Line too long (136 > 120)
(E501)
721-721: Line too long (127 > 120)
(E501)
722-722: Line too long (129 > 120)
(E501)
725-725: Line too long (121 > 120)
(E501)
726-726: Line too long (125 > 120)
(E501)
839-839: Line too long (125 > 120)
(E501)
840-840: Line too long (125 > 120)
(E501)
tensorrt_llm/_torch/distributed/ops.py
246-246: Line too long (156 > 120)
(E501)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py
320-320: Line too long (127 > 120)
(E501)
419-419: Line too long (442 > 120)
(E501)
731-731: Line too long (949 > 120)
(E501)
814-814: Line too long (372 > 120)
(E501)
tensorrt_llm/_torch/modules/fused_moe/moe_load_balancer.py
964-964: Line too long (219 > 120)
(E501)
🔇 Additional comments (3)
tensorrt_llm/_torch/modules/fused_moe/moe_backend.py (1)
214-223
: Nice: backend abstraction reads clean and contains a clear strategy separationThe split between tactic finalization and compute paths is well-structured and should ease future backend additions.
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py (2)
368-370
: Backend selection timing: ensure moe_backend is set before first usemoe_backend is initialized in create_weights(); if skip_create_weights_in_init is True and forward is called before create_weights(), self.moe_backend may be None.
Consider asserting at first use (in forward_chunk) or lazy-selecting if None:
# Initialize MoE backend after weights are created - self.moe_backend = MoEBackendSelection.select_backend(self) + self.moe_backend = MoEBackendSelection.select_backend(self)And in forward_chunk before calling run_moe:
- final_hidden_states = self.moe_backend.run_moe( + assert self.moe_backend is not None, "moe_backend is not initialized" + final_hidden_states = self.moe_backend.run_moe(Please confirm the repo always calls create_weights() prior to forward when skipping in init.
700-728
: Good integration of backend run_moe with module contextPassing module=self keeps backends decoupled yet informed. The argument mapping mirrors the legacy fused_moe signature well.
360f36d
to
6b3430e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
tensorrt_llm/_torch/distributed/ops.py (1)
1-21
: Fix logger shadowing (imported logger is overwritten).The module imports logger from tensorrt_llm.logger (Line 15) and then overwrites it with a stdlib logger (Line 20). This breaks the project’s logging conventions.
- import logging @@ -from tensorrt_llm.logger import logger +from tensorrt_llm.logger import logger @@ -logger = logging.getLogger(__name__)If nothing else uses the stdlib logging symbol after this change, drop the unused import as shown.
♻️ Duplicate comments (16)
tensorrt_llm/_torch/modules/fused_moe/moe_load_balancer.py (1)
963-965
: Replace debug print with logger.debug and wrap lines (<120 cols).Raw prints are noisy in distributed runs and fail Ruff E501. Use the project logger at debug level.
- print( - f"maybe_create_moe_load_balancer: in_supported_model_arch={in_supported_model_arch}, using_ep={using_ep}, using_smart_router={using_smart_router}, model_config.moe_load_balancer={model_config.moe_load_balancer}" - ) + logger.debug( + "maybe_create_moe_load_balancer: in_supported_model_arch=%s, " + "using_ep=%s, using_smart_router=%s, moe_load_balancer=%s", + in_supported_model_arch, using_ep, using_smart_router, + getattr(model_config, "moe_load_balancer", None), + )tensorrt_llm/_torch/modules/fused_moe/moe_backend.py (8)
715-727
: Remove verbose prints; use project logger at debug level.These prints run every forward and exceed 120 cols.
+ from tensorrt_llm.logger import logger - print( - "xxi shape 2: enter deepgemm backend compute_moe \n" - f"x.shape: {getattr(x, 'shape', None)}, \n" - f"input_sf.shape: {getattr(input_sf, 'shape', None)}, \n" - f"token_selected_slots.shape: {getattr(token_selected_slots, 'shape', None)}, \n" - f"token_final_scales.shape: {getattr(token_final_scales, 'shape', None)}, \n" - f"permuted_row_to_unpermuted_row_tensor.shape: {getattr(permuted_row_to_unpermuted_row_tensor, 'shape', None)}, \n" - f"permuted_token_selected_experts_tensor.shape: {getattr(permuted_token_selected_experts_tensor, 'shape', None)}, \n" - f"permuted_data_tensor.shape: {getattr(permuted_data_tensor, 'shape', None)}, \n" - f"expert_first_token_offset_tensor.shape: {getattr(expert_first_token_offset_tensor, 'shape', None)}, \n" - f"permuted_token_final_scales_tensor.shape: {getattr(permuted_token_final_scales_tensor, 'shape', None)}, \n" - f"unpermuted_row_to_permuted_row_tensor.shape: {getattr(unpermuted_row_to_permuted_row_tensor, 'shape', None)}\n" - ) + logger.debug( + "[DeepGemm.enter] x=%s input_sf=%s topk_slots=%s final_scales=%s " + "perm_row2unperm=%s perm_experts=%s perm_data=%s " + "expert_first_off=%s perm_final_scales=%s unperm_row2perm=%s", + getattr(x, "shape", None), + getattr(input_sf, "shape", None), + getattr(token_selected_slots, "shape", None), + getattr(token_final_scales, "shape", None), + getattr(permuted_row_to_unpermuted_row_tensor, "shape", None), + getattr(permuted_token_selected_experts_tensor, "shape", None), + getattr(permuted_data_tensor, "shape", None), + getattr(expert_first_token_offset_tensor, "shape", None), + getattr(permuted_token_final_scales_tensor, "shape", None), + getattr(unpermuted_row_to_permuted_row_tensor, "shape", None), + )
835-843
: Replace exit print with logger.debug; keep within 120 cols.- print( - "xxi shape 3: exit deepgemm backend compute_moe \n" - f"final_hidden_states.shape: {getattr(final_hidden_states, 'shape', None)}\n" - f"permuted_data_tensor.shape: {getattr(permuted_data_tensor, 'shape', None)}, \n" - f"token_final_scales.shape: {getattr(token_final_scales, 'shape', None)}, \n" - f"unpermuted_row_to_permuted_row_tensor.shape: {getattr(unpermuted_row_to_permuted_row_tensor, 'shape', None)}, \n" - f"permuted_row_to_unpermuted_row_tensor.shape: {getattr(permuted_row_to_unpermuted_row_tensor, 'shape', None)}, \n" - f"expert_first_token_offset_tensor.shape: {getattr(expert_first_token_offset_tensor, 'shape', None)}, \n" - f"x.shape: {getattr(x, 'shape', None)}, \n") + logger.debug( + "[DeepGemm.exit] out=%s perm_data=%s final_scales=%s " + "unperm_row2perm=%s perm_row2unperm=%s expert_first_off=%s x=%s", + getattr(final_hidden_states, "shape", None), + getattr(permuted_data_tensor, "shape", None), + getattr(token_final_scales, "shape", None), + getattr(unpermuted_row_to_permuted_row_tensor, "shape", None), + getattr(permuted_row_to_unpermuted_row_tensor, "shape", None), + getattr(expert_first_token_offset_tensor, "shape", None), + getattr(x, "shape", None), + )
828-834
: Use experts_per_token instead of top_k for routing.Avoid AttributeError and ensure correct semantics.
- module.routing_method.top_k if module else 1, + module.routing_method.experts_per_token if module else 1,
845-845
: Unify return type: DeepGemm backend should return List[Tensor].WideEPMoE.forward_chunk indexes [0] unconditionally; returning a bare Tensor will slice the first row.
- return final_hidden_states + return [final_hidden_states]
883-891
: Bug: evaluating a method object as truthy (has_block_fp8 always true).Call has_deepseek_fp8_block_scales if callable; otherwise coerce to bool.
- has_block_fp8 = (hasattr(module, 'has_deepseek_fp8_block_scales') - and module.has_deepseek_fp8_block_scales) + has_block_fp8 = ( + hasattr(module, 'has_deepseek_fp8_block_scales') + and ( + module.has_deepseek_fp8_block_scales() + if callable(getattr(module, 'has_deepseek_fp8_block_scales')) + else bool(getattr(module, 'has_deepseek_fp8_block_scales')) + ) + )
1-5
: Add the required NVIDIA copyright header.New Python source must start with the standard header.
+ # Copyright (c) 2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved. + # SPDX-License-Identifier: LicenseRef-NvidiaProprietary + """ MoE Backend abstraction for supporting different MoE computation implementations. This module provides a unified interface for different MoE backends (Cutlass, DeepGemm, etc.) """
264-271
: Bug: passing a bound method instead of bool to MoERunner flags.has_deepseek_fp8_block_scales is a method; using getattr without calling makes the flag truthy whenever present.
- use_deepseek_fp8_block_scale=getattr( - module, 'has_deepseek_fp8_block_scales', False), + use_deepseek_fp8_block_scale=( + module.has_deepseek_fp8_block_scales() + if hasattr(module, 'has_deepseek_fp8_block_scales') else False + ),Also verify other flags are real booleans (has_int8_woq_per_channel, has_mxfp8_act_scaling). If any are methods, apply the same callable guard.
253-274
: MoERunner should receive weight_view_dtype (not storage dtype).You compute weight_view_dtype but pass module.w3_w1_weight.dtype to MoERunner, risking mismatch when viewing weights (e.g., W4 group scaling, MXFP4).
if self.moe_runner is None: self.moe_runner = MoERunner( x_dtype=tuner_input.dtype, - weight_dtype=module.w3_w1_weight.dtype, + weight_dtype=weight_view_dtype, output_dtype=output_dtype, top_k=tuner_top_k,tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py (7)
311-314
: Replace print with logger.debug (chunking gate); keep lines <120.- print( - f"can not use alltoall due to chunking {self.calculate_num_chunks(all_rank_num_tokens)}" - ) + logger.debug( + "Cannot use alltoall due to chunking=%d", + self.calculate_num_chunks(all_rank_num_tokens), + )
319-321
: Replace print with logger.debug (DeepEPLowLatency token cap).- print( - f"can not use alltoall due to deep_ep_max_num_tokens {all_rank_max_num_tokens} > {self.deep_ep_max_num_tokens}" - ) + logger.debug( + "Cannot use alltoall: deep_ep_max_num_tokens %d > %d", + all_rank_max_num_tokens, self.deep_ep_max_num_tokens, + )
324-324
: Replace print with logger.debug (alltoall method type).- print(f"all to all type {self.alltoall_method_type}") + logger.debug("Alltoall method type: %s", self.alltoall_method_type)
333-345
: Swap verbose prints in _get_quant_method for logger.debug.Also keeps lines within 120 columns.
- print( - f"wide_ep _get_quant_method: get_sm_version()={get_sm_version()}" - ) + logger.debug( + "wide_ep _get_quant_method: get_sm_version()=%s", + get_sm_version(), + ) if get_sm_version() == 100: - print( - f"wide_ep _get_quant_method: use DeepSeekFP8BlockScalesFusedMoEMethodDeepGemm" - ) + logger.debug( + "wide_ep _get_quant_method: use DeepGemm block-FP8" + ) return DeepSeekFP8BlockScalesFusedMoEMethodDeepGemm() else: - print( - f"wide_ep _get_quant_method: use DeepSeekFP8BlockScalesFusedMoEMethod" - ) + logger.debug( + "wide_ep _get_quant_method: use Cutlass block-FP8" + ) return DeepSeekFP8BlockScalesFusedMoEMethod()
418-420
: Remove large print in forward_chunk; use concise logger.debug.This runs per forward and exceeds 120 columns.
- print( - f"xxi shape 1: enter wide_ep forward_chunk: layer_load_balancer={self.layer_load_balancer}, is_first_call={is_first_call}, is_last_call={is_last_call}, x shape: {getattr(x, 'shape', None)}, router_logits shape: {getattr(router_logits, 'shape', None)}, use_all_to_all: {use_all_to_all}, all_rank_num_tokens: {all_rank_num_tokens}, all_rank_max_num_tokens: {all_rank_max_num_tokens}, use_dp_padding: {use_dp_padding}, repeating_info: {repeating_info}" - ) + logger.debug( + "forward_chunk.enter lb=%s first=%s last=%s x=%s router=%s " + "alltoall=%s num_tokens=%s max_tokens=%s dp_pad=%s repeat=%s", + bool(self.layer_load_balancer), is_first_call, is_last_call, + getattr(x, "shape", None), getattr(router_logits, "shape", None), + use_all_to_all, all_rank_num_tokens, all_rank_max_num_tokens, + use_dp_padding, repeating_info, + )
730-732
: Replace post-backend shape print with logger.debug; avoid indexing cost.Also helps when DeepGemm returns a Tensor list after your fix.
- print( - f"xxi shape 4 after moe backend : {getattr(x, 'shape', None)}, final_hidden_states shape: {getattr(final_hidden_states[0], 'shape', None)}, token_selected_slots shape: {getattr(token_selected_slots, 'shape', None)}, token_final_scales shape: {getattr(token_final_scales, 'shape', None)}, w3_w1_weight shape: {getattr(w3_w1_weight, 'shape', None)}, w2_weight shape: {getattr(w2_weight, 'shape', None)}, quant_scales: {getattr(quant_scales, 'shape', None)}, input_sf: {getattr(x_sf, 'shape', None)}, swizzled_input_sf: False, tp_size: {self.tp_size}, tp_rank: {self.tp_rank}, ep_size: {ep_size}, ep_rank: {ep_rank}, cluster_size: {cluster_size}, cluster_rank: {cluster_rank}, enable_alltoall: {use_all_to_all}, use_deepseek_fp8_block_scale: {use_deepseek_fp8_block_scale}, use_w4_group_scaling: {use_w4_group_scaling}, min_latency_mode: False, tune_max_num_tokens: {self.tune_max_num_tokens}, tuner_num_tokens: {tuner_num_tokens}, tuner_top_k: {tuner_top_k}" - ) + logger.debug( + "moe_backend.out x=%s out=%s topk_slots=%s final_scales=%s " + "w3w1=%s w2=%s x_sf=%s alltoall=%s tp=(%d,%d) ep=(%d,%d) " + "cluster=(%d,%d) block_fp8=%s w4_group=%s tune_max_tokens=%d " + "tuner_tokens=%s tuner_top_k=%s", + getattr(x, "shape", None), + getattr(final_hidden_states[0], "shape", None) + if isinstance(final_hidden_states, list) + else getattr(final_hidden_states, "shape", None), + getattr(token_selected_slots, "shape", None), + getattr(token_final_scales, "shape", None), + getattr(w3_w1_weight, "shape", None), + getattr(w2_weight, "shape", None), + getattr(x_sf, "shape", None), + use_all_to_all, self.tp_size, self.tp_rank, ep_size, ep_rank, + cluster_size, cluster_rank, use_deepseek_fp8_block_scale, + use_w4_group_scaling, self.tune_max_num_tokens, + tuner_num_tokens, tuner_top_k, + )
813-815
: Remove one-line “all info” print; log succinctly with logger.debug.- print( - f"xxi x.shape: {getattr(x, 'shape', None)}, use_all_to_all: {use_all_to_all}, all_rank_num_tokens: {all_rank_num_tokens}, all_rank_num_tokens_padded: {all_rank_num_tokens_padded}, all_rank_max_num_tokens: {all_rank_max_num_tokens}, use_dp_padding: {use_dp_padding}, outputs.shape: {getattr(outputs, 'shape', None)}, use_dp_padding(again): {use_dp_padding}" - ) + logger.debug( + "forward.out x=%s alltoall=%s num_tokens=%s num_tokens_pad=%s " + "max_tokens=%s dp_pad=%s out=%s", + getattr(x, "shape", None), use_all_to_all, all_rank_num_tokens, + all_rank_num_tokens_padded, all_rank_max_num_tokens, + use_dp_padding, getattr(outputs, "shape", None), + )
🧹 Nitpick comments (2)
tensorrt_llm/_torch/distributed/ops.py (1)
243-247
: Swap print() for logger.debug; keep line length under 120.Printing inside collectives spams stdout and is slow. Use logger.debug with placeholders.
- for val in input: - if val is not None and val.shape[dim] != sum_split_size: - print( - f"[reducescatter] val.shape={val.shape}, dim={dim}, val.shape[dim]={val.shape[dim]}, sum_split_size={sum_split_size}, sizes={sizes}" - ) + for val in input: + if val is not None and val.shape[dim] != sum_split_size: + logger.debug( + "[reducescatter] val.shape=%s, dim=%d, val.shape[dim]=%d, " + "sum_split_size=%d, sizes=%s", + tuple(val.shape), dim, int(val.shape[dim]), + int(sum_split_size), sizes, + )tensorrt_llm/_torch/modules/fused_moe/moe_backend.py (1)
382-383
: Wrap long conditional assignment to satisfy E501 and readability.- run_moe = self.moe_runner.fused_moe_runner.run_moe_min_latency if min_latency_mode else self.moe_runner.fused_moe_runner.run_moe + run_moe = ( + self.moe_runner.fused_moe_runner.run_moe_min_latency + if min_latency_mode + else self.moe_runner.fused_moe_runner.run_moe + )
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (4)
tensorrt_llm/_torch/distributed/ops.py
(2 hunks)tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py
(11 hunks)tensorrt_llm/_torch/modules/fused_moe/moe_backend.py
(1 hunks)tensorrt_llm/_torch/modules/fused_moe/moe_load_balancer.py
(1 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.py
: Code must target Python 3.8+
Indent Python code with 4 spaces; do not use tabs
Preserve module namespaces when importing; import modules/packages and access members via the module (e.g., from package.subpackage import foo; foo.SomeClass())
Python file names should be snake_case
Python class names should be PascalCase
Python functions/methods and local variables should be snake_case; variables beginning with a number should be prefixed with k_ (e.g., k_99th_percentile)
Global variables should be UPPER_SNAKE_CASE prefixed with G_ (e.g., G_MY_GLOBAL); constants should be UPPER_SNAKE_CASE
Avoid shadowing variables from outer scopes; initialize all externally visible members in init
Prefer docstrings for interfaces used outside a file; comments should be reserved for in-function or file-local interfaces
Use Google-style docstrings for classes and functions; attributes and variables may be documented inline with trailing string literals
Avoid reflection when simpler, explicit code suffices (e.g., avoid dict(**locals()) patterns)
In try/except, catch the narrowest exceptions possible
For duck-typing patterns, keep the try body minimal and move logic to else to avoid masking unrelated failures
Files:
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py
tensorrt_llm/_torch/modules/fused_moe/moe_backend.py
tensorrt_llm/_torch/distributed/ops.py
tensorrt_llm/_torch/modules/fused_moe/moe_load_balancer.py
**/*.{c,cc,cpp,cxx,h,hh,hpp,hxx,cu,cuh,py}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Prepend the NVIDIA copyright header (current year) to all source files (.cpp, .h, .cu, .py, etc.)
Files:
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py
tensorrt_llm/_torch/modules/fused_moe/moe_backend.py
tensorrt_llm/_torch/distributed/ops.py
tensorrt_llm/_torch/modules/fused_moe/moe_load_balancer.py
🧠 Learnings (4)
📓 Common learnings
Learnt from: timlee0212
PR: NVIDIA/TensorRT-LLM#6886
File: tensorrt_llm/_torch/models/modeling_deepseekv3.py:0-0
Timestamp: 2025-08-14T06:36:40.701Z
Learning: In DeepSeek V3 model (tensorrt_llm/_torch/models/modeling_deepseekv3.py), the disagreement between AllReduce.__init__ guard and _compute_mlp_tp_size logic for MNNVL usage is expected by design. The AllReduce component and MLP TP-size computation intentionally use different criteria for MNNVL availability decisions.
📚 Learning: 2025-08-14T23:23:27.449Z
Learnt from: djns99
PR: NVIDIA/TensorRT-LLM#6915
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:4010-4012
Timestamp: 2025-08-14T23:23:27.449Z
Learning: For MOE (Mixture of Experts) code reviews in TensorRT-LLM, avoid repeatedly suggesting finalize fusion validation checks and safety assertions. The user djns99 has indicated these suggestions are repetitive and unwanted across multiple MOE-related changes.
Applied to files:
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py
tensorrt_llm/_torch/modules/fused_moe/moe_backend.py
📚 Learning: 2025-08-09T20:57:04.084Z
Learnt from: sklevtsov-nvidia
PR: NVIDIA/TensorRT-LLM#3294
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu:118-127
Timestamp: 2025-08-09T20:57:04.084Z
Learning: In the CUTLASS MoE finalize fusion implementation (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu), when setting `fused_finalize_epilogue.stride_final_output` with shape `(hidden_size, num_output_tokens, 1)`, the `num_rows_in_final_output` should be set to `num_output_tokens` (not `hidden_size`) because of a swap+transpose operation that maps rows of the output tensor to `hidden_size` and columns to `num_output_tokens`.
Applied to files:
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py
tensorrt_llm/_torch/modules/fused_moe/moe_backend.py
📚 Learning: 2025-08-14T06:36:40.701Z
Learnt from: timlee0212
PR: NVIDIA/TensorRT-LLM#6886
File: tensorrt_llm/_torch/models/modeling_deepseekv3.py:0-0
Timestamp: 2025-08-14T06:36:40.701Z
Learning: In DeepSeek V3 model (tensorrt_llm/_torch/models/modeling_deepseekv3.py), the disagreement between AllReduce.__init__ guard and _compute_mlp_tp_size logic for MNNVL usage is expected by design. The AllReduce component and MLP TP-size computation intentionally use different criteria for MNNVL availability decisions.
Applied to files:
tensorrt_llm/_torch/distributed/ops.py
🧬 Code graph analysis (3)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py (5)
tensorrt_llm/_torch/modules/fused_moe/moe_backend.py (4)
MoEBackendSelection
(848-891)select_backend
(860-891)run_moe
(101-212)run_moe
(413-533)tensorrt_llm/_torch/modules/fused_moe/moe_load_balancer.py (1)
get_moe_load_balancer
(1027-1037)tensorrt_llm/_torch/modules/fused_moe/quantization.py (2)
DeepSeekFP8BlockScalesFusedMoEMethod
(604-737)DeepSeekFP8BlockScalesFusedMoEMethodDeepGemm
(740-781)tensorrt_llm/_torch/distributed/communicator.py (1)
tp_size
(46-47)tensorrt_llm/mapping.py (1)
tp_rank
(338-339)
tensorrt_llm/_torch/modules/fused_moe/moe_backend.py (5)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py (1)
enable_alltoall
(298-301)tensorrt_llm/_torch/autotuner.py (2)
AutoTuner
(271-752)choose_one
(329-436)tensorrt_llm/_torch/custom_ops/torch_custom_ops.py (1)
MoERunner
(23-114)tensorrt_llm/_torch/modules/fused_moe/fused_moe_deepgemm.py (5)
deepgemm_fp8_group_blockwise_gemm
(246-284)masked_index_copy_group_quant_fp8
(89-160)preprocess_after_permute
(233-242)set_strides
(287-293)triton_masked_index_gather
(195-216)tensorrt_llm/_torch/modules/fused_moe/interface.py (1)
has_deepseek_fp8_block_scales
(127-130)
tensorrt_llm/_torch/distributed/ops.py (3)
tensorrt_llm/functional.py (1)
AllReduceStrategy
(3876-3885)tensorrt_llm/_torch/distributed/communicator.py (2)
tp_size
(46-47)world_size
(26-27)tensorrt_llm/logger.py (1)
debug
(143-144)
🪛 Ruff (0.12.2)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py
320-320: Line too long (127 > 120)
(E501)
419-419: Line too long (461 > 120)
(E501)
731-731: Line too long (968 > 120)
(E501)
814-814: Line too long (372 > 120)
(E501)
tensorrt_llm/_torch/modules/fused_moe/moe_backend.py
382-382: Line too long (136 > 120)
(E501)
721-721: Line too long (127 > 120)
(E501)
722-722: Line too long (129 > 120)
(E501)
725-725: Line too long (121 > 120)
(E501)
726-726: Line too long (125 > 120)
(E501)
840-840: Line too long (127 > 120)
(E501)
841-841: Line too long (127 > 120)
(E501)
tensorrt_llm/_torch/distributed/ops.py
246-246: Line too long (156 > 120)
(E501)
tensorrt_llm/_torch/modules/fused_moe/moe_load_balancer.py
964-964: Line too long (219 > 120)
(E501)
@@ -5,7 +5,7 @@ | |||
import torch | |||
|
|||
from tensorrt_llm._mnnvl_utils import MnnvlMemory, MnnvlMoe, MoEAlltoallInfo | |||
from tensorrt_llm._utils import logger | |||
from tensorrt_llm._utils import get_sm_version, logger |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Incorrect logger import; use tensorrt_llm.logger.
logger is not provided by tensorrt_llm._utils. Import from the canonical module.
-from tensorrt_llm._utils import get_sm_version, logger
+from tensorrt_llm._utils import get_sm_version
+from tensorrt_llm.logger import logger
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
from tensorrt_llm._utils import get_sm_version, logger | |
from tensorrt_llm._utils import get_sm_version | |
from tensorrt_llm.logger import logger |
🤖 Prompt for AI Agents
In tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py around line 8, the
code imports logger from tensorrt_llm._utils but that module does not export
logger; replace the import to pull logger from the canonical module
tensorrt_llm.logger (i.e., import get_sm_version from tensorrt_llm._utils and
import logger from tensorrt_llm.logger) so the module uses the correct logger
object.
6b3430e
to
ed54d31
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (13)
tensorrt_llm/_torch/modules/fused_moe/moe_load_balancer.py (1)
963-965
: Replace noisy print with logger.debug; keep within 120 cols (E501).Raw prints in hot paths cause interleaved stdout and overhead. Use the module logger and wrap lines.
- print( - f"maybe_create_moe_load_balancer: in_supported_model_arch={in_supported_model_arch}, using_ep={using_ep}, using_smart_router={using_smart_router}, model_config.moe_load_balancer={model_config.moe_load_balancer}" - ) + logger.debug( + "maybe_create_moe_load_balancer: " + f"in_supported_model_arch={in_supported_model_arch}, " + f"using_ep={using_ep}, using_smart_router={using_smart_router}, " + f"moe_load_balancer={getattr(model_config, 'moe_load_balancer', None)}" + )tensorrt_llm/_torch/distributed/ops.py (1)
243-247
: Remove debug print in reducescatter; use logger.debug and wrap long line.Printing inside the validation loop is noisy; prefer logger with structured fields. Also fixes Ruff E501.
- for val in input: - if val is not None and val.shape[dim] != sum_split_size: - print( - f"[reducescatter] val.shape={val.shape}, dim={dim}, val.shape[dim]={val.shape[dim]}, sum_split_size={sum_split_size}, sizes={sizes}" - ) + for val in input: + if val is not None and val.shape[dim] != sum_split_size: + logger.debug( + "[reducescatter] val.shape=%s, dim=%s, " + "val.shape[dim]=%s, sum_split_size=%s, sizes=%s", + getattr(val, "shape", None), dim, + val.shape[dim], sum_split_size, sizes, + )tensorrt_llm/_torch/modules/fused_moe/moe_backend.py (4)
680-727
: Remove verbose prints in DeepGemm compute; use logger.debug.These multi-line prints fire every forward; swap for concise debug logging. Also ensure logger is available in this module.
+from tensorrt_llm.logger import logger @@ - print( - "xxi shape 2: enter deepgemm backend compute_moe \n" - f"x.shape: {getattr(x, 'shape', None)}, \n" - f"input_sf.shape: {getattr(input_sf, 'shape', None)}, \n" - f"token_selected_slots.shape: {getattr(token_selected_slots, 'shape', None)}, \n" - f"token_final_scales.shape: {getattr(token_final_scales, 'shape', None)}, \n" - f"permuted_row_to_unpermuted_row_tensor.shape: {getattr(permuted_row_to_unpermuted_row_tensor, 'shape', None)}, \n" - f"permuted_token_selected_experts_tensor.shape: {getattr(permuted_token_selected_experts_tensor, 'shape', None)}, \n" - f"permuted_data_tensor.shape: {getattr(permuted_data_tensor, 'shape', None)}, \n" - f"expert_first_token_offset_tensor.shape: {getattr(expert_first_token_offset_tensor, 'shape', None)}, \n" - f"permuted_token_final_scales_tensor.shape: {getattr(permuted_token_final_scales_tensor, 'shape', None)}, \n" - f"unpermuted_row_to_permuted_row_tensor.shape: {getattr(unpermuted_row_to_permuted_row_tensor, 'shape', None)}\n" - ) + logger.debug( + "[DeepGemm::enter] x=%s input_sf=%s topk_slots=%s final_scales=%s " + "perm_row2unperm=%s perm_experts=%s perm_data=%s " + "expert_first_offset=%s perm_final_scales=%s unperm_row2perm=%s", + getattr(x, "shape", None), + getattr(input_sf, "shape", None), + getattr(token_selected_slots, "shape", None), + getattr(token_final_scales, "shape", None), + getattr(permuted_row_to_unpermuted_row_tensor, "shape", None), + getattr(permuted_token_selected_experts_tensor, "shape", None), + getattr(permuted_data_tensor, "shape", None), + getattr(expert_first_token_offset_tensor, "shape", None), + getattr(permuted_token_final_scales_tensor, "shape", None), + getattr(unpermuted_row_to_permuted_row_tensor, "shape", None), + )
237-275
: Use view dtype for MoERunner and call feature flags correctly.
- MoERunner should get the runtime view dtype (e.g., quint4x2/uint8) to match
.view(...)
during profiling.has_deepseek_fp8_block_scales
is a method; passing the bound method makes the flag truthy even when False.- if self.moe_runner is None: - self.moe_runner = MoERunner( + if self.moe_runner is None: + self.moe_runner = MoERunner( x_dtype=tuner_input.dtype, - weight_dtype=module.w3_w1_weight.dtype, + weight_dtype=weight_view_dtype, output_dtype=output_dtype, top_k=tuner_top_k, tp_size=module.tp_size, tp_rank=module.tp_rank, ep_size=module.ep_size, ep_rank=module.ep_rank, cluster_size=module.cluster_size, cluster_rank=module.cluster_rank, - use_deepseek_fp8_block_scale=getattr( - module, 'has_deepseek_fp8_block_scales', False), + use_deepseek_fp8_block_scale=( + module.has_deepseek_fp8_block_scales() + if hasattr(module, 'has_deepseek_fp8_block_scales') else False + ), use_w4_group_scaling=getattr(module, 'has_w4afp8', False), use_int8_woq_per_channel=getattr(module, 'has_int8_woq_per_channel', False), use_mxfp8_act_scaling=getattr(module, 'has_mxfp8_act_scaling', False), min_latency_mode=min_latency_mode, use_fused_finalize=getattr(module, 'use_fused_finalize', False), )
828-846
: Fix API contract: use experts_per_token and return List[Tensor] for consistency.
- The routing method exposes experts_per_token; .top_k risks AttributeError.
- WideEPMoE forward unconditionally indexes [0]. DeepGemm must return [tensor] to match Cutlass behavior.
- module.routing_method.top_k if module else 1, + module.routing_method.experts_per_token if module else 1, @@ - return final_hidden_states + return [final_hidden_states]
880-891
: Call has_deepseek_fp8_block_scales() when selecting backend.Currently it treats the bound method as truthy; call it (guarding callability) so selection reflects the real capability.
- has_block_fp8 = (hasattr(module, 'has_deepseek_fp8_block_scales') - and module.has_deepseek_fp8_block_scales) + fn = getattr(module, 'has_deepseek_fp8_block_scales', None) + has_block_fp8 = (callable(fn) and fn()) or (fn if isinstance(fn, bool) else False)tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py (7)
311-325
: Replace leftover prints with logger.debug; wrap long lines.Avoid stdout noise and satisfy E501.
- print( - f"can not use alltoall due to chunking {self.calculate_num_chunks(all_rank_num_tokens)}" - ) + logger.debug( + "Cannot use alltoall due to chunking %s", + self.calculate_num_chunks(all_rank_num_tokens), + ) @@ - print( - f"can not use alltoall due to deep_ep_max_num_tokens {all_rank_max_num_tokens} > {self.deep_ep_max_num_tokens}" - ) + logger.debug( + "Cannot use alltoall due to deep_ep_max_num_tokens %s > %s", + all_rank_max_num_tokens, self.deep_ep_max_num_tokens, + ) @@ - print(f"all to all type {self.alltoall_method_type}") + logger.debug("Alltoall type: %s", self.alltoall_method_type)
333-345
: Swap diagnostic prints in _get_quant_method for logger.debug.Keep runtime quiet; small, readable debug lines.
- print( - f"wide_ep _get_quant_method: get_sm_version()={get_sm_version()}" - ) + logger.debug( + "wide_ep _get_quant_method: get_sm_version()=%s", + get_sm_version(), + ) if get_sm_version() == 100: - print( - f"wide_ep _get_quant_method: use DeepSeekFP8BlockScalesFusedMoEMethodDeepGemm" - ) + logger.debug( + "wide_ep _get_quant_method: DeepSeekFP8BlockScalesFusedMoEMethodDeepGemm" + ) return DeepSeekFP8BlockScalesFusedMoEMethodDeepGemm() else: - print( - f"wide_ep _get_quant_method: use DeepSeekFP8BlockScalesFusedMoEMethod" - ) + logger.debug( + "wide_ep _get_quant_method: DeepSeekFP8BlockScalesFusedMoEMethod" + ) return DeepSeekFP8BlockScalesFusedMoEMethod()
418-420
: Remove verbose entry print in forward_chunk; use concise logger.debug.Giant prints exceed 120 cols and spam stdout.
- print( - f"xxi shape 1: enter wide_ep forward_chunk: layer_load_balancer={self.layer_load_balancer}, is_first_call={is_first_call}, is_last_call={is_last_call}, x shape: {getattr(x, 'shape', None)}, router_logits shape: {getattr(router_logits, 'shape', None)}, use_all_to_all: {use_all_to_all}, all_rank_num_tokens: {all_rank_num_tokens}, all_rank_max_num_tokens: {all_rank_max_num_tokens}, use_dp_padding: {use_dp_padding}, repeating_info: {repeating_info}" - ) + logger.debug( + "forward_chunk: lb=%s first=%s last=%s x=%s router=%s " + "alltoall=%s num_tokens=%s max_tokens=%s dp_pad=%s", + bool(self.layer_load_balancer), is_first_call, is_last_call, + getattr(x, "shape", None), getattr(router_logits, "shape", None), + use_all_to_all, all_rank_num_tokens, all_rank_max_num_tokens, + use_dp_padding, + )
730-732
: Remove post-backend verbose print; replace with logger.debug.Also note quant_scales is a list; accessing .shape on it is invalid.
- print( - f"xxi shape 4 after moe backend : {getattr(x, 'shape', None)}, final_hidden_states shape: {getattr(final_hidden_states, 'shape', None)}, token_selected_slots shape: {getattr(token_selected_slots, 'shape', None)}, token_final_scales shape: {getattr(token_final_scales, 'shape', None)}, w3_w1_weight shape: {getattr(w3_w1_weight, 'shape', None)}, w2_weight shape: {getattr(w2_weight, 'shape', None)}, quant_scales: {getattr(quant_scales, 'shape', None)}, input_sf: {getattr(x_sf, 'shape', None)}, swizzled_input_sf: False, tp_size: {self.tp_size}, tp_rank: {self.tp_rank}, ep_size: {ep_size}, ep_rank: {ep_rank}, cluster_size: {cluster_size}, cluster_rank: {cluster_rank}, enable_alltoall: {use_all_to_all}, use_deepseek_fp8_block_scale: {use_deepseek_fp8_block_scale}, use_w4_group_scaling: {use_w4_group_scaling}, min_latency_mode: False, tune_max_num_tokens: {self.tune_max_num_tokens}, tuner_num_tokens: {tuner_num_tokens}, tuner_top_k: {tuner_top_k}" - ) + logger.debug( + "after backend: x=%s out=%s slots=%s scales=%s w3w1=%s w2=%s " + "x_sf=%s tp=(%s,%s) ep=(%s,%s) alltoall=%s flags(fp8=%s,w4=%s) " + "tune=(max_tokens=%s, num_tokens=%s, top_k=%s)", + getattr(x, "shape", None), + getattr(final_hidden_states, "shape", None), + getattr(token_selected_slots, "shape", None), + getattr(token_final_scales, "shape", None), + getattr(w3_w1_weight, "shape", None), + getattr(w2_weight, "shape", None), + getattr(x_sf, "shape", None), + self.tp_size, self.tp_rank, ep_size, ep_rank, use_all_to_all, + use_deepseek_fp8_block_scale, use_w4_group_scaling, + self.tune_max_num_tokens, tuner_num_tokens, tuner_top_k, + )
812-815
: Remove one-line “dump everything” print; prefer structured debug logging.This line is extremely long (E501) and duplicates information already logged.
- # 一行打印所有信息 - print( - f"xxi x.shape: {getattr(x, 'shape', None)}, use_all_to_all: {use_all_to_all}, all_rank_num_tokens: {all_rank_num_tokens}, all_rank_num_tokens_padded: {all_rank_num_tokens_padded}, all_rank_max_num_tokens: {all_rank_max_num_tokens}, use_dp_padding: {use_dp_padding}, outputs.shape: {getattr(outputs, 'shape', None)}, use_dp_padding(again): {use_dp_padding}" - ) + logger.debug( + "forward (single-chunk): x=%s out=%s alltoall=%s " + "num_tokens=%s padded=%s max_tokens=%s dp_pad=%s", + getattr(x, "shape", None), getattr(outputs, "shape", None), + use_all_to_all, all_rank_num_tokens, all_rank_num_tokens_padded, + all_rank_max_num_tokens, use_dp_padding, + )
859-867
: Consistent attribute name: experts_per_token vs top_k across codebase.You already use experts_per_token elsewhere; keep it consistent to avoid AttributeError on routing methods that don’t expose .top_k.
8-9
: Fix logger import; use the canonical project logger.
tensorrt_llm._utils
does not export logger. Import from tensorrt_llm.logger.-from tensorrt_llm._utils import get_sm_version, logger +from tensorrt_llm._utils import get_sm_version +from tensorrt_llm.logger import logger
🧹 Nitpick comments (6)
tensorrt_llm/_torch/distributed/ops.py (1)
463-489
: MNNVL gating: confirm tp_size == world_size requirement is intended.The added guard disables MNNVL when tp_size != world_size. Given past divergence in MNNVL criteria vs. TP sizing elsewhere is by design, confirm this stricter requirement won’t preclude valid configurations (e.g., pipeline/data parallel present but TP group is full-node and MNNVL-capable).
If the intent is to gate on the TP group’s multi-node NVLink availability rather than full world_size equality, consider checking the group topology (e.g., mapping.tp_group vs. cluster topology) instead of world_size.
Would you like a follow-up PR to probe/validate group-level MNNVL capabilities at init time?
tensorrt_llm/_torch/modules/fused_moe/moe_backend.py (1)
1-5
: Add the required NVIDIA copyright header.New source files must include the current-year NVIDIA header per guidelines.
+# Copyright (c) 2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved. """ MoE Backend abstraction for supporting different MoE computation implementations. This module provides a unified interface for different MoE backends (Cutlass, DeepGemm, etc.) """
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py (4)
700-728
: Backend call: ensure downstream return type is uniform.Since forward_chunk later indexes [0], ensure both backends return List[Tensor]. This is addressed by changing DeepGemm to wrap the tensor; please apply that diff alongside this file.
739-745
: Unpack backend output robustly.Given Cutlass returns [tensor] and DeepGemm now wraps output per fix, this indexing is fine. If you’d like extra safety, guard:
if isinstance(final_hidden_states, list): final_hidden_states = final_hidden_states[0]
.
369-369
: Minor: ensure backend is non-None before use.If selection ever returns None, forward would fail. Optional: assert after init.
self._weights_created = True self._check_configs() # Initialize MoE backend after weights are created - self.moe_backend = MoEBackendSelection.select_backend(self) + self.moe_backend = MoEBackendSelection.select_backend(self) + assert self.moe_backend is not None, "Failed to select a MoE backend"
1-1
: Header check (optional):Consider adding the NVIDIA header to this file as well if it’s missing project-wide.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (4)
tensorrt_llm/_torch/distributed/ops.py
(2 hunks)tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py
(11 hunks)tensorrt_llm/_torch/modules/fused_moe/moe_backend.py
(1 hunks)tensorrt_llm/_torch/modules/fused_moe/moe_load_balancer.py
(1 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.py
: Code must target Python 3.8+
Indent Python code with 4 spaces; do not use tabs
Preserve module namespaces when importing; import modules/packages and access members via the module (e.g., from package.subpackage import foo; foo.SomeClass())
Python file names should be snake_case
Python class names should be PascalCase
Python functions/methods and local variables should be snake_case; variables beginning with a number should be prefixed with k_ (e.g., k_99th_percentile)
Global variables should be UPPER_SNAKE_CASE prefixed with G_ (e.g., G_MY_GLOBAL); constants should be UPPER_SNAKE_CASE
Avoid shadowing variables from outer scopes; initialize all externally visible members in init
Prefer docstrings for interfaces used outside a file; comments should be reserved for in-function or file-local interfaces
Use Google-style docstrings for classes and functions; attributes and variables may be documented inline with trailing string literals
Avoid reflection when simpler, explicit code suffices (e.g., avoid dict(**locals()) patterns)
In try/except, catch the narrowest exceptions possible
For duck-typing patterns, keep the try body minimal and move logic to else to avoid masking unrelated failures
Files:
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py
tensorrt_llm/_torch/distributed/ops.py
tensorrt_llm/_torch/modules/fused_moe/moe_backend.py
tensorrt_llm/_torch/modules/fused_moe/moe_load_balancer.py
**/*.{c,cc,cpp,cxx,h,hh,hpp,hxx,cu,cuh,py}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Prepend the NVIDIA copyright header (current year) to all source files (.cpp, .h, .cu, .py, etc.)
Files:
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py
tensorrt_llm/_torch/distributed/ops.py
tensorrt_llm/_torch/modules/fused_moe/moe_backend.py
tensorrt_llm/_torch/modules/fused_moe/moe_load_balancer.py
🧠 Learnings (4)
📓 Common learnings
Learnt from: timlee0212
PR: NVIDIA/TensorRT-LLM#6886
File: tensorrt_llm/_torch/models/modeling_deepseekv3.py:0-0
Timestamp: 2025-08-14T06:36:40.701Z
Learning: In DeepSeek V3 model (tensorrt_llm/_torch/models/modeling_deepseekv3.py), the disagreement between AllReduce.__init__ guard and _compute_mlp_tp_size logic for MNNVL usage is expected by design. The AllReduce component and MLP TP-size computation intentionally use different criteria for MNNVL availability decisions.
📚 Learning: 2025-08-14T23:23:27.449Z
Learnt from: djns99
PR: NVIDIA/TensorRT-LLM#6915
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:4010-4012
Timestamp: 2025-08-14T23:23:27.449Z
Learning: For MOE (Mixture of Experts) code reviews in TensorRT-LLM, avoid repeatedly suggesting finalize fusion validation checks and safety assertions. The user djns99 has indicated these suggestions are repetitive and unwanted across multiple MOE-related changes.
Applied to files:
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py
tensorrt_llm/_torch/modules/fused_moe/moe_backend.py
📚 Learning: 2025-08-09T20:57:04.084Z
Learnt from: sklevtsov-nvidia
PR: NVIDIA/TensorRT-LLM#3294
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu:118-127
Timestamp: 2025-08-09T20:57:04.084Z
Learning: In the CUTLASS MoE finalize fusion implementation (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu), when setting `fused_finalize_epilogue.stride_final_output` with shape `(hidden_size, num_output_tokens, 1)`, the `num_rows_in_final_output` should be set to `num_output_tokens` (not `hidden_size`) because of a swap+transpose operation that maps rows of the output tensor to `hidden_size` and columns to `num_output_tokens`.
Applied to files:
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py
tensorrt_llm/_torch/modules/fused_moe/moe_backend.py
📚 Learning: 2025-08-14T06:36:40.701Z
Learnt from: timlee0212
PR: NVIDIA/TensorRT-LLM#6886
File: tensorrt_llm/_torch/models/modeling_deepseekv3.py:0-0
Timestamp: 2025-08-14T06:36:40.701Z
Learning: In DeepSeek V3 model (tensorrt_llm/_torch/models/modeling_deepseekv3.py), the disagreement between AllReduce.__init__ guard and _compute_mlp_tp_size logic for MNNVL usage is expected by design. The AllReduce component and MLP TP-size computation intentionally use different criteria for MNNVL availability decisions.
Applied to files:
tensorrt_llm/_torch/distributed/ops.py
🧬 Code graph analysis (3)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py (2)
tensorrt_llm/_torch/modules/fused_moe/moe_backend.py (4)
MoEBackendSelection
(848-891)select_backend
(860-891)run_moe
(101-212)run_moe
(413-533)tensorrt_llm/_torch/modules/fused_moe/quantization.py (2)
DeepSeekFP8BlockScalesFusedMoEMethod
(604-737)DeepSeekFP8BlockScalesFusedMoEMethodDeepGemm
(740-781)
tensorrt_llm/_torch/distributed/ops.py (3)
tensorrt_llm/functional.py (1)
AllReduceStrategy
(3876-3885)tensorrt_llm/_torch/distributed/communicator.py (2)
tp_size
(46-47)world_size
(26-27)tensorrt_llm/logger.py (1)
debug
(143-144)
tensorrt_llm/_torch/modules/fused_moe/moe_backend.py (5)
tensorrt_llm/_torch/autotuner.py (2)
AutoTuner
(271-752)choose_one
(329-436)tensorrt_llm/_torch/custom_ops/torch_custom_ops.py (1)
MoERunner
(23-114)tensorrt_llm/_torch/modules/fused_moe/fused_moe_deepgemm.py (5)
deepgemm_fp8_group_blockwise_gemm
(246-284)masked_index_copy_group_quant_fp8
(89-160)preprocess_after_permute
(233-242)set_strides
(287-293)triton_masked_index_gather
(195-216)tensorrt_llm/quantization/utils/fp8_utils.py (1)
silu_and_mul_masked_post_quant_fwd
(304-375)tensorrt_llm/_torch/modules/fused_moe/interface.py (1)
has_deepseek_fp8_block_scales
(127-130)
🪛 Ruff (0.12.2)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py
320-320: Line too long (127 > 120)
(E501)
419-419: Line too long (461 > 120)
(E501)
731-731: Line too long (965 > 120)
(E501)
814-814: Line too long (372 > 120)
(E501)
tensorrt_llm/_torch/distributed/ops.py
246-246: Line too long (156 > 120)
(E501)
tensorrt_llm/_torch/modules/fused_moe/moe_backend.py
382-382: Line too long (136 > 120)
(E501)
721-721: Line too long (127 > 120)
(E501)
722-722: Line too long (129 > 120)
(E501)
725-725: Line too long (121 > 120)
(E501)
726-726: Line too long (125 > 120)
(E501)
840-840: Line too long (127 > 120)
(E501)
841-841: Line too long (127 > 120)
(E501)
tensorrt_llm/_torch/modules/fused_moe/moe_load_balancer.py
964-964: Line too long (219 > 120)
(E501)
🔇 Additional comments (2)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py (2)
368-370
: LGTM: Backend selection initialized post weights creation.Initializing MoE backend after weight creation is appropriate and keeps selection aligned with quant flags. No issues.
498-504
: Nice guard: only fetch LB stats when last call or LB present.This avoids unnecessary syncs/calls mid-iteration. Looks good.
ed54d31
to
e3b58fb
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
♻️ Duplicate comments (15)
tensorrt_llm/_torch/modules/fused_moe/moe_load_balancer.py (1)
963-965
: Replace debug print with logger.debug and wrap to avoid E501Use the project logger instead of stdout; also keeps the line length within 120 and avoids noisy logs in distributed runs.
- print( - f"maybe_create_moe_load_balancer: in_supported_model_arch={in_supported_model_arch}, using_ep={using_ep}, using_smart_router={using_smart_router}, model_config.moe_load_balancer={model_config.moe_load_balancer}" - ) + logger.debug( + "maybe_create_moe_load_balancer: supported_arch=%s, using_ep=%s, " + "smart_router=%s, lb_cfg=%s", + in_supported_model_arch, + bool(using_ep), + bool(using_smart_router), + getattr(model_config, "moe_load_balancer", None), + )tensorrt_llm/_torch/distributed/ops.py (1)
243-247
: Remove stdout print in reducescatter size-check; use logger.debug (fixes E501)Stick to the module logger to avoid flooding stdout and interleaving across ranks.
- for val in input: - if val is not None and val.shape[dim] != sum_split_size: - print( - f"[reducescatter] val.shape={val.shape}, dim={dim}, val.shape[dim]={val.shape[dim]}, sum_split_size={sum_split_size}, sizes={sizes}" - ) + for val in input: + if val is not None and val.shape[dim] != sum_split_size: + logger.debug( + "[reducescatter] shape=%s dim=%s dim_size=%s sum_split_size=%s sizes=%s", + val.shape, dim, val.shape[dim], sum_split_size, sizes, + )tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py (8)
311-313
: Replace debug print with logger and avoid duplicate computeCapture chunk count once and log via
logger.debug
.- if self.calculate_num_chunks(all_rank_num_tokens) > 1: - print( - f"can not use alltoall due to chunking {self.calculate_num_chunks(all_rank_num_tokens)}" - ) + chunks = self.calculate_num_chunks(all_rank_num_tokens) + if chunks > 1: + logger.debug("Cannot use alltoall due to chunking: %s", chunks) return False
319-321
: Replace DeepEPLowLatency limit print with logger.debug and fix E501- print( - f"can not use alltoall due to deep_ep_max_num_tokens {all_rank_max_num_tokens} > {self.deep_ep_max_num_tokens}" - ) + logger.debug( + "Cannot use alltoall: max_tokens %s > threshold %s", + all_rank_max_num_tokens, self.deep_ep_max_num_tokens, + )
324-324
: Replace stray print with logger.debug- print(f"all to all type {self.alltoall_method_type}") + logger.debug("alltoall method type: %s", self.alltoall_method_type)
333-345
: Swap prints for logger.debug in quant-method selectionAlso reduces log noise and respects line length.
- print( - f"wide_ep _get_quant_method: get_sm_version()={get_sm_version()}" - ) + logger.debug("wide_ep _get_quant_method: sm=%s", + get_sm_version()) if get_sm_version() == 100: - print( - f"wide_ep _get_quant_method: use DeepSeekFP8BlockScalesFusedMoEMethodDeepGemm" - ) + logger.debug("wide_ep _get_quant_method: DeepGemm path") return DeepSeekFP8BlockScalesFusedMoEMethodDeepGemm() else: - print( - f"wide_ep _get_quant_method: use DeepSeekFP8BlockScalesFusedMoEMethod" - ) + logger.debug("wide_ep _get_quant_method: Cutlass path") return DeepSeekFP8BlockScalesFusedMoEMethod()
418-421
: Remove verbose print from hot-path forward_chunk; use logger.debug placeholdersThis avoids formatting cost when debug is disabled and fixes E501.
- print( - f"xxi shape 1: enter wide_ep forward_chunk: layer_load_balancer={self.layer_load_balancer}, is_first_call={is_first_call}, is_last_call={is_last_call}, x shape: {getattr(x, 'shape', None)}, router_logits shape: {getattr(router_logits, 'shape', None)}, use_all_to_all: {use_all_to_all}, all_rank_num_tokens: {all_rank_num_tokens}, all_rank_max_num_tokens: {all_rank_max_num_tokens}, use_dp_padding: {use_dp_padding}, repeating_info: {repeating_info}" - ) + logger.debug( + "forward_chunk enter: lb=%s first=%s last=%s x=%s router=%s " + "alltoall=%s num_tokens=%s max_tokens=%s dp_pad=%s repeat=%s", + bool(self.layer_load_balancer), is_first_call, is_last_call, + getattr(x, "shape", None), getattr(router_logits, "shape", None), + use_all_to_all, all_rank_num_tokens, all_rank_max_num_tokens, + use_dp_padding, repeating_info, + )
730-732
: Drop extremely verbose print after backend runReplace with concise, gated logger.debug if you need quick shape sanity.
- print( - f"xxi shape 4 after moe backend : {getattr(x, 'shape', None)}, final_hidden_states shape: {getattr(final_hidden_states, 'shape', None)}, token_selected_slots shape: {getattr(token_selected_slots, 'shape', None)}, token_final_scales shape: {getattr(token_final_scales, 'shape', None)}, w3_w1_weight shape: {getattr(w3_w1_weight, 'shape', None)}, w2_weight shape: {getattr(w2_weight, 'shape', None)}, quant_scales: {getattr(quant_scales, 'shape', None)}, input_sf: {getattr(x_sf, 'shape', None)}, swizzled_input_sf: False, tp_size: {self.tp_size}, tp_rank: {self.tp_rank}, ep_size: {ep_size}, ep_rank: {ep_rank}, cluster_size: {cluster_size}, cluster_rank: {cluster_rank}, enable_alltoall: {use_all_to_all}, use_deepseek_fp8_block_scale: {use_deepseek_fp8_block_scale}, use_w4_group_scaling: {use_w4_group_scaling}, min_latency_mode: False, tune_max_num_tokens: {self.tune_max_num_tokens}, tuner_num_tokens: {tuner_num_tokens}, tuner_top_k: {tuner_top_k}" - ) + logger.debug( + "forward_chunk exit: x=%s out=%s slots=%s scales=%s alltoall=%s", + getattr(x, "shape", None), + getattr(final_hidden_states, "shape", None), + getattr(token_selected_slots, "shape", None), + getattr(token_final_scales, "shape", None), + use_all_to_all, + )
812-815
: Remove leftover one-line diagnostic print (non-English comment)
Replace with logger.debug if needed; prefer English comments for consistency.- # 一行打印所有信息 - print( - f"xxi x.shape: {getattr(x, 'shape', None)}, use_all_to_all: {use_all_to_all}, all_rank_num_tokens: {all_rank_num_tokens}, all_rank_num_tokens_padded: {all_rank_num_tokens_padded}, all_rank_max_num_tokens: {all_rank_max_num_tokens}, use_dp_padding: {use_dp_padding}, outputs.shape: {getattr(outputs, 'shape', None)}, use_dp_padding(again): {use_dp_padding}" - ) + # Log summary for debugging (optional) + # logger.debug("wide_ep summary: x=%s alltoall=%s tokens=%s padded=%s max_tokens=%s dp_pad=%s out=%s", + # getattr(x, "shape", None), use_all_to_all, all_rank_num_tokens, + # all_rank_num_tokens_padded, all_rank_max_num_tokens, use_dp_padding, + # getattr(outputs, "shape", None))
8-9
: Fix logger import source
logger
should come fromtensorrt_llm.logger
, not_utils
.-from tensorrt_llm._utils import get_sm_version, logger +from tensorrt_llm._utils import get_sm_version +from tensorrt_llm.logger import loggertensorrt_llm/_torch/modules/fused_moe/moe_backend.py (5)
835-844
: Remove exit-path print; replace with concise logger.debug- print( - "xxi shape 3: exit deepgemm backend compute_moe \n" - f"final_hidden_states.shape: {getattr(final_hidden_states, 'shape', None)}\n" - f"permuted_data_tensor.shape: {getattr(permuted_data_tensor, 'shape', None)}, \n" - f"token_final_scales.shape: {getattr(token_final_scales, 'shape', None)}, \n" - f"unpermuted_row_to_permuted_row_tensor.shape: {getattr(unpermuted_row_to_permuted_row_tensor, 'shape', None)}, \n" - f"permuted_row_to_unpermuted_row_tensor.shape: {getattr(permuted_row_to_unpermuted_row_tensor, 'shape', None)}, \n" - f"expert_first_token_offset_tensor.shape: {getattr(expert_first_token_offset_tensor, 'shape', None)}, \n" - f"x.shape: {getattr(x, 'shape', None)}, \n") + logger.debug( + "[DeepGemm::exit] out=%s perm_data=%s final_scales=%s " + "unperm_row2perm=%s perm_row2unperm=%s expert_first_off=%s x=%s", + getattr(final_hidden_states, "shape", None), + getattr(permuted_data_tensor, "shape", None), + getattr(token_final_scales, "shape", None), + getattr(unpermuted_row_to_permuted_row_tensor, "shape", None), + getattr(permuted_row_to_unpermuted_row_tensor, "shape", None), + getattr(expert_first_token_offset_tensor, "shape", None), + getattr(x, "shape", None), + )
1-4
: Add NVIDIA copyright headerPer repo guidelines, prepend the current year header to all source files.
+# Copyright (c) 2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved. """ MoE Backend abstraction for supporting different MoE computation implementations. This module provides a unified interface for different MoE backends (Cutlass, DeepGemm, etc.) """
885-887
: Fix: call has_deepseek_fp8_block_scales() during backend selectionCurrently passing the method object makes the condition always truthy when present.
- has_block_fp8 = (hasattr(module, 'has_deepseek_fp8_block_scales') - and module.has_deepseek_fp8_block_scales) + attr = getattr(module, 'has_deepseek_fp8_block_scales', None) + has_block_fp8 = bool(attr() if callable(attr) else attr)
224-275
: Fix: pass the actual weight view dtype to MoERunner; also ensure boolean flags aren’t bound methods
MoERunner
should see the same dtype you use for.view(...)
during profiling.has_deepseek_fp8_block_scales
is a method; passing the bound method is always truthy.- # Determine view dtype for weights to match runtime quantization layout + # Determine view dtype for weights to match runtime quantization layout weight_view_dtype = module.w3_w1_weight.dtype if getattr(module, 'has_w4afp8', False): weight_view_dtype = torch.quint4x2 elif getattr(module, 'has_w4a16_mxfp4', False): weight_view_dtype = torch.uint8 ... if self.moe_runner is None: self.moe_runner = MoERunner( x_dtype=tuner_input.dtype, - weight_dtype=module.w3_w1_weight.dtype, + weight_dtype=weight_view_dtype, output_dtype=output_dtype, top_k=tuner_top_k, tp_size=module.tp_size, tp_rank=module.tp_rank, ep_size=module.ep_size, ep_rank=module.ep_rank, cluster_size=module.cluster_size, cluster_rank=module.cluster_rank, - use_deepseek_fp8_block_scale=getattr( - module, 'has_deepseek_fp8_block_scales', False), + use_deepseek_fp8_block_scale=( + module.has_deepseek_fp8_block_scales() + if hasattr(module, 'has_deepseek_fp8_block_scales') else False + ), use_w4_group_scaling=getattr(module, 'has_w4afp8', False), use_int8_woq_per_channel=getattr(module, 'has_int8_woq_per_channel', False), use_mxfp8_act_scaling=getattr(module, 'has_mxfp8_act_scaling', False), min_latency_mode=min_latency_mode, use_fused_finalize=getattr(module, 'use_fused_finalize', False), )
816-834
: Use experts_per_token (not .top_k) in finalize-scale callRouting methods expose
experts_per_token
;.top_k
may not exist.- module.routing_method.top_k if module else 1, + (module.routing_method.experts_per_token if module else 1),
🧹 Nitpick comments (6)
tensorrt_llm/_torch/distributed/ops.py (1)
15-21
: Avoid shadowing the project logger with a stdlib logger instanceThis file imports the project logger and then reassigns
logger
to a stdlib logger. Keep the project logger for consistent behavior and formats.-from tensorrt_llm.logger import logger +from tensorrt_llm.logger import logger ... -import logging +import logging ... -logger = logging.getLogger(__name__) +# Do not override the imported project logger.tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py (1)
672-699
: Consider removing commented-out legacy call before mergeKeeping large commented blocks adds noise. Retain in Git history if needed.
tensorrt_llm/_torch/modules/fused_moe/moe_backend.py (4)
381-386
: Nit: long call fits, but keep line length < 120 in future editsMinor style; nothing blocking.
715-727
: Remove verbose prints from DeepGemm path; use logger.debug with placeholdersPrevents huge stdout spam and avoids string interpolation cost when debug is off.
- print( - "xxi shape 2: enter deepgemm backend compute_moe \n" - f"x.shape: {getattr(x, 'shape', None)}, \n" - f"input_sf.shape: {getattr(input_sf, 'shape', None)}, \n" - f"token_selected_slots.shape: {getattr(token_selected_slots, 'shape', None)}, \n" - f"token_final_scales.shape: {getattr(token_final_scales, 'shape', None)}, \n" - f"permuted_row_to_unpermuted_row_tensor.shape: {getattr(permuted_row_to_unpermuted_row_tensor, 'shape', None)}, \n" - f"permuted_token_selected_experts_tensor.shape: {getattr(permuted_token_selected_experts_tensor, 'shape', None)}, \n" - f"permuted_data_tensor.shape: {getattr(permuted_data_tensor, 'shape', None)}, \n" - f"expert_first_token_offset_tensor.shape: {getattr(expert_first_token_offset_tensor, 'shape', None)}, \n" - f"permuted_token_final_scales_tensor.shape: {getattr(permuted_token_final_scales_tensor, 'shape', None)}, \n" - f"unpermuted_row_to_permuted_row_tensor.shape: {getattr(unpermuted_row_to_permuted_row_tensor, 'shape', None)}\n" - ) + logger.debug( + "[DeepGemm::enter] x=%s sf=%s slots=%s scales=%s " + "perm_row2unperm=%s perm_experts=%s perm_data=%s " + "expert_first_off=%s perm_scales=%s unperm_row2perm=%s", + getattr(x, "shape", None), + getattr(input_sf, "shape", None), + getattr(token_selected_slots, "shape", None), + getattr(token_final_scales, "shape", None), + getattr(permuted_row_to_unpermuted_row_tensor, "shape", None), + getattr(permuted_token_selected_experts_tensor, "shape", None), + getattr(permuted_data_tensor, "shape", None), + getattr(expert_first_token_offset_tensor, "shape", None), + getattr(permuted_token_final_scales_tensor, "shape", None), + getattr(unpermuted_row_to_permuted_row_tensor, "shape", None), + )
720-744
: Optional: add a guard for zero-sized inputs earlierYou handle
numel()==0
shortly after; consider early return just after permute to skip allocations.
700-714
: Note: finalize permute uses use_fp8_block_scaling=True unconditionallyAssumed invariant for DeepGemm path. If this ever becomes runtime-configurable, plumb through from module/backend selector.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (4)
tensorrt_llm/_torch/distributed/ops.py
(2 hunks)tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py
(11 hunks)tensorrt_llm/_torch/modules/fused_moe/moe_backend.py
(1 hunks)tensorrt_llm/_torch/modules/fused_moe/moe_load_balancer.py
(1 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.py
: Code must target Python 3.8+
Indent Python code with 4 spaces; do not use tabs
Preserve module namespaces when importing; import modules/packages and access members via the module (e.g., from package.subpackage import foo; foo.SomeClass())
Python file names should be snake_case
Python class names should be PascalCase
Python functions/methods and local variables should be snake_case; variables beginning with a number should be prefixed with k_ (e.g., k_99th_percentile)
Global variables should be UPPER_SNAKE_CASE prefixed with G_ (e.g., G_MY_GLOBAL); constants should be UPPER_SNAKE_CASE
Avoid shadowing variables from outer scopes; initialize all externally visible members in init
Prefer docstrings for interfaces used outside a file; comments should be reserved for in-function or file-local interfaces
Use Google-style docstrings for classes and functions; attributes and variables may be documented inline with trailing string literals
Avoid reflection when simpler, explicit code suffices (e.g., avoid dict(**locals()) patterns)
In try/except, catch the narrowest exceptions possible
For duck-typing patterns, keep the try body minimal and move logic to else to avoid masking unrelated failures
Files:
tensorrt_llm/_torch/distributed/ops.py
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py
tensorrt_llm/_torch/modules/fused_moe/moe_backend.py
tensorrt_llm/_torch/modules/fused_moe/moe_load_balancer.py
**/*.{c,cc,cpp,cxx,h,hh,hpp,hxx,cu,cuh,py}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Prepend the NVIDIA copyright header (current year) to all source files (.cpp, .h, .cu, .py, etc.)
Files:
tensorrt_llm/_torch/distributed/ops.py
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py
tensorrt_llm/_torch/modules/fused_moe/moe_backend.py
tensorrt_llm/_torch/modules/fused_moe/moe_load_balancer.py
🧠 Learnings (4)
📓 Common learnings
Learnt from: timlee0212
PR: NVIDIA/TensorRT-LLM#6886
File: tensorrt_llm/_torch/models/modeling_deepseekv3.py:0-0
Timestamp: 2025-08-14T06:36:40.701Z
Learning: In DeepSeek V3 model (tensorrt_llm/_torch/models/modeling_deepseekv3.py), the disagreement between AllReduce.__init__ guard and _compute_mlp_tp_size logic for MNNVL usage is expected by design. The AllReduce component and MLP TP-size computation intentionally use different criteria for MNNVL availability decisions.
📚 Learning: 2025-08-14T06:36:40.701Z
Learnt from: timlee0212
PR: NVIDIA/TensorRT-LLM#6886
File: tensorrt_llm/_torch/models/modeling_deepseekv3.py:0-0
Timestamp: 2025-08-14T06:36:40.701Z
Learning: In DeepSeek V3 model (tensorrt_llm/_torch/models/modeling_deepseekv3.py), the disagreement between AllReduce.__init__ guard and _compute_mlp_tp_size logic for MNNVL usage is expected by design. The AllReduce component and MLP TP-size computation intentionally use different criteria for MNNVL availability decisions.
Applied to files:
tensorrt_llm/_torch/distributed/ops.py
📚 Learning: 2025-08-14T23:23:27.449Z
Learnt from: djns99
PR: NVIDIA/TensorRT-LLM#6915
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:4010-4012
Timestamp: 2025-08-14T23:23:27.449Z
Learning: For MOE (Mixture of Experts) code reviews in TensorRT-LLM, avoid repeatedly suggesting finalize fusion validation checks and safety assertions. The user djns99 has indicated these suggestions are repetitive and unwanted across multiple MOE-related changes.
Applied to files:
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py
tensorrt_llm/_torch/modules/fused_moe/moe_backend.py
📚 Learning: 2025-08-09T20:57:04.084Z
Learnt from: sklevtsov-nvidia
PR: NVIDIA/TensorRT-LLM#3294
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu:118-127
Timestamp: 2025-08-09T20:57:04.084Z
Learning: In the CUTLASS MoE finalize fusion implementation (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu), when setting `fused_finalize_epilogue.stride_final_output` with shape `(hidden_size, num_output_tokens, 1)`, the `num_rows_in_final_output` should be set to `num_output_tokens` (not `hidden_size`) because of a swap+transpose operation that maps rows of the output tensor to `hidden_size` and columns to `num_output_tokens`.
Applied to files:
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py
tensorrt_llm/_torch/modules/fused_moe/moe_backend.py
🧬 Code graph analysis (3)
tensorrt_llm/_torch/distributed/ops.py (3)
tensorrt_llm/_torch/distributed/communicator.py (2)
tp_size
(46-47)world_size
(26-27)tensorrt_llm/llmapi/llm_args.py (2)
world_size
(246-258)world_size
(267-274)tensorrt_llm/logger.py (1)
debug
(143-144)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py (3)
tensorrt_llm/_torch/modules/fused_moe/moe_backend.py (4)
MoEBackendSelection
(850-893)select_backend
(862-893)run_moe
(101-212)run_moe
(413-533)tensorrt_llm/_torch/modules/fused_moe/moe_load_balancer.py (1)
get_moe_load_balancer
(1027-1037)tensorrt_llm/_torch/modules/fused_moe/quantization.py (2)
DeepSeekFP8BlockScalesFusedMoEMethod
(604-737)DeepSeekFP8BlockScalesFusedMoEMethodDeepGemm
(740-781)
tensorrt_llm/_torch/modules/fused_moe/moe_backend.py (3)
tensorrt_llm/_torch/custom_ops/torch_custom_ops.py (1)
MoERunner
(23-114)tensorrt_llm/_torch/modules/fused_moe/fused_moe_deepgemm.py (5)
deepgemm_fp8_group_blockwise_gemm
(246-284)masked_index_copy_group_quant_fp8
(89-160)preprocess_after_permute
(233-242)set_strides
(287-293)triton_masked_index_gather
(195-216)tensorrt_llm/_torch/modules/fused_moe/interface.py (1)
has_deepseek_fp8_block_scales
(127-130)
🪛 Ruff (0.12.2)
tensorrt_llm/_torch/distributed/ops.py
246-246: Line too long (156 > 120)
(E501)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py
320-320: Line too long (127 > 120)
(E501)
419-419: Line too long (461 > 120)
(E501)
731-731: Line too long (965 > 120)
(E501)
814-814: Line too long (372 > 120)
(E501)
tensorrt_llm/_torch/modules/fused_moe/moe_backend.py
382-382: Line too long (136 > 120)
(E501)
721-721: Line too long (127 > 120)
(E501)
722-722: Line too long (129 > 120)
(E501)
725-725: Line too long (121 > 120)
(E501)
726-726: Line too long (125 > 120)
(E501)
840-840: Line too long (127 > 120)
(E501)
841-841: Line too long (127 > 120)
(E501)
tensorrt_llm/_torch/modules/fused_moe/moe_load_balancer.py
964-964: Line too long (219 > 120)
(E501)
🔇 Additional comments (10)
tensorrt_llm/_torch/distributed/ops.py (1)
463-489
: MNNVL AUTO/MNNVL enablement and fallback logic looks sound
- Gated under AUTO or MNNVL strategies.
- Disables when tp_size != world_size, as expected for multi-node NVLink scope.
- Safe instantiation with try/except; clean fallbacks and debug logs.
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py (5)
237-239
: Early attribute initialization is fineLazy-init
self.moe_backend
to None pre-weights is OK and avoids accidental use before weights are created.
368-370
: Backend selection after weight creation is appropriateSelecting the backend only after weights are created aligns with quant-dependent routes. LGTM.
498-509
: Good: gate statistic fetch with layer_load_balancer and is_last_callPrevents unconditional
get_local_statistic_tensor
calls.
700-729
: Backend switchover looks correct; passes through fused_moe-compatible paramsThe backend API parity with
torch.ops.trtllm.fused_moe
is preserved; passingmodule=self
enables backend-specific access. LGTM.
777-787
: Add lazy initialization ofmoe_backend
in forwardTo guard against
model_config.skip_create_weights_in_init=True
skipping your backend setup, ensureself.moe_backend
is always initialized before use:
- File:
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py
Location:forward
method (around lines 777–787)def forward( self, x: Union[torch.Tensor, Fp4QuantizedTensor], router_logits: torch.Tensor, do_finalize: bool = True, output_dtype: Optional[torch.dtype] = None, all_rank_num_tokens: Optional[List[int]] = None, all_rank_max_num_tokens: Optional[int]] = None, use_dp_padding: Optional[bool] = None, ) -> torch.Tensor: + if getattr(self, "moe_backend", None) is None: + self.moe_backend = MoEBackendSelection.select_backend(self) assert all_rank_num_tokens is not NoneThis lazy-init ensures that, even if weight creation is deferred, your MoE backend is ready when
forward
runs. Please verify that no other code paths already cover this initialization.tensorrt_llm/_torch/modules/fused_moe/moe_backend.py (4)
861-893
: Backend selection policy is clear; criteria read wellSM100 + block FP8 -> DeepGemm; else Cutlass. Nice and contained.
536-568
: No-op finalize_tactic for DeepGemm is appropriateDeepGemm’s static tactic design justifies this. Good separation from Cutlass.
497-534
: Cutlass run_moe flow is coherent and matches compute_moe contractTuner input derivation mirrors fused_moe path; return type aligns with caller.
619-667
: DeepGemm path composition is correctPermutation → (W3/W1) GEMM → SiLU → (W2) GEMM → gather → finalize/scale. Looks consistent with the intended block-FP8 pipeline.
import torch | ||
|
||
from tensorrt_llm._utils import get_sm_version | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Import the project logger for consistent logging across backends
You use print
later; switch to logger.debug
and import the project logger here.
import torch
-from tensorrt_llm._utils import get_sm_version
+from tensorrt_llm._utils import get_sm_version
+from tensorrt_llm.logger import logger
Committable suggestion skipped: line range outside the PR's diff.
🤖 Prompt for AI Agents
In tensorrt_llm/_torch/modules/fused_moe/moe_backend.py around lines 9 to 13,
the module currently imports torch and get_sm_version but later uses print for
debug output; import the project logger (e.g., from tensorrt_llm import logger
or the module's standard logger entrypoint) at the top alongside existing
imports and replace any print(...) calls in this file with logger.debug(...)
calls so logging is consistent across backends and uses the configured logger.
Signed-off-by: xxi <[email protected]> modified: tensorrt_llm/_torch/modules/fused_moe/moe_load_balancer.py modified: tensorrt_llm/_torch/distributed/ops.py modified: tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py modified: tensorrt_llm/_torch/modules/fused_moe/moe_backend.py modified: tensorrt_llm/_torch/modules/fused_moe/moe_load_balancer.py
e3b58fb
to
15d9841
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
♻️ Duplicate comments (9)
tensorrt_llm/_torch/distributed/ops.py (1)
243-247
: Replace print with proper logging; fix long line (E501).Use the module logger instead of printing to stdout and wrap the message to satisfy the 120-char limit.
- for val in input: - if val is not None and val.shape[dim] != sum_split_size: - print( - f"[reducescatter] val.shape={val.shape}, dim={dim}, val.shape[dim]={val.shape[dim]}, sum_split_size={sum_split_size}, sizes={sizes}" - ) + for val in input: + if val is not None and val.shape[dim] != sum_split_size: + logger.debug( + f"[reducescatter] val.shape={val.shape}, dim={dim}, " + f"val.shape[dim]={val.shape[dim]}, " + f"sum_split_size={sum_split_size}, sizes={sizes}" + )Run to ensure no stray prints remain:
#!/bin/bash rg -n -C2 '^\s*print\(' tensorrt_llm/_torch/distributed/ops.pytensorrt_llm/_torch/modules/fused_moe/moe_load_balancer.py (1)
963-965
: Replace noisy print with logger.debug; fix E501 and keep logs structured.Use the project logger and avoid long f-strings on hot paths.
- print( - f"maybe_create_moe_load_balancer: in_supported_model_arch={in_supported_model_arch}, using_ep={using_ep}, using_smart_router={using_smart_router}, model_config.moe_load_balancer={model_config.moe_load_balancer}" - ) + logger.debug( + "maybe_create_moe_load_balancer: supported_arch=%s, using_ep=%s, " + "using_smart_router=%s, moe_load_balancer=%s", + in_supported_model_arch, + using_ep, + using_smart_router, + getattr(model_config, "moe_load_balancer", None), + )tensorrt_llm/_torch/modules/fused_moe/moe_backend.py (2)
245-274
: Two fixes: pass the correct weight dtype to MoERunner; call the FP8 flag method.
- weight_dtype should match the runtime weight.view(dtype).
- has_deepseek_fp8_block_scales is a method; calling it is required.
- # Determine view dtype for weights to match runtime quantization layout + # Determine view dtype for weights to match runtime quantization layout weight_view_dtype = module.w3_w1_weight.dtype if getattr(module, 'has_w4afp8', False): weight_view_dtype = torch.quint4x2 elif getattr(module, 'has_w4a16_mxfp4', False): weight_view_dtype = torch.uint8 @@ if self.moe_runner is None: self.moe_runner = MoERunner( x_dtype=tuner_input.dtype, - weight_dtype=module.w3_w1_weight.dtype, + weight_dtype=weight_view_dtype, output_dtype=output_dtype, top_k=tuner_top_k, tp_size=module.tp_size, tp_rank=module.tp_rank, ep_size=module.ep_size, ep_rank=module.ep_rank, cluster_size=module.cluster_size, cluster_rank=module.cluster_rank, - use_deepseek_fp8_block_scale=getattr( - module, 'has_deepseek_fp8_block_scales', False), + use_deepseek_fp8_block_scale=( + module.has_deepseek_fp8_block_scales() + if hasattr(module, 'has_deepseek_fp8_block_scales') else False + ), use_w4_group_scaling=getattr(module, 'has_w4afp8', False), use_int8_woq_per_channel=getattr(module, 'has_int8_woq_per_channel', False), use_mxfp8_act_scaling=getattr(module, 'has_mxfp8_act_scaling', False), min_latency_mode=min_latency_mode, use_fused_finalize=getattr(module, 'use_fused_finalize', False), )
830-836
: Use experts_per_token (not top_k) to avoid AttributeError and ensure correct routing.- module.routing_method.top_k if module else 1, + module.routing_method.experts_per_token if module else 1,tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py (5)
311-314
: Replace print with logger.debug and wrap message.- print( - f"can not use alltoall due to chunking {self.calculate_num_chunks(all_rank_num_tokens)}" - ) + logger.debug( + "Cannot use alltoall due to chunking %s", + self.calculate_num_chunks(all_rank_num_tokens), + )
319-321
: Replace print with logger.debug; fixes E501 long line too.- print( - f"can not use alltoall due to deep_ep_max_num_tokens {all_rank_max_num_tokens} > {self.deep_ep_max_num_tokens}" - ) + logger.debug( + "Cannot use alltoall due to deep_ep_max_num_tokens %s > %s", + all_rank_max_num_tokens, + self.deep_ep_max_num_tokens, + )
324-324
: Replace print with logger.debug for consistency.- print(f"all to all type {self.alltoall_method_type}") + logger.debug("alltoall method type: %s", self.alltoall_method_type)
333-345
: Swap verbose prints for compact logger.debug in quant-method selection.- print( - f"wide_ep _get_quant_method: get_sm_version()={get_sm_version()}" - ) + logger.debug( + "wide_ep _get_quant_method: get_sm_version()=%s", + get_sm_version(), + ) if get_sm_version() == 100: - print( - f"wide_ep _get_quant_method: use DeepSeekFP8BlockScalesFusedMoEMethodDeepGemm" - ) + logger.debug( + "wide_ep _get_quant_method: DeepSeekFP8BlockScalesFusedMoEMethodDeepGemm" + ) return DeepSeekFP8BlockScalesFusedMoEMethodDeepGemm() else: - print( - f"wide_ep _get_quant_method: use DeepSeekFP8BlockScalesFusedMoEMethod" - ) + logger.debug( + "wide_ep _get_quant_method: DeepSeekFP8BlockScalesFusedMoEMethod" + ) return DeepSeekFP8BlockScalesFusedMoEMethod()
8-8
: Fix logger import; use tensorrt_llm.logger (not _utils).-from tensorrt_llm._utils import get_sm_version, logger +from tensorrt_llm._utils import get_sm_version +from tensorrt_llm.logger import logger
🧹 Nitpick comments (7)
tensorrt_llm/_torch/distributed/ops.py (2)
465-487
: Make forced MNNVL fallback noisy; clarify dtype=None disable reason.If strategy=MNNVL but it’s disabled, emit a warning (not debug). Also clarify when dtype=None is the reason.
- if self.strategy in (AllReduceStrategy.AUTO, - AllReduceStrategy.MNNVL): + if self.strategy in (AllReduceStrategy.AUTO, AllReduceStrategy.MNNVL): + forced_mnnvl = (self.strategy == AllReduceStrategy.MNNVL) if self.mapping.tp_size != self.mapping.world_size: - logger.debug( - f"MNNVLAllReduce is disabled due to tp_size:{self.mapping.tp_size} " - f"!= world_size:{self.mapping.world_size}") + (logger.warning if forced_mnnvl else logger.debug)( + f"MNNVLAllReduce is disabled due to tp_size:{self.mapping.tp_size} " + f"!= world_size:{self.mapping.world_size}" + ) self.mnnvl_allreduce = None elif MNNVLAllReduce.is_mnnvl(self.mapping, dtype): try: self.mnnvl_allreduce = MNNVLAllReduce( self.mapping, dtype) if dtype else None if self.mnnvl_allreduce: - logger.debug(f"MNNVLAllReduce is enabled") + logger.debug("MNNVLAllReduce is enabled") else: - logger.debug(f"MNNVLAllReduce is disabled") + (logger.warning if forced_mnnvl else logger.debug)( + "MNNVLAllReduce is disabled (dtype is None)." + ) except Exception as e: - logger.debug( - f"MNNVL AllReduce can't be enabled due to {e}.") + (logger.warning if forced_mnnvl else logger.debug)( + f"MNNVL AllReduce can't be enabled due to: {e}" + ) self.mnnvl_allreduce = None else: - logger.debug( - f"MNNVLAllReduce can't be enabled due to failing the is_mnnvl check." - ) + (logger.warning if forced_mnnvl else logger.debug)( + "MNNVLAllReduce can't be enabled; is_mnnvl check failed." + ) self.mnnvl_allreduce = NoneTo verify behavior, please confirm intended policy: when users explicitly choose MNNVL, should fallback be “warn-and-continue” (as above) or “error-and-stop”?
15-21
: Don’t shadow the project logger.
from tensorrt_llm.logger import logger
is immediately shadowed bylogging.getLogger(__name__)
, leading to inconsistent log routing.-logger = logging.getLogger(__name__) +# Use the project logger imported above for consistent routing/levels.Alternatively, prefer module import per guidelines and use
tensorrt_llm.logger.logger
everywhere.tensorrt_llm/_torch/modules/fused_moe/moe_backend.py (3)
382-383
: Wrap long conditional assignment to satisfy E501 and readability.- run_moe = self.moe_runner.fused_moe_runner.run_moe_min_latency if min_latency_mode else self.moe_runner.fused_moe_runner.run_moe + run_moe = ( + self.moe_runner.fused_moe_runner.run_moe_min_latency + if min_latency_mode + else self.moe_runner.fused_moe_runner.run_moe + )
717-729
: Delete commented debug prints; they still trigger E501 and clutter the file.- # print( - # "xxi shape 2: enter deepgemm backend compute_moe \n" - # f"x.shape: {getattr(x, 'shape', None)}, \n" - # f"input_sf.shape: {getattr(input_sf, 'shape', None)}, \n" - # f"token_selected_slots.shape: {getattr(token_selected_slots, 'shape', None)}, \n" - # f"token_final_scales.shape: {getattr(token_final_scales, 'shape', None)}, \n" - # f"permuted_row_to_unpermuted_row_tensor.shape: {getattr(permuted_row_to_unpermuted_row_tensor, 'shape', None)}, \n" - # f"permuted_token_selected_experts_tensor.shape: {getattr(permuted_token_selected_experts_tensor, 'shape', None)}, \n" - # f"permuted_data_tensor.shape: {getattr(permuted_data_tensor, 'shape', None)}, \n" - # f"expert_first_token_offset_tensor.shape: {getattr(expert_first_token_offset_tensor, 'shape', None)}, \n" - # f"permuted_token_final_scales_tensor.shape: {getattr(permuted_token_final_scales_tensor, 'shape', None)}, \n" - # f"unpermuted_row_to_permuted_row_tensor.shape: {getattr(unpermuted_row_to_permuted_row_tensor, 'shape', None)}\n" - # ) @@ - # print( - # "xxi shape 3: exit deepgemm backend compute_moe \n" - # f"final_hidden_states.shape: {getattr(final_hidden_states, 'shape', None)}\n" - # f"permuted_data_tensor.shape: {getattr(permuted_data_tensor, 'shape', None)}, \n" - # f"token_final_scales.shape: {getattr(token_final_scales, 'shape', None)}, \n" - # f"unpermuted_row_to_permuted_row_tensor.shape: {getattr(unpermuted_row_to_permuted_row_tensor, 'shape', None)}, \n" - # f"permuted_row_to_unpermuted_row_tensor.shape: {getattr(permuted_row_to_unpermuted_row_tensor, 'shape', None)}, \n" - # f"expert_first_token_offset_tensor.shape: {getattr(expert_first_token_offset_tensor, 'shape', None)}, \n" - # f"x.shape: {getattr(x, 'shape', None)}, \n")Also applies to: 837-845
1-13
: Import the project logger if you adopt logger.debug above.from tensorrt_llm._utils import get_sm_version +from tensorrt_llm.logger import logger
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py (2)
418-421
: Remove commented debug block; it violates E501 and adds noise.- # print( - # f"xxi shape 1: enter wide_ep forward_chunk: layer_load_balancer={self.layer_load_balancer}, is_first_call={is_first_call}, is_last_call={is_last_call}, x shape: {getattr(x, 'shape', None)}, router_logits shape: {getattr(router_logits, 'shape', None)}, use_all_to_all: {use_all_to_all}, all_rank_num_tokens: {all_rank_num_tokens}, all_rank_max_num_tokens: {all_rank_max_num_tokens}, use_dp_padding: {use_dp_padding}, repeating_info: {repeating_info}" - # )
812-815
: Delete leftover commented print (non-English comment, long line).- # 一行打印所有信息 - # print( - # f"xxi x.shape: {getattr(x, 'shape', None)}, use_all_to_all: {use_all_to_all}, all_rank_num_tokens: {all_rank_num_tokens}, all_rank_num_tokens_padded: {all_rank_num_tokens_padded}, all_rank_max_num_tokens: {all_rank_max_num_tokens}, use_dp_padding: {use_dp_padding}, outputs.shape: {getattr(outputs, 'shape', None)}, use_dp_padding(again): {use_dp_padding}" - # )
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (4)
tensorrt_llm/_torch/distributed/ops.py
(2 hunks)tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py
(11 hunks)tensorrt_llm/_torch/modules/fused_moe/moe_backend.py
(1 hunks)tensorrt_llm/_torch/modules/fused_moe/moe_load_balancer.py
(1 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.py
: Code must target Python 3.8+
Indent Python code with 4 spaces; do not use tabs
Preserve module namespaces when importing; import modules/packages and access members via the module (e.g., from package.subpackage import foo; foo.SomeClass())
Python file names should be snake_case
Python class names should be PascalCase
Python functions/methods and local variables should be snake_case; variables beginning with a number should be prefixed with k_ (e.g., k_99th_percentile)
Global variables should be UPPER_SNAKE_CASE prefixed with G_ (e.g., G_MY_GLOBAL); constants should be UPPER_SNAKE_CASE
Avoid shadowing variables from outer scopes; initialize all externally visible members in init
Prefer docstrings for interfaces used outside a file; comments should be reserved for in-function or file-local interfaces
Use Google-style docstrings for classes and functions; attributes and variables may be documented inline with trailing string literals
Avoid reflection when simpler, explicit code suffices (e.g., avoid dict(**locals()) patterns)
In try/except, catch the narrowest exceptions possible
For duck-typing patterns, keep the try body minimal and move logic to else to avoid masking unrelated failures
Files:
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py
tensorrt_llm/_torch/modules/fused_moe/moe_backend.py
tensorrt_llm/_torch/distributed/ops.py
tensorrt_llm/_torch/modules/fused_moe/moe_load_balancer.py
**/*.{c,cc,cpp,cxx,h,hh,hpp,hxx,cu,cuh,py}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Prepend the NVIDIA copyright header (current year) to all source files (.cpp, .h, .cu, .py, etc.)
Files:
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py
tensorrt_llm/_torch/modules/fused_moe/moe_backend.py
tensorrt_llm/_torch/distributed/ops.py
tensorrt_llm/_torch/modules/fused_moe/moe_load_balancer.py
🧠 Learnings (4)
📓 Common learnings
Learnt from: timlee0212
PR: NVIDIA/TensorRT-LLM#6886
File: tensorrt_llm/_torch/models/modeling_deepseekv3.py:0-0
Timestamp: 2025-08-14T06:36:40.701Z
Learning: In DeepSeek V3 model (tensorrt_llm/_torch/models/modeling_deepseekv3.py), the disagreement between AllReduce.__init__ guard and _compute_mlp_tp_size logic for MNNVL usage is expected by design. The AllReduce component and MLP TP-size computation intentionally use different criteria for MNNVL availability decisions.
📚 Learning: 2025-08-14T23:23:27.449Z
Learnt from: djns99
PR: NVIDIA/TensorRT-LLM#6915
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:4010-4012
Timestamp: 2025-08-14T23:23:27.449Z
Learning: For MOE (Mixture of Experts) code reviews in TensorRT-LLM, avoid repeatedly suggesting finalize fusion validation checks and safety assertions. The user djns99 has indicated these suggestions are repetitive and unwanted across multiple MOE-related changes.
Applied to files:
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py
tensorrt_llm/_torch/modules/fused_moe/moe_backend.py
📚 Learning: 2025-08-09T20:57:04.084Z
Learnt from: sklevtsov-nvidia
PR: NVIDIA/TensorRT-LLM#3294
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu:118-127
Timestamp: 2025-08-09T20:57:04.084Z
Learning: In the CUTLASS MoE finalize fusion implementation (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu), when setting `fused_finalize_epilogue.stride_final_output` with shape `(hidden_size, num_output_tokens, 1)`, the `num_rows_in_final_output` should be set to `num_output_tokens` (not `hidden_size`) because of a swap+transpose operation that maps rows of the output tensor to `hidden_size` and columns to `num_output_tokens`.
Applied to files:
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py
tensorrt_llm/_torch/modules/fused_moe/moe_backend.py
📚 Learning: 2025-08-14T06:36:40.701Z
Learnt from: timlee0212
PR: NVIDIA/TensorRT-LLM#6886
File: tensorrt_llm/_torch/models/modeling_deepseekv3.py:0-0
Timestamp: 2025-08-14T06:36:40.701Z
Learning: In DeepSeek V3 model (tensorrt_llm/_torch/models/modeling_deepseekv3.py), the disagreement between AllReduce.__init__ guard and _compute_mlp_tp_size logic for MNNVL usage is expected by design. The AllReduce component and MLP TP-size computation intentionally use different criteria for MNNVL availability decisions.
Applied to files:
tensorrt_llm/_torch/distributed/ops.py
🧬 Code graph analysis (3)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py (3)
tensorrt_llm/_torch/modules/fused_moe/moe_backend.py (4)
MoEBackendSelection
(852-898)select_backend
(864-898)run_moe
(101-212)run_moe
(413-533)tensorrt_llm/_torch/modules/fused_moe/moe_load_balancer.py (1)
get_moe_load_balancer
(1027-1037)tensorrt_llm/_torch/modules/fused_moe/quantization.py (2)
DeepSeekFP8BlockScalesFusedMoEMethod
(604-737)DeepSeekFP8BlockScalesFusedMoEMethodDeepGemm
(740-781)
tensorrt_llm/_torch/modules/fused_moe/moe_backend.py (5)
tensorrt_llm/_torch/autotuner.py (2)
AutoTuner
(271-752)choose_one
(329-436)tensorrt_llm/_torch/custom_ops/torch_custom_ops.py (1)
MoERunner
(23-114)tensorrt_llm/_torch/modules/fused_moe/fused_moe_deepgemm.py (5)
deepgemm_fp8_group_blockwise_gemm
(246-284)masked_index_copy_group_quant_fp8
(89-160)preprocess_after_permute
(233-242)set_strides
(287-293)triton_masked_index_gather
(195-216)tensorrt_llm/quantization/utils/fp8_utils.py (1)
silu_and_mul_masked_post_quant_fwd
(304-375)tensorrt_llm/_torch/modules/fused_moe/interface.py (1)
has_deepseek_fp8_block_scales
(127-130)
tensorrt_llm/_torch/distributed/ops.py (3)
tensorrt_llm/functional.py (1)
AllReduceStrategy
(3876-3885)tensorrt_llm/_torch/distributed/communicator.py (2)
tp_size
(46-47)world_size
(26-27)tensorrt_llm/logger.py (1)
debug
(143-144)
🪛 Ruff (0.12.2)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py
320-320: Line too long (127 > 120)
(E501)
419-419: Line too long (463 > 120)
(E501)
731-731: Line too long (967 > 120)
(E501)
814-814: Line too long (374 > 120)
(E501)
tensorrt_llm/_torch/modules/fused_moe/moe_backend.py
382-382: Line too long (136 > 120)
(E501)
723-723: Line too long (129 > 120)
(E501)
724-724: Line too long (131 > 120)
(E501)
727-727: Line too long (123 > 120)
(E501)
728-728: Line too long (127 > 120)
(E501)
842-842: Line too long (129 > 120)
(E501)
843-843: Line too long (129 > 120)
(E501)
tensorrt_llm/_torch/distributed/ops.py
246-246: Line too long (156 > 120)
(E501)
tensorrt_llm/_torch/modules/fused_moe/moe_load_balancer.py
964-964: Line too long (219 > 120)
(E501)
🔇 Additional comments (2)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py (2)
368-370
: LGTM: backend selection deferred until weights exist.The lazy init avoids probing module state too early and keeps forward path clean.
700-728
: LGTM: unified backend.run_moe call mirrors fused_moe signature.Passing module=self enables backend-specific decisions without leaking details here.
print( | ||
f"can not use alltoall due to chunking {self.calculate_num_chunks(all_rank_num_tokens)}" | ||
) | ||
return False | ||
|
||
# For DeepEPLowLatency, check if tokens exceed the threshold | ||
if (self.alltoall_method_type == AlltoallMethodType.DeepEPLowLatency | ||
and all_rank_max_num_tokens > self.deep_ep_max_num_tokens): | ||
print( | ||
f"can not use alltoall due to deep_ep_max_num_tokens {all_rank_max_num_tokens} > {self.deep_ep_max_num_tokens}" | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Verification agent
🧩 Analysis chain
Quick audit: ensure no stray print() calls remain in fused_moe modules.
Run this repo-local grep to confirm cleanup across the module.
Also applies to: 324-324, 333-345, 418-421, 812-815
🏁 Script executed:
#!/bin/bash
# Find any remaining print() in fused_moe Python sources (code, not tests).
rg -nP '^\s*print\(' tensorrt_llm/_torch/modules/fused_moe -g '!**/test/**' -S
Length of output: 1144
Remove all print()
calls from fused_moe modules
Audit via the provided rg
output shows the following remaining print()
statements that should be removed or replaced with proper logging:
- tensorrt_llm/_torch/modules/fused_moe/moe_load_balancer.py
• Line 963:print( … )
- tensorrt_llm/_torch/modules/fused_moe/moe_backend.py
• Lines 893, 897:print("xxi select backend: …")
- tensorrt_llm/_torch/modules/fused_moe/routing.py
• Lines 77, 95, 102: debug routing prints - tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py
• Lines 311, 319, 324, 333, 337, 342: alltoall and chunking diagnostics
Please remove these print()
calls. If retaining these messages is necessary for debugging or telemetry, convert them to the module’s logger (e.g., logging.debug
or logging.info
) at the appropriate level.
🧰 Tools
🪛 Ruff (0.12.2)
320-320: Line too long (127 > 120)
(E501)
🤖 Prompt for AI Agents
In tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py around lines 311
to 321, remove the inline print() calls and replace them with the module logger
(e.g., logger.debug or logger.info) preserving the original message
text/formatting; ensure a logger is available (import logging and create logger
= logging.getLogger(__name__) if not already present) and keep control
flow/return behavior unchanged so the function still returns False where
intended.
is_blackwell = get_sm_version() == 100 | ||
has_block_fp8 = (hasattr(module, 'has_deepseek_fp8_block_scales') | ||
and module.has_deepseek_fp8_block_scales) | ||
|
||
# print(f"xxi select backend: is_blackwell: {is_blackwell}, has_block_fp8: {has_block_fp8}") | ||
if is_blackwell and has_block_fp8: | ||
# Use DeepGemm backend for Blackwell with block FP8 | ||
print("xxi select backend: use deepgemm backend") | ||
return MoeDeepGemmBackend() | ||
else: | ||
# Use Cutlass backend for all other cases | ||
print("xxi select backend: use cutlass backend") | ||
return MoeCutlassBackend() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix method/boolean misuse in backend selection and replace prints with logger.
- Call has_deepseek_fp8_block_scales() safely (it’s a method).
- Use logger.debug instead of print.
- has_block_fp8 = (hasattr(module, 'has_deepseek_fp8_block_scales')
- and module.has_deepseek_fp8_block_scales)
+ has_block_fp8 = (
+ hasattr(module, 'has_deepseek_fp8_block_scales')
+ and (
+ module.has_deepseek_fp8_block_scales()
+ if callable(getattr(module, 'has_deepseek_fp8_block_scales'))
+ else bool(getattr(module, 'has_deepseek_fp8_block_scales'))
+ )
+ )
@@
- print("xxi select backend: use deepgemm backend")
+ logger.debug("MoEBackendSelection: using DeepGemm backend")
return MoeDeepGemmBackend()
else:
- print("xxi select backend: use cutlass backend")
- return MoeCutlassBackend()
+ logger.debug("MoEBackendSelection: using Cutlass backend")
+ return MoeCutlassBackend()
Add this import near the top of the file (outside this hunk):
from tensorrt_llm.logger import logger
🤖 Prompt for AI Agents
In tensorrt_llm/_torch/modules/fused_moe/moe_backend.py around lines 886 to 898,
the code treats has_deepseek_fp8_block_scales as a boolean attribute and uses
print statements; change it to safely call the method only if present and
callable (e.g., hasattr(...) and callable(...)) and capture its boolean result
into has_block_fp8, replace both print calls with logger.debug, and add the
required import "from tensorrt_llm.logger import logger" near the top of the
file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py (1)
1-1
: Add NVIDIA copyright header (2025).All source files must prepend the current-year NVIDIA copyright header.
Apply:
+# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.
♻️ Duplicate comments (6)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py (3)
311-321
: Replace print() diagnostics with logger.debug and fix lints.Raw prints in hot path spam stdout and violate E501.
- print( - f"can not use alltoall due to chunking {self.calculate_num_chunks(all_rank_num_tokens)}" - ) + logger.debug( + "Disable alltoall: chunking=%d", + self.calculate_num_chunks(all_rank_num_tokens), + ) ... - print( - f"can not use alltoall due to deep_ep_max_num_tokens {all_rank_max_num_tokens} > {self.deep_ep_max_num_tokens}" - ) + logger.debug( + "Disable alltoall: deep_ep_max_num_tokens %d > %d", + all_rank_max_num_tokens, + self.deep_ep_max_num_tokens, + ) ... - print(f"all to all type {self.alltoall_method_type}") + logger.debug("Alltoall method type: %s", self.alltoall_method_type)Also applies to: 324-324
333-345
: Use logger for quant-method selection; remove prints.Convert to structured debug logs and keep lines <120.
- print( - f"wide_ep _get_quant_method: get_sm_version()={get_sm_version()}" - ) + logger.debug("wide_ep _get_quant_method: sm=%s", get_sm_version()) if get_sm_version() == 100: - print( - f"wide_ep _get_quant_method: use DeepSeekFP8BlockScalesFusedMoEMethodDeepGemm" - ) + logger.debug("wide_ep _get_quant_method: DeepGemm variant") return DeepSeekFP8BlockScalesFusedMoEMethodDeepGemm() else: - print( - f"wide_ep _get_quant_method: use DeepSeekFP8BlockScalesFusedMoEMethod" - ) + logger.debug("wide_ep _get_quant_method: Cutlass variant") return DeepSeekFP8BlockScalesFusedMoEMethod()
8-8
: Fix logger import; _utils does not export logger.Import logger from tensorrt_llm.logger to avoid ImportError at runtime.
-from tensorrt_llm._utils import get_sm_version, logger +from tensorrt_llm._utils import get_sm_version +from tensorrt_llm.logger import loggertensorrt_llm/_torch/modules/fused_moe/moe_backend.py (3)
11-11
: Import project logger and stop using print.Use tensorrt_llm.logger for consistent, configurable logging.
-from tensorrt_llm._utils import get_sm_version +from tensorrt_llm._utils import get_sm_version +from tensorrt_llm.logger import logger
748-754
: Fix attribute: use experts_per_token (not top_k).Routing exposes experts_per_token elsewhere; .top_k may raise AttributeError.
- module.routing_method.top_k if module else 1, + module.routing_method.experts_per_token if module else 1,
1-816
: Remove all stray print statements in fused_moe modulesThe following debug
print()
calls were detected and should be removed (or replaced with proper logging) to avoid unintended console output:• tensorrt_llm/_torch/modules/fused_moe/moe_load_balancer.py:963
• tensorrt_llm/_torch/modules/fused_moe/moe_backend.py:811, 815
• tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py:311, 319, 324, 333, 337, 342
• tensorrt_llm/_torch/modules/fused_moe/routing.py:77, 95, 102Please remove these calls (or switch to a standardized logger) and ensure no new
print()
statements are introduced.
🧹 Nitpick comments (8)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py (2)
428-431
: Remove commented-out print blocks to satisfy E501 and reduce noise.These long commented prints trigger line-length lint and add churn in diffs.
Would you like me to open a follow-up patch to delete these comment blocks?
Also applies to: 679-706, 728-731, 811-813
309-326
: Optional: add a fast path guard for can_use_alltoall.Return early if self.alltoall_method_type is NotEnabled to save work.
def can_use_alltoall(self, all_rank_num_tokens, all_rank_max_num_tokens): + if not self.enable_alltoall: + return Falsetensorrt_llm/_torch/modules/fused_moe/moe_backend.py (6)
217-241
: Use weight_view_dtype for MoERunner to match runtime .view dtype.Prevents tactic/profile mismatch when view dtype differs from storage (e.g., W4 group scaling).
- self.moe_runner = MoERunner( + self.moe_runner = MoERunner( x_dtype=tuner_input.dtype, - weight_dtype=module.w3_w1_weight.dtype, + weight_dtype=weight_view_dtype, output_dtype=output_dtype, top_k=tuner_top_k, @@ - use_deepseek_fp8_block_scale=getattr( - module, 'has_deepseek_fp8_block_scales', False), + use_deepseek_fp8_block_scale=getattr( + module, 'has_deepseek_fp8_block_scales', False),
321-328
: Remove unused locals flagged by Ruff.These are never referenced in compute_moe.
- use_deepseek_fp8_block_scale = getattr(module, - 'has_deepseek_fp8_block_scales', - False) use_w4_group_scaling = getattr(module, 'has_w4afp8', False) - use_int8_woq_per_channel = getattr(module, 'has_int8_woq_per_channel', - False)
348-349
: Wrap long line to satisfy E501.Minor readability and lint fix.
- run_moe = self.moe_runner.fused_moe_runner.run_moe_min_latency if min_latency_mode else self.moe_runner.fused_moe_runner.run_moe + run_moe = ( + self.moe_runner.fused_moe_runner.run_moe_min_latency + if min_latency_mode + else self.moe_runner.fused_moe_runner.run_moe + )
635-647
: Delete long commented debug prints to appease Ruff E501.These blocks are commented and unnecessary in production.
Would you like me to submit a follow-up commit removing them?
Also applies to: 755-764
804-816
: Replace backend-selection prints with logger.debug.Keep selection diagnostics but via project logger.
- is_blackwell = get_sm_version() == 100 - has_block_fp8 = (hasattr(module, 'has_deepseek_fp8_block_scales') - and module.has_deepseek_fp8_block_scales) + is_blackwell = get_sm_version() == 100 + has_block_fp8 = ( + hasattr(module, 'has_deepseek_fp8_block_scales') + and module.has_deepseek_fp8_block_scales + ) @@ - # print(f"xxi select backend: is_blackwell: {is_blackwell}, has_block_fp8: {has_block_fp8}") if is_blackwell and has_block_fp8: - # Use DeepGemm backend for Blackwell with block FP8 - print("xxi select backend: use deepgemm backend") + logger.debug( + "MoEBackendSelection: using DeepGemm (SM=%s, block_fp8=%s)", + is_blackwell, has_block_fp8, + ) return MoEDeepGemmBackend() else: - # Use Cutlass backend for all other cases - print("xxi select backend: use cutlass backend") + logger.debug( + "MoEBackendSelection: using Cutlass (SM=%s, block_fp8=%s)", + is_blackwell, has_block_fp8, + ) return MoECutlassBackend()
202-206
: Assert message is useful; consider early-return if tuner_input is None.You already assert; leaving as-is is fine.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (2)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py
(13 hunks)tensorrt_llm/_torch/modules/fused_moe/moe_backend.py
(1 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.py
: Code must target Python 3.8+
Indent Python code with 4 spaces; do not use tabs
Preserve module namespaces when importing; import modules/packages and access members via the module (e.g., from package.subpackage import foo; foo.SomeClass())
Python file names should be snake_case
Python class names should be PascalCase
Python functions/methods and local variables should be snake_case; variables beginning with a number should be prefixed with k_ (e.g., k_99th_percentile)
Global variables should be UPPER_SNAKE_CASE prefixed with G_ (e.g., G_MY_GLOBAL); constants should be UPPER_SNAKE_CASE
Avoid shadowing variables from outer scopes; initialize all externally visible members in init
Prefer docstrings for interfaces used outside a file; comments should be reserved for in-function or file-local interfaces
Use Google-style docstrings for classes and functions; attributes and variables may be documented inline with trailing string literals
Avoid reflection when simpler, explicit code suffices (e.g., avoid dict(**locals()) patterns)
In try/except, catch the narrowest exceptions possible
For duck-typing patterns, keep the try body minimal and move logic to else to avoid masking unrelated failures
Files:
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py
tensorrt_llm/_torch/modules/fused_moe/moe_backend.py
**/*.{c,cc,cpp,cxx,h,hh,hpp,hxx,cu,cuh,py}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Prepend the NVIDIA copyright header (current year) to all source files (.cpp, .h, .cu, .py, etc.)
Files:
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py
tensorrt_llm/_torch/modules/fused_moe/moe_backend.py
🧠 Learnings (3)
📓 Common learnings
Learnt from: timlee0212
PR: NVIDIA/TensorRT-LLM#6886
File: tensorrt_llm/_torch/models/modeling_deepseekv3.py:0-0
Timestamp: 2025-08-14T06:36:40.701Z
Learning: In DeepSeek V3 model (tensorrt_llm/_torch/models/modeling_deepseekv3.py), the disagreement between AllReduce.__init__ guard and _compute_mlp_tp_size logic for MNNVL usage is expected by design. The AllReduce component and MLP TP-size computation intentionally use different criteria for MNNVL availability decisions.
📚 Learning: 2025-08-14T23:23:27.449Z
Learnt from: djns99
PR: NVIDIA/TensorRT-LLM#6915
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:4010-4012
Timestamp: 2025-08-14T23:23:27.449Z
Learning: For MOE (Mixture of Experts) code reviews in TensorRT-LLM, avoid repeatedly suggesting finalize fusion validation checks and safety assertions. The user djns99 has indicated these suggestions are repetitive and unwanted across multiple MOE-related changes.
Applied to files:
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py
tensorrt_llm/_torch/modules/fused_moe/moe_backend.py
📚 Learning: 2025-08-09T20:57:04.084Z
Learnt from: sklevtsov-nvidia
PR: NVIDIA/TensorRT-LLM#3294
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu:118-127
Timestamp: 2025-08-09T20:57:04.084Z
Learning: In the CUTLASS MoE finalize fusion implementation (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu), when setting `fused_finalize_epilogue.stride_final_output` with shape `(hidden_size, num_output_tokens, 1)`, the `num_rows_in_final_output` should be set to `num_output_tokens` (not `hidden_size`) because of a swap+transpose operation that maps rows of the output tensor to `hidden_size` and columns to `num_output_tokens`.
Applied to files:
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py
tensorrt_llm/_torch/modules/fused_moe/moe_backend.py
🧬 Code graph analysis (2)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py (2)
tensorrt_llm/_torch/modules/fused_moe/moe_backend.py (5)
MoEBackend
(17-177)MoEBackendSelection
(770-816)select_backend
(782-816)run_moe
(109-177)run_moe
(379-461)tensorrt_llm/_torch/modules/fused_moe/quantization.py (2)
DeepSeekFP8BlockScalesFusedMoEMethod
(604-737)DeepSeekFP8BlockScalesFusedMoEMethodDeepGemm
(740-781)
tensorrt_llm/_torch/modules/fused_moe/moe_backend.py (4)
tensorrt_llm/_torch/modules/fused_moe/interface.py (2)
MoE
(22-181)has_deepseek_fp8_block_scales
(127-130)tensorrt_llm/_torch/autotuner.py (2)
AutoTuner
(271-752)choose_one
(329-436)tensorrt_llm/_torch/custom_ops/torch_custom_ops.py (1)
MoERunner
(23-114)tensorrt_llm/_torch/modules/fused_moe/fused_moe_deepgemm.py (5)
deepgemm_fp8_group_blockwise_gemm
(246-284)masked_index_copy_group_quant_fp8
(89-160)preprocess_after_permute
(233-242)set_strides
(287-293)triton_masked_index_gather
(195-216)
🪛 Ruff (0.12.2)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py
320-320: Line too long (127 > 120)
(E501)
429-429: Line too long (463 > 120)
(E501)
729-729: Line too long (967 > 120)
(E501)
812-812: Line too long (374 > 120)
(E501)
tensorrt_llm/_torch/modules/fused_moe/moe_backend.py
322-322: Local variable use_deepseek_fp8_block_scale
is assigned to but never used
Remove assignment to unused variable use_deepseek_fp8_block_scale
(F841)
326-326: Local variable use_int8_woq_per_channel
is assigned to but never used
Remove assignment to unused variable use_int8_woq_per_channel
(F841)
348-348: Line too long (136 > 120)
(E501)
641-641: Line too long (129 > 120)
(E501)
642-642: Line too long (131 > 120)
(E501)
645-645: Line too long (123 > 120)
(E501)
646-646: Line too long (127 > 120)
(E501)
760-760: Line too long (129 > 120)
(E501)
761-761: Line too long (129 > 120)
(E501)
🔇 Additional comments (6)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py (3)
368-380
: LGTM: Lazy backend property and module-first call.The lazy initialization via moe_backend and passing self to run_moe is clean and keeps backend selection after weights exist.
237-239
: LGTM: Backend lazy-init sentinel.Storing _moe_backend=None with a clear comment is fine.
500-518
: No changes needed forget_local_statistic_tensor
guardThe call to
get_local_statistic_tensor
is intentionally limited to the final iteration (is_last_call
) of the All-to-All loop:
update_local_statistic(...)
is invoked on every invocation whenuse_all_to_all && method==MNNVL
, satisfying theassert
inget_local_statistic_tensor
and populatinglocal_statistic_tensor
before the last call .- Only on
is_last_call
do we fetch the local statistics and pass them intoalltoall_prepare_maybe_dispatch
, so that the per-rank stats are gathered exactly once for the global update path. Skipping this on non-last calls avoids redundant or premature gathers.This matches the intended protocol. No further guard or changes required.
tensorrt_llm/_torch/modules/fused_moe/moe_backend.py (3)
376-378
: Return-type contract alignment looks good.Returning [output] in non-min-latency mode keeps caller’s indexing stable across backends.
Also applies to: 765-767
441-442
: Same bug in Cutlass.run_moe: pass use_fused_finalize correctly.Positional args currently misbind. Use keywords.
- self.finalize_tactic(module, tuner_input, output_dtype, - min_latency_mode, tuner_top_k) + self.finalize_tactic( + module, + tuner_input, + output_dtype, + min_latency_mode=min_latency_mode, + use_fused_finalize=use_fused_finalize, + tuner_top_k=tuner_top_k, + )Likely an incorrect or invalid review comment.
157-160
: Bug: finalize_tactic called with wrong positional args.tuner_top_k is passed into use_fused_finalize in Cutlass override. Use keywords and pass use_fused_finalize.
- self.finalize_tactic(module, input, output_dtype, min_latency_mode, - tuner_top_k) + self.finalize_tactic( + module, + input, + output_dtype, + min_latency_mode=min_latency_mode, + use_fused_finalize=use_fused_finalize, + tuner_top_k=tuner_top_k, + )Likely an incorrect or invalid review comment.
@@ -0,0 +1,816 @@ | |||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Add NVIDIA copyright header (2025).
Comply with repo policy for all source files.
+# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
""" | |
# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved. | |
""" |
🤖 Prompt for AI Agents
In tensorrt_llm/_torch/modules/fused_moe/moe_backend.py lines 1-1, the file is
missing the required NVIDIA copyright header for 2025; insert the repo-standard
NVIDIA copyright/license header at the very top of the file before any docstring
or code, updating the year to 2025 and including the required
SPDX-License-Identifier and organization attribution as specified by the
repository policy.
ddc8278
to
1c7826c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 4
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py (1)
1-1
: Prepend NVIDIA copyright header (2025).This production source file needs the standard header per repo policy.
Add at the very top:
# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved. # # Licensed under the NVIDIA Source Code License. # # SPDX-License-Identifier: NVIDIA-Source-Code
♻️ Duplicate comments (9)
tensorrt_llm/_torch/modules/fused_moe/moe_backend.py (6)
811-816
: Replace print with logger and standardize messages in backend selection.Avoid stdout spam; use project logger.
- print("xxi select backend: use deepgemm backend") + logger.debug("MoEBackendSelection: using DeepGemm backend") return MoEDeepGemmBackend() else: - print("xxi select backend: use cutlass backend") - return MoECutlassBackend() + logger.debug("MoEBackendSelection: using Cutlass backend") + return MoECutlassBackend()
28-37
: API mismatch: base finalize_tactic missing use_fused_finalize; callers misbind args.Align the base signature with MoECutlassBackend to avoid positional misbinding and enable named passing.
def finalize_tactic( self, module: Any, tuner_input: torch.Tensor, output_dtype: torch.dtype, - min_latency_mode: bool = False, - tuner_top_k: Optional[int] = None, + min_latency_mode: bool = False, + use_fused_finalize: bool = True, + tuner_top_k: Optional[int] = None, ) -> None: @@ - min_latency_mode: Whether to profile for min-latency path + min_latency_mode: Whether to profile for min-latency path + use_fused_finalize: Whether to use fused finalize epilogue (backend-dependent) tuner_top_k: Top-k value for tuning (Cutlass specific)
210-216
: Use weight_view_dtype when constructing MoERunner.MoERunner should be initialized with the dtype actually used at runtime views to avoid inconsistency.
# Determine view dtype for weights to match runtime quantization layout weight_view_dtype = module.w3_w1_weight.dtype if getattr(module, 'has_w4afp8', False): weight_view_dtype = torch.quint4x2 elif getattr(module, 'has_w4a16_mxfp4', False): weight_view_dtype = torch.uint8 @@ if self.moe_runner is None: self.moe_runner = MoERunner( x_dtype=tuner_input.dtype, - weight_dtype=module.w3_w1_weight.dtype, + weight_dtype=weight_view_dtype, output_dtype=output_dtype, top_k=tuner_top_k, tp_size=module.tp_size, tp_rank=module.tp_rank, ep_size=module.ep_size, ep_rank=module.ep_rank, cluster_size=module.cluster_size, cluster_rank=module.cluster_rank, use_deepseek_fp8_block_scale=getattr( module, 'has_deepseek_fp8_block_scales', False), use_w4_group_scaling=getattr(module, 'has_w4afp8', False), use_int8_woq_per_channel=getattr(module, 'has_int8_woq_per_channel', False), use_mxfp8_act_scaling=getattr(module, 'has_mxfp8_act_scaling', False), min_latency_mode=min_latency_mode, use_fused_finalize=use_fused_finalize, )Also applies to: 219-240
748-754
: Routing API: use experts_per_token (not top_k).Else this can raise AttributeError or wrong values with some routing methods.
- module.routing_method.top_k if module else 1, + module.routing_method.experts_per_token if module else 1,
11-13
: Import the canonical logger and stop using print().Prepare for replacing print() with logger calls below.
from tensorrt_llm._utils import get_sm_version +from tensorrt_llm.logger import logger
1-4
: Add NVIDIA copyright header (2025).Source files must include the NVIDIA header per repo policy.
+# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved. +# +# Licensed under the NVIDIA Source Code License. +# +# SPDX-License-Identifier: NVIDIA-Source-Codetensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py (3)
311-325
: Swap debug prints for logger.debug in can_use_alltoall.Avoid stdout noise and align with project logging.
- print( - f"can not use alltoall due to chunking {self.calculate_num_chunks(all_rank_num_tokens)}" - ) + logger.debug( + "Cannot use alltoall: chunking required (%d chunks)", + self.calculate_num_chunks(all_rank_num_tokens), + ) @@ - print( - f"can not use alltoall due to deep_ep_max_num_tokens {all_rank_max_num_tokens} > {self.deep_ep_max_num_tokens}" - ) + logger.debug( + "Cannot use alltoall: deep_ep_max_num_tokens %d > %d", + all_rank_max_num_tokens, + self.deep_ep_max_num_tokens, + ) @@ - print(f"all to all type {self.alltoall_method_type}") + logger.debug("Alltoall type: %s", self.alltoall_method_type)
333-345
: Replace prints in _get_quant_method with logger.debug.Keep diagnostics but not as stdout prints.
- print( - f"wide_ep _get_quant_method: get_sm_version()={get_sm_version()}" - ) + logger.debug( + "wide_ep _get_quant_method: get_sm_version()=%s", + get_sm_version(), + ) if get_sm_version() == 100: - print( - f"wide_ep _get_quant_method: use DeepSeekFP8BlockScalesFusedMoEMethodDeepGemm" - ) + logger.debug( + "wide_ep _get_quant_method: DeepSeekFP8BlockScalesFusedMoEMethodDeepGemm" + ) return DeepSeekFP8BlockScalesFusedMoEMethodDeepGemm() else: - print( - f"wide_ep _get_quant_method: use DeepSeekFP8BlockScalesFusedMoEMethod" - ) + logger.debug( + "wide_ep _get_quant_method: DeepSeekFP8BlockScalesFusedMoEMethod" + ) return DeepSeekFP8BlockScalesFusedMoEMethod()
8-9
: Incorrect logger import; use tensorrt_llm.logger.Fixes runtime import error.
-from tensorrt_llm._utils import get_sm_version, logger +from tensorrt_llm._utils import get_sm_version +from tensorrt_llm.logger import logger
🧹 Nitpick comments (1)
tensorrt_llm/_torch/modules/fused_moe/moe_backend.py (1)
322-328
: Remove unused locals flagged by Ruff.Drop variables that are assigned but not used.
- use_deepseek_fp8_block_scale = getattr(module, - 'has_deepseek_fp8_block_scales', - False) use_w4_group_scaling = getattr(module, 'has_w4afp8', False) - use_int8_woq_per_channel = getattr(module, 'has_int8_woq_per_channel', - False) + # int8 per-channel flag is handled inside fused op if needed
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (2)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py
(13 hunks)tensorrt_llm/_torch/modules/fused_moe/moe_backend.py
(1 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.py
: Code must target Python 3.8+
Indent Python code with 4 spaces; do not use tabs
Preserve module namespaces when importing; import modules/packages and access members via the module (e.g., from package.subpackage import foo; foo.SomeClass())
Python file names should be snake_case
Python class names should be PascalCase
Python functions/methods and local variables should be snake_case; variables beginning with a number should be prefixed with k_ (e.g., k_99th_percentile)
Global variables should be UPPER_SNAKE_CASE prefixed with G_ (e.g., G_MY_GLOBAL); constants should be UPPER_SNAKE_CASE
Avoid shadowing variables from outer scopes; initialize all externally visible members in init
Prefer docstrings for interfaces used outside a file; comments should be reserved for in-function or file-local interfaces
Use Google-style docstrings for classes and functions; attributes and variables may be documented inline with trailing string literals
Avoid reflection when simpler, explicit code suffices (e.g., avoid dict(**locals()) patterns)
In try/except, catch the narrowest exceptions possible
For duck-typing patterns, keep the try body minimal and move logic to else to avoid masking unrelated failures
Files:
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py
tensorrt_llm/_torch/modules/fused_moe/moe_backend.py
**/*.{c,cc,cpp,cxx,h,hh,hpp,hxx,cu,cuh,py}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Prepend the NVIDIA copyright header (current year) to all source files (.cpp, .h, .cu, .py, etc.)
Files:
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py
tensorrt_llm/_torch/modules/fused_moe/moe_backend.py
🧠 Learnings (5)
📓 Common learnings
Learnt from: timlee0212
PR: NVIDIA/TensorRT-LLM#6886
File: tensorrt_llm/_torch/models/modeling_deepseekv3.py:0-0
Timestamp: 2025-08-14T06:36:40.701Z
Learning: In DeepSeek V3 model (tensorrt_llm/_torch/models/modeling_deepseekv3.py), the disagreement between AllReduce.__init__ guard and _compute_mlp_tp_size logic for MNNVL usage is expected by design. The AllReduce component and MLP TP-size computation intentionally use different criteria for MNNVL availability decisions.
📚 Learning: 2025-08-14T23:23:27.449Z
Learnt from: djns99
PR: NVIDIA/TensorRT-LLM#6915
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:4010-4012
Timestamp: 2025-08-14T23:23:27.449Z
Learning: For MOE (Mixture of Experts) code reviews in TensorRT-LLM, avoid repeatedly suggesting finalize fusion validation checks and safety assertions. The user djns99 has indicated these suggestions are repetitive and unwanted across multiple MOE-related changes.
Applied to files:
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py
tensorrt_llm/_torch/modules/fused_moe/moe_backend.py
📚 Learning: 2025-08-09T20:57:04.084Z
Learnt from: sklevtsov-nvidia
PR: NVIDIA/TensorRT-LLM#3294
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu:118-127
Timestamp: 2025-08-09T20:57:04.084Z
Learning: In the CUTLASS MoE finalize fusion implementation (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu), when setting `fused_finalize_epilogue.stride_final_output` with shape `(hidden_size, num_output_tokens, 1)`, the `num_rows_in_final_output` should be set to `num_output_tokens` (not `hidden_size`) because of a swap+transpose operation that maps rows of the output tensor to `hidden_size` and columns to `num_output_tokens`.
Applied to files:
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py
tensorrt_llm/_torch/modules/fused_moe/moe_backend.py
📚 Learning: 2025-08-25T22:42:47.587Z
Learnt from: CR
PR: NVIDIA/TensorRT-LLM#0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-08-25T22:42:47.587Z
Learning: Applies to **/*.{c,cc,cpp,cxx,h,hh,hpp,hxx,cu,cuh,py} : Prepend the NVIDIA copyright header (current year) to all source files (.cpp, .h, .cu, .py, etc.)
Applied to files:
tensorrt_llm/_torch/modules/fused_moe/moe_backend.py
📚 Learning: 2025-08-06T13:58:07.506Z
Learnt from: galagam
PR: NVIDIA/TensorRT-LLM#6487
File: tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_trtllm_bench.py:1-12
Timestamp: 2025-08-06T13:58:07.506Z
Learning: In TensorRT-LLM, test files (files under tests/ directories) do not require NVIDIA copyright headers, unlike production source code files. Test files typically start directly with imports, docstrings, or code.
Applied to files:
tensorrt_llm/_torch/modules/fused_moe/moe_backend.py
🧬 Code graph analysis (2)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py (2)
tensorrt_llm/_torch/modules/fused_moe/moe_backend.py (5)
MoEBackend
(17-177)MoEBackendSelection
(770-816)select_backend
(782-816)run_moe
(109-177)run_moe
(379-461)tensorrt_llm/_torch/modules/fused_moe/quantization.py (2)
DeepSeekFP8BlockScalesFusedMoEMethod
(604-737)DeepSeekFP8BlockScalesFusedMoEMethodDeepGemm
(740-781)
tensorrt_llm/_torch/modules/fused_moe/moe_backend.py (4)
tensorrt_llm/_torch/modules/fused_moe/interface.py (2)
MoE
(22-181)has_deepseek_fp8_block_scales
(127-130)tensorrt_llm/_torch/custom_ops/torch_custom_ops.py (1)
MoERunner
(23-114)tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py (1)
enable_alltoall
(298-301)tensorrt_llm/_torch/modules/fused_moe/fused_moe_deepgemm.py (5)
deepgemm_fp8_group_blockwise_gemm
(246-284)masked_index_copy_group_quant_fp8
(89-160)preprocess_after_permute
(233-242)set_strides
(287-293)triton_masked_index_gather
(195-216)
🪛 Ruff (0.12.2)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py
320-320: Line too long (127 > 120)
(E501)
429-429: Line too long (463 > 120)
(E501)
729-729: Line too long (967 > 120)
(E501)
812-812: Line too long (374 > 120)
(E501)
tensorrt_llm/_torch/modules/fused_moe/moe_backend.py
322-322: Local variable use_deepseek_fp8_block_scale
is assigned to but never used
Remove assignment to unused variable use_deepseek_fp8_block_scale
(F841)
326-326: Local variable use_int8_woq_per_channel
is assigned to but never used
Remove assignment to unused variable use_int8_woq_per_channel
(F841)
348-348: Line too long (136 > 120)
(E501)
641-641: Line too long (129 > 120)
(E501)
642-642: Line too long (131 > 120)
(E501)
645-645: Line too long (123 > 120)
(E501)
646-646: Line too long (127 > 120)
(E501)
760-760: Line too long (129 > 120)
(E501)
761-761: Line too long (129 > 120)
(E501)
🔇 Additional comments (1)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py (1)
368-380
: Property uses _moe_backend_impl; ensure consistent lazy init.After fixing init, this block is fine. If you prefer _moe_backend naming, rename consistently in both places.
Would you like me to submit a follow-up diff to rename to _moe_backend across the file for consistency?
1c7826c
to
e8af58e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (12)
tensorrt_llm/_torch/modules/fused_moe/moe_backend.py (8)
477-485
: Keep DeepGemm finalize_tactic signature compatible with base.Add unused use_fused_finalize for API consistency.
def finalize_tactic( self, module: Any, tuner_input: torch.Tensor, output_dtype: torch.dtype, - min_latency_mode: bool = False, + min_latency_mode: bool = False, + use_fused_finalize: bool = True, tuner_top_k: Optional[int] = None, ) -> None:
805-816
: Replace prints with logger and make has_block_fp8 robust.Use logger.debug and handle property vs callable for has_deepseek_fp8_block_scales.
- has_block_fp8 = (hasattr(module, 'has_deepseek_fp8_block_scales') - and module.has_deepseek_fp8_block_scales) + has_block_fp8 = ( + hasattr(module, 'has_deepseek_fp8_block_scales') and + ( + module.has_deepseek_fp8_block_scales() + if callable(getattr(module, 'has_deepseek_fp8_block_scales')) + else bool(getattr(module, 'has_deepseek_fp8_block_scales')) + ) + ) @@ - if is_blackwell and has_block_fp8: - # Use DeepGemm backend for Blackwell with block FP8 - print("xxi select backend: use deepgemm backend") - return MoEDeepGemmBackend() - else: - # Use Cutlass backend for all other cases - print("xxi select backend: use cutlass backend") - return MoECutlassBackend() + if is_blackwell and has_block_fp8: + logger.debug("MoEBackendSelection: using DeepGemm backend") + return MoEDeepGemmBackend() + else: + logger.debug("MoEBackendSelection: using Cutlass backend") + return MoECutlassBackend()
1-4
: Add NVIDIA copyright header (2025).Required by repo policy for Python sources.
+ # Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.
28-37
: Align base finalize_tactic signature with overrides (add use_fused_finalize).Prevents positional misbinding in callers and keeps LSP-safe.
def finalize_tactic( self, module: Any, tuner_input: torch.Tensor, output_dtype: torch.dtype, - min_latency_mode: bool = False, + min_latency_mode: bool = False, + use_fused_finalize: bool = True, tuner_top_k: Optional[int] = None, ) -> None: @@ - min_latency_mode: Whether to profile for min-latency path + min_latency_mode: Whether to profile for min-latency path + use_fused_finalize: Whether to use fused finalize epilogue (backend-dependent) tuner_top_k: Top-k value for tuning (Cutlass specific)
157-159
: Bug: positional arg misbound to use_fused_finalize. Pass kwargs.Currently tuner_top_k is bound to use_fused_finalize in subclass overrides.
- self.finalize_tactic(module, input, output_dtype, min_latency_mode, - tuner_top_k) + self.finalize_tactic( + module, + input, + output_dtype, + min_latency_mode=min_latency_mode, + use_fused_finalize=use_fused_finalize, + tuner_top_k=tuner_top_k, + )
217-241
: MoERunner weight_dtype should match view dtype used during tuning.Avoids constructor/runtime dtype mismatch.
if self.moe_runner is None: self.moe_runner = MoERunner( x_dtype=tuner_input.dtype, - weight_dtype=module.w3_w1_weight.dtype, + weight_dtype=weight_view_dtype, output_dtype=output_dtype, top_k=tuner_top_k,
441-443
: Bug: positional arg misbound to use_fused_finalize (Cutlass.run_moe).Pass keywords to avoid binding tuner_top_k into use_fused_finalize.
- self.finalize_tactic(module, tuner_input, output_dtype, - min_latency_mode, tuner_top_k) + self.finalize_tactic( + module, + tuner_input, + output_dtype, + min_latency_mode=min_latency_mode, + use_fused_finalize=use_fused_finalize, + tuner_top_k=tuner_top_k, + )
748-748
: Use experts_per_token (not top_k) for routing attribute.Prevents AttributeError on some routing implementations.
- module.routing_method.top_k if module else 1, + module.routing_method.experts_per_token if module else 1,tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py (4)
1-1
: Add NVIDIA copyright header (2025).Required by repo policy for production sources.
+ # Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.
311-325
: Replace debug prints with logger.debug in can_use_alltoall.Avoid stdout noise and satisfy linters.
- print( - f"can not use alltoall due to chunking {self.calculate_num_chunks(all_rank_num_tokens)}" - ) + logger.debug( + "Cannot use alltoall: chunking=%d", + self.calculate_num_chunks(all_rank_num_tokens), + ) @@ - print( - f"can not use alltoall due to deep_ep_max_num_tokens {all_rank_max_num_tokens} > {self.deep_ep_max_num_tokens}" - ) + logger.debug( + "Cannot use alltoall: deep_ep_max_num_tokens %d > %d", + all_rank_max_num_tokens, + self.deep_ep_max_num_tokens, + ) @@ - print(f"all to all type {self.alltoall_method_type}") + logger.debug("Alltoall method type: %s", self.alltoall_method_type)
333-345
: Replace prints in _get_quant_method with logger.debug.Keep logs gated and concise.
- print( - f"wide_ep _get_quant_method: get_sm_version()={get_sm_version()}" - ) + logger.debug( + "wide_ep _get_quant_method: get_sm_version()=%s", + get_sm_version(), + ) if get_sm_version() == 100: - print( - f"wide_ep _get_quant_method: use DeepSeekFP8BlockScalesFusedMoEMethodDeepGemm" - ) + logger.debug( + "wide_ep _get_quant_method: DeepGemm FP8 block-scales" + ) return DeepSeekFP8BlockScalesFusedMoEMethodDeepGemm() else: - print( - f"wide_ep _get_quant_method: use DeepSeekFP8BlockScalesFusedMoEMethod" - ) + logger.debug( + "wide_ep _get_quant_method: Cutlass FP8 block-scales" + ) return DeepSeekFP8BlockScalesFusedMoEMethod()
8-8
: Fix logger import; use canonical module.logger isn’t exported from _utils.
-from tensorrt_llm._utils import get_sm_version, logger +from tensorrt_llm._utils import get_sm_version +from tensorrt_llm.logger import logger
🧹 Nitpick comments (2)
tensorrt_llm/_torch/modules/fused_moe/moe_backend.py (2)
11-12
: Use project logger; import from tensorrt_llm.logger.Prepare for replacing print() below and keep logging consistent.
-from tensorrt_llm._utils import get_sm_version +from tensorrt_llm._utils import get_sm_version +from tensorrt_llm.logger import logger
322-328
: Drop unused locals flagged by Ruff.Assignments are unused; remove to satisfy lint and reduce noise.
- use_deepseek_fp8_block_scale = getattr(module, - 'has_deepseek_fp8_block_scales', - False) - use_w4_group_scaling = getattr(module, 'has_w4afp8', False) - use_int8_woq_per_channel = getattr(module, 'has_int8_woq_per_channel', - False) + use_w4_group_scaling = getattr(module, 'has_w4afp8', False)
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (2)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py
(13 hunks)tensorrt_llm/_torch/modules/fused_moe/moe_backend.py
(1 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.py
: Code must target Python 3.8+
Indent Python code with 4 spaces; do not use tabs
Preserve module namespaces when importing; import modules/packages and access members via the module (e.g., from package.subpackage import foo; foo.SomeClass())
Python file names should be snake_case
Python class names should be PascalCase
Python functions/methods and local variables should be snake_case; variables beginning with a number should be prefixed with k_ (e.g., k_99th_percentile)
Global variables should be UPPER_SNAKE_CASE prefixed with G_ (e.g., G_MY_GLOBAL); constants should be UPPER_SNAKE_CASE
Avoid shadowing variables from outer scopes; initialize all externally visible members in init
Prefer docstrings for interfaces used outside a file; comments should be reserved for in-function or file-local interfaces
Use Google-style docstrings for classes and functions; attributes and variables may be documented inline with trailing string literals
Avoid reflection when simpler, explicit code suffices (e.g., avoid dict(**locals()) patterns)
In try/except, catch the narrowest exceptions possible
For duck-typing patterns, keep the try body minimal and move logic to else to avoid masking unrelated failures
Files:
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py
tensorrt_llm/_torch/modules/fused_moe/moe_backend.py
**/*.{c,cc,cpp,cxx,h,hh,hpp,hxx,cu,cuh,py}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Prepend the NVIDIA copyright header (current year) to all source files (.cpp, .h, .cu, .py, etc.)
Files:
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py
tensorrt_llm/_torch/modules/fused_moe/moe_backend.py
🧠 Learnings (5)
📓 Common learnings
Learnt from: timlee0212
PR: NVIDIA/TensorRT-LLM#6886
File: tensorrt_llm/_torch/models/modeling_deepseekv3.py:0-0
Timestamp: 2025-08-14T06:36:40.701Z
Learning: In DeepSeek V3 model (tensorrt_llm/_torch/models/modeling_deepseekv3.py), the disagreement between AllReduce.__init__ guard and _compute_mlp_tp_size logic for MNNVL usage is expected by design. The AllReduce component and MLP TP-size computation intentionally use different criteria for MNNVL availability decisions.
📚 Learning: 2025-08-14T23:23:27.449Z
Learnt from: djns99
PR: NVIDIA/TensorRT-LLM#6915
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:4010-4012
Timestamp: 2025-08-14T23:23:27.449Z
Learning: For MOE (Mixture of Experts) code reviews in TensorRT-LLM, avoid repeatedly suggesting finalize fusion validation checks and safety assertions. The user djns99 has indicated these suggestions are repetitive and unwanted across multiple MOE-related changes.
Applied to files:
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py
tensorrt_llm/_torch/modules/fused_moe/moe_backend.py
📚 Learning: 2025-08-09T20:57:04.084Z
Learnt from: sklevtsov-nvidia
PR: NVIDIA/TensorRT-LLM#3294
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu:118-127
Timestamp: 2025-08-09T20:57:04.084Z
Learning: In the CUTLASS MoE finalize fusion implementation (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu), when setting `fused_finalize_epilogue.stride_final_output` with shape `(hidden_size, num_output_tokens, 1)`, the `num_rows_in_final_output` should be set to `num_output_tokens` (not `hidden_size`) because of a swap+transpose operation that maps rows of the output tensor to `hidden_size` and columns to `num_output_tokens`.
Applied to files:
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py
tensorrt_llm/_torch/modules/fused_moe/moe_backend.py
📚 Learning: 2025-08-25T22:42:47.587Z
Learnt from: CR
PR: NVIDIA/TensorRT-LLM#0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-08-25T22:42:47.587Z
Learning: Applies to **/*.{c,cc,cpp,cxx,h,hh,hpp,hxx,cu,cuh,py} : Prepend the NVIDIA copyright header (current year) to all source files (.cpp, .h, .cu, .py, etc.)
Applied to files:
tensorrt_llm/_torch/modules/fused_moe/moe_backend.py
📚 Learning: 2025-08-06T13:58:07.506Z
Learnt from: galagam
PR: NVIDIA/TensorRT-LLM#6487
File: tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_trtllm_bench.py:1-12
Timestamp: 2025-08-06T13:58:07.506Z
Learning: In TensorRT-LLM, test files (files under tests/ directories) do not require NVIDIA copyright headers, unlike production source code files. Test files typically start directly with imports, docstrings, or code.
Applied to files:
tensorrt_llm/_torch/modules/fused_moe/moe_backend.py
🧬 Code graph analysis (2)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py (3)
tensorrt_llm/_torch/modules/fused_moe/moe_backend.py (5)
MoEBackend
(17-177)MoEBackendSelection
(770-816)select_backend
(782-816)run_moe
(109-177)run_moe
(379-461)tensorrt_llm/_torch/modules/fused_moe/moe_load_balancer.py (1)
get_moe_load_balancer
(1027-1037)tensorrt_llm/_torch/modules/fused_moe/quantization.py (2)
DeepSeekFP8BlockScalesFusedMoEMethod
(604-737)DeepSeekFP8BlockScalesFusedMoEMethodDeepGemm
(740-781)
tensorrt_llm/_torch/modules/fused_moe/moe_backend.py (6)
tensorrt_llm/_torch/modules/fused_moe/interface.py (2)
MoE
(22-181)has_deepseek_fp8_block_scales
(127-130)tensorrt_llm/_torch/autotuner.py (2)
AutoTuner
(271-752)choose_one
(329-436)tensorrt_llm/_torch/custom_ops/torch_custom_ops.py (1)
MoERunner
(23-114)tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py (1)
enable_alltoall
(298-301)tensorrt_llm/_torch/modules/fused_moe/fused_moe_deepgemm.py (5)
deepgemm_fp8_group_blockwise_gemm
(246-284)masked_index_copy_group_quant_fp8
(89-160)preprocess_after_permute
(233-242)set_strides
(287-293)triton_masked_index_gather
(195-216)tensorrt_llm/quantization/utils/fp8_utils.py (1)
silu_and_mul_masked_post_quant_fwd
(304-375)
🪛 Ruff (0.12.2)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py
320-320: Line too long (127 > 120)
(E501)
429-429: Line too long (463 > 120)
(E501)
729-729: Line too long (967 > 120)
(E501)
812-812: Line too long (374 > 120)
(E501)
tensorrt_llm/_torch/modules/fused_moe/moe_backend.py
322-322: Local variable use_deepseek_fp8_block_scale
is assigned to but never used
Remove assignment to unused variable use_deepseek_fp8_block_scale
(F841)
326-326: Local variable use_int8_woq_per_channel
is assigned to but never used
Remove assignment to unused variable use_int8_woq_per_channel
(F841)
348-348: Line too long (136 > 120)
(E501)
641-641: Line too long (129 > 120)
(E501)
642-642: Line too long (131 > 120)
(E501)
645-645: Line too long (123 > 120)
(E501)
646-646: Line too long (127 > 120)
(E501)
760-760: Line too long (129 > 120)
(E501)
761-761: Line too long (129 > 120)
(E501)
🔇 Additional comments (1)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py (1)
368-379
: LGTM: lazy backend initialization property is clean and safe.Correctly guards on weights and centralizes backend selection.
e8af58e
to
6c62b25
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
tensorrt_llm/_torch/modules/fused_moe/moe_backend.py (1)
800-815
: Replace prints with logger and make has_block_fp8 robust.Use the project logger and bool(getattr(...)) for clarity; add the logger import.
+from tensorrt_llm.logger import logger @@ - has_block_fp8 = (hasattr(module, 'has_deepseek_fp8_block_scales') - and module.has_deepseek_fp8_block_scales) + has_block_fp8 = bool(getattr(module, 'has_deepseek_fp8_block_scales', False)) @@ - print("xxi select backend: use deepgemm backend") + logger.debug("MoEBackendSelection: using DeepGemm backend") return MoEDeepGemmBackend() else: - print("xxi select backend: use cutlass backend") + logger.debug("MoEBackendSelection: using Cutlass backend") return MoECutlassBackend()tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py (1)
1-1
: Add NVIDIA copyright header (2025).Required for production Python files.
+# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved. import os
♻️ Duplicate comments (9)
tensorrt_llm/_torch/modules/fused_moe/moe_backend.py (6)
1-4
: Add NVIDIA copyright header (2025).Comply with repo policy for source files.
+# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved. """ MoE Backend abstraction for supporting different MoE computation implementations. This module provides a unified interface for different MoE backends (Cutlass, DeepGemm, etc.) """
476-483
: Keep DeepGemm.finalize_tactic signature compatible with base.Add unused use_fused_finalize to the override.
def finalize_tactic( self, module: Any, tuner_input: torch.Tensor, output_dtype: torch.dtype, min_latency_mode: bool = False, + use_fused_finalize: bool = True, tuner_top_k: Optional[int] = None, ) -> None:
441-443
: Fix Cutlass.run_moe finalize_tactic call: pass named args (same misbinding).Mirror the base run_moe fix to avoid positional confusion.
- self.finalize_tactic(module, tuner_input, output_dtype, - min_latency_mode, tuner_top_k) + self.finalize_tactic( + module, + tuner_input, + output_dtype, + min_latency_mode=min_latency_mode, + use_fused_finalize=use_fused_finalize, + tuner_top_k=tuner_top_k, + )
157-158
: Fix finalize_tactic call: pass by name to avoid positional misbinding.Currently tuner_top_k binds to use_fused_finalize in overrides.
- self.finalize_tactic(module, input, output_dtype, min_latency_mode, - tuner_top_k) + self.finalize_tactic( + module, + input, + output_dtype, + min_latency_mode=min_latency_mode, + use_fused_finalize=use_fused_finalize, + tuner_top_k=tuner_top_k, + )
30-37
: Align base API: add use_fused_finalize to finalize_tactic signature.Keep LSP with MoECutlassBackend override and avoid caller misbinding.
def finalize_tactic( self, module: Any, tuner_input: torch.Tensor, output_dtype: torch.dtype, min_latency_mode: bool = False, + use_fused_finalize: bool = True, tuner_top_k: Optional[int] = None, ) -> None:
746-752
: Use experts_per_token (not top_k) when finalizing.Avoid AttributeError and ensure correct routing parameter.
- module.routing_method.top_k if module else 1, + module.routing_method.experts_per_token if module else 1,tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py (3)
311-325
: Replace print diagnostics with logger.debug.Avoid noisy stdout; adhere to project logging. Also fixes E501 by shortening lines.
- print( - f"can not use alltoall due to chunking {self.calculate_num_chunks(all_rank_num_tokens)}" - ) + logger.debug( + "Cannot use alltoall due to chunking=%s", + self.calculate_num_chunks(all_rank_num_tokens), + ) @@ - print( - f"can not use alltoall due to deep_ep_max_num_tokens {all_rank_max_num_tokens} > {self.deep_ep_max_num_tokens}" - ) + logger.debug( + "Cannot use alltoall: max_tokens %s > limit %s", + all_rank_max_num_tokens, + self.deep_ep_max_num_tokens, + ) @@ - print(f"all to all type {self.alltoall_method_type}") + logger.debug("Alltoall type: %s", self.alltoall_method_type)
333-345
: Swap remaining prints in _get_quant_method for logger.debug.Keep logging consistent; no behavior change.
- print( - f"wide_ep _get_quant_method: get_sm_version()={get_sm_version()}" - ) + logger.debug( + "wide_ep _get_quant_method: get_sm_version()=%s", + get_sm_version(), + ) if get_sm_version() == 100: - print( - f"wide_ep _get_quant_method: use DeepSeekFP8BlockScalesFusedMoEMethodDeepGemm" - ) + logger.debug( + "wide_ep _get_quant_method: use DeepSeekFP8BlockScalesFusedMoEMethodDeepGemm" + ) return DeepSeekFP8BlockScalesFusedMoEMethodDeepGemm() else: - print( - f"wide_ep _get_quant_method: use DeepSeekFP8BlockScalesFusedMoEMethod" - ) + logger.debug( + "wide_ep _get_quant_method: use DeepSeekFP8BlockScalesFusedMoEMethod" + ) return DeepSeekFP8BlockScalesFusedMoEMethod()
8-8
: Fix logger import (use tensorrt_llm.logger)._tensorrt_llm.utils doesn’t export logger; this import will fail at import time.
-from tensorrt_llm._utils import get_sm_version, logger +from tensorrt_llm._utils import get_sm_version +from tensorrt_llm.logger import logger
🧹 Nitpick comments (3)
tensorrt_llm/_torch/modules/fused_moe/moe_backend.py (2)
322-328
: Remove unused locals flagged by Ruff.These flags aren’t used in this scope; runner already captured them during finalize.
- use_deepseek_fp8_block_scale = getattr(module, - 'has_deepseek_fp8_block_scales', - False) use_w4_group_scaling = getattr(module, 'has_w4afp8', False) - use_int8_woq_per_channel = getattr(module, 'has_int8_woq_per_channel', - False)
633-645
: Delete long commented debug prints (line-length lint).These commented blocks trigger E501; remove to keep the file clean.
- # print( - # "xxi shape 2: enter deepgemm backend compute_moe \n" - # f"x.shape: {getattr(x, 'shape', None)}, \n" - # f"input_sf.shape: {getattr(input_sf, 'shape', None)}, \n" - # f"token_selected_slots.shape: {getattr(token_selected_slots, 'shape', None)}, \n" - # f"token_final_scales.shape: {getattr(token_final_scales, 'shape', None)}, \n" - # f"permuted_row_to_unpermuted_row_tensor.shape: {getattr(permuted_row_to_unpermuted_row_tensor, 'shape', None)}, \n" - # f"permuted_token_selected_experts_tensor.shape: {getattr(permuted_token_selected_experts_tensor, 'shape', None)}, \n" - # f"permuted_data_tensor.shape: {getattr(permuted_data_tensor, 'shape', None)}, \n" - # f"expert_first_token_offset_tensor.shape: {getattr(expert_first_token_offset_tensor, 'shape', None)}, \n" - # f"permuted_token_final_scales_tensor.shape: {getattr(permuted_token_final_scales_tensor, 'shape', None)}, \n" - # f"unpermuted_row_to_permuted_row_tensor.shape: {getattr(unpermuted_row_to_permuted_row_tensor, 'shape', None)}\n" - # ) @@ - # print( - # "xxi shape 3: exit deepgemm backend compute_moe \n" - # f"final_hidden_states.shape: {getattr(final_hidden_states, 'shape', None)}\n" - # f"permuted_data_tensor.shape: {getattr(permuted_data_tensor, 'shape', None)}, \n" - # f"token_final_scales.shape: {getattr(token_final_scales, 'shape', None)}, \n" - # f"unpermuted_row_to_permuted_row_tensor.shape: {getattr(unpermuted_row_to_permuted_row_tensor, 'shape', None)}, \n" - # f"permuted_row_to_unpermuted_row_tensor.shape: {getattr(permuted_row_to_unpermuted_row_tensor, 'shape', None)}, \n" - # f"expert_first_token_offset_tensor.shape: {getattr(expert_first_token_offset_tensor, 'shape', None)}, \n" - # f"x.shape: {getattr(x, 'shape', None)}, \n")Also applies to: 753-762
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py (1)
428-431
: Remove long commented print blocks (lint: E501).They’re not useful in production; delete to keep the file clean.
- # print( - # f"xxi shape 1: enter wide_ep forward_chunk: layer_load_balancer={self.layer_load_balancer}, is_first_call={is_first_call}, is_last_call={is_last_call}, x shape: {getattr(x, 'shape', None)}, router_logits shape: {getattr(router_logits, 'shape', None)}, use_all_to_all: {use_all_to_all}, all_rank_num_tokens: {all_rank_num_tokens}, all_rank_max_num_tokens: {all_rank_max_num_tokens}, use_dp_padding: {use_dp_padding}, repeating_info: {repeating_info}" - # ) @@ - # 一行打印所有信息 - # print( - # f"xxi x.shape: {getattr(x, 'shape', None)}, use_all_to_all: {use_all_to_all}, all_rank_num_tokens: {all_rank_num_tokens}, all_rank_num_tokens_padded: {all_rank_num_tokens_padded}, all_rank_max_num_tokens: {all_rank_max_num_tokens}, use_dp_padding: {use_dp_padding}, outputs.shape: {getattr(outputs, 'shape', None)}, use_dp_padding(again): {use_dp_padding}" - # )Also applies to: 810-813
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (2)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py
(13 hunks)tensorrt_llm/_torch/modules/fused_moe/moe_backend.py
(1 hunks)
🧰 Additional context used
📓 Path-based instructions (3)
**/*.{cpp,cc,cxx,cu,py,h,hpp,hh,hxx,cuh}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Use spaces only; no tabs; indent by 4 spaces
Files:
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py
tensorrt_llm/_torch/modules/fused_moe/moe_backend.py
**/*.py
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.py
: Code must target Python 3.8+
Indent with 4 spaces; no tabs
Keep module namespace on import; import the module, not individual names; use module.symbol
Python filenames use snake_case (e.g., some_file.py)
Class names use PascalCase
Function and method names use snake_case
Local variables use snake_case; if starting with a number, prefix with k_ (e.g., k_99th_percentile)
Global variables use UPPER_SNAKE with G_ prefix (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE
Avoid shadowing variables from outer scopes
Initialize all externally visible members in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with trailing docstrings
Avoid reflection when simple alternatives exist (e.g., avoid dict(**locals()) patterns)
Limit except clauses to specific exceptions; avoid bare except
When duck-typing with try/except, keep try body minimal and use else for logic
Files:
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py
tensorrt_llm/_torch/modules/fused_moe/moe_backend.py
**/*.{cpp,cc,cxx,cu,h,hpp,hh,hxx,cuh,py}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Prepend NVIDIA copyright header (current year) to all source files
Files:
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py
tensorrt_llm/_torch/modules/fused_moe/moe_backend.py
🧠 Learnings (5)
📓 Common learnings
Learnt from: timlee0212
PR: NVIDIA/TensorRT-LLM#6886
File: tensorrt_llm/_torch/models/modeling_deepseekv3.py:0-0
Timestamp: 2025-08-14T06:36:40.701Z
Learning: In DeepSeek V3 model (tensorrt_llm/_torch/models/modeling_deepseekv3.py), the disagreement between AllReduce.__init__ guard and _compute_mlp_tp_size logic for MNNVL usage is expected by design. The AllReduce component and MLP TP-size computation intentionally use different criteria for MNNVL availability decisions.
📚 Learning: 2025-08-14T23:23:27.449Z
Learnt from: djns99
PR: NVIDIA/TensorRT-LLM#6915
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:4010-4012
Timestamp: 2025-08-14T23:23:27.449Z
Learning: For MOE (Mixture of Experts) code reviews in TensorRT-LLM, avoid repeatedly suggesting finalize fusion validation checks and safety assertions. The user djns99 has indicated these suggestions are repetitive and unwanted across multiple MOE-related changes.
Applied to files:
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py
tensorrt_llm/_torch/modules/fused_moe/moe_backend.py
📚 Learning: 2025-08-09T20:57:04.084Z
Learnt from: sklevtsov-nvidia
PR: NVIDIA/TensorRT-LLM#3294
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu:118-127
Timestamp: 2025-08-09T20:57:04.084Z
Learning: In the CUTLASS MoE finalize fusion implementation (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu), when setting `fused_finalize_epilogue.stride_final_output` with shape `(hidden_size, num_output_tokens, 1)`, the `num_rows_in_final_output` should be set to `num_output_tokens` (not `hidden_size`) because of a swap+transpose operation that maps rows of the output tensor to `hidden_size` and columns to `num_output_tokens`.
Applied to files:
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py
tensorrt_llm/_torch/modules/fused_moe/moe_backend.py
📚 Learning: 2025-08-28T15:21:03.571Z
Learnt from: CR
PR: NVIDIA/TensorRT-LLM#0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-08-28T15:21:03.571Z
Learning: Applies to **/*.{cpp,cc,cxx,cu,h,hpp,hh,hxx,cuh,py} : Prepend NVIDIA copyright header (current year) to all source files
Applied to files:
tensorrt_llm/_torch/modules/fused_moe/moe_backend.py
📚 Learning: 2025-08-06T13:58:07.506Z
Learnt from: galagam
PR: NVIDIA/TensorRT-LLM#6487
File: tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_trtllm_bench.py:1-12
Timestamp: 2025-08-06T13:58:07.506Z
Learning: In TensorRT-LLM, test files (files under tests/ directories) do not require NVIDIA copyright headers, unlike production source code files. Test files typically start directly with imports, docstrings, or code.
Applied to files:
tensorrt_llm/_torch/modules/fused_moe/moe_backend.py
🧬 Code graph analysis (2)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py (1)
tensorrt_llm/_torch/modules/fused_moe/moe_backend.py (5)
MoEBackend
(17-177)MoEBackendSelection
(768-814)select_backend
(780-814)run_moe
(109-177)run_moe
(379-461)
tensorrt_llm/_torch/modules/fused_moe/moe_backend.py (4)
tensorrt_llm/_torch/modules/fused_moe/interface.py (2)
MoE
(22-181)has_deepseek_fp8_block_scales
(127-130)tensorrt_llm/_torch/custom_ops/torch_custom_ops.py (1)
MoERunner
(23-114)tensorrt_llm/_torch/modules/fused_moe/fused_moe_deepgemm.py (5)
deepgemm_fp8_group_blockwise_gemm
(246-284)masked_index_copy_group_quant_fp8
(89-160)preprocess_after_permute
(233-242)set_strides
(287-293)triton_masked_index_gather
(195-216)tensorrt_llm/quantization/utils/fp8_utils.py (1)
silu_and_mul_masked_post_quant_fwd
(304-375)
🪛 Ruff (0.12.2)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py
320-320: Line too long (127 > 120)
(E501)
429-429: Line too long (463 > 120)
(E501)
729-729: Line too long (967 > 120)
(E501)
812-812: Line too long (374 > 120)
(E501)
tensorrt_llm/_torch/modules/fused_moe/moe_backend.py
322-322: Local variable use_deepseek_fp8_block_scale
is assigned to but never used
Remove assignment to unused variable use_deepseek_fp8_block_scale
(F841)
326-326: Local variable use_int8_woq_per_channel
is assigned to but never used
Remove assignment to unused variable use_int8_woq_per_channel
(F841)
348-348: Line too long (136 > 120)
(E501)
639-639: Line too long (129 > 120)
(E501)
640-640: Line too long (131 > 120)
(E501)
643-643: Line too long (123 > 120)
(E501)
644-644: Line too long (127 > 120)
(E501)
758-758: Line too long (129 > 120)
(E501)
759-759: Line too long (129 > 120)
(E501)
🔇 Additional comments (1)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py (1)
707-726
: LGTM: backend invocation is correct and passes runtime-only knobs.Uses keywords, aligns with backend API, and unwraps later.
Signed-off-by: xxi <[email protected]> modified: tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py modified: tensorrt_llm/_torch/modules/fused_moe/moe_backend.py modified: tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py modified: tensorrt_llm/_torch/modules/fused_moe/moe_backend.py modified: tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py modified: tensorrt_llm/_torch/modules/fused_moe/moe_backend.py modified: tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py modified: tensorrt_llm/_torch/modules/fused_moe/moe_backend.py modified: tests/unittest/_torch/modules/test_fused_moe.py
6c62b25
to
bfc5496
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (8)
tensorrt_llm/_torch/modules/fused_moe/moe_backend.py (5)
157-158
: Fix positional arg misbinding in finalize_tactic call (use named args).Passing tuner_top_k positionally can bind into use_fused_finalize in subclass overrides. Bind by name.
- self.finalize_tactic(module, input, output_dtype, min_latency_mode, - tuner_top_k) + self.finalize_tactic( + module, + input, + output_dtype, + min_latency_mode=min_latency_mode, + tuner_top_k=tuner_top_k, + )
441-443
: Cutlass.run_moe: pass finalize_tactic args by name (and include use_fused_finalize).Avoids misbinding and ensures the intended flag is honored during profiling.
- self.finalize_tactic(module, tuner_input, output_dtype, - min_latency_mode, tuner_top_k) + self.finalize_tactic( + module, + tuner_input, + output_dtype, + min_latency_mode=min_latency_mode, + use_fused_finalize=use_fused_finalize, + tuner_top_k=tuner_top_k, + )
217-241
: Use weight_view_dtype when constructing MoERunner to match runtime .view(...).The tuner currently receives storage dtype; runtime uses a view dtype (e.g., quint4x2, uint8). Use the same dtype for accurate tactic selection.
if self.moe_runner is None: self.moe_runner = MoERunner( x_dtype=tuner_input.dtype, - weight_dtype=module.w3_w1_weight.dtype, + weight_dtype=weight_view_dtype, output_dtype=output_dtype, top_k=tuner_top_k,
746-752
: Routing attr: use experts_per_token, not top_k.APIs elsewhere expose experts_per_token; using top_k risks AttributeError and inconsistency.
- module.routing_method.top_k if module else 1, + module.routing_method.experts_per_token if module else 1,
11-13
: Replace print with logger and import the project logger.Avoid raw prints in production; use the configured logger.
from tensorrt_llm._utils import get_sm_version +from tensorrt_llm.logger import logger @@ - print("xxi select backend: use deepgemm backend") + logger.debug("MoEBackendSelection: using DeepGemm backend") return MoEDeepGemmBackend() else: - print("xxi select backend: use cutlass backend") - return MoECutlassBackend() + logger.debug("MoEBackendSelection: using Cutlass backend") + return MoECutlassBackend()Also applies to: 806-814
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py (3)
8-8
: Fix logger import; _utils doesn’t export logger.Import logger from tensorrt_llm.logger.
-from tensorrt_llm._utils import get_sm_version, logger +from tensorrt_llm._utils import get_sm_version +from tensorrt_llm.logger import logger
311-325
: Replace debug prints with logger.debug.No raw prints in production paths; keep messages but gate via logger.
- print( - f"can not use alltoall due to chunking {self.calculate_num_chunks(all_rank_num_tokens)}" - ) + logger.debug( + "Cannot use alltoall due to chunking %d", + self.calculate_num_chunks(all_rank_num_tokens), + ) @@ - print( - f"can not use alltoall due to deep_ep_max_num_tokens {all_rank_max_num_tokens} > {self.deep_ep_max_num_tokens}" - ) + logger.debug( + "Cannot use alltoall: deep_ep_max_num_tokens %d > %d", + all_rank_max_num_tokens, + self.deep_ep_max_num_tokens, + ) @@ - print(f"all to all type {self.alltoall_method_type}") + logger.debug("Alltoall type: %s", self.alltoall_method_type)
333-345
: _get_quant_method: replace prints with logger.debug.Keeps diagnostics without spamming stdout.
- print( - f"wide_ep _get_quant_method: get_sm_version()={get_sm_version()}" - ) + logger.debug("wide_ep _get_quant_method: sm=%s", get_sm_version()) if get_sm_version() == 100: - print( - f"wide_ep _get_quant_method: use DeepSeekFP8BlockScalesFusedMoEMethodDeepGemm" - ) + logger.debug( + "wide_ep _get_quant_method: DeepSeekFP8BlockScalesFusedMoEMethodDeepGemm" + ) return DeepSeekFP8BlockScalesFusedMoEMethodDeepGemm() else: - print( - f"wide_ep _get_quant_method: use DeepSeekFP8BlockScalesFusedMoEMethod" - ) + logger.debug( + "wide_ep _get_quant_method: DeepSeekFP8BlockScalesFusedMoEMethod" + ) return DeepSeekFP8BlockScalesFusedMoEMethod()
🧹 Nitpick comments (2)
tensorrt_llm/_torch/modules/fused_moe/moe_backend.py (2)
317-333
: Remove unused locals in Cutlass.compute_moe.These assignments aren’t used (Ruff F841) and add noise.
- getattr(module, 'tune_max_num_tokens', 8192) + # no-op: tune_max_num_tokens handled during finalize_tactic - use_deepseek_fp8_block_scale = getattr(module, - 'has_deepseek_fp8_block_scales', - False) use_w4_group_scaling = getattr(module, 'has_w4afp8', False) - use_int8_woq_per_channel = getattr(module, 'has_int8_woq_per_channel', - False)
347-349
: Split long ternary for readability and linter E501.- run_moe = self.moe_runner.fused_moe_runner.run_moe_min_latency if min_latency_mode else self.moe_runner.fused_moe_runner.run_moe + run_moe = ( + self.moe_runner.fused_moe_runner.run_moe_min_latency + if min_latency_mode + else self.moe_runner.fused_moe_runner.run_moe + )
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (3)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py
(13 hunks)tensorrt_llm/_torch/modules/fused_moe/moe_backend.py
(1 hunks)tests/unittest/_torch/modules/test_fused_moe.py
(2 hunks)
🧰 Additional context used
🧠 Learnings (4)
📓 Common learnings
Learnt from: timlee0212
PR: NVIDIA/TensorRT-LLM#6886
File: tensorrt_llm/_torch/models/modeling_deepseekv3.py:0-0
Timestamp: 2025-08-14T06:36:40.701Z
Learning: In DeepSeek V3 model (tensorrt_llm/_torch/models/modeling_deepseekv3.py), the disagreement between AllReduce.__init__ guard and _compute_mlp_tp_size logic for MNNVL usage is expected by design. The AllReduce component and MLP TP-size computation intentionally use different criteria for MNNVL availability decisions.
📚 Learning: 2025-08-14T23:23:27.449Z
Learnt from: djns99
PR: NVIDIA/TensorRT-LLM#6915
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:4010-4012
Timestamp: 2025-08-14T23:23:27.449Z
Learning: For MOE (Mixture of Experts) code reviews in TensorRT-LLM, avoid repeatedly suggesting finalize fusion validation checks and safety assertions. The user djns99 has indicated these suggestions are repetitive and unwanted across multiple MOE-related changes.
Applied to files:
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py
tensorrt_llm/_torch/modules/fused_moe/moe_backend.py
📚 Learning: 2025-08-09T20:57:04.084Z
Learnt from: sklevtsov-nvidia
PR: NVIDIA/TensorRT-LLM#3294
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu:118-127
Timestamp: 2025-08-09T20:57:04.084Z
Learning: In the CUTLASS MoE finalize fusion implementation (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu), when setting `fused_finalize_epilogue.stride_final_output` with shape `(hidden_size, num_output_tokens, 1)`, the `num_rows_in_final_output` should be set to `num_output_tokens` (not `hidden_size`) because of a swap+transpose operation that maps rows of the output tensor to `hidden_size` and columns to `num_output_tokens`.
Applied to files:
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py
tensorrt_llm/_torch/modules/fused_moe/moe_backend.py
📚 Learning: 2025-08-06T13:58:07.506Z
Learnt from: galagam
PR: NVIDIA/TensorRT-LLM#6487
File: tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_trtllm_bench.py:1-12
Timestamp: 2025-08-06T13:58:07.506Z
Learning: In TensorRT-LLM, test files (files under tests/ directories) do not require NVIDIA copyright headers, unlike production source code files. Test files typically start directly with imports, docstrings, or code.
Applied to files:
tensorrt_llm/_torch/modules/fused_moe/moe_backend.py
🧬 Code graph analysis (3)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py (2)
tensorrt_llm/_torch/modules/fused_moe/moe_backend.py (5)
MoEBackend
(17-177)MoEBackendSelection
(768-814)select_backend
(780-814)run_moe
(109-177)run_moe
(379-461)tensorrt_llm/_torch/modules/fused_moe/quantization.py (2)
DeepSeekFP8BlockScalesFusedMoEMethod
(604-737)DeepSeekFP8BlockScalesFusedMoEMethodDeepGemm
(740-781)
tests/unittest/_torch/modules/test_fused_moe.py (4)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py (1)
AlltoallMethodType
(29-37)tensorrt_llm/_torch/modules/fused_moe/routing.py (1)
DefaultMoeRoutingMethod
(184-210)tensorrt_llm/mapping.py (1)
Mapping
(32-513)tests/unittest/_torch/helpers.py (1)
per_block_cast_to_fp8_e8m0
(55-68)
tensorrt_llm/_torch/modules/fused_moe/moe_backend.py (4)
tensorrt_llm/_torch/modules/fused_moe/interface.py (2)
MoE
(22-181)has_deepseek_fp8_block_scales
(127-130)tensorrt_llm/_torch/custom_ops/torch_custom_ops.py (1)
MoERunner
(23-114)tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py (1)
enable_alltoall
(298-301)tensorrt_llm/_torch/modules/fused_moe/fused_moe_deepgemm.py (5)
deepgemm_fp8_group_blockwise_gemm
(246-284)masked_index_copy_group_quant_fp8
(89-160)preprocess_after_permute
(233-242)set_strides
(287-293)triton_masked_index_gather
(195-216)
🪛 Ruff (0.12.2)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py
320-320: Line too long (127 > 120)
(E501)
429-429: Line too long (463 > 120)
(E501)
729-729: Line too long (967 > 120)
(E501)
812-812: Line too long (374 > 120)
(E501)
tensorrt_llm/_torch/modules/fused_moe/moe_backend.py
322-322: Local variable use_deepseek_fp8_block_scale
is assigned to but never used
Remove assignment to unused variable use_deepseek_fp8_block_scale
(F841)
326-326: Local variable use_int8_woq_per_channel
is assigned to but never used
Remove assignment to unused variable use_int8_woq_per_channel
(F841)
348-348: Line too long (136 > 120)
(E501)
639-639: Line too long (129 > 120)
(E501)
640-640: Line too long (131 > 120)
(E501)
643-643: Line too long (123 > 120)
(E501)
644-644: Line too long (127 > 120)
(E501)
758-758: Line too long (129 > 120)
(E501)
759-759: Line too long (129 > 120)
(E501)
🔇 Additional comments (3)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py (1)
368-379
: LGTM: lazy backend initialization is clean.Property-based selection after weights exist avoids premature capability checks.
tests/unittest/_torch/modules/test_fused_moe.py (2)
520-664
: LGTM: FP8 blockwise WideEP test adds valuable coverage.Good per-rank setup, deterministic tensors, and DeepGemm reference comparison with sane tolerances.
477-518
: LGTM: helper patterns are straightforward and reusable.
Summary by CodeRabbit
New Features
Improvements
Tests
Style
Description
Test Coverage
GitHub Bot Help
/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...
Provide a user friendly way for developers to interact with a Jenkins server.
Run
/bot [-h|--help]
to print this help message.See details below for each supported subcommand.
run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]
Launch build/test pipelines. All previously running jobs will be killed.
--reuse-test (optional)pipeline-id
(OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.--disable-reuse-test
(OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.--disable-fail-fast
(OPTIONAL) : Disable fail fast on build/tests/infra failures.--skip-test
(OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.--stage-list "A10-PyTorch-1, xxx"
(OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.--gpu-type "A30, H100_PCIe"
(OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.--test-backend "pytorch, cpp"
(OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.--only-multi-gpu-test
(OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.--disable-multi-gpu-test
(OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.--add-multi-gpu-test
(OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.--post-merge
(OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx"
(OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".--detailed-log
(OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.--debug
(OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in thestage-list
parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.For guidance on mapping tests to stage names, see
docs/source/reference/ci-overview.md
and the
scripts/test_to_stage_mapping.py
helper.kill
kill
Kill all running builds associated with pull request.
skip
skip --comment COMMENT
Skip testing for latest commit on pull request.
--comment "Reason for skipping build/test"
is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.reuse-pipeline
reuse-pipeline
Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.