Skip to content

Conversation

xxi-nv
Copy link
Collaborator

@xxi-nv xxi-nv commented Aug 25, 2025

Summary by CodeRabbit

  • New Features

    • Pluggable MoE backend framework with lazy initialization and runtime selection to route MoE execution through the chosen backend; exposes a backend-aware MoE execution path.
  • Improvements

    • Runtime-aware tensor-parallel sizing that respects all-reduce strategy and backend capability.
    • More robust all-reduce setup with strategy gating, node-size checks and exception-safe fallbacks.
  • Tests

    • Re-enabled alltoall tests and added comprehensive FP8 blockwise MoE tests plus deterministic test utilities.
  • Style

    • Additional runtime diagnostic logging for load-balancer and backend/path decisions.

Description

Test Coverage

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md
and the scripts/test_to_stage_mapping.py helper.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

xxi-nv added 2 commits August 25, 2025 06:59
Signed-off-by: xxi <[email protected]>

	modified:   tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py
	new file:   tensorrt_llm/_torch/modules/fused_moe/moe_backend.py
Signed-off-by: xxi <[email protected]>
@xxi-nv xxi-nv requested review from a team as code owners August 25, 2025 09:45
@xxi-nv xxi-nv requested review from hlu1 and symphonylyh August 25, 2025 09:45
Copy link
Contributor

coderabbitai bot commented Aug 25, 2025

📝 Walkthrough

Walkthrough

Adds per-item debug checks in reducescatter; robustly gates MNNVL AllReduce initialization on strategy, tp/world-size, dtype, and runtime capability; conditions DeepseekV3 MLP TP sizing on allreduce strategy and MNNVL support; introduces a pluggable MoE backend API with Cutlass and DeepGemm backends and runtime selection; adds a minor MoE load-balancer debug print.

Changes

Cohort / File(s) Summary
Distributed AllReduce & reducescatter
tensorrt_llm/_torch/distributed/ops.py
Adds per-item debug checks in reducescatter for sequence inputs when sizes is provided; broadens and hardens MNNVL AllReduce initialization to run when strategy ∈ {AUTO, MNNVL}, checks tp_size == world_size, verifies MNNVLAllReduce.is_mnnvl(mapping, dtype), attempts instantiation inside try/except (only if dtype provided), and logs enable/disable paths.
DeepseekV3 TP sizing logic
tensorrt_llm/_torch/models/modeling_deepseekv3.py
Imports AllReduceStrategy/MNNVLAllReduce; _compute_mlp_tp_size now uses full tp when strategy is AUTO or MNNVL and MNNVLAllReduce.is_mnnvl(mapping, pretrained_config.torch_dtype) is true; otherwise falls back to gcd(tp, GPUs_per_node).
Fused MoE: module integration & forward
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py
Adds lazy self._moe_backend_impl / moe_backend_impl property via MoEBackendSelection.select_backend(self) after weights exist; forwards MoE execution to self.moe_backend_impl.run_moe(self, ...); selects FP8 block-scales implementation by SM version; adds diagnostic prints and tighter guard for local-stat collection (if self.layer_load_balancer and is_last_call).
MoE backend abstraction
tensorrt_llm/_torch/modules/fused_moe/moe_backend.py
New MoEBackend abstract API and concrete MoECutlassBackend and MoEDeepGemmBackend implementations exposing finalize_tactic, compute_moe, and run_moe (signatures accept module and runtime flags); adds MoEBackendSelection.select_backend(module) for runtime backend selection.
MoE load-balancer debug
tensorrt_llm/_torch/modules/fused_moe/moe_load_balancer.py
Adds a debug print in maybe_create_moe_load_balancer logging in_supported_model_arch, using_ep, using_smart_router, and model_config.moe_load_balancer; no control-flow changes.
Tests: fused_moe unit tests
tests/unittest/_torch/modules/test_fused_moe.py
Re-enables two alltoall tests; adds test_wide_ep_fp8_blockwise exercising FP8 blockwise path with per-rank runner and deterministic helper data generators; introduces set_tensor_value_2/3/4 helpers and removes duplicate definitions.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant Caller
  participant Ops as distributed.ops
  participant MNN as MNNVLAllReduce

  Caller->>Ops: AllReduce.__init__(strategy, tp_size, world_size, dtype, mapping)
  alt strategy in (AUTO, MNNVL)
    alt tp_size == world_size and dtype provided
      Ops->>MNN: is_mnnvl(mapping, dtype)?
      alt supported
        Ops->>MNN: try instantiate MNNVLAllReduce(mapping, dtype)
        MNN-->>Ops: instance or raise
        alt success
          Ops-->>Caller: debug "MNNVL enabled"
        else exception/fail
          Ops-->>Caller: debug "MNNVL disabled"
        end
      else unsupported
        Ops-->>Caller: debug "MNNVL cannot be enabled"
      end
    else size mismatch
      Ops-->>Caller: debug "MNNVL disabled (size mismatch)"
    end
  else other strategy
    Ops-->>Caller: debug "MNNVL disabled (strategy)"
  end
Loading
sequenceDiagram
  autonumber
  participant Layer as WideEPMoE
  participant Selector as MoEBackendSelection
  participant Cut as MoECutlassBackend
  participant DG as MoEDeepGemmBackend

  Layer->>Selector: select_backend(module=self)
  alt SM==100 and module/runtime prefers DeepGemm
    Selector-->>Layer: MoEDeepGemmBackend
    Layer->>DG: run_moe(..., module=self)
    DG-->>Layer: final_hidden_states
  else
    Selector-->>Layer: MoECutlassBackend
    Layer->>Cut: run_moe(..., module=self)
    Cut-->>Layer: final_hidden_states
  end
  Layer-->>Layer: continue with alltoall / post-processing
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~70 minutes

Possibly related PRs

Suggested reviewers

  • hlu1
  • symphonylyh
  • yuxianq

Tip

🔌 Remote MCP (Model Context Protocol) integration is now available!

Pro plan users can now connect to remote MCP servers from the Integrations page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats.

✨ Finishing Touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbit in a new review comment at the desired location with your query.
  • PR comments: Tag @coderabbit in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbit gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbit read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbit help to get the list of available commands.

Other keywords and placeholders

  • Add @coderabbit ignore or @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbit summary or @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbit or @coderabbitai title anywhere in the PR title to generate the title automatically.

Status, Documentation and Community

  • Visit our Status Page to check the current availability of CodeRabbit.
  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (5)
tensorrt_llm/_torch/distributed/ops.py (1)

460-484: AUTO now considers MNNVL; add a safety check on tp/world mismatch and keep graceful fallback — LGTM with one gating question

  • The broadened gating (AUTO or MNNVL) with a try/except fallback is solid. Good call on logging enable/disable states and exceptions.
  • The new tp_size != world_size guard disables MNNVL when TP is a subgroup (e.g., DP>1). Is this strictly required by MNNVL AR? The workspace and kernels appear to use the TP group, so MNNVL might still be valid with dp_size > 1. If the intent is “MNNVL only when TP spans the full world,” consider documenting that assumption or keying the check off the actual communicators used by MNNVL rather than global world_size to avoid unnecessary disablement.

If the restriction is not required, consider this change:

-                if self.mapping.tp_size != self.mapping.world_size:
+                # Disable only when the TP group spans multiple process groups in a way MNNVL can't handle.
+                # If the MNNVL implementation only depends on the TP group, disablement on DP>1 may be unnecessary.
+                if getattr(self.mapping, "tp_spans_world", None) is False:
                     logger.debug(
-                        f"MNNVLAllReduce is disabled due to tp_size:{self.mapping.tp_size} "
-                        f"!= world_size:{self.mapping.world_size}")
+                        f"MNNVLAllReduce is disabled: TP group does not span the intended MNNVL domain.")
                     self.mnnvl_allreduce = None

Or, if you do require world_size equality, add a one‑liner comment to codify the constraint to help future readers.

tensorrt_llm/_torch/models/modeling_deepseekv3.py (1)

742-759: Avoid repetitive capability probes for MNNVL in hot paths

The call to MNNVLAllReduce.is_mnnvl(...) can be relatively expensive (imports + capability checks). If this is executed per layer instantiation, consider caching the result on the model_config or mapping (e.g., a lazy singleton per process) to reduce overhead.

tensorrt_llm/_torch/modules/fused_moe/moe_backend.py (2)

568-616: Avoid per-call workspace allocations in DeepGemm path

_get_deepgemm_workspace allocates multiple large CUDA tensors on every forward. Cache by (m_max, group_size) to avoid repeated alloc/free, which can fragment memory and add latency in steady state.

Example:

 class MoeDeepGemmBackend(MoEBackend):
     def __init__(self):
         ...
         self._workspace_cache = {}
@@
-    def _get_deepgemm_workspace(self, module: Any, m_max: int,
-                                group_size: int) -> Dict[str, torch.Tensor]:
+    def _get_deepgemm_workspace(self, module: Any, m_max: int,
+                                group_size: int) -> Dict[str, torch.Tensor]:
+        key = (module.expert_size_per_partition, m_max, group_size,
+               module.hidden_size, module.intermediate_size)
+        if key in self._workspace_cache:
+            return self._workspace_cache[key]
         ...
-        return workspace
+        self._workspace_cache[key] = workspace
+        return workspace

316-413: Cutlass compute_moe contract — return type clarity

You return a list [output] in the non–min-latency case and a bare tensor in min latency via run_moe_min_latency. Document this in the docstring and keep it aligned with callers. With the DeepGemm fix above, both backends will behave consistently for non–min-latency.

tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py (1)

680-708: Route fused MoE through pluggable backend — LGTM with one robustness tweak

The call mirrors the original fused_moe signature and passes module for backend-specific needs — good. After addressing the DeepGemm return type in moe_backend.py, this will work. For extra safety at call sites, make the unpacking resilient to both list and tensor returns.

Apply:

-        # Only in cutlass_min_latency_mode, the output is a list of tensors.
-        # Otherwise, the output should be unpacked as a single tensor.
-        final_hidden_states = final_hidden_states[0]
+        # Be robust to either a bare tensor or a single-element list/tuple
+        if isinstance(final_hidden_states, (list, tuple)):
+            final_hidden_states = final_hidden_states[0]
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between a1e03af and 91eb8b8.

📒 Files selected for processing (4)
  • tensorrt_llm/_torch/distributed/ops.py (1 hunks)
  • tensorrt_llm/_torch/models/modeling_deepseekv3.py (3 hunks)
  • tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py (7 hunks)
  • tensorrt_llm/_torch/modules/fused_moe/moe_backend.py (1 hunks)
🧰 Additional context used
🧠 Learnings (2)
📓 Common learnings
Learnt from: timlee0212
PR: NVIDIA/TensorRT-LLM#6886
File: tensorrt_llm/_torch/models/modeling_deepseekv3.py:0-0
Timestamp: 2025-08-14T06:36:40.701Z
Learning: In DeepSeek V3 model (tensorrt_llm/_torch/models/modeling_deepseekv3.py), the disagreement between AllReduce.__init__ guard and _compute_mlp_tp_size logic for MNNVL usage is expected by design. The AllReduce component and MLP TP-size computation intentionally use different criteria for MNNVL availability decisions.
📚 Learning: 2025-08-14T06:36:40.701Z
Learnt from: timlee0212
PR: NVIDIA/TensorRT-LLM#6886
File: tensorrt_llm/_torch/models/modeling_deepseekv3.py:0-0
Timestamp: 2025-08-14T06:36:40.701Z
Learning: In DeepSeek V3 model (tensorrt_llm/_torch/models/modeling_deepseekv3.py), the disagreement between AllReduce.__init__ guard and _compute_mlp_tp_size logic for MNNVL usage is expected by design. The AllReduce component and MLP TP-size computation intentionally use different criteria for MNNVL availability decisions.

Applied to files:

  • tensorrt_llm/_torch/distributed/ops.py
  • tensorrt_llm/_torch/models/modeling_deepseekv3.py
🧬 Code graph analysis (4)
tensorrt_llm/_torch/distributed/ops.py (4)
tensorrt_llm/functional.py (1)
  • AllReduceStrategy (3876-3885)
tensorrt_llm/_torch/distributed/communicator.py (2)
  • tp_size (46-47)
  • world_size (26-27)
tensorrt_llm/llmapi/llm_args.py (2)
  • world_size (246-258)
  • world_size (267-274)
tensorrt_llm/logger.py (1)
  • debug (143-144)
tensorrt_llm/_torch/models/modeling_deepseekv3.py (2)
tensorrt_llm/functional.py (1)
  • AllReduceStrategy (3876-3885)
tensorrt_llm/_torch/distributed/ops.py (2)
  • MNNVLAllReduce (290-396)
  • is_mnnvl (316-323)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py (2)
tensorrt_llm/_torch/modules/fused_moe/moe_backend.py (4)
  • MoEBackendSelection (823-866)
  • select_backend (835-866)
  • run_moe (101-212)
  • run_moe (413-533)
tensorrt_llm/_torch/modules/fused_moe/quantization.py (2)
  • DeepSeekFP8BlockScalesFusedMoEMethod (604-737)
  • DeepSeekFP8BlockScalesFusedMoEMethodDeepGemm (740-781)
tensorrt_llm/_torch/modules/fused_moe/moe_backend.py (5)
tensorrt_llm/_torch/autotuner.py (2)
  • AutoTuner (271-752)
  • choose_one (329-436)
tensorrt_llm/_torch/custom_ops/torch_custom_ops.py (1)
  • MoERunner (23-114)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_deepgemm.py (5)
  • masked_index_copy_group_quant_fp8 (89-160)
  • preprocess_after_permute (233-242)
  • set_strides (287-293)
  • triton_masked_index_gather (195-216)
  • deepgemm_fp8_group_blockwise_gemm (246-284)
tensorrt_llm/quantization/utils/fp8_utils.py (1)
  • silu_and_mul_masked_post_quant_fwd (304-375)
tensorrt_llm/_torch/modules/fused_moe/interface.py (1)
  • has_deepseek_fp8_block_scales (127-130)
🪛 Ruff (0.12.2)
tensorrt_llm/_torch/modules/fused_moe/moe_backend.py

382-382: Line too long (136 > 120)

(E501)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (5)
tensorrt_llm/_torch/models/modeling_deepseekv3.py (1)

742-759: MLP TP-size capping conditioned on strategy and MNNVL capability — LGTM and consistent with design

  • The gating to avoid inter-node TP when MNNVL is not in use or unsupported is reasonable.
  • This intentionally differs from AllReduce.init’s guard and is expected by design per prior learnings (DeepSeek V3 decisions). No change requested.
tensorrt_llm/_torch/modules/fused_moe/moe_backend.py (1)

251-275: Verify that MoERunner and profiling paths accept optional biases as None

You pass w3_w1_bias/w2_bias potentially as None to AutoTuner/MoERunner. Ensure the custom op path gracefully handles None for these inputs during profiling; if not, use zero-sized or zero-filled tensors of appropriate dtype/shape.

Minimal repro to validate:

# Expected: no crash during profiling when biases are None
# with realistic dtypes and small shapes
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py (3)

326-329: Backend-specific quant method choice for SM100 — LGTM

Selecting DeepGemm-specific FP8 block-scales method only on SM100 keeps behavior correct on other arches. No change requested.


352-354: Initialize backend after weights — LGTM

Deferring backend selection until weights exist makes sense, since backend needs module state. No change requested.


680-708: Confirm quant_scales layout matches backend expectation

DeepGemm expects pre-transformed scales (see DeepSeekFP8BlockScalesFusedMoEMethodDeepGemm.transform_sf_into_required_layout) while Cutlass expects original layout. Ensure self.quant_scales correspond to the runtime-selected backend.

If there’s any chance of mismatch, add an assert near backend selection:

assert (isinstance(self.quant_method, DeepSeekFP8BlockScalesFusedMoEMethodDeepGemm)) == isinstance(self.moe_backend, MoeDeepGemmBackend)

Or gate a small shape/layout check before the call.

Comment on lines 264 to 274
use_deepseek_fp8_block_scale=getattr(
module, 'has_deepseek_fp8_block_scales', False),
use_w4_group_scaling=getattr(module, 'has_w4afp8', False),
use_int8_woq_per_channel=getattr(module,
'has_int8_woq_per_channel',
False),
use_mxfp8_act_scaling=getattr(module, 'has_mxfp8_act_scaling',
False),
min_latency_mode=min_latency_mode,
use_fused_finalize=getattr(module, 'use_fused_finalize', False),
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Bug: passing a bound method where a bool is expected (missing “()”)

has_deepseek_fp8_block_scales is a method, but it’s passed as an attribute to MoERunner. This will always be truthy when present, enabling FP8 block-scale paths unintentionally.

Apply:

-                use_deepseek_fp8_block_scale=getattr(
-                    module, 'has_deepseek_fp8_block_scales', False),
+                use_deepseek_fp8_block_scale=(
+                    module.has_deepseek_fp8_block_scales()
+                    if hasattr(module, 'has_deepseek_fp8_block_scales') else False),

Also double-check other getattr flags (e.g., has_int8_woq_per_channel, has_mxfp8_act_scaling) to ensure they are booleans, not methods.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
use_deepseek_fp8_block_scale=getattr(
module, 'has_deepseek_fp8_block_scales', False),
use_w4_group_scaling=getattr(module, 'has_w4afp8', False),
use_int8_woq_per_channel=getattr(module,
'has_int8_woq_per_channel',
False),
use_mxfp8_act_scaling=getattr(module, 'has_mxfp8_act_scaling',
False),
min_latency_mode=min_latency_mode,
use_fused_finalize=getattr(module, 'use_fused_finalize', False),
)
use_deepseek_fp8_block_scale=(
module.has_deepseek_fp8_block_scales()
if hasattr(module, 'has_deepseek_fp8_block_scales') else False),
use_w4_group_scaling=getattr(module, 'has_w4afp8', False),
use_int8_woq_per_channel=getattr(module,
'has_int8_woq_per_channel',
False),
use_mxfp8_act_scaling=getattr(module, 'has_mxfp8_act_scaling',
False),
min_latency_mode=min_latency_mode,
use_fused_finalize=getattr(module, 'use_fused_finalize', False),
🤖 Prompt for AI Agents
In tensorrt_llm/_torch/modules/fused_moe/moe_backend.py around lines 264 to 274,
a bound method (has_deepseek_fp8_block_scales) is being passed to MoERunner via
getattr without calling it, causing the flag to be truthy even when it should
evaluate to False; change the getattr call to invoke the method when present
(e.g., getattr(module, 'has_deepseek_fp8_block_scales', False)()) or better:
check if the attribute is callable and call it, otherwise use its boolean value,
and apply the same guard/call pattern to the other flags
(has_int8_woq_per_channel, has_mxfp8_act_scaling, etc.) so each flag passed is a
real boolean rather than a bound method.

@xxi-nv xxi-nv force-pushed the supportFP8BlockWideEp branch from 91eb8b8 to b206849 Compare August 26, 2025 03:10
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (3)
tensorrt_llm/_torch/modules/fused_moe/moe_backend.py (3)

264-274: Bug: passing a bound method where a bool is expected (missing “()”)

has_deepseek_fp8_block_scales is a method on the module (see interface.py), but it’s passed through to MoERunner as an attribute. This will always be truthy and incorrectly enable block-FP8 paths.

Apply:

-                use_deepseek_fp8_block_scale=getattr(
-                    module, 'has_deepseek_fp8_block_scales', False),
+                use_deepseek_fp8_block_scale=(
+                    module.has_deepseek_fp8_block_scales()
+                    if hasattr(module, 'has_deepseek_fp8_block_scales') and
+                       callable(getattr(module, 'has_deepseek_fp8_block_scales'))
+                    else bool(getattr(module, 'has_deepseek_fp8_block_scales', False))
+                ),

Optional: audit similar flags (has_int8_woq_per_channel, has_mxfp8_act_scaling) if any are methods.


859-868: Backend selection condition always true when attribute exists (missing “()”)

Same issue here: treating has_deepseek_fp8_block_scales as a property makes has_block_fp8 true whenever the attribute exists, even if the method would return False.

Apply:

-        has_block_fp8 = (hasattr(module, 'has_deepseek_fp8_block_scales')
-                         and module.has_deepseek_fp8_block_scales)
+        flag = getattr(module, 'has_deepseek_fp8_block_scales', None)
+        has_block_fp8 = bool(flag() if callable(flag) else flag)

822-822: Unify run_moe return contract across backends to List[Tensor]

DeepGemm returns a bare Tensor while Cutlass returns a one-element list, and the caller unconditionally indexes [0]. This will slice/corrupt output on DeepGemm.

Apply:

-        return final_hidden_states
+        return [final_hidden_states]

Follow-up: with this change, both backends consistently return List[Tensor]. Consider updating the base interface’s type hints/docstring accordingly (see next comment).

🧹 Nitpick comments (5)
tensorrt_llm/_torch/modules/fused_moe/moe_load_balancer.py (1)

963-965: Replace print with logger.debug and fix E501 long line

Avoid stdout in library code; use the project logger. Also split the long f-string to satisfy Ruff (E501).

Apply:

-    print(
-        f"maybe_create_moe_load_balancer: in_supported_model_arch={in_supported_model_arch}, using_ep={using_ep}, using_smart_router={using_smart_router}, model_config.moe_load_balancer={model_config.moe_load_balancer}"
-    )
+    logger.debug(
+        "maybe_create_moe_load_balancer: in_supported_model_arch=%s, using_ep=%s, "
+        "using_smart_router=%s, model_config.moe_load_balancer=%s",
+        bool(in_supported_model_arch),
+        bool(using_ep),
+        bool(using_smart_router),
+        model_config.moe_load_balancer,
+    )

Note: casting to bool avoids logging None for flag-like variables when mapping is None. If you prefer one-time logging, switch to logger.info_once (if available) instead of debug.

tensorrt_llm/_torch/modules/fused_moe/moe_backend.py (2)

101-113: Type hints/docstring: reflect List[Tensor] return to match unified API

After unifying the contract, adjust type hints and docs so callers and static tooling don’t get misled.

Apply:

-    def run_moe(
+    def run_moe(
         ...
-            **kwargs) -> torch.Tensor:
+            **kwargs) -> List[torch.Tensor]:
@@
-        Returns:
-            Computed MoE output tensor
+        Returns:
+            List[torch.Tensor]: one tensor per pipeline (single-element list in current backends)

And similarly in compute_moe:

-    def compute_moe(
+    def compute_moe(
         ...
-            **kwargs) -> torch.Tensor:
+            **kwargs) -> List[torch.Tensor]:

Also applies to: 136-138


381-383: Fix E501: break long conditional assignment line

Readability and Ruff compliance.

Apply:

-        run_moe = self.moe_runner.fused_moe_runner.run_moe_min_latency if min_latency_mode else self.moe_runner.fused_moe_runner.run_moe
+        run_moe = (
+            self.moe_runner.fused_moe_runner.run_moe_min_latency
+            if min_latency_mode
+            else self.moe_runner.fused_moe_runner.run_moe
+        )
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py (2)

402-405: Replace print with logger.debug and fix E501

Avoid stdout spam in multi-rank runs; also satisfy Ruff (E501).

Apply:

-        print(
-            f"wide_ep forward_chunk: layer_load_balancer={self.layer_load_balancer}, is_first_call={is_first_call}, is_last_call={is_last_call}"
-        )
+        logger.debug(
+            "wide_ep forward_chunk: layer_load_balancer=%s, is_first_call=%s, is_last_call=%s",
+            bool(self.layer_load_balancer),
+            is_first_call,
+            is_last_call,
+        )

719-719: Make caller robust to either Tensor or List[Tensor] (defensive until backends converge)

Even with backend unification, this guard is harmless and future-proof.

Apply:

-        final_hidden_states = final_hidden_states[0]
+        if isinstance(final_hidden_states, list):
+            final_hidden_states = final_hidden_states[0]
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 91eb8b8 and b206849.

📒 Files selected for processing (3)
  • tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py (9 hunks)
  • tensorrt_llm/_torch/modules/fused_moe/moe_backend.py (1 hunks)
  • tensorrt_llm/_torch/modules/fused_moe/moe_load_balancer.py (1 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Code must target Python 3.8+
Indent Python code with 4 spaces; do not use tabs
Preserve module namespaces when importing; import modules/packages and access members via the module (e.g., from package.subpackage import foo; foo.SomeClass())
Python file names should be snake_case
Python class names should be PascalCase
Python functions/methods and local variables should be snake_case; variables beginning with a number should be prefixed with k_ (e.g., k_99th_percentile)
Global variables should be UPPER_SNAKE_CASE prefixed with G_ (e.g., G_MY_GLOBAL); constants should be UPPER_SNAKE_CASE
Avoid shadowing variables from outer scopes; initialize all externally visible members in init
Prefer docstrings for interfaces used outside a file; comments should be reserved for in-function or file-local interfaces
Use Google-style docstrings for classes and functions; attributes and variables may be documented inline with trailing string literals
Avoid reflection when simpler, explicit code suffices (e.g., avoid dict(**locals()) patterns)
In try/except, catch the narrowest exceptions possible
For duck-typing patterns, keep the try body minimal and move logic to else to avoid masking unrelated failures

Files:

  • tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py
  • tensorrt_llm/_torch/modules/fused_moe/moe_backend.py
  • tensorrt_llm/_torch/modules/fused_moe/moe_load_balancer.py
**/*.{c,cc,cpp,cxx,h,hh,hpp,hxx,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Prepend the NVIDIA copyright header (current year) to all source files (.cpp, .h, .cu, .py, etc.)

Files:

  • tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py
  • tensorrt_llm/_torch/modules/fused_moe/moe_backend.py
  • tensorrt_llm/_torch/modules/fused_moe/moe_load_balancer.py
🧠 Learnings (3)
📓 Common learnings
Learnt from: timlee0212
PR: NVIDIA/TensorRT-LLM#6886
File: tensorrt_llm/_torch/models/modeling_deepseekv3.py:0-0
Timestamp: 2025-08-14T06:36:40.701Z
Learning: In DeepSeek V3 model (tensorrt_llm/_torch/models/modeling_deepseekv3.py), the disagreement between AllReduce.__init__ guard and _compute_mlp_tp_size logic for MNNVL usage is expected by design. The AllReduce component and MLP TP-size computation intentionally use different criteria for MNNVL availability decisions.
📚 Learning: 2025-08-14T23:23:27.449Z
Learnt from: djns99
PR: NVIDIA/TensorRT-LLM#6915
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:4010-4012
Timestamp: 2025-08-14T23:23:27.449Z
Learning: For MOE (Mixture of Experts) code reviews in TensorRT-LLM, avoid repeatedly suggesting finalize fusion validation checks and safety assertions. The user djns99 has indicated these suggestions are repetitive and unwanted across multiple MOE-related changes.

Applied to files:

  • tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py
  • tensorrt_llm/_torch/modules/fused_moe/moe_backend.py
📚 Learning: 2025-08-09T20:57:04.084Z
Learnt from: sklevtsov-nvidia
PR: NVIDIA/TensorRT-LLM#3294
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu:118-127
Timestamp: 2025-08-09T20:57:04.084Z
Learning: In the CUTLASS MoE finalize fusion implementation (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu), when setting `fused_finalize_epilogue.stride_final_output` with shape `(hidden_size, num_output_tokens, 1)`, the `num_rows_in_final_output` should be set to `num_output_tokens` (not `hidden_size`) because of a swap+transpose operation that maps rows of the output tensor to `hidden_size` and columns to `num_output_tokens`.

Applied to files:

  • tensorrt_llm/_torch/modules/fused_moe/moe_backend.py
🧬 Code graph analysis (2)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py (3)
tensorrt_llm/_torch/modules/fused_moe/moe_backend.py (4)
  • MoEBackendSelection (825-868)
  • select_backend (837-868)
  • run_moe (101-212)
  • run_moe (413-533)
tensorrt_llm/_torch/modules/fused_moe/moe_load_balancer.py (1)
  • get_moe_load_balancer (1027-1037)
tensorrt_llm/_torch/modules/fused_moe/quantization.py (2)
  • DeepSeekFP8BlockScalesFusedMoEMethod (604-737)
  • DeepSeekFP8BlockScalesFusedMoEMethodDeepGemm (740-781)
tensorrt_llm/_torch/modules/fused_moe/moe_backend.py (5)
tensorrt_llm/_torch/autotuner.py (2)
  • AutoTuner (271-752)
  • choose_one (329-436)
tensorrt_llm/_torch/custom_ops/torch_custom_ops.py (1)
  • MoERunner (23-114)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_deepgemm.py (5)
  • deepgemm_fp8_group_blockwise_gemm (246-284)
  • masked_index_copy_group_quant_fp8 (89-160)
  • preprocess_after_permute (233-242)
  • set_strides (287-293)
  • triton_masked_index_gather (195-216)
tensorrt_llm/quantization/utils/fp8_utils.py (1)
  • silu_and_mul_masked_post_quant_fwd (304-375)
tensorrt_llm/_torch/modules/fused_moe/interface.py (1)
  • has_deepseek_fp8_block_scales (127-130)
🪛 Ruff (0.12.2)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py

403-403: Line too long (144 > 120)

(E501)

tensorrt_llm/_torch/modules/fused_moe/moe_backend.py

382-382: Line too long (136 > 120)

(E501)

tensorrt_llm/_torch/modules/fused_moe/moe_load_balancer.py

964-964: Line too long (219 > 120)

(E501)

🔇 Additional comments (4)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py (4)

326-330: Good: SM100-aware quant-method swap for block FP8

Selecting DeepSeekFP8BlockScalesFusedMoEMethodDeepGemm only on SM100 keeps legacy paths intact elsewhere. Looks correct.


352-354: Good: backend selection deferred until after weights are created

Initializing self.moe_backend post-create_weights() guarantees the module has everything the backend needs to introspect.


482-487: LGTM: guard get_local_statistic_tensor() with both LB presence and last-call

Prevents invalid access when LB is disabled or on intermediate repeats.


684-712: Forward delegates to pluggable backend correctly (module passed, tuner args wired)

The call mirrors the original fused_moe signature and threads through tuning knobs and module context. Once the backend return type is unified to List[Tensor], this section is good.

@xxi-nv xxi-nv force-pushed the supportFP8BlockWideEp branch from b206849 to ed20793 Compare August 26, 2025 03:32
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
tensorrt_llm/_torch/modules/fused_moe/moe_load_balancer.py (1)

1-1: Missing NVIDIA copyright header

Per coding guidelines, prepend the NVIDIA copyright header (current year) to this source file.

Apply:

+# Copyright (c) 2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py (1)

1-1: Missing NVIDIA copyright header

Per coding guidelines, prepend the NVIDIA copyright header (current year) to this source file.

+# Copyright (c) 2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
♻️ Duplicate comments (4)
tensorrt_llm/_torch/modules/fused_moe/moe_backend.py (4)

254-274: MoERunner constructed with storage dtype; should use view dtype used at runtime

weight_view_dtype is computed (e.g., torch.quint4x2 for W4), but MoERunner is created with module.w3_w1_weight.dtype. This can desynchronize tuning from runtime execution.

-                weight_dtype=module.w3_w1_weight.dtype,
+                weight_dtype=weight_view_dtype,

264-271: Passing a bound method instead of a boolean for flags (missing “()”)

has_deepseek_fp8_block_scales is a method on the module. Passing it without calling makes the flag truthy whenever the attribute exists.

-                use_deepseek_fp8_block_scale=getattr(
-                    module, 'has_deepseek_fp8_block_scales', False),
+                use_deepseek_fp8_block_scale=(
+                    module.has_deepseek_fp8_block_scales()
+                    if hasattr(module, 'has_deepseek_fp8_block_scales') else False
+                ),

Also audit other boolean flags if they can be methods and guard with callable(...).


822-822: Unify return type across backends to List[Tensor]

DeepGemm returns a bare Tensor, while Cutlass returns List[Tensor]. Callers (e.g., WideEPMoE.forward_chunk) index [0] unconditionally; returning a bare tensor will slice the first token and corrupt output.

-        return final_hidden_states
+        return [final_hidden_states]

859-868: Backend selection wrongly treats method as a property; FP8 block path always true when attribute exists

module.has_deepseek_fp8_block_scales is a method. Current code evaluates truthiness of the attribute, not its return value, enabling the DeepGemm path unintentionally.

-        has_block_fp8 = (hasattr(module, 'has_deepseek_fp8_block_scales')
-                         and module.has_deepseek_fp8_block_scales)
+        has_block_fp8 = (
+            hasattr(module, 'has_deepseek_fp8_block_scales') and
+            (
+                module.has_deepseek_fp8_block_scales()
+                if callable(getattr(module, 'has_deepseek_fp8_block_scales'))
+                else bool(getattr(module, 'has_deepseek_fp8_block_scales'))
+            )
+        )
🧹 Nitpick comments (6)
tensorrt_llm/_torch/modules/fused_moe/moe_load_balancer.py (1)

963-965: Replace noisy print with structured logging and avoid long line (Ruff E501)

Unconditional print will spam multi-rank logs and breaks line-length rule. Use logger.info_once (or rank-gated logging) and compress the payload; do not print the entire model_config.moe_load_balancer object.

-    print(
-        f"maybe_create_moe_load_balancer: in_supported_model_arch={in_supported_model_arch}, using_ep={using_ep}, using_smart_router={using_smart_router}, model_config.moe_load_balancer={model_config.moe_load_balancer}"
-    )
+    # Log once to avoid multi-rank spam; keep message under 120 chars per line
+    has_moe_lb = model_config.moe_load_balancer is not None
+    logger.info_once(
+        "maybe_create_moe_load_balancer: "
+        f"in_supported_model_arch={in_supported_model_arch}, "
+        f"using_ep={bool(using_ep)}, "
+        f"using_smart_router={bool(using_smart_router)}, "
+        f"has_moe_lb={has_moe_lb}",
+        key="moe_lb_debug",
+    )
tensorrt_llm/_torch/modules/fused_moe/moe_backend.py (1)

382-382: Break long line per Ruff E501 and improve readability

Line exceeds 120 chars. Assign the function to a local variable with line breaks.

-        run_moe = self.moe_runner.fused_moe_runner.run_moe_min_latency if min_latency_mode else self.moe_runner.fused_moe_runner.run_moe
+        run_moe = (
+            self.moe_runner.fused_moe_runner.run_moe_min_latency
+            if min_latency_mode
+            else self.moe_runner.fused_moe_runner.run_moe
+        )
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py (4)

326-338: Replace debug prints with structured, rank-safe logging (and keep lines short)

These print calls will flood multi-rank logs and violate line-length. Use logger.info_once (or rank-gated logs) to emit concise messages once.

-                print(
-                    f"wide_ep _get_quant_method: get_sm_version()={get_sm_version()}"
-                )
+                logger.info_once(
+                    f"wide_ep _get_quant_method: get_sm_version()={get_sm_version()}",
+                    key="wide_ep_quant_sm"
+                )
@@
-                    print(
-                        f"wide_ep _get_quant_method: use DeepSeekFP8BlockScalesFusedMoEMethodDeepGemm"
-                    )
+                    logger.info_once(
+                        "wide_ep _get_quant_method: use DeepSeekFP8BlockScalesFusedMoEMethodDeepGemm",
+                        key="wide_ep_quant_method"
+                    )
@@
-                    print(
-                        f"wide_ep _get_quant_method: use DeepSeekFP8BlockScalesFusedMoEMethod"
-                    )
+                    logger.info_once(
+                        "wide_ep _get_quant_method: use DeepSeekFP8BlockScalesFusedMoEMethod",
+                        key="wide_ep_quant_method"
+                    )

411-413: Replace debug print with structured logging and fix long line (Ruff E501)

Same rationale as above; this print is noisy across ranks and exceeds 120 columns.

-        print(
-            f"wide_ep forward_chunk: layer_load_balancer={self.layer_load_balancer}, is_first_call={is_first_call}, is_last_call={is_last_call}"
-        )
+        logger.info_once(
+            "wide_ep forward_chunk: "
+            f"layer_load_balancer={bool(self.layer_load_balancer)}, "
+            f"is_first_call={is_first_call}, "
+            f"is_last_call={is_last_call}",
+            key="wide_ep_forward_chunk_debug",
+        )

403-408: Tidy up redundant assignment

output_dtype = output_dtype is a no-op. Simplify the block; keep the assert for FP4 tensors.

-        if isinstance(x, Fp4QuantizedTensor):
-            assert output_dtype is not None
-            output_dtype = output_dtype
-        else:
-            output_dtype = x.dtype
+        if isinstance(x, Fp4QuantizedTensor):
+            assert output_dtype is not None, "Provide output_dtype when passing Fp4QuantizedTensor"
+        else:
+            output_dtype = x.dtype

726-729: Be robust to backend return type until all backends are aligned

Even after unifying backends, making the caller tolerant avoids future regressions.

-        # Only in cutlass_min_latency_mode, the output is a list of tensors.
-        # Otherwise, the output should be unpacked as a single tensor.
-        final_hidden_states = final_hidden_states[0]
+        # Backends return List[Tensor] (even for single output); keep robust.
+        if isinstance(final_hidden_states, list):
+            final_hidden_states = final_hidden_states[0]
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between b206849 and ed20793.

📒 Files selected for processing (3)
  • tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py (9 hunks)
  • tensorrt_llm/_torch/modules/fused_moe/moe_backend.py (1 hunks)
  • tensorrt_llm/_torch/modules/fused_moe/moe_load_balancer.py (1 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Code must target Python 3.8+
Indent Python code with 4 spaces; do not use tabs
Preserve module namespaces when importing; import modules/packages and access members via the module (e.g., from package.subpackage import foo; foo.SomeClass())
Python file names should be snake_case
Python class names should be PascalCase
Python functions/methods and local variables should be snake_case; variables beginning with a number should be prefixed with k_ (e.g., k_99th_percentile)
Global variables should be UPPER_SNAKE_CASE prefixed with G_ (e.g., G_MY_GLOBAL); constants should be UPPER_SNAKE_CASE
Avoid shadowing variables from outer scopes; initialize all externally visible members in init
Prefer docstrings for interfaces used outside a file; comments should be reserved for in-function or file-local interfaces
Use Google-style docstrings for classes and functions; attributes and variables may be documented inline with trailing string literals
Avoid reflection when simpler, explicit code suffices (e.g., avoid dict(**locals()) patterns)
In try/except, catch the narrowest exceptions possible
For duck-typing patterns, keep the try body minimal and move logic to else to avoid masking unrelated failures

Files:

  • tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py
  • tensorrt_llm/_torch/modules/fused_moe/moe_backend.py
  • tensorrt_llm/_torch/modules/fused_moe/moe_load_balancer.py
**/*.{c,cc,cpp,cxx,h,hh,hpp,hxx,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Prepend the NVIDIA copyright header (current year) to all source files (.cpp, .h, .cu, .py, etc.)

Files:

  • tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py
  • tensorrt_llm/_torch/modules/fused_moe/moe_backend.py
  • tensorrt_llm/_torch/modules/fused_moe/moe_load_balancer.py
🧠 Learnings (3)
📓 Common learnings
Learnt from: timlee0212
PR: NVIDIA/TensorRT-LLM#6886
File: tensorrt_llm/_torch/models/modeling_deepseekv3.py:0-0
Timestamp: 2025-08-14T06:36:40.701Z
Learning: In DeepSeek V3 model (tensorrt_llm/_torch/models/modeling_deepseekv3.py), the disagreement between AllReduce.__init__ guard and _compute_mlp_tp_size logic for MNNVL usage is expected by design. The AllReduce component and MLP TP-size computation intentionally use different criteria for MNNVL availability decisions.
📚 Learning: 2025-08-14T23:23:27.449Z
Learnt from: djns99
PR: NVIDIA/TensorRT-LLM#6915
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:4010-4012
Timestamp: 2025-08-14T23:23:27.449Z
Learning: For MOE (Mixture of Experts) code reviews in TensorRT-LLM, avoid repeatedly suggesting finalize fusion validation checks and safety assertions. The user djns99 has indicated these suggestions are repetitive and unwanted across multiple MOE-related changes.

Applied to files:

  • tensorrt_llm/_torch/modules/fused_moe/moe_backend.py
📚 Learning: 2025-08-09T20:57:04.084Z
Learnt from: sklevtsov-nvidia
PR: NVIDIA/TensorRT-LLM#3294
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu:118-127
Timestamp: 2025-08-09T20:57:04.084Z
Learning: In the CUTLASS MoE finalize fusion implementation (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu), when setting `fused_finalize_epilogue.stride_final_output` with shape `(hidden_size, num_output_tokens, 1)`, the `num_rows_in_final_output` should be set to `num_output_tokens` (not `hidden_size`) because of a swap+transpose operation that maps rows of the output tensor to `hidden_size` and columns to `num_output_tokens`.

Applied to files:

  • tensorrt_llm/_torch/modules/fused_moe/moe_backend.py
🧬 Code graph analysis (2)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py (3)
tensorrt_llm/_torch/modules/fused_moe/moe_backend.py (4)
  • MoEBackendSelection (825-868)
  • select_backend (837-868)
  • run_moe (101-212)
  • run_moe (413-533)
tensorrt_llm/_torch/modules/fused_moe/moe_load_balancer.py (1)
  • get_moe_load_balancer (1027-1037)
tensorrt_llm/_torch/modules/fused_moe/quantization.py (2)
  • DeepSeekFP8BlockScalesFusedMoEMethod (604-737)
  • DeepSeekFP8BlockScalesFusedMoEMethodDeepGemm (740-781)
tensorrt_llm/_torch/modules/fused_moe/moe_backend.py (3)
tensorrt_llm/_torch/custom_ops/torch_custom_ops.py (1)
  • MoERunner (23-114)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_deepgemm.py (5)
  • deepgemm_fp8_group_blockwise_gemm (246-284)
  • masked_index_copy_group_quant_fp8 (89-160)
  • preprocess_after_permute (233-242)
  • set_strides (287-293)
  • triton_masked_index_gather (195-216)
tensorrt_llm/_torch/modules/fused_moe/interface.py (1)
  • has_deepseek_fp8_block_scales (127-130)
🪛 Ruff (0.12.2)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py

412-412: Line too long (144 > 120)

(E501)

tensorrt_llm/_torch/modules/fused_moe/moe_backend.py

382-382: Line too long (136 > 120)

(E501)

tensorrt_llm/_torch/modules/fused_moe/moe_load_balancer.py

964-964: Line too long (219 > 120)

(E501)

@xxi-nv xxi-nv force-pushed the supportFP8BlockWideEp branch from ed20793 to 36f4493 Compare August 26, 2025 05:08
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)
tensorrt_llm/_torch/distributed/ops.py (1)

463-489: Missing dtype argument disables MNNVLAllReduce under AUTO/MNNVL
AllReduce’s constructor only enables MNNVL when a non-None dtype is passed. Call sites that omit the dtype parameter will never activate MNNVL (even if strategy is AUTO/MNNVL and tp_size == world_size). Please update the following invocations to include the appropriate dtype (e.g. config.torch_dtype):

  • tensorrt_llm/_torch/models/modeling_qwen3_moe.py
    • Line 93: AllReduce(mapping=…, strategy=…)
    • Line 187: AllReduce(mapping=…, strategy=…)

  • tensorrt_llm/_torch/models/modeling_deepseekv3.py
    • Line 485: AllReduce(mapping=…, strategy=…)

  • tensorrt_llm/_torch/models/modeling_llama.py
    • Lines 304–307: AllReduce(mapping=…, strategy=…)
    • Lines 455–456: AllReduce(mapping=…, strategy=…)
    • Line 645: AllReduce(mapping=…)

  • tensorrt_llm/_torch/modules/fused_moe/fused_moe_vanilla.py
    • Line 72: AllReduce(mapping=…, strategy=…)

For each, modify to:

- AllReduce(mapping=…, strategy=…, …)
+ AllReduce(mapping=…, strategy=…, dtype=config.torch_dtype)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py (2)

700-728: Contract mismatch risk: backend.run_moe return type

Downstream you index final_hidden_states[0]. Ensure all backends return List[Tensor] to avoid slicing a Tensor (first-token bug). This will be fixed if DeepGemm returns [tensor], but add a defensive guard here or in the backend meantime.

No change needed here if you accept the backend fix in moe_backend.py. If not, adjust the unpack site below (Lines 733-736).


733-736: Defensive unpack to handle both List[Tensor] and Tensor

Until all backends consistently return List[Tensor], avoid accidental first-dimension slicing.

-        # Only in cutlass_min_latency_mode, the output is a list of tensors.
-        # Otherwise, the output should be unpacked as a single tensor.
-        final_hidden_states = final_hidden_states[0]
+        # Normalize backend output to a tensor
+        if isinstance(final_hidden_states, (list, tuple)):
+            final_hidden_states = final_hidden_states[0]
♻️ Duplicate comments (5)
tensorrt_llm/_torch/modules/fused_moe/moe_backend.py (5)

253-275: MoERunner should use the actual weight view dtype

During tuning you view weights with a possibly different dtype (e.g., W4 group scaling). Pass that view dtype into MoERunner for consistency.

-            self.moe_runner = MoERunner(
+            self.moe_runner = MoERunner(
                 x_dtype=tuner_input.dtype,
-                weight_dtype=module.w3_w1_weight.dtype,
+                weight_dtype=weight_view_dtype,
                 output_dtype=output_dtype,
                 top_k=tuner_top_k,
                 tp_size=module.tp_size,

264-272: Don’t pass a bound method as a bool flag; call it or cast safely

module.has_deepseek_fp8_block_scales is a method; passing it directly will always be truthy.

-                use_deepseek_fp8_block_scale=getattr(
-                    module, 'has_deepseek_fp8_block_scales', False),
+                use_deepseek_fp8_block_scale=(
+                    module.has_deepseek_fp8_block_scales()
+                    if hasattr(module, 'has_deepseek_fp8_block_scales')
+                    and callable(module.has_deepseek_fp8_block_scales)
+                    else bool(getattr(module, 'has_deepseek_fp8_block_scales', False))
+                ),

Also double-check similar flags (has_int8_woq_per_channel, has_mxfp8_act_scaling) are booleans, not callables.


822-823: Unify return type: wrap DeepGemm output in a list

Callers assume a List[Tensor] in the non-min-latency path. Returning a bare tensor here will corrupt outputs when indexed.

-        return final_hidden_states
+        return [final_hidden_states]

836-868: Fix backend selection: call has_deepseek_fp8_block_scales()

The check currently treats the method as a property; it should be invoked (and safely guarded).

-        has_block_fp8 = (hasattr(module, 'has_deepseek_fp8_block_scales')
-                         and module.has_deepseek_fp8_block_scales)
+        has_block_fp8 = (
+            hasattr(module, 'has_deepseek_fp8_block_scales')
+            and (
+                module.has_deepseek_fp8_block_scales()
+                if callable(getattr(module, 'has_deepseek_fp8_block_scales'))
+                else bool(getattr(module, 'has_deepseek_fp8_block_scales'))
+            )
+        )

1-4: Add mandatory NVIDIA copyright header

Per repo guidelines, prepend the NVIDIA copyright header to all source files.

+# Copyright (c) 2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+#
 """
 MoE Backend abstraction for supporting different MoE computation implementations.
 This module provides a unified interface for different MoE backends (Cutlass, DeepGemm, etc.)
 """
🧹 Nitpick comments (9)
tensorrt_llm/_torch/modules/fused_moe/moe_load_balancer.py (1)

963-965: Replace print with project logger and wrap long line

Use the project logger instead of an unconditional print; also fix the E501 long line warning by splitting the message.

-    print(
-        f"maybe_create_moe_load_balancer: in_supported_model_arch={in_supported_model_arch}, using_ep={using_ep}, using_smart_router={using_smart_router}, model_config.moe_load_balancer={model_config.moe_load_balancer}"
-    )
+    logger.debug(
+        "maybe_create_moe_load_balancer: "
+        f"in_supported_model_arch={in_supported_model_arch}, "
+        f"using_ep={using_ep}, "
+        f"using_smart_router={using_smart_router}, "
+        f"model_config.moe_load_balancer={model_config.moe_load_balancer}"
+    )
tensorrt_llm/_torch/distributed/ops.py (1)

243-251: Avoid stdout in hot path; log mismatch and strengthen assertion message

Unconditional print in collective validation will spam logs and hurt perf. Prefer debug logging and attach context to the assertion.

-            for val in input:
-                if val is not None and val.shape[dim] != sum_split_size:
-                    print(
-                        f"[reducescatter] val.shape={val.shape}, dim={dim}, val.shape[dim]={val.shape[dim]}, sum_split_size={sum_split_size}, sizes={sizes}"
-                    )
-            assert all([
-                val.shape[dim] == sum_split_size for val in input
-                if val is not None
-            ])
+            for val in input:
+                if val is not None and val.shape[dim] != sum_split_size:
+                    logger.debug(
+                        "reducescatter size mismatch: val_shape=%s, dim=%d, actual=%d, expected=%d, sizes=%s",
+                        tuple(val.shape), dim, val.shape[dim], sum_split_size, sizes,
+                    )
+            assert all(
+                val.shape[dim] == sum_split_size for val in input if val is not None
+            ), (
+                f"reducescatter: mismatched input shapes along dim={dim}; "
+                f"expected {sum_split_size}, sizes={sizes}"
+            )
tensorrt_llm/_torch/modules/fused_moe/moe_backend.py (2)

382-383: Wrap long line and improve readability of run_moe selection

Refactor the ternary for readability and to satisfy E501.

-        run_moe = self.moe_runner.fused_moe_runner.run_moe_min_latency if min_latency_mode else self.moe_runner.fused_moe_runner.run_moe
+        if min_latency_mode:
+            run_moe = self.moe_runner.fused_moe_runner.run_moe_min_latency
+        else:
+            run_moe = self.moe_runner.fused_moe_runner.run_moe

569-618: Avoid per-call large allocations in DeepGemm workspace

_get_deepgemm_workspace allocates multiple large tensors per call. Cache workspaces keyed by (m_max, hidden/intermediate sizes, group size) on self to avoid repeated allocations.

Example implementation outside this hunk (add in init and use here):

# In __init__
self._workspace_cache = {}

# In _get_deepgemm_workspace(...)
key = (m_max, module.hidden_size, module.intermediate_size, group_size)
ws = self._workspace_cache.get(key)
if ws is None:
    ws = { ... allocate once ... }
    self._workspace_cache[key] = ws
return ws

Happy to provide a full patch if you want it wired end-to-end.

tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py (5)

311-325: Replace prints with logger and trim long lines

These diagnostics should go through the project logger and avoid long lines.

-            print(
-                f"can not use alltoall due to chunking {self.calculate_num_chunks(all_rank_num_tokens)}"
-            )
+            logger.debug(
+                "Disable alltoall: chunking=%d",
+                self.calculate_num_chunks(all_rank_num_tokens),
+            )
@@
-            print(
-                f"can not use alltoall due to deep_ep_max_num_tokens {all_rank_max_num_tokens} > {self.deep_ep_max_num_tokens}"
-            )
+            logger.debug(
+                "Disable DeepEPLowLatency: max_tokens=%d > threshold=%d",
+                all_rank_max_num_tokens, self.deep_ep_max_num_tokens,
+            )
@@
-        print(f"all to all type {self.alltoall_method_type}")
+        logger.debug("alltoall method type: %s", self.alltoall_method_type)

333-345: Use logger for backend pick trace and avoid noisy stdout

Leverage logger and keep consistent formatting.

-                print(
-                    f"wide_ep _get_quant_method: get_sm_version()={get_sm_version()}"
-                )
+                logger.debug("wide_ep _get_quant_method: sm=%s", get_sm_version())
                 if get_sm_version() == 100:
-                    print(
-                        f"wide_ep _get_quant_method: use DeepSeekFP8BlockScalesFusedMoEMethodDeepGemm"
-                    )
+                    logger.debug("Select DeepSeekFP8BlockScalesFusedMoEMethodDeepGemm")
                     return DeepSeekFP8BlockScalesFusedMoEMethodDeepGemm()
                 else:
-                    print(
-                        f"wide_ep _get_quant_method: use DeepSeekFP8BlockScalesFusedMoEMethod"
-                    )
+                    logger.debug("Select DeepSeekFP8BlockScalesFusedMoEMethod")
                     return DeepSeekFP8BlockScalesFusedMoEMethod()

418-421: Debug prints in forward path should use logger

Replace the print with logger.debug and keep messages short.

-        print(
-            f"wide_ep forward_chunk: layer_load_balancer={self.layer_load_balancer}, is_first_call={is_first_call}, is_last_call={is_last_call}"
-        )
+        logger.debug(
+            "wide_ep forward_chunk: has_lb=%s, first=%s, last=%s",
+            self.layer_load_balancer is not None, is_first_call, is_last_call,
+        )

672-699: Remove large commented-out legacy call

The old fused_moe call is fully preserved in comments. Consider removing it or moving to docs to reduce noise.


809-811: Replace extra-verbose print with logger and keep it readable

Use logger.debug and reduce verbosity; very long single-line prints violate E501 and clutter logs.

-            # 一行打印所有信息
-            print(
-                f"xxi x.shape: {getattr(x, 'shape', None)}, use_all_to_all: {use_all_to_all}, all_rank_num_tokens: {all_rank_num_tokens}, all_rank_num_tokens_padded: {all_rank_num_tokens_padded}, all_rank_max_num_tokens: {all_rank_max_num_tokens}, use_dp_padding: {use_dp_padding}, outputs.shape: {getattr(outputs, 'shape', None)}, use_dp_padding(again): {use_dp_padding}"
-            )
+            logger.debug(
+                "wide_ep forward: x.shape=%s, use_all_to_all=%s, "
+                "num_tokens=%s, num_tokens_padded=%s, max_tokens=%s, "
+                "use_dp_padding=%s, out.shape=%s",
+                getattr(x, "shape", None),
+                use_all_to_all,
+                all_rank_num_tokens,
+                all_rank_num_tokens_padded,
+                all_rank_max_num_tokens,
+                use_dp_padding,
+                getattr(outputs, "shape", None),
+            )
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between ed20793 and 36f4493.

📒 Files selected for processing (4)
  • tensorrt_llm/_torch/distributed/ops.py (2 hunks)
  • tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py (11 hunks)
  • tensorrt_llm/_torch/modules/fused_moe/moe_backend.py (1 hunks)
  • tensorrt_llm/_torch/modules/fused_moe/moe_load_balancer.py (1 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Code must target Python 3.8+
Indent Python code with 4 spaces; do not use tabs
Preserve module namespaces when importing; import modules/packages and access members via the module (e.g., from package.subpackage import foo; foo.SomeClass())
Python file names should be snake_case
Python class names should be PascalCase
Python functions/methods and local variables should be snake_case; variables beginning with a number should be prefixed with k_ (e.g., k_99th_percentile)
Global variables should be UPPER_SNAKE_CASE prefixed with G_ (e.g., G_MY_GLOBAL); constants should be UPPER_SNAKE_CASE
Avoid shadowing variables from outer scopes; initialize all externally visible members in init
Prefer docstrings for interfaces used outside a file; comments should be reserved for in-function or file-local interfaces
Use Google-style docstrings for classes and functions; attributes and variables may be documented inline with trailing string literals
Avoid reflection when simpler, explicit code suffices (e.g., avoid dict(**locals()) patterns)
In try/except, catch the narrowest exceptions possible
For duck-typing patterns, keep the try body minimal and move logic to else to avoid masking unrelated failures

Files:

  • tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py
  • tensorrt_llm/_torch/distributed/ops.py
  • tensorrt_llm/_torch/modules/fused_moe/moe_backend.py
  • tensorrt_llm/_torch/modules/fused_moe/moe_load_balancer.py
**/*.{c,cc,cpp,cxx,h,hh,hpp,hxx,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Prepend the NVIDIA copyright header (current year) to all source files (.cpp, .h, .cu, .py, etc.)

Files:

  • tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py
  • tensorrt_llm/_torch/distributed/ops.py
  • tensorrt_llm/_torch/modules/fused_moe/moe_backend.py
  • tensorrt_llm/_torch/modules/fused_moe/moe_load_balancer.py
🧠 Learnings (4)
📓 Common learnings
Learnt from: timlee0212
PR: NVIDIA/TensorRT-LLM#6886
File: tensorrt_llm/_torch/models/modeling_deepseekv3.py:0-0
Timestamp: 2025-08-14T06:36:40.701Z
Learning: In DeepSeek V3 model (tensorrt_llm/_torch/models/modeling_deepseekv3.py), the disagreement between AllReduce.__init__ guard and _compute_mlp_tp_size logic for MNNVL usage is expected by design. The AllReduce component and MLP TP-size computation intentionally use different criteria for MNNVL availability decisions.
📚 Learning: 2025-08-14T06:36:40.701Z
Learnt from: timlee0212
PR: NVIDIA/TensorRT-LLM#6886
File: tensorrt_llm/_torch/models/modeling_deepseekv3.py:0-0
Timestamp: 2025-08-14T06:36:40.701Z
Learning: In DeepSeek V3 model (tensorrt_llm/_torch/models/modeling_deepseekv3.py), the disagreement between AllReduce.__init__ guard and _compute_mlp_tp_size logic for MNNVL usage is expected by design. The AllReduce component and MLP TP-size computation intentionally use different criteria for MNNVL availability decisions.

Applied to files:

  • tensorrt_llm/_torch/distributed/ops.py
📚 Learning: 2025-08-14T23:23:27.449Z
Learnt from: djns99
PR: NVIDIA/TensorRT-LLM#6915
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:4010-4012
Timestamp: 2025-08-14T23:23:27.449Z
Learning: For MOE (Mixture of Experts) code reviews in TensorRT-LLM, avoid repeatedly suggesting finalize fusion validation checks and safety assertions. The user djns99 has indicated these suggestions are repetitive and unwanted across multiple MOE-related changes.

Applied to files:

  • tensorrt_llm/_torch/modules/fused_moe/moe_backend.py
📚 Learning: 2025-08-09T20:57:04.084Z
Learnt from: sklevtsov-nvidia
PR: NVIDIA/TensorRT-LLM#3294
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu:118-127
Timestamp: 2025-08-09T20:57:04.084Z
Learning: In the CUTLASS MoE finalize fusion implementation (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu), when setting `fused_finalize_epilogue.stride_final_output` with shape `(hidden_size, num_output_tokens, 1)`, the `num_rows_in_final_output` should be set to `num_output_tokens` (not `hidden_size`) because of a swap+transpose operation that maps rows of the output tensor to `hidden_size` and columns to `num_output_tokens`.

Applied to files:

  • tensorrt_llm/_torch/modules/fused_moe/moe_backend.py
🧬 Code graph analysis (3)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py (3)
tensorrt_llm/_torch/modules/fused_moe/moe_backend.py (4)
  • MoEBackendSelection (825-868)
  • select_backend (837-868)
  • run_moe (101-212)
  • run_moe (413-533)
tensorrt_llm/_torch/modules/fused_moe/moe_load_balancer.py (1)
  • get_moe_load_balancer (1027-1037)
tensorrt_llm/_torch/modules/fused_moe/quantization.py (2)
  • DeepSeekFP8BlockScalesFusedMoEMethod (604-737)
  • DeepSeekFP8BlockScalesFusedMoEMethodDeepGemm (740-781)
tensorrt_llm/_torch/distributed/ops.py (3)
tensorrt_llm/functional.py (1)
  • AllReduceStrategy (3876-3885)
tensorrt_llm/_torch/distributed/communicator.py (2)
  • tp_size (46-47)
  • world_size (26-27)
tensorrt_llm/logger.py (1)
  • debug (143-144)
tensorrt_llm/_torch/modules/fused_moe/moe_backend.py (4)
tensorrt_llm/_torch/autotuner.py (2)
  • AutoTuner (271-752)
  • choose_one (329-436)
tensorrt_llm/_torch/custom_ops/torch_custom_ops.py (1)
  • MoERunner (23-114)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_deepgemm.py (5)
  • deepgemm_fp8_group_blockwise_gemm (246-284)
  • masked_index_copy_group_quant_fp8 (89-160)
  • preprocess_after_permute (233-242)
  • set_strides (287-293)
  • triton_masked_index_gather (195-216)
tensorrt_llm/_torch/modules/fused_moe/interface.py (1)
  • has_deepseek_fp8_block_scales (127-130)
🪛 Ruff (0.12.2)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py

320-320: Line too long (127 > 120)

(E501)


419-419: Line too long (144 > 120)

(E501)


810-810: Line too long (372 > 120)

(E501)

tensorrt_llm/_torch/distributed/ops.py

246-246: Line too long (156 > 120)

(E501)

tensorrt_llm/_torch/modules/fused_moe/moe_backend.py

382-382: Line too long (136 > 120)

(E501)

tensorrt_llm/_torch/modules/fused_moe/moe_load_balancer.py

964-964: Line too long (219 > 120)

(E501)

🔇 Additional comments (3)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py (3)

237-239: Lazy backend init acknowledged

Initializing moe_backend after weight creation is reasonable and keeps state consistent.


368-370: Backend selection timing is good

Selecting the backend right after weight creation ensures quant flags are available.


498-503: Good: guard stat fetch behind load balancer and last_call

The additional check avoids None access when LB is disabled.

@xxi-nv xxi-nv force-pushed the supportFP8BlockWideEp branch from 36f4493 to bfb9f40 Compare August 26, 2025 07:04
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 15

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
tensorrt_llm/_torch/distributed/ops.py (2)

15-21: Avoid shadowing the project logger.

You import logger from tensorrt_llm.logger at Line 15, then overwrite it with logging.getLogger(name) at Line 20. This changes behavior of logger.debug/info_once, and can silently route logs incorrectly.

-from tensorrt_llm.logger import logger
+from tensorrt_llm.logger import logger
 ...
-_thread_local = threading.local()
-logger = logging.getLogger(__name__)
+_thread_local = threading.local()

243-251: Use logger.debug instead of print in reducescatter and fix E501.

Printing inside reducescatter is expensive and noisy. Prefer the project logger, and split the message to satisfy line length.

-            for val in input:
-                if val is not None and val.shape[dim] != sum_split_size:
-                    print(
-                        f"[reducescatter] val.shape={val.shape}, dim={dim}, val.shape[dim]={val.shape[dim]}, sum_split_size={sum_split_size}, sizes={sizes}"
-                    )
+            for val in input:
+                if val is not None and val.shape[dim] != sum_split_size:
+                    logger.debug(
+                        "[reducescatter] shape_mismatch: val.shape=%s, dim=%d, "
+                        "val_dim=%d, sum_split_size=%d, sizes=%s",
+                        tuple(val.shape), dim, val.shape[dim], sum_split_size, sizes
+                    )
♻️ Duplicate comments (10)
tensorrt_llm/_torch/modules/fused_moe/moe_backend.py (9)

1-1: Missing NVIDIA copyright header

Per coding guidelines, prepend the NVIDIA copyright header (current year) to this new file.

+# Copyright (c) 2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 """
 MoE Backend abstraction for supporting different MoE computation implementations.

264-274: Fix incorrect flag handling for has_deepseek_fp8_block_scales

has_deepseek_fp8_block_scales is a method, but it's being passed without calling it. This will always be truthy when the attribute exists, incorrectly enabling FP8 block-scale paths.

Apply:

-                use_deepseek_fp8_block_scale=getattr(
-                    module, 'has_deepseek_fp8_block_scales', False),
+                use_deepseek_fp8_block_scale=(
+                    module.has_deepseek_fp8_block_scales()
+                    if hasattr(module, 'has_deepseek_fp8_block_scales') and callable(getattr(module, 'has_deepseek_fp8_block_scales'))
+                    else False),

Also verify other flags (e.g., has_int8_woq_per_channel, has_mxfp8_act_scaling) to ensure they are booleans, not methods.


253-255: Fix weight_dtype inconsistency in MoERunner initialization

The MoERunner should be constructed with weight_view_dtype (the actual view dtype used at runtime), not the original storage dtype.

             self.moe_runner = MoERunner(
                 x_dtype=tuner_input.dtype,
-                weight_dtype=module.w3_w1_weight.dtype,
+                weight_dtype=weight_view_dtype,
                 output_dtype=output_dtype,

846-846: Fix API contract: DeepGemm backend should return List[Tensor] to match Cutlass

WideEPMoE.forward_chunk unconditionally indexes [0] after run_moe. Cutlass returns [output], but DeepGemm returns a bare tensor, which will slice the first token and corrupt output.

-        return final_hidden_states
+        return [final_hidden_states]

883-885: Fix method call for has_deepseek_fp8_block_scales in backend selection

has_deepseek_fp8_block_scales is a method but is being accessed as a property. This makes has_block_fp8 true whenever the attribute exists.

         has_block_fp8 = (hasattr(module, 'has_deepseek_fp8_block_scales')
-                         and module.has_deepseek_fp8_block_scales)
+                         and (module.has_deepseek_fp8_block_scales() 
+                              if callable(getattr(module, 'has_deepseek_fp8_block_scales'))
+                              else bool(getattr(module, 'has_deepseek_fp8_block_scales'))))

251-275: Two fixes: pass the correct weight view dtype and call boolean flags, not bound methods.

  • weight_dtype should match the view dtype used during profiling (e.g., W4 paths), otherwise the MoERunner key diverges.
  • has_deepseek_fp8_block_scales is a method; passing the bound method makes the flag always truthy.
         if self.moe_runner is None:
             self.moe_runner = MoERunner(
                 x_dtype=tuner_input.dtype,
-                weight_dtype=module.w3_w1_weight.dtype,
+                weight_dtype=weight_view_dtype,
                 output_dtype=output_dtype,
                 top_k=tuner_top_k,
                 tp_size=module.tp_size,
                 tp_rank=module.tp_rank,
                 ep_size=module.ep_size,
                 ep_rank=module.ep_rank,
                 cluster_size=module.cluster_size,
                 cluster_rank=module.cluster_rank,
-                use_deepseek_fp8_block_scale=getattr(
-                    module, 'has_deepseek_fp8_block_scales', False),
+                use_deepseek_fp8_block_scale=(
+                    module.has_deepseek_fp8_block_scales()
+                    if hasattr(module, 'has_deepseek_fp8_block_scales') else False
+                ),
                 use_w4_group_scaling=getattr(module, 'has_w4afp8', False),
                 use_int8_woq_per_channel=getattr(module,
                                                  'has_int8_woq_per_channel',
                                                  False),
                 use_mxfp8_act_scaling=getattr(module, 'has_mxfp8_act_scaling',
                                               False),
                 min_latency_mode=min_latency_mode,
                 use_fused_finalize=getattr(module, 'use_fused_finalize', False),
             )

846-846: Unify return type to match caller (wrap DeepGemm output in a list).

WideEPMoE.forward_chunk indexes [0]. Returning a bare Tensor here will slice the first token. Wrap in a one-element list, aligning with Cutlass.

-        return final_hidden_states
+        return [final_hidden_states]

883-889: Call the method when checking block-FP8 support.

module.has_deepseek_fp8_block_scales is a method; the current code treats it as a property and will always be truthy if present.

-        has_block_fp8 = (hasattr(module, 'has_deepseek_fp8_block_scales')
-                         and module.has_deepseek_fp8_block_scales)
+        has_block_fp8 = (
+            hasattr(module, "has_deepseek_fp8_block_scales")
+            and (
+                module.has_deepseek_fp8_block_scales()
+                if callable(getattr(module, "has_deepseek_fp8_block_scales"))
+                else bool(getattr(module, "has_deepseek_fp8_block_scales"))
+            )
+        )

1-4: Add required NVIDIA copyright header to new file.

Repo guideline requires the NVIDIA header on all source files. Add it above the module docstring.

+# Copyright (c) 2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+#
+"""
 MoE Backend abstraction for supporting different MoE computation implementations.
 This module provides a unified interface for different MoE backends (Cutlass, DeepGemm, etc.)
 """
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py (1)

739-739: Robustness: handle both Tensor and List[Tensor] return types.

Even after aligning DeepGemm to return a list, making the caller tolerant prevents future regressions.

-        final_hidden_states = final_hidden_states[0]
+        if isinstance(final_hidden_states, list):
+            final_hidden_states = final_hidden_states[0]
🧹 Nitpick comments (4)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py (4)

311-326: Use logger.debug instead of print in can_use_alltoall.

Stdout spam in control flow harms perf and clobbers logs. Prefer logger.debug and keep messages short.

-            print(
-                f"can not use alltoall due to chunking {self.calculate_num_chunks(all_rank_num_tokens)}"
-            )
+            logger.debug(
+                "Disable alltoall: chunking=%d",
+                self.calculate_num_chunks(all_rank_num_tokens),
+            )
 ...
-            print(
-                f"can not use alltoall due to deep_ep_max_num_tokens {all_rank_max_num_tokens} > {self.deep_ep_max_num_tokens}"
-            )
+            logger.debug(
+                "Disable alltoall: deep_ep_max_tokens %d > %d",
+                all_rank_max_num_tokens, self.deep_ep_max_num_tokens,
+            )
-        print(f"all to all type {self.alltoall_method_type}")
+        logger.debug("alltoall method type: %s", self.alltoall_method_type)

333-345: Drop prints in _get_quant_method; log at debug level.

These messages are useful during bring-up but too verbose for regular runs.

-                print(
-                    f"wide_ep _get_quant_method: get_sm_version()={get_sm_version()}"
-                )
+                logger.debug("wide_ep::_get_quant_method: sm=%s", get_sm_version())
                 if get_sm_version() == 100:
-                    print(
-                        f"wide_ep _get_quant_method: use DeepSeekFP8BlockScalesFusedMoEMethodDeepGemm"
-                    )
+                    logger.debug("wide_ep::using DeepGemm MoE method for block FP8")
                     return DeepSeekFP8BlockScalesFusedMoEMethodDeepGemm()
                 else:
-                    print(
-                        f"wide_ep _get_quant_method: use DeepSeekFP8BlockScalesFusedMoEMethod"
-                    )
+                    logger.debug("wide_ep::using Cutlass MoE method for block FP8")
                     return DeepSeekFP8BlockScalesFusedMoEMethod()

701-728: Small safety: assert backend is initialized when weights are created.

create_weights assigns moe_backend, but if skip_create_weights_in_init=True and forward_chunk is called prematurely, this could be None. Either assert here or lazy-init on first use.

         # Use the selected backend to compute MoE with the same parameters as fused_moe
+        assert self.moe_backend is not None, "MoE backend not initialized"
         final_hidden_states = self.moe_backend.run_moe(
             x,
             token_selected_slots,
             token_final_scales,

814-815: Replace single-line mega print with logger.debug and shorten.

This violates E501 and is too verbose for every forward.

-            print(
-                f"xxi x.shape: {getattr(x, 'shape', None)}, use_all_to_all: {use_all_to_all}, all_rank_num_tokens: {all_rank_num_tokens}, all_rank_num_tokens_padded: {all_rank_num_tokens_padded}, all_rank_max_num_tokens: {all_rank_max_num_tokens}, use_dp_padding: {use_dp_padding}, outputs.shape: {getattr(outputs, 'shape', None)}, use_dp_padding(again): {use_dp_padding}"
-            )
+            logger.debug(
+                "forward: x=%s alltoall=%s tokens=%s padded=%s max=%s dp_pad=%s out=%s",
+                getattr(x, "shape", None), use_all_to_all, all_rank_num_tokens,
+                all_rank_num_tokens_padded, all_rank_max_num_tokens,
+                use_dp_padding, getattr(outputs, "shape", None),
+            )
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 36f4493 and bfb9f40.

📒 Files selected for processing (4)
  • tensorrt_llm/_torch/distributed/ops.py (2 hunks)
  • tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py (11 hunks)
  • tensorrt_llm/_torch/modules/fused_moe/moe_backend.py (1 hunks)
  • tensorrt_llm/_torch/modules/fused_moe/moe_load_balancer.py (1 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Code must target Python 3.8+
Indent Python code with 4 spaces; do not use tabs
Preserve module namespaces when importing; import modules/packages and access members via the module (e.g., from package.subpackage import foo; foo.SomeClass())
Python file names should be snake_case
Python class names should be PascalCase
Python functions/methods and local variables should be snake_case; variables beginning with a number should be prefixed with k_ (e.g., k_99th_percentile)
Global variables should be UPPER_SNAKE_CASE prefixed with G_ (e.g., G_MY_GLOBAL); constants should be UPPER_SNAKE_CASE
Avoid shadowing variables from outer scopes; initialize all externally visible members in init
Prefer docstrings for interfaces used outside a file; comments should be reserved for in-function or file-local interfaces
Use Google-style docstrings for classes and functions; attributes and variables may be documented inline with trailing string literals
Avoid reflection when simpler, explicit code suffices (e.g., avoid dict(**locals()) patterns)
In try/except, catch the narrowest exceptions possible
For duck-typing patterns, keep the try body minimal and move logic to else to avoid masking unrelated failures

Files:

  • tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py
  • tensorrt_llm/_torch/modules/fused_moe/moe_backend.py
  • tensorrt_llm/_torch/distributed/ops.py
  • tensorrt_llm/_torch/modules/fused_moe/moe_load_balancer.py
**/*.{c,cc,cpp,cxx,h,hh,hpp,hxx,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Prepend the NVIDIA copyright header (current year) to all source files (.cpp, .h, .cu, .py, etc.)

Files:

  • tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py
  • tensorrt_llm/_torch/modules/fused_moe/moe_backend.py
  • tensorrt_llm/_torch/distributed/ops.py
  • tensorrt_llm/_torch/modules/fused_moe/moe_load_balancer.py
🧠 Learnings (4)
📓 Common learnings
Learnt from: timlee0212
PR: NVIDIA/TensorRT-LLM#6886
File: tensorrt_llm/_torch/models/modeling_deepseekv3.py:0-0
Timestamp: 2025-08-14T06:36:40.701Z
Learning: In DeepSeek V3 model (tensorrt_llm/_torch/models/modeling_deepseekv3.py), the disagreement between AllReduce.__init__ guard and _compute_mlp_tp_size logic for MNNVL usage is expected by design. The AllReduce component and MLP TP-size computation intentionally use different criteria for MNNVL availability decisions.
📚 Learning: 2025-08-14T23:23:27.449Z
Learnt from: djns99
PR: NVIDIA/TensorRT-LLM#6915
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:4010-4012
Timestamp: 2025-08-14T23:23:27.449Z
Learning: For MOE (Mixture of Experts) code reviews in TensorRT-LLM, avoid repeatedly suggesting finalize fusion validation checks and safety assertions. The user djns99 has indicated these suggestions are repetitive and unwanted across multiple MOE-related changes.

Applied to files:

  • tensorrt_llm/_torch/modules/fused_moe/moe_backend.py
📚 Learning: 2025-08-09T20:57:04.084Z
Learnt from: sklevtsov-nvidia
PR: NVIDIA/TensorRT-LLM#3294
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu:118-127
Timestamp: 2025-08-09T20:57:04.084Z
Learning: In the CUTLASS MoE finalize fusion implementation (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu), when setting `fused_finalize_epilogue.stride_final_output` with shape `(hidden_size, num_output_tokens, 1)`, the `num_rows_in_final_output` should be set to `num_output_tokens` (not `hidden_size`) because of a swap+transpose operation that maps rows of the output tensor to `hidden_size` and columns to `num_output_tokens`.

Applied to files:

  • tensorrt_llm/_torch/modules/fused_moe/moe_backend.py
📚 Learning: 2025-08-14T06:36:40.701Z
Learnt from: timlee0212
PR: NVIDIA/TensorRT-LLM#6886
File: tensorrt_llm/_torch/models/modeling_deepseekv3.py:0-0
Timestamp: 2025-08-14T06:36:40.701Z
Learning: In DeepSeek V3 model (tensorrt_llm/_torch/models/modeling_deepseekv3.py), the disagreement between AllReduce.__init__ guard and _compute_mlp_tp_size logic for MNNVL usage is expected by design. The AllReduce component and MLP TP-size computation intentionally use different criteria for MNNVL availability decisions.

Applied to files:

  • tensorrt_llm/_torch/distributed/ops.py
🧬 Code graph analysis (3)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py (3)
tensorrt_llm/_torch/modules/fused_moe/moe_backend.py (4)
  • MoEBackendSelection (849-892)
  • select_backend (861-892)
  • run_moe (101-212)
  • run_moe (413-533)
tensorrt_llm/_torch/modules/fused_moe/moe_load_balancer.py (1)
  • get_moe_load_balancer (1027-1037)
tensorrt_llm/_torch/modules/fused_moe/quantization.py (2)
  • DeepSeekFP8BlockScalesFusedMoEMethod (604-737)
  • DeepSeekFP8BlockScalesFusedMoEMethodDeepGemm (740-781)
tensorrt_llm/_torch/modules/fused_moe/moe_backend.py (5)
tensorrt_llm/_torch/autotuner.py (2)
  • AutoTuner (271-752)
  • choose_one (329-436)
tensorrt_llm/_torch/custom_ops/torch_custom_ops.py (1)
  • MoERunner (23-114)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_deepgemm.py (5)
  • deepgemm_fp8_group_blockwise_gemm (246-284)
  • masked_index_copy_group_quant_fp8 (89-160)
  • preprocess_after_permute (233-242)
  • set_strides (287-293)
  • triton_masked_index_gather (195-216)
tensorrt_llm/quantization/utils/fp8_utils.py (1)
  • silu_and_mul_masked_post_quant_fwd (304-375)
tensorrt_llm/_torch/modules/fused_moe/interface.py (1)
  • has_deepseek_fp8_block_scales (127-130)
tensorrt_llm/_torch/distributed/ops.py (3)
tensorrt_llm/functional.py (1)
  • AllReduceStrategy (3876-3885)
tensorrt_llm/_torch/distributed/communicator.py (2)
  • tp_size (46-47)
  • world_size (26-27)
tensorrt_llm/logger.py (1)
  • debug (143-144)
🪛 Ruff (0.12.2)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py

320-320: Line too long (127 > 120)

(E501)


419-419: Line too long (472 > 120)

(E501)


731-731: Line too long (952 > 120)

(E501)


814-814: Line too long (372 > 120)

(E501)

tensorrt_llm/_torch/modules/fused_moe/moe_backend.py

382-382: Line too long (136 > 120)

(E501)


721-721: Line too long (127 > 120)

(E501)


722-722: Line too long (129 > 120)

(E501)


725-725: Line too long (121 > 120)

(E501)


726-726: Line too long (125 > 120)

(E501)


839-839: Line too long (125 > 120)

(E501)


840-840: Line too long (125 > 120)

(E501)

tensorrt_llm/_torch/distributed/ops.py

246-246: Line too long (156 > 120)

(E501)

tensorrt_llm/_torch/modules/fused_moe/moe_load_balancer.py

964-964: Line too long (219 > 120)

(E501)

🔇 Additional comments (7)
tensorrt_llm/_torch/distributed/ops.py (2)

463-488: LGTM: Enhanced MNNVL AllReduce initialization with broader runtime detection

The changes properly expand MNNVL support to AUTO strategy and add runtime checks for tp_size/world_size compatibility. The graceful fallback with debug logging is well-implemented.


463-489: LGTM: safer MNNVL init path with AUTO fallback and explicit gating.

Allowing MNNVL under AUTO and guarding by tp_size == world_size plus try/except improves robustness. The debug traces on enable/disable paths are helpful.

Please double-check that requiring tp_size == world_size is intended for your typical mappings (e.g., TP×DP×PP > 1) to avoid over-disabling MNNVL where it should be valid. If needed, we can gate on the actual communicator/group instead of global world_size.

tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py (4)

498-499: Good defensive programming with conditional check

The added conditional check for self.layer_load_balancer before calling methods prevents potential null pointer issues.


237-239: Good modular design with backend selection after weights creation

The initialization of MoE backend after weights creation ensures all necessary properties are available for backend selection. The lazy initialization pattern is well-structured.

Also applies to: 368-370


701-728: Preserved original fused_moe call as reference documentation

Keeping the original call as a comment provides valuable reference for future maintenance. The backend call maintains the same parameter structure ensuring compatibility.


498-505: LGTM: guard load-balancer stat retrieval by is_last_call.

This avoids early access and aligns with staged statistic updates.

tensorrt_llm/_torch/modules/fused_moe/moe_backend.py (1)

381-412: Contract: return a list (non-min-latency) to match caller behavior.

WideEPMoE.forward_chunk unconditionally indexes [0] later. Ensure Cutlass backend consistently returns a list in non-min-latency mode (already does) and keep min-latency path returning a Tensor. No change needed here; just highlighting the contract for the DeepGemm backend below.

Comment on lines +243 to +247
for val in input:
if val is not None and val.shape[dim] != sum_split_size:
print(
f"[reducescatter] val.shape={val.shape}, dim={dim}, val.shape[dim]={val.shape[dim]}, sum_split_size={sum_split_size}, sizes={sizes}"
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Remove debug print statement from production code

Debug print statements should not be in production code. Consider using proper logging instead.

-            for val in input:
-                if val is not None and val.shape[dim] != sum_split_size:
-                    print(
-                        f"[reducescatter] val.shape={val.shape}, dim={dim}, val.shape[dim]={val.shape[dim]}, sum_split_size={sum_split_size}, sizes={sizes}"
-                    )
+            for val in input:
+                if val is not None and val.shape[dim] != sum_split_size:
+                    logger.debug(
+                        f"[reducescatter] val.shape={val.shape}, dim={dim}, val.shape[dim]={val.shape[dim]}, sum_split_size={sum_split_size}, sizes={sizes}"
+                    )
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
for val in input:
if val is not None and val.shape[dim] != sum_split_size:
print(
f"[reducescatter] val.shape={val.shape}, dim={dim}, val.shape[dim]={val.shape[dim]}, sum_split_size={sum_split_size}, sizes={sizes}"
)
for val in input:
if val is not None and val.shape[dim] != sum_split_size:
logger.debug(
f"[reducescatter] val.shape={val.shape}, dim={dim}, val.shape[dim]={val.shape[dim]}, sum_split_size={sum_split_size}, sizes={sizes}"
)
🧰 Tools
🪛 Ruff (0.12.2)

246-246: Line too long (156 > 120)

(E501)

🤖 Prompt for AI Agents
In tensorrt_llm/_torch/distributed/ops.py around lines 243 to 247, remove the
debug print statement that outputs shape and size info; replace it with a proper
logging call using the module logger (e.g., logger.debug or logger.warning) and
include the same contextual message and variables so it's available in logs but
not printed to stdout; ensure the module imports and uses the logger
consistently and that the log level chosen is appropriate for diagnostics.

Comment on lines +319 to +321
print(
f"can not use alltoall due to deep_ep_max_num_tokens {all_rank_max_num_tokens} > {self.deep_ep_max_num_tokens}"
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Remove debug print statement from production code

Debug print statements should be removed.

-            print(
-                f"can not use alltoall due to deep_ep_max_num_tokens {all_rank_max_num_tokens} > {self.deep_ep_max_num_tokens}"
-            )
+            logger.debug(
+                f"Cannot use alltoall due to deep_ep_max_num_tokens {all_rank_max_num_tokens} > {self.deep_ep_max_num_tokens}"
+            )
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
print(
f"can not use alltoall due to deep_ep_max_num_tokens {all_rank_max_num_tokens} > {self.deep_ep_max_num_tokens}"
)
logger.debug(
f"Cannot use alltoall due to deep_ep_max_num_tokens {all_rank_max_num_tokens} > {self.deep_ep_max_num_tokens}"
)
🧰 Tools
🪛 Ruff (0.12.2)

320-320: Line too long (127 > 120)

(E501)

🤖 Prompt for AI Agents
In tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py around lines
319-321, remove the debug print statement; replace it with a proper logging call
(e.g., module logger or logging.warning) or drop it entirely if no runtime
notice is needed. If using logging, ensure an appropriate logger is available or
import logging and call logger.warning(f"cannot use alltoall due to
deep_ep_max_num_tokens {all_rank_max_num_tokens} >
{self.deep_ep_max_num_tokens}") to preserve context instead of printing to
stdout.

return False

print(f"all to all type {self.alltoall_method_type}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Remove debug print statement from production code

Debug print statement should be removed.

-        print(f"all to all type {self.alltoall_method_type}")
+        logger.debug(f"Alltoall type: {self.alltoall_method_type}")
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
print(f"all to all type {self.alltoall_method_type}")
logger.debug(f"Alltoall type: {self.alltoall_method_type}")
🤖 Prompt for AI Agents
In tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py around line 324,
there is a debug print statement printing the all-to-all method type; remove
this print and replace it with either no output or a proper logger call if
runtime visibility is required (use the module's logger or Python logging at an
appropriate log level). Ensure no stray print() calls remain in production code
and run tests/linters to confirm removal.

Comment on lines +333 to +345
print(
f"wide_ep _get_quant_method: get_sm_version()={get_sm_version()}"
)
if get_sm_version() == 100:
print(
f"wide_ep _get_quant_method: use DeepSeekFP8BlockScalesFusedMoEMethodDeepGemm"
)
return DeepSeekFP8BlockScalesFusedMoEMethodDeepGemm()
else:
print(
f"wide_ep _get_quant_method: use DeepSeekFP8BlockScalesFusedMoEMethod"
)
return DeepSeekFP8BlockScalesFusedMoEMethod()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Remove debug print statements from production code

Multiple debug print statements in the quantization method selection should be removed.

-                print(
-                    f"wide_ep _get_quant_method: get_sm_version()={get_sm_version()}"
-                )
+                logger.debug(
+                    f"wide_ep _get_quant_method: get_sm_version()={get_sm_version()}"
+                )
                 if get_sm_version() == 100:
-                    print(
-                        f"wide_ep _get_quant_method: use DeepSeekFP8BlockScalesFusedMoEMethodDeepGemm"
-                    )
+                    logger.debug(
+                        "wide_ep _get_quant_method: use DeepSeekFP8BlockScalesFusedMoEMethodDeepGemm"
+                    )
                     return DeepSeekFP8BlockScalesFusedMoEMethodDeepGemm()
                 else:
-                    print(
-                        f"wide_ep _get_quant_method: use DeepSeekFP8BlockScalesFusedMoEMethod"
-                    )
+                    logger.debug(
+                        "wide_ep _get_quant_method: use DeepSeekFP8BlockScalesFusedMoEMethod"
+                    )
                     return DeepSeekFP8BlockScalesFusedMoEMethod()
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
print(
f"wide_ep _get_quant_method: get_sm_version()={get_sm_version()}"
)
if get_sm_version() == 100:
print(
f"wide_ep _get_quant_method: use DeepSeekFP8BlockScalesFusedMoEMethodDeepGemm"
)
return DeepSeekFP8BlockScalesFusedMoEMethodDeepGemm()
else:
print(
f"wide_ep _get_quant_method: use DeepSeekFP8BlockScalesFusedMoEMethod"
)
return DeepSeekFP8BlockScalesFusedMoEMethod()
logger.debug(
f"wide_ep _get_quant_method: get_sm_version()={get_sm_version()}"
)
if get_sm_version() == 100:
logger.debug(
"wide_ep _get_quant_method: use DeepSeekFP8BlockScalesFusedMoEMethodDeepGemm"
)
return DeepSeekFP8BlockScalesFusedMoEMethodDeepGemm()
else:
logger.debug(
"wide_ep _get_quant_method: use DeepSeekFP8BlockScalesFusedMoEMethod"
)
return DeepSeekFP8BlockScalesFusedMoEMethod()
🤖 Prompt for AI Agents
In tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py around lines 333
to 345 there are debug print() calls in _get_quant_method; remove these print
statements so production code has no stdout debug noise, and if project logging
is available replace them with an appropriate logger.debug(...) calls preserving
the same conditional flow and returned classes (no behavioral change).

Comment on lines +828 to +836
module.routing_method.top_k if module else 1,
expert_size_per_partition,
tp_size,
tp_rank,
ep_size,
ep_rank,
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Fix attribute: use experts_per_token (not top_k) for routing method.

Routing methods expose experts_per_token elsewhere in this filebase. Using .top_k risks AttributeError at runtime.

-            module.routing_method.top_k if module else 1,
+            module.routing_method.experts_per_token if module else 1,
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
module.routing_method.top_k if module else 1,
expert_size_per_partition,
tp_size,
tp_rank,
ep_size,
ep_rank,
)
module.routing_method.experts_per_token if module else 1,
expert_size_per_partition,
tp_size,
tp_rank,
ep_size,
ep_rank,
)
🤖 Prompt for AI Agents
In tensorrt_llm/_torch/modules/fused_moe/moe_backend.py around lines 828 to 834,
the code accesses module.routing_method.top_k which can raise AttributeError
because routing methods in this codebase expose experts_per_token; replace the
.top_k access with module.routing_method.experts_per_token (or a safe fallback
to 1 if missing) so the routing parameter uses the correct attribute name and
avoids runtime errors.

Comment on lines +963 to +965
print(
f"maybe_create_moe_load_balancer: in_supported_model_arch={in_supported_model_arch}, using_ep={using_ep}, using_smart_router={using_smart_router}, model_config.moe_load_balancer={model_config.moe_load_balancer}"
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Remove debug print statement from production code

Debug print statements should not be committed to production code. Please remove this print statement.

-    print(
-        f"maybe_create_moe_load_balancer: in_supported_model_arch={in_supported_model_arch}, using_ep={using_ep}, using_smart_router={using_smart_router}, model_config.moe_load_balancer={model_config.moe_load_balancer}"
-    )
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
print(
f"maybe_create_moe_load_balancer: in_supported_model_arch={in_supported_model_arch}, using_ep={using_ep}, using_smart_router={using_smart_router}, model_config.moe_load_balancer={model_config.moe_load_balancer}"
)
🧰 Tools
🪛 Ruff (0.12.2)

964-964: Line too long (219 > 120)

(E501)

🤖 Prompt for AI Agents
In tensorrt_llm/_torch/modules/fused_moe/moe_load_balancer.py around lines
963-965, remove the debug print statement that outputs
maybe_create_moe_load_balancer state; delete the print(...) call and, if runtime
diagnostics are needed, replace it with a proper logger.debug(...) call using
the module logger (ensure the logger is imported/initialized) so no raw prints
remain in production code.

🛠️ Refactor suggestion

Replace noisy print with logger.debug and wrap long line (E501).

Stdout prints in distributed hot paths cause interleaved logs and measurable overhead. Use the project logger and keep within 120 cols.

-    print(
-        f"maybe_create_moe_load_balancer: in_supported_model_arch={in_supported_model_arch}, using_ep={using_ep}, using_smart_router={using_smart_router}, model_config.moe_load_balancer={model_config.moe_load_balancer}"
-    )
+    logger.debug(
+        "maybe_create_moe_load_balancer: in_supported_model_arch=%s, "
+        "using_ep=%s, using_smart_router=%s, moe_load_balancer=%s",
+        in_supported_model_arch, using_ep, using_smart_router,
+        getattr(model_config, "moe_load_balancer", None),
+    )
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
print(
f"maybe_create_moe_load_balancer: in_supported_model_arch={in_supported_model_arch}, using_ep={using_ep}, using_smart_router={using_smart_router}, model_config.moe_load_balancer={model_config.moe_load_balancer}"
)
logger.debug(
"maybe_create_moe_load_balancer: in_supported_model_arch=%s, "
"using_ep=%s, using_smart_router=%s, moe_load_balancer=%s",
in_supported_model_arch, using_ep, using_smart_router,
getattr(model_config, "moe_load_balancer", None),
)
🧰 Tools
🪛 Ruff (0.12.2)

964-964: Line too long (219 > 120)

(E501)

🤖 Prompt for AI Agents
In tensorrt_llm/_torch/modules/fused_moe/moe_load_balancer.py around lines 963
to 965, replace the noisy print call with a project logger.debug call and wrap
the long formatted message to keep line length under 120 characters; ensure you
obtain the module logger (e.g., via logging.getLogger(__name__) or use the
existing module logger) and pass the same f-string content to logger.debug,
splitting the format across concatenated strings or multiple arguments to stay
within the line length limit.

@xxi-nv xxi-nv force-pushed the supportFP8BlockWideEp branch from bfb9f40 to 360f36d Compare August 26, 2025 07:49
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (16)
tensorrt_llm/_torch/modules/fused_moe/moe_load_balancer.py (1)

963-965: Replace stdout print with logger.debug and wrap to <=120 cols (fix E501).

Raw prints in hot/distributed paths are noisy and can materially impact performance. Use the module logger with parameterized formatting to avoid eager string interpolation and keep within style limits.

Apply this diff:

-    print(
-        f"maybe_create_moe_load_balancer: in_supported_model_arch={in_supported_model_arch}, using_ep={using_ep}, using_smart_router={using_smart_router}, model_config.moe_load_balancer={model_config.moe_load_balancer}"
-    )
+    logger.debug(
+        "maybe_create_moe_load_balancer: in_supported_model_arch=%s, "
+        "using_ep=%s, using_smart_router=%s, moe_load_balancer=%s",
+        in_supported_model_arch,
+        bool(using_ep),
+        bool(using_smart_router),
+        getattr(model_config, "moe_load_balancer", None),
+    )
tensorrt_llm/_torch/distributed/ops.py (1)

243-247: Replace debug print with proper logging (and fix the long line).

Using print in library code is noisy and bypasses configured logging. Also, Ruff flags this line as too long (E501). Switch to the module logger and wrap the message to stay under 120 chars. This also addresses the earlier bot comment on the same lines.

Apply:

-            for val in input:
-                if val is not None and val.shape[dim] != sum_split_size:
-                    print(
-                        f"[reducescatter] val.shape={val.shape}, dim={dim}, val.shape[dim]={val.shape[dim]}, sum_split_size={sum_split_size}, sizes={sizes}"
-                    )
+            for val in input:
+                if val is not None and val.shape[dim] != sum_split_size:
+                    logger.debug(
+                        f"[reducescatter] val.shape={val.shape}, dim={dim}, "
+                        f"val.shape[dim]={val.shape[dim]}, "
+                        f"sum_split_size={sum_split_size}, sizes={sizes}"
+                    )

Optional: if you prefer lazy logging to avoid f-string formatting cost when DEBUG is off, use the parameterized form:

logger.debug(
    "[reducescatter] val.shape=%s, dim=%s, val.shape[dim]=%s, sum_split_size=%s, sizes=%s",
    tuple(val.shape), dim, val.shape[dim], sum_split_size, sizes,
)

Verification (find remaining prints in this file):

#!/bin/bash
rg -n -C2 -g 'tensorrt_llm/_torch/distributed/ops.py' '^\s*print\('

Note: This file currently both imports logger from tensorrt_llm.logger and redefines logger = logging.getLogger(__name__). Pick one to avoid confusion; project-wide consistency typically favors the tensorrt_llm.logger logger.

tensorrt_llm/_torch/modules/fused_moe/moe_backend.py (7)

715-727: Remove debug prints; use logger.debug with short, wrapped message

Raw prints spam stdout and violate E501. Convert to logger.debug and keep under 120 cols.

-        print(
-            "enter deepgemm backend compute_moe \n"
-            f"x.shape: {getattr(x, 'shape', None)}, \n"
-            f"input_sf.shape: {getattr(input_sf, 'shape', None)}, \n"
-            f"token_selected_slots.shape: {getattr(token_selected_slots, 'shape', None)}, \n"
-            f"token_final_scales.shape: {getattr(token_final_scales, 'shape', None)}, \n"
-            f"permuted_row_to_unpermuted_row_tensor.shape: {getattr(permuted_row_to_unpermuted_row_tensor, 'shape', None)}, \n"
-            f"permuted_token_selected_experts_tensor.shape: {getattr(permuted_token_selected_experts_tensor, 'shape', None)}, \n"
-            f"permuted_data_tensor.shape: {getattr(permuted_data_tensor, 'shape', None)}, \n"
-            f"expert_first_token_offset_tensor.shape: {getattr(expert_first_token_offset_tensor, 'shape', None)}, \n"
-            f"permuted_token_final_scales_tensor.shape: {getattr(permuted_token_final_scales_tensor, 'shape', None)}, \n"
-            f"unpermuted_row_to_permuted_row_tensor.shape: {getattr(unpermuted_row_to_permuted_row_tensor, 'shape', None)}\n"
-        )
+        logger.debug(
+            "[DeepGemm::enter] x=%s input_sf=%s slots=%s scales=%s "
+            "perm_row2unperm=%s perm_experts=%s perm_data=%s "
+            "first_off=%s perm_final_scales=%s unperm_row2perm=%s",
+            getattr(x, "shape", None),
+            getattr(input_sf, "shape", None),
+            getattr(token_selected_slots, "shape", None),
+            getattr(token_final_scales, "shape", None),
+            getattr(permuted_row_to_unpermuted_row_tensor, "shape", None),
+            getattr(permuted_token_selected_experts_tensor, "shape", None),
+            getattr(permuted_data_tensor, "shape", None),
+            getattr(expert_first_token_offset_tensor, "shape", None),
+            getattr(permuted_token_final_scales_tensor, "shape", None),
+            getattr(unpermuted_row_to_permuted_row_tensor, "shape", None),
+        )

835-844: Remove trailing debug print; replace with logger.debug

Same rationale as the entry print and resolves E501.

-        print(
-            "exit deepgemm backend compute_moe \n"
-            f"permuted_data_tensor.shape: {getattr(permuted_data_tensor, 'shape', None)}, "
-            f"token_final_scales.shape: {getattr(token_final_scales, 'shape', None)}, "
-            f"unpermuted_row_to_permuted_row_tensor.shape: {getattr(unpermuted_row_to_permuted_row_tensor, 'shape', None)}, "
-            f"permuted_row_to_unpermuted_row_tensor.shape: {getattr(permuted_row_to_unpermuted_row_tensor, 'shape', None)}, "
-            f"expert_first_token_offset_tensor.shape: {getattr(expert_first_token_offset_tensor, 'shape', None)}, "
-            f"x.shape: {getattr(x, 'shape', None)}, "
-            f"final_hidden_states.shape: {getattr(final_hidden_states, 'shape', None)}"
-        )
+        logger.debug(
+            "[DeepGemm::exit] perm_data=%s final_scales=%s unperm_row2perm=%s "
+            "perm_row2unperm=%s first_off=%s x=%s out=%s",
+            getattr(permuted_data_tensor, "shape", None),
+            getattr(token_final_scales, "shape", None),
+            getattr(unpermuted_row_to_permuted_row_tensor, "shape", None),
+            getattr(permuted_row_to_unpermuted_row_tensor, "shape", None),
+            getattr(expert_first_token_offset_tensor, "shape", None),
+            getattr(x, "shape", None),
+            getattr(final_hidden_states, "shape", None),
+        )

1-4: Add required NVIDIA copyright header at file top

Per repo guidelines, source files must prepend the NVIDIA copyright header. Insert it before the module docstring.

+ # Copyright (c) 2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 """
 MoE Backend abstraction for supporting different MoE computation implementations.
 This module provides a unified interface for different MoE backends (Cutlass, DeepGemm, etc.)
 """

245-256: MoERunner init: pass the runtime view dtype and ensure flags are booleans (not bound methods)

  • weight_dtype should match the view dtype used during profiling (weight_view_dtype), not the storage dtype.
  • Several flags are obtained via getattr and may be bound methods; pass actual booleans by calling when callable or coercing with bool(...).
             self.moe_runner = MoERunner(
                 x_dtype=tuner_input.dtype,
-                weight_dtype=module.w3_w1_weight.dtype,
+                weight_dtype=weight_view_dtype,
                 output_dtype=output_dtype,
                 top_k=tuner_top_k,
                 tp_size=module.tp_size,
                 tp_rank=module.tp_rank,
                 ep_size=module.ep_size,
                 ep_rank=module.ep_rank,
                 cluster_size=module.cluster_size,
                 cluster_rank=module.cluster_rank,
-                use_deepseek_fp8_block_scale=getattr(
-                    module, 'has_deepseek_fp8_block_scales', False),
-                use_w4_group_scaling=getattr(module, 'has_w4afp8', False),
-                use_int8_woq_per_channel=getattr(module,
-                                                 'has_int8_woq_per_channel',
-                                                 False),
-                use_mxfp8_act_scaling=getattr(module, 'has_mxfp8_act_scaling',
-                                              False),
+                use_deepseek_fp8_block_scale=(
+                    module.has_deepseek_fp8_block_scales()
+                    if hasattr(module, 'has_deepseek_fp8_block_scales')
+                    and callable(getattr(module, 'has_deepseek_fp8_block_scales'))
+                    else bool(getattr(module, 'has_deepseek_fp8_block_scales', False))
+                ),
+                use_w4_group_scaling=bool(getattr(module, 'has_w4afp8', False)),
+                use_int8_woq_per_channel=bool(
+                    getattr(module, 'has_int8_woq_per_channel', False)
+                ),
+                use_mxfp8_act_scaling=bool(
+                    getattr(module, 'has_mxfp8_act_scaling', False)
+                ),
                 min_latency_mode=min_latency_mode,
                 use_fused_finalize=getattr(module, 'use_fused_finalize', False),
             )

Also applies to: 264-274


827-834: Use experts_per_token (not top_k) to avoid AttributeError on routing method

Project-wide the attribute is experts_per_token.

-            module.routing_method.top_k if module else 1,
+            module.routing_method.experts_per_token if module else 1,

846-846: Align DeepGemm return type with Cutlass: return a one-element list

Callers expect List[Tensor] in the non-min-latency path. Wrap the tensor to avoid slicing bugs.

-        return final_hidden_states
+        return [final_hidden_states]

861-889: Backend selection: call has_deepseek_fp8_block_scales instead of passing a bound method

As written, has_block_fp8 is truthy whenever the attribute exists, regardless of the method’s return. Ensure it’s called when callable.

         is_blackwell = get_sm_version() == 100
-        has_block_fp8 = (hasattr(module, 'has_deepseek_fp8_block_scales')
-                         and module.has_deepseek_fp8_block_scales)
+        has_block_fp8 = (
+            hasattr(module, 'has_deepseek_fp8_block_scales')
+            and (
+                module.has_deepseek_fp8_block_scales()
+                if callable(getattr(module, 'has_deepseek_fp8_block_scales'))
+                else bool(getattr(module, 'has_deepseek_fp8_block_scales'))
+            )
+        )
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py (7)

311-313: Replace debug print with logger.debug and keep under 120 cols

Avoid stdout noise in production paths.

-            print(
-                f"can not use alltoall due to chunking {self.calculate_num_chunks(all_rank_num_tokens)}"
-            )
+            logger.debug(
+                "Cannot use alltoall: chunking=%s",
+                self.calculate_num_chunks(all_rank_num_tokens),
+            )

319-321: Replace debug print with logger.debug

Also fixes E501.

-            print(
-                f"can not use alltoall due to deep_ep_max_num_tokens {all_rank_max_num_tokens} > {self.deep_ep_max_num_tokens}"
-            )
+            logger.debug(
+                "Cannot use alltoall: deep_ep_max_num_tokens %s > %s",
+                all_rank_max_num_tokens,
+                self.deep_ep_max_num_tokens,
+            )

324-324: Replace debug print with logger.debug

Short and consistent logging.

-        print(f"all to all type {self.alltoall_method_type}")
+        logger.debug("Alltoall type: %s", self.alltoall_method_type)

333-345: Use logger.debug for SM/back-end selection traces

Keep diagnostics but avoid prints; also fix long lines.

-                print(
-                    f"wide_ep _get_quant_method: get_sm_version()={get_sm_version()}"
-                )
+                logger.debug(
+                    "wide_ep _get_quant_method: get_sm_version()=%s",
+                    get_sm_version(),
+                )
                 if get_sm_version() == 100:
-                    print(
-                        f"wide_ep _get_quant_method: use DeepSeekFP8BlockScalesFusedMoEMethodDeepGemm"
-                    )
+                    logger.debug(
+                        "wide_ep _get_quant_method: use DeepSeekFP8BlockScalesFusedMoEMethodDeepGemm"
+                    )
                     return DeepSeekFP8BlockScalesFusedMoEMethodDeepGemm()
                 else:
-                    print(
-                        f"wide_ep _get_quant_method: use DeepSeekFP8BlockScalesFusedMoEMethod"
-                    )
+                    logger.debug(
+                        "wide_ep _get_quant_method: use DeepSeekFP8BlockScalesFusedMoEMethod"
+                    )
                     return DeepSeekFP8BlockScalesFusedMoEMethod()

418-420: Replace verbose print in forward_chunk with logger.debug

This line floods stdout and violates E501. Keep it concise and wrapped.

-        print(
-            f"wide_ep forward_chunk: layer_load_balancer={self.layer_load_balancer}, is_first_call={is_first_call}, is_last_call={is_last_call}, x shape: {getattr(x, 'shape', None)}, router_logits shape: {getattr(router_logits, 'shape', None)}, use_all_to_all: {use_all_to_all}, all_rank_num_tokens: {all_rank_num_tokens}, all_rank_max_num_tokens: {all_rank_max_num_tokens}, use_dp_padding: {use_dp_padding}, repeating_info: {repeating_info}"
-        )
+        logger.debug(
+            "forward_chunk: lb=%s first=%s last=%s x=%s router=%s alltoall=%s "
+            "num_tokens=%s max_tokens=%s dp_pad=%s repeat=%s",
+            bool(self.layer_load_balancer), is_first_call, is_last_call,
+            getattr(x, "shape", None), getattr(router_logits, "shape", None),
+            use_all_to_all, all_rank_num_tokens, all_rank_max_num_tokens,
+            use_dp_padding, repeating_info,
+        )

730-732: Remove extremely verbose print after backend run

This prints shapes and flags every call; switch to a gated debug log or remove.

-        print(
-            f"xxi x.shape: {getattr(x, 'shape', None)}, final_hidden_states shape: {getattr(final_hidden_states[0], 'shape', None)}, token_selected_slots shape: {getattr(token_selected_slots, 'shape', None)}, token_final_scales shape: {getattr(token_final_scales, 'shape', None)}, w3_w1_weight shape: {getattr(w3_w1_weight, 'shape', None)}, w2_weight shape: {getattr(w2_weight, 'shape', None)}, quant_scales: {getattr(quant_scales, 'shape', None)}, input_sf: {getattr(x_sf, 'shape', None)}, swizzled_input_sf: False, tp_size: {self.tp_size}, tp_rank: {self.tp_rank}, ep_size: {ep_size}, ep_rank: {ep_rank}, cluster_size: {cluster_size}, cluster_rank: {cluster_rank}, enable_alltoall: {use_all_to_all}, use_deepseek_fp8_block_scale: {use_deepseek_fp8_block_scale}, use_w4_group_scaling: {use_w4_group_scaling}, min_latency_mode: False, tune_max_num_tokens: {self.tune_max_num_tokens}, tuner_num_tokens: {tuner_num_tokens}, tuner_top_k: {tuner_top_k}"
-        )
+        logger.debug(
+            "forward_chunk result: out=%s slots=%s scales=%s alltoall=%s tp=%s/%s ep=%s/%s",
+            getattr(final_hidden_states[0], "shape", None),
+            getattr(token_selected_slots, "shape", None),
+            getattr(token_final_scales, "shape", None),
+            use_all_to_all, self.tp_rank, self.tp_size, ep_rank, ep_size,
+        )

812-815: Remove leftover print in forward; convert to concise debug log if needed

Avoid production prints and long lines.

-            # 一行打印所有信息
-            print(
-                f"xxi x.shape: {getattr(x, 'shape', None)}, use_all_to_all: {use_all_to_all}, all_rank_num_tokens: {all_rank_num_tokens}, all_rank_num_tokens_padded: {all_rank_num_tokens_padded}, all_rank_max_num_tokens: {all_rank_max_num_tokens}, use_dp_padding: {use_dp_padding}, outputs.shape: {getattr(outputs, 'shape', None)}, use_dp_padding(again): {use_dp_padding}"
-            )
+            logger.debug(
+                "forward: x=%s alltoall=%s num_tokens=%s padded=%s max_tokens=%s "
+                "dp_pad=%s out=%s",
+                getattr(x, "shape", None), use_all_to_all, all_rank_num_tokens,
+                all_rank_num_tokens_padded, all_rank_max_num_tokens,
+                use_dp_padding, getattr(outputs, "shape", None),
+            )
🧹 Nitpick comments (7)
tensorrt_llm/_torch/modules/fused_moe/moe_load_balancer.py (3)

959-962: Normalize booleans to avoid None/strange truthiness from and chaining.

using_ep = mapping and ... returns None when mapping is falsy; prefer explicit bool(mapping) for clarity and stable typing.

-    using_ep = mapping and mapping.moe_ep_size > 1
+    using_ep = bool(mapping) and mapping.moe_ep_size > 1
-    using_smart_router = mapping and mapping.moe_cluster_size > 1
+    using_smart_router = bool(mapping) and mapping.moe_cluster_size > 1

954-956: Return type annotation doesn’t match actual return (nullcontext vs. MoeLoadBalancer).

This function returns either a MoeLoadBalancer (context manager) or a nullcontext(); Optional[MoeLoadBalancer] is misleading. Consider annotating as a context manager to reflect usage (with maybe_create_moe_load_balancer(...):).

Minimal edits:

-from typing import Callable, Dict, List, Optional, Tuple
+from typing import Callable, Dict, List, Optional, Tuple, Any, ContextManager
@@
-def maybe_create_moe_load_balancer(
-        model_config, mapping: Optional[Mapping]) -> Optional[MoeLoadBalancer]:
+def maybe_create_moe_load_balancer(
+        model_config, mapping: Optional[Mapping]) -> ContextManager[Any]:

1-1: Missing NVIDIA copyright header.

Project guideline requires the NVIDIA copyright header on all source files.

Suggested header (adjust exact wording/year to match repo standard):

+# Copyright (c) 2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
tensorrt_llm/_torch/distributed/ops.py (1)

463-489: MNNVL runtime gate looks good; add a small guard and improve exception logging.

Broadening enablement to AUTO/MNNVL and gating on tp_size == world_size is reasonable. The divergence from DeepSeekV3’s TP-size logic is known/intentional per prior learnings. Minor polish:

  • Avoid calling is_mnnvl when dtype is None.
  • Include stack info in the debug log on init failure for easier diagnosis.
-            if self.strategy in (AllReduceStrategy.AUTO,
-                                 AllReduceStrategy.MNNVL):
+            if self.strategy in (AllReduceStrategy.AUTO, AllReduceStrategy.MNNVL):
                 if self.mapping.tp_size != self.mapping.world_size:
                     logger.debug(
                         f"MNNVLAllReduce is disabled due to tp_size:{self.mapping.tp_size} "
                         f"!= world_size:{self.mapping.world_size}")
                     self.mnnvl_allreduce = None
-                elif MNNVLAllReduce.is_mnnvl(self.mapping, dtype):
+                elif dtype is not None and MNNVLAllReduce.is_mnnvl(self.mapping, dtype):
                     try:
                         self.mnnvl_allreduce = MNNVLAllReduce(
                             self.mapping, dtype) if dtype else None
                         if self.mnnvl_allreduce:
                             logger.debug(f"MNNVLAllReduce is enabled")
                         else:
                             logger.debug(f"MNNVLAllReduce is disabled")
-                    except Exception as e:
-                        logger.debug(
-                            f"MNNVL AllReduce can't be enabled due to {e}.")
+                    except Exception:
+                        logger.debug("MNNVL AllReduce init failed.", exc_info=True)
                         self.mnnvl_allreduce = None
                 else:
                     logger.debug(
                         f"MNNVLAllReduce can't be enabled due to failing the is_mnnvl check."
                     )
                     self.mnnvl_allreduce = None

Side note: since this block adds several new logger.debug calls, consider unifying on a single logger source for the module (see note in the reducescatter comment).

tensorrt_llm/_torch/modules/fused_moe/moe_backend.py (3)

11-11: Import the module logger for use instead of print()

Follow project practice and avoid raw prints. Import logger alongside get_sm_version.

- from tensorrt_llm._utils import get_sm_version
+ from tensorrt_llm._utils import get_sm_version, logger

381-383: Wrap long conditional assignment to satisfy line length and readability

No functional change; resolves Ruff E501 and improves clarity.

-        run_moe = self.moe_runner.fused_moe_runner.run_moe_min_latency if min_latency_mode else self.moe_runner.fused_moe_runner.run_moe
+        if min_latency_mode:
+            run_moe = self.moe_runner.fused_moe_runner.run_moe_min_latency
+        else:
+            run_moe = self.moe_runner.fused_moe_runner.run_moe

97-99: Doc/typing: clarify run_moe/compute_moe return contract (List[Tensor] vs Tensor)

With the DeepGemm fix, both backends return List[Tensor] (non-min-latency). Update docstrings and type hints to reflect this, or standardize on always returning List[Tensor] except when explicitly in min-latency mode.

Also applies to: 172-174

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between bfb9f40 and 360f36d.

📒 Files selected for processing (4)
  • tensorrt_llm/_torch/distributed/ops.py (2 hunks)
  • tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py (11 hunks)
  • tensorrt_llm/_torch/modules/fused_moe/moe_backend.py (1 hunks)
  • tensorrt_llm/_torch/modules/fused_moe/moe_load_balancer.py (1 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Code must target Python 3.8+
Indent Python code with 4 spaces; do not use tabs
Preserve module namespaces when importing; import modules/packages and access members via the module (e.g., from package.subpackage import foo; foo.SomeClass())
Python file names should be snake_case
Python class names should be PascalCase
Python functions/methods and local variables should be snake_case; variables beginning with a number should be prefixed with k_ (e.g., k_99th_percentile)
Global variables should be UPPER_SNAKE_CASE prefixed with G_ (e.g., G_MY_GLOBAL); constants should be UPPER_SNAKE_CASE
Avoid shadowing variables from outer scopes; initialize all externally visible members in init
Prefer docstrings for interfaces used outside a file; comments should be reserved for in-function or file-local interfaces
Use Google-style docstrings for classes and functions; attributes and variables may be documented inline with trailing string literals
Avoid reflection when simpler, explicit code suffices (e.g., avoid dict(**locals()) patterns)
In try/except, catch the narrowest exceptions possible
For duck-typing patterns, keep the try body minimal and move logic to else to avoid masking unrelated failures

Files:

  • tensorrt_llm/_torch/modules/fused_moe/moe_backend.py
  • tensorrt_llm/_torch/distributed/ops.py
  • tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py
  • tensorrt_llm/_torch/modules/fused_moe/moe_load_balancer.py
**/*.{c,cc,cpp,cxx,h,hh,hpp,hxx,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Prepend the NVIDIA copyright header (current year) to all source files (.cpp, .h, .cu, .py, etc.)

Files:

  • tensorrt_llm/_torch/modules/fused_moe/moe_backend.py
  • tensorrt_llm/_torch/distributed/ops.py
  • tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py
  • tensorrt_llm/_torch/modules/fused_moe/moe_load_balancer.py
🧠 Learnings (4)
📓 Common learnings
Learnt from: timlee0212
PR: NVIDIA/TensorRT-LLM#6886
File: tensorrt_llm/_torch/models/modeling_deepseekv3.py:0-0
Timestamp: 2025-08-14T06:36:40.701Z
Learning: In DeepSeek V3 model (tensorrt_llm/_torch/models/modeling_deepseekv3.py), the disagreement between AllReduce.__init__ guard and _compute_mlp_tp_size logic for MNNVL usage is expected by design. The AllReduce component and MLP TP-size computation intentionally use different criteria for MNNVL availability decisions.
📚 Learning: 2025-08-14T23:23:27.449Z
Learnt from: djns99
PR: NVIDIA/TensorRT-LLM#6915
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:4010-4012
Timestamp: 2025-08-14T23:23:27.449Z
Learning: For MOE (Mixture of Experts) code reviews in TensorRT-LLM, avoid repeatedly suggesting finalize fusion validation checks and safety assertions. The user djns99 has indicated these suggestions are repetitive and unwanted across multiple MOE-related changes.

Applied to files:

  • tensorrt_llm/_torch/modules/fused_moe/moe_backend.py
  • tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py
📚 Learning: 2025-08-09T20:57:04.084Z
Learnt from: sklevtsov-nvidia
PR: NVIDIA/TensorRT-LLM#3294
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu:118-127
Timestamp: 2025-08-09T20:57:04.084Z
Learning: In the CUTLASS MoE finalize fusion implementation (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu), when setting `fused_finalize_epilogue.stride_final_output` with shape `(hidden_size, num_output_tokens, 1)`, the `num_rows_in_final_output` should be set to `num_output_tokens` (not `hidden_size`) because of a swap+transpose operation that maps rows of the output tensor to `hidden_size` and columns to `num_output_tokens`.

Applied to files:

  • tensorrt_llm/_torch/modules/fused_moe/moe_backend.py
  • tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py
📚 Learning: 2025-08-14T06:36:40.701Z
Learnt from: timlee0212
PR: NVIDIA/TensorRT-LLM#6886
File: tensorrt_llm/_torch/models/modeling_deepseekv3.py:0-0
Timestamp: 2025-08-14T06:36:40.701Z
Learning: In DeepSeek V3 model (tensorrt_llm/_torch/models/modeling_deepseekv3.py), the disagreement between AllReduce.__init__ guard and _compute_mlp_tp_size logic for MNNVL usage is expected by design. The AllReduce component and MLP TP-size computation intentionally use different criteria for MNNVL availability decisions.

Applied to files:

  • tensorrt_llm/_torch/distributed/ops.py
🪛 Ruff (0.12.2)
tensorrt_llm/_torch/modules/fused_moe/moe_backend.py

382-382: Line too long (136 > 120)

(E501)


721-721: Line too long (127 > 120)

(E501)


722-722: Line too long (129 > 120)

(E501)


725-725: Line too long (121 > 120)

(E501)


726-726: Line too long (125 > 120)

(E501)


839-839: Line too long (125 > 120)

(E501)


840-840: Line too long (125 > 120)

(E501)

tensorrt_llm/_torch/distributed/ops.py

246-246: Line too long (156 > 120)

(E501)

tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py

320-320: Line too long (127 > 120)

(E501)


419-419: Line too long (442 > 120)

(E501)


731-731: Line too long (949 > 120)

(E501)


814-814: Line too long (372 > 120)

(E501)

tensorrt_llm/_torch/modules/fused_moe/moe_load_balancer.py

964-964: Line too long (219 > 120)

(E501)

🔇 Additional comments (3)
tensorrt_llm/_torch/modules/fused_moe/moe_backend.py (1)

214-223: Nice: backend abstraction reads clean and contains a clear strategy separation

The split between tactic finalization and compute paths is well-structured and should ease future backend additions.

tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py (2)

368-370: Backend selection timing: ensure moe_backend is set before first use

moe_backend is initialized in create_weights(); if skip_create_weights_in_init is True and forward is called before create_weights(), self.moe_backend may be None.

Consider asserting at first use (in forward_chunk) or lazy-selecting if None:

         # Initialize MoE backend after weights are created
-        self.moe_backend = MoEBackendSelection.select_backend(self)
+        self.moe_backend = MoEBackendSelection.select_backend(self)

And in forward_chunk before calling run_moe:

-        final_hidden_states = self.moe_backend.run_moe(
+        assert self.moe_backend is not None, "moe_backend is not initialized"
+        final_hidden_states = self.moe_backend.run_moe(

Please confirm the repo always calls create_weights() prior to forward when skipping in init.


700-728: Good integration of backend run_moe with module context

Passing module=self keeps backends decoupled yet informed. The argument mapping mirrors the legacy fused_moe signature well.

@xxi-nv xxi-nv force-pushed the supportFP8BlockWideEp branch from 360f36d to 6b3430e Compare August 26, 2025 08:19
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
tensorrt_llm/_torch/distributed/ops.py (1)

1-21: Fix logger shadowing (imported logger is overwritten).

The module imports logger from tensorrt_llm.logger (Line 15) and then overwrites it with a stdlib logger (Line 20). This breaks the project’s logging conventions.

- import logging
@@
-from tensorrt_llm.logger import logger
+from tensorrt_llm.logger import logger
@@
-logger = logging.getLogger(__name__)

If nothing else uses the stdlib logging symbol after this change, drop the unused import as shown.

♻️ Duplicate comments (16)
tensorrt_llm/_torch/modules/fused_moe/moe_load_balancer.py (1)

963-965: Replace debug print with logger.debug and wrap lines (<120 cols).

Raw prints are noisy in distributed runs and fail Ruff E501. Use the project logger at debug level.

-    print(
-        f"maybe_create_moe_load_balancer: in_supported_model_arch={in_supported_model_arch}, using_ep={using_ep}, using_smart_router={using_smart_router}, model_config.moe_load_balancer={model_config.moe_load_balancer}"
-    )
+    logger.debug(
+        "maybe_create_moe_load_balancer: in_supported_model_arch=%s, "
+        "using_ep=%s, using_smart_router=%s, moe_load_balancer=%s",
+        in_supported_model_arch, using_ep, using_smart_router,
+        getattr(model_config, "moe_load_balancer", None),
+    )
tensorrt_llm/_torch/modules/fused_moe/moe_backend.py (8)

715-727: Remove verbose prints; use project logger at debug level.

These prints run every forward and exceed 120 cols.

+        from tensorrt_llm.logger import logger
-        print(
-            "xxi shape 2: enter deepgemm backend compute_moe \n"
-            f"x.shape: {getattr(x, 'shape', None)}, \n"
-            f"input_sf.shape: {getattr(input_sf, 'shape', None)}, \n"
-            f"token_selected_slots.shape: {getattr(token_selected_slots, 'shape', None)}, \n"
-            f"token_final_scales.shape: {getattr(token_final_scales, 'shape', None)}, \n"
-            f"permuted_row_to_unpermuted_row_tensor.shape: {getattr(permuted_row_to_unpermuted_row_tensor, 'shape', None)}, \n"
-            f"permuted_token_selected_experts_tensor.shape: {getattr(permuted_token_selected_experts_tensor, 'shape', None)}, \n"
-            f"permuted_data_tensor.shape: {getattr(permuted_data_tensor, 'shape', None)}, \n"
-            f"expert_first_token_offset_tensor.shape: {getattr(expert_first_token_offset_tensor, 'shape', None)}, \n"
-            f"permuted_token_final_scales_tensor.shape: {getattr(permuted_token_final_scales_tensor, 'shape', None)}, \n"
-            f"unpermuted_row_to_permuted_row_tensor.shape: {getattr(unpermuted_row_to_permuted_row_tensor, 'shape', None)}\n"
-        )
+        logger.debug(
+            "[DeepGemm.enter] x=%s input_sf=%s topk_slots=%s final_scales=%s "
+            "perm_row2unperm=%s perm_experts=%s perm_data=%s "
+            "expert_first_off=%s perm_final_scales=%s unperm_row2perm=%s",
+            getattr(x, "shape", None),
+            getattr(input_sf, "shape", None),
+            getattr(token_selected_slots, "shape", None),
+            getattr(token_final_scales, "shape", None),
+            getattr(permuted_row_to_unpermuted_row_tensor, "shape", None),
+            getattr(permuted_token_selected_experts_tensor, "shape", None),
+            getattr(permuted_data_tensor, "shape", None),
+            getattr(expert_first_token_offset_tensor, "shape", None),
+            getattr(permuted_token_final_scales_tensor, "shape", None),
+            getattr(unpermuted_row_to_permuted_row_tensor, "shape", None),
+        )

835-843: Replace exit print with logger.debug; keep within 120 cols.

-        print(
-            "xxi shape 3: exit deepgemm backend compute_moe \n"
-            f"final_hidden_states.shape: {getattr(final_hidden_states, 'shape', None)}\n"
-            f"permuted_data_tensor.shape: {getattr(permuted_data_tensor, 'shape', None)}, \n"
-            f"token_final_scales.shape: {getattr(token_final_scales, 'shape', None)}, \n"
-            f"unpermuted_row_to_permuted_row_tensor.shape: {getattr(unpermuted_row_to_permuted_row_tensor, 'shape', None)}, \n"
-            f"permuted_row_to_unpermuted_row_tensor.shape: {getattr(permuted_row_to_unpermuted_row_tensor, 'shape', None)}, \n"
-            f"expert_first_token_offset_tensor.shape: {getattr(expert_first_token_offset_tensor, 'shape', None)}, \n"
-            f"x.shape: {getattr(x, 'shape', None)}, \n")
+        logger.debug(
+            "[DeepGemm.exit] out=%s perm_data=%s final_scales=%s "
+            "unperm_row2perm=%s perm_row2unperm=%s expert_first_off=%s x=%s",
+            getattr(final_hidden_states, "shape", None),
+            getattr(permuted_data_tensor, "shape", None),
+            getattr(token_final_scales, "shape", None),
+            getattr(unpermuted_row_to_permuted_row_tensor, "shape", None),
+            getattr(permuted_row_to_unpermuted_row_tensor, "shape", None),
+            getattr(expert_first_token_offset_tensor, "shape", None),
+            getattr(x, "shape", None),
+        )

828-834: Use experts_per_token instead of top_k for routing.

Avoid AttributeError and ensure correct semantics.

-            module.routing_method.top_k if module else 1,
+            module.routing_method.experts_per_token if module else 1,

845-845: Unify return type: DeepGemm backend should return List[Tensor].

WideEPMoE.forward_chunk indexes [0] unconditionally; returning a bare Tensor will slice the first row.

-        return final_hidden_states
+        return [final_hidden_states]

883-891: Bug: evaluating a method object as truthy (has_block_fp8 always true).

Call has_deepseek_fp8_block_scales if callable; otherwise coerce to bool.

-        has_block_fp8 = (hasattr(module, 'has_deepseek_fp8_block_scales')
-                         and module.has_deepseek_fp8_block_scales)
+        has_block_fp8 = (
+            hasattr(module, 'has_deepseek_fp8_block_scales')
+            and (
+                module.has_deepseek_fp8_block_scales()
+                if callable(getattr(module, 'has_deepseek_fp8_block_scales'))
+                else bool(getattr(module, 'has_deepseek_fp8_block_scales'))
+            )
+        )

1-5: Add the required NVIDIA copyright header.

New Python source must start with the standard header.

+ # Copyright (c) 2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ # SPDX-License-Identifier: LicenseRef-NvidiaProprietary
+
 """
 MoE Backend abstraction for supporting different MoE computation implementations.
 This module provides a unified interface for different MoE backends (Cutlass, DeepGemm, etc.)
 """

264-271: Bug: passing a bound method instead of bool to MoERunner flags.

has_deepseek_fp8_block_scales is a method; using getattr without calling makes the flag truthy whenever present.

-                use_deepseek_fp8_block_scale=getattr(
-                    module, 'has_deepseek_fp8_block_scales', False),
+                use_deepseek_fp8_block_scale=(
+                    module.has_deepseek_fp8_block_scales()
+                    if hasattr(module, 'has_deepseek_fp8_block_scales') else False
+                ),

Also verify other flags are real booleans (has_int8_woq_per_channel, has_mxfp8_act_scaling). If any are methods, apply the same callable guard.


253-274: MoERunner should receive weight_view_dtype (not storage dtype).

You compute weight_view_dtype but pass module.w3_w1_weight.dtype to MoERunner, risking mismatch when viewing weights (e.g., W4 group scaling, MXFP4).

         if self.moe_runner is None:
             self.moe_runner = MoERunner(
                 x_dtype=tuner_input.dtype,
-                weight_dtype=module.w3_w1_weight.dtype,
+                weight_dtype=weight_view_dtype,
                 output_dtype=output_dtype,
                 top_k=tuner_top_k,
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py (7)

311-314: Replace print with logger.debug (chunking gate); keep lines <120.

-            print(
-                f"can not use alltoall due to chunking {self.calculate_num_chunks(all_rank_num_tokens)}"
-            )
+            logger.debug(
+                "Cannot use alltoall due to chunking=%d",
+                self.calculate_num_chunks(all_rank_num_tokens),
+            )

319-321: Replace print with logger.debug (DeepEPLowLatency token cap).

-            print(
-                f"can not use alltoall due to deep_ep_max_num_tokens {all_rank_max_num_tokens} > {self.deep_ep_max_num_tokens}"
-            )
+            logger.debug(
+                "Cannot use alltoall: deep_ep_max_num_tokens %d > %d",
+                all_rank_max_num_tokens, self.deep_ep_max_num_tokens,
+            )

324-324: Replace print with logger.debug (alltoall method type).

-        print(f"all to all type {self.alltoall_method_type}")
+        logger.debug("Alltoall method type: %s", self.alltoall_method_type)

333-345: Swap verbose prints in _get_quant_method for logger.debug.

Also keeps lines within 120 columns.

-                print(
-                    f"wide_ep _get_quant_method: get_sm_version()={get_sm_version()}"
-                )
+                logger.debug(
+                    "wide_ep _get_quant_method: get_sm_version()=%s",
+                    get_sm_version(),
+                )
                 if get_sm_version() == 100:
-                    print(
-                        f"wide_ep _get_quant_method: use DeepSeekFP8BlockScalesFusedMoEMethodDeepGemm"
-                    )
+                    logger.debug(
+                        "wide_ep _get_quant_method: use DeepGemm block-FP8"
+                    )
                     return DeepSeekFP8BlockScalesFusedMoEMethodDeepGemm()
                 else:
-                    print(
-                        f"wide_ep _get_quant_method: use DeepSeekFP8BlockScalesFusedMoEMethod"
-                    )
+                    logger.debug(
+                        "wide_ep _get_quant_method: use Cutlass block-FP8"
+                    )
                     return DeepSeekFP8BlockScalesFusedMoEMethod()

418-420: Remove large print in forward_chunk; use concise logger.debug.

This runs per forward and exceeds 120 columns.

-        print(
-            f"xxi shape 1: enter wide_ep forward_chunk: layer_load_balancer={self.layer_load_balancer}, is_first_call={is_first_call}, is_last_call={is_last_call}, x shape: {getattr(x, 'shape', None)}, router_logits shape: {getattr(router_logits, 'shape', None)}, use_all_to_all: {use_all_to_all}, all_rank_num_tokens: {all_rank_num_tokens}, all_rank_max_num_tokens: {all_rank_max_num_tokens}, use_dp_padding: {use_dp_padding}, repeating_info: {repeating_info}"
-        )
+        logger.debug(
+            "forward_chunk.enter lb=%s first=%s last=%s x=%s router=%s "
+            "alltoall=%s num_tokens=%s max_tokens=%s dp_pad=%s repeat=%s",
+            bool(self.layer_load_balancer), is_first_call, is_last_call,
+            getattr(x, "shape", None), getattr(router_logits, "shape", None),
+            use_all_to_all, all_rank_num_tokens, all_rank_max_num_tokens,
+            use_dp_padding, repeating_info,
+        )

730-732: Replace post-backend shape print with logger.debug; avoid indexing cost.

Also helps when DeepGemm returns a Tensor list after your fix.

-        print(
-            f"xxi shape 4 after moe backend : {getattr(x, 'shape', None)}, final_hidden_states shape: {getattr(final_hidden_states[0], 'shape', None)}, token_selected_slots shape: {getattr(token_selected_slots, 'shape', None)}, token_final_scales shape: {getattr(token_final_scales, 'shape', None)}, w3_w1_weight shape: {getattr(w3_w1_weight, 'shape', None)}, w2_weight shape: {getattr(w2_weight, 'shape', None)}, quant_scales: {getattr(quant_scales, 'shape', None)}, input_sf: {getattr(x_sf, 'shape', None)}, swizzled_input_sf: False, tp_size: {self.tp_size}, tp_rank: {self.tp_rank}, ep_size: {ep_size}, ep_rank: {ep_rank}, cluster_size: {cluster_size}, cluster_rank: {cluster_rank}, enable_alltoall: {use_all_to_all}, use_deepseek_fp8_block_scale: {use_deepseek_fp8_block_scale}, use_w4_group_scaling: {use_w4_group_scaling}, min_latency_mode: False, tune_max_num_tokens: {self.tune_max_num_tokens}, tuner_num_tokens: {tuner_num_tokens}, tuner_top_k: {tuner_top_k}"
-        )
+        logger.debug(
+            "moe_backend.out x=%s out=%s topk_slots=%s final_scales=%s "
+            "w3w1=%s w2=%s x_sf=%s alltoall=%s tp=(%d,%d) ep=(%d,%d) "
+            "cluster=(%d,%d) block_fp8=%s w4_group=%s tune_max_tokens=%d "
+            "tuner_tokens=%s tuner_top_k=%s",
+            getattr(x, "shape", None),
+            getattr(final_hidden_states[0], "shape", None)
+            if isinstance(final_hidden_states, list)
+            else getattr(final_hidden_states, "shape", None),
+            getattr(token_selected_slots, "shape", None),
+            getattr(token_final_scales, "shape", None),
+            getattr(w3_w1_weight, "shape", None),
+            getattr(w2_weight, "shape", None),
+            getattr(x_sf, "shape", None),
+            use_all_to_all, self.tp_size, self.tp_rank, ep_size, ep_rank,
+            cluster_size, cluster_rank, use_deepseek_fp8_block_scale,
+            use_w4_group_scaling, self.tune_max_num_tokens,
+            tuner_num_tokens, tuner_top_k,
+        )

813-815: Remove one-line “all info” print; log succinctly with logger.debug.

-            print(
-                f"xxi x.shape: {getattr(x, 'shape', None)}, use_all_to_all: {use_all_to_all}, all_rank_num_tokens: {all_rank_num_tokens}, all_rank_num_tokens_padded: {all_rank_num_tokens_padded}, all_rank_max_num_tokens: {all_rank_max_num_tokens}, use_dp_padding: {use_dp_padding}, outputs.shape: {getattr(outputs, 'shape', None)}, use_dp_padding(again): {use_dp_padding}"
-            )
+            logger.debug(
+                "forward.out x=%s alltoall=%s num_tokens=%s num_tokens_pad=%s "
+                "max_tokens=%s dp_pad=%s out=%s",
+                getattr(x, "shape", None), use_all_to_all, all_rank_num_tokens,
+                all_rank_num_tokens_padded, all_rank_max_num_tokens,
+                use_dp_padding, getattr(outputs, "shape", None),
+            )
🧹 Nitpick comments (2)
tensorrt_llm/_torch/distributed/ops.py (1)

243-247: Swap print() for logger.debug; keep line length under 120.

Printing inside collectives spams stdout and is slow. Use logger.debug with placeholders.

-            for val in input:
-                if val is not None and val.shape[dim] != sum_split_size:
-                    print(
-                        f"[reducescatter] val.shape={val.shape}, dim={dim}, val.shape[dim]={val.shape[dim]}, sum_split_size={sum_split_size}, sizes={sizes}"
-                    )
+            for val in input:
+                if val is not None and val.shape[dim] != sum_split_size:
+                    logger.debug(
+                        "[reducescatter] val.shape=%s, dim=%d, val.shape[dim]=%d, "
+                        "sum_split_size=%d, sizes=%s",
+                        tuple(val.shape), dim, int(val.shape[dim]),
+                        int(sum_split_size), sizes,
+                    )
tensorrt_llm/_torch/modules/fused_moe/moe_backend.py (1)

382-383: Wrap long conditional assignment to satisfy E501 and readability.

-        run_moe = self.moe_runner.fused_moe_runner.run_moe_min_latency if min_latency_mode else self.moe_runner.fused_moe_runner.run_moe
+        run_moe = (
+            self.moe_runner.fused_moe_runner.run_moe_min_latency
+            if min_latency_mode
+            else self.moe_runner.fused_moe_runner.run_moe
+        )
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 360f36d and 6b3430e.

📒 Files selected for processing (4)
  • tensorrt_llm/_torch/distributed/ops.py (2 hunks)
  • tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py (11 hunks)
  • tensorrt_llm/_torch/modules/fused_moe/moe_backend.py (1 hunks)
  • tensorrt_llm/_torch/modules/fused_moe/moe_load_balancer.py (1 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Code must target Python 3.8+
Indent Python code with 4 spaces; do not use tabs
Preserve module namespaces when importing; import modules/packages and access members via the module (e.g., from package.subpackage import foo; foo.SomeClass())
Python file names should be snake_case
Python class names should be PascalCase
Python functions/methods and local variables should be snake_case; variables beginning with a number should be prefixed with k_ (e.g., k_99th_percentile)
Global variables should be UPPER_SNAKE_CASE prefixed with G_ (e.g., G_MY_GLOBAL); constants should be UPPER_SNAKE_CASE
Avoid shadowing variables from outer scopes; initialize all externally visible members in init
Prefer docstrings for interfaces used outside a file; comments should be reserved for in-function or file-local interfaces
Use Google-style docstrings for classes and functions; attributes and variables may be documented inline with trailing string literals
Avoid reflection when simpler, explicit code suffices (e.g., avoid dict(**locals()) patterns)
In try/except, catch the narrowest exceptions possible
For duck-typing patterns, keep the try body minimal and move logic to else to avoid masking unrelated failures

Files:

  • tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py
  • tensorrt_llm/_torch/modules/fused_moe/moe_backend.py
  • tensorrt_llm/_torch/distributed/ops.py
  • tensorrt_llm/_torch/modules/fused_moe/moe_load_balancer.py
**/*.{c,cc,cpp,cxx,h,hh,hpp,hxx,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Prepend the NVIDIA copyright header (current year) to all source files (.cpp, .h, .cu, .py, etc.)

Files:

  • tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py
  • tensorrt_llm/_torch/modules/fused_moe/moe_backend.py
  • tensorrt_llm/_torch/distributed/ops.py
  • tensorrt_llm/_torch/modules/fused_moe/moe_load_balancer.py
🧠 Learnings (4)
📓 Common learnings
Learnt from: timlee0212
PR: NVIDIA/TensorRT-LLM#6886
File: tensorrt_llm/_torch/models/modeling_deepseekv3.py:0-0
Timestamp: 2025-08-14T06:36:40.701Z
Learning: In DeepSeek V3 model (tensorrt_llm/_torch/models/modeling_deepseekv3.py), the disagreement between AllReduce.__init__ guard and _compute_mlp_tp_size logic for MNNVL usage is expected by design. The AllReduce component and MLP TP-size computation intentionally use different criteria for MNNVL availability decisions.
📚 Learning: 2025-08-14T23:23:27.449Z
Learnt from: djns99
PR: NVIDIA/TensorRT-LLM#6915
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:4010-4012
Timestamp: 2025-08-14T23:23:27.449Z
Learning: For MOE (Mixture of Experts) code reviews in TensorRT-LLM, avoid repeatedly suggesting finalize fusion validation checks and safety assertions. The user djns99 has indicated these suggestions are repetitive and unwanted across multiple MOE-related changes.

Applied to files:

  • tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py
  • tensorrt_llm/_torch/modules/fused_moe/moe_backend.py
📚 Learning: 2025-08-09T20:57:04.084Z
Learnt from: sklevtsov-nvidia
PR: NVIDIA/TensorRT-LLM#3294
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu:118-127
Timestamp: 2025-08-09T20:57:04.084Z
Learning: In the CUTLASS MoE finalize fusion implementation (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu), when setting `fused_finalize_epilogue.stride_final_output` with shape `(hidden_size, num_output_tokens, 1)`, the `num_rows_in_final_output` should be set to `num_output_tokens` (not `hidden_size`) because of a swap+transpose operation that maps rows of the output tensor to `hidden_size` and columns to `num_output_tokens`.

Applied to files:

  • tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py
  • tensorrt_llm/_torch/modules/fused_moe/moe_backend.py
📚 Learning: 2025-08-14T06:36:40.701Z
Learnt from: timlee0212
PR: NVIDIA/TensorRT-LLM#6886
File: tensorrt_llm/_torch/models/modeling_deepseekv3.py:0-0
Timestamp: 2025-08-14T06:36:40.701Z
Learning: In DeepSeek V3 model (tensorrt_llm/_torch/models/modeling_deepseekv3.py), the disagreement between AllReduce.__init__ guard and _compute_mlp_tp_size logic for MNNVL usage is expected by design. The AllReduce component and MLP TP-size computation intentionally use different criteria for MNNVL availability decisions.

Applied to files:

  • tensorrt_llm/_torch/distributed/ops.py
🧬 Code graph analysis (3)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py (5)
tensorrt_llm/_torch/modules/fused_moe/moe_backend.py (4)
  • MoEBackendSelection (848-891)
  • select_backend (860-891)
  • run_moe (101-212)
  • run_moe (413-533)
tensorrt_llm/_torch/modules/fused_moe/moe_load_balancer.py (1)
  • get_moe_load_balancer (1027-1037)
tensorrt_llm/_torch/modules/fused_moe/quantization.py (2)
  • DeepSeekFP8BlockScalesFusedMoEMethod (604-737)
  • DeepSeekFP8BlockScalesFusedMoEMethodDeepGemm (740-781)
tensorrt_llm/_torch/distributed/communicator.py (1)
  • tp_size (46-47)
tensorrt_llm/mapping.py (1)
  • tp_rank (338-339)
tensorrt_llm/_torch/modules/fused_moe/moe_backend.py (5)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py (1)
  • enable_alltoall (298-301)
tensorrt_llm/_torch/autotuner.py (2)
  • AutoTuner (271-752)
  • choose_one (329-436)
tensorrt_llm/_torch/custom_ops/torch_custom_ops.py (1)
  • MoERunner (23-114)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_deepgemm.py (5)
  • deepgemm_fp8_group_blockwise_gemm (246-284)
  • masked_index_copy_group_quant_fp8 (89-160)
  • preprocess_after_permute (233-242)
  • set_strides (287-293)
  • triton_masked_index_gather (195-216)
tensorrt_llm/_torch/modules/fused_moe/interface.py (1)
  • has_deepseek_fp8_block_scales (127-130)
tensorrt_llm/_torch/distributed/ops.py (3)
tensorrt_llm/functional.py (1)
  • AllReduceStrategy (3876-3885)
tensorrt_llm/_torch/distributed/communicator.py (2)
  • tp_size (46-47)
  • world_size (26-27)
tensorrt_llm/logger.py (1)
  • debug (143-144)
🪛 Ruff (0.12.2)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py

320-320: Line too long (127 > 120)

(E501)


419-419: Line too long (461 > 120)

(E501)


731-731: Line too long (968 > 120)

(E501)


814-814: Line too long (372 > 120)

(E501)

tensorrt_llm/_torch/modules/fused_moe/moe_backend.py

382-382: Line too long (136 > 120)

(E501)


721-721: Line too long (127 > 120)

(E501)


722-722: Line too long (129 > 120)

(E501)


725-725: Line too long (121 > 120)

(E501)


726-726: Line too long (125 > 120)

(E501)


840-840: Line too long (127 > 120)

(E501)


841-841: Line too long (127 > 120)

(E501)

tensorrt_llm/_torch/distributed/ops.py

246-246: Line too long (156 > 120)

(E501)

tensorrt_llm/_torch/modules/fused_moe/moe_load_balancer.py

964-964: Line too long (219 > 120)

(E501)

@@ -5,7 +5,7 @@
import torch

from tensorrt_llm._mnnvl_utils import MnnvlMemory, MnnvlMoe, MoEAlltoallInfo
from tensorrt_llm._utils import logger
from tensorrt_llm._utils import get_sm_version, logger
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Incorrect logger import; use tensorrt_llm.logger.

logger is not provided by tensorrt_llm._utils. Import from the canonical module.

-from tensorrt_llm._utils import get_sm_version, logger
+from tensorrt_llm._utils import get_sm_version
+from tensorrt_llm.logger import logger
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
from tensorrt_llm._utils import get_sm_version, logger
from tensorrt_llm._utils import get_sm_version
from tensorrt_llm.logger import logger
🤖 Prompt for AI Agents
In tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py around line 8, the
code imports logger from tensorrt_llm._utils but that module does not export
logger; replace the import to pull logger from the canonical module
tensorrt_llm.logger (i.e., import get_sm_version from tensorrt_llm._utils and
import logger from tensorrt_llm.logger) so the module uses the correct logger
object.

@xxi-nv xxi-nv force-pushed the supportFP8BlockWideEp branch from 6b3430e to ed54d31 Compare August 26, 2025 08:41
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (13)
tensorrt_llm/_torch/modules/fused_moe/moe_load_balancer.py (1)

963-965: Replace noisy print with logger.debug; keep within 120 cols (E501).

Raw prints in hot paths cause interleaved stdout and overhead. Use the module logger and wrap lines.

-    print(
-        f"maybe_create_moe_load_balancer: in_supported_model_arch={in_supported_model_arch}, using_ep={using_ep}, using_smart_router={using_smart_router}, model_config.moe_load_balancer={model_config.moe_load_balancer}"
-    )
+    logger.debug(
+        "maybe_create_moe_load_balancer: "
+        f"in_supported_model_arch={in_supported_model_arch}, "
+        f"using_ep={using_ep}, using_smart_router={using_smart_router}, "
+        f"moe_load_balancer={getattr(model_config, 'moe_load_balancer', None)}"
+    )
tensorrt_llm/_torch/distributed/ops.py (1)

243-247: Remove debug print in reducescatter; use logger.debug and wrap long line.

Printing inside the validation loop is noisy; prefer logger with structured fields. Also fixes Ruff E501.

-            for val in input:
-                if val is not None and val.shape[dim] != sum_split_size:
-                    print(
-                        f"[reducescatter] val.shape={val.shape}, dim={dim}, val.shape[dim]={val.shape[dim]}, sum_split_size={sum_split_size}, sizes={sizes}"
-                    )
+            for val in input:
+                if val is not None and val.shape[dim] != sum_split_size:
+                    logger.debug(
+                        "[reducescatter] val.shape=%s, dim=%s, "
+                        "val.shape[dim]=%s, sum_split_size=%s, sizes=%s",
+                        getattr(val, "shape", None), dim,
+                        val.shape[dim], sum_split_size, sizes,
+                    )
tensorrt_llm/_torch/modules/fused_moe/moe_backend.py (4)

680-727: Remove verbose prints in DeepGemm compute; use logger.debug.

These multi-line prints fire every forward; swap for concise debug logging. Also ensure logger is available in this module.

+from tensorrt_llm.logger import logger
@@
-        print(
-            "xxi shape 2: enter deepgemm backend compute_moe \n"
-            f"x.shape: {getattr(x, 'shape', None)}, \n"
-            f"input_sf.shape: {getattr(input_sf, 'shape', None)}, \n"
-            f"token_selected_slots.shape: {getattr(token_selected_slots, 'shape', None)}, \n"
-            f"token_final_scales.shape: {getattr(token_final_scales, 'shape', None)}, \n"
-            f"permuted_row_to_unpermuted_row_tensor.shape: {getattr(permuted_row_to_unpermuted_row_tensor, 'shape', None)}, \n"
-            f"permuted_token_selected_experts_tensor.shape: {getattr(permuted_token_selected_experts_tensor, 'shape', None)}, \n"
-            f"permuted_data_tensor.shape: {getattr(permuted_data_tensor, 'shape', None)}, \n"
-            f"expert_first_token_offset_tensor.shape: {getattr(expert_first_token_offset_tensor, 'shape', None)}, \n"
-            f"permuted_token_final_scales_tensor.shape: {getattr(permuted_token_final_scales_tensor, 'shape', None)}, \n"
-            f"unpermuted_row_to_permuted_row_tensor.shape: {getattr(unpermuted_row_to_permuted_row_tensor, 'shape', None)}\n"
-        )
+        logger.debug(
+            "[DeepGemm::enter] x=%s input_sf=%s topk_slots=%s final_scales=%s "
+            "perm_row2unperm=%s perm_experts=%s perm_data=%s "
+            "expert_first_offset=%s perm_final_scales=%s unperm_row2perm=%s",
+            getattr(x, "shape", None),
+            getattr(input_sf, "shape", None),
+            getattr(token_selected_slots, "shape", None),
+            getattr(token_final_scales, "shape", None),
+            getattr(permuted_row_to_unpermuted_row_tensor, "shape", None),
+            getattr(permuted_token_selected_experts_tensor, "shape", None),
+            getattr(permuted_data_tensor, "shape", None),
+            getattr(expert_first_token_offset_tensor, "shape", None),
+            getattr(permuted_token_final_scales_tensor, "shape", None),
+            getattr(unpermuted_row_to_permuted_row_tensor, "shape", None),
+        )

237-275: Use view dtype for MoERunner and call feature flags correctly.

  • MoERunner should get the runtime view dtype (e.g., quint4x2/uint8) to match .view(...) during profiling.
  • has_deepseek_fp8_block_scales is a method; passing the bound method makes the flag truthy even when False.
-        if self.moe_runner is None:
-            self.moe_runner = MoERunner(
+        if self.moe_runner is None:
+            self.moe_runner = MoERunner(
                 x_dtype=tuner_input.dtype,
-                weight_dtype=module.w3_w1_weight.dtype,
+                weight_dtype=weight_view_dtype,
                 output_dtype=output_dtype,
                 top_k=tuner_top_k,
                 tp_size=module.tp_size,
                 tp_rank=module.tp_rank,
                 ep_size=module.ep_size,
                 ep_rank=module.ep_rank,
                 cluster_size=module.cluster_size,
                 cluster_rank=module.cluster_rank,
-                use_deepseek_fp8_block_scale=getattr(
-                    module, 'has_deepseek_fp8_block_scales', False),
+                use_deepseek_fp8_block_scale=(
+                    module.has_deepseek_fp8_block_scales()
+                    if hasattr(module, 'has_deepseek_fp8_block_scales') else False
+                ),
                 use_w4_group_scaling=getattr(module, 'has_w4afp8', False),
                 use_int8_woq_per_channel=getattr(module,
                                                  'has_int8_woq_per_channel',
                                                  False),
                 use_mxfp8_act_scaling=getattr(module, 'has_mxfp8_act_scaling',
                                               False),
                 min_latency_mode=min_latency_mode,
                 use_fused_finalize=getattr(module, 'use_fused_finalize', False),
             )

828-846: Fix API contract: use experts_per_token and return List[Tensor] for consistency.

  • The routing method exposes experts_per_token; .top_k risks AttributeError.
  • WideEPMoE forward unconditionally indexes [0]. DeepGemm must return [tensor] to match Cutlass behavior.
-            module.routing_method.top_k if module else 1,
+            module.routing_method.experts_per_token if module else 1,
@@
-        return final_hidden_states
+        return [final_hidden_states]

880-891: Call has_deepseek_fp8_block_scales() when selecting backend.

Currently it treats the bound method as truthy; call it (guarding callability) so selection reflects the real capability.

-        has_block_fp8 = (hasattr(module, 'has_deepseek_fp8_block_scales')
-                         and module.has_deepseek_fp8_block_scales)
+        fn = getattr(module, 'has_deepseek_fp8_block_scales', None)
+        has_block_fp8 = (callable(fn) and fn()) or (fn if isinstance(fn, bool) else False)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py (7)

311-325: Replace leftover prints with logger.debug; wrap long lines.

Avoid stdout noise and satisfy E501.

-            print(
-                f"can not use alltoall due to chunking {self.calculate_num_chunks(all_rank_num_tokens)}"
-            )
+            logger.debug(
+                "Cannot use alltoall due to chunking %s",
+                self.calculate_num_chunks(all_rank_num_tokens),
+            )
@@
-            print(
-                f"can not use alltoall due to deep_ep_max_num_tokens {all_rank_max_num_tokens} > {self.deep_ep_max_num_tokens}"
-            )
+            logger.debug(
+                "Cannot use alltoall due to deep_ep_max_num_tokens %s > %s",
+                all_rank_max_num_tokens, self.deep_ep_max_num_tokens,
+            )
@@
-        print(f"all to all type {self.alltoall_method_type}")
+        logger.debug("Alltoall type: %s", self.alltoall_method_type)

333-345: Swap diagnostic prints in _get_quant_method for logger.debug.

Keep runtime quiet; small, readable debug lines.

-                print(
-                    f"wide_ep _get_quant_method: get_sm_version()={get_sm_version()}"
-                )
+                logger.debug(
+                    "wide_ep _get_quant_method: get_sm_version()=%s",
+                    get_sm_version(),
+                )
                 if get_sm_version() == 100:
-                    print(
-                        f"wide_ep _get_quant_method: use DeepSeekFP8BlockScalesFusedMoEMethodDeepGemm"
-                    )
+                    logger.debug(
+                        "wide_ep _get_quant_method: DeepSeekFP8BlockScalesFusedMoEMethodDeepGemm"
+                    )
                     return DeepSeekFP8BlockScalesFusedMoEMethodDeepGemm()
                 else:
-                    print(
-                        f"wide_ep _get_quant_method: use DeepSeekFP8BlockScalesFusedMoEMethod"
-                    )
+                    logger.debug(
+                        "wide_ep _get_quant_method: DeepSeekFP8BlockScalesFusedMoEMethod"
+                    )
                     return DeepSeekFP8BlockScalesFusedMoEMethod()

418-420: Remove verbose entry print in forward_chunk; use concise logger.debug.

Giant prints exceed 120 cols and spam stdout.

-        print(
-            f"xxi shape 1: enter wide_ep forward_chunk: layer_load_balancer={self.layer_load_balancer}, is_first_call={is_first_call}, is_last_call={is_last_call}, x shape: {getattr(x, 'shape', None)}, router_logits shape: {getattr(router_logits, 'shape', None)}, use_all_to_all: {use_all_to_all}, all_rank_num_tokens: {all_rank_num_tokens}, all_rank_max_num_tokens: {all_rank_max_num_tokens}, use_dp_padding: {use_dp_padding}, repeating_info: {repeating_info}"
-        )
+        logger.debug(
+            "forward_chunk: lb=%s first=%s last=%s x=%s router=%s "
+            "alltoall=%s num_tokens=%s max_tokens=%s dp_pad=%s",
+            bool(self.layer_load_balancer), is_first_call, is_last_call,
+            getattr(x, "shape", None), getattr(router_logits, "shape", None),
+            use_all_to_all, all_rank_num_tokens, all_rank_max_num_tokens,
+            use_dp_padding,
+        )

730-732: Remove post-backend verbose print; replace with logger.debug.

Also note quant_scales is a list; accessing .shape on it is invalid.

-        print(
-            f"xxi shape 4 after moe backend : {getattr(x, 'shape', None)}, final_hidden_states shape: {getattr(final_hidden_states, 'shape', None)}, token_selected_slots shape: {getattr(token_selected_slots, 'shape', None)}, token_final_scales shape: {getattr(token_final_scales, 'shape', None)}, w3_w1_weight shape: {getattr(w3_w1_weight, 'shape', None)}, w2_weight shape: {getattr(w2_weight, 'shape', None)}, quant_scales: {getattr(quant_scales, 'shape', None)}, input_sf: {getattr(x_sf, 'shape', None)}, swizzled_input_sf: False, tp_size: {self.tp_size}, tp_rank: {self.tp_rank}, ep_size: {ep_size}, ep_rank: {ep_rank}, cluster_size: {cluster_size}, cluster_rank: {cluster_rank}, enable_alltoall: {use_all_to_all}, use_deepseek_fp8_block_scale: {use_deepseek_fp8_block_scale}, use_w4_group_scaling: {use_w4_group_scaling}, min_latency_mode: False, tune_max_num_tokens: {self.tune_max_num_tokens}, tuner_num_tokens: {tuner_num_tokens}, tuner_top_k: {tuner_top_k}"
-        )
+        logger.debug(
+            "after backend: x=%s out=%s slots=%s scales=%s w3w1=%s w2=%s "
+            "x_sf=%s tp=(%s,%s) ep=(%s,%s) alltoall=%s flags(fp8=%s,w4=%s) "
+            "tune=(max_tokens=%s, num_tokens=%s, top_k=%s)",
+            getattr(x, "shape", None),
+            getattr(final_hidden_states, "shape", None),
+            getattr(token_selected_slots, "shape", None),
+            getattr(token_final_scales, "shape", None),
+            getattr(w3_w1_weight, "shape", None),
+            getattr(w2_weight, "shape", None),
+            getattr(x_sf, "shape", None),
+            self.tp_size, self.tp_rank, ep_size, ep_rank, use_all_to_all,
+            use_deepseek_fp8_block_scale, use_w4_group_scaling,
+            self.tune_max_num_tokens, tuner_num_tokens, tuner_top_k,
+        )

812-815: Remove one-line “dump everything” print; prefer structured debug logging.

This line is extremely long (E501) and duplicates information already logged.

-            # 一行打印所有信息
-            print(
-                f"xxi x.shape: {getattr(x, 'shape', None)}, use_all_to_all: {use_all_to_all}, all_rank_num_tokens: {all_rank_num_tokens}, all_rank_num_tokens_padded: {all_rank_num_tokens_padded}, all_rank_max_num_tokens: {all_rank_max_num_tokens}, use_dp_padding: {use_dp_padding}, outputs.shape: {getattr(outputs, 'shape', None)}, use_dp_padding(again): {use_dp_padding}"
-            )
+            logger.debug(
+                "forward (single-chunk): x=%s out=%s alltoall=%s "
+                "num_tokens=%s padded=%s max_tokens=%s dp_pad=%s",
+                getattr(x, "shape", None), getattr(outputs, "shape", None),
+                use_all_to_all, all_rank_num_tokens, all_rank_num_tokens_padded,
+                all_rank_max_num_tokens, use_dp_padding,
+            )

859-867: Consistent attribute name: experts_per_token vs top_k across codebase.

You already use experts_per_token elsewhere; keep it consistent to avoid AttributeError on routing methods that don’t expose .top_k.


8-9: Fix logger import; use the canonical project logger.

tensorrt_llm._utils does not export logger. Import from tensorrt_llm.logger.

-from tensorrt_llm._utils import get_sm_version, logger
+from tensorrt_llm._utils import get_sm_version
+from tensorrt_llm.logger import logger
🧹 Nitpick comments (6)
tensorrt_llm/_torch/distributed/ops.py (1)

463-489: MNNVL gating: confirm tp_size == world_size requirement is intended.

The added guard disables MNNVL when tp_size != world_size. Given past divergence in MNNVL criteria vs. TP sizing elsewhere is by design, confirm this stricter requirement won’t preclude valid configurations (e.g., pipeline/data parallel present but TP group is full-node and MNNVL-capable).

If the intent is to gate on the TP group’s multi-node NVLink availability rather than full world_size equality, consider checking the group topology (e.g., mapping.tp_group vs. cluster topology) instead of world_size.

Would you like a follow-up PR to probe/validate group-level MNNVL capabilities at init time?

tensorrt_llm/_torch/modules/fused_moe/moe_backend.py (1)

1-5: Add the required NVIDIA copyright header.

New source files must include the current-year NVIDIA header per guidelines.

+# Copyright (c) 2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 """
 MoE Backend abstraction for supporting different MoE computation implementations.
 This module provides a unified interface for different MoE backends (Cutlass, DeepGemm, etc.)
 """
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py (4)

700-728: Backend call: ensure downstream return type is uniform.

Since forward_chunk later indexes [0], ensure both backends return List[Tensor]. This is addressed by changing DeepGemm to wrap the tensor; please apply that diff alongside this file.


739-745: Unpack backend output robustly.

Given Cutlass returns [tensor] and DeepGemm now wraps output per fix, this indexing is fine. If you’d like extra safety, guard: if isinstance(final_hidden_states, list): final_hidden_states = final_hidden_states[0].


369-369: Minor: ensure backend is non-None before use.

If selection ever returns None, forward would fail. Optional: assert after init.

         self._weights_created = True
         self._check_configs()
 
         # Initialize MoE backend after weights are created
-        self.moe_backend = MoEBackendSelection.select_backend(self)
+        self.moe_backend = MoEBackendSelection.select_backend(self)
+        assert self.moe_backend is not None, "Failed to select a MoE backend"

1-1: Header check (optional):

Consider adding the NVIDIA header to this file as well if it’s missing project-wide.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 6b3430e and ed54d31.

📒 Files selected for processing (4)
  • tensorrt_llm/_torch/distributed/ops.py (2 hunks)
  • tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py (11 hunks)
  • tensorrt_llm/_torch/modules/fused_moe/moe_backend.py (1 hunks)
  • tensorrt_llm/_torch/modules/fused_moe/moe_load_balancer.py (1 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Code must target Python 3.8+
Indent Python code with 4 spaces; do not use tabs
Preserve module namespaces when importing; import modules/packages and access members via the module (e.g., from package.subpackage import foo; foo.SomeClass())
Python file names should be snake_case
Python class names should be PascalCase
Python functions/methods and local variables should be snake_case; variables beginning with a number should be prefixed with k_ (e.g., k_99th_percentile)
Global variables should be UPPER_SNAKE_CASE prefixed with G_ (e.g., G_MY_GLOBAL); constants should be UPPER_SNAKE_CASE
Avoid shadowing variables from outer scopes; initialize all externally visible members in init
Prefer docstrings for interfaces used outside a file; comments should be reserved for in-function or file-local interfaces
Use Google-style docstrings for classes and functions; attributes and variables may be documented inline with trailing string literals
Avoid reflection when simpler, explicit code suffices (e.g., avoid dict(**locals()) patterns)
In try/except, catch the narrowest exceptions possible
For duck-typing patterns, keep the try body minimal and move logic to else to avoid masking unrelated failures

Files:

  • tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py
  • tensorrt_llm/_torch/distributed/ops.py
  • tensorrt_llm/_torch/modules/fused_moe/moe_backend.py
  • tensorrt_llm/_torch/modules/fused_moe/moe_load_balancer.py
**/*.{c,cc,cpp,cxx,h,hh,hpp,hxx,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Prepend the NVIDIA copyright header (current year) to all source files (.cpp, .h, .cu, .py, etc.)

Files:

  • tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py
  • tensorrt_llm/_torch/distributed/ops.py
  • tensorrt_llm/_torch/modules/fused_moe/moe_backend.py
  • tensorrt_llm/_torch/modules/fused_moe/moe_load_balancer.py
🧠 Learnings (4)
📓 Common learnings
Learnt from: timlee0212
PR: NVIDIA/TensorRT-LLM#6886
File: tensorrt_llm/_torch/models/modeling_deepseekv3.py:0-0
Timestamp: 2025-08-14T06:36:40.701Z
Learning: In DeepSeek V3 model (tensorrt_llm/_torch/models/modeling_deepseekv3.py), the disagreement between AllReduce.__init__ guard and _compute_mlp_tp_size logic for MNNVL usage is expected by design. The AllReduce component and MLP TP-size computation intentionally use different criteria for MNNVL availability decisions.
📚 Learning: 2025-08-14T23:23:27.449Z
Learnt from: djns99
PR: NVIDIA/TensorRT-LLM#6915
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:4010-4012
Timestamp: 2025-08-14T23:23:27.449Z
Learning: For MOE (Mixture of Experts) code reviews in TensorRT-LLM, avoid repeatedly suggesting finalize fusion validation checks and safety assertions. The user djns99 has indicated these suggestions are repetitive and unwanted across multiple MOE-related changes.

Applied to files:

  • tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py
  • tensorrt_llm/_torch/modules/fused_moe/moe_backend.py
📚 Learning: 2025-08-09T20:57:04.084Z
Learnt from: sklevtsov-nvidia
PR: NVIDIA/TensorRT-LLM#3294
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu:118-127
Timestamp: 2025-08-09T20:57:04.084Z
Learning: In the CUTLASS MoE finalize fusion implementation (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu), when setting `fused_finalize_epilogue.stride_final_output` with shape `(hidden_size, num_output_tokens, 1)`, the `num_rows_in_final_output` should be set to `num_output_tokens` (not `hidden_size`) because of a swap+transpose operation that maps rows of the output tensor to `hidden_size` and columns to `num_output_tokens`.

Applied to files:

  • tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py
  • tensorrt_llm/_torch/modules/fused_moe/moe_backend.py
📚 Learning: 2025-08-14T06:36:40.701Z
Learnt from: timlee0212
PR: NVIDIA/TensorRT-LLM#6886
File: tensorrt_llm/_torch/models/modeling_deepseekv3.py:0-0
Timestamp: 2025-08-14T06:36:40.701Z
Learning: In DeepSeek V3 model (tensorrt_llm/_torch/models/modeling_deepseekv3.py), the disagreement between AllReduce.__init__ guard and _compute_mlp_tp_size logic for MNNVL usage is expected by design. The AllReduce component and MLP TP-size computation intentionally use different criteria for MNNVL availability decisions.

Applied to files:

  • tensorrt_llm/_torch/distributed/ops.py
🧬 Code graph analysis (3)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py (2)
tensorrt_llm/_torch/modules/fused_moe/moe_backend.py (4)
  • MoEBackendSelection (848-891)
  • select_backend (860-891)
  • run_moe (101-212)
  • run_moe (413-533)
tensorrt_llm/_torch/modules/fused_moe/quantization.py (2)
  • DeepSeekFP8BlockScalesFusedMoEMethod (604-737)
  • DeepSeekFP8BlockScalesFusedMoEMethodDeepGemm (740-781)
tensorrt_llm/_torch/distributed/ops.py (3)
tensorrt_llm/functional.py (1)
  • AllReduceStrategy (3876-3885)
tensorrt_llm/_torch/distributed/communicator.py (2)
  • tp_size (46-47)
  • world_size (26-27)
tensorrt_llm/logger.py (1)
  • debug (143-144)
tensorrt_llm/_torch/modules/fused_moe/moe_backend.py (5)
tensorrt_llm/_torch/autotuner.py (2)
  • AutoTuner (271-752)
  • choose_one (329-436)
tensorrt_llm/_torch/custom_ops/torch_custom_ops.py (1)
  • MoERunner (23-114)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_deepgemm.py (5)
  • deepgemm_fp8_group_blockwise_gemm (246-284)
  • masked_index_copy_group_quant_fp8 (89-160)
  • preprocess_after_permute (233-242)
  • set_strides (287-293)
  • triton_masked_index_gather (195-216)
tensorrt_llm/quantization/utils/fp8_utils.py (1)
  • silu_and_mul_masked_post_quant_fwd (304-375)
tensorrt_llm/_torch/modules/fused_moe/interface.py (1)
  • has_deepseek_fp8_block_scales (127-130)
🪛 Ruff (0.12.2)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py

320-320: Line too long (127 > 120)

(E501)


419-419: Line too long (461 > 120)

(E501)


731-731: Line too long (965 > 120)

(E501)


814-814: Line too long (372 > 120)

(E501)

tensorrt_llm/_torch/distributed/ops.py

246-246: Line too long (156 > 120)

(E501)

tensorrt_llm/_torch/modules/fused_moe/moe_backend.py

382-382: Line too long (136 > 120)

(E501)


721-721: Line too long (127 > 120)

(E501)


722-722: Line too long (129 > 120)

(E501)


725-725: Line too long (121 > 120)

(E501)


726-726: Line too long (125 > 120)

(E501)


840-840: Line too long (127 > 120)

(E501)


841-841: Line too long (127 > 120)

(E501)

tensorrt_llm/_torch/modules/fused_moe/moe_load_balancer.py

964-964: Line too long (219 > 120)

(E501)

🔇 Additional comments (2)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py (2)

368-370: LGTM: Backend selection initialized post weights creation.

Initializing MoE backend after weight creation is appropriate and keeps selection aligned with quant flags. No issues.


498-504: Nice guard: only fetch LB stats when last call or LB present.

This avoids unnecessary syncs/calls mid-iteration. Looks good.

@xxi-nv xxi-nv force-pushed the supportFP8BlockWideEp branch from ed54d31 to e3b58fb Compare August 26, 2025 09:07
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (15)
tensorrt_llm/_torch/modules/fused_moe/moe_load_balancer.py (1)

963-965: Replace debug print with logger.debug and wrap to avoid E501

Use the project logger instead of stdout; also keeps the line length within 120 and avoids noisy logs in distributed runs.

-    print(
-        f"maybe_create_moe_load_balancer: in_supported_model_arch={in_supported_model_arch}, using_ep={using_ep}, using_smart_router={using_smart_router}, model_config.moe_load_balancer={model_config.moe_load_balancer}"
-    )
+    logger.debug(
+        "maybe_create_moe_load_balancer: supported_arch=%s, using_ep=%s, "
+        "smart_router=%s, lb_cfg=%s",
+        in_supported_model_arch,
+        bool(using_ep),
+        bool(using_smart_router),
+        getattr(model_config, "moe_load_balancer", None),
+    )
tensorrt_llm/_torch/distributed/ops.py (1)

243-247: Remove stdout print in reducescatter size-check; use logger.debug (fixes E501)

Stick to the module logger to avoid flooding stdout and interleaving across ranks.

-            for val in input:
-                if val is not None and val.shape[dim] != sum_split_size:
-                    print(
-                        f"[reducescatter] val.shape={val.shape}, dim={dim}, val.shape[dim]={val.shape[dim]}, sum_split_size={sum_split_size}, sizes={sizes}"
-                    )
+            for val in input:
+                if val is not None and val.shape[dim] != sum_split_size:
+                    logger.debug(
+                        "[reducescatter] shape=%s dim=%s dim_size=%s sum_split_size=%s sizes=%s",
+                        val.shape, dim, val.shape[dim], sum_split_size, sizes,
+                    )
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py (8)

311-313: Replace debug print with logger and avoid duplicate compute

Capture chunk count once and log via logger.debug.

-        if self.calculate_num_chunks(all_rank_num_tokens) > 1:
-            print(
-                f"can not use alltoall due to chunking {self.calculate_num_chunks(all_rank_num_tokens)}"
-            )
+        chunks = self.calculate_num_chunks(all_rank_num_tokens)
+        if chunks > 1:
+            logger.debug("Cannot use alltoall due to chunking: %s", chunks)
             return False

319-321: Replace DeepEPLowLatency limit print with logger.debug and fix E501

-            print(
-                f"can not use alltoall due to deep_ep_max_num_tokens {all_rank_max_num_tokens} > {self.deep_ep_max_num_tokens}"
-            )
+            logger.debug(
+                "Cannot use alltoall: max_tokens %s > threshold %s",
+                all_rank_max_num_tokens, self.deep_ep_max_num_tokens,
+            )

324-324: Replace stray print with logger.debug

-        print(f"all to all type {self.alltoall_method_type}")
+        logger.debug("alltoall method type: %s", self.alltoall_method_type)

333-345: Swap prints for logger.debug in quant-method selection

Also reduces log noise and respects line length.

-                print(
-                    f"wide_ep _get_quant_method: get_sm_version()={get_sm_version()}"
-                )
+                logger.debug("wide_ep _get_quant_method: sm=%s",
+                             get_sm_version())
                 if get_sm_version() == 100:
-                    print(
-                        f"wide_ep _get_quant_method: use DeepSeekFP8BlockScalesFusedMoEMethodDeepGemm"
-                    )
+                    logger.debug("wide_ep _get_quant_method: DeepGemm path")
                     return DeepSeekFP8BlockScalesFusedMoEMethodDeepGemm()
                 else:
-                    print(
-                        f"wide_ep _get_quant_method: use DeepSeekFP8BlockScalesFusedMoEMethod"
-                    )
+                    logger.debug("wide_ep _get_quant_method: Cutlass path")
                     return DeepSeekFP8BlockScalesFusedMoEMethod()

418-421: Remove verbose print from hot-path forward_chunk; use logger.debug placeholders

This avoids formatting cost when debug is disabled and fixes E501.

-        print(
-            f"xxi shape 1: enter wide_ep forward_chunk: layer_load_balancer={self.layer_load_balancer}, is_first_call={is_first_call}, is_last_call={is_last_call}, x shape: {getattr(x, 'shape', None)}, router_logits shape: {getattr(router_logits, 'shape', None)}, use_all_to_all: {use_all_to_all}, all_rank_num_tokens: {all_rank_num_tokens}, all_rank_max_num_tokens: {all_rank_max_num_tokens}, use_dp_padding: {use_dp_padding}, repeating_info: {repeating_info}"
-        )
+        logger.debug(
+            "forward_chunk enter: lb=%s first=%s last=%s x=%s router=%s "
+            "alltoall=%s num_tokens=%s max_tokens=%s dp_pad=%s repeat=%s",
+            bool(self.layer_load_balancer), is_first_call, is_last_call,
+            getattr(x, "shape", None), getattr(router_logits, "shape", None),
+            use_all_to_all, all_rank_num_tokens, all_rank_max_num_tokens,
+            use_dp_padding, repeating_info,
+        )

730-732: Drop extremely verbose print after backend run

Replace with concise, gated logger.debug if you need quick shape sanity.

-        print(
-            f"xxi shape 4 after moe backend : {getattr(x, 'shape', None)}, final_hidden_states shape: {getattr(final_hidden_states, 'shape', None)}, token_selected_slots shape: {getattr(token_selected_slots, 'shape', None)}, token_final_scales shape: {getattr(token_final_scales, 'shape', None)}, w3_w1_weight shape: {getattr(w3_w1_weight, 'shape', None)}, w2_weight shape: {getattr(w2_weight, 'shape', None)}, quant_scales: {getattr(quant_scales, 'shape', None)}, input_sf: {getattr(x_sf, 'shape', None)}, swizzled_input_sf: False, tp_size: {self.tp_size}, tp_rank: {self.tp_rank}, ep_size: {ep_size}, ep_rank: {ep_rank}, cluster_size: {cluster_size}, cluster_rank: {cluster_rank}, enable_alltoall: {use_all_to_all}, use_deepseek_fp8_block_scale: {use_deepseek_fp8_block_scale}, use_w4_group_scaling: {use_w4_group_scaling}, min_latency_mode: False, tune_max_num_tokens: {self.tune_max_num_tokens}, tuner_num_tokens: {tuner_num_tokens}, tuner_top_k: {tuner_top_k}"
-        )
+        logger.debug(
+            "forward_chunk exit: x=%s out=%s slots=%s scales=%s alltoall=%s",
+            getattr(x, "shape", None),
+            getattr(final_hidden_states, "shape", None),
+            getattr(token_selected_slots, "shape", None),
+            getattr(token_final_scales, "shape", None),
+            use_all_to_all,
+        )

812-815: Remove leftover one-line diagnostic print (non-English comment)
Replace with logger.debug if needed; prefer English comments for consistency.

-            # 一行打印所有信息
-            print(
-                f"xxi x.shape: {getattr(x, 'shape', None)}, use_all_to_all: {use_all_to_all}, all_rank_num_tokens: {all_rank_num_tokens}, all_rank_num_tokens_padded: {all_rank_num_tokens_padded}, all_rank_max_num_tokens: {all_rank_max_num_tokens}, use_dp_padding: {use_dp_padding}, outputs.shape: {getattr(outputs, 'shape', None)}, use_dp_padding(again): {use_dp_padding}"
-            )
+            # Log summary for debugging (optional)
+            # logger.debug("wide_ep summary: x=%s alltoall=%s tokens=%s padded=%s max_tokens=%s dp_pad=%s out=%s",
+            #              getattr(x, "shape", None), use_all_to_all, all_rank_num_tokens,
+            #              all_rank_num_tokens_padded, all_rank_max_num_tokens, use_dp_padding,
+            #              getattr(outputs, "shape", None))

8-9: Fix logger import source

logger should come from tensorrt_llm.logger, not _utils.

-from tensorrt_llm._utils import get_sm_version, logger
+from tensorrt_llm._utils import get_sm_version
+from tensorrt_llm.logger import logger
tensorrt_llm/_torch/modules/fused_moe/moe_backend.py (5)

835-844: Remove exit-path print; replace with concise logger.debug

-        print(
-            "xxi shape 3: exit deepgemm backend compute_moe \n"
-            f"final_hidden_states.shape: {getattr(final_hidden_states, 'shape', None)}\n"
-            f"permuted_data_tensor.shape: {getattr(permuted_data_tensor, 'shape', None)}, \n"
-            f"token_final_scales.shape: {getattr(token_final_scales, 'shape', None)}, \n"
-            f"unpermuted_row_to_permuted_row_tensor.shape: {getattr(unpermuted_row_to_permuted_row_tensor, 'shape', None)}, \n"
-            f"permuted_row_to_unpermuted_row_tensor.shape: {getattr(permuted_row_to_unpermuted_row_tensor, 'shape', None)}, \n"
-            f"expert_first_token_offset_tensor.shape: {getattr(expert_first_token_offset_tensor, 'shape', None)}, \n"
-            f"x.shape: {getattr(x, 'shape', None)}, \n")
+        logger.debug(
+            "[DeepGemm::exit] out=%s perm_data=%s final_scales=%s "
+            "unperm_row2perm=%s perm_row2unperm=%s expert_first_off=%s x=%s",
+            getattr(final_hidden_states, "shape", None),
+            getattr(permuted_data_tensor, "shape", None),
+            getattr(token_final_scales, "shape", None),
+            getattr(unpermuted_row_to_permuted_row_tensor, "shape", None),
+            getattr(permuted_row_to_unpermuted_row_tensor, "shape", None),
+            getattr(expert_first_token_offset_tensor, "shape", None),
+            getattr(x, "shape", None),
+        )

1-4: Add NVIDIA copyright header

Per repo guidelines, prepend the current year header to all source files.

+# Copyright (c) 2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 """
 MoE Backend abstraction for supporting different MoE computation implementations.
 This module provides a unified interface for different MoE backends (Cutlass, DeepGemm, etc.)
 """

885-887: Fix: call has_deepseek_fp8_block_scales() during backend selection

Currently passing the method object makes the condition always truthy when present.

-        has_block_fp8 = (hasattr(module, 'has_deepseek_fp8_block_scales')
-                         and module.has_deepseek_fp8_block_scales)
+        attr = getattr(module, 'has_deepseek_fp8_block_scales', None)
+        has_block_fp8 = bool(attr() if callable(attr) else attr)

224-275: Fix: pass the actual weight view dtype to MoERunner; also ensure boolean flags aren’t bound methods

  • MoERunner should see the same dtype you use for .view(...) during profiling.
  • has_deepseek_fp8_block_scales is a method; passing the bound method is always truthy.
-        # Determine view dtype for weights to match runtime quantization layout
+        # Determine view dtype for weights to match runtime quantization layout
         weight_view_dtype = module.w3_w1_weight.dtype
         if getattr(module, 'has_w4afp8', False):
             weight_view_dtype = torch.quint4x2
         elif getattr(module, 'has_w4a16_mxfp4', False):
             weight_view_dtype = torch.uint8
 ...
         if self.moe_runner is None:
             self.moe_runner = MoERunner(
                 x_dtype=tuner_input.dtype,
-                weight_dtype=module.w3_w1_weight.dtype,
+                weight_dtype=weight_view_dtype,
                 output_dtype=output_dtype,
                 top_k=tuner_top_k,
                 tp_size=module.tp_size,
                 tp_rank=module.tp_rank,
                 ep_size=module.ep_size,
                 ep_rank=module.ep_rank,
                 cluster_size=module.cluster_size,
                 cluster_rank=module.cluster_rank,
-                use_deepseek_fp8_block_scale=getattr(
-                    module, 'has_deepseek_fp8_block_scales', False),
+                use_deepseek_fp8_block_scale=(
+                    module.has_deepseek_fp8_block_scales()
+                    if hasattr(module, 'has_deepseek_fp8_block_scales') else False
+                ),
                 use_w4_group_scaling=getattr(module, 'has_w4afp8', False),
                 use_int8_woq_per_channel=getattr(module,
                                                  'has_int8_woq_per_channel',
                                                  False),
                 use_mxfp8_act_scaling=getattr(module, 'has_mxfp8_act_scaling',
                                               False),
                 min_latency_mode=min_latency_mode,
                 use_fused_finalize=getattr(module, 'use_fused_finalize', False),
             )

816-834: Use experts_per_token (not .top_k) in finalize-scale call

Routing methods expose experts_per_token; .top_k may not exist.

-            module.routing_method.top_k if module else 1,
+            (module.routing_method.experts_per_token if module else 1),
🧹 Nitpick comments (6)
tensorrt_llm/_torch/distributed/ops.py (1)

15-21: Avoid shadowing the project logger with a stdlib logger instance

This file imports the project logger and then reassigns logger to a stdlib logger. Keep the project logger for consistent behavior and formats.

-from tensorrt_llm.logger import logger
+from tensorrt_llm.logger import logger
 ...
-import logging
+import logging
 ...
-logger = logging.getLogger(__name__)
+# Do not override the imported project logger.
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py (1)

672-699: Consider removing commented-out legacy call before merge

Keeping large commented blocks adds noise. Retain in Git history if needed.

tensorrt_llm/_torch/modules/fused_moe/moe_backend.py (4)

381-386: Nit: long call fits, but keep line length < 120 in future edits

Minor style; nothing blocking.


715-727: Remove verbose prints from DeepGemm path; use logger.debug with placeholders

Prevents huge stdout spam and avoids string interpolation cost when debug is off.

-        print(
-            "xxi shape 2: enter deepgemm backend compute_moe \n"
-            f"x.shape: {getattr(x, 'shape', None)}, \n"
-            f"input_sf.shape: {getattr(input_sf, 'shape', None)}, \n"
-            f"token_selected_slots.shape: {getattr(token_selected_slots, 'shape', None)}, \n"
-            f"token_final_scales.shape: {getattr(token_final_scales, 'shape', None)}, \n"
-            f"permuted_row_to_unpermuted_row_tensor.shape: {getattr(permuted_row_to_unpermuted_row_tensor, 'shape', None)}, \n"
-            f"permuted_token_selected_experts_tensor.shape: {getattr(permuted_token_selected_experts_tensor, 'shape', None)}, \n"
-            f"permuted_data_tensor.shape: {getattr(permuted_data_tensor, 'shape', None)}, \n"
-            f"expert_first_token_offset_tensor.shape: {getattr(expert_first_token_offset_tensor, 'shape', None)}, \n"
-            f"permuted_token_final_scales_tensor.shape: {getattr(permuted_token_final_scales_tensor, 'shape', None)}, \n"
-            f"unpermuted_row_to_permuted_row_tensor.shape: {getattr(unpermuted_row_to_permuted_row_tensor, 'shape', None)}\n"
-        )
+        logger.debug(
+            "[DeepGemm::enter] x=%s sf=%s slots=%s scales=%s "
+            "perm_row2unperm=%s perm_experts=%s perm_data=%s "
+            "expert_first_off=%s perm_scales=%s unperm_row2perm=%s",
+            getattr(x, "shape", None),
+            getattr(input_sf, "shape", None),
+            getattr(token_selected_slots, "shape", None),
+            getattr(token_final_scales, "shape", None),
+            getattr(permuted_row_to_unpermuted_row_tensor, "shape", None),
+            getattr(permuted_token_selected_experts_tensor, "shape", None),
+            getattr(permuted_data_tensor, "shape", None),
+            getattr(expert_first_token_offset_tensor, "shape", None),
+            getattr(permuted_token_final_scales_tensor, "shape", None),
+            getattr(unpermuted_row_to_permuted_row_tensor, "shape", None),
+        )

720-744: Optional: add a guard for zero-sized inputs earlier

You handle numel()==0 shortly after; consider early return just after permute to skip allocations.


700-714: Note: finalize permute uses use_fp8_block_scaling=True unconditionally

Assumed invariant for DeepGemm path. If this ever becomes runtime-configurable, plumb through from module/backend selector.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between ed54d31 and e3b58fb.

📒 Files selected for processing (4)
  • tensorrt_llm/_torch/distributed/ops.py (2 hunks)
  • tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py (11 hunks)
  • tensorrt_llm/_torch/modules/fused_moe/moe_backend.py (1 hunks)
  • tensorrt_llm/_torch/modules/fused_moe/moe_load_balancer.py (1 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Code must target Python 3.8+
Indent Python code with 4 spaces; do not use tabs
Preserve module namespaces when importing; import modules/packages and access members via the module (e.g., from package.subpackage import foo; foo.SomeClass())
Python file names should be snake_case
Python class names should be PascalCase
Python functions/methods and local variables should be snake_case; variables beginning with a number should be prefixed with k_ (e.g., k_99th_percentile)
Global variables should be UPPER_SNAKE_CASE prefixed with G_ (e.g., G_MY_GLOBAL); constants should be UPPER_SNAKE_CASE
Avoid shadowing variables from outer scopes; initialize all externally visible members in init
Prefer docstrings for interfaces used outside a file; comments should be reserved for in-function or file-local interfaces
Use Google-style docstrings for classes and functions; attributes and variables may be documented inline with trailing string literals
Avoid reflection when simpler, explicit code suffices (e.g., avoid dict(**locals()) patterns)
In try/except, catch the narrowest exceptions possible
For duck-typing patterns, keep the try body minimal and move logic to else to avoid masking unrelated failures

Files:

  • tensorrt_llm/_torch/distributed/ops.py
  • tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py
  • tensorrt_llm/_torch/modules/fused_moe/moe_backend.py
  • tensorrt_llm/_torch/modules/fused_moe/moe_load_balancer.py
**/*.{c,cc,cpp,cxx,h,hh,hpp,hxx,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Prepend the NVIDIA copyright header (current year) to all source files (.cpp, .h, .cu, .py, etc.)

Files:

  • tensorrt_llm/_torch/distributed/ops.py
  • tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py
  • tensorrt_llm/_torch/modules/fused_moe/moe_backend.py
  • tensorrt_llm/_torch/modules/fused_moe/moe_load_balancer.py
🧠 Learnings (4)
📓 Common learnings
Learnt from: timlee0212
PR: NVIDIA/TensorRT-LLM#6886
File: tensorrt_llm/_torch/models/modeling_deepseekv3.py:0-0
Timestamp: 2025-08-14T06:36:40.701Z
Learning: In DeepSeek V3 model (tensorrt_llm/_torch/models/modeling_deepseekv3.py), the disagreement between AllReduce.__init__ guard and _compute_mlp_tp_size logic for MNNVL usage is expected by design. The AllReduce component and MLP TP-size computation intentionally use different criteria for MNNVL availability decisions.
📚 Learning: 2025-08-14T06:36:40.701Z
Learnt from: timlee0212
PR: NVIDIA/TensorRT-LLM#6886
File: tensorrt_llm/_torch/models/modeling_deepseekv3.py:0-0
Timestamp: 2025-08-14T06:36:40.701Z
Learning: In DeepSeek V3 model (tensorrt_llm/_torch/models/modeling_deepseekv3.py), the disagreement between AllReduce.__init__ guard and _compute_mlp_tp_size logic for MNNVL usage is expected by design. The AllReduce component and MLP TP-size computation intentionally use different criteria for MNNVL availability decisions.

Applied to files:

  • tensorrt_llm/_torch/distributed/ops.py
📚 Learning: 2025-08-14T23:23:27.449Z
Learnt from: djns99
PR: NVIDIA/TensorRT-LLM#6915
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:4010-4012
Timestamp: 2025-08-14T23:23:27.449Z
Learning: For MOE (Mixture of Experts) code reviews in TensorRT-LLM, avoid repeatedly suggesting finalize fusion validation checks and safety assertions. The user djns99 has indicated these suggestions are repetitive and unwanted across multiple MOE-related changes.

Applied to files:

  • tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py
  • tensorrt_llm/_torch/modules/fused_moe/moe_backend.py
📚 Learning: 2025-08-09T20:57:04.084Z
Learnt from: sklevtsov-nvidia
PR: NVIDIA/TensorRT-LLM#3294
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu:118-127
Timestamp: 2025-08-09T20:57:04.084Z
Learning: In the CUTLASS MoE finalize fusion implementation (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu), when setting `fused_finalize_epilogue.stride_final_output` with shape `(hidden_size, num_output_tokens, 1)`, the `num_rows_in_final_output` should be set to `num_output_tokens` (not `hidden_size`) because of a swap+transpose operation that maps rows of the output tensor to `hidden_size` and columns to `num_output_tokens`.

Applied to files:

  • tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py
  • tensorrt_llm/_torch/modules/fused_moe/moe_backend.py
🧬 Code graph analysis (3)
tensorrt_llm/_torch/distributed/ops.py (3)
tensorrt_llm/_torch/distributed/communicator.py (2)
  • tp_size (46-47)
  • world_size (26-27)
tensorrt_llm/llmapi/llm_args.py (2)
  • world_size (246-258)
  • world_size (267-274)
tensorrt_llm/logger.py (1)
  • debug (143-144)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py (3)
tensorrt_llm/_torch/modules/fused_moe/moe_backend.py (4)
  • MoEBackendSelection (850-893)
  • select_backend (862-893)
  • run_moe (101-212)
  • run_moe (413-533)
tensorrt_llm/_torch/modules/fused_moe/moe_load_balancer.py (1)
  • get_moe_load_balancer (1027-1037)
tensorrt_llm/_torch/modules/fused_moe/quantization.py (2)
  • DeepSeekFP8BlockScalesFusedMoEMethod (604-737)
  • DeepSeekFP8BlockScalesFusedMoEMethodDeepGemm (740-781)
tensorrt_llm/_torch/modules/fused_moe/moe_backend.py (3)
tensorrt_llm/_torch/custom_ops/torch_custom_ops.py (1)
  • MoERunner (23-114)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_deepgemm.py (5)
  • deepgemm_fp8_group_blockwise_gemm (246-284)
  • masked_index_copy_group_quant_fp8 (89-160)
  • preprocess_after_permute (233-242)
  • set_strides (287-293)
  • triton_masked_index_gather (195-216)
tensorrt_llm/_torch/modules/fused_moe/interface.py (1)
  • has_deepseek_fp8_block_scales (127-130)
🪛 Ruff (0.12.2)
tensorrt_llm/_torch/distributed/ops.py

246-246: Line too long (156 > 120)

(E501)

tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py

320-320: Line too long (127 > 120)

(E501)


419-419: Line too long (461 > 120)

(E501)


731-731: Line too long (965 > 120)

(E501)


814-814: Line too long (372 > 120)

(E501)

tensorrt_llm/_torch/modules/fused_moe/moe_backend.py

382-382: Line too long (136 > 120)

(E501)


721-721: Line too long (127 > 120)

(E501)


722-722: Line too long (129 > 120)

(E501)


725-725: Line too long (121 > 120)

(E501)


726-726: Line too long (125 > 120)

(E501)


840-840: Line too long (127 > 120)

(E501)


841-841: Line too long (127 > 120)

(E501)

tensorrt_llm/_torch/modules/fused_moe/moe_load_balancer.py

964-964: Line too long (219 > 120)

(E501)

🔇 Additional comments (10)
tensorrt_llm/_torch/distributed/ops.py (1)

463-489: MNNVL AUTO/MNNVL enablement and fallback logic looks sound

  • Gated under AUTO or MNNVL strategies.
  • Disables when tp_size != world_size, as expected for multi-node NVLink scope.
  • Safe instantiation with try/except; clean fallbacks and debug logs.
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py (5)

237-239: Early attribute initialization is fine

Lazy-init self.moe_backend to None pre-weights is OK and avoids accidental use before weights are created.


368-370: Backend selection after weight creation is appropriate

Selecting the backend only after weights are created aligns with quant-dependent routes. LGTM.


498-509: Good: gate statistic fetch with layer_load_balancer and is_last_call

Prevents unconditional get_local_statistic_tensor calls.


700-729: Backend switchover looks correct; passes through fused_moe-compatible params

The backend API parity with torch.ops.trtllm.fused_moe is preserved; passing module=self enables backend-specific access. LGTM.


777-787: Add lazy initialization of moe_backend in forward

To guard against model_config.skip_create_weights_in_init=True skipping your backend setup, ensure self.moe_backend is always initialized before use:

  • File: tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py
    Location: forward method (around lines 777–787)
     def forward(
         self,
         x: Union[torch.Tensor, Fp4QuantizedTensor],
         router_logits: torch.Tensor,
         do_finalize: bool = True,
         output_dtype: Optional[torch.dtype] = None,
         all_rank_num_tokens: Optional[List[int]] = None,
         all_rank_max_num_tokens: Optional[int]] = None,
         use_dp_padding: Optional[bool] = None,
     ) -> torch.Tensor:
+        if getattr(self, "moe_backend", None) is None:
+            self.moe_backend = MoEBackendSelection.select_backend(self)
         assert all_rank_num_tokens is not None

This lazy-init ensures that, even if weight creation is deferred, your MoE backend is ready when forward runs. Please verify that no other code paths already cover this initialization.

tensorrt_llm/_torch/modules/fused_moe/moe_backend.py (4)

861-893: Backend selection policy is clear; criteria read well

SM100 + block FP8 -> DeepGemm; else Cutlass. Nice and contained.


536-568: No-op finalize_tactic for DeepGemm is appropriate

DeepGemm’s static tactic design justifies this. Good separation from Cutlass.


497-534: Cutlass run_moe flow is coherent and matches compute_moe contract

Tuner input derivation mirrors fused_moe path; return type aligns with caller.


619-667: DeepGemm path composition is correct

Permutation → (W3/W1) GEMM → SiLU → (W2) GEMM → gather → finalize/scale. Looks consistent with the intended block-FP8 pipeline.

Comment on lines 9 to 13
import torch

from tensorrt_llm._utils import get_sm_version


Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Import the project logger for consistent logging across backends

You use print later; switch to logger.debug and import the project logger here.

 import torch
 
-from tensorrt_llm._utils import get_sm_version
+from tensorrt_llm._utils import get_sm_version
+from tensorrt_llm.logger import logger

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents
In tensorrt_llm/_torch/modules/fused_moe/moe_backend.py around lines 9 to 13,
the module currently imports torch and get_sm_version but later uses print for
debug output; import the project logger (e.g., from tensorrt_llm import logger
or the module's standard logger entrypoint) at the top alongside existing
imports and replace any print(...) calls in this file with logger.debug(...)
calls so logging is consistent across backends and uses the configured logger.

Signed-off-by: xxi <[email protected]>

	modified:   tensorrt_llm/_torch/modules/fused_moe/moe_load_balancer.py

	modified:   tensorrt_llm/_torch/distributed/ops.py
	modified:   tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py
	modified:   tensorrt_llm/_torch/modules/fused_moe/moe_backend.py
	modified:   tensorrt_llm/_torch/modules/fused_moe/moe_load_balancer.py
@xxi-nv xxi-nv force-pushed the supportFP8BlockWideEp branch from e3b58fb to 15d9841 Compare August 27, 2025 04:59
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (9)
tensorrt_llm/_torch/distributed/ops.py (1)

243-247: Replace print with proper logging; fix long line (E501).

Use the module logger instead of printing to stdout and wrap the message to satisfy the 120-char limit.

-            for val in input:
-                if val is not None and val.shape[dim] != sum_split_size:
-                    print(
-                        f"[reducescatter] val.shape={val.shape}, dim={dim}, val.shape[dim]={val.shape[dim]}, sum_split_size={sum_split_size}, sizes={sizes}"
-                    )
+            for val in input:
+                if val is not None and val.shape[dim] != sum_split_size:
+                    logger.debug(
+                        f"[reducescatter] val.shape={val.shape}, dim={dim}, "
+                        f"val.shape[dim]={val.shape[dim]}, "
+                        f"sum_split_size={sum_split_size}, sizes={sizes}"
+                    )

Run to ensure no stray prints remain:

#!/bin/bash
rg -n -C2 '^\s*print\(' tensorrt_llm/_torch/distributed/ops.py
tensorrt_llm/_torch/modules/fused_moe/moe_load_balancer.py (1)

963-965: Replace noisy print with logger.debug; fix E501 and keep logs structured.

Use the project logger and avoid long f-strings on hot paths.

-    print(
-        f"maybe_create_moe_load_balancer: in_supported_model_arch={in_supported_model_arch}, using_ep={using_ep}, using_smart_router={using_smart_router}, model_config.moe_load_balancer={model_config.moe_load_balancer}"
-    )
+    logger.debug(
+        "maybe_create_moe_load_balancer: supported_arch=%s, using_ep=%s, "
+        "using_smart_router=%s, moe_load_balancer=%s",
+        in_supported_model_arch,
+        using_ep,
+        using_smart_router,
+        getattr(model_config, "moe_load_balancer", None),
+    )
tensorrt_llm/_torch/modules/fused_moe/moe_backend.py (2)

245-274: Two fixes: pass the correct weight dtype to MoERunner; call the FP8 flag method.

  • weight_dtype should match the runtime weight.view(dtype).
  • has_deepseek_fp8_block_scales is a method; calling it is required.
-        # Determine view dtype for weights to match runtime quantization layout
+        # Determine view dtype for weights to match runtime quantization layout
         weight_view_dtype = module.w3_w1_weight.dtype
         if getattr(module, 'has_w4afp8', False):
             weight_view_dtype = torch.quint4x2
         elif getattr(module, 'has_w4a16_mxfp4', False):
             weight_view_dtype = torch.uint8
@@
         if self.moe_runner is None:
             self.moe_runner = MoERunner(
                 x_dtype=tuner_input.dtype,
-                weight_dtype=module.w3_w1_weight.dtype,
+                weight_dtype=weight_view_dtype,
                 output_dtype=output_dtype,
                 top_k=tuner_top_k,
                 tp_size=module.tp_size,
                 tp_rank=module.tp_rank,
                 ep_size=module.ep_size,
                 ep_rank=module.ep_rank,
                 cluster_size=module.cluster_size,
                 cluster_rank=module.cluster_rank,
-                use_deepseek_fp8_block_scale=getattr(
-                    module, 'has_deepseek_fp8_block_scales', False),
+                use_deepseek_fp8_block_scale=(
+                    module.has_deepseek_fp8_block_scales()
+                    if hasattr(module, 'has_deepseek_fp8_block_scales') else False
+                ),
                 use_w4_group_scaling=getattr(module, 'has_w4afp8', False),
                 use_int8_woq_per_channel=getattr(module,
                                                  'has_int8_woq_per_channel',
                                                  False),
                 use_mxfp8_act_scaling=getattr(module, 'has_mxfp8_act_scaling',
                                               False),
                 min_latency_mode=min_latency_mode,
                 use_fused_finalize=getattr(module, 'use_fused_finalize', False),
             )

830-836: Use experts_per_token (not top_k) to avoid AttributeError and ensure correct routing.

-            module.routing_method.top_k if module else 1,
+            module.routing_method.experts_per_token if module else 1,
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py (5)

311-314: Replace print with logger.debug and wrap message.

-            print(
-                f"can not use alltoall due to chunking {self.calculate_num_chunks(all_rank_num_tokens)}"
-            )
+            logger.debug(
+                "Cannot use alltoall due to chunking %s",
+                self.calculate_num_chunks(all_rank_num_tokens),
+            )

319-321: Replace print with logger.debug; fixes E501 long line too.

-            print(
-                f"can not use alltoall due to deep_ep_max_num_tokens {all_rank_max_num_tokens} > {self.deep_ep_max_num_tokens}"
-            )
+            logger.debug(
+                "Cannot use alltoall due to deep_ep_max_num_tokens %s > %s",
+                all_rank_max_num_tokens,
+                self.deep_ep_max_num_tokens,
+            )

324-324: Replace print with logger.debug for consistency.

-        print(f"all to all type {self.alltoall_method_type}")
+        logger.debug("alltoall method type: %s", self.alltoall_method_type)

333-345: Swap verbose prints for compact logger.debug in quant-method selection.

-                print(
-                    f"wide_ep _get_quant_method: get_sm_version()={get_sm_version()}"
-                )
+                logger.debug(
+                    "wide_ep _get_quant_method: get_sm_version()=%s",
+                    get_sm_version(),
+                )
                 if get_sm_version() == 100:
-                    print(
-                        f"wide_ep _get_quant_method: use DeepSeekFP8BlockScalesFusedMoEMethodDeepGemm"
-                    )
+                    logger.debug(
+                        "wide_ep _get_quant_method: DeepSeekFP8BlockScalesFusedMoEMethodDeepGemm"
+                    )
                     return DeepSeekFP8BlockScalesFusedMoEMethodDeepGemm()
                 else:
-                    print(
-                        f"wide_ep _get_quant_method: use DeepSeekFP8BlockScalesFusedMoEMethod"
-                    )
+                    logger.debug(
+                        "wide_ep _get_quant_method: DeepSeekFP8BlockScalesFusedMoEMethod"
+                    )
                     return DeepSeekFP8BlockScalesFusedMoEMethod()

8-8: Fix logger import; use tensorrt_llm.logger (not _utils).

-from tensorrt_llm._utils import get_sm_version, logger
+from tensorrt_llm._utils import get_sm_version
+from tensorrt_llm.logger import logger
🧹 Nitpick comments (7)
tensorrt_llm/_torch/distributed/ops.py (2)

465-487: Make forced MNNVL fallback noisy; clarify dtype=None disable reason.

If strategy=MNNVL but it’s disabled, emit a warning (not debug). Also clarify when dtype=None is the reason.

-            if self.strategy in (AllReduceStrategy.AUTO,
-                                 AllReduceStrategy.MNNVL):
+            if self.strategy in (AllReduceStrategy.AUTO, AllReduceStrategy.MNNVL):
+                forced_mnnvl = (self.strategy == AllReduceStrategy.MNNVL)
                 if self.mapping.tp_size != self.mapping.world_size:
-                    logger.debug(
-                        f"MNNVLAllReduce is disabled due to tp_size:{self.mapping.tp_size} "
-                        f"!= world_size:{self.mapping.world_size}")
+                    (logger.warning if forced_mnnvl else logger.debug)(
+                        f"MNNVLAllReduce is disabled due to tp_size:{self.mapping.tp_size} "
+                        f"!= world_size:{self.mapping.world_size}"
+                    )
                     self.mnnvl_allreduce = None
                 elif MNNVLAllReduce.is_mnnvl(self.mapping, dtype):
                     try:
                         self.mnnvl_allreduce = MNNVLAllReduce(
                             self.mapping, dtype) if dtype else None
                         if self.mnnvl_allreduce:
-                            logger.debug(f"MNNVLAllReduce is enabled")
+                            logger.debug("MNNVLAllReduce is enabled")
                         else:
-                            logger.debug(f"MNNVLAllReduce is disabled")
+                            (logger.warning if forced_mnnvl else logger.debug)(
+                                "MNNVLAllReduce is disabled (dtype is None)."
+                            )
                     except Exception as e:
-                        logger.debug(
-                            f"MNNVL AllReduce can't be enabled due to {e}.")
+                        (logger.warning if forced_mnnvl else logger.debug)(
+                            f"MNNVL AllReduce can't be enabled due to: {e}"
+                        )
                         self.mnnvl_allreduce = None
                 else:
-                    logger.debug(
-                        f"MNNVLAllReduce can't be enabled due to failing the is_mnnvl check."
-                    )
+                    (logger.warning if forced_mnnvl else logger.debug)(
+                        "MNNVLAllReduce can't be enabled; is_mnnvl check failed."
+                    )
                     self.mnnvl_allreduce = None

To verify behavior, please confirm intended policy: when users explicitly choose MNNVL, should fallback be “warn-and-continue” (as above) or “error-and-stop”?


15-21: Don’t shadow the project logger.

from tensorrt_llm.logger import logger is immediately shadowed by logging.getLogger(__name__), leading to inconsistent log routing.

-logger = logging.getLogger(__name__)
+# Use the project logger imported above for consistent routing/levels.

Alternatively, prefer module import per guidelines and use tensorrt_llm.logger.logger everywhere.

tensorrt_llm/_torch/modules/fused_moe/moe_backend.py (3)

382-383: Wrap long conditional assignment to satisfy E501 and readability.

-        run_moe = self.moe_runner.fused_moe_runner.run_moe_min_latency if min_latency_mode else self.moe_runner.fused_moe_runner.run_moe
+        run_moe = (
+            self.moe_runner.fused_moe_runner.run_moe_min_latency
+            if min_latency_mode
+            else self.moe_runner.fused_moe_runner.run_moe
+        )

717-729: Delete commented debug prints; they still trigger E501 and clutter the file.

-        # print(
-        #     "xxi shape 2: enter deepgemm backend compute_moe \n"
-        #     f"x.shape: {getattr(x, 'shape', None)}, \n"
-        #     f"input_sf.shape: {getattr(input_sf, 'shape', None)}, \n"
-        #     f"token_selected_slots.shape: {getattr(token_selected_slots, 'shape', None)}, \n"
-        #     f"token_final_scales.shape: {getattr(token_final_scales, 'shape', None)}, \n"
-        #     f"permuted_row_to_unpermuted_row_tensor.shape: {getattr(permuted_row_to_unpermuted_row_tensor, 'shape', None)}, \n"
-        #     f"permuted_token_selected_experts_tensor.shape: {getattr(permuted_token_selected_experts_tensor, 'shape', None)}, \n"
-        #     f"permuted_data_tensor.shape: {getattr(permuted_data_tensor, 'shape', None)}, \n"
-        #     f"expert_first_token_offset_tensor.shape: {getattr(expert_first_token_offset_tensor, 'shape', None)}, \n"
-        #     f"permuted_token_final_scales_tensor.shape: {getattr(permuted_token_final_scales_tensor, 'shape', None)}, \n"
-        #     f"unpermuted_row_to_permuted_row_tensor.shape: {getattr(unpermuted_row_to_permuted_row_tensor, 'shape', None)}\n"
-        # )
@@
-        # print(
-        #     "xxi shape 3: exit deepgemm backend compute_moe \n"
-        #     f"final_hidden_states.shape: {getattr(final_hidden_states, 'shape', None)}\n"
-        #     f"permuted_data_tensor.shape: {getattr(permuted_data_tensor, 'shape', None)}, \n"
-        #     f"token_final_scales.shape: {getattr(token_final_scales, 'shape', None)}, \n"
-        #     f"unpermuted_row_to_permuted_row_tensor.shape: {getattr(unpermuted_row_to_permuted_row_tensor, 'shape', None)}, \n"
-        #     f"permuted_row_to_unpermuted_row_tensor.shape: {getattr(permuted_row_to_unpermuted_row_tensor, 'shape', None)}, \n"
-        #     f"expert_first_token_offset_tensor.shape: {getattr(expert_first_token_offset_tensor, 'shape', None)}, \n"
-        #     f"x.shape: {getattr(x, 'shape', None)}, \n")

Also applies to: 837-845


1-13: Import the project logger if you adopt logger.debug above.

 from tensorrt_llm._utils import get_sm_version
+from tensorrt_llm.logger import logger
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py (2)

418-421: Remove commented debug block; it violates E501 and adds noise.

-        # print(
-        #     f"xxi shape 1: enter wide_ep forward_chunk: layer_load_balancer={self.layer_load_balancer}, is_first_call={is_first_call}, is_last_call={is_last_call}, x shape: {getattr(x, 'shape', None)}, router_logits shape: {getattr(router_logits, 'shape', None)}, use_all_to_all: {use_all_to_all}, all_rank_num_tokens: {all_rank_num_tokens}, all_rank_max_num_tokens: {all_rank_max_num_tokens}, use_dp_padding: {use_dp_padding}, repeating_info: {repeating_info}"
-        # )

812-815: Delete leftover commented print (non-English comment, long line).

-            # 一行打印所有信息
-            # print(
-            #     f"xxi x.shape: {getattr(x, 'shape', None)}, use_all_to_all: {use_all_to_all}, all_rank_num_tokens: {all_rank_num_tokens}, all_rank_num_tokens_padded: {all_rank_num_tokens_padded}, all_rank_max_num_tokens: {all_rank_max_num_tokens}, use_dp_padding: {use_dp_padding}, outputs.shape: {getattr(outputs, 'shape', None)}, use_dp_padding(again): {use_dp_padding}"
-            # )
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between e3b58fb and 15d9841.

📒 Files selected for processing (4)
  • tensorrt_llm/_torch/distributed/ops.py (2 hunks)
  • tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py (11 hunks)
  • tensorrt_llm/_torch/modules/fused_moe/moe_backend.py (1 hunks)
  • tensorrt_llm/_torch/modules/fused_moe/moe_load_balancer.py (1 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Code must target Python 3.8+
Indent Python code with 4 spaces; do not use tabs
Preserve module namespaces when importing; import modules/packages and access members via the module (e.g., from package.subpackage import foo; foo.SomeClass())
Python file names should be snake_case
Python class names should be PascalCase
Python functions/methods and local variables should be snake_case; variables beginning with a number should be prefixed with k_ (e.g., k_99th_percentile)
Global variables should be UPPER_SNAKE_CASE prefixed with G_ (e.g., G_MY_GLOBAL); constants should be UPPER_SNAKE_CASE
Avoid shadowing variables from outer scopes; initialize all externally visible members in init
Prefer docstrings for interfaces used outside a file; comments should be reserved for in-function or file-local interfaces
Use Google-style docstrings for classes and functions; attributes and variables may be documented inline with trailing string literals
Avoid reflection when simpler, explicit code suffices (e.g., avoid dict(**locals()) patterns)
In try/except, catch the narrowest exceptions possible
For duck-typing patterns, keep the try body minimal and move logic to else to avoid masking unrelated failures

Files:

  • tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py
  • tensorrt_llm/_torch/modules/fused_moe/moe_backend.py
  • tensorrt_llm/_torch/distributed/ops.py
  • tensorrt_llm/_torch/modules/fused_moe/moe_load_balancer.py
**/*.{c,cc,cpp,cxx,h,hh,hpp,hxx,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Prepend the NVIDIA copyright header (current year) to all source files (.cpp, .h, .cu, .py, etc.)

Files:

  • tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py
  • tensorrt_llm/_torch/modules/fused_moe/moe_backend.py
  • tensorrt_llm/_torch/distributed/ops.py
  • tensorrt_llm/_torch/modules/fused_moe/moe_load_balancer.py
🧠 Learnings (4)
📓 Common learnings
Learnt from: timlee0212
PR: NVIDIA/TensorRT-LLM#6886
File: tensorrt_llm/_torch/models/modeling_deepseekv3.py:0-0
Timestamp: 2025-08-14T06:36:40.701Z
Learning: In DeepSeek V3 model (tensorrt_llm/_torch/models/modeling_deepseekv3.py), the disagreement between AllReduce.__init__ guard and _compute_mlp_tp_size logic for MNNVL usage is expected by design. The AllReduce component and MLP TP-size computation intentionally use different criteria for MNNVL availability decisions.
📚 Learning: 2025-08-14T23:23:27.449Z
Learnt from: djns99
PR: NVIDIA/TensorRT-LLM#6915
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:4010-4012
Timestamp: 2025-08-14T23:23:27.449Z
Learning: For MOE (Mixture of Experts) code reviews in TensorRT-LLM, avoid repeatedly suggesting finalize fusion validation checks and safety assertions. The user djns99 has indicated these suggestions are repetitive and unwanted across multiple MOE-related changes.

Applied to files:

  • tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py
  • tensorrt_llm/_torch/modules/fused_moe/moe_backend.py
📚 Learning: 2025-08-09T20:57:04.084Z
Learnt from: sklevtsov-nvidia
PR: NVIDIA/TensorRT-LLM#3294
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu:118-127
Timestamp: 2025-08-09T20:57:04.084Z
Learning: In the CUTLASS MoE finalize fusion implementation (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu), when setting `fused_finalize_epilogue.stride_final_output` with shape `(hidden_size, num_output_tokens, 1)`, the `num_rows_in_final_output` should be set to `num_output_tokens` (not `hidden_size`) because of a swap+transpose operation that maps rows of the output tensor to `hidden_size` and columns to `num_output_tokens`.

Applied to files:

  • tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py
  • tensorrt_llm/_torch/modules/fused_moe/moe_backend.py
📚 Learning: 2025-08-14T06:36:40.701Z
Learnt from: timlee0212
PR: NVIDIA/TensorRT-LLM#6886
File: tensorrt_llm/_torch/models/modeling_deepseekv3.py:0-0
Timestamp: 2025-08-14T06:36:40.701Z
Learning: In DeepSeek V3 model (tensorrt_llm/_torch/models/modeling_deepseekv3.py), the disagreement between AllReduce.__init__ guard and _compute_mlp_tp_size logic for MNNVL usage is expected by design. The AllReduce component and MLP TP-size computation intentionally use different criteria for MNNVL availability decisions.

Applied to files:

  • tensorrt_llm/_torch/distributed/ops.py
🧬 Code graph analysis (3)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py (3)
tensorrt_llm/_torch/modules/fused_moe/moe_backend.py (4)
  • MoEBackendSelection (852-898)
  • select_backend (864-898)
  • run_moe (101-212)
  • run_moe (413-533)
tensorrt_llm/_torch/modules/fused_moe/moe_load_balancer.py (1)
  • get_moe_load_balancer (1027-1037)
tensorrt_llm/_torch/modules/fused_moe/quantization.py (2)
  • DeepSeekFP8BlockScalesFusedMoEMethod (604-737)
  • DeepSeekFP8BlockScalesFusedMoEMethodDeepGemm (740-781)
tensorrt_llm/_torch/modules/fused_moe/moe_backend.py (5)
tensorrt_llm/_torch/autotuner.py (2)
  • AutoTuner (271-752)
  • choose_one (329-436)
tensorrt_llm/_torch/custom_ops/torch_custom_ops.py (1)
  • MoERunner (23-114)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_deepgemm.py (5)
  • deepgemm_fp8_group_blockwise_gemm (246-284)
  • masked_index_copy_group_quant_fp8 (89-160)
  • preprocess_after_permute (233-242)
  • set_strides (287-293)
  • triton_masked_index_gather (195-216)
tensorrt_llm/quantization/utils/fp8_utils.py (1)
  • silu_and_mul_masked_post_quant_fwd (304-375)
tensorrt_llm/_torch/modules/fused_moe/interface.py (1)
  • has_deepseek_fp8_block_scales (127-130)
tensorrt_llm/_torch/distributed/ops.py (3)
tensorrt_llm/functional.py (1)
  • AllReduceStrategy (3876-3885)
tensorrt_llm/_torch/distributed/communicator.py (2)
  • tp_size (46-47)
  • world_size (26-27)
tensorrt_llm/logger.py (1)
  • debug (143-144)
🪛 Ruff (0.12.2)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py

320-320: Line too long (127 > 120)

(E501)


419-419: Line too long (463 > 120)

(E501)


731-731: Line too long (967 > 120)

(E501)


814-814: Line too long (374 > 120)

(E501)

tensorrt_llm/_torch/modules/fused_moe/moe_backend.py

382-382: Line too long (136 > 120)

(E501)


723-723: Line too long (129 > 120)

(E501)


724-724: Line too long (131 > 120)

(E501)


727-727: Line too long (123 > 120)

(E501)


728-728: Line too long (127 > 120)

(E501)


842-842: Line too long (129 > 120)

(E501)


843-843: Line too long (129 > 120)

(E501)

tensorrt_llm/_torch/distributed/ops.py

246-246: Line too long (156 > 120)

(E501)

tensorrt_llm/_torch/modules/fused_moe/moe_load_balancer.py

964-964: Line too long (219 > 120)

(E501)

🔇 Additional comments (2)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py (2)

368-370: LGTM: backend selection deferred until weights exist.

The lazy init avoids probing module state too early and keeps forward path clean.


700-728: LGTM: unified backend.run_moe call mirrors fused_moe signature.

Passing module=self enables backend-specific decisions without leaking details here.

Comment on lines +311 to +321
print(
f"can not use alltoall due to chunking {self.calculate_num_chunks(all_rank_num_tokens)}"
)
return False

# For DeepEPLowLatency, check if tokens exceed the threshold
if (self.alltoall_method_type == AlltoallMethodType.DeepEPLowLatency
and all_rank_max_num_tokens > self.deep_ep_max_num_tokens):
print(
f"can not use alltoall due to deep_ep_max_num_tokens {all_rank_max_num_tokens} > {self.deep_ep_max_num_tokens}"
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Verification agent

🧩 Analysis chain

Quick audit: ensure no stray print() calls remain in fused_moe modules.

Run this repo-local grep to confirm cleanup across the module.

Also applies to: 324-324, 333-345, 418-421, 812-815


🏁 Script executed:

#!/bin/bash
# Find any remaining print() in fused_moe Python sources (code, not tests).
rg -nP '^\s*print\(' tensorrt_llm/_torch/modules/fused_moe -g '!**/test/**' -S

Length of output: 1144


Remove all print() calls from fused_moe modules

Audit via the provided rg output shows the following remaining print() statements that should be removed or replaced with proper logging:

  • tensorrt_llm/_torch/modules/fused_moe/moe_load_balancer.py
    • Line 963: print( … )
  • tensorrt_llm/_torch/modules/fused_moe/moe_backend.py
    • Lines 893, 897: print("xxi select backend: …")
  • tensorrt_llm/_torch/modules/fused_moe/routing.py
    • Lines 77, 95, 102: debug routing prints
  • tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py
    • Lines 311, 319, 324, 333, 337, 342: alltoall and chunking diagnostics

Please remove these print() calls. If retaining these messages is necessary for debugging or telemetry, convert them to the module’s logger (e.g., logging.debug or logging.info) at the appropriate level.

🧰 Tools
🪛 Ruff (0.12.2)

320-320: Line too long (127 > 120)

(E501)

🤖 Prompt for AI Agents
In tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py around lines 311
to 321, remove the inline print() calls and replace them with the module logger
(e.g., logger.debug or logger.info) preserving the original message
text/formatting; ensure a logger is available (import logging and create logger
= logging.getLogger(__name__) if not already present) and keep control
flow/return behavior unchanged so the function still returns False where
intended.

Comment on lines 886 to 898
is_blackwell = get_sm_version() == 100
has_block_fp8 = (hasattr(module, 'has_deepseek_fp8_block_scales')
and module.has_deepseek_fp8_block_scales)

# print(f"xxi select backend: is_blackwell: {is_blackwell}, has_block_fp8: {has_block_fp8}")
if is_blackwell and has_block_fp8:
# Use DeepGemm backend for Blackwell with block FP8
print("xxi select backend: use deepgemm backend")
return MoeDeepGemmBackend()
else:
# Use Cutlass backend for all other cases
print("xxi select backend: use cutlass backend")
return MoeCutlassBackend()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Fix method/boolean misuse in backend selection and replace prints with logger.

  • Call has_deepseek_fp8_block_scales() safely (it’s a method).
  • Use logger.debug instead of print.
-        has_block_fp8 = (hasattr(module, 'has_deepseek_fp8_block_scales')
-                         and module.has_deepseek_fp8_block_scales)
+        has_block_fp8 = (
+            hasattr(module, 'has_deepseek_fp8_block_scales')
+            and (
+                module.has_deepseek_fp8_block_scales()
+                if callable(getattr(module, 'has_deepseek_fp8_block_scales'))
+                else bool(getattr(module, 'has_deepseek_fp8_block_scales'))
+            )
+        )
@@
-            print("xxi select backend: use deepgemm backend")
+            logger.debug("MoEBackendSelection: using DeepGemm backend")
             return MoeDeepGemmBackend()
         else:
-            print("xxi select backend: use cutlass backend")
-            return MoeCutlassBackend()
+            logger.debug("MoEBackendSelection: using Cutlass backend")
+            return MoeCutlassBackend()

Add this import near the top of the file (outside this hunk):

from tensorrt_llm.logger import logger
🤖 Prompt for AI Agents
In tensorrt_llm/_torch/modules/fused_moe/moe_backend.py around lines 886 to 898,
the code treats has_deepseek_fp8_block_scales as a boolean attribute and uses
print statements; change it to safely call the method only if present and
callable (e.g., hasattr(...) and callable(...)) and capture its boolean result
into has_block_fp8, replace both print calls with logger.debug, and add the
required import "from tensorrt_llm.logger import logger" near the top of the
file.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py (1)

1-1: Add NVIDIA copyright header (2025).

All source files must prepend the current-year NVIDIA copyright header.

Apply:

+# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.
♻️ Duplicate comments (6)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py (3)

311-321: Replace print() diagnostics with logger.debug and fix lints.

Raw prints in hot path spam stdout and violate E501.

-            print(
-                f"can not use alltoall due to chunking {self.calculate_num_chunks(all_rank_num_tokens)}"
-            )
+            logger.debug(
+                "Disable alltoall: chunking=%d",
+                self.calculate_num_chunks(all_rank_num_tokens),
+            )
...
-            print(
-                f"can not use alltoall due to deep_ep_max_num_tokens {all_rank_max_num_tokens} > {self.deep_ep_max_num_tokens}"
-            )
+            logger.debug(
+                "Disable alltoall: deep_ep_max_num_tokens %d > %d",
+                all_rank_max_num_tokens,
+                self.deep_ep_max_num_tokens,
+            )
...
-        print(f"all to all type {self.alltoall_method_type}")
+        logger.debug("Alltoall method type: %s", self.alltoall_method_type)

Also applies to: 324-324


333-345: Use logger for quant-method selection; remove prints.

Convert to structured debug logs and keep lines <120.

-                print(
-                    f"wide_ep _get_quant_method: get_sm_version()={get_sm_version()}"
-                )
+                logger.debug("wide_ep _get_quant_method: sm=%s", get_sm_version())
                 if get_sm_version() == 100:
-                    print(
-                        f"wide_ep _get_quant_method: use DeepSeekFP8BlockScalesFusedMoEMethodDeepGemm"
-                    )
+                    logger.debug("wide_ep _get_quant_method: DeepGemm variant")
                     return DeepSeekFP8BlockScalesFusedMoEMethodDeepGemm()
                 else:
-                    print(
-                        f"wide_ep _get_quant_method: use DeepSeekFP8BlockScalesFusedMoEMethod"
-                    )
+                    logger.debug("wide_ep _get_quant_method: Cutlass variant")
                     return DeepSeekFP8BlockScalesFusedMoEMethod()

8-8: Fix logger import; _utils does not export logger.

Import logger from tensorrt_llm.logger to avoid ImportError at runtime.

-from tensorrt_llm._utils import get_sm_version, logger
+from tensorrt_llm._utils import get_sm_version
+from tensorrt_llm.logger import logger
tensorrt_llm/_torch/modules/fused_moe/moe_backend.py (3)

11-11: Import project logger and stop using print.

Use tensorrt_llm.logger for consistent, configurable logging.

-from tensorrt_llm._utils import get_sm_version
+from tensorrt_llm._utils import get_sm_version
+from tensorrt_llm.logger import logger

748-754: Fix attribute: use experts_per_token (not top_k).

Routing exposes experts_per_token elsewhere; .top_k may raise AttributeError.

-            module.routing_method.top_k if module else 1,
+            module.routing_method.experts_per_token if module else 1,

1-816: Remove all stray print statements in fused_moe modules

The following debug print() calls were detected and should be removed (or replaced with proper logging) to avoid unintended console output:

• tensorrt_llm/_torch/modules/fused_moe/moe_load_balancer.py:963
• tensorrt_llm/_torch/modules/fused_moe/moe_backend.py:811, 815
• tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py:311, 319, 324, 333, 337, 342
• tensorrt_llm/_torch/modules/fused_moe/routing.py:77, 95, 102

Please remove these calls (or switch to a standardized logger) and ensure no new print() statements are introduced.

🧹 Nitpick comments (8)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py (2)

428-431: Remove commented-out print blocks to satisfy E501 and reduce noise.

These long commented prints trigger line-length lint and add churn in diffs.

Would you like me to open a follow-up patch to delete these comment blocks?

Also applies to: 679-706, 728-731, 811-813


309-326: Optional: add a fast path guard for can_use_alltoall.

Return early if self.alltoall_method_type is NotEnabled to save work.

     def can_use_alltoall(self, all_rank_num_tokens, all_rank_max_num_tokens):
+        if not self.enable_alltoall:
+            return False
tensorrt_llm/_torch/modules/fused_moe/moe_backend.py (6)

217-241: Use weight_view_dtype for MoERunner to match runtime .view dtype.

Prevents tactic/profile mismatch when view dtype differs from storage (e.g., W4 group scaling).

-            self.moe_runner = MoERunner(
+            self.moe_runner = MoERunner(
                 x_dtype=tuner_input.dtype,
-                weight_dtype=module.w3_w1_weight.dtype,
+                weight_dtype=weight_view_dtype,
                 output_dtype=output_dtype,
                 top_k=tuner_top_k,
@@
-                use_deepseek_fp8_block_scale=getattr(
-                    module, 'has_deepseek_fp8_block_scales', False),
+                use_deepseek_fp8_block_scale=getattr(
+                    module, 'has_deepseek_fp8_block_scales', False),

321-328: Remove unused locals flagged by Ruff.

These are never referenced in compute_moe.

-        use_deepseek_fp8_block_scale = getattr(module,
-                                               'has_deepseek_fp8_block_scales',
-                                               False)
         use_w4_group_scaling = getattr(module, 'has_w4afp8', False)
-        use_int8_woq_per_channel = getattr(module, 'has_int8_woq_per_channel',
-                                           False)

348-349: Wrap long line to satisfy E501.

Minor readability and lint fix.

-        run_moe = self.moe_runner.fused_moe_runner.run_moe_min_latency if min_latency_mode else self.moe_runner.fused_moe_runner.run_moe
+        run_moe = (
+            self.moe_runner.fused_moe_runner.run_moe_min_latency
+            if min_latency_mode
+            else self.moe_runner.fused_moe_runner.run_moe
+        )

635-647: Delete long commented debug prints to appease Ruff E501.

These blocks are commented and unnecessary in production.

Would you like me to submit a follow-up commit removing them?

Also applies to: 755-764


804-816: Replace backend-selection prints with logger.debug.

Keep selection diagnostics but via project logger.

-        is_blackwell = get_sm_version() == 100
-        has_block_fp8 = (hasattr(module, 'has_deepseek_fp8_block_scales')
-                         and module.has_deepseek_fp8_block_scales)
+        is_blackwell = get_sm_version() == 100
+        has_block_fp8 = (
+            hasattr(module, 'has_deepseek_fp8_block_scales')
+            and module.has_deepseek_fp8_block_scales
+        )
@@
-        # print(f"xxi select backend: is_blackwell: {is_blackwell}, has_block_fp8: {has_block_fp8}")
         if is_blackwell and has_block_fp8:
-            # Use DeepGemm backend for Blackwell with block FP8
-            print("xxi select backend: use deepgemm backend")
+            logger.debug(
+                "MoEBackendSelection: using DeepGemm (SM=%s, block_fp8=%s)",
+                is_blackwell, has_block_fp8,
+            )
             return MoEDeepGemmBackend()
         else:
-            # Use Cutlass backend for all other cases
-            print("xxi select backend: use cutlass backend")
+            logger.debug(
+                "MoEBackendSelection: using Cutlass (SM=%s, block_fp8=%s)",
+                is_blackwell, has_block_fp8,
+            )
             return MoECutlassBackend()

202-206: Assert message is useful; consider early-return if tuner_input is None.

You already assert; leaving as-is is fine.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 15d9841 and ddc8278.

📒 Files selected for processing (2)
  • tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py (13 hunks)
  • tensorrt_llm/_torch/modules/fused_moe/moe_backend.py (1 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Code must target Python 3.8+
Indent Python code with 4 spaces; do not use tabs
Preserve module namespaces when importing; import modules/packages and access members via the module (e.g., from package.subpackage import foo; foo.SomeClass())
Python file names should be snake_case
Python class names should be PascalCase
Python functions/methods and local variables should be snake_case; variables beginning with a number should be prefixed with k_ (e.g., k_99th_percentile)
Global variables should be UPPER_SNAKE_CASE prefixed with G_ (e.g., G_MY_GLOBAL); constants should be UPPER_SNAKE_CASE
Avoid shadowing variables from outer scopes; initialize all externally visible members in init
Prefer docstrings for interfaces used outside a file; comments should be reserved for in-function or file-local interfaces
Use Google-style docstrings for classes and functions; attributes and variables may be documented inline with trailing string literals
Avoid reflection when simpler, explicit code suffices (e.g., avoid dict(**locals()) patterns)
In try/except, catch the narrowest exceptions possible
For duck-typing patterns, keep the try body minimal and move logic to else to avoid masking unrelated failures

Files:

  • tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py
  • tensorrt_llm/_torch/modules/fused_moe/moe_backend.py
**/*.{c,cc,cpp,cxx,h,hh,hpp,hxx,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Prepend the NVIDIA copyright header (current year) to all source files (.cpp, .h, .cu, .py, etc.)

Files:

  • tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py
  • tensorrt_llm/_torch/modules/fused_moe/moe_backend.py
🧠 Learnings (3)
📓 Common learnings
Learnt from: timlee0212
PR: NVIDIA/TensorRT-LLM#6886
File: tensorrt_llm/_torch/models/modeling_deepseekv3.py:0-0
Timestamp: 2025-08-14T06:36:40.701Z
Learning: In DeepSeek V3 model (tensorrt_llm/_torch/models/modeling_deepseekv3.py), the disagreement between AllReduce.__init__ guard and _compute_mlp_tp_size logic for MNNVL usage is expected by design. The AllReduce component and MLP TP-size computation intentionally use different criteria for MNNVL availability decisions.
📚 Learning: 2025-08-14T23:23:27.449Z
Learnt from: djns99
PR: NVIDIA/TensorRT-LLM#6915
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:4010-4012
Timestamp: 2025-08-14T23:23:27.449Z
Learning: For MOE (Mixture of Experts) code reviews in TensorRT-LLM, avoid repeatedly suggesting finalize fusion validation checks and safety assertions. The user djns99 has indicated these suggestions are repetitive and unwanted across multiple MOE-related changes.

Applied to files:

  • tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py
  • tensorrt_llm/_torch/modules/fused_moe/moe_backend.py
📚 Learning: 2025-08-09T20:57:04.084Z
Learnt from: sklevtsov-nvidia
PR: NVIDIA/TensorRT-LLM#3294
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu:118-127
Timestamp: 2025-08-09T20:57:04.084Z
Learning: In the CUTLASS MoE finalize fusion implementation (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu), when setting `fused_finalize_epilogue.stride_final_output` with shape `(hidden_size, num_output_tokens, 1)`, the `num_rows_in_final_output` should be set to `num_output_tokens` (not `hidden_size`) because of a swap+transpose operation that maps rows of the output tensor to `hidden_size` and columns to `num_output_tokens`.

Applied to files:

  • tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py
  • tensorrt_llm/_torch/modules/fused_moe/moe_backend.py
🧬 Code graph analysis (2)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py (2)
tensorrt_llm/_torch/modules/fused_moe/moe_backend.py (5)
  • MoEBackend (17-177)
  • MoEBackendSelection (770-816)
  • select_backend (782-816)
  • run_moe (109-177)
  • run_moe (379-461)
tensorrt_llm/_torch/modules/fused_moe/quantization.py (2)
  • DeepSeekFP8BlockScalesFusedMoEMethod (604-737)
  • DeepSeekFP8BlockScalesFusedMoEMethodDeepGemm (740-781)
tensorrt_llm/_torch/modules/fused_moe/moe_backend.py (4)
tensorrt_llm/_torch/modules/fused_moe/interface.py (2)
  • MoE (22-181)
  • has_deepseek_fp8_block_scales (127-130)
tensorrt_llm/_torch/autotuner.py (2)
  • AutoTuner (271-752)
  • choose_one (329-436)
tensorrt_llm/_torch/custom_ops/torch_custom_ops.py (1)
  • MoERunner (23-114)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_deepgemm.py (5)
  • deepgemm_fp8_group_blockwise_gemm (246-284)
  • masked_index_copy_group_quant_fp8 (89-160)
  • preprocess_after_permute (233-242)
  • set_strides (287-293)
  • triton_masked_index_gather (195-216)
🪛 Ruff (0.12.2)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py

320-320: Line too long (127 > 120)

(E501)


429-429: Line too long (463 > 120)

(E501)


729-729: Line too long (967 > 120)

(E501)


812-812: Line too long (374 > 120)

(E501)

tensorrt_llm/_torch/modules/fused_moe/moe_backend.py

322-322: Local variable use_deepseek_fp8_block_scale is assigned to but never used

Remove assignment to unused variable use_deepseek_fp8_block_scale

(F841)


326-326: Local variable use_int8_woq_per_channel is assigned to but never used

Remove assignment to unused variable use_int8_woq_per_channel

(F841)


348-348: Line too long (136 > 120)

(E501)


641-641: Line too long (129 > 120)

(E501)


642-642: Line too long (131 > 120)

(E501)


645-645: Line too long (123 > 120)

(E501)


646-646: Line too long (127 > 120)

(E501)


760-760: Line too long (129 > 120)

(E501)


761-761: Line too long (129 > 120)

(E501)

🔇 Additional comments (6)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py (3)

368-380: LGTM: Lazy backend property and module-first call.

The lazy initialization via moe_backend and passing self to run_moe is clean and keeps backend selection after weights exist.


237-239: LGTM: Backend lazy-init sentinel.

Storing _moe_backend=None with a clear comment is fine.


500-518: No changes needed for get_local_statistic_tensor guard

The call to get_local_statistic_tensor is intentionally limited to the final iteration (is_last_call) of the All-to-All loop:

  • update_local_statistic(...) is invoked on every invocation when use_all_to_all && method==MNNVL, satisfying the assert in get_local_statistic_tensor and populating local_statistic_tensor before the last call .
  • Only on is_last_call do we fetch the local statistics and pass them into alltoall_prepare_maybe_dispatch, so that the per-rank stats are gathered exactly once for the global update path. Skipping this on non-last calls avoids redundant or premature gathers.

This matches the intended protocol. No further guard or changes required.

tensorrt_llm/_torch/modules/fused_moe/moe_backend.py (3)

376-378: Return-type contract alignment looks good.

Returning [output] in non-min-latency mode keeps caller’s indexing stable across backends.

Also applies to: 765-767


441-442: Same bug in Cutlass.run_moe: pass use_fused_finalize correctly.

Positional args currently misbind. Use keywords.

-        self.finalize_tactic(module, tuner_input, output_dtype,
-                             min_latency_mode, tuner_top_k)
+        self.finalize_tactic(
+            module,
+            tuner_input,
+            output_dtype,
+            min_latency_mode=min_latency_mode,
+            use_fused_finalize=use_fused_finalize,
+            tuner_top_k=tuner_top_k,
+        )

Likely an incorrect or invalid review comment.


157-160: Bug: finalize_tactic called with wrong positional args.

tuner_top_k is passed into use_fused_finalize in Cutlass override. Use keywords and pass use_fused_finalize.

-        self.finalize_tactic(module, input, output_dtype, min_latency_mode,
-                             tuner_top_k)
+        self.finalize_tactic(
+            module,
+            input,
+            output_dtype,
+            min_latency_mode=min_latency_mode,
+            use_fused_finalize=use_fused_finalize,
+            tuner_top_k=tuner_top_k,
+        )

Likely an incorrect or invalid review comment.

@@ -0,0 +1,816 @@
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Add NVIDIA copyright header (2025).

Comply with repo policy for all source files.

+# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
"""
# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.
"""
🤖 Prompt for AI Agents
In tensorrt_llm/_torch/modules/fused_moe/moe_backend.py lines 1-1, the file is
missing the required NVIDIA copyright header for 2025; insert the repo-standard
NVIDIA copyright/license header at the very top of the file before any docstring
or code, updating the year to 2025 and including the required
SPDX-License-Identifier and organization attribution as specified by the
repository policy.

@xxi-nv xxi-nv force-pushed the supportFP8BlockWideEp branch from ddc8278 to 1c7826c Compare August 27, 2025 11:37
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py (1)

1-1: Prepend NVIDIA copyright header (2025).

This production source file needs the standard header per repo policy.

Add at the very top:

# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the NVIDIA Source Code License.
#
# SPDX-License-Identifier: NVIDIA-Source-Code
♻️ Duplicate comments (9)
tensorrt_llm/_torch/modules/fused_moe/moe_backend.py (6)

811-816: Replace print with logger and standardize messages in backend selection.

Avoid stdout spam; use project logger.

-            print("xxi select backend: use deepgemm backend")
+            logger.debug("MoEBackendSelection: using DeepGemm backend")
             return MoEDeepGemmBackend()
         else:
-            print("xxi select backend: use cutlass backend")
-            return MoECutlassBackend()
+            logger.debug("MoEBackendSelection: using Cutlass backend")
+            return MoECutlassBackend()

28-37: API mismatch: base finalize_tactic missing use_fused_finalize; callers misbind args.

Align the base signature with MoECutlassBackend to avoid positional misbinding and enable named passing.

     def finalize_tactic(
         self,
         module: Any,
         tuner_input: torch.Tensor,
         output_dtype: torch.dtype,
-        min_latency_mode: bool = False,
-        tuner_top_k: Optional[int] = None,
+        min_latency_mode: bool = False,
+        use_fused_finalize: bool = True,
+        tuner_top_k: Optional[int] = None,
     ) -> None:
@@
-            min_latency_mode: Whether to profile for min-latency path
+            min_latency_mode: Whether to profile for min-latency path
+            use_fused_finalize: Whether to use fused finalize epilogue (backend-dependent)
             tuner_top_k: Top-k value for tuning (Cutlass specific)

210-216: Use weight_view_dtype when constructing MoERunner.

MoERunner should be initialized with the dtype actually used at runtime views to avoid inconsistency.

         # Determine view dtype for weights to match runtime quantization layout
         weight_view_dtype = module.w3_w1_weight.dtype
         if getattr(module, 'has_w4afp8', False):
             weight_view_dtype = torch.quint4x2
         elif getattr(module, 'has_w4a16_mxfp4', False):
             weight_view_dtype = torch.uint8
@@
         if self.moe_runner is None:
             self.moe_runner = MoERunner(
                 x_dtype=tuner_input.dtype,
-                weight_dtype=module.w3_w1_weight.dtype,
+                weight_dtype=weight_view_dtype,
                 output_dtype=output_dtype,
                 top_k=tuner_top_k,
                 tp_size=module.tp_size,
                 tp_rank=module.tp_rank,
                 ep_size=module.ep_size,
                 ep_rank=module.ep_rank,
                 cluster_size=module.cluster_size,
                 cluster_rank=module.cluster_rank,
                 use_deepseek_fp8_block_scale=getattr(
                     module, 'has_deepseek_fp8_block_scales', False),
                 use_w4_group_scaling=getattr(module, 'has_w4afp8', False),
                 use_int8_woq_per_channel=getattr(module,
                                                  'has_int8_woq_per_channel',
                                                  False),
                 use_mxfp8_act_scaling=getattr(module, 'has_mxfp8_act_scaling',
                                               False),
                 min_latency_mode=min_latency_mode,
                 use_fused_finalize=use_fused_finalize,
             )

Also applies to: 219-240


748-754: Routing API: use experts_per_token (not top_k).

Else this can raise AttributeError or wrong values with some routing methods.

-            module.routing_method.top_k if module else 1,
+            module.routing_method.experts_per_token if module else 1,

11-13: Import the canonical logger and stop using print().

Prepare for replacing print() with logger calls below.

 from tensorrt_llm._utils import get_sm_version
+from tensorrt_llm.logger import logger

1-4: Add NVIDIA copyright header (2025).

Source files must include the NVIDIA header per repo policy.

+# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.
+#
+# Licensed under the NVIDIA Source Code License.
+#
+# SPDX-License-Identifier: NVIDIA-Source-Code
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py (3)

311-325: Swap debug prints for logger.debug in can_use_alltoall.

Avoid stdout noise and align with project logging.

-            print(
-                f"can not use alltoall due to chunking {self.calculate_num_chunks(all_rank_num_tokens)}"
-            )
+            logger.debug(
+                "Cannot use alltoall: chunking required (%d chunks)",
+                self.calculate_num_chunks(all_rank_num_tokens),
+            )
@@
-            print(
-                f"can not use alltoall due to deep_ep_max_num_tokens {all_rank_max_num_tokens} > {self.deep_ep_max_num_tokens}"
-            )
+            logger.debug(
+                "Cannot use alltoall: deep_ep_max_num_tokens %d > %d",
+                all_rank_max_num_tokens,
+                self.deep_ep_max_num_tokens,
+            )
@@
-        print(f"all to all type {self.alltoall_method_type}")
+        logger.debug("Alltoall type: %s", self.alltoall_method_type)

333-345: Replace prints in _get_quant_method with logger.debug.

Keep diagnostics but not as stdout prints.

-                print(
-                    f"wide_ep _get_quant_method: get_sm_version()={get_sm_version()}"
-                )
+                logger.debug(
+                    "wide_ep _get_quant_method: get_sm_version()=%s",
+                    get_sm_version(),
+                )
                 if get_sm_version() == 100:
-                    print(
-                        f"wide_ep _get_quant_method: use DeepSeekFP8BlockScalesFusedMoEMethodDeepGemm"
-                    )
+                    logger.debug(
+                        "wide_ep _get_quant_method: DeepSeekFP8BlockScalesFusedMoEMethodDeepGemm"
+                    )
                     return DeepSeekFP8BlockScalesFusedMoEMethodDeepGemm()
                 else:
-                    print(
-                        f"wide_ep _get_quant_method: use DeepSeekFP8BlockScalesFusedMoEMethod"
-                    )
+                    logger.debug(
+                        "wide_ep _get_quant_method: DeepSeekFP8BlockScalesFusedMoEMethod"
+                    )
                     return DeepSeekFP8BlockScalesFusedMoEMethod()

8-9: Incorrect logger import; use tensorrt_llm.logger.

Fixes runtime import error.

-from tensorrt_llm._utils import get_sm_version, logger
+from tensorrt_llm._utils import get_sm_version
+from tensorrt_llm.logger import logger
🧹 Nitpick comments (1)
tensorrt_llm/_torch/modules/fused_moe/moe_backend.py (1)

322-328: Remove unused locals flagged by Ruff.

Drop variables that are assigned but not used.

-        use_deepseek_fp8_block_scale = getattr(module,
-                                               'has_deepseek_fp8_block_scales',
-                                               False)
         use_w4_group_scaling = getattr(module, 'has_w4afp8', False)
-        use_int8_woq_per_channel = getattr(module, 'has_int8_woq_per_channel',
-                                           False)
+        # int8 per-channel flag is handled inside fused op if needed
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between ddc8278 and 1c7826c.

📒 Files selected for processing (2)
  • tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py (13 hunks)
  • tensorrt_llm/_torch/modules/fused_moe/moe_backend.py (1 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Code must target Python 3.8+
Indent Python code with 4 spaces; do not use tabs
Preserve module namespaces when importing; import modules/packages and access members via the module (e.g., from package.subpackage import foo; foo.SomeClass())
Python file names should be snake_case
Python class names should be PascalCase
Python functions/methods and local variables should be snake_case; variables beginning with a number should be prefixed with k_ (e.g., k_99th_percentile)
Global variables should be UPPER_SNAKE_CASE prefixed with G_ (e.g., G_MY_GLOBAL); constants should be UPPER_SNAKE_CASE
Avoid shadowing variables from outer scopes; initialize all externally visible members in init
Prefer docstrings for interfaces used outside a file; comments should be reserved for in-function or file-local interfaces
Use Google-style docstrings for classes and functions; attributes and variables may be documented inline with trailing string literals
Avoid reflection when simpler, explicit code suffices (e.g., avoid dict(**locals()) patterns)
In try/except, catch the narrowest exceptions possible
For duck-typing patterns, keep the try body minimal and move logic to else to avoid masking unrelated failures

Files:

  • tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py
  • tensorrt_llm/_torch/modules/fused_moe/moe_backend.py
**/*.{c,cc,cpp,cxx,h,hh,hpp,hxx,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Prepend the NVIDIA copyright header (current year) to all source files (.cpp, .h, .cu, .py, etc.)

Files:

  • tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py
  • tensorrt_llm/_torch/modules/fused_moe/moe_backend.py
🧠 Learnings (5)
📓 Common learnings
Learnt from: timlee0212
PR: NVIDIA/TensorRT-LLM#6886
File: tensorrt_llm/_torch/models/modeling_deepseekv3.py:0-0
Timestamp: 2025-08-14T06:36:40.701Z
Learning: In DeepSeek V3 model (tensorrt_llm/_torch/models/modeling_deepseekv3.py), the disagreement between AllReduce.__init__ guard and _compute_mlp_tp_size logic for MNNVL usage is expected by design. The AllReduce component and MLP TP-size computation intentionally use different criteria for MNNVL availability decisions.
📚 Learning: 2025-08-14T23:23:27.449Z
Learnt from: djns99
PR: NVIDIA/TensorRT-LLM#6915
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:4010-4012
Timestamp: 2025-08-14T23:23:27.449Z
Learning: For MOE (Mixture of Experts) code reviews in TensorRT-LLM, avoid repeatedly suggesting finalize fusion validation checks and safety assertions. The user djns99 has indicated these suggestions are repetitive and unwanted across multiple MOE-related changes.

Applied to files:

  • tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py
  • tensorrt_llm/_torch/modules/fused_moe/moe_backend.py
📚 Learning: 2025-08-09T20:57:04.084Z
Learnt from: sklevtsov-nvidia
PR: NVIDIA/TensorRT-LLM#3294
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu:118-127
Timestamp: 2025-08-09T20:57:04.084Z
Learning: In the CUTLASS MoE finalize fusion implementation (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu), when setting `fused_finalize_epilogue.stride_final_output` with shape `(hidden_size, num_output_tokens, 1)`, the `num_rows_in_final_output` should be set to `num_output_tokens` (not `hidden_size`) because of a swap+transpose operation that maps rows of the output tensor to `hidden_size` and columns to `num_output_tokens`.

Applied to files:

  • tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py
  • tensorrt_llm/_torch/modules/fused_moe/moe_backend.py
📚 Learning: 2025-08-25T22:42:47.587Z
Learnt from: CR
PR: NVIDIA/TensorRT-LLM#0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-08-25T22:42:47.587Z
Learning: Applies to **/*.{c,cc,cpp,cxx,h,hh,hpp,hxx,cu,cuh,py} : Prepend the NVIDIA copyright header (current year) to all source files (.cpp, .h, .cu, .py, etc.)

Applied to files:

  • tensorrt_llm/_torch/modules/fused_moe/moe_backend.py
📚 Learning: 2025-08-06T13:58:07.506Z
Learnt from: galagam
PR: NVIDIA/TensorRT-LLM#6487
File: tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_trtllm_bench.py:1-12
Timestamp: 2025-08-06T13:58:07.506Z
Learning: In TensorRT-LLM, test files (files under tests/ directories) do not require NVIDIA copyright headers, unlike production source code files. Test files typically start directly with imports, docstrings, or code.

Applied to files:

  • tensorrt_llm/_torch/modules/fused_moe/moe_backend.py
🧬 Code graph analysis (2)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py (2)
tensorrt_llm/_torch/modules/fused_moe/moe_backend.py (5)
  • MoEBackend (17-177)
  • MoEBackendSelection (770-816)
  • select_backend (782-816)
  • run_moe (109-177)
  • run_moe (379-461)
tensorrt_llm/_torch/modules/fused_moe/quantization.py (2)
  • DeepSeekFP8BlockScalesFusedMoEMethod (604-737)
  • DeepSeekFP8BlockScalesFusedMoEMethodDeepGemm (740-781)
tensorrt_llm/_torch/modules/fused_moe/moe_backend.py (4)
tensorrt_llm/_torch/modules/fused_moe/interface.py (2)
  • MoE (22-181)
  • has_deepseek_fp8_block_scales (127-130)
tensorrt_llm/_torch/custom_ops/torch_custom_ops.py (1)
  • MoERunner (23-114)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py (1)
  • enable_alltoall (298-301)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_deepgemm.py (5)
  • deepgemm_fp8_group_blockwise_gemm (246-284)
  • masked_index_copy_group_quant_fp8 (89-160)
  • preprocess_after_permute (233-242)
  • set_strides (287-293)
  • triton_masked_index_gather (195-216)
🪛 Ruff (0.12.2)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py

320-320: Line too long (127 > 120)

(E501)


429-429: Line too long (463 > 120)

(E501)


729-729: Line too long (967 > 120)

(E501)


812-812: Line too long (374 > 120)

(E501)

tensorrt_llm/_torch/modules/fused_moe/moe_backend.py

322-322: Local variable use_deepseek_fp8_block_scale is assigned to but never used

Remove assignment to unused variable use_deepseek_fp8_block_scale

(F841)


326-326: Local variable use_int8_woq_per_channel is assigned to but never used

Remove assignment to unused variable use_int8_woq_per_channel

(F841)


348-348: Line too long (136 > 120)

(E501)


641-641: Line too long (129 > 120)

(E501)


642-642: Line too long (131 > 120)

(E501)


645-645: Line too long (123 > 120)

(E501)


646-646: Line too long (127 > 120)

(E501)


760-760: Line too long (129 > 120)

(E501)


761-761: Line too long (129 > 120)

(E501)

🔇 Additional comments (1)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py (1)

368-380: Property uses _moe_backend_impl; ensure consistent lazy init.

After fixing init, this block is fine. If you prefer _moe_backend naming, rename consistently in both places.

Would you like me to submit a follow-up diff to rename to _moe_backend across the file for consistency?

@xxi-nv xxi-nv force-pushed the supportFP8BlockWideEp branch from 1c7826c to e8af58e Compare August 27, 2025 12:47
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (12)
tensorrt_llm/_torch/modules/fused_moe/moe_backend.py (8)

477-485: Keep DeepGemm finalize_tactic signature compatible with base.

Add unused use_fused_finalize for API consistency.

     def finalize_tactic(
         self,
         module: Any,
         tuner_input: torch.Tensor,
         output_dtype: torch.dtype,
-        min_latency_mode: bool = False,
+        min_latency_mode: bool = False,
+        use_fused_finalize: bool = True,
         tuner_top_k: Optional[int] = None,
     ) -> None:

805-816: Replace prints with logger and make has_block_fp8 robust.

Use logger.debug and handle property vs callable for has_deepseek_fp8_block_scales.

-        has_block_fp8 = (hasattr(module, 'has_deepseek_fp8_block_scales')
-                         and module.has_deepseek_fp8_block_scales)
+        has_block_fp8 = (
+            hasattr(module, 'has_deepseek_fp8_block_scales') and
+            (
+                module.has_deepseek_fp8_block_scales()
+                if callable(getattr(module, 'has_deepseek_fp8_block_scales'))
+                else bool(getattr(module, 'has_deepseek_fp8_block_scales'))
+            )
+        )
@@
-        if is_blackwell and has_block_fp8:
-            # Use DeepGemm backend for Blackwell with block FP8
-            print("xxi select backend: use deepgemm backend")
-            return MoEDeepGemmBackend()
-        else:
-            # Use Cutlass backend for all other cases
-            print("xxi select backend: use cutlass backend")
-            return MoECutlassBackend()
+        if is_blackwell and has_block_fp8:
+            logger.debug("MoEBackendSelection: using DeepGemm backend")
+            return MoEDeepGemmBackend()
+        else:
+            logger.debug("MoEBackendSelection: using Cutlass backend")
+            return MoECutlassBackend()

1-4: Add NVIDIA copyright header (2025).

Required by repo policy for Python sources.

+ # Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.

28-37: Align base finalize_tactic signature with overrides (add use_fused_finalize).

Prevents positional misbinding in callers and keeps LSP-safe.

     def finalize_tactic(
         self,
         module: Any,
         tuner_input: torch.Tensor,
         output_dtype: torch.dtype,
-        min_latency_mode: bool = False,
+        min_latency_mode: bool = False,
+        use_fused_finalize: bool = True,
         tuner_top_k: Optional[int] = None,
     ) -> None:
@@
-            min_latency_mode: Whether to profile for min-latency path
+            min_latency_mode: Whether to profile for min-latency path
+            use_fused_finalize: Whether to use fused finalize epilogue (backend-dependent)
             tuner_top_k: Top-k value for tuning (Cutlass specific)

157-159: Bug: positional arg misbound to use_fused_finalize. Pass kwargs.

Currently tuner_top_k is bound to use_fused_finalize in subclass overrides.

-        self.finalize_tactic(module, input, output_dtype, min_latency_mode,
-                             tuner_top_k)
+        self.finalize_tactic(
+            module,
+            input,
+            output_dtype,
+            min_latency_mode=min_latency_mode,
+            use_fused_finalize=use_fused_finalize,
+            tuner_top_k=tuner_top_k,
+        )

217-241: MoERunner weight_dtype should match view dtype used during tuning.

Avoids constructor/runtime dtype mismatch.

         if self.moe_runner is None:
             self.moe_runner = MoERunner(
                 x_dtype=tuner_input.dtype,
-                weight_dtype=module.w3_w1_weight.dtype,
+                weight_dtype=weight_view_dtype,
                 output_dtype=output_dtype,
                 top_k=tuner_top_k,

441-443: Bug: positional arg misbound to use_fused_finalize (Cutlass.run_moe).

Pass keywords to avoid binding tuner_top_k into use_fused_finalize.

-        self.finalize_tactic(module, tuner_input, output_dtype,
-                             min_latency_mode, tuner_top_k)
+        self.finalize_tactic(
+            module,
+            tuner_input,
+            output_dtype,
+            min_latency_mode=min_latency_mode,
+            use_fused_finalize=use_fused_finalize,
+            tuner_top_k=tuner_top_k,
+        )

748-748: Use experts_per_token (not top_k) for routing attribute.

Prevents AttributeError on some routing implementations.

-            module.routing_method.top_k if module else 1,
+            module.routing_method.experts_per_token if module else 1,
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py (4)

1-1: Add NVIDIA copyright header (2025).

Required by repo policy for production sources.

+ # Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.

311-325: Replace debug prints with logger.debug in can_use_alltoall.

Avoid stdout noise and satisfy linters.

-            print(
-                f"can not use alltoall due to chunking {self.calculate_num_chunks(all_rank_num_tokens)}"
-            )
+            logger.debug(
+                "Cannot use alltoall: chunking=%d",
+                self.calculate_num_chunks(all_rank_num_tokens),
+            )
@@
-            print(
-                f"can not use alltoall due to deep_ep_max_num_tokens {all_rank_max_num_tokens} > {self.deep_ep_max_num_tokens}"
-            )
+            logger.debug(
+                "Cannot use alltoall: deep_ep_max_num_tokens %d > %d",
+                all_rank_max_num_tokens,
+                self.deep_ep_max_num_tokens,
+            )
@@
-        print(f"all to all type {self.alltoall_method_type}")
+        logger.debug("Alltoall method type: %s", self.alltoall_method_type)

333-345: Replace prints in _get_quant_method with logger.debug.

Keep logs gated and concise.

-                print(
-                    f"wide_ep _get_quant_method: get_sm_version()={get_sm_version()}"
-                )
+                logger.debug(
+                    "wide_ep _get_quant_method: get_sm_version()=%s",
+                    get_sm_version(),
+                )
                 if get_sm_version() == 100:
-                    print(
-                        f"wide_ep _get_quant_method: use DeepSeekFP8BlockScalesFusedMoEMethodDeepGemm"
-                    )
+                    logger.debug(
+                        "wide_ep _get_quant_method: DeepGemm FP8 block-scales"
+                    )
                     return DeepSeekFP8BlockScalesFusedMoEMethodDeepGemm()
                 else:
-                    print(
-                        f"wide_ep _get_quant_method: use DeepSeekFP8BlockScalesFusedMoEMethod"
-                    )
+                    logger.debug(
+                        "wide_ep _get_quant_method: Cutlass FP8 block-scales"
+                    )
                     return DeepSeekFP8BlockScalesFusedMoEMethod()

8-8: Fix logger import; use canonical module.

logger isn’t exported from _utils.

-from tensorrt_llm._utils import get_sm_version, logger
+from tensorrt_llm._utils import get_sm_version
+from tensorrt_llm.logger import logger
🧹 Nitpick comments (2)
tensorrt_llm/_torch/modules/fused_moe/moe_backend.py (2)

11-12: Use project logger; import from tensorrt_llm.logger.

Prepare for replacing print() below and keep logging consistent.

-from tensorrt_llm._utils import get_sm_version
+from tensorrt_llm._utils import get_sm_version
+from tensorrt_llm.logger import logger

322-328: Drop unused locals flagged by Ruff.

Assignments are unused; remove to satisfy lint and reduce noise.

-        use_deepseek_fp8_block_scale = getattr(module,
-                                               'has_deepseek_fp8_block_scales',
-                                               False)
-        use_w4_group_scaling = getattr(module, 'has_w4afp8', False)
-        use_int8_woq_per_channel = getattr(module, 'has_int8_woq_per_channel',
-                                           False)
+        use_w4_group_scaling = getattr(module, 'has_w4afp8', False)
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 1c7826c and e8af58e.

📒 Files selected for processing (2)
  • tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py (13 hunks)
  • tensorrt_llm/_torch/modules/fused_moe/moe_backend.py (1 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Code must target Python 3.8+
Indent Python code with 4 spaces; do not use tabs
Preserve module namespaces when importing; import modules/packages and access members via the module (e.g., from package.subpackage import foo; foo.SomeClass())
Python file names should be snake_case
Python class names should be PascalCase
Python functions/methods and local variables should be snake_case; variables beginning with a number should be prefixed with k_ (e.g., k_99th_percentile)
Global variables should be UPPER_SNAKE_CASE prefixed with G_ (e.g., G_MY_GLOBAL); constants should be UPPER_SNAKE_CASE
Avoid shadowing variables from outer scopes; initialize all externally visible members in init
Prefer docstrings for interfaces used outside a file; comments should be reserved for in-function or file-local interfaces
Use Google-style docstrings for classes and functions; attributes and variables may be documented inline with trailing string literals
Avoid reflection when simpler, explicit code suffices (e.g., avoid dict(**locals()) patterns)
In try/except, catch the narrowest exceptions possible
For duck-typing patterns, keep the try body minimal and move logic to else to avoid masking unrelated failures

Files:

  • tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py
  • tensorrt_llm/_torch/modules/fused_moe/moe_backend.py
**/*.{c,cc,cpp,cxx,h,hh,hpp,hxx,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Prepend the NVIDIA copyright header (current year) to all source files (.cpp, .h, .cu, .py, etc.)

Files:

  • tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py
  • tensorrt_llm/_torch/modules/fused_moe/moe_backend.py
🧠 Learnings (5)
📓 Common learnings
Learnt from: timlee0212
PR: NVIDIA/TensorRT-LLM#6886
File: tensorrt_llm/_torch/models/modeling_deepseekv3.py:0-0
Timestamp: 2025-08-14T06:36:40.701Z
Learning: In DeepSeek V3 model (tensorrt_llm/_torch/models/modeling_deepseekv3.py), the disagreement between AllReduce.__init__ guard and _compute_mlp_tp_size logic for MNNVL usage is expected by design. The AllReduce component and MLP TP-size computation intentionally use different criteria for MNNVL availability decisions.
📚 Learning: 2025-08-14T23:23:27.449Z
Learnt from: djns99
PR: NVIDIA/TensorRT-LLM#6915
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:4010-4012
Timestamp: 2025-08-14T23:23:27.449Z
Learning: For MOE (Mixture of Experts) code reviews in TensorRT-LLM, avoid repeatedly suggesting finalize fusion validation checks and safety assertions. The user djns99 has indicated these suggestions are repetitive and unwanted across multiple MOE-related changes.

Applied to files:

  • tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py
  • tensorrt_llm/_torch/modules/fused_moe/moe_backend.py
📚 Learning: 2025-08-09T20:57:04.084Z
Learnt from: sklevtsov-nvidia
PR: NVIDIA/TensorRT-LLM#3294
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu:118-127
Timestamp: 2025-08-09T20:57:04.084Z
Learning: In the CUTLASS MoE finalize fusion implementation (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu), when setting `fused_finalize_epilogue.stride_final_output` with shape `(hidden_size, num_output_tokens, 1)`, the `num_rows_in_final_output` should be set to `num_output_tokens` (not `hidden_size`) because of a swap+transpose operation that maps rows of the output tensor to `hidden_size` and columns to `num_output_tokens`.

Applied to files:

  • tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py
  • tensorrt_llm/_torch/modules/fused_moe/moe_backend.py
📚 Learning: 2025-08-25T22:42:47.587Z
Learnt from: CR
PR: NVIDIA/TensorRT-LLM#0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-08-25T22:42:47.587Z
Learning: Applies to **/*.{c,cc,cpp,cxx,h,hh,hpp,hxx,cu,cuh,py} : Prepend the NVIDIA copyright header (current year) to all source files (.cpp, .h, .cu, .py, etc.)

Applied to files:

  • tensorrt_llm/_torch/modules/fused_moe/moe_backend.py
📚 Learning: 2025-08-06T13:58:07.506Z
Learnt from: galagam
PR: NVIDIA/TensorRT-LLM#6487
File: tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_trtllm_bench.py:1-12
Timestamp: 2025-08-06T13:58:07.506Z
Learning: In TensorRT-LLM, test files (files under tests/ directories) do not require NVIDIA copyright headers, unlike production source code files. Test files typically start directly with imports, docstrings, or code.

Applied to files:

  • tensorrt_llm/_torch/modules/fused_moe/moe_backend.py
🧬 Code graph analysis (2)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py (3)
tensorrt_llm/_torch/modules/fused_moe/moe_backend.py (5)
  • MoEBackend (17-177)
  • MoEBackendSelection (770-816)
  • select_backend (782-816)
  • run_moe (109-177)
  • run_moe (379-461)
tensorrt_llm/_torch/modules/fused_moe/moe_load_balancer.py (1)
  • get_moe_load_balancer (1027-1037)
tensorrt_llm/_torch/modules/fused_moe/quantization.py (2)
  • DeepSeekFP8BlockScalesFusedMoEMethod (604-737)
  • DeepSeekFP8BlockScalesFusedMoEMethodDeepGemm (740-781)
tensorrt_llm/_torch/modules/fused_moe/moe_backend.py (6)
tensorrt_llm/_torch/modules/fused_moe/interface.py (2)
  • MoE (22-181)
  • has_deepseek_fp8_block_scales (127-130)
tensorrt_llm/_torch/autotuner.py (2)
  • AutoTuner (271-752)
  • choose_one (329-436)
tensorrt_llm/_torch/custom_ops/torch_custom_ops.py (1)
  • MoERunner (23-114)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py (1)
  • enable_alltoall (298-301)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_deepgemm.py (5)
  • deepgemm_fp8_group_blockwise_gemm (246-284)
  • masked_index_copy_group_quant_fp8 (89-160)
  • preprocess_after_permute (233-242)
  • set_strides (287-293)
  • triton_masked_index_gather (195-216)
tensorrt_llm/quantization/utils/fp8_utils.py (1)
  • silu_and_mul_masked_post_quant_fwd (304-375)
🪛 Ruff (0.12.2)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py

320-320: Line too long (127 > 120)

(E501)


429-429: Line too long (463 > 120)

(E501)


729-729: Line too long (967 > 120)

(E501)


812-812: Line too long (374 > 120)

(E501)

tensorrt_llm/_torch/modules/fused_moe/moe_backend.py

322-322: Local variable use_deepseek_fp8_block_scale is assigned to but never used

Remove assignment to unused variable use_deepseek_fp8_block_scale

(F841)


326-326: Local variable use_int8_woq_per_channel is assigned to but never used

Remove assignment to unused variable use_int8_woq_per_channel

(F841)


348-348: Line too long (136 > 120)

(E501)


641-641: Line too long (129 > 120)

(E501)


642-642: Line too long (131 > 120)

(E501)


645-645: Line too long (123 > 120)

(E501)


646-646: Line too long (127 > 120)

(E501)


760-760: Line too long (129 > 120)

(E501)


761-761: Line too long (129 > 120)

(E501)

🔇 Additional comments (1)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py (1)

368-379: LGTM: lazy backend initialization property is clean and safe.

Correctly guards on weights and centralizes backend selection.

@xxi-nv xxi-nv force-pushed the supportFP8BlockWideEp branch from e8af58e to 6c62b25 Compare August 28, 2025 23:48
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
tensorrt_llm/_torch/modules/fused_moe/moe_backend.py (1)

800-815: Replace prints with logger and make has_block_fp8 robust.

Use the project logger and bool(getattr(...)) for clarity; add the logger import.

+from tensorrt_llm.logger import logger
@@
-        has_block_fp8 = (hasattr(module, 'has_deepseek_fp8_block_scales')
-                         and module.has_deepseek_fp8_block_scales)
+        has_block_fp8 = bool(getattr(module, 'has_deepseek_fp8_block_scales', False))
@@
-            print("xxi select backend: use deepgemm backend")
+            logger.debug("MoEBackendSelection: using DeepGemm backend")
             return MoEDeepGemmBackend()
         else:
-            print("xxi select backend: use cutlass backend")
+            logger.debug("MoEBackendSelection: using Cutlass backend")
             return MoECutlassBackend()
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py (1)

1-1: Add NVIDIA copyright header (2025).

Required for production Python files.

+# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.
 import os
♻️ Duplicate comments (9)
tensorrt_llm/_torch/modules/fused_moe/moe_backend.py (6)

1-4: Add NVIDIA copyright header (2025).

Comply with repo policy for source files.

+# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.
 """
 MoE Backend abstraction for supporting different MoE computation implementations.
 This module provides a unified interface for different MoE backends (Cutlass, DeepGemm, etc.)
 """

476-483: Keep DeepGemm.finalize_tactic signature compatible with base.

Add unused use_fused_finalize to the override.

     def finalize_tactic(
         self,
         module: Any,
         tuner_input: torch.Tensor,
         output_dtype: torch.dtype,
         min_latency_mode: bool = False,
+        use_fused_finalize: bool = True,
         tuner_top_k: Optional[int] = None,
     ) -> None:

441-443: Fix Cutlass.run_moe finalize_tactic call: pass named args (same misbinding).

Mirror the base run_moe fix to avoid positional confusion.

-        self.finalize_tactic(module, tuner_input, output_dtype,
-                             min_latency_mode, tuner_top_k)
+        self.finalize_tactic(
+            module,
+            tuner_input,
+            output_dtype,
+            min_latency_mode=min_latency_mode,
+            use_fused_finalize=use_fused_finalize,
+            tuner_top_k=tuner_top_k,
+        )

157-158: Fix finalize_tactic call: pass by name to avoid positional misbinding.

Currently tuner_top_k binds to use_fused_finalize in overrides.

-        self.finalize_tactic(module, input, output_dtype, min_latency_mode,
-                             tuner_top_k)
+        self.finalize_tactic(
+            module,
+            input,
+            output_dtype,
+            min_latency_mode=min_latency_mode,
+            use_fused_finalize=use_fused_finalize,
+            tuner_top_k=tuner_top_k,
+        )

30-37: Align base API: add use_fused_finalize to finalize_tactic signature.

Keep LSP with MoECutlassBackend override and avoid caller misbinding.

     def finalize_tactic(
         self,
         module: Any,
         tuner_input: torch.Tensor,
         output_dtype: torch.dtype,
         min_latency_mode: bool = False,
+        use_fused_finalize: bool = True,
         tuner_top_k: Optional[int] = None,
     ) -> None:

746-752: Use experts_per_token (not top_k) when finalizing.

Avoid AttributeError and ensure correct routing parameter.

-            module.routing_method.top_k if module else 1,
+            module.routing_method.experts_per_token if module else 1,
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py (3)

311-325: Replace print diagnostics with logger.debug.

Avoid noisy stdout; adhere to project logging. Also fixes E501 by shortening lines.

-            print(
-                f"can not use alltoall due to chunking {self.calculate_num_chunks(all_rank_num_tokens)}"
-            )
+            logger.debug(
+                "Cannot use alltoall due to chunking=%s",
+                self.calculate_num_chunks(all_rank_num_tokens),
+            )
@@
-            print(
-                f"can not use alltoall due to deep_ep_max_num_tokens {all_rank_max_num_tokens} > {self.deep_ep_max_num_tokens}"
-            )
+            logger.debug(
+                "Cannot use alltoall: max_tokens %s > limit %s",
+                all_rank_max_num_tokens,
+                self.deep_ep_max_num_tokens,
+            )
@@
-        print(f"all to all type {self.alltoall_method_type}")
+        logger.debug("Alltoall type: %s", self.alltoall_method_type)

333-345: Swap remaining prints in _get_quant_method for logger.debug.

Keep logging consistent; no behavior change.

-                print(
-                    f"wide_ep _get_quant_method: get_sm_version()={get_sm_version()}"
-                )
+                logger.debug(
+                    "wide_ep _get_quant_method: get_sm_version()=%s",
+                    get_sm_version(),
+                )
                 if get_sm_version() == 100:
-                    print(
-                        f"wide_ep _get_quant_method: use DeepSeekFP8BlockScalesFusedMoEMethodDeepGemm"
-                    )
+                    logger.debug(
+                        "wide_ep _get_quant_method: use DeepSeekFP8BlockScalesFusedMoEMethodDeepGemm"
+                    )
                     return DeepSeekFP8BlockScalesFusedMoEMethodDeepGemm()
                 else:
-                    print(
-                        f"wide_ep _get_quant_method: use DeepSeekFP8BlockScalesFusedMoEMethod"
-                    )
+                    logger.debug(
+                        "wide_ep _get_quant_method: use DeepSeekFP8BlockScalesFusedMoEMethod"
+                    )
                     return DeepSeekFP8BlockScalesFusedMoEMethod()

8-8: Fix logger import (use tensorrt_llm.logger).

_tensorrt_llm.utils doesn’t export logger; this import will fail at import time.

-from tensorrt_llm._utils import get_sm_version, logger
+from tensorrt_llm._utils import get_sm_version
+from tensorrt_llm.logger import logger
🧹 Nitpick comments (3)
tensorrt_llm/_torch/modules/fused_moe/moe_backend.py (2)

322-328: Remove unused locals flagged by Ruff.

These flags aren’t used in this scope; runner already captured them during finalize.

-        use_deepseek_fp8_block_scale = getattr(module,
-                                               'has_deepseek_fp8_block_scales',
-                                               False)
         use_w4_group_scaling = getattr(module, 'has_w4afp8', False)
-        use_int8_woq_per_channel = getattr(module, 'has_int8_woq_per_channel',
-                                           False)

633-645: Delete long commented debug prints (line-length lint).

These commented blocks trigger E501; remove to keep the file clean.

-        # print(
-        #     "xxi shape 2: enter deepgemm backend compute_moe \n"
-        #     f"x.shape: {getattr(x, 'shape', None)}, \n"
-        #     f"input_sf.shape: {getattr(input_sf, 'shape', None)}, \n"
-        #     f"token_selected_slots.shape: {getattr(token_selected_slots, 'shape', None)}, \n"
-        #     f"token_final_scales.shape: {getattr(token_final_scales, 'shape', None)}, \n"
-        #     f"permuted_row_to_unpermuted_row_tensor.shape: {getattr(permuted_row_to_unpermuted_row_tensor, 'shape', None)}, \n"
-        #     f"permuted_token_selected_experts_tensor.shape: {getattr(permuted_token_selected_experts_tensor, 'shape', None)}, \n"
-        #     f"permuted_data_tensor.shape: {getattr(permuted_data_tensor, 'shape', None)}, \n"
-        #     f"expert_first_token_offset_tensor.shape: {getattr(expert_first_token_offset_tensor, 'shape', None)}, \n"
-        #     f"permuted_token_final_scales_tensor.shape: {getattr(permuted_token_final_scales_tensor, 'shape', None)}, \n"
-        #     f"unpermuted_row_to_permuted_row_tensor.shape: {getattr(unpermuted_row_to_permuted_row_tensor, 'shape', None)}\n"
-        # )
@@
-        # print(
-        #     "xxi shape 3: exit deepgemm backend compute_moe \n"
-        #     f"final_hidden_states.shape: {getattr(final_hidden_states, 'shape', None)}\n"
-        #     f"permuted_data_tensor.shape: {getattr(permuted_data_tensor, 'shape', None)}, \n"
-        #     f"token_final_scales.shape: {getattr(token_final_scales, 'shape', None)}, \n"
-        #     f"unpermuted_row_to_permuted_row_tensor.shape: {getattr(unpermuted_row_to_permuted_row_tensor, 'shape', None)}, \n"
-        #     f"permuted_row_to_unpermuted_row_tensor.shape: {getattr(permuted_row_to_unpermuted_row_tensor, 'shape', None)}, \n"
-        #     f"expert_first_token_offset_tensor.shape: {getattr(expert_first_token_offset_tensor, 'shape', None)}, \n"
-        #     f"x.shape: {getattr(x, 'shape', None)}, \n")

Also applies to: 753-762

tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py (1)

428-431: Remove long commented print blocks (lint: E501).

They’re not useful in production; delete to keep the file clean.

-        # print(
-        #     f"xxi shape 1: enter wide_ep forward_chunk: layer_load_balancer={self.layer_load_balancer}, is_first_call={is_first_call}, is_last_call={is_last_call}, x shape: {getattr(x, 'shape', None)}, router_logits shape: {getattr(router_logits, 'shape', None)}, use_all_to_all: {use_all_to_all}, all_rank_num_tokens: {all_rank_num_tokens}, all_rank_max_num_tokens: {all_rank_max_num_tokens}, use_dp_padding: {use_dp_padding}, repeating_info: {repeating_info}"
-        # )
@@
-            # 一行打印所有信息
-            # print(
-            #     f"xxi x.shape: {getattr(x, 'shape', None)}, use_all_to_all: {use_all_to_all}, all_rank_num_tokens: {all_rank_num_tokens}, all_rank_num_tokens_padded: {all_rank_num_tokens_padded}, all_rank_max_num_tokens: {all_rank_max_num_tokens}, use_dp_padding: {use_dp_padding}, outputs.shape: {getattr(outputs, 'shape', None)}, use_dp_padding(again): {use_dp_padding}"
-            # )

Also applies to: 810-813

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between e8af58e and 6c62b25.

📒 Files selected for processing (2)
  • tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py (13 hunks)
  • tensorrt_llm/_torch/modules/fused_moe/moe_backend.py (1 hunks)
🧰 Additional context used
📓 Path-based instructions (3)
**/*.{cpp,cc,cxx,cu,py,h,hpp,hh,hxx,cuh}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Use spaces only; no tabs; indent by 4 spaces

Files:

  • tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py
  • tensorrt_llm/_torch/modules/fused_moe/moe_backend.py
**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Code must target Python 3.8+
Indent with 4 spaces; no tabs
Keep module namespace on import; import the module, not individual names; use module.symbol
Python filenames use snake_case (e.g., some_file.py)
Class names use PascalCase
Function and method names use snake_case
Local variables use snake_case; if starting with a number, prefix with k_ (e.g., k_99th_percentile)
Global variables use UPPER_SNAKE with G_ prefix (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE
Avoid shadowing variables from outer scopes
Initialize all externally visible members in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with trailing docstrings
Avoid reflection when simple alternatives exist (e.g., avoid dict(**locals()) patterns)
Limit except clauses to specific exceptions; avoid bare except
When duck-typing with try/except, keep try body minimal and use else for logic

Files:

  • tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py
  • tensorrt_llm/_torch/modules/fused_moe/moe_backend.py
**/*.{cpp,cc,cxx,cu,h,hpp,hh,hxx,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Prepend NVIDIA copyright header (current year) to all source files

Files:

  • tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py
  • tensorrt_llm/_torch/modules/fused_moe/moe_backend.py
🧠 Learnings (5)
📓 Common learnings
Learnt from: timlee0212
PR: NVIDIA/TensorRT-LLM#6886
File: tensorrt_llm/_torch/models/modeling_deepseekv3.py:0-0
Timestamp: 2025-08-14T06:36:40.701Z
Learning: In DeepSeek V3 model (tensorrt_llm/_torch/models/modeling_deepseekv3.py), the disagreement between AllReduce.__init__ guard and _compute_mlp_tp_size logic for MNNVL usage is expected by design. The AllReduce component and MLP TP-size computation intentionally use different criteria for MNNVL availability decisions.
📚 Learning: 2025-08-14T23:23:27.449Z
Learnt from: djns99
PR: NVIDIA/TensorRT-LLM#6915
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:4010-4012
Timestamp: 2025-08-14T23:23:27.449Z
Learning: For MOE (Mixture of Experts) code reviews in TensorRT-LLM, avoid repeatedly suggesting finalize fusion validation checks and safety assertions. The user djns99 has indicated these suggestions are repetitive and unwanted across multiple MOE-related changes.

Applied to files:

  • tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py
  • tensorrt_llm/_torch/modules/fused_moe/moe_backend.py
📚 Learning: 2025-08-09T20:57:04.084Z
Learnt from: sklevtsov-nvidia
PR: NVIDIA/TensorRT-LLM#3294
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu:118-127
Timestamp: 2025-08-09T20:57:04.084Z
Learning: In the CUTLASS MoE finalize fusion implementation (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu), when setting `fused_finalize_epilogue.stride_final_output` with shape `(hidden_size, num_output_tokens, 1)`, the `num_rows_in_final_output` should be set to `num_output_tokens` (not `hidden_size`) because of a swap+transpose operation that maps rows of the output tensor to `hidden_size` and columns to `num_output_tokens`.

Applied to files:

  • tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py
  • tensorrt_llm/_torch/modules/fused_moe/moe_backend.py
📚 Learning: 2025-08-28T15:21:03.571Z
Learnt from: CR
PR: NVIDIA/TensorRT-LLM#0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-08-28T15:21:03.571Z
Learning: Applies to **/*.{cpp,cc,cxx,cu,h,hpp,hh,hxx,cuh,py} : Prepend NVIDIA copyright header (current year) to all source files

Applied to files:

  • tensorrt_llm/_torch/modules/fused_moe/moe_backend.py
📚 Learning: 2025-08-06T13:58:07.506Z
Learnt from: galagam
PR: NVIDIA/TensorRT-LLM#6487
File: tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_trtllm_bench.py:1-12
Timestamp: 2025-08-06T13:58:07.506Z
Learning: In TensorRT-LLM, test files (files under tests/ directories) do not require NVIDIA copyright headers, unlike production source code files. Test files typically start directly with imports, docstrings, or code.

Applied to files:

  • tensorrt_llm/_torch/modules/fused_moe/moe_backend.py
🧬 Code graph analysis (2)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py (1)
tensorrt_llm/_torch/modules/fused_moe/moe_backend.py (5)
  • MoEBackend (17-177)
  • MoEBackendSelection (768-814)
  • select_backend (780-814)
  • run_moe (109-177)
  • run_moe (379-461)
tensorrt_llm/_torch/modules/fused_moe/moe_backend.py (4)
tensorrt_llm/_torch/modules/fused_moe/interface.py (2)
  • MoE (22-181)
  • has_deepseek_fp8_block_scales (127-130)
tensorrt_llm/_torch/custom_ops/torch_custom_ops.py (1)
  • MoERunner (23-114)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_deepgemm.py (5)
  • deepgemm_fp8_group_blockwise_gemm (246-284)
  • masked_index_copy_group_quant_fp8 (89-160)
  • preprocess_after_permute (233-242)
  • set_strides (287-293)
  • triton_masked_index_gather (195-216)
tensorrt_llm/quantization/utils/fp8_utils.py (1)
  • silu_and_mul_masked_post_quant_fwd (304-375)
🪛 Ruff (0.12.2)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py

320-320: Line too long (127 > 120)

(E501)


429-429: Line too long (463 > 120)

(E501)


729-729: Line too long (967 > 120)

(E501)


812-812: Line too long (374 > 120)

(E501)

tensorrt_llm/_torch/modules/fused_moe/moe_backend.py

322-322: Local variable use_deepseek_fp8_block_scale is assigned to but never used

Remove assignment to unused variable use_deepseek_fp8_block_scale

(F841)


326-326: Local variable use_int8_woq_per_channel is assigned to but never used

Remove assignment to unused variable use_int8_woq_per_channel

(F841)


348-348: Line too long (136 > 120)

(E501)


639-639: Line too long (129 > 120)

(E501)


640-640: Line too long (131 > 120)

(E501)


643-643: Line too long (123 > 120)

(E501)


644-644: Line too long (127 > 120)

(E501)


758-758: Line too long (129 > 120)

(E501)


759-759: Line too long (129 > 120)

(E501)

🔇 Additional comments (1)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py (1)

707-726: LGTM: backend invocation is correct and passes runtime-only knobs.

Uses keywords, aligns with backend API, and unwraps later.

Signed-off-by: xxi <[email protected]>

	modified:   tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py
	modified:   tensorrt_llm/_torch/modules/fused_moe/moe_backend.py

	modified:   tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py
	modified:   tensorrt_llm/_torch/modules/fused_moe/moe_backend.py

	modified:   tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py
	modified:   tensorrt_llm/_torch/modules/fused_moe/moe_backend.py

	modified:   tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py
	modified:   tensorrt_llm/_torch/modules/fused_moe/moe_backend.py
	modified:   tests/unittest/_torch/modules/test_fused_moe.py
@xxi-nv xxi-nv force-pushed the supportFP8BlockWideEp branch from 6c62b25 to bfc5496 Compare August 29, 2025 05:23
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (8)
tensorrt_llm/_torch/modules/fused_moe/moe_backend.py (5)

157-158: Fix positional arg misbinding in finalize_tactic call (use named args).

Passing tuner_top_k positionally can bind into use_fused_finalize in subclass overrides. Bind by name.

-        self.finalize_tactic(module, input, output_dtype, min_latency_mode,
-                             tuner_top_k)
+        self.finalize_tactic(
+            module,
+            input,
+            output_dtype,
+            min_latency_mode=min_latency_mode,
+            tuner_top_k=tuner_top_k,
+        )

441-443: Cutlass.run_moe: pass finalize_tactic args by name (and include use_fused_finalize).

Avoids misbinding and ensures the intended flag is honored during profiling.

-        self.finalize_tactic(module, tuner_input, output_dtype,
-                             min_latency_mode, tuner_top_k)
+        self.finalize_tactic(
+            module,
+            tuner_input,
+            output_dtype,
+            min_latency_mode=min_latency_mode,
+            use_fused_finalize=use_fused_finalize,
+            tuner_top_k=tuner_top_k,
+        )

217-241: Use weight_view_dtype when constructing MoERunner to match runtime .view(...).

The tuner currently receives storage dtype; runtime uses a view dtype (e.g., quint4x2, uint8). Use the same dtype for accurate tactic selection.

         if self.moe_runner is None:
             self.moe_runner = MoERunner(
                 x_dtype=tuner_input.dtype,
-                weight_dtype=module.w3_w1_weight.dtype,
+                weight_dtype=weight_view_dtype,
                 output_dtype=output_dtype,
                 top_k=tuner_top_k,

746-752: Routing attr: use experts_per_token, not top_k.

APIs elsewhere expose experts_per_token; using top_k risks AttributeError and inconsistency.

-            module.routing_method.top_k if module else 1,
+            module.routing_method.experts_per_token if module else 1,

11-13: Replace print with logger and import the project logger.

Avoid raw prints in production; use the configured logger.

 from tensorrt_llm._utils import get_sm_version
+from tensorrt_llm.logger import logger
@@
-            print("xxi select backend: use deepgemm backend")
+            logger.debug("MoEBackendSelection: using DeepGemm backend")
             return MoEDeepGemmBackend()
         else:
-            print("xxi select backend: use cutlass backend")
-            return MoECutlassBackend()
+            logger.debug("MoEBackendSelection: using Cutlass backend")
+            return MoECutlassBackend()

Also applies to: 806-814

tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py (3)

8-8: Fix logger import; _utils doesn’t export logger.

Import logger from tensorrt_llm.logger.

-from tensorrt_llm._utils import get_sm_version, logger
+from tensorrt_llm._utils import get_sm_version
+from tensorrt_llm.logger import logger

311-325: Replace debug prints with logger.debug.

No raw prints in production paths; keep messages but gate via logger.

-            print(
-                f"can not use alltoall due to chunking {self.calculate_num_chunks(all_rank_num_tokens)}"
-            )
+            logger.debug(
+                "Cannot use alltoall due to chunking %d",
+                self.calculate_num_chunks(all_rank_num_tokens),
+            )
@@
-            print(
-                f"can not use alltoall due to deep_ep_max_num_tokens {all_rank_max_num_tokens} > {self.deep_ep_max_num_tokens}"
-            )
+            logger.debug(
+                "Cannot use alltoall: deep_ep_max_num_tokens %d > %d",
+                all_rank_max_num_tokens,
+                self.deep_ep_max_num_tokens,
+            )
@@
-        print(f"all to all type {self.alltoall_method_type}")
+        logger.debug("Alltoall type: %s", self.alltoall_method_type)

333-345: _get_quant_method: replace prints with logger.debug.

Keeps diagnostics without spamming stdout.

-                print(
-                    f"wide_ep _get_quant_method: get_sm_version()={get_sm_version()}"
-                )
+                logger.debug("wide_ep _get_quant_method: sm=%s", get_sm_version())
                 if get_sm_version() == 100:
-                    print(
-                        f"wide_ep _get_quant_method: use DeepSeekFP8BlockScalesFusedMoEMethodDeepGemm"
-                    )
+                    logger.debug(
+                        "wide_ep _get_quant_method: DeepSeekFP8BlockScalesFusedMoEMethodDeepGemm"
+                    )
                     return DeepSeekFP8BlockScalesFusedMoEMethodDeepGemm()
                 else:
-                    print(
-                        f"wide_ep _get_quant_method: use DeepSeekFP8BlockScalesFusedMoEMethod"
-                    )
+                    logger.debug(
+                        "wide_ep _get_quant_method: DeepSeekFP8BlockScalesFusedMoEMethod"
+                    )
                     return DeepSeekFP8BlockScalesFusedMoEMethod()
🧹 Nitpick comments (2)
tensorrt_llm/_torch/modules/fused_moe/moe_backend.py (2)

317-333: Remove unused locals in Cutlass.compute_moe.

These assignments aren’t used (Ruff F841) and add noise.

-        getattr(module, 'tune_max_num_tokens', 8192)
+        # no-op: tune_max_num_tokens handled during finalize_tactic

-        use_deepseek_fp8_block_scale = getattr(module,
-                                               'has_deepseek_fp8_block_scales',
-                                               False)
         use_w4_group_scaling = getattr(module, 'has_w4afp8', False)
-        use_int8_woq_per_channel = getattr(module, 'has_int8_woq_per_channel',
-                                           False)

347-349: Split long ternary for readability and linter E501.

-        run_moe = self.moe_runner.fused_moe_runner.run_moe_min_latency if min_latency_mode else self.moe_runner.fused_moe_runner.run_moe
+        run_moe = (
+            self.moe_runner.fused_moe_runner.run_moe_min_latency
+            if min_latency_mode
+            else self.moe_runner.fused_moe_runner.run_moe
+        )
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 6c62b25 and bfc5496.

📒 Files selected for processing (3)
  • tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py (13 hunks)
  • tensorrt_llm/_torch/modules/fused_moe/moe_backend.py (1 hunks)
  • tests/unittest/_torch/modules/test_fused_moe.py (2 hunks)
🧰 Additional context used
🧠 Learnings (4)
📓 Common learnings
Learnt from: timlee0212
PR: NVIDIA/TensorRT-LLM#6886
File: tensorrt_llm/_torch/models/modeling_deepseekv3.py:0-0
Timestamp: 2025-08-14T06:36:40.701Z
Learning: In DeepSeek V3 model (tensorrt_llm/_torch/models/modeling_deepseekv3.py), the disagreement between AllReduce.__init__ guard and _compute_mlp_tp_size logic for MNNVL usage is expected by design. The AllReduce component and MLP TP-size computation intentionally use different criteria for MNNVL availability decisions.
📚 Learning: 2025-08-14T23:23:27.449Z
Learnt from: djns99
PR: NVIDIA/TensorRT-LLM#6915
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:4010-4012
Timestamp: 2025-08-14T23:23:27.449Z
Learning: For MOE (Mixture of Experts) code reviews in TensorRT-LLM, avoid repeatedly suggesting finalize fusion validation checks and safety assertions. The user djns99 has indicated these suggestions are repetitive and unwanted across multiple MOE-related changes.

Applied to files:

  • tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py
  • tensorrt_llm/_torch/modules/fused_moe/moe_backend.py
📚 Learning: 2025-08-09T20:57:04.084Z
Learnt from: sklevtsov-nvidia
PR: NVIDIA/TensorRT-LLM#3294
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu:118-127
Timestamp: 2025-08-09T20:57:04.084Z
Learning: In the CUTLASS MoE finalize fusion implementation (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu), when setting `fused_finalize_epilogue.stride_final_output` with shape `(hidden_size, num_output_tokens, 1)`, the `num_rows_in_final_output` should be set to `num_output_tokens` (not `hidden_size`) because of a swap+transpose operation that maps rows of the output tensor to `hidden_size` and columns to `num_output_tokens`.

Applied to files:

  • tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py
  • tensorrt_llm/_torch/modules/fused_moe/moe_backend.py
📚 Learning: 2025-08-06T13:58:07.506Z
Learnt from: galagam
PR: NVIDIA/TensorRT-LLM#6487
File: tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_trtllm_bench.py:1-12
Timestamp: 2025-08-06T13:58:07.506Z
Learning: In TensorRT-LLM, test files (files under tests/ directories) do not require NVIDIA copyright headers, unlike production source code files. Test files typically start directly with imports, docstrings, or code.

Applied to files:

  • tensorrt_llm/_torch/modules/fused_moe/moe_backend.py
🧬 Code graph analysis (3)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py (2)
tensorrt_llm/_torch/modules/fused_moe/moe_backend.py (5)
  • MoEBackend (17-177)
  • MoEBackendSelection (768-814)
  • select_backend (780-814)
  • run_moe (109-177)
  • run_moe (379-461)
tensorrt_llm/_torch/modules/fused_moe/quantization.py (2)
  • DeepSeekFP8BlockScalesFusedMoEMethod (604-737)
  • DeepSeekFP8BlockScalesFusedMoEMethodDeepGemm (740-781)
tests/unittest/_torch/modules/test_fused_moe.py (4)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py (1)
  • AlltoallMethodType (29-37)
tensorrt_llm/_torch/modules/fused_moe/routing.py (1)
  • DefaultMoeRoutingMethod (184-210)
tensorrt_llm/mapping.py (1)
  • Mapping (32-513)
tests/unittest/_torch/helpers.py (1)
  • per_block_cast_to_fp8_e8m0 (55-68)
tensorrt_llm/_torch/modules/fused_moe/moe_backend.py (4)
tensorrt_llm/_torch/modules/fused_moe/interface.py (2)
  • MoE (22-181)
  • has_deepseek_fp8_block_scales (127-130)
tensorrt_llm/_torch/custom_ops/torch_custom_ops.py (1)
  • MoERunner (23-114)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py (1)
  • enable_alltoall (298-301)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_deepgemm.py (5)
  • deepgemm_fp8_group_blockwise_gemm (246-284)
  • masked_index_copy_group_quant_fp8 (89-160)
  • preprocess_after_permute (233-242)
  • set_strides (287-293)
  • triton_masked_index_gather (195-216)
🪛 Ruff (0.12.2)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py

320-320: Line too long (127 > 120)

(E501)


429-429: Line too long (463 > 120)

(E501)


729-729: Line too long (967 > 120)

(E501)


812-812: Line too long (374 > 120)

(E501)

tensorrt_llm/_torch/modules/fused_moe/moe_backend.py

322-322: Local variable use_deepseek_fp8_block_scale is assigned to but never used

Remove assignment to unused variable use_deepseek_fp8_block_scale

(F841)


326-326: Local variable use_int8_woq_per_channel is assigned to but never used

Remove assignment to unused variable use_int8_woq_per_channel

(F841)


348-348: Line too long (136 > 120)

(E501)


639-639: Line too long (129 > 120)

(E501)


640-640: Line too long (131 > 120)

(E501)


643-643: Line too long (123 > 120)

(E501)


644-644: Line too long (127 > 120)

(E501)


758-758: Line too long (129 > 120)

(E501)


759-759: Line too long (129 > 120)

(E501)

🔇 Additional comments (3)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py (1)

368-379: LGTM: lazy backend initialization is clean.

Property-based selection after weights exist avoids premature capability checks.

tests/unittest/_torch/modules/test_fused_moe.py (2)

520-664: LGTM: FP8 blockwise WideEP test adds valuable coverage.

Good per-rank setup, deterministic tensors, and DeepGemm reference comparison with sane tolerances.


477-518: LGTM: helper patterns are straightforward and reusable.

@xxi-nv xxi-nv closed this Aug 29, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant