[https://nvbugs/5463720][fix]: Naively handle per-layer MLP dimensions for Nemotron 49B #7016

moraxu · 2025-08-19T00:01:08Z

Description

[https://nvbugs/5463720][fix]: Naively handle per-layer MLP dimensions for Nemotron 49B

Test Coverage

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md
and the scripts/test_to_stage_mapping.py helper.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

Summary by CodeRabbit

New Features
- Added optional coarse MLP hidden-size in model configs; LoRA config can accept it and use it to size/pad adapter weights across tensor-parallel shards.
- LoRA loading now honors per-executor runtime mapping when present.
Chores
- Increased debug/runtime logging to surface model, mapping and LoRA sizing/extraction details.
Tests
- Added an integration test exercising a real LoRA adapter on Nemotron Super 49B (duplicated).

Signed-off-by: Michal Guzek <[email protected]>

coderabbitai · 2025-08-19T00:01:16Z

📝 Walkthrough

Walkthrough

Adds a new ModelConfig property coarse_mlp_hidden_size with Nemotron inference and logging, threads that optional value through pyexecutor to LoraModelConfig, updates LoRA padding to use coarse MLP sizing when provided, adds C++ LoRA validation logs, and introduces a duplicated integration test for Nemotron Super 49B with a real LoRA adapter.

Changes

Cohort / File(s)	Summary
ModelConfig: coarse MLP inference & logging `tensorrt_llm/_torch/model_config.py`	Adds public property `coarse_mlp_hidden_size` that returns `pretrained_config.intermediate_size // tp_size` or infers Nemotron FFN multiplier via `_infer_nemotron_ffn_mult()`; refactors `_infer_nemotron_ffn_mult()` to build an explicit list of `ffn_mults`; adds extensive debug logging.
PyExecutor: extract coarse size and pass to LoraModelConfig `tensorrt_llm/_torch/pyexecutor/model_engine.py`	Reads `coarse_mlp_hidden_size` (if present) from `self.model.model_config` with debug prints and safe fallbacks; passes the value (or None) into `LoraModelConfig` constructor.
LoRA config & padding logic `tensorrt_llm/lora_manager.py`	Adds `coarse_mlp_hidden_size: Optional[int] = None` to `LoraModelConfig`; when provided, computes TP-aware `max_weight_size` using coarse MLP, `tp_size`, rank and hidden size, and pads CPU LoRA weights to that size before stacking; preserves previous padding when absent; adds debug logs.
C++ LoRA validation logging `cpp/tensorrt_llm/runtime/loraUtils.cpp`	Adds debug log showing expected vs actual flattened LoRA dimension and related parameters during `loraValidateRequestTensors`; no behavior change.
Resource/PEFT init changes `tensorrt_llm/_torch/pyexecutor/resource_manager.py`	Removes model_engine-based branch for populating `_lora_model_config`; always constructs `LoraModelConfig` directly with dtype binding; adds debug print for `py_lora_path` in `add_request_peft`.
Executor LoRA load mapping `tensorrt_llm/executor/worker.py`	Builds a `runtime_mapping` from `executor_config.mapping` (world_size, tp/pp/cp sizes, rank, gpus_per_node) and passes it to `_lora_manager.load_from_ckpt`; requires `Mapping` constructor accepting `world_size`.
PyExecutor util (non-functional comment) `tensorrt_llm/_torch/pyexecutor/_util.py`	Inserts a comment line before `LoraModule.create_lora_modules` call; no functional change.
Integration test (duplicated) `tests/integration/defs/examples/test_nemotron_nas.py`	Adds `test_nemotron_super_49b_real_lora_torch` which loads a real LoRA adapter for Nemotron Super 49B and runs generate with `LoRARequest`; function appears twice (duplicate).

Sequence Diagram(s)

sequenceDiagram
  participant Dev as Caller
  participant Engine as PyExecutor.model_engine
  participant Config as ModelConfig
  participant LoraCfg as LoraModelConfig
  participant LoraMgr as lora_manager
  participant CPP as loraUtils.cpp

  Dev->>Engine: instantiate executor
  Engine->>Config: access model_config.coarse_mlp_hidden_size
  alt pretrained_config.intermediate_size present
    Config-->>Engine: coarse_mlp_hidden_size = intermediate_size // tp_size
  else architectures == ["DeciLMForCausalLM"]
    Config-->>Engine: coarse_mlp_hidden_size = _infer_nemotron_ffn_mult()
  else
    Config-->>Engine: raises / None (handled by Engine)
  end
  Engine->>LoraCfg: construct LoraModelConfig(..., coarse_mlp_hidden_size)
  LoraCfg->>LoraMgr: load LoRA weights (CPU path)
  alt coarse_mlp_hidden_size provided
    LoraMgr->>LoraMgr: M_coarse_tp = coarse_mlp_hidden_size * tp_size
    LoraMgr->>LoraMgr: max_weight_size = r*M_coarse_tp + r*H
    LoraMgr->>LoraMgr: pad weights -> max_weight_size
  else not provided
    LoraMgr->>LoraMgr: pad to max existing weight size (old path)
  end
  LoraMgr->>CPP: loraValidateRequestTensors (logs expected vs actual)
  CPP-->>LoraMgr: validation result
  LoraMgr-->>Engine: loaded LoRA modules
  Engine-->>Dev: executor ready

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Add basic Nemo Ckpt Lora Loading in pytorch flow #6019 — Directly related to Nemotron FFN multiplier inference and model-config inference logic changes.
Fix: Enhance ModelConfig for kv cache size calculations #5868 — Related edits to ModelConfig and downstream consumers of model-derived sizing fields.

Suggested reviewers

yilin-void
syuoni
shaharmor98
Naveassaf
litaotju
amitz-nv

Tip

🔌 Remote MCP (Model Context Protocol) integration is now available!

Pro plan users can now connect to remote MCP servers from the Integrations page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats.

✨ Finishing Touches

📝 Generate Docstrings

🧪 Generate unit tests

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai or @coderabbitai title anywhere in the PR title to generate the title automatically.

Status, Documentation and Community

Visit our Status Page to check the current availability of CodeRabbit.
Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 3

🧹 Nitpick comments (2)

tensorrt_llm/_torch/model_config.py (1)

456-459: Make mlp_hidden_size_per_layer available even for uniform MLP dims

Downstream code reads mlp_hidden_size_per_layer. Setting it in the scalar path avoids None/empty semantics and keeps the binding contract consistent.

Apply this diff:

             else:
                 # Uniform MLP dimensions across all layers
                 mlp_hidden_size = self.pretrained_config.intermediate_size // self.mapping.tp_size
+                # Populate per-layer vector for consistency with variable-MLP models
+                model_config_cpp.mlp_hidden_size_per_layer = [
+                    mlp_hidden_size
+                ] * self.pretrained_config.num_hidden_layers

tests/integration/defs/examples/test_nemotron_nas.py (1)

173-177: Nit: shorten long lines and keep messages readable

Wrap arguments to stay within 120 characters and improve readability (Ruff E501).

Apply this diff:

-        sampling_params = SamplingParams(max_tokens=50,
-                                         temperature=0.7,
-                                         top_p=0.9)
+        sampling_params = SamplingParams(
+            max_tokens=50,
+            temperature=0.7,
+            top_p=0.9,
+        )

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between e76e5c6 and abf1402.

📒 Files selected for processing (3)

tensorrt_llm/_torch/model_config.py (1 hunks)
tensorrt_llm/_torch/pyexecutor/_util.py (1 hunks)
tests/integration/defs/examples/test_nemotron_nas.py (2 hunks)

🧰 Additional context used

📓 Path-based instructions (2)

**/*.py

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.py: Python code must target Python 3.8+
Python indentation: 4 spaces, no tabs
Maintain module namespace in imports (from package.subpackage import foo; then use foo.SomeClass())
Python file names use snake_case
Python class names use PascalCase
Python functions/methods and local variables use snake_case; variables starting with a number get k_ prefix (e.g., k_99th_percentile)
Global variables use G_ prefixed UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE in Python
Avoid shadowing variables from outer scopes in Python
Initialize all externally visible members of a Python class in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with short docstrings
Avoid reflection when simple alternatives exist (e.g., prefer explicit parameters over dict(**locals()))
In try/except, catch the narrowest exceptions possible
For duck-typing with try/except, keep try body minimal and put logic in else

Files:

tensorrt_llm/_torch/model_config.py
tests/integration/defs/examples/test_nemotron_nas.py
tensorrt_llm/_torch/pyexecutor/_util.py

**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Prepend NVIDIA copyright header (current year) to all source files

Files:

tensorrt_llm/_torch/model_config.py
tests/integration/defs/examples/test_nemotron_nas.py
tensorrt_llm/_torch/pyexecutor/_util.py

🧠 Learnings (2)

📚 Learning: 2025-07-28T17:06:08.621Z

Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.

Applied to files:

tests/integration/defs/examples/test_nemotron_nas.py

📚 Learning: 2025-08-06T13:58:07.506Z

Learnt from: galagam
PR: NVIDIA/TensorRT-LLM#6487
File: tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_trtllm_bench.py:1-12
Timestamp: 2025-08-06T13:58:07.506Z
Learning: In TensorRT-LLM, test files (files under tests/ directories) do not require NVIDIA copyright headers, unlike production source code files. Test files typically start directly with imports, docstrings, or code.

Applied to files:

tests/integration/defs/examples/test_nemotron_nas.py

🪛 Ruff (0.12.2)

tests/integration/defs/examples/test_nemotron_nas.py

146-146: Line too long (125 > 120)

(E501)

tensorrt_llm/_torch/pyexecutor/_util.py

474-474: Line too long (128 > 120)

(E501)

🔇 Additional comments (2)

tensorrt_llm/_torch/pyexecutor/_util.py (1)

482-491: Bindings include tp_size & num_experts – API compatibility confirmed

I’ve verified that both the pybind and nanobind registrations for LoraModule.create_lora_modules declare the new parameters with defaults:

cpp/tensorrt_llm/pybind/bindings.cpp (lines 213–216):
.def_static("create_lora_modules", …, py::arg("tp_size") = 1, py::arg("num_experts") = 0);

cpp/tensorrt_llm/nanobind/bindings.cpp (lines 221–224):
.def_static("create_lora_modules", …, nb::arg("tp_size") = 1, nb::arg("num_experts") = 0);

These signatures match the Python call site in tensorrt_llm/_torch/pyexecutor/_util.py. No further binding updates are necessary.

tests/integration/defs/examples/test_nemotron_nas.py (1)

133-199: No duplicate test definitions found
I searched the entire repository for def test_nemotron_super_49b_real_lora_torch and only one occurrence exists, so there’s no shadowing or redundant runs. All pytest markers and fixtures can remain as-is.

tensorrt_llm/_torch/model_config.py

tensorrt_llm/_torch/pyexecutor/_util.py

tests/integration/defs/examples/test_nemotron_nas.py

moraxu · 2025-08-19T06:24:00Z

I mixed up config fields, this won't work. Closing

Signed-off-by: Michal Guzek <[email protected]>

coderabbitai

Actionable comments posted: 6

🧹 Nitpick comments (1)

tensorrt_llm/lora_manager.py (1)

1216-1217: Trim excessive debug prints; emit one-liners via logger.debug.

The surrounding print bursts (final max size, stacked shapes) are excessive for hot paths. Use concise logger.debug; avoid per-weight logs.

Apply this diff:

-            print(f"DEBUG: load_from_hf - Final max_weight_size: {max_weight_size}")
+            logger.debug("LoRA padding | final_max_weight_size=%d", max_weight_size)
...
-            print(
-                f"DEBUG: load_from_hf - Final stacked weights shape: {self._cpp_lora_weights[uid].shape}"
-            )
+            logger.debug("LoRA padding | stacked_weights_shape=%s", tuple(self._cpp_lora_weights[uid].shape))
...
-            print(
-                f"DEBUG: load_from_hf - Final stacked config shape: {self._cpp_lora_config[uid].shape}"
-            )
+            logger.debug("LoRA padding | stacked_config_shape=%s", tuple(self._cpp_lora_config[uid].shape))

Also applies to: 1229-1237

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between abf1402 and 90b369c.

📒 Files selected for processing (5)

cpp/tensorrt_llm/runtime/loraUtils.cpp (1 hunks)
tensorrt_llm/_torch/model_config.py (3 hunks)
tensorrt_llm/_torch/pyexecutor/_util.py (1 hunks)
tensorrt_llm/_torch/pyexecutor/model_engine.py (1 hunks)
tensorrt_llm/lora_manager.py (2 hunks)

✅ Files skipped from review due to trivial changes (1)

tensorrt_llm/_torch/pyexecutor/_util.py

🚧 Files skipped from review as they are similar to previous changes (1)

tensorrt_llm/_torch/model_config.py

🧰 Additional context used

📓 Path-based instructions (5)

**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}: In C++, close namespaces with a comment naming the namespace (e.g., } // namespace foo)
Prefer const/constexpr variables over #define for constants
Declare variables const if not modified after initialization
Use Allman brace style in C++
C++ filenames use lowerCamelCase and must be case-insensitively unique within a build target
C++ type names use UpperCamelCase
Local variables, methods, and namespaces use lowerCamelCase
Global non-static variables not in anonymous namespace use gPrefix lowerCamelCase (e.g., gExample)
Static globals or globals in anonymous namespaces use sPrefix lowerCamelCase
Locally visible static variables start with 's' (e.g., static std::once_flag sFlag;)
Member variables use mPrefix lowerCamelCase; public members may omit but are encouraged to use 'm'
Constants (enums, global/static/function-scope magic numbers) use kPREFIXED_UPPER_SNAKE (e.g., kDIGIT_NUM)
If macros are unavoidable, use UPPER_SNAKE_CASE (prefer constants over #define)
Constructor parameter that conflicts with a public member name gets trailing underscore (foo_)
Literal suffixes should be uppercase (e.g., 1234L not 1234l)
C++: use spaces only; indent 4 spaces
Run clang-format (LLVM style) before submitting; wrap lines at 120 characters
If formatting must be bypassed, use // clang-format off/on around the section
Prefer smart pointers; use unique_ptr for sole ownership, shared_ptr for shared; weak_ptr only in exceptional cases
Do not use deprecated pre-C++11 smart pointers
Use C++ style comments; avoid C comments except special inline cases; prefer // single-line
Capitalize and punctuate full-sentence comments
Follow Doxygen rules: use //! for comments and //!< for members in C++
Disable code with #if/#endif and mnemonic conditions; avoid commented-out code; avoid dead code
Do not throw exceptions across library boundaries
Use least-forceful casts; avoid removing const/volatile; avoid C-style and functional casts (except constructors); p...

Files:

cpp/tensorrt_llm/runtime/loraUtils.cpp

**/*.{cpp,cxx,cc,cu}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.{cpp,cxx,cc,cu}: Avoid literal values except for 0, nullptr, true, false; use named constexpr for other literals
Place semicolon of empty for/while loop on a new line
Always use brace-delimited bodies for switch/while/do-for/if/else
Use inline C comments in argument lists when parameter meaning is unclear (e.g., /* checkForErrors = */ false)
Do not use assignment in subexpressions (e.g., if (x = y) ... is forbidden)
Switch on enums should enumerate all values and omit default to catch new values at compile time
Structure switch statements; prohibit fallthrough except between empty cases; each case ends with break or throw; return at end of case not allowed; put break inside braces for compound case
Prefer anonymous namespaces over static for internal linkage of functions
Every defined function must be called at least once (no unused methods)

Files:

cpp/tensorrt_llm/runtime/loraUtils.cpp

**/*.{h,hpp,hxx,hh,cuh,cpp,cxx,cc,cu}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Parameter names must be consistent between declarations and definitions

Files:

cpp/tensorrt_llm/runtime/loraUtils.cpp

**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Prepend NVIDIA copyright header (current year) to all source files

Files:

cpp/tensorrt_llm/runtime/loraUtils.cpp
tensorrt_llm/_torch/pyexecutor/model_engine.py
tensorrt_llm/lora_manager.py

**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Python code must target Python 3.8+
Python indentation: 4 spaces, no tabs
Maintain module namespace in imports (from package.subpackage import foo; then use foo.SomeClass())
Python file names use snake_case
Python class names use PascalCase
Python functions/methods and local variables use snake_case; variables starting with a number get k_ prefix (e.g., k_99th_percentile)
Global variables use G_ prefixed UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE in Python
Avoid shadowing variables from outer scopes in Python
Initialize all externally visible members of a Python class in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with short docstrings
Avoid reflection when simple alternatives exist (e.g., prefer explicit parameters over dict(**locals()))
In try/except, catch the narrowest exceptions possible
For duck-typing with try/except, keep try body minimal and put logic in else

Files:

tensorrt_llm/_torch/pyexecutor/model_engine.py
tensorrt_llm/lora_manager.py

🧬 Code graph analysis (2)

tensorrt_llm/_torch/pyexecutor/model_engine.py (4)

tensorrt_llm/_torch/model_config.py (2)

coarse_mlp_hidden_size (541-556)

torch_dtype (147-152)

tensorrt_llm/lora_manager.py (1)

LoraModelConfig (241-246)

tensorrt_llm/_torch/models/modeling_utils.py (1)

config (500-501)

tensorrt_llm/_utils.py (3)

dtype (935-936)

dtype (943-953)

torch_dtype_to_str (218-219)

tensorrt_llm/lora_manager.py (2)

tensorrt_llm/_torch/model_config.py (1)

coarse_mlp_hidden_size (541-556)

tensorrt_llm/_torch/distributed/communicator.py (1)

tp_size (46-47)

🔇 Additional comments (2)

tensorrt_llm/lora_manager.py (1)

246-247: Adding coarse_mlp_hidden_size to LoraModelConfig is appropriate.

Cleanly threads the optional ML P-size hint into LoRA loading logic. No issues here.

tensorrt_llm/_torch/pyexecutor/model_engine.py (1)

496-498: LGTM: dtype now passed as string via torch_dtype_to_str.

This aligns with downstream str_dtype_to_torch usage in lora_manager and eliminates dtype mismatches.

coderabbitai · 2025-08-21T22:12:13Z

cpp/tensorrt_llm/runtime/loraUtils.cpp

            TLLM_CHECK_WITH_INFO(it->flattenedInOutSize(adapterSize, isDora) <= weights->getShape().d[2],
                "lora_weights has to few values for " + moduleName);
+
+            auto expectedSize = it->flattenedInOutSize(adapterSize, isDora);
+            auto actualSize = weights->getShape().d[2];
+            TLLM_LOG_DEBUG("LoRA validation for %s - Expected: %d, Actual: %d, AdapterSize: %d, IsDora: %d",
+                moduleName.c_str(), expectedSize, actualSize, adapterSize, isDora);
+


🛠️ Refactor suggestion

Log after-check makes debugging impossible on failure; move log before check and reuse computed sizes. Also fix typo "to few" → "too few".

If the size check fails, the debug line never emits. Compute expected/actual once, log, then assert using the computed values. This also fixes the user-facing message grammar.

Apply this diff:

- TLLM_CHECK_WITH_INFO(it->flattenedInOutSize(adapterSize, isDora) <= weights->getShape().d[2], - "lora_weights has to few values for " + moduleName); - - auto expectedSize = it->flattenedInOutSize(adapterSize, isDora); - auto actualSize = weights->getShape().d[2]; - TLLM_LOG_DEBUG("LoRA validation for %s - Expected: %d, Actual: %d, AdapterSize: %d, IsDora: %d", - moduleName.c_str(), expectedSize, actualSize, adapterSize, isDora); + auto const expectedSize = it->flattenedInOutSize(adapterSize, isDora); + auto const actualSize = weights->getShape().d[2]; + TLLM_LOG_DEBUG( + "LoRA validation for %s - Expected: %d, Actual: %d, AdapterSize: %d, IsDora: %d", + moduleName.c_str(), expectedSize, actualSize, adapterSize, static_cast<int>(isDora)); + TLLM_CHECK_WITH_INFO(expectedSize <= actualSize, + "lora_weights has too few values for " + moduleName);

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

TLLM_CHECK_WITH_INFO(it->flattenedInOutSize(adapterSize, isDora) <= weights->getShape().d[2],

"lora_weights has to few values for " + moduleName);

auto expectedSize = it->flattenedInOutSize(adapterSize, isDora);

auto actualSize = weights->getShape().d[2];

TLLM_LOG_DEBUG("LoRA validation for %s - Expected: %d, Actual: %d, AdapterSize: %d, IsDora: %d",

moduleName.c_str(), expectedSize, actualSize, adapterSize, isDora);

auto const expectedSize = it->flattenedInOutSize(adapterSize, isDora);

auto const actualSize = weights->getShape().d[2];

TLLM_LOG_DEBUG(

"LoRA validation for %s - Expected: %d, Actual: %d, AdapterSize: %d, IsDora: %d",

moduleName.c_str(), expectedSize, actualSize, adapterSize, static_cast<int>(isDora));

TLLM_CHECK_WITH_INFO(expectedSize <= actualSize,

"lora_weights has too few values for " + moduleName);

🤖 Prompt for AI Agents

In cpp/tensorrt_llm/runtime/loraUtils.cpp around lines 108 to 115, the debug log is placed after the size check so it won't run on failure and the check message contains a typo; compute expectedSize and actualSize once, move the TLLM_LOG_DEBUG call to before TLLM_CHECK_WITH_INFO and pass the computed expectedSize and actualSize (and adapterSize/isDora) to it, and change the TLLM_CHECK_WITH_INFO message text from "to few" to "too few" while using the already computed expectedSize/actualSize values in the check.

coderabbitai · 2025-08-21T22:12:13Z

tensorrt_llm/_torch/pyexecutor/model_engine.py

+        coarse_mlp_hidden_size = None
+
+        # Debug: Check what type self.model.model_config is
+        print(
+            f"DEBUG: model_engine.py - self.model.model_config type: {type(self.model.model_config)}"
+        )
+        print(
+            f"DEBUG: model_engine.py - self.model.model_config dir: {dir(self.model.model_config)}"
+        )
+
+        if hasattr(self.model.model_config, 'coarse_mlp_hidden_size'):
+            coarse_mlp_hidden_size = self.model.model_config.coarse_mlp_hidden_size
+            print(
+                f"DEBUG: model_engine.py - coarse_mlp_hidden_size: {coarse_mlp_hidden_size}"
+            )
+        else:
+            print(
+                f"DEBUG: model_engine.py - coarse_mlp_hidden_size property not found"
+            )
+            # Try direct access to see if it works
+            try:
+                coarse_mlp_hidden_size = self.model.model_config.coarse_mlp_hidden_size
+                print(
+                    f"DEBUG: model_engine.py - Direct access worked: {coarse_mlp_hidden_size}"
+                )
+            except AttributeError as e:
+                print(f"DEBUG: model_engine.py - Direct access failed: {e}")
+


🛠️ Refactor suggestion

⚠️ Potential issue

Avoid hasattr on properties that may raise; replace prints with robust getattr + logger.debug.

coarse_mlp_hidden_size is a property that can raise ValueError for unsupported architectures. hasattr will evaluate the property and won’t catch ValueError, causing unexpected failures. Use getattr in a try/except, and use logger.debug instead of print.

Apply this diff:

- coarse_mlp_hidden_size = None - - # Debug: Check what type self.model.model_config is - print( - f"DEBUG: model_engine.py - self.model.model_config type: {type(self.model.model_config)}" - ) - print( - f"DEBUG: model_engine.py - self.model.model_config dir: {dir(self.model.model_config)}" - ) - - if hasattr(self.model.model_config, 'coarse_mlp_hidden_size'): - coarse_mlp_hidden_size = self.model.model_config.coarse_mlp_hidden_size - print( - f"DEBUG: model_engine.py - coarse_mlp_hidden_size: {coarse_mlp_hidden_size}" - ) - else: - print( - f"DEBUG: model_engine.py - coarse_mlp_hidden_size property not found" - ) - # Try direct access to see if it works - try: - coarse_mlp_hidden_size = self.model.model_config.coarse_mlp_hidden_size - print( - f"DEBUG: model_engine.py - Direct access worked: {coarse_mlp_hidden_size}" - ) - except AttributeError as e: - print(f"DEBUG: model_engine.py - Direct access failed: {e}") + coarse_mlp_hidden_size = None + try: + coarse_mlp_hidden_size = getattr(self.model.model_config, 'coarse_mlp_hidden_size', None) + logger.debug("model_engine | coarse_mlp_hidden_size=%s", str(coarse_mlp_hidden_size)) + except Exception as e: + # Property exists but inference failed; proceed without it. + logger.debug("model_engine | coarse_mlp_hidden_size unavailable: %s", e)

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

coarse_mlp_hidden_size = None

# Debug: Check what type self.model.model_config is

print(

f"DEBUG: model_engine.py - self.model.model_config type: {type(self.model.model_config)}"

)

print(

f"DEBUG: model_engine.py - self.model.model_config dir: {dir(self.model.model_config)}"

)

if hasattr(self.model.model_config, 'coarse_mlp_hidden_size'):

coarse_mlp_hidden_size = self.model.model_config.coarse_mlp_hidden_size

print(

f"DEBUG: model_engine.py - coarse_mlp_hidden_size: {coarse_mlp_hidden_size}"

)

else:

print(

f"DEBUG: model_engine.py - coarse_mlp_hidden_size property not found"

)

# Try direct access to see if it works

try:

coarse_mlp_hidden_size = self.model.model_config.coarse_mlp_hidden_size

print(

f"DEBUG: model_engine.py - Direct access worked: {coarse_mlp_hidden_size}"

)

except AttributeError as e:

print(f"DEBUG: model_engine.py - Direct access failed: {e}")

coarse_mlp_hidden_size = None

try:

coarse_mlp_hidden_size = getattr(self.model.model_config, 'coarse_mlp_hidden_size', None)

logger.debug("model_engine | coarse_mlp_hidden_size=%s", str(coarse_mlp_hidden_size))

except Exception as e:

# Property exists but inference failed; proceed without it.

logger.debug("model_engine | coarse_mlp_hidden_size unavailable: %s", e)

🤖 Prompt for AI Agents

In tensorrt_llm/_torch/pyexecutor/model_engine.py around lines 464 to 491, replace the current hasattr/print debug block with a single try/except that reads coarse_mlp_hidden_size once via getattr (or direct attribute access) and catches both ValueError and AttributeError so property access that raises is handled; on success assign the value to coarse_mlp_hidden_size and call logger.debug with the value, on exception set coarse_mlp_hidden_size = None and log the failure with logger.debug (do not use print), ensuring you do not evaluate the property twice and you handle both kinds of errors.

coderabbitai · 2025-08-21T22:12:13Z

tensorrt_llm/lora_manager.py

+            # Handle both ModelConfig and LoraModelConfig types
+            print(f"DEBUG: model_config type: {type(model_config)}")
+            print(
+                f"DEBUG: model_config has coarse_mlp_hidden_size: {hasattr(model_config, 'coarse_mlp_hidden_size')}"
+            )
+            print(
+                f"DEBUG: model_config.coarse_mlp_hidden_size value: {model_config.coarse_mlp_hidden_size}"
            )
+            print(
+                f"DEBUG: model_config.coarse_mlp_hidden_size is None: {model_config.coarse_mlp_hidden_size is None}"
+            )
+


💡 Verification agent

🧩 Analysis chain

Replace print-based debugging with logger.debug and trim verbosity.

Using print in library code is noisy and hard to control. Prefer logger.debug (already configured as logger = logging.getLogger(name)). Collapse multiple prints into a single succinct line.

Apply this diff:

- print(f"DEBUG: model_config type: {type(model_config)}") - print( - f"DEBUG: model_config has coarse_mlp_hidden_size: {hasattr(model_config, 'coarse_mlp_hidden_size')}" - ) - print( - f"DEBUG: model_config.coarse_mlp_hidden_size value: {model_config.coarse_mlp_hidden_size}" - ) - print( - f"DEBUG: model_config.coarse_mlp_hidden_size is None: {model_config.coarse_mlp_hidden_size is None}" - ) + logger.debug( + "LoRA padding | model_config=%s | has_coarse=%s | coarse=%s", + type(model_config), + hasattr(model_config, "coarse_mlp_hidden_size"), + getattr(model_config, "coarse_mlp_hidden_size", None), + )

Run to confirm no stray prints remain in lora_manager/model_engine:

🏁 Script executed:

#!/bin/bash rg -nP '^\s*print\(' tensorrt_llm/lora_manager.py tensorrt_llm/_torch/pyexecutor/model_engine.py

Length of output: 3027

Mandate: Replace all print-based debugging with logger.debug across Lora Manager and Model Engine

Your grep shows numerous stray print(...) calls—both in tensorrt_llm/lora_manager.py and in tensorrt_llm/_torch/pyexecutor/model_engine.py. These should all be converted to controlled, leveled logging rather than uncontrolled stdout.

• Files and ranges needing refactor:

tensorrt_llm/lora_manager.py: lines ~1138–1154, 1162–1235

tensorrt_llm/_torch/pyexecutor/model_engine.py: lines ~467–490

• Primary change (example for LoraManager snippet at 1137–1148):

- print(f"DEBUG: model_config type: {type(model_config)}") - print( - f"DEBUG: model_config has coarse_mlp_hidden_size: {hasattr(model_config, 'coarse_mlp_hidden_size')}" - ) - print( - f"DEBUG: model_config.coarse_mlp_hidden_size value: {model_config.coarse_mlp_hidden_size}" - ) - print( - f"DEBUG: model_config.coarse_mlp_hidden_size is None: {model_config.coarse_mlp_hidden_size is None}" - ) + logger.debug( + "LoRA padding | model_config=%s | has_coarse=%s | coarse=%s", + type(model_config), + hasattr(model_config, "coarse_mlp_hidden_size"), + getattr(model_config, "coarse_mlp_hidden_size", None), + )

Apply the same pattern to all other debug prints:

Consolidate related prints into single logger.debug calls with structured arguments.

Ensure you import and use the module logger (logger = logging.getLogger(__name__)).

Remove any remaining print(...) debug statements.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

# Handle both ModelConfig and LoraModelConfig types

print(f"DEBUG: model_config type: {type(model_config)}")

print(

f"DEBUG: model_config has coarse_mlp_hidden_size: {hasattr(model_config, 'coarse_mlp_hidden_size')}"

)

print(

f"DEBUG: model_config.coarse_mlp_hidden_size value: {model_config.coarse_mlp_hidden_size}"

)

print(

f"DEBUG: model_config.coarse_mlp_hidden_size is None: {model_config.coarse_mlp_hidden_size is None}"

)

# Handle both ModelConfig and LoraModelConfig types

logger.debug(

"LoRA padding | model_config=%s | has_coarse=%s | coarse=%s",

type(model_config),

hasattr(model_config, "coarse_mlp_hidden_size"),

getattr(model_config, "coarse_mlp_hidden_size", None),

)

🤖 Prompt for AI Agents

In tensorrt_llm/lora_manager.py around lines 1137 to 1148 (and extend the same change across the other indicated ranges ~1162–1235) replace the ad-hoc print(...) debug statements with structured logger.debug calls: add or ensure a module-level logger = logging.getLogger(__name__) import logging at top if missing, consolidate related prints into single logger.debug invocations with clear messages and variables (e.g., logger.debug("model_config type=%s coarse_mlp_hidden_size=%s is_none=%s", type(model_config), getattr(model_config, "coarse_mlp_hidden_size", None), getattr(model_config, "coarse_mlp_hidden_size", None) is None)), and remove all remaining print(...) calls; apply the identical replacement pattern to tensorrt_llm/_torch/pyexecutor/model_engine.py lines ~467–490 so no print-based debugging remains.

coderabbitai · 2025-08-21T22:12:13Z

tensorrt_llm/lora_manager.py

+                M_coarse_tp = M_coarse * tp_size
+                max_weight_size = rank * M_coarse_tp + rank * H
+


⚠️ Potential issue

TP double-counting in max_weight_size; coarse_mlp_hidden_size is already TP-split.

coarse_mlp_hidden_size returns the TP-split M value. Multiplying by tp_size again inflates padding by tp_size, wasting memory significantly (especially on 49B). Compute with the TP-split value directly.

Apply this diff:

- M_coarse_tp = M_coarse * tp_size - max_weight_size = rank * M_coarse_tp + rank * H + // coarse_mlp_hidden_size is already TP-split; do not multiply by tp_size again. + M_coarse_tp = M_coarse + // For MLP up/gate or down, flattened size per module is rank*(M_coarse_tp + H). + // This upper-bounds attention modules too since M_coarse_tp >= H in typical configs. + calc_max_weight_size = rank * (M_coarse_tp + H)

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents

In tensorrt_llm/lora_manager.py around lines 1167 to 1169, the code multiplies the TP-split coarse_mlp_hidden_size (M_coarse) by tp_size (M_coarse_tp) and then uses that to compute max_weight_size, which double-counts tensor-parallel splitting and inflates padding; remove the extra multiplication and compute max_weight_size using the TP-split value directly (drop M_coarse_tp and use M_coarse in the expression so max_weight_size = rank * M_coarse + rank * H).

coderabbitai · 2025-08-21T22:12:13Z

tensorrt_llm/lora_manager.py

+                actual_max_weight_size = max(w.size(0) for w in self._cpp_lora_weights[uid])
+                print(f"DEBUG: load_from_hf - Actual max weight size: {actual_max_weight_size}")
+                print(f"DEBUG: load_from_hf - Calculated max_weight_size: {max_weight_size}")
+                print(
+                    f"DEBUG: load_from_hf - Difference: {max_weight_size - actual_max_weight_size}"
+                )


⚠️ Potential issue

Guard against negative padding; ensure computed max covers actual max.

If calc_max_weight_size underestimates (e.g., DoRA adds a magnitude vector), padding_needed becomes negative and F.pad will error. Use the actual maximum size as a floor.

Apply this diff:

- actual_max_weight_size = max(w.size(0) for w in self._cpp_lora_weights[uid]) - print(f"DEBUG: load_from_hf - Actual max weight size: {actual_max_weight_size}") - print(f"DEBUG: load_from_hf - Calculated max_weight_size: {max_weight_size}") - print( - f"DEBUG: load_from_hf - Difference: {max_weight_size - actual_max_weight_size}" - ) + actual_max_weight_size = max(w.size(0) for w in self._cpp_lora_weights[uid]) + # Account for possible DoRA magnitude vector (+M_coarse_tp) by taking the max with actual. + max_weight_size = max(calc_max_weight_size, actual_max_weight_size) + logger.debug( + "LoRA padding | calc=%d actual=%d final=%d", + calc_max_weight_size, actual_max_weight_size, max_weight_size + )

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

actual_max_weight_size = max(w.size(0) for w in self._cpp_lora_weights[uid])

print(f"DEBUG: load_from_hf - Actual max weight size: {actual_max_weight_size}")

print(f"DEBUG: load_from_hf - Calculated max_weight_size: {max_weight_size}")

print(

f"DEBUG: load_from_hf - Difference: {max_weight_size - actual_max_weight_size}"

)

actual_max_weight_size = max(w.size(0) for w in self._cpp_lora_weights[uid])

# Account for possible DoRA magnitude vector (+M_coarse_tp) by taking the max with actual.

max_weight_size = max(calc_max_weight_size, actual_max_weight_size)

logger.debug(

"LoRA padding | calc=%d actual=%d final=%d",

calc_max_weight_size, actual_max_weight_size, max_weight_size

)

🤖 Prompt for AI Agents

In tensorrt_llm/lora_manager.py around lines 1186 to 1191, the computed max_weight_size can be less than the actual_max_weight_size causing negative padding_needed and F.pad to fail; before padding, ensure max_weight_size is floored to the observed actual maximum (e.g., max_weight_size = max(max_weight_size, actual_max_weight_size)) and/or compute padding_needed = max(0, max_weight_size - w.size(0)) so padding_needed is never negative.

coderabbitai · 2025-08-21T22:12:13Z

tensorrt_llm/lora_manager.py

+            padded_weights = []
+            for i, w in enumerate(self._cpp_lora_weights[uid]):
+                padding_needed = max_weight_size - w.size(0)
+                print(
+                    f"DEBUG: load_from_hf - Weight {i}: original size {w.size(0)}, padding {padding_needed}"
+                )
+                padded_w = torch.nn.functional.pad(w, (0, padding_needed))
+                print(f"DEBUG: load_from_hf - Weight {i}: padded size {padded_w.size(0)}")
+                padded_weights.append(padded_w)
+


🛠️ Refactor suggestion

Safer padding: clamp to zero and avoid per-weight print loops.

Even with the fix above, clamping is a safe guard and avoids crash on rare edge-cases. Replace prints with a single debug if needed.

Apply this diff:

- padded_weights = [] - for i, w in enumerate(self._cpp_lora_weights[uid]): - padding_needed = max_weight_size - w.size(0) - print( - f"DEBUG: load_from_hf - Weight {i}: original size {w.size(0)}, padding {padding_needed}" - ) - padded_w = torch.nn.functional.pad(w, (0, padding_needed)) - print(f"DEBUG: load_from_hf - Weight {i}: padded size {padded_w.size(0)}") - padded_weights.append(padded_w) + padded_weights = [] + for i, w in enumerate(self._cpp_lora_weights[uid]): + padding_needed = max_weight_size - w.size(0) + if padding_needed < 0: + # Should not happen with the guard above; fall back defensively. + logger.warning("LoRA padding | negative padding=%d at idx=%d; widening to actual size", padding_needed, i) + padding_needed = 0 + padded_weights.append(torch.nn.functional.pad(w, (0, padding_needed)))

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

padded_weights = []

for i, w in enumerate(self._cpp_lora_weights[uid]):

padding_needed = max_weight_size - w.size(0)

print(

f"DEBUG: load_from_hf - Weight {i}: original size {w.size(0)}, padding {padding_needed}"

)

padded_w = torch.nn.functional.pad(w, (0, padding_needed))

print(f"DEBUG: load_from_hf - Weight {i}: padded size {padded_w.size(0)}")

padded_weights.append(padded_w)

padded_weights = []

for i, w in enumerate(self._cpp_lora_weights[uid]):

padding_needed = max_weight_size - w.size(0)

if padding_needed < 0:

# Should not happen with the guard above; fall back defensively.

logger.warning(

"LoRA padding | negative padding=%d at idx=%d; widening to actual size",

padding_needed,

i,

)

padding_needed = 0

padded_weights.append(torch.nn.functional.pad(w, (0, padding_needed)))

🤖 Prompt for AI Agents

In tensorrt_llm/lora_manager.py around lines 1219 to 1228, the per-weight debug prints and unguarded padding can cause noisy logs and crashes when padding_needed is negative; change padding_needed to be clamped to zero (e.g. use max(padding_needed, 0) or torch.clamp_min) before calling pad, remove the per-weight print statements inside the loop, and instead emit a single debug summarizing counts/sizes after the loop (or conditionally log once when any padding occurred). Ensure padded_weights.append is still executed and behavior unchanged when padding_needed is zero.

Signed-off-by: Michal Guzek <[email protected]>

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

tensorrt_llm/_torch/pyexecutor/resource_manager.py (1)
1047-1051: Include coarse_mlp_hidden_size in LoraModelConfig before passing it

We verified that LoraModelConfig is declared as a @dataclass without a coarse_mlp_hidden_size field, so the proposed diff in resource_manager.py would break the constructor call. To support Nemotron-49B’s per-layer MLP sizing, first extend LoraModelConfig, then pass the new argument:

• In tensorrt_llm/lora_manager.py, add the field to the dataclass:
@@ class LoraModelConfig:
     lora_target_modules: list[str]
     trtllm_modules_to_hf_modules: dict[str, str]
     hidden_size: int
     data_type: str
+    coarse_mlp_hidden_size: int | None = None
• In tensorrt_llm/_torch/pyexecutor/resource_manager.py, instantiate with the new parameter:
@@ resource_manager.py:1047
-        self._lora_model_config = LoraModelConfig(
-            lora_config.lora_target_modules,
-            lora_config.trtllm_modules_to_hf_modules,
-            model_config.hidden_size,
-            binding_to_str_dtype(model_config.data_type))
+        coarse_mlp_hidden_size = getattr(model_config, "coarse_mlp_hidden_size", None)
+        self._lora_model_config = LoraModelConfig(
+            lora_config.lora_target_modules,
+            lora_config.trtllm_modules_to_hf_modules,
+            model_config.hidden_size,
+            binding_to_str_dtype(model_config.data_type),
+            coarse_mlp_hidden_size=coarse_mlp_hidden_size,
+        )
These changes ensure the constructor accepts and stores coarse_mlp_hidden_size for correct LoRA sizing on Nemotron-49B.

🧹 Nitpick comments (5)

tensorrt_llm/_torch/pyexecutor/resource_manager.py (3)
1040-1047: Remove dead commented code referencing model_engine.

This block references model_engine that isn’t in scope for this initializer and adds noise. Please delete instead of commenting out.
-        # if model_engine is not None and hasattr(model_engine, "lora_model_config"):
-        #     self._lora_model_config = model_engine.lora_model_config
-        # else:
-        #     self._lora_model_config = LoraModelConfig(
-        #         lora_config.lora_target_modules,
-        #         lora_config.trtllm_modules_to_hf_modules, model_config.hidden_size,
-        #         binding_to_str_dtype(model_config.data_type))
1062-1064: Replace print with logger.debug to avoid noisy stdout and respect log levels.

Library code shouldn’t use print. Use the existing logger with debug level; include task id for traceability.
-                print(
-                    f"DEBUG: INSIDE add_request_peft: request.py_lora_path: {request.py_lora_path}"
-                )
+                logger.debug(
+                    f"add_request_peft: loading LoRA from py_lora_path='{request.py_lora_path}' "
+                    f"for task_id={request.lora_task_id}"
+                )
1-1: Add NVIDIA copyright header (2025).

Per coding guidelines, prepend the current-year NVIDIA copyright header.
+# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.
+
 import copy
tensorrt_llm/executor/worker.py (2)
361-361: Local import is fine; consider module-level import only if no circular deps.

Keeping the import inside the function avoids import-time cycles and is acceptable here, since LoRA loading isn’t in the hot path. If there’s no cycle, moving it to the top improves discoverability.

1-1: Missing NVIDIA copyright header.

Per repo guidelines, prepend the current-year NVIDIA header to all source files.
+# Copyright (c) 2025, NVIDIA CORPORATION.  All rights reserved.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 90b369c and 794d28b.

📒 Files selected for processing (2)

tensorrt_llm/_torch/pyexecutor/resource_manager.py (2 hunks)
tensorrt_llm/executor/worker.py (1 hunks)

🧰 Additional context used

📓 Path-based instructions (2)

**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Python code must target Python 3.8+
Python indentation: 4 spaces, no tabs
Maintain module namespace in imports (from package.subpackage import foo; then use foo.SomeClass())
Python file names use snake_case
Python class names use PascalCase
Python functions/methods and local variables use snake_case; variables starting with a number get k_ prefix (e.g., k_99th_percentile)
Global variables use G_ prefixed UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE in Python
Avoid shadowing variables from outer scopes in Python
Initialize all externally visible members of a Python class in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with short docstrings
Avoid reflection when simple alternatives exist (e.g., prefer explicit parameters over dict(**locals()))
In try/except, catch the narrowest exceptions possible
For duck-typing with try/except, keep try body minimal and put logic in else

Files:

tensorrt_llm/executor/worker.py
tensorrt_llm/_torch/pyexecutor/resource_manager.py

**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Prepend NVIDIA copyright header (current year) to all source files

Files:

tensorrt_llm/executor/worker.py
tensorrt_llm/_torch/pyexecutor/resource_manager.py

🧠 Learnings (2)

📚 Learning: 2025-07-17T09:01:27.402Z

Learnt from: amitz-nv
PR: NVIDIA/TensorRT-LLM#5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks `is_adapter_in_cpu_cache()` and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.

Applied to files:

tensorrt_llm/executor/worker.py
tensorrt_llm/_torch/pyexecutor/resource_manager.py

📚 Learning: 2025-08-19T12:45:11.997Z

Learnt from: amitz-nv
PR: NVIDIA/TensorRT-LLM#7033
File: tensorrt_llm/_torch/pyexecutor/model_engine.py:0-0
Timestamp: 2025-08-19T12:45:11.997Z
Learning: In tensorrt_llm/_torch/pyexecutor/model_engine.py, DoRA (Delta Orthogonal Rank Adaptation) functionality was removed from the PyTorch flow to eliminate issues with inverted DoRA detection logic. The original is_dora condition was checking if scaling_vec_pointer == 0, which was potentially incorrect.

Applied to files:

tensorrt_llm/_torch/pyexecutor/resource_manager.py

🔇 Additional comments (3)

tensorrt_llm/executor/worker.py (3)

379-386: Only pass runtime_mapping when fully specified; otherwise defer to loader defaults.

The current call always forwards a runtime_mapping object (potentially empty or malformed). To avoid surprising behavior, pass None when you don’t have a trustworthy mapping so LoraManager.load_from_ckpt can apply its internal defaults.

If you apply the previous diff (returning None in the fallback), this call site is already compatible. Please confirm that load_from_ckpt(..., runtime_mapping=None) preserves prior behavior across TRT and PyTorch backends.

399-407: Known LoRA CPU cache race remains; this change doesn’t address it.

Per our prior learning on this file, the is_adapter_in_cpu_cache() path has a race that can’t be solved with simple guards. Your new runtime mapping doesn’t worsen it, but also doesn’t mitigate it. Just calling it out to avoid false confidence. If you want, we can scope a targeted fix separately.

362-375: Mapping and LoraManager Signatures Verified – All Good

The Mapping.__init__ signature in tensorrt_llm/mapping.py begins with
def __init__(self, world_size=1, rank=0, gpus_per_node=8, *, cp_size=1, cp_config=None, tp_size=1, pp_size=1, ...)
confirming that the world_size keyword argument is supported and all of the existing default‐ctor parameters remain unchanged.

The LoraManager.load_from_ckpt method in tensorrt_llm/lora_manager.py is declared as
def load_from_ckpt(..., runtime_mapping: Optional[Mapping] = None, ...)
so it fully accepts a runtime_mapping argument without breaking existing behavior.

No further changes are needed for the constructor or loader signatures.

coderabbitai · 2025-08-21T23:24:40Z

tensorrt_llm/executor/worker.py

+        if hasattr(self._executor_config,
+                   "mapping") and self._executor_config.mapping is not None:
+            mapping = self._executor_config.mapping
+            # Calculate world_size to satisfy the constraint: world_size = tp_size * pp_size * cp_size
+            world_size = mapping.tp_size * mapping.pp_size * mapping.cp_size
+
+            runtime_mapping = Mapping(
+                world_size=world_size,  # ← Add world_size
+                tp_size=mapping.tp_size,
+                pp_size=mapping.pp_size,
+                cp_size=mapping.cp_size,
+                rank=mapping.rank,
+                gpus_per_node=mapping.gpus_per_node)
+        else:
+            # Fallback to default mapping
+            runtime_mapping = Mapping()


🛠️ Refactor suggestion

⚠️ Potential issue

World size calculation likely incorrect and brittle; cp_size may be None, dp_size ignored, and several attrs may be absent.

Using world_size = tp_size * pp_size * cp_size will fail if cp_size is None (it is optional elsewhere in this file) and likely undercounts when dp_size exists. Many runtimes define world_size = tp * pp * dp and treat CP orthogonally; include CP only if it's part of the global process group.

Accessing mapping.rank and mapping.gpus_per_node without fallbacks can raise if those fields are missing.

Falling back to Mapping() with no args risks a runtime break if the constructor now requires world_size (or other) or if its defaults differ from the LoRA manager’s expectations.

Refactor to:

Prefer exec_map.world_size if present; otherwise compute with safe defaults, including dp_size.

Treat cp_size as 1 unless you can prove it participates in the global PG (introduce a boolean gate if available).

Provide robust fallbacks for rank and gpus_per_node.

When no mapping is available, pass None so the downstream loader can apply its own defaults.

Suggested patch:

- from tensorrt_llm.mapping import Mapping - if hasattr(self._executor_config, - "mapping") and self._executor_config.mapping is not None: - mapping = self._executor_config.mapping - # Calculate world_size to satisfy the constraint: world_size = tp_size * pp_size * cp_size - world_size = mapping.tp_size * mapping.pp_size * mapping.cp_size - - runtime_mapping = Mapping( - world_size=world_size, # ← Add world_size - tp_size=mapping.tp_size, - pp_size=mapping.pp_size, - cp_size=mapping.cp_size, - rank=mapping.rank, - gpus_per_node=mapping.gpus_per_node) - else: - # Fallback to default mapping - runtime_mapping = Mapping() + from tensorrt_llm.mapping import Mapping as RuntimeMapping + exec_map = getattr(self._executor_config, "mapping", None) + if exec_map is not None: + # Prefer explicit world_size if available on the mapping. + world_size = getattr(exec_map, "world_size", None) + if world_size is None: + tp = getattr(exec_map, "tp_size", 1) or 1 + pp = getattr(exec_map, "pp_size", 1) or 1 + dp = getattr(exec_map, "dp_size", 1) or 1 + cp = getattr(exec_map, "cp_size", 1) or 1 + # Include CP in world_size only if it participates in the global PG. + include_cp = getattr(exec_map, "cp_participates_in_global_pg", False) + world_size = tp * pp * dp * (cp if include_cp else 1) + + runtime_mapping = RuntimeMapping( + world_size=world_size, + tp_size=getattr(exec_map, "tp_size", 1), + pp_size=getattr(exec_map, "pp_size", 1), + cp_size=getattr(exec_map, "cp_size", 1), + rank=getattr(exec_map, "rank", self.global_rank), + gpus_per_node=getattr(exec_map, "gpus_per_node", torch.cuda.device_count()), + ) + else: + # Let downstream choose sane defaults when no executor mapping is provided. + runtime_mapping = None

Committable suggestion skipped: line range outside the PR's diff.

moraxu · 2025-08-30T00:25:19Z

Resolved in #7231

First draft

abf1402

Signed-off-by: Michal Guzek <[email protected]>

coderabbitai bot reviewed Aug 19, 2025

View reviewed changes

tensorrt_llm/_torch/model_config.py Outdated Show resolved Hide resolved

tensorrt_llm/_torch/pyexecutor/_util.py Outdated Show resolved Hide resolved

tests/integration/defs/examples/test_nemotron_nas.py Show resolved Hide resolved

moraxu closed this Aug 19, 2025

moraxu reopened this Aug 21, 2025

Update

90b369c

Signed-off-by: Michal Guzek <[email protected]>

coderabbitai bot reviewed Aug 21, 2025

View reviewed changes

Update

794d28b

Signed-off-by: Michal Guzek <[email protected]>

coderabbitai bot reviewed Aug 21, 2025

View reviewed changes

moraxu closed this Aug 30, 2025

		M_coarse_tp = M_coarse * tp_size
		max_weight_size = rank * M_coarse_tp + rank * H

[https://nvbugs/5463720][fix]: Naively handle per-layer MLP dimensions for Nemotron 49B #7016

[https://nvbugs/5463720][fix]: Naively handle per-layer MLP dimensions for Nemotron 49B #7016

Uh oh!

Conversation

moraxu commented Aug 19, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Test Coverage

GitHub Bot Help

kill

skip

reuse-pipeline

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Aug 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested reviewers

Chat

Support

CodeRabbit Commands (Invoked using PR/Issue comments)

Other keywords and placeholders

Status, Documentation and Community

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

moraxu commented Aug 19, 2025

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Aug 21, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Aug 21, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Aug 21, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Aug 21, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Aug 21, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Aug 21, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Aug 21, 2025

Choose a reason for hiding this comment

Uh oh!

moraxu commented Aug 30, 2025

Uh oh!

Uh oh!

moraxu commented Aug 19, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Aug 19, 2025 •

edited

Loading