Skip to content

Conversation

moraxu
Copy link
Collaborator

@moraxu moraxu commented Aug 19, 2025

Description

[https://nvbugs/5463720][fix]: Naively handle per-layer MLP dimensions for Nemotron 49B

Test Coverage

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md
and the scripts/test_to_stage_mapping.py helper.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

Summary by CodeRabbit

  • New Features

    • Added optional coarse MLP hidden-size in model configs; LoRA config can accept it and use it to size/pad adapter weights across tensor-parallel shards.
    • LoRA loading now honors per-executor runtime mapping when present.
  • Chores

    • Increased debug/runtime logging to surface model, mapping and LoRA sizing/extraction details.
  • Tests

    • Added an integration test exercising a real LoRA adapter on Nemotron Super 49B (duplicated).

Signed-off-by: Michal Guzek <[email protected]>
Copy link
Contributor

coderabbitai bot commented Aug 19, 2025

📝 Walkthrough

Walkthrough

Adds a new ModelConfig property coarse_mlp_hidden_size with Nemotron inference and logging, threads that optional value through pyexecutor to LoraModelConfig, updates LoRA padding to use coarse MLP sizing when provided, adds C++ LoRA validation logs, and introduces a duplicated integration test for Nemotron Super 49B with a real LoRA adapter.

Changes

Cohort / File(s) Summary
ModelConfig: coarse MLP inference & logging
tensorrt_llm/_torch/model_config.py
Adds public property coarse_mlp_hidden_size that returns pretrained_config.intermediate_size // tp_size or infers Nemotron FFN multiplier via _infer_nemotron_ffn_mult(); refactors _infer_nemotron_ffn_mult() to build an explicit list of ffn_mults; adds extensive debug logging.
PyExecutor: extract coarse size and pass to LoraModelConfig
tensorrt_llm/_torch/pyexecutor/model_engine.py
Reads coarse_mlp_hidden_size (if present) from self.model.model_config with debug prints and safe fallbacks; passes the value (or None) into LoraModelConfig constructor.
LoRA config & padding logic
tensorrt_llm/lora_manager.py
Adds coarse_mlp_hidden_size: Optional[int] = None to LoraModelConfig; when provided, computes TP-aware max_weight_size using coarse MLP, tp_size, rank and hidden size, and pads CPU LoRA weights to that size before stacking; preserves previous padding when absent; adds debug logs.
C++ LoRA validation logging
cpp/tensorrt_llm/runtime/loraUtils.cpp
Adds debug log showing expected vs actual flattened LoRA dimension and related parameters during loraValidateRequestTensors; no behavior change.
Resource/PEFT init changes
tensorrt_llm/_torch/pyexecutor/resource_manager.py
Removes model_engine-based branch for populating _lora_model_config; always constructs LoraModelConfig directly with dtype binding; adds debug print for py_lora_path in add_request_peft.
Executor LoRA load mapping
tensorrt_llm/executor/worker.py
Builds a runtime_mapping from executor_config.mapping (world_size, tp/pp/cp sizes, rank, gpus_per_node) and passes it to _lora_manager.load_from_ckpt; requires Mapping constructor accepting world_size.
PyExecutor util (non-functional comment)
tensorrt_llm/_torch/pyexecutor/_util.py
Inserts a comment line before LoraModule.create_lora_modules call; no functional change.
Integration test (duplicated)
tests/integration/defs/examples/test_nemotron_nas.py
Adds test_nemotron_super_49b_real_lora_torch which loads a real LoRA adapter for Nemotron Super 49B and runs generate with LoRARequest; function appears twice (duplicate).

Sequence Diagram(s)

sequenceDiagram
  participant Dev as Caller
  participant Engine as PyExecutor.model_engine
  participant Config as ModelConfig
  participant LoraCfg as LoraModelConfig
  participant LoraMgr as lora_manager
  participant CPP as loraUtils.cpp

  Dev->>Engine: instantiate executor
  Engine->>Config: access model_config.coarse_mlp_hidden_size
  alt pretrained_config.intermediate_size present
    Config-->>Engine: coarse_mlp_hidden_size = intermediate_size // tp_size
  else architectures == ["DeciLMForCausalLM"]
    Config-->>Engine: coarse_mlp_hidden_size = _infer_nemotron_ffn_mult()
  else
    Config-->>Engine: raises / None (handled by Engine)
  end
  Engine->>LoraCfg: construct LoraModelConfig(..., coarse_mlp_hidden_size)
  LoraCfg->>LoraMgr: load LoRA weights (CPU path)
  alt coarse_mlp_hidden_size provided
    LoraMgr->>LoraMgr: M_coarse_tp = coarse_mlp_hidden_size * tp_size
    LoraMgr->>LoraMgr: max_weight_size = r*M_coarse_tp + r*H
    LoraMgr->>LoraMgr: pad weights -> max_weight_size
  else not provided
    LoraMgr->>LoraMgr: pad to max existing weight size (old path)
  end
  LoraMgr->>CPP: loraValidateRequestTensors (logs expected vs actual)
  CPP-->>LoraMgr: validation result
  LoraMgr-->>Engine: loaded LoRA modules
  Engine-->>Dev: executor ready
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Suggested reviewers

  • yilin-void
  • syuoni
  • shaharmor98
  • Naveassaf
  • litaotju
  • amitz-nv

Tip

🔌 Remote MCP (Model Context Protocol) integration is now available!

Pro plan users can now connect to remote MCP servers from the Integrations page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats.

✨ Finishing Touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai or @coderabbitai title anywhere in the PR title to generate the title automatically.

Status, Documentation and Community

  • Visit our Status Page to check the current availability of CodeRabbit.
  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (2)
tensorrt_llm/_torch/model_config.py (1)

456-459: Make mlp_hidden_size_per_layer available even for uniform MLP dims

Downstream code reads mlp_hidden_size_per_layer. Setting it in the scalar path avoids None/empty semantics and keeps the binding contract consistent.

Apply this diff:

             else:
                 # Uniform MLP dimensions across all layers
                 mlp_hidden_size = self.pretrained_config.intermediate_size // self.mapping.tp_size
+                # Populate per-layer vector for consistency with variable-MLP models
+                model_config_cpp.mlp_hidden_size_per_layer = [
+                    mlp_hidden_size
+                ] * self.pretrained_config.num_hidden_layers
tests/integration/defs/examples/test_nemotron_nas.py (1)

173-177: Nit: shorten long lines and keep messages readable

Wrap arguments to stay within 120 characters and improve readability (Ruff E501).

Apply this diff:

-        sampling_params = SamplingParams(max_tokens=50,
-                                         temperature=0.7,
-                                         top_p=0.9)
+        sampling_params = SamplingParams(
+            max_tokens=50,
+            temperature=0.7,
+            top_p=0.9,
+        )
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between e76e5c6 and abf1402.

📒 Files selected for processing (3)
  • tensorrt_llm/_torch/model_config.py (1 hunks)
  • tensorrt_llm/_torch/pyexecutor/_util.py (1 hunks)
  • tests/integration/defs/examples/test_nemotron_nas.py (2 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.py: Python code must target Python 3.8+
Python indentation: 4 spaces, no tabs
Maintain module namespace in imports (from package.subpackage import foo; then use foo.SomeClass())
Python file names use snake_case
Python class names use PascalCase
Python functions/methods and local variables use snake_case; variables starting with a number get k_ prefix (e.g., k_99th_percentile)
Global variables use G_ prefixed UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE in Python
Avoid shadowing variables from outer scopes in Python
Initialize all externally visible members of a Python class in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with short docstrings
Avoid reflection when simple alternatives exist (e.g., prefer explicit parameters over dict(**locals()))
In try/except, catch the narrowest exceptions possible
For duck-typing with try/except, keep try body minimal and put logic in else

Files:

  • tensorrt_llm/_torch/model_config.py
  • tests/integration/defs/examples/test_nemotron_nas.py
  • tensorrt_llm/_torch/pyexecutor/_util.py
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Prepend NVIDIA copyright header (current year) to all source files

Files:

  • tensorrt_llm/_torch/model_config.py
  • tests/integration/defs/examples/test_nemotron_nas.py
  • tensorrt_llm/_torch/pyexecutor/_util.py
🧠 Learnings (2)
📚 Learning: 2025-07-28T17:06:08.621Z
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.

Applied to files:

  • tests/integration/defs/examples/test_nemotron_nas.py
📚 Learning: 2025-08-06T13:58:07.506Z
Learnt from: galagam
PR: NVIDIA/TensorRT-LLM#6487
File: tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_trtllm_bench.py:1-12
Timestamp: 2025-08-06T13:58:07.506Z
Learning: In TensorRT-LLM, test files (files under tests/ directories) do not require NVIDIA copyright headers, unlike production source code files. Test files typically start directly with imports, docstrings, or code.

Applied to files:

  • tests/integration/defs/examples/test_nemotron_nas.py
🪛 Ruff (0.12.2)
tests/integration/defs/examples/test_nemotron_nas.py

146-146: Line too long (125 > 120)

(E501)

tensorrt_llm/_torch/pyexecutor/_util.py

474-474: Line too long (128 > 120)

(E501)

🔇 Additional comments (2)
tensorrt_llm/_torch/pyexecutor/_util.py (1)

482-491: Bindings include tp_size & num_experts – API compatibility confirmed

I’ve verified that both the pybind and nanobind registrations for LoraModule.create_lora_modules declare the new parameters with defaults:

  • cpp/tensorrt_llm/pybind/bindings.cpp (lines 213–216):
    .def_static("create_lora_modules", …, py::arg("tp_size") = 1, py::arg("num_experts") = 0);
  • cpp/tensorrt_llm/nanobind/bindings.cpp (lines 221–224):
    .def_static("create_lora_modules", …, nb::arg("tp_size") = 1, nb::arg("num_experts") = 0);

These signatures match the Python call site in tensorrt_llm/_torch/pyexecutor/_util.py. No further binding updates are necessary.

tests/integration/defs/examples/test_nemotron_nas.py (1)

133-199: No duplicate test definitions found
I searched the entire repository for def test_nemotron_super_49b_real_lora_torch and only one occurrence exists, so there’s no shadowing or redundant runs. All pytest markers and fixtures can remain as-is.

@moraxu
Copy link
Collaborator Author

moraxu commented Aug 19, 2025

I mixed up config fields, this won't work. Closing

@moraxu moraxu closed this Aug 19, 2025
@moraxu moraxu reopened this Aug 21, 2025
Signed-off-by: Michal Guzek <[email protected]>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

🧹 Nitpick comments (1)
tensorrt_llm/lora_manager.py (1)

1216-1217: Trim excessive debug prints; emit one-liners via logger.debug.

The surrounding print bursts (final max size, stacked shapes) are excessive for hot paths. Use concise logger.debug; avoid per-weight logs.

Apply this diff:

-            print(f"DEBUG: load_from_hf - Final max_weight_size: {max_weight_size}")
+            logger.debug("LoRA padding | final_max_weight_size=%d", max_weight_size)
...
-            print(
-                f"DEBUG: load_from_hf - Final stacked weights shape: {self._cpp_lora_weights[uid].shape}"
-            )
+            logger.debug("LoRA padding | stacked_weights_shape=%s", tuple(self._cpp_lora_weights[uid].shape))
...
-            print(
-                f"DEBUG: load_from_hf - Final stacked config shape: {self._cpp_lora_config[uid].shape}"
-            )
+            logger.debug("LoRA padding | stacked_config_shape=%s", tuple(self._cpp_lora_config[uid].shape))

Also applies to: 1229-1237

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between abf1402 and 90b369c.

📒 Files selected for processing (5)
  • cpp/tensorrt_llm/runtime/loraUtils.cpp (1 hunks)
  • tensorrt_llm/_torch/model_config.py (3 hunks)
  • tensorrt_llm/_torch/pyexecutor/_util.py (1 hunks)
  • tensorrt_llm/_torch/pyexecutor/model_engine.py (1 hunks)
  • tensorrt_llm/lora_manager.py (2 hunks)
✅ Files skipped from review due to trivial changes (1)
  • tensorrt_llm/_torch/pyexecutor/_util.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • tensorrt_llm/_torch/model_config.py
🧰 Additional context used
📓 Path-based instructions (5)
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}: In C++, close namespaces with a comment naming the namespace (e.g., } // namespace foo)
Prefer const/constexpr variables over #define for constants
Declare variables const if not modified after initialization
Use Allman brace style in C++
C++ filenames use lowerCamelCase and must be case-insensitively unique within a build target
C++ type names use UpperCamelCase
Local variables, methods, and namespaces use lowerCamelCase
Global non-static variables not in anonymous namespace use gPrefix lowerCamelCase (e.g., gExample)
Static globals or globals in anonymous namespaces use sPrefix lowerCamelCase
Locally visible static variables start with 's' (e.g., static std::once_flag sFlag;)
Member variables use mPrefix lowerCamelCase; public members may omit but are encouraged to use 'm'
Constants (enums, global/static/function-scope magic numbers) use kPREFIXED_UPPER_SNAKE (e.g., kDIGIT_NUM)
If macros are unavoidable, use UPPER_SNAKE_CASE (prefer constants over #define)
Constructor parameter that conflicts with a public member name gets trailing underscore (foo_)
Literal suffixes should be uppercase (e.g., 1234L not 1234l)
C++: use spaces only; indent 4 spaces
Run clang-format (LLVM style) before submitting; wrap lines at 120 characters
If formatting must be bypassed, use // clang-format off/on around the section
Prefer smart pointers; use unique_ptr for sole ownership, shared_ptr for shared; weak_ptr only in exceptional cases
Do not use deprecated pre-C++11 smart pointers
Use C++ style comments; avoid C comments except special inline cases; prefer // single-line
Capitalize and punctuate full-sentence comments
Follow Doxygen rules: use //! for comments and //!< for members in C++
Disable code with #if/#endif and mnemonic conditions; avoid commented-out code; avoid dead code
Do not throw exceptions across library boundaries
Use least-forceful casts; avoid removing const/volatile; avoid C-style and functional casts (except constructors); p...

Files:

  • cpp/tensorrt_llm/runtime/loraUtils.cpp
**/*.{cpp,cxx,cc,cu}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.{cpp,cxx,cc,cu}: Avoid literal values except for 0, nullptr, true, false; use named constexpr for other literals
Place semicolon of empty for/while loop on a new line
Always use brace-delimited bodies for switch/while/do-for/if/else
Use inline C comments in argument lists when parameter meaning is unclear (e.g., /* checkForErrors = */ false)
Do not use assignment in subexpressions (e.g., if (x = y) ... is forbidden)
Switch on enums should enumerate all values and omit default to catch new values at compile time
Structure switch statements; prohibit fallthrough except between empty cases; each case ends with break or throw; return at end of case not allowed; put break inside braces for compound case
Prefer anonymous namespaces over static for internal linkage of functions
Every defined function must be called at least once (no unused methods)

Files:

  • cpp/tensorrt_llm/runtime/loraUtils.cpp
**/*.{h,hpp,hxx,hh,cuh,cpp,cxx,cc,cu}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Parameter names must be consistent between declarations and definitions

Files:

  • cpp/tensorrt_llm/runtime/loraUtils.cpp
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Prepend NVIDIA copyright header (current year) to all source files

Files:

  • cpp/tensorrt_llm/runtime/loraUtils.cpp
  • tensorrt_llm/_torch/pyexecutor/model_engine.py
  • tensorrt_llm/lora_manager.py
**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Python code must target Python 3.8+
Python indentation: 4 spaces, no tabs
Maintain module namespace in imports (from package.subpackage import foo; then use foo.SomeClass())
Python file names use snake_case
Python class names use PascalCase
Python functions/methods and local variables use snake_case; variables starting with a number get k_ prefix (e.g., k_99th_percentile)
Global variables use G_ prefixed UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE in Python
Avoid shadowing variables from outer scopes in Python
Initialize all externally visible members of a Python class in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with short docstrings
Avoid reflection when simple alternatives exist (e.g., prefer explicit parameters over dict(**locals()))
In try/except, catch the narrowest exceptions possible
For duck-typing with try/except, keep try body minimal and put logic in else

Files:

  • tensorrt_llm/_torch/pyexecutor/model_engine.py
  • tensorrt_llm/lora_manager.py
🧬 Code graph analysis (2)
tensorrt_llm/_torch/pyexecutor/model_engine.py (4)
tensorrt_llm/_torch/model_config.py (2)
  • coarse_mlp_hidden_size (541-556)
  • torch_dtype (147-152)
tensorrt_llm/lora_manager.py (1)
  • LoraModelConfig (241-246)
tensorrt_llm/_torch/models/modeling_utils.py (1)
  • config (500-501)
tensorrt_llm/_utils.py (3)
  • dtype (935-936)
  • dtype (943-953)
  • torch_dtype_to_str (218-219)
tensorrt_llm/lora_manager.py (2)
tensorrt_llm/_torch/model_config.py (1)
  • coarse_mlp_hidden_size (541-556)
tensorrt_llm/_torch/distributed/communicator.py (1)
  • tp_size (46-47)
🔇 Additional comments (2)
tensorrt_llm/lora_manager.py (1)

246-247: Adding coarse_mlp_hidden_size to LoraModelConfig is appropriate.

Cleanly threads the optional ML P-size hint into LoRA loading logic. No issues here.

tensorrt_llm/_torch/pyexecutor/model_engine.py (1)

496-498: LGTM: dtype now passed as string via torch_dtype_to_str.

This aligns with downstream str_dtype_to_torch usage in lora_manager and eliminates dtype mismatches.

Comment on lines 108 to +115
TLLM_CHECK_WITH_INFO(it->flattenedInOutSize(adapterSize, isDora) <= weights->getShape().d[2],
"lora_weights has to few values for " + moduleName);

auto expectedSize = it->flattenedInOutSize(adapterSize, isDora);
auto actualSize = weights->getShape().d[2];
TLLM_LOG_DEBUG("LoRA validation for %s - Expected: %d, Actual: %d, AdapterSize: %d, IsDora: %d",
moduleName.c_str(), expectedSize, actualSize, adapterSize, isDora);

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Log after-check makes debugging impossible on failure; move log before check and reuse computed sizes. Also fix typo "to few" → "too few".

If the size check fails, the debug line never emits. Compute expected/actual once, log, then assert using the computed values. This also fixes the user-facing message grammar.

Apply this diff:

-            TLLM_CHECK_WITH_INFO(it->flattenedInOutSize(adapterSize, isDora) <= weights->getShape().d[2],
-                "lora_weights has to few values for " + moduleName);
-
-            auto expectedSize = it->flattenedInOutSize(adapterSize, isDora);
-            auto actualSize = weights->getShape().d[2];
-            TLLM_LOG_DEBUG("LoRA validation for %s - Expected: %d, Actual: %d, AdapterSize: %d, IsDora: %d",
-                moduleName.c_str(), expectedSize, actualSize, adapterSize, isDora);
+            auto const expectedSize = it->flattenedInOutSize(adapterSize, isDora);
+            auto const actualSize = weights->getShape().d[2];
+            TLLM_LOG_DEBUG(
+                "LoRA validation for %s - Expected: %d, Actual: %d, AdapterSize: %d, IsDora: %d",
+                moduleName.c_str(), expectedSize, actualSize, adapterSize, static_cast<int>(isDora));
+            TLLM_CHECK_WITH_INFO(expectedSize <= actualSize,
+                "lora_weights has too few values for " + moduleName);
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
TLLM_CHECK_WITH_INFO(it->flattenedInOutSize(adapterSize, isDora) <= weights->getShape().d[2],
"lora_weights has to few values for " + moduleName);
auto expectedSize = it->flattenedInOutSize(adapterSize, isDora);
auto actualSize = weights->getShape().d[2];
TLLM_LOG_DEBUG("LoRA validation for %s - Expected: %d, Actual: %d, AdapterSize: %d, IsDora: %d",
moduleName.c_str(), expectedSize, actualSize, adapterSize, isDora);
auto const expectedSize = it->flattenedInOutSize(adapterSize, isDora);
auto const actualSize = weights->getShape().d[2];
TLLM_LOG_DEBUG(
"LoRA validation for %s - Expected: %d, Actual: %d, AdapterSize: %d, IsDora: %d",
moduleName.c_str(), expectedSize, actualSize, adapterSize, static_cast<int>(isDora));
TLLM_CHECK_WITH_INFO(expectedSize <= actualSize,
"lora_weights has too few values for " + moduleName);
🤖 Prompt for AI Agents
In cpp/tensorrt_llm/runtime/loraUtils.cpp around lines 108 to 115, the debug log
is placed after the size check so it won't run on failure and the check message
contains a typo; compute expectedSize and actualSize once, move the
TLLM_LOG_DEBUG call to before TLLM_CHECK_WITH_INFO and pass the computed
expectedSize and actualSize (and adapterSize/isDora) to it, and change the
TLLM_CHECK_WITH_INFO message text from "to few" to "too few" while using the
already computed expectedSize/actualSize values in the check.

Comment on lines +464 to +491
coarse_mlp_hidden_size = None

# Debug: Check what type self.model.model_config is
print(
f"DEBUG: model_engine.py - self.model.model_config type: {type(self.model.model_config)}"
)
print(
f"DEBUG: model_engine.py - self.model.model_config dir: {dir(self.model.model_config)}"
)

if hasattr(self.model.model_config, 'coarse_mlp_hidden_size'):
coarse_mlp_hidden_size = self.model.model_config.coarse_mlp_hidden_size
print(
f"DEBUG: model_engine.py - coarse_mlp_hidden_size: {coarse_mlp_hidden_size}"
)
else:
print(
f"DEBUG: model_engine.py - coarse_mlp_hidden_size property not found"
)
# Try direct access to see if it works
try:
coarse_mlp_hidden_size = self.model.model_config.coarse_mlp_hidden_size
print(
f"DEBUG: model_engine.py - Direct access worked: {coarse_mlp_hidden_size}"
)
except AttributeError as e:
print(f"DEBUG: model_engine.py - Direct access failed: {e}")

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

⚠️ Potential issue

Avoid hasattr on properties that may raise; replace prints with robust getattr + logger.debug.

coarse_mlp_hidden_size is a property that can raise ValueError for unsupported architectures. hasattr will evaluate the property and won’t catch ValueError, causing unexpected failures. Use getattr in a try/except, and use logger.debug instead of print.

Apply this diff:

-        coarse_mlp_hidden_size = None
-
-        # Debug: Check what type self.model.model_config is
-        print(
-            f"DEBUG: model_engine.py - self.model.model_config type: {type(self.model.model_config)}"
-        )
-        print(
-            f"DEBUG: model_engine.py - self.model.model_config dir: {dir(self.model.model_config)}"
-        )
-
-        if hasattr(self.model.model_config, 'coarse_mlp_hidden_size'):
-            coarse_mlp_hidden_size = self.model.model_config.coarse_mlp_hidden_size
-            print(
-                f"DEBUG: model_engine.py - coarse_mlp_hidden_size: {coarse_mlp_hidden_size}"
-            )
-        else:
-            print(
-                f"DEBUG: model_engine.py - coarse_mlp_hidden_size property not found"
-            )
-            # Try direct access to see if it works
-            try:
-                coarse_mlp_hidden_size = self.model.model_config.coarse_mlp_hidden_size
-                print(
-                    f"DEBUG: model_engine.py - Direct access worked: {coarse_mlp_hidden_size}"
-                )
-            except AttributeError as e:
-                print(f"DEBUG: model_engine.py - Direct access failed: {e}")
+        coarse_mlp_hidden_size = None
+        try:
+            coarse_mlp_hidden_size = getattr(self.model.model_config, 'coarse_mlp_hidden_size', None)
+            logger.debug("model_engine | coarse_mlp_hidden_size=%s", str(coarse_mlp_hidden_size))
+        except Exception as e:
+            # Property exists but inference failed; proceed without it.
+            logger.debug("model_engine | coarse_mlp_hidden_size unavailable: %s", e)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
coarse_mlp_hidden_size = None
# Debug: Check what type self.model.model_config is
print(
f"DEBUG: model_engine.py - self.model.model_config type: {type(self.model.model_config)}"
)
print(
f"DEBUG: model_engine.py - self.model.model_config dir: {dir(self.model.model_config)}"
)
if hasattr(self.model.model_config, 'coarse_mlp_hidden_size'):
coarse_mlp_hidden_size = self.model.model_config.coarse_mlp_hidden_size
print(
f"DEBUG: model_engine.py - coarse_mlp_hidden_size: {coarse_mlp_hidden_size}"
)
else:
print(
f"DEBUG: model_engine.py - coarse_mlp_hidden_size property not found"
)
# Try direct access to see if it works
try:
coarse_mlp_hidden_size = self.model.model_config.coarse_mlp_hidden_size
print(
f"DEBUG: model_engine.py - Direct access worked: {coarse_mlp_hidden_size}"
)
except AttributeError as e:
print(f"DEBUG: model_engine.py - Direct access failed: {e}")
coarse_mlp_hidden_size = None
try:
coarse_mlp_hidden_size = getattr(self.model.model_config, 'coarse_mlp_hidden_size', None)
logger.debug("model_engine | coarse_mlp_hidden_size=%s", str(coarse_mlp_hidden_size))
except Exception as e:
# Property exists but inference failed; proceed without it.
logger.debug("model_engine | coarse_mlp_hidden_size unavailable: %s", e)
🤖 Prompt for AI Agents
In tensorrt_llm/_torch/pyexecutor/model_engine.py around lines 464 to 491,
replace the current hasattr/print debug block with a single try/except that
reads coarse_mlp_hidden_size once via getattr (or direct attribute access) and
catches both ValueError and AttributeError so property access that raises is
handled; on success assign the value to coarse_mlp_hidden_size and call
logger.debug with the value, on exception set coarse_mlp_hidden_size = None and
log the failure with logger.debug (do not use print), ensuring you do not
evaluate the property twice and you handle both kinds of errors.

Comment on lines +1137 to +1148
# Handle both ModelConfig and LoraModelConfig types
print(f"DEBUG: model_config type: {type(model_config)}")
print(
f"DEBUG: model_config has coarse_mlp_hidden_size: {hasattr(model_config, 'coarse_mlp_hidden_size')}"
)
print(
f"DEBUG: model_config.coarse_mlp_hidden_size value: {model_config.coarse_mlp_hidden_size}"
)
print(
f"DEBUG: model_config.coarse_mlp_hidden_size is None: {model_config.coarse_mlp_hidden_size is None}"
)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Verification agent

🧩 Analysis chain

Replace print-based debugging with logger.debug and trim verbosity.

Using print in library code is noisy and hard to control. Prefer logger.debug (already configured as logger = logging.getLogger(name)). Collapse multiple prints into a single succinct line.

Apply this diff:

-            print(f"DEBUG: model_config type: {type(model_config)}")
-            print(
-                f"DEBUG: model_config has coarse_mlp_hidden_size: {hasattr(model_config, 'coarse_mlp_hidden_size')}"
-            )
-            print(
-                f"DEBUG: model_config.coarse_mlp_hidden_size value: {model_config.coarse_mlp_hidden_size}"
-            )
-            print(
-                f"DEBUG: model_config.coarse_mlp_hidden_size is None: {model_config.coarse_mlp_hidden_size is None}"
-            )
+            logger.debug(
+                "LoRA padding | model_config=%s | has_coarse=%s | coarse=%s",
+                type(model_config),
+                hasattr(model_config, "coarse_mlp_hidden_size"),
+                getattr(model_config, "coarse_mlp_hidden_size", None),
+            )

Run to confirm no stray prints remain in lora_manager/model_engine:


🏁 Script executed:

#!/bin/bash
rg -nP '^\s*print\(' tensorrt_llm/lora_manager.py tensorrt_llm/_torch/pyexecutor/model_engine.py

Length of output: 3027


Mandate: Replace all print-based debugging with logger.debug across Lora Manager and Model Engine

Your grep shows numerous stray print(...) calls—both in tensorrt_llm/lora_manager.py and in tensorrt_llm/_torch/pyexecutor/model_engine.py. These should all be converted to controlled, leveled logging rather than uncontrolled stdout.

• Files and ranges needing refactor:

  • tensorrt_llm/lora_manager.py: lines ~1138–1154, 1162–1235
  • tensorrt_llm/_torch/pyexecutor/model_engine.py: lines ~467–490

• Primary change (example for LoraManager snippet at 1137–1148):

-            print(f"DEBUG: model_config type: {type(model_config)}")
-            print(
-                f"DEBUG: model_config has coarse_mlp_hidden_size: {hasattr(model_config, 'coarse_mlp_hidden_size')}"
-            )
-            print(
-                f"DEBUG: model_config.coarse_mlp_hidden_size value: {model_config.coarse_mlp_hidden_size}"
-            )
-            print(
-                f"DEBUG: model_config.coarse_mlp_hidden_size is None: {model_config.coarse_mlp_hidden_size is None}"
-            )
+            logger.debug(
+                "LoRA padding | model_config=%s | has_coarse=%s | coarse=%s",
+                type(model_config),
+                hasattr(model_config, "coarse_mlp_hidden_size"),
+                getattr(model_config, "coarse_mlp_hidden_size", None),
+            )

Apply the same pattern to all other debug prints:

  • Consolidate related prints into single logger.debug calls with structured arguments.
  • Ensure you import and use the module logger (logger = logging.getLogger(__name__)).
  • Remove any remaining print(...) debug statements.
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# Handle both ModelConfig and LoraModelConfig types
print(f"DEBUG: model_config type: {type(model_config)}")
print(
f"DEBUG: model_config has coarse_mlp_hidden_size: {hasattr(model_config, 'coarse_mlp_hidden_size')}"
)
print(
f"DEBUG: model_config.coarse_mlp_hidden_size value: {model_config.coarse_mlp_hidden_size}"
)
print(
f"DEBUG: model_config.coarse_mlp_hidden_size is None: {model_config.coarse_mlp_hidden_size is None}"
)
# Handle both ModelConfig and LoraModelConfig types
logger.debug(
"LoRA padding | model_config=%s | has_coarse=%s | coarse=%s",
type(model_config),
hasattr(model_config, "coarse_mlp_hidden_size"),
getattr(model_config, "coarse_mlp_hidden_size", None),
)
🤖 Prompt for AI Agents
In tensorrt_llm/lora_manager.py around lines 1137 to 1148 (and extend the same
change across the other indicated ranges ~1162–1235) replace the ad-hoc
print(...) debug statements with structured logger.debug calls: add or ensure a
module-level logger = logging.getLogger(__name__) import logging at top if
missing, consolidate related prints into single logger.debug invocations with
clear messages and variables (e.g., logger.debug("model_config type=%s
coarse_mlp_hidden_size=%s is_none=%s", type(model_config), getattr(model_config,
"coarse_mlp_hidden_size", None), getattr(model_config, "coarse_mlp_hidden_size",
None) is None)), and remove all remaining print(...) calls; apply the identical
replacement pattern to tensorrt_llm/_torch/pyexecutor/model_engine.py lines
~467–490 so no print-based debugging remains.

Comment on lines +1167 to +1169
M_coarse_tp = M_coarse * tp_size
max_weight_size = rank * M_coarse_tp + rank * H

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

TP double-counting in max_weight_size; coarse_mlp_hidden_size is already TP-split.

coarse_mlp_hidden_size returns the TP-split M value. Multiplying by tp_size again inflates padding by tp_size, wasting memory significantly (especially on 49B). Compute with the TP-split value directly.

Apply this diff:

-                M_coarse_tp = M_coarse * tp_size
-                max_weight_size = rank * M_coarse_tp + rank * H
+                // coarse_mlp_hidden_size is already TP-split; do not multiply by tp_size again.
+                M_coarse_tp = M_coarse
+                // For MLP up/gate or down, flattened size per module is rank*(M_coarse_tp + H).
+                // This upper-bounds attention modules too since M_coarse_tp >= H in typical configs.
+                calc_max_weight_size = rank * (M_coarse_tp + H)

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents
In tensorrt_llm/lora_manager.py around lines 1167 to 1169, the code multiplies
the TP-split coarse_mlp_hidden_size (M_coarse) by tp_size (M_coarse_tp) and then
uses that to compute max_weight_size, which double-counts tensor-parallel
splitting and inflates padding; remove the extra multiplication and compute
max_weight_size using the TP-split value directly (drop M_coarse_tp and use
M_coarse in the expression so max_weight_size = rank * M_coarse + rank * H).

Comment on lines +1186 to +1191
actual_max_weight_size = max(w.size(0) for w in self._cpp_lora_weights[uid])
print(f"DEBUG: load_from_hf - Actual max weight size: {actual_max_weight_size}")
print(f"DEBUG: load_from_hf - Calculated max_weight_size: {max_weight_size}")
print(
f"DEBUG: load_from_hf - Difference: {max_weight_size - actual_max_weight_size}"
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Guard against negative padding; ensure computed max covers actual max.

If calc_max_weight_size underestimates (e.g., DoRA adds a magnitude vector), padding_needed becomes negative and F.pad will error. Use the actual maximum size as a floor.

Apply this diff:

-                actual_max_weight_size = max(w.size(0) for w in self._cpp_lora_weights[uid])
-                print(f"DEBUG: load_from_hf - Actual max weight size: {actual_max_weight_size}")
-                print(f"DEBUG: load_from_hf - Calculated max_weight_size: {max_weight_size}")
-                print(
-                    f"DEBUG: load_from_hf - Difference: {max_weight_size - actual_max_weight_size}"
-                )
+                actual_max_weight_size = max(w.size(0) for w in self._cpp_lora_weights[uid])
+                # Account for possible DoRA magnitude vector (+M_coarse_tp) by taking the max with actual.
+                max_weight_size = max(calc_max_weight_size, actual_max_weight_size)
+                logger.debug(
+                    "LoRA padding | calc=%d actual=%d final=%d",
+                    calc_max_weight_size, actual_max_weight_size, max_weight_size
+                )
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
actual_max_weight_size = max(w.size(0) for w in self._cpp_lora_weights[uid])
print(f"DEBUG: load_from_hf - Actual max weight size: {actual_max_weight_size}")
print(f"DEBUG: load_from_hf - Calculated max_weight_size: {max_weight_size}")
print(
f"DEBUG: load_from_hf - Difference: {max_weight_size - actual_max_weight_size}"
)
actual_max_weight_size = max(w.size(0) for w in self._cpp_lora_weights[uid])
# Account for possible DoRA magnitude vector (+M_coarse_tp) by taking the max with actual.
max_weight_size = max(calc_max_weight_size, actual_max_weight_size)
logger.debug(
"LoRA padding | calc=%d actual=%d final=%d",
calc_max_weight_size, actual_max_weight_size, max_weight_size
)
🤖 Prompt for AI Agents
In tensorrt_llm/lora_manager.py around lines 1186 to 1191, the computed
max_weight_size can be less than the actual_max_weight_size causing negative
padding_needed and F.pad to fail; before padding, ensure max_weight_size is
floored to the observed actual maximum (e.g., max_weight_size =
max(max_weight_size, actual_max_weight_size)) and/or compute padding_needed =
max(0, max_weight_size - w.size(0)) so padding_needed is never negative.

Comment on lines +1219 to +1228
padded_weights = []
for i, w in enumerate(self._cpp_lora_weights[uid]):
padding_needed = max_weight_size - w.size(0)
print(
f"DEBUG: load_from_hf - Weight {i}: original size {w.size(0)}, padding {padding_needed}"
)
padded_w = torch.nn.functional.pad(w, (0, padding_needed))
print(f"DEBUG: load_from_hf - Weight {i}: padded size {padded_w.size(0)}")
padded_weights.append(padded_w)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Safer padding: clamp to zero and avoid per-weight print loops.

Even with the fix above, clamping is a safe guard and avoids crash on rare edge-cases. Replace prints with a single debug if needed.

Apply this diff:

-            padded_weights = []
-            for i, w in enumerate(self._cpp_lora_weights[uid]):
-                padding_needed = max_weight_size - w.size(0)
-                print(
-                    f"DEBUG: load_from_hf - Weight {i}: original size {w.size(0)}, padding {padding_needed}"
-                )
-                padded_w = torch.nn.functional.pad(w, (0, padding_needed))
-                print(f"DEBUG: load_from_hf - Weight {i}: padded size {padded_w.size(0)}")
-                padded_weights.append(padded_w)
+            padded_weights = []
+            for i, w in enumerate(self._cpp_lora_weights[uid]):
+                padding_needed = max_weight_size - w.size(0)
+                if padding_needed < 0:
+                    # Should not happen with the guard above; fall back defensively.
+                    logger.warning("LoRA padding | negative padding=%d at idx=%d; widening to actual size", padding_needed, i)
+                    padding_needed = 0
+                padded_weights.append(torch.nn.functional.pad(w, (0, padding_needed)))
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
padded_weights = []
for i, w in enumerate(self._cpp_lora_weights[uid]):
padding_needed = max_weight_size - w.size(0)
print(
f"DEBUG: load_from_hf - Weight {i}: original size {w.size(0)}, padding {padding_needed}"
)
padded_w = torch.nn.functional.pad(w, (0, padding_needed))
print(f"DEBUG: load_from_hf - Weight {i}: padded size {padded_w.size(0)}")
padded_weights.append(padded_w)
padded_weights = []
for i, w in enumerate(self._cpp_lora_weights[uid]):
padding_needed = max_weight_size - w.size(0)
if padding_needed < 0:
# Should not happen with the guard above; fall back defensively.
logger.warning(
"LoRA padding | negative padding=%d at idx=%d; widening to actual size",
padding_needed,
i,
)
padding_needed = 0
padded_weights.append(torch.nn.functional.pad(w, (0, padding_needed)))
🤖 Prompt for AI Agents
In tensorrt_llm/lora_manager.py around lines 1219 to 1228, the per-weight debug
prints and unguarded padding can cause noisy logs and crashes when
padding_needed is negative; change padding_needed to be clamped to zero (e.g.
use max(padding_needed, 0) or torch.clamp_min) before calling pad, remove the
per-weight print statements inside the loop, and instead emit a single debug
summarizing counts/sizes after the loop (or conditionally log once when any
padding occurred). Ensure padded_weights.append is still executed and behavior
unchanged when padding_needed is zero.

Signed-off-by: Michal Guzek <[email protected]>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
tensorrt_llm/_torch/pyexecutor/resource_manager.py (1)

1047-1051: Include coarse_mlp_hidden_size in LoraModelConfig before passing it

We verified that LoraModelConfig is declared as a @dataclass without a coarse_mlp_hidden_size field, so the proposed diff in resource_manager.py would break the constructor call. To support Nemotron-49B’s per-layer MLP sizing, first extend LoraModelConfig, then pass the new argument:

• In tensorrt_llm/lora_manager.py, add the field to the dataclass:

@@ class LoraModelConfig:
     lora_target_modules: list[str]
     trtllm_modules_to_hf_modules: dict[str, str]
     hidden_size: int
     data_type: str
+    coarse_mlp_hidden_size: int | None = None

• In tensorrt_llm/_torch/pyexecutor/resource_manager.py, instantiate with the new parameter:

@@ resource_manager.py:1047
-        self._lora_model_config = LoraModelConfig(
-            lora_config.lora_target_modules,
-            lora_config.trtllm_modules_to_hf_modules,
-            model_config.hidden_size,
-            binding_to_str_dtype(model_config.data_type))
+        coarse_mlp_hidden_size = getattr(model_config, "coarse_mlp_hidden_size", None)
+        self._lora_model_config = LoraModelConfig(
+            lora_config.lora_target_modules,
+            lora_config.trtllm_modules_to_hf_modules,
+            model_config.hidden_size,
+            binding_to_str_dtype(model_config.data_type),
+            coarse_mlp_hidden_size=coarse_mlp_hidden_size,
+        )

These changes ensure the constructor accepts and stores coarse_mlp_hidden_size for correct LoRA sizing on Nemotron-49B.

🧹 Nitpick comments (5)
tensorrt_llm/_torch/pyexecutor/resource_manager.py (3)

1040-1047: Remove dead commented code referencing model_engine.

This block references model_engine that isn’t in scope for this initializer and adds noise. Please delete instead of commenting out.

-        # if model_engine is not None and hasattr(model_engine, "lora_model_config"):
-        #     self._lora_model_config = model_engine.lora_model_config
-        # else:
-        #     self._lora_model_config = LoraModelConfig(
-        #         lora_config.lora_target_modules,
-        #         lora_config.trtllm_modules_to_hf_modules, model_config.hidden_size,
-        #         binding_to_str_dtype(model_config.data_type))

1062-1064: Replace print with logger.debug to avoid noisy stdout and respect log levels.

Library code shouldn’t use print. Use the existing logger with debug level; include task id for traceability.

-                print(
-                    f"DEBUG: INSIDE add_request_peft: request.py_lora_path: {request.py_lora_path}"
-                )
+                logger.debug(
+                    f"add_request_peft: loading LoRA from py_lora_path='{request.py_lora_path}' "
+                    f"for task_id={request.lora_task_id}"
+                )

1-1: Add NVIDIA copyright header (2025).

Per coding guidelines, prepend the current-year NVIDIA copyright header.

+# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.
+
 import copy
tensorrt_llm/executor/worker.py (2)

361-361: Local import is fine; consider module-level import only if no circular deps.

Keeping the import inside the function avoids import-time cycles and is acceptable here, since LoRA loading isn’t in the hot path. If there’s no cycle, moving it to the top improves discoverability.


1-1: Missing NVIDIA copyright header.

Per repo guidelines, prepend the current-year NVIDIA header to all source files.

+# Copyright (c) 2025, NVIDIA CORPORATION.  All rights reserved.
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 90b369c and 794d28b.

📒 Files selected for processing (2)
  • tensorrt_llm/_torch/pyexecutor/resource_manager.py (2 hunks)
  • tensorrt_llm/executor/worker.py (1 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Python code must target Python 3.8+
Python indentation: 4 spaces, no tabs
Maintain module namespace in imports (from package.subpackage import foo; then use foo.SomeClass())
Python file names use snake_case
Python class names use PascalCase
Python functions/methods and local variables use snake_case; variables starting with a number get k_ prefix (e.g., k_99th_percentile)
Global variables use G_ prefixed UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE in Python
Avoid shadowing variables from outer scopes in Python
Initialize all externally visible members of a Python class in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with short docstrings
Avoid reflection when simple alternatives exist (e.g., prefer explicit parameters over dict(**locals()))
In try/except, catch the narrowest exceptions possible
For duck-typing with try/except, keep try body minimal and put logic in else

Files:

  • tensorrt_llm/executor/worker.py
  • tensorrt_llm/_torch/pyexecutor/resource_manager.py
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Prepend NVIDIA copyright header (current year) to all source files

Files:

  • tensorrt_llm/executor/worker.py
  • tensorrt_llm/_torch/pyexecutor/resource_manager.py
🧠 Learnings (2)
📚 Learning: 2025-07-17T09:01:27.402Z
Learnt from: amitz-nv
PR: NVIDIA/TensorRT-LLM#5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks `is_adapter_in_cpu_cache()` and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.

Applied to files:

  • tensorrt_llm/executor/worker.py
  • tensorrt_llm/_torch/pyexecutor/resource_manager.py
📚 Learning: 2025-08-19T12:45:11.997Z
Learnt from: amitz-nv
PR: NVIDIA/TensorRT-LLM#7033
File: tensorrt_llm/_torch/pyexecutor/model_engine.py:0-0
Timestamp: 2025-08-19T12:45:11.997Z
Learning: In tensorrt_llm/_torch/pyexecutor/model_engine.py, DoRA (Delta Orthogonal Rank Adaptation) functionality was removed from the PyTorch flow to eliminate issues with inverted DoRA detection logic. The original is_dora condition was checking if scaling_vec_pointer == 0, which was potentially incorrect.

Applied to files:

  • tensorrt_llm/_torch/pyexecutor/resource_manager.py
🔇 Additional comments (3)
tensorrt_llm/executor/worker.py (3)

379-386: Only pass runtime_mapping when fully specified; otherwise defer to loader defaults.

The current call always forwards a runtime_mapping object (potentially empty or malformed). To avoid surprising behavior, pass None when you don’t have a trustworthy mapping so LoraManager.load_from_ckpt can apply its internal defaults.

If you apply the previous diff (returning None in the fallback), this call site is already compatible. Please confirm that load_from_ckpt(..., runtime_mapping=None) preserves prior behavior across TRT and PyTorch backends.


399-407: Known LoRA CPU cache race remains; this change doesn’t address it.

Per our prior learning on this file, the is_adapter_in_cpu_cache() path has a race that can’t be solved with simple guards. Your new runtime mapping doesn’t worsen it, but also doesn’t mitigate it. Just calling it out to avoid false confidence. If you want, we can scope a targeted fix separately.


362-375: Mapping and LoraManager Signatures Verified – All Good

  • The Mapping.__init__ signature in tensorrt_llm/mapping.py begins with
    def __init__(self, world_size=1, rank=0, gpus_per_node=8, *, cp_size=1, cp_config=None, tp_size=1, pp_size=1, ...)
    confirming that the world_size keyword argument is supported and all of the existing default‐ctor parameters remain unchanged.
  • The LoraManager.load_from_ckpt method in tensorrt_llm/lora_manager.py is declared as
    def load_from_ckpt(..., runtime_mapping: Optional[Mapping] = None, ...)
    so it fully accepts a runtime_mapping argument without breaking existing behavior.

No further changes are needed for the constructor or loader signatures.

Comment on lines +362 to +377
if hasattr(self._executor_config,
"mapping") and self._executor_config.mapping is not None:
mapping = self._executor_config.mapping
# Calculate world_size to satisfy the constraint: world_size = tp_size * pp_size * cp_size
world_size = mapping.tp_size * mapping.pp_size * mapping.cp_size

runtime_mapping = Mapping(
world_size=world_size, # ← Add world_size
tp_size=mapping.tp_size,
pp_size=mapping.pp_size,
cp_size=mapping.cp_size,
rank=mapping.rank,
gpus_per_node=mapping.gpus_per_node)
else:
# Fallback to default mapping
runtime_mapping = Mapping()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

⚠️ Potential issue

World size calculation likely incorrect and brittle; cp_size may be None, dp_size ignored, and several attrs may be absent.

  • Using world_size = tp_size * pp_size * cp_size will fail if cp_size is None (it is optional elsewhere in this file) and likely undercounts when dp_size exists. Many runtimes define world_size = tp * pp * dp and treat CP orthogonally; include CP only if it's part of the global process group.
  • Accessing mapping.rank and mapping.gpus_per_node without fallbacks can raise if those fields are missing.
  • Falling back to Mapping() with no args risks a runtime break if the constructor now requires world_size (or other) or if its defaults differ from the LoRA manager’s expectations.

Refactor to:

  • Prefer exec_map.world_size if present; otherwise compute with safe defaults, including dp_size.
  • Treat cp_size as 1 unless you can prove it participates in the global PG (introduce a boolean gate if available).
  • Provide robust fallbacks for rank and gpus_per_node.
  • When no mapping is available, pass None so the downstream loader can apply its own defaults.

Suggested patch:

-        from tensorrt_llm.mapping import Mapping
-        if hasattr(self._executor_config,
-                   "mapping") and self._executor_config.mapping is not None:
-            mapping = self._executor_config.mapping
-            # Calculate world_size to satisfy the constraint: world_size = tp_size * pp_size * cp_size
-            world_size = mapping.tp_size * mapping.pp_size * mapping.cp_size
-
-            runtime_mapping = Mapping(
-                world_size=world_size,  # ← Add world_size
-                tp_size=mapping.tp_size,
-                pp_size=mapping.pp_size,
-                cp_size=mapping.cp_size,
-                rank=mapping.rank,
-                gpus_per_node=mapping.gpus_per_node)
-        else:
-            # Fallback to default mapping
-            runtime_mapping = Mapping()
+        from tensorrt_llm.mapping import Mapping as RuntimeMapping
+        exec_map = getattr(self._executor_config, "mapping", None)
+        if exec_map is not None:
+            # Prefer explicit world_size if available on the mapping.
+            world_size = getattr(exec_map, "world_size", None)
+            if world_size is None:
+                tp = getattr(exec_map, "tp_size", 1) or 1
+                pp = getattr(exec_map, "pp_size", 1) or 1
+                dp = getattr(exec_map, "dp_size", 1) or 1
+                cp = getattr(exec_map, "cp_size", 1) or 1
+                # Include CP in world_size only if it participates in the global PG.
+                include_cp = getattr(exec_map, "cp_participates_in_global_pg", False)
+                world_size = tp * pp * dp * (cp if include_cp else 1)
+
+            runtime_mapping = RuntimeMapping(
+                world_size=world_size,
+                tp_size=getattr(exec_map, "tp_size", 1),
+                pp_size=getattr(exec_map, "pp_size", 1),
+                cp_size=getattr(exec_map, "cp_size", 1),
+                rank=getattr(exec_map, "rank", self.global_rank),
+                gpus_per_node=getattr(exec_map, "gpus_per_node", torch.cuda.device_count()),
+            )
+        else:
+            # Let downstream choose sane defaults when no executor mapping is provided.
+            runtime_mapping = None

Committable suggestion skipped: line range outside the PR's diff.

@moraxu
Copy link
Collaborator Author

moraxu commented Aug 30, 2025

Resolved in #7231

@moraxu moraxu closed this Aug 30, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant