-
Notifications
You must be signed in to change notification settings - Fork 1.7k
[https://nvbugs/5463720][fix]: Naively handle per-layer MLP dimensions for Nemotron 49B #7016
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Michal Guzek <[email protected]>
📝 WalkthroughWalkthroughAdds a new ModelConfig property coarse_mlp_hidden_size with Nemotron inference and logging, threads that optional value through pyexecutor to LoraModelConfig, updates LoRA padding to use coarse MLP sizing when provided, adds C++ LoRA validation logs, and introduces a duplicated integration test for Nemotron Super 49B with a real LoRA adapter. Changes
Sequence Diagram(s)sequenceDiagram
participant Dev as Caller
participant Engine as PyExecutor.model_engine
participant Config as ModelConfig
participant LoraCfg as LoraModelConfig
participant LoraMgr as lora_manager
participant CPP as loraUtils.cpp
Dev->>Engine: instantiate executor
Engine->>Config: access model_config.coarse_mlp_hidden_size
alt pretrained_config.intermediate_size present
Config-->>Engine: coarse_mlp_hidden_size = intermediate_size // tp_size
else architectures == ["DeciLMForCausalLM"]
Config-->>Engine: coarse_mlp_hidden_size = _infer_nemotron_ffn_mult()
else
Config-->>Engine: raises / None (handled by Engine)
end
Engine->>LoraCfg: construct LoraModelConfig(..., coarse_mlp_hidden_size)
LoraCfg->>LoraMgr: load LoRA weights (CPU path)
alt coarse_mlp_hidden_size provided
LoraMgr->>LoraMgr: M_coarse_tp = coarse_mlp_hidden_size * tp_size
LoraMgr->>LoraMgr: max_weight_size = r*M_coarse_tp + r*H
LoraMgr->>LoraMgr: pad weights -> max_weight_size
else not provided
LoraMgr->>LoraMgr: pad to max existing weight size (old path)
end
LoraMgr->>CPP: loraValidateRequestTensors (logs expected vs actual)
CPP-->>LoraMgr: validation result
LoraMgr-->>Engine: loaded LoRA modules
Engine-->>Dev: executor ready
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested reviewers
Tip 🔌 Remote MCP (Model Context Protocol) integration is now available!Pro plan users can now connect to remote MCP servers from the Integrations page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats. ✨ Finishing Touches
🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. CodeRabbit Commands (Invoked using PR/Issue comments)Type Other keywords and placeholders
Status, Documentation and Community
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
🧹 Nitpick comments (2)
tensorrt_llm/_torch/model_config.py (1)
456-459
: Make mlp_hidden_size_per_layer available even for uniform MLP dimsDownstream code reads mlp_hidden_size_per_layer. Setting it in the scalar path avoids None/empty semantics and keeps the binding contract consistent.
Apply this diff:
else: # Uniform MLP dimensions across all layers mlp_hidden_size = self.pretrained_config.intermediate_size // self.mapping.tp_size + # Populate per-layer vector for consistency with variable-MLP models + model_config_cpp.mlp_hidden_size_per_layer = [ + mlp_hidden_size + ] * self.pretrained_config.num_hidden_layerstests/integration/defs/examples/test_nemotron_nas.py (1)
173-177
: Nit: shorten long lines and keep messages readableWrap arguments to stay within 120 characters and improve readability (Ruff E501).
Apply this diff:
- sampling_params = SamplingParams(max_tokens=50, - temperature=0.7, - top_p=0.9) + sampling_params = SamplingParams( + max_tokens=50, + temperature=0.7, + top_p=0.9, + )
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (3)
tensorrt_llm/_torch/model_config.py
(1 hunks)tensorrt_llm/_torch/pyexecutor/_util.py
(1 hunks)tests/integration/defs/examples/test_nemotron_nas.py
(2 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.py
: Python code must target Python 3.8+
Python indentation: 4 spaces, no tabs
Maintain module namespace in imports (from package.subpackage import foo; then use foo.SomeClass())
Python file names use snake_case
Python class names use PascalCase
Python functions/methods and local variables use snake_case; variables starting with a number get k_ prefix (e.g., k_99th_percentile)
Global variables use G_ prefixed UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE in Python
Avoid shadowing variables from outer scopes in Python
Initialize all externally visible members of a Python class in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with short docstrings
Avoid reflection when simple alternatives exist (e.g., prefer explicit parameters over dict(**locals()))
In try/except, catch the narrowest exceptions possible
For duck-typing with try/except, keep try body minimal and put logic in else
Files:
tensorrt_llm/_torch/model_config.py
tests/integration/defs/examples/test_nemotron_nas.py
tensorrt_llm/_torch/pyexecutor/_util.py
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
Prepend NVIDIA copyright header (current year) to all source files
Files:
tensorrt_llm/_torch/model_config.py
tests/integration/defs/examples/test_nemotron_nas.py
tensorrt_llm/_torch/pyexecutor/_util.py
🧠 Learnings (2)
📚 Learning: 2025-07-28T17:06:08.621Z
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.
Applied to files:
tests/integration/defs/examples/test_nemotron_nas.py
📚 Learning: 2025-08-06T13:58:07.506Z
Learnt from: galagam
PR: NVIDIA/TensorRT-LLM#6487
File: tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_trtllm_bench.py:1-12
Timestamp: 2025-08-06T13:58:07.506Z
Learning: In TensorRT-LLM, test files (files under tests/ directories) do not require NVIDIA copyright headers, unlike production source code files. Test files typically start directly with imports, docstrings, or code.
Applied to files:
tests/integration/defs/examples/test_nemotron_nas.py
🪛 Ruff (0.12.2)
tests/integration/defs/examples/test_nemotron_nas.py
146-146: Line too long (125 > 120)
(E501)
tensorrt_llm/_torch/pyexecutor/_util.py
474-474: Line too long (128 > 120)
(E501)
🔇 Additional comments (2)
tensorrt_llm/_torch/pyexecutor/_util.py (1)
482-491
: Bindings include tp_size & num_experts – API compatibility confirmedI’ve verified that both the pybind and nanobind registrations for
LoraModule.create_lora_modules
declare the new parameters with defaults:
- cpp/tensorrt_llm/pybind/bindings.cpp (lines 213–216):
.def_static("create_lora_modules", …, py::arg("tp_size") = 1, py::arg("num_experts") = 0);
- cpp/tensorrt_llm/nanobind/bindings.cpp (lines 221–224):
.def_static("create_lora_modules", …, nb::arg("tp_size") = 1, nb::arg("num_experts") = 0);
These signatures match the Python call site in
tensorrt_llm/_torch/pyexecutor/_util.py
. No further binding updates are necessary.tests/integration/defs/examples/test_nemotron_nas.py (1)
133-199
: No duplicate test definitions found
I searched the entire repository fordef test_nemotron_super_49b_real_lora_torch
and only one occurrence exists, so there’s no shadowing or redundant runs. All pytest markers and fixtures can remain as-is.
I mixed up config fields, this won't work. Closing |
Signed-off-by: Michal Guzek <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 6
🧹 Nitpick comments (1)
tensorrt_llm/lora_manager.py (1)
1216-1217
: Trim excessive debug prints; emit one-liners via logger.debug.The surrounding print bursts (final max size, stacked shapes) are excessive for hot paths. Use concise logger.debug; avoid per-weight logs.
Apply this diff:
- print(f"DEBUG: load_from_hf - Final max_weight_size: {max_weight_size}") + logger.debug("LoRA padding | final_max_weight_size=%d", max_weight_size) ... - print( - f"DEBUG: load_from_hf - Final stacked weights shape: {self._cpp_lora_weights[uid].shape}" - ) + logger.debug("LoRA padding | stacked_weights_shape=%s", tuple(self._cpp_lora_weights[uid].shape)) ... - print( - f"DEBUG: load_from_hf - Final stacked config shape: {self._cpp_lora_config[uid].shape}" - ) + logger.debug("LoRA padding | stacked_config_shape=%s", tuple(self._cpp_lora_config[uid].shape))Also applies to: 1229-1237
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (5)
cpp/tensorrt_llm/runtime/loraUtils.cpp
(1 hunks)tensorrt_llm/_torch/model_config.py
(3 hunks)tensorrt_llm/_torch/pyexecutor/_util.py
(1 hunks)tensorrt_llm/_torch/pyexecutor/model_engine.py
(1 hunks)tensorrt_llm/lora_manager.py
(2 hunks)
✅ Files skipped from review due to trivial changes (1)
- tensorrt_llm/_torch/pyexecutor/_util.py
🚧 Files skipped from review as they are similar to previous changes (1)
- tensorrt_llm/_torch/model_config.py
🧰 Additional context used
📓 Path-based instructions (5)
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}
: In C++, close namespaces with a comment naming the namespace (e.g., } // namespace foo)
Prefer const/constexpr variables over #define for constants
Declare variables const if not modified after initialization
Use Allman brace style in C++
C++ filenames use lowerCamelCase and must be case-insensitively unique within a build target
C++ type names use UpperCamelCase
Local variables, methods, and namespaces use lowerCamelCase
Global non-static variables not in anonymous namespace use gPrefix lowerCamelCase (e.g., gExample)
Static globals or globals in anonymous namespaces use sPrefix lowerCamelCase
Locally visible static variables start with 's' (e.g., static std::once_flag sFlag;)
Member variables use mPrefix lowerCamelCase; public members may omit but are encouraged to use 'm'
Constants (enums, global/static/function-scope magic numbers) use kPREFIXED_UPPER_SNAKE (e.g., kDIGIT_NUM)
If macros are unavoidable, use UPPER_SNAKE_CASE (prefer constants over #define)
Constructor parameter that conflicts with a public member name gets trailing underscore (foo_)
Literal suffixes should be uppercase (e.g., 1234L not 1234l)
C++: use spaces only; indent 4 spaces
Run clang-format (LLVM style) before submitting; wrap lines at 120 characters
If formatting must be bypassed, use // clang-format off/on around the section
Prefer smart pointers; use unique_ptr for sole ownership, shared_ptr for shared; weak_ptr only in exceptional cases
Do not use deprecated pre-C++11 smart pointers
Use C++ style comments; avoid C comments except special inline cases; prefer // single-line
Capitalize and punctuate full-sentence comments
Follow Doxygen rules: use //! for comments and //!< for members in C++
Disable code with #if/#endif and mnemonic conditions; avoid commented-out code; avoid dead code
Do not throw exceptions across library boundaries
Use least-forceful casts; avoid removing const/volatile; avoid C-style and functional casts (except constructors); p...
Files:
cpp/tensorrt_llm/runtime/loraUtils.cpp
**/*.{cpp,cxx,cc,cu}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.{cpp,cxx,cc,cu}
: Avoid literal values except for 0, nullptr, true, false; use named constexpr for other literals
Place semicolon of empty for/while loop on a new line
Always use brace-delimited bodies for switch/while/do-for/if/else
Use inline C comments in argument lists when parameter meaning is unclear (e.g., /* checkForErrors = */ false)
Do not use assignment in subexpressions (e.g., if (x = y) ... is forbidden)
Switch on enums should enumerate all values and omit default to catch new values at compile time
Structure switch statements; prohibit fallthrough except between empty cases; each case ends with break or throw; return at end of case not allowed; put break inside braces for compound case
Prefer anonymous namespaces over static for internal linkage of functions
Every defined function must be called at least once (no unused methods)
Files:
cpp/tensorrt_llm/runtime/loraUtils.cpp
**/*.{h,hpp,hxx,hh,cuh,cpp,cxx,cc,cu}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Parameter names must be consistent between declarations and definitions
Files:
cpp/tensorrt_llm/runtime/loraUtils.cpp
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Prepend NVIDIA copyright header (current year) to all source files
Files:
cpp/tensorrt_llm/runtime/loraUtils.cpp
tensorrt_llm/_torch/pyexecutor/model_engine.py
tensorrt_llm/lora_manager.py
**/*.py
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.py
: Python code must target Python 3.8+
Python indentation: 4 spaces, no tabs
Maintain module namespace in imports (from package.subpackage import foo; then use foo.SomeClass())
Python file names use snake_case
Python class names use PascalCase
Python functions/methods and local variables use snake_case; variables starting with a number get k_ prefix (e.g., k_99th_percentile)
Global variables use G_ prefixed UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE in Python
Avoid shadowing variables from outer scopes in Python
Initialize all externally visible members of a Python class in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with short docstrings
Avoid reflection when simple alternatives exist (e.g., prefer explicit parameters over dict(**locals()))
In try/except, catch the narrowest exceptions possible
For duck-typing with try/except, keep try body minimal and put logic in else
Files:
tensorrt_llm/_torch/pyexecutor/model_engine.py
tensorrt_llm/lora_manager.py
🧬 Code graph analysis (2)
tensorrt_llm/_torch/pyexecutor/model_engine.py (4)
tensorrt_llm/_torch/model_config.py (2)
coarse_mlp_hidden_size
(541-556)torch_dtype
(147-152)tensorrt_llm/lora_manager.py (1)
LoraModelConfig
(241-246)tensorrt_llm/_torch/models/modeling_utils.py (1)
config
(500-501)tensorrt_llm/_utils.py (3)
dtype
(935-936)dtype
(943-953)torch_dtype_to_str
(218-219)
tensorrt_llm/lora_manager.py (2)
tensorrt_llm/_torch/model_config.py (1)
coarse_mlp_hidden_size
(541-556)tensorrt_llm/_torch/distributed/communicator.py (1)
tp_size
(46-47)
🔇 Additional comments (2)
tensorrt_llm/lora_manager.py (1)
246-247
: Adding coarse_mlp_hidden_size to LoraModelConfig is appropriate.Cleanly threads the optional ML P-size hint into LoRA loading logic. No issues here.
tensorrt_llm/_torch/pyexecutor/model_engine.py (1)
496-498
: LGTM: dtype now passed as string via torch_dtype_to_str.This aligns with downstream str_dtype_to_torch usage in lora_manager and eliminates dtype mismatches.
TLLM_CHECK_WITH_INFO(it->flattenedInOutSize(adapterSize, isDora) <= weights->getShape().d[2], | ||
"lora_weights has to few values for " + moduleName); | ||
|
||
auto expectedSize = it->flattenedInOutSize(adapterSize, isDora); | ||
auto actualSize = weights->getShape().d[2]; | ||
TLLM_LOG_DEBUG("LoRA validation for %s - Expected: %d, Actual: %d, AdapterSize: %d, IsDora: %d", | ||
moduleName.c_str(), expectedSize, actualSize, adapterSize, isDora); | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Log after-check makes debugging impossible on failure; move log before check and reuse computed sizes. Also fix typo "to few" → "too few".
If the size check fails, the debug line never emits. Compute expected/actual once, log, then assert using the computed values. This also fixes the user-facing message grammar.
Apply this diff:
- TLLM_CHECK_WITH_INFO(it->flattenedInOutSize(adapterSize, isDora) <= weights->getShape().d[2],
- "lora_weights has to few values for " + moduleName);
-
- auto expectedSize = it->flattenedInOutSize(adapterSize, isDora);
- auto actualSize = weights->getShape().d[2];
- TLLM_LOG_DEBUG("LoRA validation for %s - Expected: %d, Actual: %d, AdapterSize: %d, IsDora: %d",
- moduleName.c_str(), expectedSize, actualSize, adapterSize, isDora);
+ auto const expectedSize = it->flattenedInOutSize(adapterSize, isDora);
+ auto const actualSize = weights->getShape().d[2];
+ TLLM_LOG_DEBUG(
+ "LoRA validation for %s - Expected: %d, Actual: %d, AdapterSize: %d, IsDora: %d",
+ moduleName.c_str(), expectedSize, actualSize, adapterSize, static_cast<int>(isDora));
+ TLLM_CHECK_WITH_INFO(expectedSize <= actualSize,
+ "lora_weights has too few values for " + moduleName);
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
TLLM_CHECK_WITH_INFO(it->flattenedInOutSize(adapterSize, isDora) <= weights->getShape().d[2], | |
"lora_weights has to few values for " + moduleName); | |
auto expectedSize = it->flattenedInOutSize(adapterSize, isDora); | |
auto actualSize = weights->getShape().d[2]; | |
TLLM_LOG_DEBUG("LoRA validation for %s - Expected: %d, Actual: %d, AdapterSize: %d, IsDora: %d", | |
moduleName.c_str(), expectedSize, actualSize, adapterSize, isDora); | |
auto const expectedSize = it->flattenedInOutSize(adapterSize, isDora); | |
auto const actualSize = weights->getShape().d[2]; | |
TLLM_LOG_DEBUG( | |
"LoRA validation for %s - Expected: %d, Actual: %d, AdapterSize: %d, IsDora: %d", | |
moduleName.c_str(), expectedSize, actualSize, adapterSize, static_cast<int>(isDora)); | |
TLLM_CHECK_WITH_INFO(expectedSize <= actualSize, | |
"lora_weights has too few values for " + moduleName); |
🤖 Prompt for AI Agents
In cpp/tensorrt_llm/runtime/loraUtils.cpp around lines 108 to 115, the debug log
is placed after the size check so it won't run on failure and the check message
contains a typo; compute expectedSize and actualSize once, move the
TLLM_LOG_DEBUG call to before TLLM_CHECK_WITH_INFO and pass the computed
expectedSize and actualSize (and adapterSize/isDora) to it, and change the
TLLM_CHECK_WITH_INFO message text from "to few" to "too few" while using the
already computed expectedSize/actualSize values in the check.
coarse_mlp_hidden_size = None | ||
|
||
# Debug: Check what type self.model.model_config is | ||
print( | ||
f"DEBUG: model_engine.py - self.model.model_config type: {type(self.model.model_config)}" | ||
) | ||
print( | ||
f"DEBUG: model_engine.py - self.model.model_config dir: {dir(self.model.model_config)}" | ||
) | ||
|
||
if hasattr(self.model.model_config, 'coarse_mlp_hidden_size'): | ||
coarse_mlp_hidden_size = self.model.model_config.coarse_mlp_hidden_size | ||
print( | ||
f"DEBUG: model_engine.py - coarse_mlp_hidden_size: {coarse_mlp_hidden_size}" | ||
) | ||
else: | ||
print( | ||
f"DEBUG: model_engine.py - coarse_mlp_hidden_size property not found" | ||
) | ||
# Try direct access to see if it works | ||
try: | ||
coarse_mlp_hidden_size = self.model.model_config.coarse_mlp_hidden_size | ||
print( | ||
f"DEBUG: model_engine.py - Direct access worked: {coarse_mlp_hidden_size}" | ||
) | ||
except AttributeError as e: | ||
print(f"DEBUG: model_engine.py - Direct access failed: {e}") | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Avoid hasattr on properties that may raise; replace prints with robust getattr + logger.debug.
coarse_mlp_hidden_size is a property that can raise ValueError for unsupported architectures. hasattr will evaluate the property and won’t catch ValueError, causing unexpected failures. Use getattr in a try/except, and use logger.debug instead of print.
Apply this diff:
- coarse_mlp_hidden_size = None
-
- # Debug: Check what type self.model.model_config is
- print(
- f"DEBUG: model_engine.py - self.model.model_config type: {type(self.model.model_config)}"
- )
- print(
- f"DEBUG: model_engine.py - self.model.model_config dir: {dir(self.model.model_config)}"
- )
-
- if hasattr(self.model.model_config, 'coarse_mlp_hidden_size'):
- coarse_mlp_hidden_size = self.model.model_config.coarse_mlp_hidden_size
- print(
- f"DEBUG: model_engine.py - coarse_mlp_hidden_size: {coarse_mlp_hidden_size}"
- )
- else:
- print(
- f"DEBUG: model_engine.py - coarse_mlp_hidden_size property not found"
- )
- # Try direct access to see if it works
- try:
- coarse_mlp_hidden_size = self.model.model_config.coarse_mlp_hidden_size
- print(
- f"DEBUG: model_engine.py - Direct access worked: {coarse_mlp_hidden_size}"
- )
- except AttributeError as e:
- print(f"DEBUG: model_engine.py - Direct access failed: {e}")
+ coarse_mlp_hidden_size = None
+ try:
+ coarse_mlp_hidden_size = getattr(self.model.model_config, 'coarse_mlp_hidden_size', None)
+ logger.debug("model_engine | coarse_mlp_hidden_size=%s", str(coarse_mlp_hidden_size))
+ except Exception as e:
+ # Property exists but inference failed; proceed without it.
+ logger.debug("model_engine | coarse_mlp_hidden_size unavailable: %s", e)
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
coarse_mlp_hidden_size = None | |
# Debug: Check what type self.model.model_config is | |
print( | |
f"DEBUG: model_engine.py - self.model.model_config type: {type(self.model.model_config)}" | |
) | |
print( | |
f"DEBUG: model_engine.py - self.model.model_config dir: {dir(self.model.model_config)}" | |
) | |
if hasattr(self.model.model_config, 'coarse_mlp_hidden_size'): | |
coarse_mlp_hidden_size = self.model.model_config.coarse_mlp_hidden_size | |
print( | |
f"DEBUG: model_engine.py - coarse_mlp_hidden_size: {coarse_mlp_hidden_size}" | |
) | |
else: | |
print( | |
f"DEBUG: model_engine.py - coarse_mlp_hidden_size property not found" | |
) | |
# Try direct access to see if it works | |
try: | |
coarse_mlp_hidden_size = self.model.model_config.coarse_mlp_hidden_size | |
print( | |
f"DEBUG: model_engine.py - Direct access worked: {coarse_mlp_hidden_size}" | |
) | |
except AttributeError as e: | |
print(f"DEBUG: model_engine.py - Direct access failed: {e}") | |
coarse_mlp_hidden_size = None | |
try: | |
coarse_mlp_hidden_size = getattr(self.model.model_config, 'coarse_mlp_hidden_size', None) | |
logger.debug("model_engine | coarse_mlp_hidden_size=%s", str(coarse_mlp_hidden_size)) | |
except Exception as e: | |
# Property exists but inference failed; proceed without it. | |
logger.debug("model_engine | coarse_mlp_hidden_size unavailable: %s", e) |
🤖 Prompt for AI Agents
In tensorrt_llm/_torch/pyexecutor/model_engine.py around lines 464 to 491,
replace the current hasattr/print debug block with a single try/except that
reads coarse_mlp_hidden_size once via getattr (or direct attribute access) and
catches both ValueError and AttributeError so property access that raises is
handled; on success assign the value to coarse_mlp_hidden_size and call
logger.debug with the value, on exception set coarse_mlp_hidden_size = None and
log the failure with logger.debug (do not use print), ensuring you do not
evaluate the property twice and you handle both kinds of errors.
# Handle both ModelConfig and LoraModelConfig types | ||
print(f"DEBUG: model_config type: {type(model_config)}") | ||
print( | ||
f"DEBUG: model_config has coarse_mlp_hidden_size: {hasattr(model_config, 'coarse_mlp_hidden_size')}" | ||
) | ||
print( | ||
f"DEBUG: model_config.coarse_mlp_hidden_size value: {model_config.coarse_mlp_hidden_size}" | ||
) | ||
print( | ||
f"DEBUG: model_config.coarse_mlp_hidden_size is None: {model_config.coarse_mlp_hidden_size is None}" | ||
) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Verification agent
🧩 Analysis chain
Replace print-based debugging with logger.debug and trim verbosity.
Using print in library code is noisy and hard to control. Prefer logger.debug (already configured as logger = logging.getLogger(name)). Collapse multiple prints into a single succinct line.
Apply this diff:
- print(f"DEBUG: model_config type: {type(model_config)}")
- print(
- f"DEBUG: model_config has coarse_mlp_hidden_size: {hasattr(model_config, 'coarse_mlp_hidden_size')}"
- )
- print(
- f"DEBUG: model_config.coarse_mlp_hidden_size value: {model_config.coarse_mlp_hidden_size}"
- )
- print(
- f"DEBUG: model_config.coarse_mlp_hidden_size is None: {model_config.coarse_mlp_hidden_size is None}"
- )
+ logger.debug(
+ "LoRA padding | model_config=%s | has_coarse=%s | coarse=%s",
+ type(model_config),
+ hasattr(model_config, "coarse_mlp_hidden_size"),
+ getattr(model_config, "coarse_mlp_hidden_size", None),
+ )
Run to confirm no stray prints remain in lora_manager/model_engine:
🏁 Script executed:
#!/bin/bash
rg -nP '^\s*print\(' tensorrt_llm/lora_manager.py tensorrt_llm/_torch/pyexecutor/model_engine.py
Length of output: 3027
Mandate: Replace all print
-based debugging with logger.debug
across Lora Manager and Model Engine
Your grep shows numerous stray print(...)
calls—both in tensorrt_llm/lora_manager.py
and in tensorrt_llm/_torch/pyexecutor/model_engine.py
. These should all be converted to controlled, leveled logging rather than uncontrolled stdout.
• Files and ranges needing refactor:
- tensorrt_llm/lora_manager.py: lines ~1138–1154, 1162–1235
- tensorrt_llm/_torch/pyexecutor/model_engine.py: lines ~467–490
• Primary change (example for LoraManager snippet at 1137–1148):
- print(f"DEBUG: model_config type: {type(model_config)}")
- print(
- f"DEBUG: model_config has coarse_mlp_hidden_size: {hasattr(model_config, 'coarse_mlp_hidden_size')}"
- )
- print(
- f"DEBUG: model_config.coarse_mlp_hidden_size value: {model_config.coarse_mlp_hidden_size}"
- )
- print(
- f"DEBUG: model_config.coarse_mlp_hidden_size is None: {model_config.coarse_mlp_hidden_size is None}"
- )
+ logger.debug(
+ "LoRA padding | model_config=%s | has_coarse=%s | coarse=%s",
+ type(model_config),
+ hasattr(model_config, "coarse_mlp_hidden_size"),
+ getattr(model_config, "coarse_mlp_hidden_size", None),
+ )
Apply the same pattern to all other debug prints:
- Consolidate related prints into single
logger.debug
calls with structured arguments. - Ensure you import and use the module logger (
logger = logging.getLogger(__name__)
). - Remove any remaining
print(...)
debug statements.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
# Handle both ModelConfig and LoraModelConfig types | |
print(f"DEBUG: model_config type: {type(model_config)}") | |
print( | |
f"DEBUG: model_config has coarse_mlp_hidden_size: {hasattr(model_config, 'coarse_mlp_hidden_size')}" | |
) | |
print( | |
f"DEBUG: model_config.coarse_mlp_hidden_size value: {model_config.coarse_mlp_hidden_size}" | |
) | |
print( | |
f"DEBUG: model_config.coarse_mlp_hidden_size is None: {model_config.coarse_mlp_hidden_size is None}" | |
) | |
# Handle both ModelConfig and LoraModelConfig types | |
logger.debug( | |
"LoRA padding | model_config=%s | has_coarse=%s | coarse=%s", | |
type(model_config), | |
hasattr(model_config, "coarse_mlp_hidden_size"), | |
getattr(model_config, "coarse_mlp_hidden_size", None), | |
) |
🤖 Prompt for AI Agents
In tensorrt_llm/lora_manager.py around lines 1137 to 1148 (and extend the same
change across the other indicated ranges ~1162–1235) replace the ad-hoc
print(...) debug statements with structured logger.debug calls: add or ensure a
module-level logger = logging.getLogger(__name__) import logging at top if
missing, consolidate related prints into single logger.debug invocations with
clear messages and variables (e.g., logger.debug("model_config type=%s
coarse_mlp_hidden_size=%s is_none=%s", type(model_config), getattr(model_config,
"coarse_mlp_hidden_size", None), getattr(model_config, "coarse_mlp_hidden_size",
None) is None)), and remove all remaining print(...) calls; apply the identical
replacement pattern to tensorrt_llm/_torch/pyexecutor/model_engine.py lines
~467–490 so no print-based debugging remains.
M_coarse_tp = M_coarse * tp_size | ||
max_weight_size = rank * M_coarse_tp + rank * H | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TP double-counting in max_weight_size; coarse_mlp_hidden_size is already TP-split.
coarse_mlp_hidden_size returns the TP-split M value. Multiplying by tp_size again inflates padding by tp_size, wasting memory significantly (especially on 49B). Compute with the TP-split value directly.
Apply this diff:
- M_coarse_tp = M_coarse * tp_size
- max_weight_size = rank * M_coarse_tp + rank * H
+ // coarse_mlp_hidden_size is already TP-split; do not multiply by tp_size again.
+ M_coarse_tp = M_coarse
+ // For MLP up/gate or down, flattened size per module is rank*(M_coarse_tp + H).
+ // This upper-bounds attention modules too since M_coarse_tp >= H in typical configs.
+ calc_max_weight_size = rank * (M_coarse_tp + H)
Committable suggestion skipped: line range outside the PR's diff.
🤖 Prompt for AI Agents
In tensorrt_llm/lora_manager.py around lines 1167 to 1169, the code multiplies
the TP-split coarse_mlp_hidden_size (M_coarse) by tp_size (M_coarse_tp) and then
uses that to compute max_weight_size, which double-counts tensor-parallel
splitting and inflates padding; remove the extra multiplication and compute
max_weight_size using the TP-split value directly (drop M_coarse_tp and use
M_coarse in the expression so max_weight_size = rank * M_coarse + rank * H).
actual_max_weight_size = max(w.size(0) for w in self._cpp_lora_weights[uid]) | ||
print(f"DEBUG: load_from_hf - Actual max weight size: {actual_max_weight_size}") | ||
print(f"DEBUG: load_from_hf - Calculated max_weight_size: {max_weight_size}") | ||
print( | ||
f"DEBUG: load_from_hf - Difference: {max_weight_size - actual_max_weight_size}" | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Guard against negative padding; ensure computed max covers actual max.
If calc_max_weight_size underestimates (e.g., DoRA adds a magnitude vector), padding_needed becomes negative and F.pad will error. Use the actual maximum size as a floor.
Apply this diff:
- actual_max_weight_size = max(w.size(0) for w in self._cpp_lora_weights[uid])
- print(f"DEBUG: load_from_hf - Actual max weight size: {actual_max_weight_size}")
- print(f"DEBUG: load_from_hf - Calculated max_weight_size: {max_weight_size}")
- print(
- f"DEBUG: load_from_hf - Difference: {max_weight_size - actual_max_weight_size}"
- )
+ actual_max_weight_size = max(w.size(0) for w in self._cpp_lora_weights[uid])
+ # Account for possible DoRA magnitude vector (+M_coarse_tp) by taking the max with actual.
+ max_weight_size = max(calc_max_weight_size, actual_max_weight_size)
+ logger.debug(
+ "LoRA padding | calc=%d actual=%d final=%d",
+ calc_max_weight_size, actual_max_weight_size, max_weight_size
+ )
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
actual_max_weight_size = max(w.size(0) for w in self._cpp_lora_weights[uid]) | |
print(f"DEBUG: load_from_hf - Actual max weight size: {actual_max_weight_size}") | |
print(f"DEBUG: load_from_hf - Calculated max_weight_size: {max_weight_size}") | |
print( | |
f"DEBUG: load_from_hf - Difference: {max_weight_size - actual_max_weight_size}" | |
) | |
actual_max_weight_size = max(w.size(0) for w in self._cpp_lora_weights[uid]) | |
# Account for possible DoRA magnitude vector (+M_coarse_tp) by taking the max with actual. | |
max_weight_size = max(calc_max_weight_size, actual_max_weight_size) | |
logger.debug( | |
"LoRA padding | calc=%d actual=%d final=%d", | |
calc_max_weight_size, actual_max_weight_size, max_weight_size | |
) |
🤖 Prompt for AI Agents
In tensorrt_llm/lora_manager.py around lines 1186 to 1191, the computed
max_weight_size can be less than the actual_max_weight_size causing negative
padding_needed and F.pad to fail; before padding, ensure max_weight_size is
floored to the observed actual maximum (e.g., max_weight_size =
max(max_weight_size, actual_max_weight_size)) and/or compute padding_needed =
max(0, max_weight_size - w.size(0)) so padding_needed is never negative.
padded_weights = [] | ||
for i, w in enumerate(self._cpp_lora_weights[uid]): | ||
padding_needed = max_weight_size - w.size(0) | ||
print( | ||
f"DEBUG: load_from_hf - Weight {i}: original size {w.size(0)}, padding {padding_needed}" | ||
) | ||
padded_w = torch.nn.functional.pad(w, (0, padding_needed)) | ||
print(f"DEBUG: load_from_hf - Weight {i}: padded size {padded_w.size(0)}") | ||
padded_weights.append(padded_w) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Safer padding: clamp to zero and avoid per-weight print loops.
Even with the fix above, clamping is a safe guard and avoids crash on rare edge-cases. Replace prints with a single debug if needed.
Apply this diff:
- padded_weights = []
- for i, w in enumerate(self._cpp_lora_weights[uid]):
- padding_needed = max_weight_size - w.size(0)
- print(
- f"DEBUG: load_from_hf - Weight {i}: original size {w.size(0)}, padding {padding_needed}"
- )
- padded_w = torch.nn.functional.pad(w, (0, padding_needed))
- print(f"DEBUG: load_from_hf - Weight {i}: padded size {padded_w.size(0)}")
- padded_weights.append(padded_w)
+ padded_weights = []
+ for i, w in enumerate(self._cpp_lora_weights[uid]):
+ padding_needed = max_weight_size - w.size(0)
+ if padding_needed < 0:
+ # Should not happen with the guard above; fall back defensively.
+ logger.warning("LoRA padding | negative padding=%d at idx=%d; widening to actual size", padding_needed, i)
+ padding_needed = 0
+ padded_weights.append(torch.nn.functional.pad(w, (0, padding_needed)))
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
padded_weights = [] | |
for i, w in enumerate(self._cpp_lora_weights[uid]): | |
padding_needed = max_weight_size - w.size(0) | |
print( | |
f"DEBUG: load_from_hf - Weight {i}: original size {w.size(0)}, padding {padding_needed}" | |
) | |
padded_w = torch.nn.functional.pad(w, (0, padding_needed)) | |
print(f"DEBUG: load_from_hf - Weight {i}: padded size {padded_w.size(0)}") | |
padded_weights.append(padded_w) | |
padded_weights = [] | |
for i, w in enumerate(self._cpp_lora_weights[uid]): | |
padding_needed = max_weight_size - w.size(0) | |
if padding_needed < 0: | |
# Should not happen with the guard above; fall back defensively. | |
logger.warning( | |
"LoRA padding | negative padding=%d at idx=%d; widening to actual size", | |
padding_needed, | |
i, | |
) | |
padding_needed = 0 | |
padded_weights.append(torch.nn.functional.pad(w, (0, padding_needed))) |
🤖 Prompt for AI Agents
In tensorrt_llm/lora_manager.py around lines 1219 to 1228, the per-weight debug
prints and unguarded padding can cause noisy logs and crashes when
padding_needed is negative; change padding_needed to be clamped to zero (e.g.
use max(padding_needed, 0) or torch.clamp_min) before calling pad, remove the
per-weight print statements inside the loop, and instead emit a single debug
summarizing counts/sizes after the loop (or conditionally log once when any
padding occurred). Ensure padded_weights.append is still executed and behavior
unchanged when padding_needed is zero.
Signed-off-by: Michal Guzek <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
tensorrt_llm/_torch/pyexecutor/resource_manager.py (1)
1047-1051
: Includecoarse_mlp_hidden_size
inLoraModelConfig
before passing itWe verified that
LoraModelConfig
is declared as a@dataclass
without acoarse_mlp_hidden_size
field, so the proposed diff inresource_manager.py
would break the constructor call. To support Nemotron-49B’s per-layer MLP sizing, first extendLoraModelConfig
, then pass the new argument:• In tensorrt_llm/lora_manager.py, add the field to the dataclass:
@@ class LoraModelConfig: lora_target_modules: list[str] trtllm_modules_to_hf_modules: dict[str, str] hidden_size: int data_type: str + coarse_mlp_hidden_size: int | None = None
• In tensorrt_llm/_torch/pyexecutor/resource_manager.py, instantiate with the new parameter:
@@ resource_manager.py:1047 - self._lora_model_config = LoraModelConfig( - lora_config.lora_target_modules, - lora_config.trtllm_modules_to_hf_modules, - model_config.hidden_size, - binding_to_str_dtype(model_config.data_type)) + coarse_mlp_hidden_size = getattr(model_config, "coarse_mlp_hidden_size", None) + self._lora_model_config = LoraModelConfig( + lora_config.lora_target_modules, + lora_config.trtllm_modules_to_hf_modules, + model_config.hidden_size, + binding_to_str_dtype(model_config.data_type), + coarse_mlp_hidden_size=coarse_mlp_hidden_size, + )These changes ensure the constructor accepts and stores
coarse_mlp_hidden_size
for correct LoRA sizing on Nemotron-49B.
🧹 Nitpick comments (5)
tensorrt_llm/_torch/pyexecutor/resource_manager.py (3)
1040-1047
: Remove dead commented code referencing model_engine.This block references model_engine that isn’t in scope for this initializer and adds noise. Please delete instead of commenting out.
- # if model_engine is not None and hasattr(model_engine, "lora_model_config"): - # self._lora_model_config = model_engine.lora_model_config - # else: - # self._lora_model_config = LoraModelConfig( - # lora_config.lora_target_modules, - # lora_config.trtllm_modules_to_hf_modules, model_config.hidden_size, - # binding_to_str_dtype(model_config.data_type))
1062-1064
: Replace print with logger.debug to avoid noisy stdout and respect log levels.Library code shouldn’t use print. Use the existing logger with debug level; include task id for traceability.
- print( - f"DEBUG: INSIDE add_request_peft: request.py_lora_path: {request.py_lora_path}" - ) + logger.debug( + f"add_request_peft: loading LoRA from py_lora_path='{request.py_lora_path}' " + f"for task_id={request.lora_task_id}" + )
1-1
: Add NVIDIA copyright header (2025).Per coding guidelines, prepend the current-year NVIDIA copyright header.
+# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved. + import copytensorrt_llm/executor/worker.py (2)
361-361
: Local import is fine; consider module-level import only if no circular deps.Keeping the import inside the function avoids import-time cycles and is acceptable here, since LoRA loading isn’t in the hot path. If there’s no cycle, moving it to the top improves discoverability.
1-1
: Missing NVIDIA copyright header.Per repo guidelines, prepend the current-year NVIDIA header to all source files.
+# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (2)
tensorrt_llm/_torch/pyexecutor/resource_manager.py
(2 hunks)tensorrt_llm/executor/worker.py
(1 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.py
: Python code must target Python 3.8+
Python indentation: 4 spaces, no tabs
Maintain module namespace in imports (from package.subpackage import foo; then use foo.SomeClass())
Python file names use snake_case
Python class names use PascalCase
Python functions/methods and local variables use snake_case; variables starting with a number get k_ prefix (e.g., k_99th_percentile)
Global variables use G_ prefixed UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE in Python
Avoid shadowing variables from outer scopes in Python
Initialize all externally visible members of a Python class in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with short docstrings
Avoid reflection when simple alternatives exist (e.g., prefer explicit parameters over dict(**locals()))
In try/except, catch the narrowest exceptions possible
For duck-typing with try/except, keep try body minimal and put logic in else
Files:
tensorrt_llm/executor/worker.py
tensorrt_llm/_torch/pyexecutor/resource_manager.py
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Prepend NVIDIA copyright header (current year) to all source files
Files:
tensorrt_llm/executor/worker.py
tensorrt_llm/_torch/pyexecutor/resource_manager.py
🧠 Learnings (2)
📚 Learning: 2025-07-17T09:01:27.402Z
Learnt from: amitz-nv
PR: NVIDIA/TensorRT-LLM#5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks `is_adapter_in_cpu_cache()` and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
Applied to files:
tensorrt_llm/executor/worker.py
tensorrt_llm/_torch/pyexecutor/resource_manager.py
📚 Learning: 2025-08-19T12:45:11.997Z
Learnt from: amitz-nv
PR: NVIDIA/TensorRT-LLM#7033
File: tensorrt_llm/_torch/pyexecutor/model_engine.py:0-0
Timestamp: 2025-08-19T12:45:11.997Z
Learning: In tensorrt_llm/_torch/pyexecutor/model_engine.py, DoRA (Delta Orthogonal Rank Adaptation) functionality was removed from the PyTorch flow to eliminate issues with inverted DoRA detection logic. The original is_dora condition was checking if scaling_vec_pointer == 0, which was potentially incorrect.
Applied to files:
tensorrt_llm/_torch/pyexecutor/resource_manager.py
🔇 Additional comments (3)
tensorrt_llm/executor/worker.py (3)
379-386
: Only pass runtime_mapping when fully specified; otherwise defer to loader defaults.The current call always forwards a
runtime_mapping
object (potentially empty or malformed). To avoid surprising behavior, passNone
when you don’t have a trustworthy mapping soLoraManager.load_from_ckpt
can apply its internal defaults.If you apply the previous diff (returning
None
in the fallback), this call site is already compatible. Please confirm thatload_from_ckpt(..., runtime_mapping=None)
preserves prior behavior across TRT and PyTorch backends.
399-407
: Known LoRA CPU cache race remains; this change doesn’t address it.Per our prior learning on this file, the
is_adapter_in_cpu_cache()
path has a race that can’t be solved with simple guards. Your new runtime mapping doesn’t worsen it, but also doesn’t mitigate it. Just calling it out to avoid false confidence. If you want, we can scope a targeted fix separately.
362-375
: Mapping and LoraManager Signatures Verified – All Good
- The
Mapping.__init__
signature intensorrt_llm/mapping.py
begins with
def __init__(self, world_size=1, rank=0, gpus_per_node=8, *, cp_size=1, cp_config=None, tp_size=1, pp_size=1, ...)
confirming that theworld_size
keyword argument is supported and all of the existing default‐ctor parameters remain unchanged.- The
LoraManager.load_from_ckpt
method intensorrt_llm/lora_manager.py
is declared as
def load_from_ckpt(..., runtime_mapping: Optional[Mapping] = None, ...)
so it fully accepts aruntime_mapping
argument without breaking existing behavior.No further changes are needed for the constructor or loader signatures.
if hasattr(self._executor_config, | ||
"mapping") and self._executor_config.mapping is not None: | ||
mapping = self._executor_config.mapping | ||
# Calculate world_size to satisfy the constraint: world_size = tp_size * pp_size * cp_size | ||
world_size = mapping.tp_size * mapping.pp_size * mapping.cp_size | ||
|
||
runtime_mapping = Mapping( | ||
world_size=world_size, # ← Add world_size | ||
tp_size=mapping.tp_size, | ||
pp_size=mapping.pp_size, | ||
cp_size=mapping.cp_size, | ||
rank=mapping.rank, | ||
gpus_per_node=mapping.gpus_per_node) | ||
else: | ||
# Fallback to default mapping | ||
runtime_mapping = Mapping() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
World size calculation likely incorrect and brittle; cp_size may be None, dp_size ignored, and several attrs may be absent.
- Using
world_size = tp_size * pp_size * cp_size
will fail ifcp_size
isNone
(it is optional elsewhere in this file) and likely undercounts whendp_size
exists. Many runtimes defineworld_size = tp * pp * dp
and treat CP orthogonally; include CP only if it's part of the global process group. - Accessing
mapping.rank
andmapping.gpus_per_node
without fallbacks can raise if those fields are missing. - Falling back to
Mapping()
with no args risks a runtime break if the constructor now requiresworld_size
(or other) or if its defaults differ from the LoRA manager’s expectations.
Refactor to:
- Prefer
exec_map.world_size
if present; otherwise compute with safe defaults, includingdp_size
. - Treat
cp_size
as 1 unless you can prove it participates in the global PG (introduce a boolean gate if available). - Provide robust fallbacks for
rank
andgpus_per_node
. - When no mapping is available, pass
None
so the downstream loader can apply its own defaults.
Suggested patch:
- from tensorrt_llm.mapping import Mapping
- if hasattr(self._executor_config,
- "mapping") and self._executor_config.mapping is not None:
- mapping = self._executor_config.mapping
- # Calculate world_size to satisfy the constraint: world_size = tp_size * pp_size * cp_size
- world_size = mapping.tp_size * mapping.pp_size * mapping.cp_size
-
- runtime_mapping = Mapping(
- world_size=world_size, # ← Add world_size
- tp_size=mapping.tp_size,
- pp_size=mapping.pp_size,
- cp_size=mapping.cp_size,
- rank=mapping.rank,
- gpus_per_node=mapping.gpus_per_node)
- else:
- # Fallback to default mapping
- runtime_mapping = Mapping()
+ from tensorrt_llm.mapping import Mapping as RuntimeMapping
+ exec_map = getattr(self._executor_config, "mapping", None)
+ if exec_map is not None:
+ # Prefer explicit world_size if available on the mapping.
+ world_size = getattr(exec_map, "world_size", None)
+ if world_size is None:
+ tp = getattr(exec_map, "tp_size", 1) or 1
+ pp = getattr(exec_map, "pp_size", 1) or 1
+ dp = getattr(exec_map, "dp_size", 1) or 1
+ cp = getattr(exec_map, "cp_size", 1) or 1
+ # Include CP in world_size only if it participates in the global PG.
+ include_cp = getattr(exec_map, "cp_participates_in_global_pg", False)
+ world_size = tp * pp * dp * (cp if include_cp else 1)
+
+ runtime_mapping = RuntimeMapping(
+ world_size=world_size,
+ tp_size=getattr(exec_map, "tp_size", 1),
+ pp_size=getattr(exec_map, "pp_size", 1),
+ cp_size=getattr(exec_map, "cp_size", 1),
+ rank=getattr(exec_map, "rank", self.global_rank),
+ gpus_per_node=getattr(exec_map, "gpus_per_node", torch.cuda.device_count()),
+ )
+ else:
+ # Let downstream choose sane defaults when no executor mapping is provided.
+ runtime_mapping = None
Committable suggestion skipped: line range outside the PR's diff.
Resolved in #7231 |
Description
[https://nvbugs/5463720][fix]: Naively handle per-layer MLP dimensions for Nemotron 49B
Test Coverage
GitHub Bot Help
/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...
Provide a user friendly way for developers to interact with a Jenkins server.
Run
/bot [-h|--help]
to print this help message.See details below for each supported subcommand.
run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]
Launch build/test pipelines. All previously running jobs will be killed.
--reuse-test (optional)pipeline-id
(OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.--disable-reuse-test
(OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.--disable-fail-fast
(OPTIONAL) : Disable fail fast on build/tests/infra failures.--skip-test
(OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.--stage-list "A10-PyTorch-1, xxx"
(OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.--gpu-type "A30, H100_PCIe"
(OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.--test-backend "pytorch, cpp"
(OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.--only-multi-gpu-test
(OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.--disable-multi-gpu-test
(OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.--add-multi-gpu-test
(OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.--post-merge
(OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx"
(OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".--detailed-log
(OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.--debug
(OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in thestage-list
parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.For guidance on mapping tests to stage names, see
docs/source/reference/ci-overview.md
and the
scripts/test_to_stage_mapping.py
helper.kill
kill
Kill all running builds associated with pull request.
skip
skip --comment COMMENT
Skip testing for latest commit on pull request.
--comment "Reason for skipping build/test"
is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.reuse-pipeline
reuse-pipeline
Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.
Summary by CodeRabbit
New Features
Chores
Tests