-
Notifications
You must be signed in to change notification settings - Fork 2k
[TRTLLM-7346][fix] Improve performance of PyTorchModelEngine._get_lora_params_from_requests #7033
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[TRTLLM-7346][fix] Improve performance of PyTorchModelEngine._get_lora_params_from_requests #7033
Conversation
📝 WalkthroughWalkthroughAdds an explicit type annotation to an internal LlmRequest attribute and refactors internal LoRA aggregation: removes per-module Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant Client
participant ModelEngine as ME
participant TMP as tmp_lora_params
participant AGG as lora_params
participant Torch as Torch
Client->>ME: submit request (may include LoRA modules)
ME->>TMP: record per-(req,layer,module) entries
note right of TMP #DDFFDD: entry = {adapter_size, weight_pointers}\n(scaling_vec_pointer -> 0 if None)
loop per layer/module
alt tmp has entry for request
TMP->>AGG: append adapter_size and weight_pointers
else
AGG->>AGG: append defaults (0, [0,0,0])
end
end
ME->>Torch: convert adapter_size lists -> IntTensor
ME->>Torch: convert weight_pointers lists -> LongTensor
Torch-->>ME: tensors ready for runtime
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested reviewers
Tip 🔌 Remote MCP (Model Context Protocol) integration is now available!Pro plan users can now connect to remote MCP servers from the Integrations page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats. 📜 Recent review detailsConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro 💡 Knowledge Base configuration:
You can enable these sources in your CodeRabbit configuration. 📒 Files selected for processing (4)
💤 Files with no reviewable changes (2)
🚧 Files skipped from review as they are similar to previous changes (2)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
✨ Finishing Touches
🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. CodeRabbit Commands (Invoked using PR/Issue comments)Type Other keywords and placeholders
Status, Documentation and Community
|
|
/bot run |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🧹 Nitpick comments (6)
tensorrt_llm/_torch/pyexecutor/llm_request.py (1)
1-1: Check for required NVIDIA copyright header.Per repository guidelines, source files should prepend the NVIDIA copyright header for the current year. This file appears to be missing it.
Would you like me to generate the correct header block to add at the top?
tensorrt_llm/_torch/pyexecutor/model_engine.py (5)
2067-2071: Ensure per-request alignment across modules stays consistent.The default-path appends zeros when a request lacks a module entry, which is correct to preserve alignment. However, consider asserting equal lengths after consolidation to catch subtle mismatches early during development.
If you want, I can add a lightweight consistency check (guarded by an env flag) that validates each module’s lists are length==len(request_list) after the loops.
Also applies to: 2075-2091
2100-2104: Minor: use explicit truthiness only if intentional.You switched to if lora_params: further down while earlier call sites check if bool(lora_params). Behavior is the same here, but please keep it consistent across the file to avoid confusion.
- if lora_params: + if bool(lora_params): lora_params['host_request_types'] = attn_metadata.host_request_types lora_params['prompt_lens_cpu'] = attn_metadata.prompt_lens_cpu lora_params['num_seqs'] = attn_metadata.num_seqs
2025-2061: Naming consistency: weights_pointer vs weight_pointers.Within tmp_lora_params the key is weights_pointer but the consolidated dict uses weight_pointers. It’s correct here but easy to trip on later. Consider standardizing the key name across both structures.
I can provide a follow-up patch to standardize on weight_pointers throughout this function.
2008-2024: Docstring should reflect final dtypes and shapes.The docstring states is_dora is a torch tensor (bool), but current code leaves it as list unless fixed per above. Also clarify that weight_pointers is a flattened 1D int64 tensor of length 3 * num_requests (per module).
I can update the docstring accordingly once the is_dora tensorization change is in.
1-1: Check for required NVIDIA copyright header.Per repository guidelines, source files should prepend the NVIDIA copyright header for the current year. This file appears to be missing it.
I can add the proper header block if desired.
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (2)
tensorrt_llm/_torch/pyexecutor/llm_request.py(1 hunks)tensorrt_llm/_torch/pyexecutor/model_engine.py(1 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.py: Python code must target Python 3.8+
Python indentation: 4 spaces, no tabs
Maintain module namespace in imports (from package.subpackage import foo; then use foo.SomeClass())
Python file names use snake_case
Python class names use PascalCase
Python functions/methods and local variables use snake_case; variables starting with a number get k_ prefix (e.g., k_99th_percentile)
Global variables use G_ prefixed UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE in Python
Avoid shadowing variables from outer scopes in Python
Initialize all externally visible members of a Python class in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with short docstrings
Avoid reflection when simple alternatives exist (e.g., prefer explicit parameters over dict(**locals()))
In try/except, catch the narrowest exceptions possible
For duck-typing with try/except, keep try body minimal and put logic in else
Files:
tensorrt_llm/_torch/pyexecutor/llm_request.pytensorrt_llm/_torch/pyexecutor/model_engine.py
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
Prepend NVIDIA copyright header (current year) to all source files
Files:
tensorrt_llm/_torch/pyexecutor/llm_request.pytensorrt_llm/_torch/pyexecutor/model_engine.py
🧠 Learnings (1)
📚 Learning: 2025-07-17T09:01:27.402Z
Learnt from: amitz-nv
PR: NVIDIA/TensorRT-LLM#5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks `is_adapter_in_cpu_cache()` and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
Applied to files:
tensorrt_llm/_torch/pyexecutor/model_engine.py
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (4)
tensorrt_llm/_torch/pyexecutor/llm_request.py (2)
331-333: Good addition: explicit typing for py_lora_task_layer_module_configs.The type is clear and consistent with downstream usage in model_engine. No behavioral change introduced.
331-333: All references to the old attribute removedI’ve verified that
py_lora_task_layer_configsno longer appears anywhere, and only the newpy_lora_task_layer_module_configsis used in:
- tensorrt_llm/_torch/pyexecutor/llm_request.py
- tensorrt_llm/_torch/pyexecutor/resource_manager.py
- tensorrt_llm/_torch/pyexecutor/model_engine.py
No further changes are needed.
tensorrt_llm/_torch/pyexecutor/model_engine.py (2)
2042-2047: Initialization structure LGTM and matches the new per-module consolidation design.
2031-2041: Confirm producer of TaskLayerModuleConfig provides stable pointers.Given the known LoRA adapter cache race noted previously in executor/worker.py, please double-check that weights_in_pointer/out_pointer/scaling_vec_pointer are stable across the consolidation window here. Any reuse or deferred population could lead to stale addresses being captured.
Would you like a short script that scans for the adapter CPU cache optimization sites to cross-check ordering with this consolidation?
|
PR_Github #15766 [ run ] triggered by Bot |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (1)
tensorrt_llm/_torch/pyexecutor/model_engine.py (1)
2048-2060: Potential inversion: DoRA detection likely should be based on scaling vector presence (!= 0), not absence.Conventionally, DoRA uses a non-null scaling vector; LoRA does not. Current code sets
is_dorato True whenscaling_vec_pointer == 0. If there isn’t a counter-convention downstream, flip the predicate.Apply this diff:
- tmp_lora_params[(request.py_request_id, layer_id, - module_id)] = { - 'adapter_size': [module.adapter_size], - 'is_dora': [scaling_vec_pointer == 0], + tmp_lora_params[(request.py_request_id, layer_id, + module_id)] = { + 'adapter_size': [module.adapter_size], + # DoRA iff a scaling vector exists + 'is_dora': [scaling_vec_pointer != 0], 'weight_pointers': [ module.weights_in_pointer, module.weights_out_pointer, scaling_vec_pointer ], }To double-check the impact, run:
#!/bin/bash # Verify whether 'is_dora' is consumed anywhere (so flipping the predicate cannot break usage). rg -n -C3 --type=py '\bis_dora\b' # Inspect lora_grouped_gemm usage/params. rg -n -C3 --type=py '\blora_grouped_gemm\b'
🧹 Nitpick comments (3)
tensorrt_llm/_torch/pyexecutor/model_engine.py (3)
2067-2071: Use append/extend over list concatenation for clarity and a tiny perf win.
+=with lists allocates intermediate lists;append/extendis clearer and marginally cheaper in tight loops.Diff 1 (defaults):
- current_lora_params['weight_pointers'] += [0, 0, 0] + current_lora_params['weight_pointers'].extend((0, 0, 0))Diff 2 (merging actual entries):
- current_lora_params[ - 'adapter_size'] += current_tmp_lora_params[ - 'adapter_size'] - current_lora_params[ - 'is_dora'] += current_tmp_lora_params['is_dora'] - current_lora_params[ - 'weight_pointers'] += current_tmp_lora_params[ - 'weight_pointers'] + current_lora_params['adapter_size'].append( + current_tmp_lora_params['adapter_size'][0]) + current_lora_params['is_dora'].append( + current_tmp_lora_params['is_dora'][0]) + current_lora_params['weight_pointers'].extend( + current_tmp_lora_params['weight_pointers'])Also applies to: 2075-2090
2062-2092: Optional: pre-size per-module arrays by request count to avoid repeated growth.Given this is on the hot path, consider allocating once per (layer_id, module_id) with zero-filled arrays sized to
len(request_list)and fill byreq_idx(and3*req_idx:3*req_idx+3for weight pointers). This eliminates repeated appends/concats and can reduce Python overhead in larger batches.I can provide a targeted refactor if you want to pursue this micro-optimization.
2095-2101: Doc/type mismatch for is_dora; deliberate non-tensor is fine but update docstring or gate conversion.You intentionally leave
is_doraas a Python list (good for perf if unused today), but the function’s docstring says “torch tensor: bool”. Either update the docstring to reflect the current type or convert conditionally when DoRA is supported.If you prefer to update the docstring now (outside this hunk), here’s a minimal tweak:
def _get_lora_params_from_requests(...): ''' ... is_dora: list[bool] # TODO: convert to torch.bool tensor when DoRA path uses it ... '''Alternatively, when enabling DoRA, add:
current_lora_params['is_dora'] = torch.tensor( current_lora_params['is_dora'], dtype=torch.bool)
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (1)
tensorrt_llm/_torch/pyexecutor/model_engine.py(1 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.py: Python code must target Python 3.8+
Python indentation: 4 spaces, no tabs
Maintain module namespace in imports (from package.subpackage import foo; then use foo.SomeClass())
Python file names use snake_case
Python class names use PascalCase
Python functions/methods and local variables use snake_case; variables starting with a number get k_ prefix (e.g., k_99th_percentile)
Global variables use G_ prefixed UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE in Python
Avoid shadowing variables from outer scopes in Python
Initialize all externally visible members of a Python class in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with short docstrings
Avoid reflection when simple alternatives exist (e.g., prefer explicit parameters over dict(**locals()))
In try/except, catch the narrowest exceptions possible
For duck-typing with try/except, keep try body minimal and put logic in else
Files:
tensorrt_llm/_torch/pyexecutor/model_engine.py
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
Prepend NVIDIA copyright header (current year) to all source files
Files:
tensorrt_llm/_torch/pyexecutor/model_engine.py
🧠 Learnings (1)
📚 Learning: 2025-07-17T09:01:27.402Z
Learnt from: amitz-nv
PR: NVIDIA/TensorRT-LLM#5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks `is_adapter_in_cpu_cache()` and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
Applied to files:
tensorrt_llm/_torch/pyexecutor/model_engine.py
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (2)
tensorrt_llm/_torch/pyexecutor/model_engine.py (2)
2051-2053: Good move: tuple keys + per-request tmp map reduce string allocations and lookups.Keying
tmp_lora_paramsby(request_id, layer_id, module_id)avoids repeated string formatting and typically hashes faster than formatted strings, aligning with the PR’s perf goals.
2102-2106: LGTM: attaching host_request_types/prompt_lens_cpu/num_seqs only when LoRA present.Keeps payload lean and preserves previous behavior. Minor nit:
if lora_params:vsif bool(lora_params):is equivalent; the change is fine.
68cee80 to
8b38d50
Compare
|
/bot run |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (6)
tensorrt_llm/_torch/pyexecutor/model_engine.py (6)
2041-2044: Minor: reduce chained dict lookups for lora_params initialization.You can slightly trim Python-level overhead by caching the layer dict locally (or using setdefault) before checking/setting the module key.
Example (no need to change lines above if you prefer clarity over micro-opts):
layer_map = lora_params.setdefault(layer_id, {}) if module_id not in layer_map: layer_map[module_id] = {'adapter_size': [], 'weight_pointers': []}
2064-2067: Prefer extend() with a tuple to avoid constructing temporary lists repeatedly.Use
extend((0, 0, 0))instead of+= [0, 0, 0]to skip allocating a new list object each time.- current_lora_params['adapter_size'].append(0) - current_lora_params['weight_pointers'] += [0, 0, 0] + current_lora_params['adapter_size'].append(0) + current_lora_params['weight_pointers'].extend((0, 0, 0))
2071-2083: Use extend() instead of += for lists; minor micro-alloc optimization.Same rationale as above; also keeps style uniform.
- current_tmp_lora_params = tmp_lora_params.get( - (request.py_request_id, layer_id, module_id), None) + current_tmp_lora_params = tmp_lora_params.get( + (request.py_request_id, layer_id, module_id), None) current_lora_params = lora_params[layer_id][module_id] if current_tmp_lora_params is None: - current_lora_params['adapter_size'].append(0) - current_lora_params['weight_pointers'] += [0, 0, 0] + current_lora_params['adapter_size'].append(0) + current_lora_params['weight_pointers'].extend((0, 0, 0)) else: - current_lora_params[ - 'adapter_size'] += current_tmp_lora_params[ - 'adapter_size'] - current_lora_params[ - 'weight_pointers'] += current_tmp_lora_params[ - 'weight_pointers'] + current_lora_params['adapter_size'].extend( + current_tmp_lora_params['adapter_size']) + current_lora_params['weight_pointers'].extend( + current_tmp_lora_params['weight_pointers'])
2087-2091: Optional: be explicit with dtypes and consider pinning for H2D speed.
torch.IntTensor/torch.LongTensorare fine, but usingtorch.tensor(..., dtype=...)is explicit and you can setpin_memory=Truein case these tensors are transferred to CUDA frequently.- current_lora_params['adapter_size'] = torch.IntTensor( - current_lora_params['adapter_size']) - current_lora_params['weight_pointers'] = torch.LongTensor( - current_lora_params['weight_pointers']) + current_lora_params['adapter_size'] = torch.tensor( + current_lora_params['adapter_size'], dtype=torch.int32, pin_memory=True) + current_lora_params['weight_pointers'] = torch.tensor( + current_lora_params['weight_pointers'], dtype=torch.int64, pin_memory=True)If these tensors stay on CPU and are not copied to CUDA, feel free to ignore the pinning bit. Otherwise, this typically helps.
2093-2096: Docstring now diverges: top-level keys are missing.You add
host_request_types,prompt_lens_cpu, andnum_seqsat the top level oflora_params, but the docstring above only documents the per-layer/per-module structure. Consider updating it for clarity and to prevent misuse.Proposed docstring patch (structure-only):
''' lora_params: dict { layer_id: { module_id: { adapter_size: torch.Tensor[int32], # shape: [num_requests] weight_pointers: torch.Tensor[int64], # shape: [3 * num_requests] } }, # Extra metadata aligned with request_list order: host_request_types: list[int], prompt_lens_cpu: list[int], num_seqs: int, } '''
2041-2057: Optional: preallocate per-module arrays to avoid repeated appends.If you want to squeeze a bit more perf, you could:
- Compute
N = len(request_list)once.- For each
(layer_id, module_id)in the union, preallocate:
adapter_size = [0] * Nweight_pointers = [0] * (3 * N)- On the first pass, fill by index using a
req_indexmap instead of appending/extend. This removes list growth and reduces Python allocations.Happy to provide a concrete refactor if you'd like to go this route.
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (3)
tensorrt_llm/_torch/peft/lora/layer.py(0 hunks)tensorrt_llm/_torch/pyexecutor/model_engine.py(1 hunks)tests/unittest/_torch/modules/tests_lora_modules/test_lora_attention_pytorch_flow_vs_trt.py(0 hunks)
💤 Files with no reviewable changes (2)
- tensorrt_llm/_torch/peft/lora/layer.py
- tests/unittest/_torch/modules/tests_lora_modules/test_lora_attention_pytorch_flow_vs_trt.py
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.py: Python code must target Python 3.8+
Python indentation: 4 spaces, no tabs
Maintain module namespace in imports (from package.subpackage import foo; then use foo.SomeClass())
Python file names use snake_case
Python class names use PascalCase
Python functions/methods and local variables use snake_case; variables starting with a number get k_ prefix (e.g., k_99th_percentile)
Global variables use G_ prefixed UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE in Python
Avoid shadowing variables from outer scopes in Python
Initialize all externally visible members of a Python class in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with short docstrings
Avoid reflection when simple alternatives exist (e.g., prefer explicit parameters over dict(**locals()))
In try/except, catch the narrowest exceptions possible
For duck-typing with try/except, keep try body minimal and put logic in else
Files:
tensorrt_llm/_torch/pyexecutor/model_engine.py
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
Prepend NVIDIA copyright header (current year) to all source files
Files:
tensorrt_llm/_torch/pyexecutor/model_engine.py
🧠 Learnings (1)
📚 Learning: 2025-07-17T09:01:27.402Z
Learnt from: amitz-nv
PR: NVIDIA/TensorRT-LLM#5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks `is_adapter_in_cpu_cache()` and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
Applied to files:
tensorrt_llm/_torch/pyexecutor/model_engine.py
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (2)
tensorrt_llm/_torch/pyexecutor/model_engine.py (2)
2046-2057: Good call on tuple keys; avoids f-string formatting and reduces lookups.Using
(request_id, layer_id, module_id)as the key intmp_lora_paramsis a clean and low-overhead way to index per-request/per-module entries. This aligns well with the PR objective and should contribute to the reported speedup.
2046-2057: Lora params no longer includeis_dora, and per‐request ordering aligns withhost_request_types.
- We only build
tmp_lora_paramsand then flatten intolora_paramswithout ever emitting anis_dorakey, and no consumer in_torch/pyexecutor/model_engine.pyor in its downstream calls referenceslora_params['is_dora'].- The two‐pass over
scheduled_requests.context_requests + scheduled_requests.generation_requestsproduces adapter sizes and weight pointers in the same order asattn_metadata.host_request_types(0s for context, then 1s for generation).All good here.
|
PR_Github #15783 [ run ] triggered by Bot |
|
PR_Github #15766 [ run ] completed with state |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (4)
tensorrt_llm/_torch/pyexecutor/model_engine.py (4)
2041-2044: Initialization of per-layer/module aggregation looks good; minor micro-optimization possibleThe structure {'adapter_size': [], 'weight_pointers': []} is clear and aligns with the refactor. If you want to squeeze a bit more from CPython dict lookups, you can reduce repeated indexing by using a local handle for the layer map and/or setdefault (optional; readability trade-off).
I can propose a concise refactor that preserves behavior if you'd like.
2064-2067: Zero-fill keeps per-request alignment; use extend for clarityThe zero-padding logic maintains per-request alignment across modules. As a small readability/intent improvement, prefer list.extend over +=.
Apply this diff:
- current_lora_params['weight_pointers'] += [0, 0, 0] + current_lora_params['weight_pointers'].extend([0, 0, 0])
2071-2083: Aggregation path is correct; tiny reductions in overhead are possibleThe use of a temporary map plus a single pass over the union ensures stable ordering and avoids repeated lookups. Two micro-nits if you want to chip away a few more cycles:
- Hoist request.py_request_id to a local req_id.
- Replace repeated tuple construction in get(...) with a prebuilt key.
For example:
# before iterating layer/module req_id = request.py_request_id # ... key = (req_id, layer_id, module_id) current_tmp_lora_params = tmp_lora_params.get(key)
2075-2076: Mirror the extend nit for the present-but-missing module caseSame as above, prefer extend over += for intent clarity.
Apply this diff:
- current_lora_params['weight_pointers'] += [0, 0, 0] + current_lora_params['weight_pointers'].extend([0, 0, 0])
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (4)
tensorrt_llm/_torch/peft/lora/layer.py(0 hunks)tensorrt_llm/_torch/pyexecutor/llm_request.py(1 hunks)tensorrt_llm/_torch/pyexecutor/model_engine.py(1 hunks)tests/unittest/_torch/modules/tests_lora_modules/test_lora_attention_pytorch_flow_vs_trt.py(0 hunks)
💤 Files with no reviewable changes (2)
- tensorrt_llm/_torch/peft/lora/layer.py
- tests/unittest/_torch/modules/tests_lora_modules/test_lora_attention_pytorch_flow_vs_trt.py
🚧 Files skipped from review as they are similar to previous changes (1)
- tensorrt_llm/_torch/pyexecutor/llm_request.py
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.py: Python code must target Python 3.8+
Python indentation: 4 spaces, no tabs
Maintain module namespace in imports (from package.subpackage import foo; then use foo.SomeClass())
Python file names use snake_case
Python class names use PascalCase
Python functions/methods and local variables use snake_case; variables starting with a number get k_ prefix (e.g., k_99th_percentile)
Global variables use G_ prefixed UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE in Python
Avoid shadowing variables from outer scopes in Python
Initialize all externally visible members of a Python class in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with short docstrings
Avoid reflection when simple alternatives exist (e.g., prefer explicit parameters over dict(**locals()))
In try/except, catch the narrowest exceptions possible
For duck-typing with try/except, keep try body minimal and put logic in else
Files:
tensorrt_llm/_torch/pyexecutor/model_engine.py
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
Prepend NVIDIA copyright header (current year) to all source files
Files:
tensorrt_llm/_torch/pyexecutor/model_engine.py
🧠 Learnings (4)
📚 Learning: 2025-08-19T12:45:11.951Z
Learnt from: amitz-nv
PR: NVIDIA/TensorRT-LLM#7033
File: tensorrt_llm/_torch/pyexecutor/model_engine.py:0-0
Timestamp: 2025-08-19T12:45:11.951Z
Learning: In tensorrt_llm/_torch/pyexecutor/model_engine.py, DoRA (Delta Orthogonal Rank Adaptation) functionality was removed from the PyTorch flow to eliminate issues with inverted DoRA detection logic. The original is_dora condition was checking if scaling_vec_pointer == 0, which was potentially incorrect.
Applied to files:
tensorrt_llm/_torch/pyexecutor/model_engine.py
📚 Learning: 2025-07-17T09:01:27.402Z
Learnt from: amitz-nv
PR: NVIDIA/TensorRT-LLM#5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks `is_adapter_in_cpu_cache()` and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
Applied to files:
tensorrt_llm/_torch/pyexecutor/model_engine.py
📚 Learning: 2025-08-19T12:45:35.390Z
Learnt from: amitz-nv
PR: NVIDIA/TensorRT-LLM#7033
File: tensorrt_llm/_torch/pyexecutor/model_engine.py:2086-2092
Timestamp: 2025-08-19T12:45:35.390Z
Learning: DoRA (Delta Orthogonal Rank Adaptation) functionality has been removed from the PyTorch flow in tensorrt_llm/_torch/pyexecutor/model_engine.py. The is_dora field is computed but not used downstream in the PyTorch flow, so converting it to a tensor would be wasteful overhead.
Applied to files:
tensorrt_llm/_torch/pyexecutor/model_engine.py
📚 Learning: 2025-08-14T23:23:27.420Z
Learnt from: djns99
PR: NVIDIA/TensorRT-LLM#6915
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:4010-4012
Timestamp: 2025-08-14T23:23:27.420Z
Learning: For MOE (Mixture of Experts) code reviews in TensorRT-LLM, avoid repeatedly suggesting finalize fusion validation checks and safety assertions. The user djns99 has indicated these suggestions are repetitive and unwanted across multiple MOE-related changes.
Applied to files:
tensorrt_llm/_torch/pyexecutor/model_engine.py
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (3)
tensorrt_llm/_torch/pyexecutor/model_engine.py (3)
2046-2057: Tuple key adoption and pointer normalization are on-pointSwitching to (request_id, layer_id, module_id) tuple keys removes format-string overhead and cuts hashing/allocations. Normalizing scaling_vec_pointer None → 0 is consistent with removing DoRA while still keeping a stable 3-pointer tuple.
2087-2091: Tensorization LGTM; confirm downstream shape expectations for weight_pointersUsing IntTensor/LongTensor is fine and consistent with pointer sizes. Weight pointers end up as a flat 1D tensor of length 3*num_requests. If downstream expects a 2D [num_requests, 3] view, you may want to reshape here to avoid repeated view ops later. If consumers already handle 1D, keep as-is.
If you want to verify usage sites, I can generate a quick repo scan to check whether consumers reshape weight_pointers into (N, 3) or expect 1D.
2093-2093: Conditional attachment of metadata is correctGating host_request_types/prompt_lens_cpu/num_seqs on lora_params being non-empty preserves prior semantics and avoids extraneous payloads.
|
PR_Github #15783 [ run ] completed with state |
shaharmor98
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
8b38d50 to
b90a55d
Compare
|
/bot run |
|
PR_Github #15887 [ run ] triggered by Bot |
|
PR_Github #15887 [ run ] completed with state |
b90a55d to
85bf438
Compare
|
/bot run |
|
PR_Github #15939 [ run ] triggered by Bot |
|
PR_Github #15939 [ run ] completed with state |
85bf438 to
24593e3
Compare
|
/bot run |
|
PR_Github #16023 [ run ] triggered by Bot |
|
PR_Github #16023 [ run ] completed with state |
24593e3 to
d51da9c
Compare
|
/bot run |
|
PR_Github #16068 [ run ] triggered by Bot |
|
PR_Github #16068 [ run ] completed with state |
d51da9c to
500a336
Compare
|
/bot run |
|
PR_Github #16297 [ run ] triggered by Bot |
|
PR_Github #16297 [ run ] completed with state |
…sts, add type hint to LlmRequest.py_lora_task_layer_module_configs Signed-off-by: Amit Zuker <[email protected]>
…tensor conversion for is_dora Signed-off-by: Amit Zuker <[email protected]>
Signed-off-by: Amit Zuker <[email protected]>
500a336 to
8439b60
Compare
|
/bot run |
|
PR_Github #16310 [ run ] triggered by Bot |
|
PR_Github #16310 [ run ] completed with state |
…a_params_from_requests (NVIDIA#7033) Signed-off-by: Amit Zuker <[email protected]>
Description
Improve the runtime performance of
PyTorchModelEngine._get_lora_params_from_requestsby:Performance improvement of these changes when benchmarked on
Llama-3.1-8B-Instruct-FP8, TP=1, BS=16:PyTorchModelEngine._get_lora_params_from_requestsrun time reduced by ~33%.In addition, some DoRA related code was removed as it's not supported in pytorch flow.
Test Coverage
GitHub Bot Help
/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...Provide a user friendly way for developers to interact with a Jenkins server.
Run
/bot [-h|--help]to print this help message.See details below for each supported subcommand.
Details
run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]Launch build/test pipelines. All previously running jobs will be killed.
--reuse-test (optional)pipeline-id(OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.--disable-reuse-test(OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.--disable-fail-fast(OPTIONAL) : Disable fail fast on build/tests/infra failures.--skip-test(OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.--stage-list "A10-PyTorch-1, xxx"(OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.--gpu-type "A30, H100_PCIe"(OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.--test-backend "pytorch, cpp"(OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.--only-multi-gpu-test(OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.--disable-multi-gpu-test(OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.--add-multi-gpu-test(OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.--post-merge(OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx"(OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".--detailed-log(OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.--debug(OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in thestage-listparameter to access the appropriate container environment. Note: Does NOT update GitHub check status.For guidance on mapping tests to stage names, see
docs/source/reference/ci-overview.mdand the
scripts/test_to_stage_mapping.pyhelper.kill
killKill all running builds associated with pull request.
skip
skip --comment COMMENTSkip testing for latest commit on pull request.
--comment "Reason for skipping build/test"is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.reuse-pipeline
reuse-pipelineReuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.
Summary by CodeRabbit
Refactor
Chores
Tests