Skip to content

Conversation

@amitz-nv
Copy link
Collaborator

@amitz-nv amitz-nv commented Aug 19, 2025

Description

Improve the runtime performance of PyTorchModelEngine._get_lora_params_from_requests by:

  1. Changing the dict key from a formatted string containing request_id, layer_id and module_id to a tuple.
  2. Avoiding repeated dict lookups.

Performance improvement of these changes when benchmarked on Llama-3.1-8B-Instruct-FP8, TP=1, BS=16:

  • PyTorchModelEngine._get_lora_params_from_requests run time reduced by ~33%.
  • forward step run time reduced by ~10%.

In addition, some DoRA related code was removed as it's not supported in pytorch flow.

Test Coverage

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

Details

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md
and the scripts/test_to_stage_mapping.py helper.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

Summary by CodeRabbit

  • Refactor

    • Consolidated internal LoRA parameter aggregation so per-layer/module data is merged consistently and converted to tensors; removed an obsolete internal per-module flag from the runtime/forward path.
  • Chores

    • Added an explicit type annotation to an internal attribute; no behavioral or public API changes.
  • Tests

    • Updated unit test LoRA configurations to match the new internal representation.

@amitz-nv amitz-nv self-assigned this Aug 19, 2025
@amitz-nv amitz-nv requested a review from a team as a code owner August 19, 2025 09:59
@amitz-nv amitz-nv requested a review from HuiGao-NV August 19, 2025 09:59
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Aug 19, 2025

📝 Walkthrough

Walkthrough

Adds an explicit type annotation to an internal LlmRequest attribute and refactors internal LoRA aggregation: removes per-module is_dora, consolidates per-request LoRA entries into a per-(layer,module) aggregated structure, pads missing entries with zeros, normalizes null pointers to 0, and converts aggregated lists to torch tensors.

Changes

Cohort / File(s) Summary of changes
LlmRequest type annotation
tensorrt_llm/_torch/pyexecutor/llm_request.py
Added explicit type: `self.py_lora_task_layer_module_configs: list[tensorrt_llm.bindings.internal.runtime.TaskLayerModuleConfig]
LoRA params aggregation refactor (ModelEngine)
tensorrt_llm/_torch/pyexecutor/model_engine.py
Removed per-module is_dora; store per-request values keyed by (request_id, layer_id, module_id) as {'adapter_size': [...], 'weight_pointers': [...]} (normalize scaling_vec_pointer0); aggregate into lora_params[layer_id][module_id], pad missing request slots with zeros, then convert adapter_sizetorch.IntTensor and weight_pointerstorch.LongTensor. Docstring updated; public signatures unchanged.
LoraLayer forward simplified
tensorrt_llm/_torch/peft/lora/layer.py
Removed is_dora lookup/use in forward; forward now uses only adapter_size and weight_pointers to construct ranks and pointers for grouped_gemm.
Tests updated
tests/unittest/_torch/modules/tests_lora_modules/test_lora_attention_pytorch_flow_vs_trt.py
Removed explicit is_dora: False entries for Q/K/V/DENSE module definitions to match new LoRA params shape. No other test logic changed.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant Client
  participant ModelEngine as ME
  participant TMP as tmp_lora_params
  participant AGG as lora_params
  participant Torch as Torch

  Client->>ME: submit request (may include LoRA modules)
  ME->>TMP: record per-(req,layer,module) entries
  note right of TMP #DDFFDD: entry = {adapter_size, weight_pointers}\n(scaling_vec_pointer -> 0 if None)

  loop per layer/module
    alt tmp has entry for request
      TMP->>AGG: append adapter_size and weight_pointers
    else
      AGG->>AGG: append defaults (0, [0,0,0])
    end
  end

  ME->>Torch: convert adapter_size lists -> IntTensor
  ME->>Torch: convert weight_pointers lists -> LongTensor
  Torch-->>ME: tensors ready for runtime
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Suggested reviewers

  • Naveassaf
  • shaharmor98
  • venkywonka
  • yilin-void

Tip

🔌 Remote MCP (Model Context Protocol) integration is now available!

Pro plan users can now connect to remote MCP servers from the Integrations page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats.


📜 Recent review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 500a336 and 8439b60.

📒 Files selected for processing (4)
  • tensorrt_llm/_torch/peft/lora/layer.py (0 hunks)
  • tensorrt_llm/_torch/pyexecutor/llm_request.py (1 hunks)
  • tensorrt_llm/_torch/pyexecutor/model_engine.py (1 hunks)
  • tests/unittest/_torch/modules/tests_lora_modules/test_lora_attention_pytorch_flow_vs_trt.py (0 hunks)
💤 Files with no reviewable changes (2)
  • tensorrt_llm/_torch/peft/lora/layer.py
  • tests/unittest/_torch/modules/tests_lora_modules/test_lora_attention_pytorch_flow_vs_trt.py
🚧 Files skipped from review as they are similar to previous changes (2)
  • tensorrt_llm/_torch/pyexecutor/llm_request.py
  • tensorrt_llm/_torch/pyexecutor/model_engine.py
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
✨ Finishing Touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai or @coderabbitai title anywhere in the PR title to generate the title automatically.

Status, Documentation and Community

  • Visit our Status Page to check the current availability of CodeRabbit.
  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@amitz-nv
Copy link
Collaborator Author

/bot run

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (6)
tensorrt_llm/_torch/pyexecutor/llm_request.py (1)

1-1: Check for required NVIDIA copyright header.

Per repository guidelines, source files should prepend the NVIDIA copyright header for the current year. This file appears to be missing it.

Would you like me to generate the correct header block to add at the top?

tensorrt_llm/_torch/pyexecutor/model_engine.py (5)

2067-2071: Ensure per-request alignment across modules stays consistent.

The default-path appends zeros when a request lacks a module entry, which is correct to preserve alignment. However, consider asserting equal lengths after consolidation to catch subtle mismatches early during development.

If you want, I can add a lightweight consistency check (guarded by an env flag) that validates each module’s lists are length==len(request_list) after the loops.

Also applies to: 2075-2091


2100-2104: Minor: use explicit truthiness only if intentional.

You switched to if lora_params: further down while earlier call sites check if bool(lora_params). Behavior is the same here, but please keep it consistent across the file to avoid confusion.

-        if lora_params:
+        if bool(lora_params):
             lora_params['host_request_types'] = attn_metadata.host_request_types
             lora_params['prompt_lens_cpu'] = attn_metadata.prompt_lens_cpu
             lora_params['num_seqs'] = attn_metadata.num_seqs

2025-2061: Naming consistency: weights_pointer vs weight_pointers.

Within tmp_lora_params the key is weights_pointer but the consolidated dict uses weight_pointers. It’s correct here but easy to trip on later. Consider standardizing the key name across both structures.

I can provide a follow-up patch to standardize on weight_pointers throughout this function.


2008-2024: Docstring should reflect final dtypes and shapes.

The docstring states is_dora is a torch tensor (bool), but current code leaves it as list unless fixed per above. Also clarify that weight_pointers is a flattened 1D int64 tensor of length 3 * num_requests (per module).

I can update the docstring accordingly once the is_dora tensorization change is in.


1-1: Check for required NVIDIA copyright header.

Per repository guidelines, source files should prepend the NVIDIA copyright header for the current year. This file appears to be missing it.

I can add the proper header block if desired.

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 953f4fd and eb4c815.

📒 Files selected for processing (2)
  • tensorrt_llm/_torch/pyexecutor/llm_request.py (1 hunks)
  • tensorrt_llm/_torch/pyexecutor/model_engine.py (1 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.py: Python code must target Python 3.8+
Python indentation: 4 spaces, no tabs
Maintain module namespace in imports (from package.subpackage import foo; then use foo.SomeClass())
Python file names use snake_case
Python class names use PascalCase
Python functions/methods and local variables use snake_case; variables starting with a number get k_ prefix (e.g., k_99th_percentile)
Global variables use G_ prefixed UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE in Python
Avoid shadowing variables from outer scopes in Python
Initialize all externally visible members of a Python class in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with short docstrings
Avoid reflection when simple alternatives exist (e.g., prefer explicit parameters over dict(**locals()))
In try/except, catch the narrowest exceptions possible
For duck-typing with try/except, keep try body minimal and put logic in else

Files:

  • tensorrt_llm/_torch/pyexecutor/llm_request.py
  • tensorrt_llm/_torch/pyexecutor/model_engine.py
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Prepend NVIDIA copyright header (current year) to all source files

Files:

  • tensorrt_llm/_torch/pyexecutor/llm_request.py
  • tensorrt_llm/_torch/pyexecutor/model_engine.py
🧠 Learnings (1)
📚 Learning: 2025-07-17T09:01:27.402Z
Learnt from: amitz-nv
PR: NVIDIA/TensorRT-LLM#5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks `is_adapter_in_cpu_cache()` and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.

Applied to files:

  • tensorrt_llm/_torch/pyexecutor/model_engine.py
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (4)
tensorrt_llm/_torch/pyexecutor/llm_request.py (2)

331-333: Good addition: explicit typing for py_lora_task_layer_module_configs.

The type is clear and consistent with downstream usage in model_engine. No behavioral change introduced.


331-333: All references to the old attribute removed

I’ve verified that py_lora_task_layer_configs no longer appears anywhere, and only the new py_lora_task_layer_module_configs is used in:

  • tensorrt_llm/_torch/pyexecutor/llm_request.py
  • tensorrt_llm/_torch/pyexecutor/resource_manager.py
  • tensorrt_llm/_torch/pyexecutor/model_engine.py

No further changes are needed.

tensorrt_llm/_torch/pyexecutor/model_engine.py (2)

2042-2047: Initialization structure LGTM and matches the new per-module consolidation design.


2031-2041: Confirm producer of TaskLayerModuleConfig provides stable pointers.

Given the known LoRA adapter cache race noted previously in executor/worker.py, please double-check that weights_in_pointer/out_pointer/scaling_vec_pointer are stable across the consolidation window here. Any reuse or deferred population could lead to stale addresses being captured.

Would you like a short script that scans for the adapter CPU cache optimization sites to cross-check ordering with this consolidation?

@tensorrt-cicd
Copy link
Collaborator

PR_Github #15766 [ run ] triggered by Bot

@amitz-nv amitz-nv requested review from shaharmor98 and removed request for HuiGao-NV August 19, 2025 11:30
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (1)
tensorrt_llm/_torch/pyexecutor/model_engine.py (1)

2048-2060: Potential inversion: DoRA detection likely should be based on scaling vector presence (!= 0), not absence.

Conventionally, DoRA uses a non-null scaling vector; LoRA does not. Current code sets is_dora to True when scaling_vec_pointer == 0. If there isn’t a counter-convention downstream, flip the predicate.

Apply this diff:

-                tmp_lora_params[(request.py_request_id, layer_id,
-                                 module_id)] = {
-                    'adapter_size': [module.adapter_size],
-                    'is_dora': [scaling_vec_pointer == 0],
+                tmp_lora_params[(request.py_request_id, layer_id,
+                                 module_id)] = {
+                    'adapter_size': [module.adapter_size],
+                    # DoRA iff a scaling vector exists
+                    'is_dora': [scaling_vec_pointer != 0],
                     'weight_pointers': [
                         module.weights_in_pointer,
                         module.weights_out_pointer,
                         scaling_vec_pointer
                     ],
                 }

To double-check the impact, run:

#!/bin/bash
# Verify whether 'is_dora' is consumed anywhere (so flipping the predicate cannot break usage).
rg -n -C3 --type=py '\bis_dora\b'

# Inspect lora_grouped_gemm usage/params.
rg -n -C3 --type=py '\blora_grouped_gemm\b'
🧹 Nitpick comments (3)
tensorrt_llm/_torch/pyexecutor/model_engine.py (3)

2067-2071: Use append/extend over list concatenation for clarity and a tiny perf win.

+= with lists allocates intermediate lists; append/extend is clearer and marginally cheaper in tight loops.

Diff 1 (defaults):

-                        current_lora_params['weight_pointers'] += [0, 0, 0]
+                        current_lora_params['weight_pointers'].extend((0, 0, 0))

Diff 2 (merging actual entries):

-                            current_lora_params[
-                                'adapter_size'] += current_tmp_lora_params[
-                                    'adapter_size']
-                            current_lora_params[
-                                'is_dora'] += current_tmp_lora_params['is_dora']
-                            current_lora_params[
-                                'weight_pointers'] += current_tmp_lora_params[
-                                    'weight_pointers']
+                            current_lora_params['adapter_size'].append(
+                                current_tmp_lora_params['adapter_size'][0])
+                            current_lora_params['is_dora'].append(
+                                current_tmp_lora_params['is_dora'][0])
+                            current_lora_params['weight_pointers'].extend(
+                                current_tmp_lora_params['weight_pointers'])

Also applies to: 2075-2090


2062-2092: Optional: pre-size per-module arrays by request count to avoid repeated growth.

Given this is on the hot path, consider allocating once per (layer_id, module_id) with zero-filled arrays sized to len(request_list) and fill by req_idx (and 3*req_idx:3*req_idx+3 for weight pointers). This eliminates repeated appends/concats and can reduce Python overhead in larger batches.

I can provide a targeted refactor if you want to pursue this micro-optimization.


2095-2101: Doc/type mismatch for is_dora; deliberate non-tensor is fine but update docstring or gate conversion.

You intentionally leave is_dora as a Python list (good for perf if unused today), but the function’s docstring says “torch tensor: bool”. Either update the docstring to reflect the current type or convert conditionally when DoRA is supported.

If you prefer to update the docstring now (outside this hunk), here’s a minimal tweak:

def _get_lora_params_from_requests(...):
    '''
    ...
                    is_dora: list[bool]  # TODO: convert to torch.bool tensor when DoRA path uses it
    ...
    '''

Alternatively, when enabling DoRA, add:

current_lora_params['is_dora'] = torch.tensor(
    current_lora_params['is_dora'], dtype=torch.bool)
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between eb4c815 and d59876e.

📒 Files selected for processing (1)
  • tensorrt_llm/_torch/pyexecutor/model_engine.py (1 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.py: Python code must target Python 3.8+
Python indentation: 4 spaces, no tabs
Maintain module namespace in imports (from package.subpackage import foo; then use foo.SomeClass())
Python file names use snake_case
Python class names use PascalCase
Python functions/methods and local variables use snake_case; variables starting with a number get k_ prefix (e.g., k_99th_percentile)
Global variables use G_ prefixed UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE in Python
Avoid shadowing variables from outer scopes in Python
Initialize all externally visible members of a Python class in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with short docstrings
Avoid reflection when simple alternatives exist (e.g., prefer explicit parameters over dict(**locals()))
In try/except, catch the narrowest exceptions possible
For duck-typing with try/except, keep try body minimal and put logic in else

Files:

  • tensorrt_llm/_torch/pyexecutor/model_engine.py
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Prepend NVIDIA copyright header (current year) to all source files

Files:

  • tensorrt_llm/_torch/pyexecutor/model_engine.py
🧠 Learnings (1)
📚 Learning: 2025-07-17T09:01:27.402Z
Learnt from: amitz-nv
PR: NVIDIA/TensorRT-LLM#5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks `is_adapter_in_cpu_cache()` and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.

Applied to files:

  • tensorrt_llm/_torch/pyexecutor/model_engine.py
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (2)
tensorrt_llm/_torch/pyexecutor/model_engine.py (2)

2051-2053: Good move: tuple keys + per-request tmp map reduce string allocations and lookups.

Keying tmp_lora_params by (request_id, layer_id, module_id) avoids repeated string formatting and typically hashes faster than formatted strings, aligning with the PR’s perf goals.


2102-2106: LGTM: attaching host_request_types/prompt_lens_cpu/num_seqs only when LoRA present.

Keeps payload lean and preserves previous behavior. Minor nit: if lora_params: vs if bool(lora_params): is equivalent; the change is fine.

@amitz-nv amitz-nv requested a review from a team as a code owner August 19, 2025 12:39
@amitz-nv amitz-nv force-pushed the dev-improve-perf-get-lora-params-from-requests branch from 68cee80 to 8b38d50 Compare August 19, 2025 12:42
@amitz-nv
Copy link
Collaborator Author

/bot run

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (6)
tensorrt_llm/_torch/pyexecutor/model_engine.py (6)

2041-2044: Minor: reduce chained dict lookups for lora_params initialization.

You can slightly trim Python-level overhead by caching the layer dict locally (or using setdefault) before checking/setting the module key.

Example (no need to change lines above if you prefer clarity over micro-opts):

layer_map = lora_params.setdefault(layer_id, {})
if module_id not in layer_map:
    layer_map[module_id] = {'adapter_size': [], 'weight_pointers': []}

2064-2067: Prefer extend() with a tuple to avoid constructing temporary lists repeatedly.

Use extend((0, 0, 0)) instead of += [0, 0, 0] to skip allocating a new list object each time.

-                        current_lora_params['adapter_size'].append(0)
-                        current_lora_params['weight_pointers'] += [0, 0, 0]
+                        current_lora_params['adapter_size'].append(0)
+                        current_lora_params['weight_pointers'].extend((0, 0, 0))

2071-2083: Use extend() instead of += for lists; minor micro-alloc optimization.

Same rationale as above; also keeps style uniform.

-                        current_tmp_lora_params = tmp_lora_params.get(
-                            (request.py_request_id, layer_id, module_id), None)
+                        current_tmp_lora_params = tmp_lora_params.get(
+                            (request.py_request_id, layer_id, module_id), None)

                         current_lora_params = lora_params[layer_id][module_id]
                         if current_tmp_lora_params is None:
-                            current_lora_params['adapter_size'].append(0)
-                            current_lora_params['weight_pointers'] += [0, 0, 0]
+                            current_lora_params['adapter_size'].append(0)
+                            current_lora_params['weight_pointers'].extend((0, 0, 0))
                         else:
-                            current_lora_params[
-                                'adapter_size'] += current_tmp_lora_params[
-                                    'adapter_size']
-                            current_lora_params[
-                                'weight_pointers'] += current_tmp_lora_params[
-                                    'weight_pointers']
+                            current_lora_params['adapter_size'].extend(
+                                current_tmp_lora_params['adapter_size'])
+                            current_lora_params['weight_pointers'].extend(
+                                current_tmp_lora_params['weight_pointers'])

2087-2091: Optional: be explicit with dtypes and consider pinning for H2D speed.

torch.IntTensor/torch.LongTensor are fine, but using torch.tensor(..., dtype=...) is explicit and you can set pin_memory=True in case these tensors are transferred to CUDA frequently.

-                current_lora_params['adapter_size'] = torch.IntTensor(
-                    current_lora_params['adapter_size'])
-                current_lora_params['weight_pointers'] = torch.LongTensor(
-                    current_lora_params['weight_pointers'])
+                current_lora_params['adapter_size'] = torch.tensor(
+                    current_lora_params['adapter_size'], dtype=torch.int32, pin_memory=True)
+                current_lora_params['weight_pointers'] = torch.tensor(
+                    current_lora_params['weight_pointers'], dtype=torch.int64, pin_memory=True)

If these tensors stay on CPU and are not copied to CUDA, feel free to ignore the pinning bit. Otherwise, this typically helps.


2093-2096: Docstring now diverges: top-level keys are missing.

You add host_request_types, prompt_lens_cpu, and num_seqs at the top level of lora_params, but the docstring above only documents the per-layer/per-module structure. Consider updating it for clarity and to prevent misuse.

Proposed docstring patch (structure-only):

'''
lora_params: dict
{
    layer_id: {
        module_id: {
            adapter_size: torch.Tensor[int32],    # shape: [num_requests]
            weight_pointers: torch.Tensor[int64], # shape: [3 * num_requests]
        }
    },
    # Extra metadata aligned with request_list order:
    host_request_types: list[int],
    prompt_lens_cpu: list[int],
    num_seqs: int,
}
'''

2041-2057: Optional: preallocate per-module arrays to avoid repeated appends.

If you want to squeeze a bit more perf, you could:

  • Compute N = len(request_list) once.
  • For each (layer_id, module_id) in the union, preallocate:
    • adapter_size = [0] * N
    • weight_pointers = [0] * (3 * N)
  • On the first pass, fill by index using a req_index map instead of appending/extend. This removes list growth and reduces Python allocations.

Happy to provide a concrete refactor if you'd like to go this route.

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between d59876e and 68cee80.

📒 Files selected for processing (3)
  • tensorrt_llm/_torch/peft/lora/layer.py (0 hunks)
  • tensorrt_llm/_torch/pyexecutor/model_engine.py (1 hunks)
  • tests/unittest/_torch/modules/tests_lora_modules/test_lora_attention_pytorch_flow_vs_trt.py (0 hunks)
💤 Files with no reviewable changes (2)
  • tensorrt_llm/_torch/peft/lora/layer.py
  • tests/unittest/_torch/modules/tests_lora_modules/test_lora_attention_pytorch_flow_vs_trt.py
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.py: Python code must target Python 3.8+
Python indentation: 4 spaces, no tabs
Maintain module namespace in imports (from package.subpackage import foo; then use foo.SomeClass())
Python file names use snake_case
Python class names use PascalCase
Python functions/methods and local variables use snake_case; variables starting with a number get k_ prefix (e.g., k_99th_percentile)
Global variables use G_ prefixed UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE in Python
Avoid shadowing variables from outer scopes in Python
Initialize all externally visible members of a Python class in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with short docstrings
Avoid reflection when simple alternatives exist (e.g., prefer explicit parameters over dict(**locals()))
In try/except, catch the narrowest exceptions possible
For duck-typing with try/except, keep try body minimal and put logic in else

Files:

  • tensorrt_llm/_torch/pyexecutor/model_engine.py
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Prepend NVIDIA copyright header (current year) to all source files

Files:

  • tensorrt_llm/_torch/pyexecutor/model_engine.py
🧠 Learnings (1)
📚 Learning: 2025-07-17T09:01:27.402Z
Learnt from: amitz-nv
PR: NVIDIA/TensorRT-LLM#5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks `is_adapter_in_cpu_cache()` and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.

Applied to files:

  • tensorrt_llm/_torch/pyexecutor/model_engine.py
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (2)
tensorrt_llm/_torch/pyexecutor/model_engine.py (2)

2046-2057: Good call on tuple keys; avoids f-string formatting and reduces lookups.

Using (request_id, layer_id, module_id) as the key in tmp_lora_params is a clean and low-overhead way to index per-request/per-module entries. This aligns well with the PR objective and should contribute to the reported speedup.


2046-2057: Lora params no longer include is_dora, and per‐request ordering aligns with host_request_types.

  • We only build tmp_lora_params and then flatten into lora_params without ever emitting an is_dora key, and no consumer in _torch/pyexecutor/model_engine.py or in its downstream calls references lora_params['is_dora'].
  • The two‐pass over scheduled_requests.context_requests + scheduled_requests.generation_requests produces adapter sizes and weight pointers in the same order as attn_metadata.host_request_types (0s for context, then 1s for generation).

All good here.

@tensorrt-cicd
Copy link
Collaborator

PR_Github #15783 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #15766 [ run ] completed with state ABORTED

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (4)
tensorrt_llm/_torch/pyexecutor/model_engine.py (4)

2041-2044: Initialization of per-layer/module aggregation looks good; minor micro-optimization possible

The structure {'adapter_size': [], 'weight_pointers': []} is clear and aligns with the refactor. If you want to squeeze a bit more from CPython dict lookups, you can reduce repeated indexing by using a local handle for the layer map and/or setdefault (optional; readability trade-off).

I can propose a concise refactor that preserves behavior if you'd like.


2064-2067: Zero-fill keeps per-request alignment; use extend for clarity

The zero-padding logic maintains per-request alignment across modules. As a small readability/intent improvement, prefer list.extend over +=.

Apply this diff:

-                        current_lora_params['weight_pointers'] += [0, 0, 0]
+                        current_lora_params['weight_pointers'].extend([0, 0, 0])

2071-2083: Aggregation path is correct; tiny reductions in overhead are possible

The use of a temporary map plus a single pass over the union ensures stable ordering and avoids repeated lookups. Two micro-nits if you want to chip away a few more cycles:

  • Hoist request.py_request_id to a local req_id.
  • Replace repeated tuple construction in get(...) with a prebuilt key.

For example:

# before iterating layer/module
req_id = request.py_request_id
# ...
key = (req_id, layer_id, module_id)
current_tmp_lora_params = tmp_lora_params.get(key)

2075-2076: Mirror the extend nit for the present-but-missing module case

Same as above, prefer extend over += for intent clarity.

Apply this diff:

-                            current_lora_params['weight_pointers'] += [0, 0, 0]
+                            current_lora_params['weight_pointers'].extend([0, 0, 0])
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 68cee80 and 8b38d50.

📒 Files selected for processing (4)
  • tensorrt_llm/_torch/peft/lora/layer.py (0 hunks)
  • tensorrt_llm/_torch/pyexecutor/llm_request.py (1 hunks)
  • tensorrt_llm/_torch/pyexecutor/model_engine.py (1 hunks)
  • tests/unittest/_torch/modules/tests_lora_modules/test_lora_attention_pytorch_flow_vs_trt.py (0 hunks)
💤 Files with no reviewable changes (2)
  • tensorrt_llm/_torch/peft/lora/layer.py
  • tests/unittest/_torch/modules/tests_lora_modules/test_lora_attention_pytorch_flow_vs_trt.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • tensorrt_llm/_torch/pyexecutor/llm_request.py
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.py: Python code must target Python 3.8+
Python indentation: 4 spaces, no tabs
Maintain module namespace in imports (from package.subpackage import foo; then use foo.SomeClass())
Python file names use snake_case
Python class names use PascalCase
Python functions/methods and local variables use snake_case; variables starting with a number get k_ prefix (e.g., k_99th_percentile)
Global variables use G_ prefixed UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE in Python
Avoid shadowing variables from outer scopes in Python
Initialize all externally visible members of a Python class in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with short docstrings
Avoid reflection when simple alternatives exist (e.g., prefer explicit parameters over dict(**locals()))
In try/except, catch the narrowest exceptions possible
For duck-typing with try/except, keep try body minimal and put logic in else

Files:

  • tensorrt_llm/_torch/pyexecutor/model_engine.py
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Prepend NVIDIA copyright header (current year) to all source files

Files:

  • tensorrt_llm/_torch/pyexecutor/model_engine.py
🧠 Learnings (4)
📚 Learning: 2025-08-19T12:45:11.951Z
Learnt from: amitz-nv
PR: NVIDIA/TensorRT-LLM#7033
File: tensorrt_llm/_torch/pyexecutor/model_engine.py:0-0
Timestamp: 2025-08-19T12:45:11.951Z
Learning: In tensorrt_llm/_torch/pyexecutor/model_engine.py, DoRA (Delta Orthogonal Rank Adaptation) functionality was removed from the PyTorch flow to eliminate issues with inverted DoRA detection logic. The original is_dora condition was checking if scaling_vec_pointer == 0, which was potentially incorrect.

Applied to files:

  • tensorrt_llm/_torch/pyexecutor/model_engine.py
📚 Learning: 2025-07-17T09:01:27.402Z
Learnt from: amitz-nv
PR: NVIDIA/TensorRT-LLM#5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks `is_adapter_in_cpu_cache()` and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.

Applied to files:

  • tensorrt_llm/_torch/pyexecutor/model_engine.py
📚 Learning: 2025-08-19T12:45:35.390Z
Learnt from: amitz-nv
PR: NVIDIA/TensorRT-LLM#7033
File: tensorrt_llm/_torch/pyexecutor/model_engine.py:2086-2092
Timestamp: 2025-08-19T12:45:35.390Z
Learning: DoRA (Delta Orthogonal Rank Adaptation) functionality has been removed from the PyTorch flow in tensorrt_llm/_torch/pyexecutor/model_engine.py. The is_dora field is computed but not used downstream in the PyTorch flow, so converting it to a tensor would be wasteful overhead.

Applied to files:

  • tensorrt_llm/_torch/pyexecutor/model_engine.py
📚 Learning: 2025-08-14T23:23:27.420Z
Learnt from: djns99
PR: NVIDIA/TensorRT-LLM#6915
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:4010-4012
Timestamp: 2025-08-14T23:23:27.420Z
Learning: For MOE (Mixture of Experts) code reviews in TensorRT-LLM, avoid repeatedly suggesting finalize fusion validation checks and safety assertions. The user djns99 has indicated these suggestions are repetitive and unwanted across multiple MOE-related changes.

Applied to files:

  • tensorrt_llm/_torch/pyexecutor/model_engine.py
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (3)
tensorrt_llm/_torch/pyexecutor/model_engine.py (3)

2046-2057: Tuple key adoption and pointer normalization are on-point

Switching to (request_id, layer_id, module_id) tuple keys removes format-string overhead and cuts hashing/allocations. Normalizing scaling_vec_pointer None → 0 is consistent with removing DoRA while still keeping a stable 3-pointer tuple.


2087-2091: Tensorization LGTM; confirm downstream shape expectations for weight_pointers

Using IntTensor/LongTensor is fine and consistent with pointer sizes. Weight pointers end up as a flat 1D tensor of length 3*num_requests. If downstream expects a 2D [num_requests, 3] view, you may want to reshape here to avoid repeated view ops later. If consumers already handle 1D, keep as-is.

If you want to verify usage sites, I can generate a quick repo scan to check whether consumers reshape weight_pointers into (N, 3) or expect 1D.


2093-2093: Conditional attachment of metadata is correct

Gating host_request_types/prompt_lens_cpu/num_seqs on lora_params being non-empty preserves prior semantics and avoids extraneous payloads.

@tensorrt-cicd
Copy link
Collaborator

PR_Github #15783 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #11864 completed with status: 'FAILURE'

Copy link
Collaborator

@shaharmor98 shaharmor98 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@amitz-nv amitz-nv force-pushed the dev-improve-perf-get-lora-params-from-requests branch from 8b38d50 to b90a55d Compare August 20, 2025 07:02
@amitz-nv
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #15887 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #15887 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #11940 completed with status: 'FAILURE'

@amitz-nv amitz-nv force-pushed the dev-improve-perf-get-lora-params-from-requests branch from b90a55d to 85bf438 Compare August 20, 2025 16:40
@amitz-nv
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #15939 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #15939 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #11979 completed with status: 'FAILURE'

@amitz-nv amitz-nv force-pushed the dev-improve-perf-get-lora-params-from-requests branch from 85bf438 to 24593e3 Compare August 21, 2025 07:37
@amitz-nv
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16023 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16023 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #12045 completed with status: 'FAILURE'

@amitz-nv amitz-nv force-pushed the dev-improve-perf-get-lora-params-from-requests branch from 24593e3 to d51da9c Compare August 21, 2025 15:27
@amitz-nv
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16068 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16068 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #12082 completed with status: 'FAILURE'

@amitz-nv amitz-nv force-pushed the dev-improve-perf-get-lora-params-from-requests branch from d51da9c to 500a336 Compare August 24, 2025 06:57
@amitz-nv
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16297 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16297 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #12252 completed with status: 'FAILURE'

…sts, add type hint to LlmRequest.py_lora_task_layer_module_configs

Signed-off-by: Amit Zuker <[email protected]>
…tensor conversion for is_dora

Signed-off-by: Amit Zuker <[email protected]>
@amitz-nv amitz-nv force-pushed the dev-improve-perf-get-lora-params-from-requests branch from 500a336 to 8439b60 Compare August 24, 2025 13:35
@amitz-nv
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16310 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16310 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #12262 completed with status: 'SUCCESS'
Pipeline passed with automatic retried tests. Check the rerun report for details.

@amitz-nv amitz-nv merged commit a1e03af into NVIDIA:main Aug 25, 2025
4 checks passed
amitz-nv added a commit to amitz-nv/TensorRT-LLM that referenced this pull request Aug 25, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants