-
Notifications
You must be signed in to change notification settings - Fork 2k
Feat/support lora cuda graph #7335
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Shahar Mor <[email protected]>
Signed-off-by: Shahar Mor <[email protected]>
Signed-off-by: Shahar Mor <[email protected]>
Signed-off-by: Shahar Mor <[email protected]>
WalkthroughAdds optional LoRA integration across PyTorch executor: wires a LoraManager with a PEFT cache, supports prefetching LoRA adapters, propagates lora_params into CUDA graph capture/replay, updates resource/dummy request handling to carry LoRA fields, exposes get_lora_manager(), and adds a LoRA+CUDA graph unit test. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant User
participant PyExecutor
participant ModelEngine
participant ResourceManager
participant LoraManager
participant PeftCacheMgr as PEFT Cache Manager (CPP)
User->>PyExecutor: init(...)
PyExecutor->>ModelEngine: construct(...)
PyExecutor->>ModelEngine: set_lora_manager_cpp_peft_cache_manager(ResourceManager)
ModelEngine->>ResourceManager: get(ResourceManagerType.PEFT_CACHE_MANAGER)
ResourceManager-->>ModelEngine: PEFT cache mgr
ModelEngine->>LoraManager: set_cpp_peft_cache_manager(PeftCacheMgr)
PyExecutor->>ModelEngine: prefetch_lora_dirs()
ModelEngine->>LoraManager: load adapters / prefetch
LoraManager-->>ModelEngine: adapters ready
ModelEngine-->>PyExecutor: has_lora_prefetched = True
sequenceDiagram
autonumber
participant Scheduler as Request Scheduler
participant ModelEngine
participant ResourceManager
participant CudaGraph as DecodingCUDAGraphRunner
participant LoraManager
Scheduler->>ModelEngine: forward(batch, resource_manager)
ModelEngine->>ModelEngine: _maybe_get_cuda_graph(..., resource_manager)
alt LoRA prefetched
ModelEngine->>LoraManager: build lora_config / params
LoraManager-->>ModelEngine: lora_params
ModelEngine->>CudaGraph: construct(..., lora_params)
CudaGraph->>CudaGraph: capture(forward_fn, inputs + lora_params)
else No LoRA
ModelEngine->>CudaGraph: construct(..., lora_params=None)
CudaGraph->>CudaGraph: capture(forward_fn, inputs)
end
ModelEngine->>CudaGraph: replay(...)
CudaGraph-->>ModelEngine: outputs
ModelEngine-->>Scheduler: outputs
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Suggested reviewers
Poem
Tip 🔌 Remote MCP (Model Context Protocol) integration is now available!Pro plan users can now connect to remote MCP servers from the Integrations page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats. ✨ Finishing Touches
🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. CodeRabbit Commands (Invoked using PR/Issue comments)Type Other keywords and placeholders
Status, Documentation and Community
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 10
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
tensorrt_llm/lora_manager.py (1)
442-444: Constructor ignores cpp_peft_cache_manager argument — set the field.Currently the passed manager is dropped and a new None field is created later. Initialize it in init and drop the redundant defaulting.
class LoraManager(object): @@ - def __init__( - self, cpp_peft_cache_manager: tb_internal.batch_manager.PeftCacheManager | None = None - ): + def __init__( + self, cpp_peft_cache_manager: tb_internal.batch_manager.PeftCacheManager | None = None + ): @@ - self._lora_uid_counter = 0 + self._lora_uid_counter = 0 @@ - self.lora_target_modules: List[str] = [] - self._cpp_peft_cache_manager: Optional[tb_internal.batch_manager.PeftCacheManager] = None + self.lora_target_modules: List[str] = [] + self._cpp_peft_cache_manager: Optional[ + tb_internal.batch_manager.PeftCacheManager + ] = cpp_peft_cache_managerAlso applies to: 487-493
tensorrt_llm/_torch/pyexecutor/model_engine.py (1)
1978-1979: DoRA detection logic reintroduced; it was intentionally removed.Per prior removal, set is_dora to False to avoid inverted detection.
- is_dora = module.scaling_vec_pointer == 0 + is_dora = False # DoRA disabled in PyTorch flow
🧹 Nitpick comments (8)
tensorrt_llm/_torch/pyexecutor/py_executor.py (1)
322-324: Add return type and minimal docstring for public API get_lora_manager().Improves clarity and external usage.
- def get_lora_manager(self): - return self.model_engine.lora_manager + def get_lora_manager(self) -> Optional["LoraManager"]: + """Return the LoRA manager associated with this executor (PyTorch backend only).""" + return self.model_engine.lora_managertensorrt_llm/executor/worker.py (1)
162-168: Guard against missing LoRA manager in PyTorch path.If engine.get_lora_manager() unexpectedly returns None, later access will fail. Add an assert with a clear error.
- self._lora_manager = self.engine.get_lora_manager() + self._lora_manager = self.engine.get_lora_manager() + assert self._lora_manager is not None, ( + "LoRA config provided but no LoraManager available from engine." + )tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py (3)
30-38: Document and type lora_params; ensure it’s capture-safe.Clarify expected structure/device and narrow typing for safer usage during capture/replay.
- use_mrope: bool = False, - lora_params: Optional[dict] = None, + use_mrope: bool = False, + lora_params: Optional[Dict[str, torch.Tensor]] = None,Add to the constructor docstring (not shown) that:
- lora_params tensors must be on the capture device,
- shapes and storage addresses must remain constant across replays (contents may mutate).
72-72: Persist lora_params and mark as optional model input.Include lora_params in optional_extra_model_inputs to mirror mrope handling and avoid accidental shape checks elsewhere that rely on this list.
- self.lora_params = lora_params - self._output = None + self.lora_params = lora_params + self._output = None self._graph = None - self.optional_extra_model_inputs = ["mrope_position_deltas"] + self.optional_extra_model_inputs = ["mrope_position_deltas", "lora_params"]
95-97: Inject lora_params during capture — OK; add minimal validation.Pre-capture, assert tensors live on the target device to catch misconfigurations early.
if self.lora_params is not None: + # lightweight validation + for k, v in self.lora_params.items(): + assert isinstance(v, torch.Tensor) and v.device.type == "cuda", \ + f"lora_params['{k}'] must be a CUDA tensor" inputs["lora_params"] = self.lora_paramstensorrt_llm/_torch/pyexecutor/model_engine.py (3)
535-537: Comment cleanup and consistency.Remove temporary “SMOR” comments; they will leak into production.
- lora_request= - lora_config, # TODO smor- tests assume BS1 then this will be ignored for now, need to resolve + lora_request=lora_binding, ... - lora_request=lora_config, + lora_request=lora_binding,Also applies to: 550-551
1001-1002: Replace print with logger.- print(f"SMOR, not failed on lora_params in maybe_get_cuda_graph") + logger.debug("LoRA params prepared for CUDA graph.")
1-1: Missing NVIDIA copyright header.Add the standard NVIDIA header (current year) per guidelines.
Please ensure the repo’s standard header is applied uniformly.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (7)
tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py(3 hunks)tensorrt_llm/_torch/pyexecutor/model_engine.py(9 hunks)tensorrt_llm/_torch/pyexecutor/py_executor.py(2 hunks)tensorrt_llm/_torch/pyexecutor/resource_manager.py(2 hunks)tensorrt_llm/executor/worker.py(1 hunks)tensorrt_llm/lora_manager.py(3 hunks)tests/unittest/llmapi/test_llm_pytorch.py(2 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.py: Code must target Python 3.8+
Indent with 4 spaces; do not use tabs
Preserve module namespaces in imports: import the subpackage/module, not the symbol (from package.subpackage import foo; foo.SomeClass())
Naming: files snake_case; classes PascalCase; functions/methods snake_case; local variables snake_case (k_ prefix if starting with a number); globals G_ + UPPER_SNAKE_CASE; constants UPPER_SNAKE_CASE
Avoid shadowing outer-scope variables; initialize all externally visible members in init
Prefer docstrings for interfaces used outside a file; reserve comments for function-internal or file-local interfaces
Use Google-style docstrings for classes and functions; inline docstrings for attributes/variables are allowed
Avoid reflection when straightforward code suffices (e.g., prefer explicit parameters over dict(**locals()))
Use narrow except clauses (e.g., catch FileNotFoundError instead of bare except)
For duck-typing try/except, keep try body minimal and use else for the main logic
Files:
tensorrt_llm/executor/worker.pytensorrt_llm/_torch/pyexecutor/py_executor.pytensorrt_llm/lora_manager.pytensorrt_llm/_torch/pyexecutor/cuda_graph_runner.pytests/unittest/llmapi/test_llm_pytorch.pytensorrt_llm/_torch/pyexecutor/resource_manager.pytensorrt_llm/_torch/pyexecutor/model_engine.py
**/*.{cpp,cc,cxx,cu,h,hpp,hh,hxx,cuh,py}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Prepend NVIDIA copyright header with current year to all source files
Files:
tensorrt_llm/executor/worker.pytensorrt_llm/_torch/pyexecutor/py_executor.pytensorrt_llm/lora_manager.pytensorrt_llm/_torch/pyexecutor/cuda_graph_runner.pytests/unittest/llmapi/test_llm_pytorch.pytensorrt_llm/_torch/pyexecutor/resource_manager.pytensorrt_llm/_torch/pyexecutor/model_engine.py
🧠 Learnings (6)
📚 Learning: 2025-08-26T06:07:02.166Z
Learnt from: shaharmor98
PR: NVIDIA/TensorRT-LLM#7231
File: tensorrt_llm/_torch/pyexecutor/_util.py:504-509
Timestamp: 2025-08-26T06:07:02.166Z
Learning: In tensorrt_llm/_torch/pyexecutor/_util.py, when calling model_engine.set_lora_model_config(), pass model_binding_config.mlp_hidden_size directly without multiplying by mapping.tp_size, as the mlp_hidden_size from get_bindings_model_config() is already the per-TP rank value needed for LoRA weight packaging.
Applied to files:
tensorrt_llm/executor/worker.pytensorrt_llm/_torch/pyexecutor/py_executor.pytensorrt_llm/_torch/pyexecutor/model_engine.py
📚 Learning: 2025-07-17T09:01:27.402Z
Learnt from: amitz-nv
PR: NVIDIA/TensorRT-LLM#5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks `is_adapter_in_cpu_cache()` and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
Applied to files:
tensorrt_llm/executor/worker.pytensorrt_llm/_torch/pyexecutor/resource_manager.pytensorrt_llm/_torch/pyexecutor/model_engine.py
📚 Learning: 2025-08-19T12:45:11.997Z
Learnt from: amitz-nv
PR: NVIDIA/TensorRT-LLM#7033
File: tensorrt_llm/_torch/pyexecutor/model_engine.py:0-0
Timestamp: 2025-08-19T12:45:11.997Z
Learning: In tensorrt_llm/_torch/pyexecutor/model_engine.py, DoRA (Delta Orthogonal Rank Adaptation) functionality was removed from the PyTorch flow to eliminate issues with inverted DoRA detection logic. The original is_dora condition was checking if scaling_vec_pointer == 0, which was potentially incorrect.
Applied to files:
tensorrt_llm/executor/worker.pytensorrt_llm/_torch/pyexecutor/model_engine.py
📚 Learning: 2025-08-26T09:37:10.463Z
Learnt from: jiaganc
PR: NVIDIA/TensorRT-LLM#7031
File: tensorrt_llm/bench/dataclasses/configuration.py:90-104
Timestamp: 2025-08-26T09:37:10.463Z
Learning: In TensorRT-LLM's bench configuration, the `get_pytorch_perf_config()` method returns `self.pytorch_config` which is a Dict[str, Any] that can contain default values including `cuda_graph_config`, making the fallback `llm_args["cuda_graph_config"]` safe to use.
Applied to files:
tests/unittest/llmapi/test_llm_pytorch.py
📚 Learning: 2025-08-26T09:37:10.463Z
Learnt from: jiaganc
PR: NVIDIA/TensorRT-LLM#7031
File: tensorrt_llm/bench/dataclasses/configuration.py:90-104
Timestamp: 2025-08-26T09:37:10.463Z
Learning: In TensorRT-LLM, the `get_pytorch_perf_config()` method returns `self.pytorch_config` which can contain default `cuda_graph_config` values, so `llm_args` may already have this config before the extra options processing.
Applied to files:
tests/unittest/llmapi/test_llm_pytorch.py
📚 Learning: 2025-07-28T17:06:08.621Z
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.
Applied to files:
tests/unittest/llmapi/test_llm_pytorch.py
🧬 Code graph analysis (5)
tensorrt_llm/executor/worker.py (1)
tensorrt_llm/_torch/pyexecutor/py_executor.py (1)
get_lora_manager(322-323)
tensorrt_llm/_torch/pyexecutor/py_executor.py (1)
tensorrt_llm/_torch/pyexecutor/model_engine.py (2)
set_lora_manager_cpp_peft_cache_manager(439-445)prefetch_lora_dirs(447-458)
tensorrt_llm/lora_manager.py (3)
tensorrt_llm/_torch/models/modeling_phi4mm.py (1)
lora_request(265-286)tensorrt_llm/_torch/pyexecutor/resource_manager.py (1)
PeftCacheManager(1158-1239)cpp/tensorrt_llm/batch_manager/peftCacheManager.cpp (1)
PeftCacheManager(231-255)
tests/unittest/llmapi/test_llm_pytorch.py (4)
tensorrt_llm/llmapi/llm_args.py (1)
CudaGraphConfig(63-88)tensorrt_llm/executor/request.py (1)
LoRARequest(23-43)tensorrt_llm/lora_manager.py (1)
LoraConfig(141-158)tests/unittest/utils/util.py (1)
similar(369-371)
tensorrt_llm/_torch/pyexecutor/resource_manager.py (3)
tensorrt_llm/_torch/models/modeling_phi4mm.py (2)
lora_request(265-286)lora_config(242-262)tensorrt_llm/lora_manager.py (1)
lora_weights(917-918)tensorrt_llm/_torch/pyexecutor/llm_request.py (1)
LlmRequest(264-351)
🪛 Ruff (0.12.2)
tensorrt_llm/_torch/pyexecutor/model_engine.py
909-909: Undefined name DecodingBaseConfig
(F821)
🔇 Additional comments (7)
tensorrt_llm/_torch/pyexecutor/py_executor.py (1)
282-285: LoRA PEFT cache wiring and prefetch happen at the right time (pre-warmup).Good placement and null-safe on engine side. No issues.
tests/unittest/llmapi/test_llm_pytorch.py (1)
4-4: Import of CudaGraphConfig looks correct.No action needed.
tensorrt_llm/executor/worker.py (1)
162-168: Decouple via engine.get_lora_manager() — good cleanup.This removes CPP RM coupling and aligns with the new PyExecutor API.
tensorrt_llm/lora_manager.py (1)
494-505: CPU cache check is fine; race caveat remains.Known race from prior discussions still applies when relying on CPU cache presence to omit weights. Consider adding a brief comment referencing the limitation where applicable.
tensorrt_llm/_torch/pyexecutor/resource_manager.py (1)
411-414: No action needed: LlmRequest supports LoRA kwargs. The Python wrapper’s__init__takes**kwargsand forwardslora_task_id,lora_weights, andlora_configto the underlying C++ binding, which declares optional parameters for each—so the call will succeed.tensorrt_llm/_torch/pyexecutor/model_engine.py (2)
447-459: Ensurelora_model_configis initialized before loading adapters
In tensorrt_llm/_torch/pyexecutor/model_engine.py, guardprefetch_lora_dirsagainst a missing config and fall back toset_lora_model_config:def prefetch_lora_dirs(self): if self.lora_prefetch_requests_list is None: return + if self.lora_model_config is None: + # Derive default model config for LoRA + self.set_lora_model_config( + lora_target_modules=self.model.model_config.lora_target_modules, + trtllm_modules_to_hf_modules=self.model.model_config.trtllm_modules_to_hf_modules, + ) for request in self.lora_prefetch_requests_list: self.lora_manager.load_from_ckpt( [request.path], model_config=self.lora_model_config, runtime_mapping=None, uids=[request.adapter_id]) self.has_lora_prefetched = TrueAlso verify that
model.model_config.lora_target_modulesandtrtllm_modules_to_hf_modulesare always present to avoid attribute errors.
1003-1006: DecodingCUDAGraphRunner supports lora_params
Constructor signature already includes an optionallora_paramsparameter, so no version gate or changes are required.
| from tensorrt_llm.inputs.multimodal import MultimodalParams | ||
| from tensorrt_llm.logger import logger | ||
| from tensorrt_llm.lora_manager import LoraConfig, LoraModelConfig | ||
| from tensorrt_llm.lora_manager import LoraConfig, LoraManager, LoraModelConfig |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Follow import style guideline; avoid symbol imports and name collisions.
Import the module and use qualified names to avoid colliding with bindings’ LoraConfig.
-from tensorrt_llm.lora_manager import LoraConfig, LoraManager, LoraModelConfig
+import tensorrt_llm.lora_manager as lora_mgrAnd update usages, e.g.:
- self.lora_manager: Optional[LoraManager] = None
+ self.lora_manager: Optional[lora_mgr.LoraManager] = None📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| from tensorrt_llm.lora_manager import LoraConfig, LoraManager, LoraModelConfig | |
| import tensorrt_llm.lora_manager as lora_mgr |
🤖 Prompt for AI Agents
In tensorrt_llm/_torch/pyexecutor/model_engine.py around line 26, replace the
symbol import "from tensorrt_llm.lora_manager import LoraConfig, LoraManager,
LoraModelConfig" with a module import (e.g. "import tensorrt_llm.lora_manager as
lora_manager") and update all references in this file to use qualified names
(e.g. lora_manager.LoraConfig, lora_manager.LoraManager,
lora_manager.LoraModelConfig) to avoid symbol collisions and follow the import
style guideline.
| self.lora_manager: Optional[LoraManager] = None | ||
| if lora_config is not None: | ||
| self.lora_manager = LoraManager() | ||
|
|
||
| self.lora_prefetch_requests_list = None # TODO smor - fix "LoRARequest" import | ||
| if lora_config is not None and lora_config.lora_request is not None: | ||
| self.lora_prefetch_requests_list = lora_config.lora_request | ||
| self.has_lora_prefetched = False | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Initialize has_lora_prefetched unconditionally; avoid AttributeError.
Currently only set when lora_request is not None but later read unguarded.
- self.lora_manager: Optional[LoraManager] = None
+ self.lora_manager: Optional[lora_mgr.LoraManager] = None
if lora_config is not None:
- self.lora_manager = LoraManager()
+ self.lora_manager = lora_mgr.LoraManager()
- self.lora_prefetch_requests_list = None # TODO smor - fix "LoRARequest" import
- if lora_config is not None and lora_config.lora_request is not None:
- self.lora_prefetch_requests_list = lora_config.lora_request
- self.has_lora_prefetched = False
+ self.lora_prefetch_requests_list = None # LoRA prefetch requests (bindings executor side)
+ self.has_lora_prefetched = False
+ if lora_config is not None and getattr(lora_config, "lora_request", None):
+ self.lora_prefetch_requests_list = lora_config.lora_request📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| self.lora_manager: Optional[LoraManager] = None | |
| if lora_config is not None: | |
| self.lora_manager = LoraManager() | |
| self.lora_prefetch_requests_list = None # TODO smor - fix "LoRARequest" import | |
| if lora_config is not None and lora_config.lora_request is not None: | |
| self.lora_prefetch_requests_list = lora_config.lora_request | |
| self.has_lora_prefetched = False | |
| self.lora_manager: Optional[lora_mgr.LoraManager] = None | |
| if lora_config is not None: | |
| self.lora_manager = lora_mgr.LoraManager() | |
| self.lora_prefetch_requests_list = None # LoRA prefetch requests (bindings executor side) | |
| self.has_lora_prefetched = False | |
| if lora_config is not None and getattr(lora_config, "lora_request", None): | |
| self.lora_prefetch_requests_list = lora_config.lora_request |
🤖 Prompt for AI Agents
In tensorrt_llm/_torch/pyexecutor/model_engine.py around lines 285-293,
has_lora_prefetched is only set when lora_config.lora_request is not None which
leads to AttributeError when accessed later; initialize self.has_lora_prefetched
= False unconditionally (e.g., immediately after setting self.lora_manager) and
keep setting self.lora_prefetch_requests_list = lora_config.lora_request only
when present so the attribute always exists and is safely readable thereafter.
| def set_lora_manager_cpp_peft_cache_manager( | ||
| self, resource_manager: ResourceManager): | ||
| cpp_peft_cache_manager = resource_manager.get_resource_manager( | ||
| ResourceManagerType.PEFT_CACHE_MANAGER) | ||
| if cpp_peft_cache_manager is not None and self.lora_manager is not None: | ||
| self.lora_manager.set_cpp_peft_cache_manager( | ||
| cpp_peft_cache_manager.impl) | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Don’t reach into .impl; use the manager’s public API consistently.
This helper should set the cpp manager via a public setter, but consumers must then call the Python PeftCacheManager methods, not impl.
No change here; see refactor below in _maybe_get_cuda_graph to stop using impl.
🤖 Prompt for AI Agents
In tensorrt_llm/_torch/pyexecutor/model_engine.py around lines 439 to 446, the
helper currently reaches into cpp_peft_cache_manager.impl; update it to pass the
ResourceManager-returned object through the public API instead of accessing
.impl. Replace the .impl usage by calling
self.lora_manager.set_cpp_peft_cache_manager(cpp_peft_cache_manager) (keeping
the existing None checks) and add a short comment that consumers must interact
with the Python PeftCacheManager methods rather than its internal impl to avoid
direct implementation coupling.
| lora_config = None | ||
| if self.has_lora_prefetched: | ||
| # TODO smor currently I assume a single adapter with uid 0, change this | ||
| uid = 0 | ||
| from tensorrt_llm.bindings import executor as tllm | ||
| lora_config = tllm.LoraConfig( | ||
| task_id=uid, | ||
| weights=self.lora_manager.cpp_lora_weights[uid], | ||
| config=self.lora_manager.cpp_lora_config[uid]) | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Avoid shadowing LoraConfig; remove hard-coded uid=0.
Use a distinct name for the bindings object and derive uid from prefetch requests.
- lora_config = tllm.LoraConfig(
- task_id=uid,
- weights=self.lora_manager.cpp_lora_weights[uid],
- config=self.lora_manager.cpp_lora_config[uid])
+ lora_binding = tllm.LoraConfig(
+ task_id=uid,
+ weights=self.lora_manager.cpp_lora_weights[uid],
+ config=self.lora_manager.cpp_lora_config[uid])And when passing to add_dummy_requests:
- lora_request=
- lora_config, # TODO smor- tests assume BS1 then this will be ignored for now, need to resolve
+ lora_request=lora_binding,Also, compute uid:
- uid = 0
+ # Prefer the first prefetched adapter id
+ uid = getattr(self.lora_prefetch_requests_list[0], "adapter_id", 0)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| lora_config = None | |
| if self.has_lora_prefetched: | |
| # TODO smor currently I assume a single adapter with uid 0, change this | |
| uid = 0 | |
| from tensorrt_llm.bindings import executor as tllm | |
| lora_config = tllm.LoraConfig( | |
| task_id=uid, | |
| weights=self.lora_manager.cpp_lora_weights[uid], | |
| config=self.lora_manager.cpp_lora_config[uid]) | |
| lora_config = None | |
| if self.has_lora_prefetched: | |
| # Prefer the first prefetched adapter id | |
| uid = getattr(self.lora_prefetch_requests_list[0], "adapter_id", 0) | |
| from tensorrt_llm.bindings import executor as tllm | |
| lora_binding = tllm.LoraConfig( | |
| task_id=uid, | |
| weights=self.lora_manager.cpp_lora_weights[uid], | |
| config=self.lora_manager.cpp_lora_config[uid]) | |
| # later, when enqueuing the dummy request: | |
| self.add_dummy_requests( | |
| # ... other parameters ... | |
| lora_request=lora_binding, | |
| # ... remaining parameters ... | |
| ) |
| def _maybe_get_cuda_graph( | ||
| self, | ||
| batch: ScheduledRequests, | ||
| spec_config: Optional["DecodingBaseConfig"] = None | ||
| spec_config: Optional["DecodingBaseConfig"] = None, | ||
| resource_manager: Optional[ResourceManager] = None | ||
| ) -> Optional[DecodingCUDAGraphRunner]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix Ruff F821: DecodingBaseConfig undefined (even in quotes).
Gate a type-only import to satisfy static analysis without runtime dep.
from typing import Any, Dict, Optional, Tuple
+from typing import TYPE_CHECKING
+if TYPE_CHECKING:
+ from ..speculative.decoding_config import DecodingBaseConfig📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| def _maybe_get_cuda_graph( | |
| self, | |
| batch: ScheduledRequests, | |
| spec_config: Optional["DecodingBaseConfig"] = None | |
| spec_config: Optional["DecodingBaseConfig"] = None, | |
| resource_manager: Optional[ResourceManager] = None | |
| ) -> Optional[DecodingCUDAGraphRunner]: | |
| from typing import Any, Dict, Optional, Tuple | |
| from typing import TYPE_CHECKING | |
| if TYPE_CHECKING: | |
| from ..speculative.decoding_config import DecodingBaseConfig |
🧰 Tools
🪛 Ruff (0.12.2)
909-909: Undefined name DecodingBaseConfig
(F821)
🤖 Prompt for AI Agents
In tensorrt_llm/_torch/pyexecutor/model_engine.py around lines 906 to 911, Ruff
F821 is raised because DecodingBaseConfig is referenced in type annotations but
not imported (even as a string); add a type-only import to satisfy static
analysis without introducing a runtime dependency: import TYPE_CHECKING from
typing at top of the file and then, inside an if TYPE_CHECKING: block, import
DecodingBaseConfig from the module where it is defined (replace with the correct
module path), leaving the runtime code unchanged so the annotation remains only
for type checking.
| lora_params = None | ||
|
|
||
| if self.has_lora_prefetched: | ||
| peft_cache_manager = resource_manager.get_resource_manager( | ||
| ResourceManagerType.PEFT_CACHE_MANAGER) | ||
|
|
||
| context_requests = batch.context_requests | ||
| generation_requests = batch.generation_requests | ||
|
|
||
| if len(context_requests) > 0 and len(generation_requests) > 0: | ||
| raise ValueError( | ||
| "SMOR, non empty context and generation requests isn't tested yet" | ||
| ) | ||
|
|
||
| if len(context_requests) > 0: | ||
| raise ValueError("SMOR, context requests isn't tested yet") | ||
|
|
||
| if len(generation_requests) > 1: | ||
| raise ValueError("SMOR, generation requests isn't tested yet") | ||
|
|
||
| generation_request = generation_requests[0] | ||
| # TODO smor I have no idea why this is happening | ||
| generation_request.lora_weights = generation_request.lora_weights.reshape( | ||
| [1] + list(generation_request.lora_weights.shape)) | ||
| generation_request.lora_config = generation_request.lora_config.reshape( | ||
| [1] + list(generation_request.lora_config.shape)) | ||
| peft_cache_manager.impl.add_request_peft(generation_request, True) | ||
|
|
||
| py_lora_task_layer_module_configs = peft_cache_manager.impl.ensure_batch( | ||
| context_requests, generation_requests, False) | ||
| for req in context_requests: | ||
| req.py_lora_task_layer_module_configs = py_lora_task_layer_module_configs[ | ||
| req. | ||
| py_request_id] if req.py_request_id in py_lora_task_layer_module_configs else None | ||
| for req in generation_requests: | ||
| req.py_lora_task_layer_module_configs = py_lora_task_layer_module_configs[ | ||
| req. | ||
| py_request_id] if req.py_request_id in py_lora_task_layer_module_configs else None | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
LoRA PEFT setup inside CUDA-graph path is brittle: direct .impl access, ad-hoc reshapes, and hard errors.
- Don’t reshape here; PeftCacheManager.prepare_resources already does.
- Stop calling impl methods directly; use PeftCacheManager.add_request_peft/ensure_batch.
- Replace raises with graceful fallback when encountering unsupported mixed batches.
- Replace print with logger.
- lora_params = None
-
- if self.has_lora_prefetched:
- peft_cache_manager = resource_manager.get_resource_manager(
- ResourceManagerType.PEFT_CACHE_MANAGER)
-
- context_requests = batch.context_requests
- generation_requests = batch.generation_requests
-
- if len(context_requests) > 0 and len(generation_requests) > 0:
- raise ValueError(
- "SMOR, non empty context and generation requests isn't tested yet"
- )
-
- if len(context_requests) > 0:
- raise ValueError("SMOR, context requests isn't tested yet")
-
- if len(generation_requests) > 1:
- raise ValueError("SMOR, generation requests isn't tested yet")
-
- generation_request = generation_requests[0]
- # TODO smor I have no idea why this is happening
- generation_request.lora_weights = generation_request.lora_weights.reshape(
- [1] + list(generation_request.lora_weights.shape))
- generation_request.lora_config = generation_request.lora_config.reshape(
- [1] + list(generation_request.lora_config.shape))
- peft_cache_manager.impl.add_request_peft(generation_request, True)
-
- py_lora_task_layer_module_configs = peft_cache_manager.impl.ensure_batch(
- context_requests, generation_requests, False)
- for req in context_requests:
- req.py_lora_task_layer_module_configs = py_lora_task_layer_module_configs[
- req.
- py_request_id] if req.py_request_id in py_lora_task_layer_module_configs else None
- for req in generation_requests:
- req.py_lora_task_layer_module_configs = py_lora_task_layer_module_configs[
- req.
- py_request_id] if req.py_request_id in py_lora_task_layer_module_configs else None
-
- # TODO smor - look at get lora params from requests
- # You need something that isn't scheduled requests
- # It also appears that you should make sure resource manager is called, because prefetch
- # has to be added to peftCacheManager as well. So it still shouldn't work
-
- lora_params = self._get_lora_params_from_requests(
- batch, attn_metadata)
- print(f"SMOR, not failed on lora_params in maybe_get_cuda_graph")
+ lora_params = None
+ if self.has_lora_prefetched:
+ peft_cache_manager = resource_manager.get_resource_manager(
+ ResourceManagerType.PEFT_CACHE_MANAGER)
+ if peft_cache_manager is None:
+ logger.debug("LoRA prefetched, but no PEFT cache manager present; skipping LoRA for graphs.")
+ else:
+ # Only generation-only batches are CUDA-graphable today.
+ if len(batch.context_requests) == 0 and len(batch.generation_requests) >= 1:
+ for req in batch.generation_requests:
+ peft_cache_manager.add_request_peft(req)
+ py_cfgs = peft_cache_manager.ensure_batch(
+ batch.context_requests, batch.generation_requests, reset_gpu_cache=False)
+ for req in batch.generation_requests:
+ req.py_lora_task_layer_module_configs = py_cfgs.get(req.py_request_id)
+ lora_params = self._get_lora_params_from_requests(batch, attn_metadata)
+ else:
+ logger.debug("LoRA + CUDA graph currently supports generation-only batches; skipping LoRA params.")📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| lora_params = None | |
| if self.has_lora_prefetched: | |
| peft_cache_manager = resource_manager.get_resource_manager( | |
| ResourceManagerType.PEFT_CACHE_MANAGER) | |
| context_requests = batch.context_requests | |
| generation_requests = batch.generation_requests | |
| if len(context_requests) > 0 and len(generation_requests) > 0: | |
| raise ValueError( | |
| "SMOR, non empty context and generation requests isn't tested yet" | |
| ) | |
| if len(context_requests) > 0: | |
| raise ValueError("SMOR, context requests isn't tested yet") | |
| if len(generation_requests) > 1: | |
| raise ValueError("SMOR, generation requests isn't tested yet") | |
| generation_request = generation_requests[0] | |
| # TODO smor I have no idea why this is happening | |
| generation_request.lora_weights = generation_request.lora_weights.reshape( | |
| [1] + list(generation_request.lora_weights.shape)) | |
| generation_request.lora_config = generation_request.lora_config.reshape( | |
| [1] + list(generation_request.lora_config.shape)) | |
| peft_cache_manager.impl.add_request_peft(generation_request, True) | |
| py_lora_task_layer_module_configs = peft_cache_manager.impl.ensure_batch( | |
| context_requests, generation_requests, False) | |
| for req in context_requests: | |
| req.py_lora_task_layer_module_configs = py_lora_task_layer_module_configs[ | |
| req. | |
| py_request_id] if req.py_request_id in py_lora_task_layer_module_configs else None | |
| for req in generation_requests: | |
| req.py_lora_task_layer_module_configs = py_lora_task_layer_module_configs[ | |
| req. | |
| py_request_id] if req.py_request_id in py_lora_task_layer_module_configs else None | |
| lora_params = None | |
| if self.has_lora_prefetched: | |
| peft_cache_manager = resource_manager.get_resource_manager( | |
| ResourceManagerType.PEFT_CACHE_MANAGER) | |
| if peft_cache_manager is None: | |
| logger.debug( | |
| "LoRA prefetched, but no PEFT cache manager present; " | |
| "skipping LoRA for graphs." | |
| ) | |
| else: | |
| # Only generation-only batches are CUDA-graphable today. | |
| if len(batch.context_requests) == 0 and len(batch.generation_requests) >= 1: | |
| for req in batch.generation_requests: | |
| peft_cache_manager.add_request_peft(req) | |
| py_cfgs = peft_cache_manager.ensure_batch( | |
| batch.context_requests, | |
| batch.generation_requests, | |
| reset_gpu_cache=False | |
| ) | |
| for req in batch.generation_requests: | |
| req.py_lora_task_layer_module_configs = ( | |
| py_cfgs.get(req.py_request_id) | |
| ) | |
| lora_params = self._get_lora_params_from_requests( | |
| batch, attn_metadata | |
| ) | |
| else: | |
| logger.debug( | |
| "LoRA + CUDA graph currently supports generation-only " | |
| "batches; skipping LoRA params." | |
| ) |
| lora_request=None, | ||
| ): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Type and API contract for lora_request are unclear; initialize and type it.
Document/annotate the expected type (bindings executor LoraConfig-like) and make has_lora_prefetched paths robust to None. Also initialize has_lora_prefetched at engine level to avoid AttributeError elsewhere.
Apply:
- lora_request=None,
+ lora_request: Optional[object] = None,Committable suggestion skipped: line range outside the PR's diff.
🤖 Prompt for AI Agents
In tensorrt_llm/_torch/pyexecutor/resource_manager.py around lines 376-377, the
lora_request parameter is untyped and the code assumes an engine attribute
has_lora_prefetched exists causing potential AttributeError; annotate
lora_request with the expected type (e.g., a LoraConfig-like binding/executor
object or Optional[LoraConfig]) in the function signature and add a short
docstring/comment describing the expected fields, initialize
engine.has_lora_prefetched = False when engines are created/registered so the
attribute always exists, and update any checks like if
engine.has_lora_prefetched or engine.has_lora_prefetched == True to safely
handle None (use bool(engine.has_lora_prefetched) or explicit is True) and guard
code paths with None-safe checks before accessing lora_request properties.
| lora_task_id = None | ||
| lora_weights = None | ||
| lora_config = None | ||
|
|
||
| if lora_request is not None: | ||
| # TODO smor currently work with single adapter only, not sure how this should work with request ids | ||
| lora_task_id = lora_request.task_id | ||
| lora_weights = lora_request.weights | ||
| lora_config = lora_request.config | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Guard against missing fields and wrong shapes on lora_request.
Accessing task_id/weights/config blindly risks AttributeError; torch reshape later expects tensors. Add duck-typing checks and early validation.
- lora_task_id = None
- lora_weights = None
- lora_config = None
-
- if lora_request is not None:
- # TODO smor currently work with single adapter only, not sure how this should work with request ids
- lora_task_id = lora_request.task_id
- lora_weights = lora_request.weights
- lora_config = lora_request.config
+ lora_task_id = None
+ lora_weights = None
+ lora_config = None
+ if lora_request is not None:
+ # Single-adapter warmup; multi-adapter not yet supported.
+ if not all(hasattr(lora_request, a) for a in ("task_id", "weights", "config")):
+ raise TypeError("lora_request must expose task_id, weights, and config")
+ lora_task_id = int(lora_request.task_id)
+ lora_weights = lora_request.weights
+ lora_config = lora_request.config📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| lora_task_id = None | |
| lora_weights = None | |
| lora_config = None | |
| if lora_request is not None: | |
| # TODO smor currently work with single adapter only, not sure how this should work with request ids | |
| lora_task_id = lora_request.task_id | |
| lora_weights = lora_request.weights | |
| lora_config = lora_request.config | |
| lora_task_id = None | |
| lora_weights = None | |
| lora_config = None | |
| if lora_request is not None: | |
| # Single-adapter warmup; multi-adapter not yet supported. | |
| if not all(hasattr(lora_request, a) for a in ("task_id", "weights", "config")): | |
| raise TypeError("lora_request must expose task_id, weights, and config") | |
| lora_task_id = int(lora_request.task_id) | |
| lora_weights = lora_request.weights | |
| lora_config = lora_request.config |
🤖 Prompt for AI Agents
In tensorrt_llm/_torch/pyexecutor/resource_manager.py around lines 394 to 403,
the code accesses lora_request.task_id, .weights and .config without validation
which can raise AttributeError or later cause tensor reshape errors; update this
block to first duck-type-check that lora_request has the attributes (hasattr or
try/except AttributeError), verify lora_request.weights is not None and is a
torch.Tensor or convertable to one, check its ndim/shape matches the expected
shape before any reshape and raise a clear ValueError if not, and validate that
lora_request.config contains required keys/types (or set sensible defaults) so
downstream code doesn’t assume missing fields. Ensure any early-return or error
message includes the problematic field and expected shape/type.
| lora_request: Optional[List[Any]] = None # TODO smor fix | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Avoid Any for LoraConfig.lora_request; use typed forward reference to LoRARequest.
Prevents loss of type-safety and documents intent, while avoiding import cycles with TYPE_CHECKING.
-from typing import TYPE_CHECKING, Any, Dict, List, Optional, Union
+from typing import TYPE_CHECKING, Any, Dict, List, Optional, Union
@@
if TYPE_CHECKING:
from .runtime import ModelConfig
+ from .executor.request import LoRARequest
@@
- lora_request: Optional[List[Any]] = None # TODO smor fix
+ lora_request: Optional[List["LoRARequest"]] = NoneCommittable suggestion skipped: line range outside the PR's diff.
🤖 Prompt for AI Agents
In tensorrt_llm/lora_manager.py around lines 149-150, replace the liberal Any
annotation for lora_request with a typed forward reference to LoRARequest to
preserve type-safety and intent; change the type to
Optional[List["LoRARequest"]], add from typing import TYPE_CHECKING at the top
and under if TYPE_CHECKING: import LoRARequest from its module (or appropriate
path) so the runtime import cycle is avoided while static type checkers see the
real type.
| def test_lora_dir_with_graph(): | ||
| lora_req = LoRARequest( | ||
| "task-0", 0, f"{llm_models_root()}/llama-models/luotuo-lora-7b-0.1") | ||
|
|
||
| lora_config = LoraConfig( | ||
| lora_dir=[f"{llm_models_root()}/llama-models/luotuo-lora-7b-0.1"], | ||
| max_lora_rank=8, | ||
| lora_request=[lora_req]) | ||
|
|
||
| llm = LLM(model=f"{llm_models_root()}/llama-models/llama-7b-hf", | ||
| lora_config=lora_config, | ||
| cuda_graph_config=CudaGraphConfig(max_batch_size=1)) | ||
| # cuda_graph_config=None) | ||
|
|
||
| prompts = [ | ||
| "美国的首都在哪里? \n答案:", | ||
| ] | ||
| references = [ | ||
| "美国的首都是华盛顿。\n\n美国的", | ||
| ] | ||
| sampling_params = SamplingParams(max_tokens=20) | ||
| lora_request = [lora_req] | ||
|
|
||
| outputs = llm.generate(prompts, sampling_params, lora_request=lora_request) | ||
|
|
||
| assert similar(outputs[0].outputs[0].text, references[0]) | ||
| print(f"lora output: {outputs[0].outputs[0].text}") | ||
| print(f"ref output: {references[0]}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Ensure resource cleanup, avoid redundant config, and gate by memory.
- Always shutdown LLM in finally to prevent resource leaks.
- Avoid providing LoRA adapter both in LoraConfig and per-generate arg; the per-call lora_request is sufficient here.
- Align with other 7B tests by adding the 40GB memory guard.
-@pytest.mark.parametrize(
+# keep above tests unchanged
@@
-def test_lora_dir_with_graph():
+@skip_gpu_memory_less_than_40gb
+def test_lora_dir_with_graph():
@@
- lora_config = LoraConfig(
- lora_dir=[f"{llm_models_root()}/llama-models/luotuo-lora-7b-0.1"],
- max_lora_rank=8,
- lora_request=[lora_req])
+ lora_config = LoraConfig(
+ lora_dir=[f"{llm_models_root()}/llama-models/luotuo-lora-7b-0.1"],
+ max_lora_rank=8)
@@
- llm = LLM(model=f"{llm_models_root()}/llama-models/llama-7b-hf",
- lora_config=lora_config,
- cuda_graph_config=CudaGraphConfig(max_batch_size=1))
+ llm = LLM(model=f"{llm_models_root()}/llama-models/llama-7b-hf",
+ lora_config=lora_config,
+ cuda_graph_config=CudaGraphConfig(max_batch_size=1))
@@
- outputs = llm.generate(prompts, sampling_params, lora_request=lora_request)
-
- assert similar(outputs[0].outputs[0].text, references[0])
- print(f"lora output: {outputs[0].outputs[0].text}")
- print(f"ref output: {references[0]}")
+ try:
+ outputs = llm.generate(prompts, sampling_params, lora_request=lora_request)
+ assert similar(outputs[0].outputs[0].text, references[0])
+ print(f"lora output: {outputs[0].outputs[0].text}")
+ print(f"ref output: {references[0]}")
+ finally:
+ llm.shutdown()Committable suggestion skipped: line range outside the PR's diff.
🤖 Prompt for AI Agents
In tests/unittest/llmapi/test_llm_pytorch.py around lines 436-463, ensure the
test guards memory, avoids redundant LoRA specification, and always cleans up
the LLM: add the same 40GB memory guard used by other 7B tests at the top of the
test and return/skip if not met; construct the LoraConfig without duplicating
per-call adapters (remove lora_request from LoraConfig and keep the single
lora_request passed into llm.generate, or alternatively remove the per-call
lora_request and keep it only in LoraConfig — pick one approach and make them
consistent); wrap LLM usage in try/finally and call llm.shutdown() in the
finally block to guarantee resource cleanup even on assertion failures or
exceptions.
Summary by CodeRabbit
Description
Test Coverage
PR Checklist
Please review the following before submitting your PR:
PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.
GitHub Bot Help
/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...Provide a user friendly way for developers to interact with a Jenkins server.
Run
/bot [-h|--help]to print this help message.See details below for each supported subcommand.
Details
run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]Launch build/test pipelines. All previously running jobs will be killed.
--reuse-test (optional)pipeline-id(OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.--disable-reuse-test(OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.--disable-fail-fast(OPTIONAL) : Disable fail fast on build/tests/infra failures.--skip-test(OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.--stage-list "A10-PyTorch-1, xxx"(OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.--gpu-type "A30, H100_PCIe"(OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.--test-backend "pytorch, cpp"(OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.--only-multi-gpu-test(OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.--disable-multi-gpu-test(OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.--add-multi-gpu-test(OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.--post-merge(OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx"(OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".--detailed-log(OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.--debug(OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in thestage-listparameter to access the appropriate container environment. Note: Does NOT update GitHub check status.For guidance on mapping tests to stage names, see
docs/source/reference/ci-overview.mdand the
scripts/test_to_stage_mapping.pyhelper.kill
killKill all running builds associated with pull request.
skip
skip --comment COMMENTSkip testing for latest commit on pull request.
--comment "Reason for skipping build/test"is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.reuse-pipeline
reuse-pipelineReuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.