-
Notifications
You must be signed in to change notification settings - Fork 1.7k
[AutoDeploy] merge feat/ad-2025-07-22 #6520
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
📝 WalkthroughWalkthroughThis change overhauls the attention and quantization transformation infrastructure in the TensorRT-LLM PyTorch auto-deploy pipeline. It migrates pattern matching and quantization logic from function-based transformations to modular, class-based transforms registered in a new transform library. The configuration, test harnesses, and transformation application logic are refactored to use this new modular system. Several new features and correctness improvements are introduced for attention pattern matching, quantization, and layout handling. Changes
Sequence Diagram(s)sequenceDiagram
participant ModelFactory
participant InferenceOptimizer
participant TransformRegistry
participant TransformClass
participant GraphModule
ModelFactory->>InferenceOptimizer: Provide model, config, quant_config
InferenceOptimizer->>TransformRegistry: Lookup transforms (e.g., quantize, match_eager_attention)
loop For each transform in config
InferenceOptimizer->>TransformClass: Call _apply(GraphModule, ...)
TransformClass->>GraphModule: Pattern match & transform
TransformClass-->>InferenceOptimizer: Return (GraphModule, TransformInfo)
end
InferenceOptimizer-->>ModelFactory: Return transformed GraphModule
Estimated code review effort🎯 5 (Critical) | ⏱️ ~90+ minutes Suggested reviewers
Note ⚡️ Unit Test Generation is now available in beta!Learn more here, or try it out under "Finishing Touches" below. ✨ Finishing Touches
🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
Documentation and Community
|
5e06672
to
eef1c71
Compare
/bot run |
PR_Github #13696 [ run ] triggered by Bot |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 6
🧹 Nitpick comments (7)
tensorrt_llm/_torch/auto_deploy/transformations/transform.py (1)
59-64
: Consider documenting the YAML limitation workaroundWhile the dynamic setting of
attention_op
is necessary due to YAML limitations, consider adding more detailed documentation about why this workaround is needed and what the long-term solution might be.# TODO (hg): default values that are not representable in YAML. +# Function references like attention operators cannot be serialized in YAML config files, +# so we need to dynamically inject them based on the string backend name. if "match_attention_layout" in self.ad_config.transforms:tensorrt_llm/_torch/auto_deploy/transform/library/quantization.py (1)
142-144
: Make error handling more explicitConsider using an explicit error message when shape cannot be determined.
else: - # If we can't determine the shape, skip quantization - return + # If we can't determine the shape, skip quantization + ad_logger.debug(f"Cannot determine weight shape for BMM node {node}, skipping quantization") + returntests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_quantization.py (1)
59-59
: Remove commented skip statementIf the test is working, remove the commented skip statement. If it's still flaky, uncomment it with a proper reason.
- # pytest.skip("https://nvbugspro.nvidia.com/bug/5170222")
tensorrt_llm/_torch/auto_deploy/transform/library/quantize_moe.py (1)
100-101
: Track TODO for robust parameter extractionThe TODO comment indicates this function needs more robust handling similar to
extract_param_names_from_lin_node
.Would you like me to create an issue to track implementing more robust parameter extraction for MoE nodes?
tests/unittest/_torch/auto_deploy/_utils_test/_graph_test_helpers.py (1)
80-81
: Remove or document unused args parameterThe
*args
parameter is not used in the function. Either remove it or document its purpose.skip_output_assert: bool = False, - *args, # Additional arguments for transform ) -> GraphModule:
tensorrt_llm/_torch/auto_deploy/transform/library/attention.py (2)
496-498
: Consider checking against the source attention op.The transform should also skip nodes that already match the
attention_op.get_source_attention_op()
to avoid redundant transformations.Add a check to skip already-transformed nodes:
for sdpa_node in list(graph.nodes): if sdpa_node.op != "call_function" or not is_op(sdpa_node, sdpa_ops): continue + + # Skip if already using the target attention op + if is_op(sdpa_node, self.config.attention_op.get_source_attention_op()): + continue
502-504
: Consider adding input validation.The transform assumes q, k, v are the first 3 arguments and that they have metadata. Consider adding validation for robustness.
Add validation before processing:
# Extract q, k, v inputs q, k, v = sdpa_node.args[:3] + +# Validate inputs have metadata +if not all(hasattr(arg, 'meta') and 'val' in arg.meta for arg in [q, k, v]): + ad_logger.warning(f"Skipping node {sdpa_node} due to missing metadata") + continue + +# Validate tensor dimensions +for tensor, name in [(q, 'q'), (k, 'k'), (v, 'v')]: + if len(tensor.meta['val'].shape) != 4: + ad_logger.warning(f"Skipping node {sdpa_node}: {name} tensor is not 4D") + continue
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (22)
tensorrt_llm/_torch/auto_deploy/config/default.yaml
(1 hunks)tensorrt_llm/_torch/auto_deploy/custom_ops/torch_attention.py
(4 hunks)tensorrt_llm/_torch/auto_deploy/custom_ops/torch_backend_attention.py
(3 hunks)tensorrt_llm/_torch/auto_deploy/distributed/trtllm.py
(1 hunks)tensorrt_llm/_torch/auto_deploy/models/hf.py
(4 hunks)tensorrt_llm/_torch/auto_deploy/transform/interface.py
(2 hunks)tensorrt_llm/_torch/auto_deploy/transform/library/attention.py
(1 hunks)tensorrt_llm/_torch/auto_deploy/transform/library/quantization.py
(3 hunks)tensorrt_llm/_torch/auto_deploy/transform/library/quantize_moe.py
(2 hunks)tensorrt_llm/_torch/auto_deploy/transformations/_graph.py
(2 hunks)tensorrt_llm/_torch/auto_deploy/transformations/library/__init__.py
(0 hunks)tensorrt_llm/_torch/auto_deploy/transformations/library/attention.py
(0 hunks)tensorrt_llm/_torch/auto_deploy/transformations/library/kvcache.py
(1 hunks)tensorrt_llm/_torch/auto_deploy/transformations/library/rope.py
(2 hunks)tensorrt_llm/_torch/auto_deploy/transformations/transform.py
(1 hunks)tensorrt_llm/_torch/auto_deploy/utils/pattern_matcher.py
(1 hunks)tests/unittest/_torch/auto_deploy/_utils_test/_graph_test_helpers.py
(2 hunks)tests/unittest/_torch/auto_deploy/unit/multigpu/test_ad_build_small_multi.py
(0 hunks)tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_attention_matcher.py
(16 hunks)tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_attention_matcher_hf.py
(3 hunks)tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_quant_moe.py
(2 hunks)tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_quantization.py
(5 hunks)
💤 Files with no reviewable changes (3)
- tests/unittest/_torch/auto_deploy/unit/multigpu/test_ad_build_small_multi.py
- tensorrt_llm/_torch/auto_deploy/transformations/library/init.py
- tensorrt_llm/_torch/auto_deploy/transformations/library/attention.py
🧰 Additional context used
🧠 Learnings (4)
📓 Common learnings
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.
📚 Learning: in tensorrt_llm/executor/worker.py, the lora adapter cache optimization logic that checks `is_adapte...
Learnt from: amitz-nv
PR: NVIDIA/TensorRT-LLM#5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks `is_adapter_in_cpu_cache()` and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
Applied to files:
tensorrt_llm/_torch/auto_deploy/distributed/trtllm.py
📚 Learning: in tensorrt-llm testing, it's common to have both cli flow tests (test_cli_flow.py) and pytorch api ...
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.
Applied to files:
tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_quant_moe.py
tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_quantization.py
tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_attention_matcher_hf.py
📚 Learning: in tensorrt-llm's multimodal processing pipeline, shared tensor recovery using `from_shared_tensor()...
Learnt from: yechank-nvidia
PR: NVIDIA/TensorRT-LLM#6254
File: tensorrt_llm/_torch/pyexecutor/model_engine.py:1201-1204
Timestamp: 2025-07-22T09:22:14.726Z
Learning: In TensorRT-LLM's multimodal processing pipeline, shared tensor recovery using `from_shared_tensor()` is only needed during the context phase. Generation requests reuse the already-recovered tensor data and only need to call `strip_for_generation()` to remove unnecessary multimodal data while preserving the recovered tensors. This avoids redundant tensor recovery operations during generation.
Applied to files:
tensorrt_llm/_torch/auto_deploy/custom_ops/torch_backend_attention.py
🧬 Code Graph Analysis (8)
tensorrt_llm/_torch/auto_deploy/distributed/trtllm.py (2)
tensorrt_llm/_torch/distributed/ops.py (1)
AllReduce
(362-491)tensorrt_llm/functional.py (1)
AllReduceStrategy
(3876-3884)
tensorrt_llm/_torch/auto_deploy/transformations/library/rope.py (1)
tensorrt_llm/_torch/auto_deploy/utils/pattern_matcher.py (1)
register_ad_pattern
(97-180)
tensorrt_llm/_torch/auto_deploy/custom_ops/torch_backend_attention.py (1)
tensorrt_llm/functional.py (1)
matmul
(1056-1107)
tensorrt_llm/_torch/auto_deploy/transform/interface.py (2)
tensorrt_llm/logger.py (1)
warning
(131-132)tensorrt_llm/_torch/auto_deploy/transformations/_graph.py (2)
lift_to_meta
(76-89)canonicalize_graph
(194-219)
tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_quantization.py (6)
tests/unittest/_torch/auto_deploy/_utils_test/_graph_test_helpers.py (4)
run_test_transformed_gm
(68-138)_build_model
(28-29)_load_checkpoint
(31-32)get_quant_config
(37-38)tests/unittest/_torch/auto_deploy/_utils_test/_model_test_utils.py (3)
MLP
(97-111)BMMDynamicModel
(237-253)BMMModel
(225-234)tests/unittest/_torch/auto_deploy/_utils_test/_torch_test_utils.py (2)
fp4_compatible
(33-34)fp8_compatible
(29-30)tensorrt_llm/_torch/auto_deploy/export/export.py (1)
torch_export_to_gm
(198-284)tensorrt_llm/_torch/auto_deploy/models/factory.py (1)
ModelFactory
(15-207)tensorrt_llm/_torch/auto_deploy/transformations/transform.py (1)
InferenceOptimizer
(37-202)
tensorrt_llm/_torch/auto_deploy/transform/library/attention.py (7)
tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py (1)
AttentionDescriptor
(498-624)tensorrt_llm/_torch/auto_deploy/utils/node_utils.py (1)
is_op
(183-206)tensorrt_llm/_torch/auto_deploy/utils/pattern_matcher.py (3)
ADPatternMatcherPass
(59-65)register_ad_pattern
(97-180)call_function
(247-274)tensorrt_llm/_torch/auto_deploy/transform/interface.py (4)
BaseTransform
(129-354)TransformConfig
(50-89)TransformInfo
(98-123)TransformRegistry
(357-385)tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_attention_matcher_hf.py (2)
get_attention_layout
(29-30)get_source_attention_op
(33-34)tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_attention_matcher.py (2)
get_attention_layout
(1025-1026)get_source_attention_op
(1029-1030)tensorrt_llm/logger.py (1)
debug
(143-144)
tests/unittest/_torch/auto_deploy/_utils_test/_graph_test_helpers.py (3)
tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_kv_cache.py (2)
SequenceEmbeddingInfo
(92-104)set_example_sequence
(96-104)tests/unittest/_torch/auto_deploy/unit/multigpu/transformations/library/test_collective_fusion.py (1)
check_transformed_graph
(59-62)tests/unittest/_torch/auto_deploy/_utils_test/_torch_test_utils.py (2)
reset_parameters
(24-26)all_close
(6-21)
tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_attention_matcher.py (3)
tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py (4)
AttentionDescriptor
(498-624)to
(310-314)get
(644-646)args
(151-159)tests/unittest/_torch/auto_deploy/_utils_test/_graph_test_helpers.py (1)
run_test
(141-225)tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_attention_matcher_hf.py (1)
MockAttentionDescriptor
(23-34)
🔇 Additional comments (50)
tensorrt_llm/_torch/auto_deploy/transformations/library/kvcache.py (1)
177-177
: LGTM! Critical unit conversion fix.The change correctly converts
free_mem_post
from megabytes to bytes (multiplying by1024 * 1024
) before applying thefree_mem_ratio
and adding tocurrent_cache_size
. This ensures consistent units throughout the calculation, ascurrent_cache_size
is already in bytes fromcm.current_cache_size_bytes()
.tensorrt_llm/_torch/auto_deploy/distributed/trtllm.py (1)
20-21
: LGTM! Temporary workaround is well-documented.The temporary switch from
AllReduceStrategy.AUTO
toAllReduceStrategy.NCCL
is appropriately documented with the bug URL and clear intention to revert. This ensures the workaround won't be forgotten once the underlying bug is resolved.tensorrt_llm/_torch/auto_deploy/utils/pattern_matcher.py (1)
156-157
: Excellent documentation enhancement.This clarifying note about functional isolation requirements is valuable for developers using the pattern matching system. It helps explain why certain subgraph patterns might not be eligible for replacement and reinforces the correctness guarantees of the PatternMatcherPass.
tensorrt_llm/_torch/auto_deploy/transformations/library/rope.py (2)
144-149
: Good addition to handle dtype variations.The float32 dummy inputs correctly address the case where
.float()
calls in the_complex_rope_pattern
function can alter the graph topology based on input dtype. This ensures both float16 and float32 input scenarios are properly matched.
181-190
: LGTM! Proper pattern registration for float32 variant.The additional pattern registration using
dummy_complex_2
follows the established pattern and correctly handles the float32 input case. The reuse of the same search and replace functions avoids code duplication while ensuring comprehensive pattern coverage.tensorrt_llm/_torch/auto_deploy/config/default.yaml (1)
22-33
: Well-structured configuration for new modular transforms.The addition of the six new transforms (
quantize
,quantize_moe
, and the four attention-related transforms) follows the established configuration pattern and correctly assigns them to thepattern_matcher
stage. This properly integrates the new class-based transform system into the default pipeline.tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_quant_moe.py (4)
3-3
: Proper import updates for new testing framework.The imports correctly switch to the new modular testing helpers (
FakeFactory
,run_test_transformed_gm
) that support the class-based transform architecture.
7-8
: Correct imports for modular optimizer approach.The addition of
torch_export_to_gm
andInferenceOptimizer
imports aligns with the refactoring to use the new modular transform framework instead of direct function calls.
66-74
: Well-structured adaptation to modular transform system.The test correctly:
- Exports the model to a GraphModule with dynamic shapes
- Configures InferenceOptimizer with the
quantize_moe
transform- Uses FakeFactory to provide quantization configuration
- Maintains the same transform behavior in the new framework
76-86
: Proper use of new testing helper.The switch to
run_test_transformed_gm
maintains all the original validation logic (graph checking, parameter counting, output verification) while adapting to the new modular transform testing approach.tensorrt_llm/_torch/auto_deploy/transformations/_graph.py (1)
99-119
: Good encapsulation of recompilation logicThe refactoring to determine
recompile_graph
internally based on the presence of device-related operations is a solid improvement. This simplifies the API and ensures consistent behavior.tensorrt_llm/_torch/auto_deploy/custom_ops/torch_backend_attention.py (3)
106-106
: Simplified sinks reshaping is correctThe simplified reshape operation for sinks in the generate phase is appropriate since we're dealing with single token generation.
205-205
: Correct fix for sliding window mask logicRemoving the
(pos_diff < 0)
condition is the right fix. The causal mask already handles preventing attention to future positions, so the sliding window mask should only enforce the backward-looking window constraint.
220-228
: Improved sinks tensor handling for context phaseThe new approach with
new_sinks
properly handles dimension expansion for the context phase, making the implementation more robust and correct for batch processing.tensorrt_llm/_torch/auto_deploy/transform/interface.py (2)
290-319
: Improved state tracking in pre-cleanupThe change to return
(is_clean, has_valid_shapes)
tuple provides better visibility into the graph state after pre-cleanup. The logic correctly handles the different cleanup scenarios based on configuration requirements.
233-242
: Better debugging experience with conditional error handlingThe conditional error handling based on
skip_on_error
is a great improvement. It provides better debugging capabilities when transforms fail while maintaining the option to skip errors in production scenarios.tensorrt_llm/_torch/auto_deploy/models/hf.py (2)
79-84
: Good abstraction for position embeddings configurationThe
_get_max_position_embeddings_config
method provides a clean way to configure maximum sequence length. This ensures consistency between the factory'smax_seq_len
and the model's position embeddings configuration.
326-333
: Correct quantization config loading from fetched directoryThe change to load
hf_quant_config.json
from the fetched directory is correct. This ensures the quantization config is loaded from the same location as the model weights, which is especially important for HuggingFace hub downloads.tensorrt_llm/_torch/auto_deploy/transform/library/quantization.py (2)
27-84
: Well-structured quantization logic for linear operationsThe function properly handles both regular and pre-quantized graphs, correctly registers hooks and scales, and manages state dict updates.
169-235
: Clean implementation of class-based quantization transformThe transform properly handles quantization configuration, tracks quantized operations by type and algorithm, and returns comprehensive metadata. The use of defaultdict for tracking is elegant.
tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_quantization.py (3)
23-37
: Good test factory implementationThe DummyFactory properly implements the minimal interface needed for testing quantization transforms.
71-81
: Well-structured test using new transform infrastructureThe test properly exports the model to GraphModule, applies the quantization transform via InferenceOptimizer, and validates the results.
126-184
: Comprehensive BMM quantization testingThe test properly handles both parameter-based and dynamic BMM models, correctly registers scales for each case, and validates quantization through the new transform infrastructure.
tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_attention_matcher_hf.py (1)
23-36
: Good mock implementation for testingThe MockAttentionDescriptor properly implements the AttentionDescriptor interface for testing purposes.
tensorrt_llm/_torch/auto_deploy/transform/library/quantize_moe.py (2)
20-98
: Well-implemented MoE quantization logicThe function properly handles quantization of all three expert weight lists, correctly registers scales and hooks, and replaces the node with the quantized version.
134-179
: Clean implementation of MoE quantization transformThe transform properly handles MoE node quantization with appropriate exclusion patterns and returns comprehensive metadata.
tests/unittest/_torch/auto_deploy/_utils_test/_graph_test_helpers.py (3)
17-39
: Well-designed test factoryThe FakeFactory provides a clean interface for testing with configurable cache and quantization configs.
41-53
: Good extension of SequenceInfo for embedding testsThe class properly extends SequenceInfo to support 3D input tensors for embedding-based models.
68-139
: Comprehensive graph transformation testing utilityThe function provides thorough validation including parameter counting, output comparison, state dict loading from both original and transformed models, and re-export capability. This is a valuable addition to the testing infrastructure.
tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_attention_matcher.py (9)
10-11
: LGTM!The import changes correctly reflect the migration to the new modular transform framework.
161-170
: LGTM! Improved causal mask construction.The updated causal mask construction is clearer and more standard, using
torch.triu
with-inf
values and explicit dtype casting.
245-252
: LGTM! Consistent mask construction pattern.The causal mask construction is consistent with the pattern used in
EagerAttentionModel
.
380-398
: LGTM! Efficient use of is_causal flag.The change to use
is_causal=True
instead of explicit causal masks is more efficient and aligns with the pattern matching optimization that detects and replaces causal masks.
410-459
: LGTM! Well-structured optimizer helper functions.The optimizer getter functions provide clean encapsulation for creating
InferenceOptimizer
instances with appropriate configurations for each pattern type.
544-545
: LGTM! Comprehensive test updates for eager attention.The test properly validates the new pattern matching behavior:
- Tests both masked and unmasked attention scenarios
- Ensures models are in eval mode
- Correctly verifies the
is_causal
flag instead of explicit mask presence- Uses the new optimizer infrastructure
Also applies to: 565-583, 669-678, 686-686
744-746
: LGTM! Grouped attention test properly updated.The test correctly validates grouped attention pattern matching and preservation of the
is_causal
flag.Also applies to: 773-780, 788-788
1018-1031
: LGTM! Proper inheritance from AttentionDescriptor.The
MockAttentionDescriptor
correctly inherits from the abstract base class and implements the required interface methods.
1323-1331
: LGTM! Clean implementation of attention layout test.The test properly uses
InferenceOptimizer
with thematch_attention_layout
transform andMockAttentionDescriptor
to validate layout transformations.tensorrt_llm/_torch/auto_deploy/transform/library/attention.py (6)
31-42
: LGTM! Standard repeat_kv pattern implementation.The pattern correctly captures the unsqueeze→expand→reshape sequence used for key-value expansion in grouped query attention.
45-224
: LGTM! Comprehensive SDPA pattern coverage.The eight SDPA patterns effectively cover various attention implementations:
- Causal vs non-causal attention
- Multiplication vs division scaling (with proper conversion)
- Explicit fp32 casting variations
- Downstream standardized pipeline handling
The optimization of replacing explicit causal masks with
is_causal=True
enables backend optimizations.
226-265
: LGTM! Well-structured pattern configuration helper.The function properly configures SDPA patterns with appropriate dummy tensors, scalar workarounds for literal values, and type ignoring for dtype conversions.
267-319
: LGTM! Effective grouped attention pattern matching.The patterns correctly handle both:
- Explicit repeat_kv + SDPA sequences → grouped_sdpa replacement
- Direct SDPA → grouped_sdpa replacement for unified handling
Preservesis_causal
flag settings appropriately.
321-360
: LGTM! Clean repeat_kv transform implementation.The transform correctly registers the pattern with appropriate scalar workarounds and type ignoring for reshape/expand operations. Setting
run_shape_prop=True
when matches are found ensures proper shape information propagation.
514-516
: LGTM! Proper metadata preservation.The transform correctly preserves fake tensor metadata during transpose operations, which is crucial for maintaining shape information through the transformation.
Also applies to: 543-544
tensorrt_llm/_torch/auto_deploy/custom_ops/torch_attention.py (6)
11-32
: LGTM! Well-implemented helper functions.Both helper functions are correctly implemented:
- Logit softcapping uses the standard
cap * tanh(scores/cap)
formula- Boolean mask conversion properly maps
True→1.0
andFalse→-inf
105-126
: LGTM! Solid foundation for grouped SDPA.The implementation correctly:
- Handles bnsd format inputs without unnecessary transposes
- Implements GQA by repeating K,V tensors when
n_heads != n_kv_heads
- Sets appropriate default scale based on head dimension
134-140
: LGTM! Correct causal mask application.The causal mask is correctly applied only during context phase when
s_q == s_k
, avoiding unnecessary masking during generation.
181-188
: LGTM! Standard attention path is correct.The implementation properly handles:
- Softmax computation in float32 for numerical stability
- Casting back to query dtype
- Dropout application in inference mode
228-238
: LGTM! Correct layout handling in bsnd wrapper.The wrapper properly:
- Transposes inputs from bsnd to bnsd format
- Forwards all parameters to grouped_sdpa
- Transposes output back to bsnd format
Usescontiguous()
to ensure memory layout efficiency.
193-208
: LGTM! Fake implementations properly updated.Both fake implementations correctly include the new parameters (sinks, sliding_window, logit_cap) in their signatures.
Also applies to: 242-256
...ttest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_attention_matcher_hf.py
Show resolved
Hide resolved
...ttest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_attention_matcher_hf.py
Show resolved
Hide resolved
...unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_attention_matcher.py
Show resolved
Hide resolved
PR_Github #13696 [ run ] completed with state |
tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_quantization.py
Outdated
Show resolved
Hide resolved
f6591d5
to
25e1230
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🔭 Outside diff range comments (1)
tensorrt_llm/_torch/auto_deploy/custom_ops/torch_attention.py (1)
1-2
: Add NVIDIA copyright headerAccording to the coding guidelines, all TensorRT-LLM source files should contain an NVIDIA copyright header that includes the current year.
Add the following copyright header at the beginning of the file:
# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. # SPDX-License-Identifier: Apache-2.0
♻️ Duplicate comments (5)
tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_attention_matcher_hf.py (2)
47-66
: Fix transform applicationThe transformed graph from
InferenceOptimizer
is not being returned. The function assigns the result togm
but doesn't return it.-def _joint_transform(gm: GraphModule) -> None: +def _joint_transform(gm: GraphModule) -> GraphModule: gm = InferenceOptimizer( None, { "match_repeat_kv": { "stage": "pattern_matcher", }, "match_eager_attention": { "stage": "pattern_matcher", }, "match_grouped_attention": { "stage": "pattern_matcher", }, "match_attention_layout": { "stage": "pattern_matcher", "attention_op": MockAttentionDescriptor, }, }, )(None, gm) + return gm
154-155
: Capture transformed graphThe transformed graph returned by
_joint_transform
should be assigned back togm
.# Apply transformation - _joint_transform(gm) + gm = _joint_transform(gm) assert verify_matcher(gm)tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_attention_matcher.py (1)
720-720
: Incorrect optimizer used in counter example testThe test is designed to verify that similar tensor operations are not falsely matched as
repeat_kv
patterns. It should use_get_match_repeat_kv_optimizer()
instead of_get_match_eager_attention_optimizer()
.- _get_match_eager_attention_optimizer(), + _get_match_repeat_kv_optimizer(),tensorrt_llm/_torch/auto_deploy/custom_ops/torch_attention.py (2)
142-159
: Sliding window mask logic needs adjustmentThe sliding window implementation has an off-by-one error. The current condition masks positions where the key is after the query OR at exactly
sliding_window
distance, but typical sliding window attention allows attending to exactlysliding_window + 1
positions (current position plussliding_window
past positions).- sliding_window_mask = (pos_diff < 0) | (pos_diff >= sliding_window) # [s_q, s_k] + # Mask positions where key is after query OR key is too far before query + sliding_window_mask = (pos_diff < 0) | (pos_diff > sliding_window) # [s_q, s_k]
164-180
: Incomplete sinks implementationThe sinks mechanism computes attention scores with an additional sink column but doesn't concatenate corresponding sink values to the value tensor. This means the matrix multiplication at line 180 has mismatched dimensions or ignores the sink scores.
If sinks are meant to be attention sinks (tokens that always receive some attention), you need to either:
- Concatenate sink value vectors to
value_t
before the matmul, or- Use only the non-sink portion of the scores (exclude the last column)
Please clarify the intended behavior and complete the implementation accordingly.
🧹 Nitpick comments (4)
tests/unittest/_torch/auto_deploy/_utils_test/_graph_test_helpers.py (2)
17-39
: Consider updating the docstring to reflect dual purpose.The
FakeFactory
class correctly implements a test factory but the docstring only mentionscache_config
while it supports both cache and quantization configurations.- """Dummy factory to pass cache_config for testing.""" + """Dummy factory to pass cache_config and quant_config for testing."""
68-139
: LGTM! Comprehensive testing function for transformed graph modules.The
run_test_transformed_gm
function provides thorough validation including parameter counts, output correctness, state dict loading, and re-export capability. The logic correctly handles the pre-transformed graph module testing pattern needed for the new transform framework.Consider extracting common testing logic between
run_test
andrun_test_transformed_gm
to reduce code duplication, as both functions share significant baseline testing code.tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_attention_matcher_hf.py (1)
136-147
: Refactor duplicated tensor initialization codeThe tensor initialization logic is duplicated. Consider extracting it into a helper function to improve maintainability.
def _initialize_meta_tensors(module, device="cuda"): """Initialize meta tensors with random values on the specified device.""" module._apply( lambda t: torch.normal(0.0, 1.0, size=t.shape, device=device).to(t.dtype) if t.device == torch.device("meta") else t.to(device) )Then use it as:
- model._apply( - lambda t: torch.normal(0.0, 1.0, size=t.shape, device=device).to(t.dtype) - if t.device == torch.device("meta") - else t.to(device) - ) + _initialize_meta_tensors(model, device) - gm_exported._apply( - lambda t: torch.normal(0.0, 1.0, size=t.shape, device=device).to(t.dtype) - if t.device == torch.device("meta") - else t.to(device) - ) + _initialize_meta_tensors(gm_exported, device) - gm._apply( - lambda t: torch.normal(0.0, 1.0, size=t.shape, device=device).to(t.dtype) - if t.device == torch.device("meta") - else t.to(device) - ) + _initialize_meta_tensors(gm, device)Also applies to: 159-163
tensorrt_llm/_torch/auto_deploy/transform/library/attention.py (1)
179-202
: Add documentation for pattern 7Pattern 7 appears to handle a special case where the attention already uses
torch_attention_sdpa
. Consider adding a comment to clarify why this pattern exists.-# Only pass in causal attention mask in downstream standardized pipeline +# Pattern 7: Handle existing torch_attention_sdpa calls that incorrectly pass attention_mask +# This pattern detects cases where causal mask is passed as attn_mask parameter and converts +# it to use the is_causal flag instead for better performance def _sfdp_pattern_7(query, key, value, attention_mask, scaling, dropout):
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (18)
tensorrt_llm/_torch/auto_deploy/config/default.yaml
(1 hunks)tensorrt_llm/_torch/auto_deploy/custom_ops/torch_attention.py
(4 hunks)tensorrt_llm/_torch/auto_deploy/models/hf.py
(4 hunks)tensorrt_llm/_torch/auto_deploy/transform/interface.py
(2 hunks)tensorrt_llm/_torch/auto_deploy/transform/library/attention.py
(1 hunks)tensorrt_llm/_torch/auto_deploy/transform/library/quantization.py
(3 hunks)tensorrt_llm/_torch/auto_deploy/transform/library/quantize_moe.py
(2 hunks)tensorrt_llm/_torch/auto_deploy/transformations/_graph.py
(2 hunks)tensorrt_llm/_torch/auto_deploy/transformations/library/__init__.py
(0 hunks)tensorrt_llm/_torch/auto_deploy/transformations/library/attention.py
(0 hunks)tensorrt_llm/_torch/auto_deploy/transformations/library/rope.py
(2 hunks)tensorrt_llm/_torch/auto_deploy/transformations/transform.py
(1 hunks)tensorrt_llm/_torch/auto_deploy/utils/pattern_matcher.py
(1 hunks)tests/unittest/_torch/auto_deploy/_utils_test/_graph_test_helpers.py
(2 hunks)tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_attention_matcher.py
(16 hunks)tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_attention_matcher_hf.py
(3 hunks)tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_quant_moe.py
(2 hunks)tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_quantization.py
(4 hunks)
💤 Files with no reviewable changes (2)
- tensorrt_llm/_torch/auto_deploy/transformations/library/init.py
- tensorrt_llm/_torch/auto_deploy/transformations/library/attention.py
✅ Files skipped from review due to trivial changes (2)
- tensorrt_llm/_torch/auto_deploy/config/default.yaml
- tensorrt_llm/_torch/auto_deploy/transformations/library/rope.py
🚧 Files skipped from review as they are similar to previous changes (8)
- tensorrt_llm/_torch/auto_deploy/utils/pattern_matcher.py
- tensorrt_llm/_torch/auto_deploy/transformations/transform.py
- tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_quant_moe.py
- tensorrt_llm/_torch/auto_deploy/transform/library/quantization.py
- tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_quantization.py
- tensorrt_llm/_torch/auto_deploy/models/hf.py
- tensorrt_llm/_torch/auto_deploy/transformations/_graph.py
- tensorrt_llm/_torch/auto_deploy/transform/interface.py
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.py
: Python code should conform to Python 3.8+.
Indent Python code with 4 spaces. Do not use tabs.
Always maintain the namespace when importing in Python, even if only one class or function from a module is used.
Python filenames should use snake_case (e.g., some_file.py).
Python classes should use PascalCase (e.g., class SomeClass).
Python functions and methods should use snake_case (e.g., def my_awesome_function():).
Python local variables should use snake_case. Prefix k for variable names that start with a number (e.g., k_99th_percentile = ...).
Python global variables should use upper snake_case and prefix G (e.g., G_MY_GLOBAL = ...).
Python constants should use upper snake_case (e.g., MY_CONSTANT = ...).
Avoid shadowing variables declared in an outer scope in Python.
Initialize all externally visible members of a Python class in the constructor.
For interfaces that may be used outside a Python file, prefer docstrings over comments.
Comments in Python should be reserved for code within a function, or interfaces that are local to a file.
Use Google style docstrings for Python classes and functions, which can be parsed by Sphinx.
Attributes and variables in Python can be documented inline; attribute docstrings will be rendered under the docstring for the class.
Avoid using reflection in Python when functionality can be easily achieved without it.
When using try-except blocks in Python, limit the except to the smallest set of errors possible.
When using try-except blocks to handle multiple possible variable types in Python, keep the body of the try as small as possible, using the else block to implement the logic.
Files:
tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_attention_matcher_hf.py
tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_attention_matcher.py
tests/unittest/_torch/auto_deploy/_utils_test/_graph_test_helpers.py
tensorrt_llm/_torch/auto_deploy/transform/library/quantize_moe.py
tensorrt_llm/_torch/auto_deploy/transform/library/attention.py
tensorrt_llm/_torch/auto_deploy/custom_ops/torch_attention.py
**/*.{cpp,h,hpp,cc,cxx,cu,py}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the current year. This includes .cpp, .h, .cu, .py, and any other source files which are compiled or interpreted.
Files:
tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_attention_matcher_hf.py
tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_attention_matcher.py
tests/unittest/_torch/auto_deploy/_utils_test/_graph_test_helpers.py
tensorrt_llm/_torch/auto_deploy/transform/library/quantize_moe.py
tensorrt_llm/_torch/auto_deploy/transform/library/attention.py
tensorrt_llm/_torch/auto_deploy/custom_ops/torch_attention.py
🧠 Learnings (2)
📓 Common learnings
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.
📚 Learning: in tensorrt-llm testing, it's common to have both cli flow tests (test_cli_flow.py) and pytorch api ...
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.
Applied to files:
tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_attention_matcher_hf.py
🧬 Code Graph Analysis (1)
tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_attention_matcher_hf.py (5)
tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py (4)
AttentionDescriptor
(498-624)device
(147-148)args
(151-159)to
(310-314)tensorrt_llm/_torch/auto_deploy/export/export.py (1)
torch_export_to_gm
(198-284)tensorrt_llm/_torch/auto_deploy/transformations/transform.py (1)
InferenceOptimizer
(37-202)tensorrt_llm/_torch/auto_deploy/transformations/_graph.py (1)
move_to_device
(132-139)tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_attention_matcher.py (5)
MockAttentionDescriptor
(1018-1030)verify_matcher
(478-528)verify_matcher
(598-680)verify_matcher
(747-782)verify_matcher
(1207-1317)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (7)
tensorrt_llm/_torch/auto_deploy/transform/library/quantize_moe.py (2)
2-2
: LGTM! Import changes align with the class-based transform framework.The new imports support the refactored architecture:
List
andTuple
for type hints,ModelFactory
for configuration access, and the transform framework interfaces.Also applies to: 8-9, 12-12
134-179
: LGTM! Well-structured class-based transform implementation.The
QuantizeMOE
class correctly:
- Registers with the transform framework using the decorator
- Implements the required
_apply
method signature- Handles configuration retrieval and early returns appropriately
- Preserves the original quantization logic while integrating with the new architecture
- Returns meaningful
TransformInfo
metadata for different scenariosThe refactoring successfully encapsulates the quantization functionality within the modular transform framework.
tests/unittest/_torch/auto_deploy/_utils_test/_graph_test_helpers.py (2)
11-11
: LGTM! Import changes support the new testing utilities.The imports are necessary for the new
FakeFactory
andSequenceEmbeddingInfo
classes.Also applies to: 13-13
41-53
: LGTM! Well-designed extension for embedding input testing.The
SequenceEmbeddingInfo
class correctly extendsSequenceInfo
to support 3D embedding tensors instead of token IDs, maintaining proper device placement and data types.tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_attention_matcher_hf.py (1)
169-170
: Consider reducing tolerance valuesThe tolerance values (atol=5e-2, rtol=5e-2) seem high for float16 precision. Unless there's a specific reason for such high tolerances (e.g., accumulated errors from transformations), consider using more typical values like 1e-3.
If the high tolerances are necessary due to numerical instability in the transformations, please add a comment explaining why these specific values were chosen.
tensorrt_llm/_torch/auto_deploy/transform/library/attention.py (2)
31-42
: LGTM! Correct repeat_kv pattern implementationThe pattern correctly matches the HuggingFace repeat_kv implementation with unsqueeze, expand, and reshape operations.
375-384
: Consider setting run_shape_prop for consistencyUnlike
MatchRepeatKV
and other transforms,MatchEagerAttention
doesn't setself.config.run_shape_prop = True
when patterns are matched. This might be intentional, but consider adding it for consistency if shape propagation is needed after eager attention transformations.if num_eager_patterns > 0: self.config.run_shape_prop = True
* Change the all-reduce strategy to NCCL When the strategy is set to AUTO and world_size>1 we experience hangs and CUDA memory errors. * This is the same issue as https://nvbugspro.nvidia.com/bug/5331013 * Without this change test test_ad_build_small_multi.py fails (tp==2) * This is a temporary change until we understand why this hang is happening. * On dllcuster this issue does not manifest. Signed-off-by: Neta Zmora <[email protected]> * Re-enable test_ad_build_small_multi.py tests/unittest/_torch/auto_deploy/unit/multigpu/test_ad_build_small_multi.py Signed-off-by: Neta Zmora <[email protected]> * fix kvcache mem size compute - convert to MB Signed-off-by: Gal Agam <[email protected]> --------- Signed-off-by: Neta Zmora <[email protected]> Signed-off-by: Gal Agam <[email protected]> Co-authored-by: Gal Agam <[email protected]> Signed-off-by: Neta Zmora <[email protected]>
Signed-off-by: Lucas Liebenwein <[email protected]>
Signed-off-by: haoguo <[email protected]> Signed-off-by: h-guo18 <[email protected]>
* attention matcher with torch._inductor pattern matcher,matching repeat kv, sdpa and group attention, update unit tests Signed-off-by: Frida Hou <[email protected]> * Fix the torch backend Attention Signed-off-by: nvchenghaoz <[email protected]> * Revert "attention matcher with torch._inductor pattern matcher,matching repeat kv, sdpa and group attention, update unit tests" This reverts commit 5743fb3. --------- Signed-off-by: Frida Hou <[email protected]> Signed-off-by: nvchenghaoz <[email protected]> Co-authored-by: Frida Hou <[email protected]>
…tcher (#101) * attention matcher with torch._inductor pattern matcher,matching repeat kv, sdpa and group attention, update unit tests Signed-off-by: Frida Hou <[email protected]> * update matcher to only handle causal attn mask and set is_causal=True Signed-off-by: Frida Hou <[email protected]> * separate into three transformations Signed-off-by: Frida Hou <[email protected]> --------- Signed-off-by: Frida Hou <[email protected]>
* improve error handling and graph clean-up Signed-off-by: Lucas Liebenwein <[email protected]> * fix: avoid modify immutable type TransformInfo Signed-off-by: haoguo <[email protected]> --------- Signed-off-by: Lucas Liebenwein <[email protected]> Signed-off-by: haoguo <[email protected]> Co-authored-by: haoguo <[email protected]>
* attention matcher with torch._inductor pattern matcher,matching repeat kv, sdpa and group attention, update unit tests Signed-off-by: Frida Hou <[email protected]> * Update the torch ref op Signed-off-by: nvchenghaoz <[email protected]> * Revert "attention matcher with torch._inductor pattern matcher,matching repeat kv, sdpa and group attention, update unit tests" This reverts commit 5743fb3. --------- Signed-off-by: Frida Hou <[email protected]> Signed-off-by: nvchenghaoz <[email protected]> Co-authored-by: Frida Hou <[email protected]>
* refactor: move quantization and quant_moe to new inf optimizer Signed-off-by: haoguo <[email protected]> * refactor: use quant_config from factory instead of new config type Signed-off-by: haoguo <[email protected]> * refactor: del old files; update default.yaml Signed-off-by: haoguo <[email protected]> * move helper class FakeFacotry to _graph_test_helpers.py Signed-off-by: haoguo <[email protected]> * polish: remove unreachable branch in quantization.py Co-authored-by: Fridah-nv <[email protected]> Signed-off-by: h-guo18 <[email protected]> * style: run pre-commit Signed-off-by: haoguo <[email protected]> * fix to fetch hf_quant_config from fetched dir Signed-off-by: Frida Hou <[email protected]> --------- Signed-off-by: haoguo <[email protected]> Signed-off-by: h-guo18 <[email protected]> Signed-off-by: Frida Hou <[email protected]> Co-authored-by: Fridah-nv <[email protected]>
…121) Signed-off-by: Frida Hou <[email protected]>
Signed-off-by: Frida Hou <[email protected]>
* refactor: merge attn updates;move to new inf optimizer Signed-off-by: haoguo <[email protected]> * minor: fix import Signed-off-by: haoguo <[email protected]> * doc: fix file docstring Signed-off-by: haoguo <[email protected]> * Update tensorrt_llm/_torch/auto_deploy/transform/library/attention.py Co-authored-by: Lucas Liebenwein <[email protected]> Signed-off-by: h-guo18 <[email protected]> * Update tensorrt_llm/_torch/auto_deploy/transform/library/attention.py Co-authored-by: Lucas Liebenwein <[email protected]> Signed-off-by: h-guo18 <[email protected]> * Update tensorrt_llm/_torch/auto_deploy/transform/library/attention.py Co-authored-by: Lucas Liebenwein <[email protected]> Signed-off-by: h-guo18 <[email protected]> * polish: use config.run_shape_prop for shape prop Signed-off-by: haoguo <[email protected]> * polish: remove redundant canonicalize() Signed-off-by: haoguo <[email protected]> --------- Signed-off-by: haoguo <[email protected]> Signed-off-by: h-guo18 <[email protected]> Co-authored-by: Lucas Liebenwein <[email protected]>
…s and refactored (#119) * refactored compile_limit Signed-off-by: Eran Geva <[email protected]> * removed changes made to TorchCompileCompiler Signed-off-by: Eran Geva <[email protected]> * set cache_size_limit in TorchCompileCompiler Signed-off-by: Eran Geva <[email protected]> --------- Signed-off-by: Eran Geva <[email protected]>
…tch_sizes and refactored (#119)" This reverts commit 2c81f8a. Signed-off-by: Lucas Liebenwein <[email protected]>
Signed-off-by: Fridah-nv <[email protected]>
25e1230
to
ee0252a
Compare
/bot run |
PR_Github #13723 [ run ] triggered by Bot |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (5)
tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_attention_matcher_hf.py (2)
47-65
: Add return statement for clarityThe function modifies
gm
but doesn't return the transformed result from InferenceOptimizer.def _joint_transform(gm: GraphModule) -> None: - gm = InferenceOptimizer( + return InferenceOptimizer( None, { "match_repeat_kv": { "stage": "pattern_matcher", }, "match_eager_attention": { "stage": "pattern_matcher", }, "match_grouped_attention": { "stage": "pattern_matcher", }, "match_attention_layout": { "stage": "pattern_matcher", "attention_op": MockAttentionDescriptor, }, }, )(None, gm)
154-155
: Fix transform applicationThe transformed graph from
_joint_transform
is not being used. The function should return the transformed graph.# Apply transformation - _joint_transform(gm) + gm = _joint_transform(gm)tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_attention_matcher.py (1)
720-720
: Incorrect optimizer used in counter example test.The
test_counter_example
is designed to verify that similar tensor operations with different patterns are not falsely matched asrepeat_kv
patterns. It should use_get_match_repeat_kv_optimizer()
instead of_get_match_eager_attention_optimizer()
.Apply this diff to fix the optimizer:
- _get_match_eager_attention_optimizer(), + _get_match_repeat_kv_optimizer(),tensorrt_llm/_torch/auto_deploy/custom_ops/torch_attention.py (2)
142-159
: Sliding window mask condition needs correction.The sliding window mask condition on line 157 still has the issue identified in previous reviews. The current condition
(pos_diff < 0) | (pos_diff >= sliding_window)
should be(pos_diff < 0) | (pos_diff > sliding_window)
to allow attention to exactlysliding_window + 1
positions (current position plussliding_window
past positions), which is the standard behavior for sliding window attention.
164-184
: Sinks implementation remains incomplete.The sinks mechanism still has the issue identified in previous reviews. While sink values are included in the attention score normalization, the output computation on line 180 uses the original
value_t
tensor without concatenating corresponding sink values. For a complete sinks implementation, sink values should be concatenated to the value tensor to match the expanded attention scores that include the sink column.
🧹 Nitpick comments (1)
tests/unittest/_torch/auto_deploy/_utils_test/_graph_test_helpers.py (1)
68-139
: Comprehensive testing utility for transformed GraphModules.This function provides thorough validation of transformed GraphModules including parameter consistency, output correctness, and state dict loading. The implementation is well-structured with clear error messages and appropriate test coverage.
Consider extracting common logic between
run_test_transformed_gm
andrun_test
into shared helper functions to reduce code duplication, particularly for parameter counting, output verification, and state dict loading tests.
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (22)
tensorrt_llm/_torch/auto_deploy/config/default.yaml
(1 hunks)tensorrt_llm/_torch/auto_deploy/custom_ops/torch_attention.py
(4 hunks)tensorrt_llm/_torch/auto_deploy/custom_ops/torch_backend_attention.py
(3 hunks)tensorrt_llm/_torch/auto_deploy/distributed/trtllm.py
(1 hunks)tensorrt_llm/_torch/auto_deploy/models/hf.py
(4 hunks)tensorrt_llm/_torch/auto_deploy/transform/interface.py
(2 hunks)tensorrt_llm/_torch/auto_deploy/transform/library/attention.py
(1 hunks)tensorrt_llm/_torch/auto_deploy/transform/library/quantization.py
(3 hunks)tensorrt_llm/_torch/auto_deploy/transform/library/quantize_moe.py
(2 hunks)tensorrt_llm/_torch/auto_deploy/transformations/_graph.py
(2 hunks)tensorrt_llm/_torch/auto_deploy/transformations/library/__init__.py
(0 hunks)tensorrt_llm/_torch/auto_deploy/transformations/library/attention.py
(0 hunks)tensorrt_llm/_torch/auto_deploy/transformations/library/kvcache.py
(1 hunks)tensorrt_llm/_torch/auto_deploy/transformations/library/rope.py
(2 hunks)tensorrt_llm/_torch/auto_deploy/transformations/transform.py
(1 hunks)tensorrt_llm/_torch/auto_deploy/utils/pattern_matcher.py
(1 hunks)tests/unittest/_torch/auto_deploy/_utils_test/_graph_test_helpers.py
(2 hunks)tests/unittest/_torch/auto_deploy/unit/multigpu/test_ad_build_small_multi.py
(0 hunks)tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_attention_matcher.py
(16 hunks)tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_attention_matcher_hf.py
(3 hunks)tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_quant_moe.py
(2 hunks)tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_quantization.py
(4 hunks)
💤 Files with no reviewable changes (3)
- tests/unittest/_torch/auto_deploy/unit/multigpu/test_ad_build_small_multi.py
- tensorrt_llm/_torch/auto_deploy/transformations/library/init.py
- tensorrt_llm/_torch/auto_deploy/transformations/library/attention.py
✅ Files skipped from review due to trivial changes (1)
- tensorrt_llm/_torch/auto_deploy/distributed/trtllm.py
🚧 Files skipped from review as they are similar to previous changes (13)
- tensorrt_llm/_torch/auto_deploy/utils/pattern_matcher.py
- tensorrt_llm/_torch/auto_deploy/config/default.yaml
- tensorrt_llm/_torch/auto_deploy/transformations/library/kvcache.py
- tensorrt_llm/_torch/auto_deploy/transformations/_graph.py
- tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_quant_moe.py
- tensorrt_llm/_torch/auto_deploy/transformations/transform.py
- tensorrt_llm/_torch/auto_deploy/transform/interface.py
- tensorrt_llm/_torch/auto_deploy/transformations/library/rope.py
- tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_quantization.py
- tensorrt_llm/_torch/auto_deploy/custom_ops/torch_backend_attention.py
- tensorrt_llm/_torch/auto_deploy/models/hf.py
- tensorrt_llm/_torch/auto_deploy/transform/library/quantize_moe.py
- tensorrt_llm/_torch/auto_deploy/transform/library/quantization.py
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.py
: Python code should conform to Python 3.8+.
Indent Python code with 4 spaces. Do not use tabs.
Always maintain the namespace when importing in Python, even if only one class or function from a module is used.
Python filenames should use snake_case (e.g., some_file.py).
Python classes should use PascalCase (e.g., class SomeClass).
Python functions and methods should use snake_case (e.g., def my_awesome_function():).
Python local variables should use snake_case. Prefix k for variable names that start with a number (e.g., k_99th_percentile = ...).
Python global variables should use upper snake_case and prefix G (e.g., G_MY_GLOBAL = ...).
Python constants should use upper snake_case (e.g., MY_CONSTANT = ...).
Avoid shadowing variables declared in an outer scope in Python.
Initialize all externally visible members of a Python class in the constructor.
For interfaces that may be used outside a Python file, prefer docstrings over comments.
Comments in Python should be reserved for code within a function, or interfaces that are local to a file.
Use Google style docstrings for Python classes and functions, which can be parsed by Sphinx.
Attributes and variables in Python can be documented inline; attribute docstrings will be rendered under the docstring for the class.
Avoid using reflection in Python when functionality can be easily achieved without it.
When using try-except blocks in Python, limit the except to the smallest set of errors possible.
When using try-except blocks to handle multiple possible variable types in Python, keep the body of the try as small as possible, using the else block to implement the logic.
Files:
tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_attention_matcher_hf.py
tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_attention_matcher.py
tensorrt_llm/_torch/auto_deploy/transform/library/attention.py
tests/unittest/_torch/auto_deploy/_utils_test/_graph_test_helpers.py
tensorrt_llm/_torch/auto_deploy/custom_ops/torch_attention.py
**/*.{cpp,h,hpp,cc,cxx,cu,py}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the current year. This includes .cpp, .h, .cu, .py, and any other source files which are compiled or interpreted.
Files:
tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_attention_matcher_hf.py
tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_attention_matcher.py
tensorrt_llm/_torch/auto_deploy/transform/library/attention.py
tests/unittest/_torch/auto_deploy/_utils_test/_graph_test_helpers.py
tensorrt_llm/_torch/auto_deploy/custom_ops/torch_attention.py
🧠 Learnings (2)
📓 Common learnings
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.
Learnt from: yechank-nvidia
PR: NVIDIA/TensorRT-LLM#6254
File: tensorrt_llm/_torch/pyexecutor/model_engine.py:1201-1204
Timestamp: 2025-07-22T09:22:14.726Z
Learning: In TensorRT-LLM's multimodal processing pipeline, shared tensor recovery using `from_shared_tensor()` is only needed during the context phase. Generation requests reuse the already-recovered tensor data and only need to call `strip_for_generation()` to remove unnecessary multimodal data while preserving the recovered tensors. This avoids redundant tensor recovery operations during generation.
📚 Learning: in tensorrt-llm testing, it's common to have both cli flow tests (test_cli_flow.py) and pytorch api ...
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.
Applied to files:
tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_attention_matcher_hf.py
🧬 Code Graph Analysis (1)
tensorrt_llm/_torch/auto_deploy/transform/library/attention.py (7)
tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py (1)
AttentionDescriptor
(498-624)tensorrt_llm/_torch/auto_deploy/utils/node_utils.py (1)
is_op
(183-206)tensorrt_llm/_torch/auto_deploy/utils/pattern_matcher.py (3)
ADPatternMatcherPass
(59-65)register_ad_pattern
(97-180)call_function
(247-274)tensorrt_llm/_torch/auto_deploy/transform/interface.py (4)
BaseTransform
(129-354)TransformConfig
(50-89)TransformInfo
(98-123)TransformRegistry
(357-385)tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_attention_matcher_hf.py (2)
get_attention_layout
(29-30)get_source_attention_op
(33-34)tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_attention_matcher.py (2)
get_attention_layout
(1025-1026)get_source_attention_op
(1029-1030)tensorrt_llm/logger.py (1)
debug
(143-144)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (31)
tests/unittest/_torch/auto_deploy/_utils_test/_graph_test_helpers.py (3)
11-13
: LGTM!The new imports correctly support the added functionality and follow the existing import patterns.
17-39
: Well-designed test utility class.The
FakeFactory
class provides a clean way to inject configuration data into tests without requiring full model instantiation. The stub methods are appropriately implemented for testing purposes.
41-53
: Good extension for embedding-like test inputs.The
SequenceEmbeddingInfo
class properly extendsSequenceInfo
to support tests requiring 3D embedding tensors. The implementation maintains device and dtype consistency.tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_attention_matcher_hf.py (4)
3-20
: Good import updates for new functionality.The added imports support meta device initialization, graph transformations, and reproducible testing. The manual seed setting ensures test consistency.
23-35
: Proper implementation of AttentionDescriptor interface.The
MockAttentionDescriptor
class correctly implements the required interface methods and aligns with the new attention descriptor design.
112-156
: Well-structured test with comprehensive verification.The test properly uses meta device initialization, exports to GraphModule, and performs thorough verification of the transformation and output consistency.
105-108
: Good verification of pattern transformation.The verification correctly checks that old attention patterns are replaced with the expected new patterns after transformation.
tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_attention_matcher.py (10)
10-10
: LGTM!The import addition supports the new attention descriptor interface used in the updated code.
161-170
: Improved attention mask construction.The new implementation creates causal masks more efficiently by directly using
torch.triu
with-inf
values instead of boolean masks withmasked_fill
. This is cleaner and more direct.
244-252
: Consistent mask construction improvement.The same efficient causal mask construction is applied here, maintaining consistency across model classes.
380-398
: Good update to use explicit is_causal flag.The change to use the
is_causal
parameter instead of explicit mask tensors for causal attention is more efficient and aligns with the updated attention operator interface.
410-458
: Well-designed modular optimizer functions.The new optimizer getter functions provide clean separation of concerns with consistent configuration patterns. Each function focuses on a specific pattern matching stage while including necessary cleanup operations.
564-583
: Good test practice improvements.Chaining
.eval()
after.to()
ensures models are properly set to evaluation mode for inference testing, which is important for consistent test results.
669-678
: Proper verification of new attention interface.The verification logic correctly checks for the
is_causal
parameter in both kwargs and positional arguments, ensuring the new attention interface is working as expected.
744-745
: Good test simplification.Removing the conditional logic and always expecting 1 grouped SDPA node makes the test more straightforward and aligns with the model implementation.
1018-1031
: Proper AttentionDescriptor implementation.The updated
MockAttentionDescriptor
class correctly implements the required interface methods and follows the new attention descriptor pattern.
1323-1331
: Consistent optimizer-based approach.The test correctly uses
InferenceOptimizer
with thematch_attention_layout
stage, maintaining consistency with the overall refactoring approach.tensorrt_llm/_torch/auto_deploy/transform/library/attention.py (9)
1-17
: LGTM! Well-structured module with clear imports.The module docstring clearly describes the purpose, and imports are properly organized following Python best practices.
19-29
: LGTM! Clean utility function for pattern application.The function properly encapsulates pattern registration and application logic with a clear interface.
31-42
: LGTM! Correct repeat_kv pattern implementation.The pattern accurately captures the standard repeat_kv logic from HuggingFace models, and the replacement properly uses the custom operator.
44-264
: LGTM! Comprehensive SDPA pattern coverage.The 8 patterns effectively cover the major variations of scaled dot-product attention found in transformer models, including different scaling approaches, masking, and casting behaviors. The pattern generation function is well-structured with appropriate dummy data and scalar workarounds.
267-319
: LGTM! Well-designed grouped attention patterns.The 4 patterns effectively handle different GQA scenarios with and without explicit repeat_kv operations, covering both causal and non-causal attention variants. The replacements appropriately use the grouped SDPA operator.
321-360
: LGTM! Proper transform class implementation.The MatchRepeatKV class correctly follows the BaseTransform pattern, uses appropriate dummy arguments and scalar workarounds, and properly handles shape propagation when patterns are found.
362-385
: LGTM! Clean eager attention transform implementation.The class properly leverages the pattern generation function to register all SDPA patterns efficiently.
387-454
: LGTM! Comprehensive grouped attention transform.The class properly registers all grouped attention patterns with appropriate dummy tensors and scalar workarounds for GQA scenarios.
456-563
: LGTM! Comprehensive attention layout transformation.The MatchAttentionLayout transform correctly handles layout conversions between 'bnsd' and 'bsnd' formats, properly inserts transpose operations, and preserves meta tensor information for shape propagation. The implementation covers both SDPA variants appropriately.
tensorrt_llm/_torch/auto_deploy/custom_ops/torch_attention.py (5)
11-31
: LGTM! Well-implemented helper functions.Both helper functions are correctly implemented with clear docstrings, proper parameter validation, and follow expected mathematical formulations for logit softcapping and mask conversion.
92-123
: LGTM! Proper function signature and GQA handling.The updated function signature includes necessary parameters, and the GQA implementation correctly handles key-value tensor repetition when needed. Scale calculation follows standard practices.
127-140
: LGTM! Correct attention and causal mask handling.The implementation properly converts boolean masks to float and applies causal masking only during the context phase (when s_q == s_k), following standard practices.
182-191
: LGTM! Correct standard attention computation.The fallback softmax path and dropout application are properly implemented, maintaining format consistency between input and output.
194-255
: LGTM! Consistent function signature updates.All fake functions properly match their real counterparts with the new parameters, and the bsnd variant correctly handles layout transformations between bsnd and bnsd formats.
PR_Github #13723 [ run ] completed with state |
Signed-off-by: Neta Zmora <[email protected]> Signed-off-by: Gal Agam <[email protected]> Signed-off-by: Lucas Liebenwein <[email protected]> Signed-off-by: haoguo <[email protected]> Signed-off-by: h-guo18 <[email protected]> Signed-off-by: Frida Hou <[email protected]> Signed-off-by: nvchenghaoz <[email protected]> Signed-off-by: Eran Geva <[email protected]> Signed-off-by: Fridah-nv <[email protected]> Co-authored-by: Neta Zmora <[email protected]> Co-authored-by: Gal Agam <[email protected]> Co-authored-by: h-guo18 <[email protected]> Co-authored-by: nvchenghaoz <[email protected]> Co-authored-by: Frida Hou <[email protected]> Co-authored-by: Eran Geva <[email protected]> Signed-off-by: Lanyu Liao <[email protected]>
Signed-off-by: Neta Zmora <[email protected]> Signed-off-by: Gal Agam <[email protected]> Signed-off-by: Lucas Liebenwein <[email protected]> Signed-off-by: haoguo <[email protected]> Signed-off-by: h-guo18 <[email protected]> Signed-off-by: Frida Hou <[email protected]> Signed-off-by: nvchenghaoz <[email protected]> Signed-off-by: Eran Geva <[email protected]> Signed-off-by: Fridah-nv <[email protected]> Co-authored-by: Neta Zmora <[email protected]> Co-authored-by: Gal Agam <[email protected]> Co-authored-by: h-guo18 <[email protected]> Co-authored-by: nvchenghaoz <[email protected]> Co-authored-by: Frida Hou <[email protected]> Co-authored-by: Eran Geva <[email protected]>
Summary by CodeRabbit
New Features
Bug Fixes
Refactor
Documentation
Tests
Chores
Description
Test Coverage
GitHub Bot Help
/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...
Provide a user friendly way for developers to interact with a Jenkins server.
Run
/bot [-h|--help]
to print this help message.See details below for each supported subcommand.
run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]
Launch build/test pipelines. All previously running jobs will be killed.
--reuse-test (optional)pipeline-id
(OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.--disable-reuse-test
(OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.--disable-fail-fast
(OPTIONAL) : Disable fail fast on build/tests/infra failures.--skip-test
(OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.--stage-list "A10-PyTorch-1, xxx"
(OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.--gpu-type "A30, H100_PCIe"
(OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.--test-backend "pytorch, cpp"
(OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.--only-multi-gpu-test
(OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.--disable-multi-gpu-test
(OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.--add-multi-gpu-test
(OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.--post-merge
(OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx"
(OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".--detailed-log
(OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.--debug
(OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in thestage-list
parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.For guidance on mapping tests to stage names, see
docs/source/reference/ci-overview.md
and the
scripts/test_to_stage_mapping.py
helper.kill
kill
Kill all running builds associated with pull request.
skip
skip --comment COMMENT
Skip testing for latest commit on pull request.
--comment "Reason for skipping build/test"
is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.reuse-pipeline
reuse-pipeline
Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.