Skip to content

Conversation

lucaslie
Copy link
Member

@lucaslie lucaslie commented Jul 31, 2025

Summary by CodeRabbit

  • New Features

    • Introduced a comprehensive framework for detecting and transforming various attention patterns into optimized backend calls with support for grouped attention, sliding window masking, logit softcapping, and sinks.
    • Added configurable class-based transforms for quantization and mixture-of-experts (MoE) quantization integrated into a modular transform registry.
    • Enhanced model factories and testing utilities for improved export, transformation, and validation workflows.
  • Bug Fixes

    • Corrected memory size calculations for cache resizing to use bytes instead of megabytes.
    • Fixed sliding window attention mask logic and improved causal masking handling in attention operations.
  • Refactor

    • Migrated quantization and MoE quantization from standalone functions to class-based transforms for better modularity and metadata reporting.
    • Centralized attention layout configuration and pattern matching into a modular inference optimizer, removing legacy pattern matcher calls.
    • Simplified device graph recompilation logic and streamlined transform error handling and status tracking.
    • Removed legacy attention pattern matching implementations and replaced them with modular transform registrations.
  • Documentation

    • Enhanced pattern matcher documentation to clarify isolation requirements for pattern nodes.
  • Tests

    • Refactored and expanded tests to use unified inference optimizer calls for attention and quantization transforms, verifying operator replacement and output consistency.
    • Improved test models to support meta device initialization, graph export, and dynamic shape handling.
    • Added utilities for testing transformed graph modules including parameter count checks and state dict loading validation.
    • Removed tests related to legacy causal attention mask matching.
    • Removed conditional test skips for multi-GPU to enable broader test coverage.
  • Chores

    • Updated default configuration to add new transform entries for pattern matching stages.
    • Removed legacy imports of attention and quantization modules to align with new modular transform structure.

Description

Test Coverage

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md
and the scripts/test_to_stage_mapping.py helper.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

@lucaslie lucaslie requested a review from a team as a code owner July 31, 2025 15:00
@lucaslie lucaslie requested a review from Fridah-nv July 31, 2025 15:00
Copy link
Contributor

coderabbitai bot commented Jul 31, 2025

📝 Walkthrough

Walkthrough

This change overhauls the attention and quantization transformation infrastructure in the TensorRT-LLM PyTorch auto-deploy pipeline. It migrates pattern matching and quantization logic from function-based transformations to modular, class-based transforms registered in a new transform library. The configuration, test harnesses, and transformation application logic are refactored to use this new modular system. Several new features and correctness improvements are introduced for attention pattern matching, quantization, and layout handling.

Changes

Cohort / File(s) Change Summary
Attention Transform Library Migration & Enhancements
tensorrt_llm/_torch/auto_deploy/transform/library/attention.py
tensorrt_llm/_torch/auto_deploy/custom_ops/torch_attention.py
tensorrt_llm/_torch/auto_deploy/custom_ops/torch_backend_attention.py
Introduces a new attention transform library with modular pattern matchers for repeat_kv, eager attention, grouped attention, and layout transforms. Implements custom grouped SDPA with generalized query attention (GQA), sliding window masking, logit softcapping, and sinks. Refactors backend attention ops for correct masking and sinks handling. Legacy attention pattern matcher functions and their imports are removed.
Quantization Transform Refactor
tensorrt_llm/_torch/auto_deploy/transform/library/quantization.py
tensorrt_llm/_torch/auto_deploy/transform/library/quantize_moe.py
Refactors quantization and MoE quantization from function-based to class-based transforms, integrating them into the new transform registry and returning richer metadata including transform info. Legacy standalone quantization functions are removed.
Legacy Transform Removal
tensorrt_llm/_torch/auto_deploy/transformations/library/attention.py
tensorrt_llm/_torch/auto_deploy/transformations/library/__init__.py
Removes legacy function-based attention pattern matchers and their wildcard imports from the transformations library.
Transform Application Refactor
tensorrt_llm/_torch/auto_deploy/transformations/transform.py
Removes legacy attention and quantization pattern matcher calls from the transformation pipeline, replacing them with modular transform configuration and invocation via InferenceOptimizer. The attention layout transform is configured with the appropriate attention operator from the registry.
Transform Interface & Cleanup Logic
tensorrt_llm/_torch/auto_deploy/transform/interface.py
Updates the transform interface to propagate and combine cleanliness and shape validity status, and improves error handling during transform application by conditionally wrapping calls based on config flags.
Config & Registry Updates
tensorrt_llm/_torch/auto_deploy/config/default.yaml
Adds six new transform entries for quantization and attention pattern matchers (quantize, quantize_moe, match_repeat_kv, match_eager_attention, match_grouped_attention, match_attention_layout) to the configuration under the pattern_matcher stage.
Device Move Logic
tensorrt_llm/_torch/auto_deploy/transformations/_graph.py
Simplifies device move logic by removing external recompilation control; recompilation is now internally determined based on node inspection during device move.
Distributed AllReduce Strategy
tensorrt_llm/_torch/auto_deploy/distributed/trtllm.py
Temporarily forces NCCL AllReduce strategy due to an NVIDIA bug, replacing the previous AUTO strategy with a comment noting the workaround.
Model Factory & Quant Config Handling
tensorrt_llm/_torch/auto_deploy/models/hf.py
Adds methods to dynamically merge max position embedding configuration into model kwargs. Refactors quantization config loading to use the fetched directory path instead of model path.
Pattern Matcher Documentation
tensorrt_llm/_torch/auto_deploy/utils/pattern_matcher.py
Extends docstring to clarify that pattern matcher passes require patterns to be functionally isolated with no shared intermediate nodes (checking num_users).
Test Harness & Utility Enhancements
tests/unittest/_torch/auto_deploy/_utils_test/_graph_test_helpers.py
Adds advanced test harness utilities including FakeFactory for model/quant config management, SequenceEmbeddingInfo for embedding shape/dtype setup, and run_test_transformed_gm for comprehensive testing of transformed graph modules with parameter and output validation.
Attention Matcher Test Refactor
tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_attention_matcher.py
tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_attention_matcher_hf.py
Refactors tests to use the new modular optimizer for attention pattern matching. Updates attention mask handling to use explicit is_causal flags and masks with -inf values. Removes legacy causal attention mask tests. Enhances verification logic to check for correct attention operator arguments and layout. Implements meta device model initialization and graph export in tests.
Quantization Test Refactor
tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_quant_moe.py
tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_quantization.py
Refactors quantization tests to use the new modular optimizer and advanced test harnesses, replacing direct transform calls with InferenceOptimizer invocations on graph modules. Introduces dummy/fake factories for config provision.
Other Test and Utility Updates
tensorrt_llm/_torch/auto_deploy/transformations/library/kvcache.py
tensorrt_llm/_torch/auto_deploy/transformations/library/rope.py
tests/unittest/_torch/auto_deploy/unit/multigpu/test_ad_build_small_multi.py
Fixes cache size calculation units to bytes. Adds float32 dummy inputs for RoPE pattern matching to handle .float() calls. Removes multi-GPU test skip to enable test execution on multi-GPU setups.

Sequence Diagram(s)

sequenceDiagram
    participant ModelFactory
    participant InferenceOptimizer
    participant TransformRegistry
    participant TransformClass
    participant GraphModule

    ModelFactory->>InferenceOptimizer: Provide model, config, quant_config
    InferenceOptimizer->>TransformRegistry: Lookup transforms (e.g., quantize, match_eager_attention)
    loop For each transform in config
        InferenceOptimizer->>TransformClass: Call _apply(GraphModule, ...)
        TransformClass->>GraphModule: Pattern match & transform
        TransformClass-->>InferenceOptimizer: Return (GraphModule, TransformInfo)
    end
    InferenceOptimizer-->>ModelFactory: Return transformed GraphModule
Loading

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~90+ minutes

Suggested reviewers

  • yilin-void
  • liji-nv
  • pamelap-nvidia

Note

⚡️ Unit Test Generation is now available in beta!

Learn more here, or try it out under "Finishing Touches" below.

✨ Finishing Touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Explain this complex logic.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai explain this code block.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and explain its main purpose.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
  • @coderabbitai generate unit tests to generate unit tests for this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai or @coderabbitai title anywhere in the PR title to generate the title automatically.

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@lucaslie lucaslie force-pushed the ll/merge-ad-2025-07-22 branch from 5e06672 to eef1c71 Compare July 31, 2025 15:02
@lucaslie
Copy link
Member Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #13696 [ run ] triggered by Bot

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

🧹 Nitpick comments (7)
tensorrt_llm/_torch/auto_deploy/transformations/transform.py (1)

59-64: Consider documenting the YAML limitation workaround

While the dynamic setting of attention_op is necessary due to YAML limitations, consider adding more detailed documentation about why this workaround is needed and what the long-term solution might be.

 # TODO (hg): default values that are not representable in YAML.
+# Function references like attention operators cannot be serialized in YAML config files,
+# so we need to dynamically inject them based on the string backend name.
 if "match_attention_layout" in self.ad_config.transforms:
tensorrt_llm/_torch/auto_deploy/transform/library/quantization.py (1)

142-144: Make error handling more explicit

Consider using an explicit error message when shape cannot be determined.

     else:
-        # If we can't determine the shape, skip quantization
-        return
+        # If we can't determine the shape, skip quantization
+        ad_logger.debug(f"Cannot determine weight shape for BMM node {node}, skipping quantization")
+        return
tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_quantization.py (1)

59-59: Remove commented skip statement

If the test is working, remove the commented skip statement. If it's still flaky, uncomment it with a proper reason.

-    # pytest.skip("https://nvbugspro.nvidia.com/bug/5170222")
tensorrt_llm/_torch/auto_deploy/transform/library/quantize_moe.py (1)

100-101: Track TODO for robust parameter extraction

The TODO comment indicates this function needs more robust handling similar to extract_param_names_from_lin_node.

Would you like me to create an issue to track implementing more robust parameter extraction for MoE nodes?

tests/unittest/_torch/auto_deploy/_utils_test/_graph_test_helpers.py (1)

80-81: Remove or document unused args parameter

The *args parameter is not used in the function. Either remove it or document its purpose.

     skip_output_assert: bool = False,
-    *args,  # Additional arguments for transform
 ) -> GraphModule:
tensorrt_llm/_torch/auto_deploy/transform/library/attention.py (2)

496-498: Consider checking against the source attention op.

The transform should also skip nodes that already match the attention_op.get_source_attention_op() to avoid redundant transformations.

Add a check to skip already-transformed nodes:

 for sdpa_node in list(graph.nodes):
     if sdpa_node.op != "call_function" or not is_op(sdpa_node, sdpa_ops):
         continue
+    
+    # Skip if already using the target attention op
+    if is_op(sdpa_node, self.config.attention_op.get_source_attention_op()):
+        continue

502-504: Consider adding input validation.

The transform assumes q, k, v are the first 3 arguments and that they have metadata. Consider adding validation for robustness.

Add validation before processing:

 # Extract q, k, v inputs
 q, k, v = sdpa_node.args[:3]
+
+# Validate inputs have metadata
+if not all(hasattr(arg, 'meta') and 'val' in arg.meta for arg in [q, k, v]):
+    ad_logger.warning(f"Skipping node {sdpa_node} due to missing metadata")
+    continue
+
+# Validate tensor dimensions
+for tensor, name in [(q, 'q'), (k, 'k'), (v, 'v')]:
+    if len(tensor.meta['val'].shape) != 4:
+        ad_logger.warning(f"Skipping node {sdpa_node}: {name} tensor is not 4D")
+        continue
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between baece56 and eef1c71.

📒 Files selected for processing (22)
  • tensorrt_llm/_torch/auto_deploy/config/default.yaml (1 hunks)
  • tensorrt_llm/_torch/auto_deploy/custom_ops/torch_attention.py (4 hunks)
  • tensorrt_llm/_torch/auto_deploy/custom_ops/torch_backend_attention.py (3 hunks)
  • tensorrt_llm/_torch/auto_deploy/distributed/trtllm.py (1 hunks)
  • tensorrt_llm/_torch/auto_deploy/models/hf.py (4 hunks)
  • tensorrt_llm/_torch/auto_deploy/transform/interface.py (2 hunks)
  • tensorrt_llm/_torch/auto_deploy/transform/library/attention.py (1 hunks)
  • tensorrt_llm/_torch/auto_deploy/transform/library/quantization.py (3 hunks)
  • tensorrt_llm/_torch/auto_deploy/transform/library/quantize_moe.py (2 hunks)
  • tensorrt_llm/_torch/auto_deploy/transformations/_graph.py (2 hunks)
  • tensorrt_llm/_torch/auto_deploy/transformations/library/__init__.py (0 hunks)
  • tensorrt_llm/_torch/auto_deploy/transformations/library/attention.py (0 hunks)
  • tensorrt_llm/_torch/auto_deploy/transformations/library/kvcache.py (1 hunks)
  • tensorrt_llm/_torch/auto_deploy/transformations/library/rope.py (2 hunks)
  • tensorrt_llm/_torch/auto_deploy/transformations/transform.py (1 hunks)
  • tensorrt_llm/_torch/auto_deploy/utils/pattern_matcher.py (1 hunks)
  • tests/unittest/_torch/auto_deploy/_utils_test/_graph_test_helpers.py (2 hunks)
  • tests/unittest/_torch/auto_deploy/unit/multigpu/test_ad_build_small_multi.py (0 hunks)
  • tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_attention_matcher.py (16 hunks)
  • tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_attention_matcher_hf.py (3 hunks)
  • tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_quant_moe.py (2 hunks)
  • tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_quantization.py (5 hunks)
💤 Files with no reviewable changes (3)
  • tests/unittest/_torch/auto_deploy/unit/multigpu/test_ad_build_small_multi.py
  • tensorrt_llm/_torch/auto_deploy/transformations/library/init.py
  • tensorrt_llm/_torch/auto_deploy/transformations/library/attention.py
🧰 Additional context used
🧠 Learnings (4)
📓 Common learnings
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.
📚 Learning: in tensorrt_llm/executor/worker.py, the lora adapter cache optimization logic that checks `is_adapte...
Learnt from: amitz-nv
PR: NVIDIA/TensorRT-LLM#5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks `is_adapter_in_cpu_cache()` and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.

Applied to files:

  • tensorrt_llm/_torch/auto_deploy/distributed/trtllm.py
📚 Learning: in tensorrt-llm testing, it's common to have both cli flow tests (test_cli_flow.py) and pytorch api ...
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.

Applied to files:

  • tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_quant_moe.py
  • tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_quantization.py
  • tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_attention_matcher_hf.py
📚 Learning: in tensorrt-llm's multimodal processing pipeline, shared tensor recovery using `from_shared_tensor()...
Learnt from: yechank-nvidia
PR: NVIDIA/TensorRT-LLM#6254
File: tensorrt_llm/_torch/pyexecutor/model_engine.py:1201-1204
Timestamp: 2025-07-22T09:22:14.726Z
Learning: In TensorRT-LLM's multimodal processing pipeline, shared tensor recovery using `from_shared_tensor()` is only needed during the context phase. Generation requests reuse the already-recovered tensor data and only need to call `strip_for_generation()` to remove unnecessary multimodal data while preserving the recovered tensors. This avoids redundant tensor recovery operations during generation.

Applied to files:

  • tensorrt_llm/_torch/auto_deploy/custom_ops/torch_backend_attention.py
🧬 Code Graph Analysis (8)
tensorrt_llm/_torch/auto_deploy/distributed/trtllm.py (2)
tensorrt_llm/_torch/distributed/ops.py (1)
  • AllReduce (362-491)
tensorrt_llm/functional.py (1)
  • AllReduceStrategy (3876-3884)
tensorrt_llm/_torch/auto_deploy/transformations/library/rope.py (1)
tensorrt_llm/_torch/auto_deploy/utils/pattern_matcher.py (1)
  • register_ad_pattern (97-180)
tensorrt_llm/_torch/auto_deploy/custom_ops/torch_backend_attention.py (1)
tensorrt_llm/functional.py (1)
  • matmul (1056-1107)
tensorrt_llm/_torch/auto_deploy/transform/interface.py (2)
tensorrt_llm/logger.py (1)
  • warning (131-132)
tensorrt_llm/_torch/auto_deploy/transformations/_graph.py (2)
  • lift_to_meta (76-89)
  • canonicalize_graph (194-219)
tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_quantization.py (6)
tests/unittest/_torch/auto_deploy/_utils_test/_graph_test_helpers.py (4)
  • run_test_transformed_gm (68-138)
  • _build_model (28-29)
  • _load_checkpoint (31-32)
  • get_quant_config (37-38)
tests/unittest/_torch/auto_deploy/_utils_test/_model_test_utils.py (3)
  • MLP (97-111)
  • BMMDynamicModel (237-253)
  • BMMModel (225-234)
tests/unittest/_torch/auto_deploy/_utils_test/_torch_test_utils.py (2)
  • fp4_compatible (33-34)
  • fp8_compatible (29-30)
tensorrt_llm/_torch/auto_deploy/export/export.py (1)
  • torch_export_to_gm (198-284)
tensorrt_llm/_torch/auto_deploy/models/factory.py (1)
  • ModelFactory (15-207)
tensorrt_llm/_torch/auto_deploy/transformations/transform.py (1)
  • InferenceOptimizer (37-202)
tensorrt_llm/_torch/auto_deploy/transform/library/attention.py (7)
tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py (1)
  • AttentionDescriptor (498-624)
tensorrt_llm/_torch/auto_deploy/utils/node_utils.py (1)
  • is_op (183-206)
tensorrt_llm/_torch/auto_deploy/utils/pattern_matcher.py (3)
  • ADPatternMatcherPass (59-65)
  • register_ad_pattern (97-180)
  • call_function (247-274)
tensorrt_llm/_torch/auto_deploy/transform/interface.py (4)
  • BaseTransform (129-354)
  • TransformConfig (50-89)
  • TransformInfo (98-123)
  • TransformRegistry (357-385)
tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_attention_matcher_hf.py (2)
  • get_attention_layout (29-30)
  • get_source_attention_op (33-34)
tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_attention_matcher.py (2)
  • get_attention_layout (1025-1026)
  • get_source_attention_op (1029-1030)
tensorrt_llm/logger.py (1)
  • debug (143-144)
tests/unittest/_torch/auto_deploy/_utils_test/_graph_test_helpers.py (3)
tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_kv_cache.py (2)
  • SequenceEmbeddingInfo (92-104)
  • set_example_sequence (96-104)
tests/unittest/_torch/auto_deploy/unit/multigpu/transformations/library/test_collective_fusion.py (1)
  • check_transformed_graph (59-62)
tests/unittest/_torch/auto_deploy/_utils_test/_torch_test_utils.py (2)
  • reset_parameters (24-26)
  • all_close (6-21)
tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_attention_matcher.py (3)
tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py (4)
  • AttentionDescriptor (498-624)
  • to (310-314)
  • get (644-646)
  • args (151-159)
tests/unittest/_torch/auto_deploy/_utils_test/_graph_test_helpers.py (1)
  • run_test (141-225)
tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_attention_matcher_hf.py (1)
  • MockAttentionDescriptor (23-34)
🔇 Additional comments (50)
tensorrt_llm/_torch/auto_deploy/transformations/library/kvcache.py (1)

177-177: LGTM! Critical unit conversion fix.

The change correctly converts free_mem_post from megabytes to bytes (multiplying by 1024 * 1024) before applying the free_mem_ratio and adding to current_cache_size. This ensures consistent units throughout the calculation, as current_cache_size is already in bytes from cm.current_cache_size_bytes().

tensorrt_llm/_torch/auto_deploy/distributed/trtllm.py (1)

20-21: LGTM! Temporary workaround is well-documented.

The temporary switch from AllReduceStrategy.AUTO to AllReduceStrategy.NCCL is appropriately documented with the bug URL and clear intention to revert. This ensures the workaround won't be forgotten once the underlying bug is resolved.

tensorrt_llm/_torch/auto_deploy/utils/pattern_matcher.py (1)

156-157: Excellent documentation enhancement.

This clarifying note about functional isolation requirements is valuable for developers using the pattern matching system. It helps explain why certain subgraph patterns might not be eligible for replacement and reinforces the correctness guarantees of the PatternMatcherPass.

tensorrt_llm/_torch/auto_deploy/transformations/library/rope.py (2)

144-149: Good addition to handle dtype variations.

The float32 dummy inputs correctly address the case where .float() calls in the _complex_rope_pattern function can alter the graph topology based on input dtype. This ensures both float16 and float32 input scenarios are properly matched.


181-190: LGTM! Proper pattern registration for float32 variant.

The additional pattern registration using dummy_complex_2 follows the established pattern and correctly handles the float32 input case. The reuse of the same search and replace functions avoids code duplication while ensuring comprehensive pattern coverage.

tensorrt_llm/_torch/auto_deploy/config/default.yaml (1)

22-33: Well-structured configuration for new modular transforms.

The addition of the six new transforms (quantize, quantize_moe, and the four attention-related transforms) follows the established configuration pattern and correctly assigns them to the pattern_matcher stage. This properly integrates the new class-based transform system into the default pipeline.

tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_quant_moe.py (4)

3-3: Proper import updates for new testing framework.

The imports correctly switch to the new modular testing helpers (FakeFactory, run_test_transformed_gm) that support the class-based transform architecture.


7-8: Correct imports for modular optimizer approach.

The addition of torch_export_to_gm and InferenceOptimizer imports aligns with the refactoring to use the new modular transform framework instead of direct function calls.


66-74: Well-structured adaptation to modular transform system.

The test correctly:

  • Exports the model to a GraphModule with dynamic shapes
  • Configures InferenceOptimizer with the quantize_moe transform
  • Uses FakeFactory to provide quantization configuration
  • Maintains the same transform behavior in the new framework

76-86: Proper use of new testing helper.

The switch to run_test_transformed_gm maintains all the original validation logic (graph checking, parameter counting, output verification) while adapting to the new modular transform testing approach.

tensorrt_llm/_torch/auto_deploy/transformations/_graph.py (1)

99-119: Good encapsulation of recompilation logic

The refactoring to determine recompile_graph internally based on the presence of device-related operations is a solid improvement. This simplifies the API and ensures consistent behavior.

tensorrt_llm/_torch/auto_deploy/custom_ops/torch_backend_attention.py (3)

106-106: Simplified sinks reshaping is correct

The simplified reshape operation for sinks in the generate phase is appropriate since we're dealing with single token generation.


205-205: Correct fix for sliding window mask logic

Removing the (pos_diff < 0) condition is the right fix. The causal mask already handles preventing attention to future positions, so the sliding window mask should only enforce the backward-looking window constraint.


220-228: Improved sinks tensor handling for context phase

The new approach with new_sinks properly handles dimension expansion for the context phase, making the implementation more robust and correct for batch processing.

tensorrt_llm/_torch/auto_deploy/transform/interface.py (2)

290-319: Improved state tracking in pre-cleanup

The change to return (is_clean, has_valid_shapes) tuple provides better visibility into the graph state after pre-cleanup. The logic correctly handles the different cleanup scenarios based on configuration requirements.


233-242: Better debugging experience with conditional error handling

The conditional error handling based on skip_on_error is a great improvement. It provides better debugging capabilities when transforms fail while maintaining the option to skip errors in production scenarios.

tensorrt_llm/_torch/auto_deploy/models/hf.py (2)

79-84: Good abstraction for position embeddings configuration

The _get_max_position_embeddings_config method provides a clean way to configure maximum sequence length. This ensures consistency between the factory's max_seq_len and the model's position embeddings configuration.


326-333: Correct quantization config loading from fetched directory

The change to load hf_quant_config.json from the fetched directory is correct. This ensures the quantization config is loaded from the same location as the model weights, which is especially important for HuggingFace hub downloads.

tensorrt_llm/_torch/auto_deploy/transform/library/quantization.py (2)

27-84: Well-structured quantization logic for linear operations

The function properly handles both regular and pre-quantized graphs, correctly registers hooks and scales, and manages state dict updates.


169-235: Clean implementation of class-based quantization transform

The transform properly handles quantization configuration, tracks quantized operations by type and algorithm, and returns comprehensive metadata. The use of defaultdict for tracking is elegant.

tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_quantization.py (3)

23-37: Good test factory implementation

The DummyFactory properly implements the minimal interface needed for testing quantization transforms.


71-81: Well-structured test using new transform infrastructure

The test properly exports the model to GraphModule, applies the quantization transform via InferenceOptimizer, and validates the results.


126-184: Comprehensive BMM quantization testing

The test properly handles both parameter-based and dynamic BMM models, correctly registers scales for each case, and validates quantization through the new transform infrastructure.

tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_attention_matcher_hf.py (1)

23-36: Good mock implementation for testing

The MockAttentionDescriptor properly implements the AttentionDescriptor interface for testing purposes.

tensorrt_llm/_torch/auto_deploy/transform/library/quantize_moe.py (2)

20-98: Well-implemented MoE quantization logic

The function properly handles quantization of all three expert weight lists, correctly registers scales and hooks, and replaces the node with the quantized version.


134-179: Clean implementation of MoE quantization transform

The transform properly handles MoE node quantization with appropriate exclusion patterns and returns comprehensive metadata.

tests/unittest/_torch/auto_deploy/_utils_test/_graph_test_helpers.py (3)

17-39: Well-designed test factory

The FakeFactory provides a clean interface for testing with configurable cache and quantization configs.


41-53: Good extension of SequenceInfo for embedding tests

The class properly extends SequenceInfo to support 3D input tensors for embedding-based models.


68-139: Comprehensive graph transformation testing utility

The function provides thorough validation including parameter counting, output comparison, state dict loading from both original and transformed models, and re-export capability. This is a valuable addition to the testing infrastructure.

tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_attention_matcher.py (9)

10-11: LGTM!

The import changes correctly reflect the migration to the new modular transform framework.


161-170: LGTM! Improved causal mask construction.

The updated causal mask construction is clearer and more standard, using torch.triu with -inf values and explicit dtype casting.


245-252: LGTM! Consistent mask construction pattern.

The causal mask construction is consistent with the pattern used in EagerAttentionModel.


380-398: LGTM! Efficient use of is_causal flag.

The change to use is_causal=True instead of explicit causal masks is more efficient and aligns with the pattern matching optimization that detects and replaces causal masks.


410-459: LGTM! Well-structured optimizer helper functions.

The optimizer getter functions provide clean encapsulation for creating InferenceOptimizer instances with appropriate configurations for each pattern type.


544-545: LGTM! Comprehensive test updates for eager attention.

The test properly validates the new pattern matching behavior:

  • Tests both masked and unmasked attention scenarios
  • Ensures models are in eval mode
  • Correctly verifies the is_causal flag instead of explicit mask presence
  • Uses the new optimizer infrastructure

Also applies to: 565-583, 669-678, 686-686


744-746: LGTM! Grouped attention test properly updated.

The test correctly validates grouped attention pattern matching and preservation of the is_causal flag.

Also applies to: 773-780, 788-788


1018-1031: LGTM! Proper inheritance from AttentionDescriptor.

The MockAttentionDescriptor correctly inherits from the abstract base class and implements the required interface methods.


1323-1331: LGTM! Clean implementation of attention layout test.

The test properly uses InferenceOptimizer with the match_attention_layout transform and MockAttentionDescriptor to validate layout transformations.

tensorrt_llm/_torch/auto_deploy/transform/library/attention.py (6)

31-42: LGTM! Standard repeat_kv pattern implementation.

The pattern correctly captures the unsqueeze→expand→reshape sequence used for key-value expansion in grouped query attention.


45-224: LGTM! Comprehensive SDPA pattern coverage.

The eight SDPA patterns effectively cover various attention implementations:

  • Causal vs non-causal attention
  • Multiplication vs division scaling (with proper conversion)
  • Explicit fp32 casting variations
  • Downstream standardized pipeline handling

The optimization of replacing explicit causal masks with is_causal=True enables backend optimizations.


226-265: LGTM! Well-structured pattern configuration helper.

The function properly configures SDPA patterns with appropriate dummy tensors, scalar workarounds for literal values, and type ignoring for dtype conversions.


267-319: LGTM! Effective grouped attention pattern matching.

The patterns correctly handle both:

  • Explicit repeat_kv + SDPA sequences → grouped_sdpa replacement
  • Direct SDPA → grouped_sdpa replacement for unified handling
    Preserves is_causal flag settings appropriately.

321-360: LGTM! Clean repeat_kv transform implementation.

The transform correctly registers the pattern with appropriate scalar workarounds and type ignoring for reshape/expand operations. Setting run_shape_prop=True when matches are found ensures proper shape information propagation.


514-516: LGTM! Proper metadata preservation.

The transform correctly preserves fake tensor metadata during transpose operations, which is crucial for maintaining shape information through the transformation.

Also applies to: 543-544

tensorrt_llm/_torch/auto_deploy/custom_ops/torch_attention.py (6)

11-32: LGTM! Well-implemented helper functions.

Both helper functions are correctly implemented:

  • Logit softcapping uses the standard cap * tanh(scores/cap) formula
  • Boolean mask conversion properly maps True→1.0 and False→-inf

105-126: LGTM! Solid foundation for grouped SDPA.

The implementation correctly:

  • Handles bnsd format inputs without unnecessary transposes
  • Implements GQA by repeating K,V tensors when n_heads != n_kv_heads
  • Sets appropriate default scale based on head dimension

134-140: LGTM! Correct causal mask application.

The causal mask is correctly applied only during context phase when s_q == s_k, avoiding unnecessary masking during generation.


181-188: LGTM! Standard attention path is correct.

The implementation properly handles:

  • Softmax computation in float32 for numerical stability
  • Casting back to query dtype
  • Dropout application in inference mode

228-238: LGTM! Correct layout handling in bsnd wrapper.

The wrapper properly:

  • Transposes inputs from bsnd to bnsd format
  • Forwards all parameters to grouped_sdpa
  • Transposes output back to bsnd format
    Uses contiguous() to ensure memory layout efficiency.

193-208: LGTM! Fake implementations properly updated.

Both fake implementations correctly include the new parameters (sinks, sliding_window, logit_cap) in their signatures.

Also applies to: 242-256

@tensorrt-cicd
Copy link
Collaborator

PR_Github #13696 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #10287 completed with status: 'FAILURE'

@Fridah-nv Fridah-nv force-pushed the ll/merge-ad-2025-07-22 branch from f6591d5 to 25e1230 Compare July 31, 2025 21:14
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🔭 Outside diff range comments (1)
tensorrt_llm/_torch/auto_deploy/custom_ops/torch_attention.py (1)

1-2: Add NVIDIA copyright header

According to the coding guidelines, all TensorRT-LLM source files should contain an NVIDIA copyright header that includes the current year.

Add the following copyright header at the beginning of the file:

# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
♻️ Duplicate comments (5)
tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_attention_matcher_hf.py (2)

47-66: Fix transform application

The transformed graph from InferenceOptimizer is not being returned. The function assigns the result to gm but doesn't return it.

-def _joint_transform(gm: GraphModule) -> None:
+def _joint_transform(gm: GraphModule) -> GraphModule:
     gm = InferenceOptimizer(
         None,
         {
             "match_repeat_kv": {
                 "stage": "pattern_matcher",
             },
             "match_eager_attention": {
                 "stage": "pattern_matcher",
             },
             "match_grouped_attention": {
                 "stage": "pattern_matcher",
             },
             "match_attention_layout": {
                 "stage": "pattern_matcher",
                 "attention_op": MockAttentionDescriptor,
             },
         },
     )(None, gm)
+    return gm

154-155: Capture transformed graph

The transformed graph returned by _joint_transform should be assigned back to gm.

     # Apply transformation
-    _joint_transform(gm)
+    gm = _joint_transform(gm)
     assert verify_matcher(gm)
tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_attention_matcher.py (1)

720-720: Incorrect optimizer used in counter example test

The test is designed to verify that similar tensor operations are not falsely matched as repeat_kv patterns. It should use _get_match_repeat_kv_optimizer() instead of _get_match_eager_attention_optimizer().

-        _get_match_eager_attention_optimizer(),
+        _get_match_repeat_kv_optimizer(),
tensorrt_llm/_torch/auto_deploy/custom_ops/torch_attention.py (2)

142-159: Sliding window mask logic needs adjustment

The sliding window implementation has an off-by-one error. The current condition masks positions where the key is after the query OR at exactly sliding_window distance, but typical sliding window attention allows attending to exactly sliding_window + 1 positions (current position plus sliding_window past positions).

-        sliding_window_mask = (pos_diff < 0) | (pos_diff >= sliding_window)  # [s_q, s_k]
+        # Mask positions where key is after query OR key is too far before query
+        sliding_window_mask = (pos_diff < 0) | (pos_diff > sliding_window)  # [s_q, s_k]

164-180: Incomplete sinks implementation

The sinks mechanism computes attention scores with an additional sink column but doesn't concatenate corresponding sink values to the value tensor. This means the matrix multiplication at line 180 has mismatched dimensions or ignores the sink scores.

If sinks are meant to be attention sinks (tokens that always receive some attention), you need to either:

  1. Concatenate sink value vectors to value_t before the matmul, or
  2. Use only the non-sink portion of the scores (exclude the last column)

Please clarify the intended behavior and complete the implementation accordingly.

🧹 Nitpick comments (4)
tests/unittest/_torch/auto_deploy/_utils_test/_graph_test_helpers.py (2)

17-39: Consider updating the docstring to reflect dual purpose.

The FakeFactory class correctly implements a test factory but the docstring only mentions cache_config while it supports both cache and quantization configurations.

-    """Dummy factory to pass cache_config for testing."""
+    """Dummy factory to pass cache_config and quant_config for testing."""

68-139: LGTM! Comprehensive testing function for transformed graph modules.

The run_test_transformed_gm function provides thorough validation including parameter counts, output correctness, state dict loading, and re-export capability. The logic correctly handles the pre-transformed graph module testing pattern needed for the new transform framework.

Consider extracting common testing logic between run_test and run_test_transformed_gm to reduce code duplication, as both functions share significant baseline testing code.

tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_attention_matcher_hf.py (1)

136-147: Refactor duplicated tensor initialization code

The tensor initialization logic is duplicated. Consider extracting it into a helper function to improve maintainability.

def _initialize_meta_tensors(module, device="cuda"):
    """Initialize meta tensors with random values on the specified device."""
    module._apply(
        lambda t: torch.normal(0.0, 1.0, size=t.shape, device=device).to(t.dtype)
        if t.device == torch.device("meta")
        else t.to(device)
    )

Then use it as:

-    model._apply(
-        lambda t: torch.normal(0.0, 1.0, size=t.shape, device=device).to(t.dtype)
-        if t.device == torch.device("meta")
-        else t.to(device)
-    )
+    _initialize_meta_tensors(model, device)

-    gm_exported._apply(
-        lambda t: torch.normal(0.0, 1.0, size=t.shape, device=device).to(t.dtype)
-        if t.device == torch.device("meta")
-        else t.to(device)
-    )
+    _initialize_meta_tensors(gm_exported, device)

-    gm._apply(
-        lambda t: torch.normal(0.0, 1.0, size=t.shape, device=device).to(t.dtype)
-        if t.device == torch.device("meta")
-        else t.to(device)
-    )
+    _initialize_meta_tensors(gm, device)

Also applies to: 159-163

tensorrt_llm/_torch/auto_deploy/transform/library/attention.py (1)

179-202: Add documentation for pattern 7

Pattern 7 appears to handle a special case where the attention already uses torch_attention_sdpa. Consider adding a comment to clarify why this pattern exists.

-# Only pass in causal attention mask in downstream standardized pipeline
+# Pattern 7: Handle existing torch_attention_sdpa calls that incorrectly pass attention_mask
+# This pattern detects cases where causal mask is passed as attn_mask parameter and converts
+# it to use the is_causal flag instead for better performance
 def _sfdp_pattern_7(query, key, value, attention_mask, scaling, dropout):
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f6591d5 and 25e1230.

📒 Files selected for processing (18)
  • tensorrt_llm/_torch/auto_deploy/config/default.yaml (1 hunks)
  • tensorrt_llm/_torch/auto_deploy/custom_ops/torch_attention.py (4 hunks)
  • tensorrt_llm/_torch/auto_deploy/models/hf.py (4 hunks)
  • tensorrt_llm/_torch/auto_deploy/transform/interface.py (2 hunks)
  • tensorrt_llm/_torch/auto_deploy/transform/library/attention.py (1 hunks)
  • tensorrt_llm/_torch/auto_deploy/transform/library/quantization.py (3 hunks)
  • tensorrt_llm/_torch/auto_deploy/transform/library/quantize_moe.py (2 hunks)
  • tensorrt_llm/_torch/auto_deploy/transformations/_graph.py (2 hunks)
  • tensorrt_llm/_torch/auto_deploy/transformations/library/__init__.py (0 hunks)
  • tensorrt_llm/_torch/auto_deploy/transformations/library/attention.py (0 hunks)
  • tensorrt_llm/_torch/auto_deploy/transformations/library/rope.py (2 hunks)
  • tensorrt_llm/_torch/auto_deploy/transformations/transform.py (1 hunks)
  • tensorrt_llm/_torch/auto_deploy/utils/pattern_matcher.py (1 hunks)
  • tests/unittest/_torch/auto_deploy/_utils_test/_graph_test_helpers.py (2 hunks)
  • tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_attention_matcher.py (16 hunks)
  • tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_attention_matcher_hf.py (3 hunks)
  • tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_quant_moe.py (2 hunks)
  • tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_quantization.py (4 hunks)
💤 Files with no reviewable changes (2)
  • tensorrt_llm/_torch/auto_deploy/transformations/library/init.py
  • tensorrt_llm/_torch/auto_deploy/transformations/library/attention.py
✅ Files skipped from review due to trivial changes (2)
  • tensorrt_llm/_torch/auto_deploy/config/default.yaml
  • tensorrt_llm/_torch/auto_deploy/transformations/library/rope.py
🚧 Files skipped from review as they are similar to previous changes (8)
  • tensorrt_llm/_torch/auto_deploy/utils/pattern_matcher.py
  • tensorrt_llm/_torch/auto_deploy/transformations/transform.py
  • tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_quant_moe.py
  • tensorrt_llm/_torch/auto_deploy/transform/library/quantization.py
  • tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_quantization.py
  • tensorrt_llm/_torch/auto_deploy/models/hf.py
  • tensorrt_llm/_torch/auto_deploy/transformations/_graph.py
  • tensorrt_llm/_torch/auto_deploy/transform/interface.py
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.py: Python code should conform to Python 3.8+.
Indent Python code with 4 spaces. Do not use tabs.
Always maintain the namespace when importing in Python, even if only one class or function from a module is used.
Python filenames should use snake_case (e.g., some_file.py).
Python classes should use PascalCase (e.g., class SomeClass).
Python functions and methods should use snake_case (e.g., def my_awesome_function():).
Python local variables should use snake_case. Prefix k for variable names that start with a number (e.g., k_99th_percentile = ...).
Python global variables should use upper snake_case and prefix G (e.g., G_MY_GLOBAL = ...).
Python constants should use upper snake_case (e.g., MY_CONSTANT = ...).
Avoid shadowing variables declared in an outer scope in Python.
Initialize all externally visible members of a Python class in the constructor.
For interfaces that may be used outside a Python file, prefer docstrings over comments.
Comments in Python should be reserved for code within a function, or interfaces that are local to a file.
Use Google style docstrings for Python classes and functions, which can be parsed by Sphinx.
Attributes and variables in Python can be documented inline; attribute docstrings will be rendered under the docstring for the class.
Avoid using reflection in Python when functionality can be easily achieved without it.
When using try-except blocks in Python, limit the except to the smallest set of errors possible.
When using try-except blocks to handle multiple possible variable types in Python, keep the body of the try as small as possible, using the else block to implement the logic.

Files:

  • tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_attention_matcher_hf.py
  • tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_attention_matcher.py
  • tests/unittest/_torch/auto_deploy/_utils_test/_graph_test_helpers.py
  • tensorrt_llm/_torch/auto_deploy/transform/library/quantize_moe.py
  • tensorrt_llm/_torch/auto_deploy/transform/library/attention.py
  • tensorrt_llm/_torch/auto_deploy/custom_ops/torch_attention.py
**/*.{cpp,h,hpp,cc,cxx,cu,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the current year. This includes .cpp, .h, .cu, .py, and any other source files which are compiled or interpreted.

Files:

  • tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_attention_matcher_hf.py
  • tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_attention_matcher.py
  • tests/unittest/_torch/auto_deploy/_utils_test/_graph_test_helpers.py
  • tensorrt_llm/_torch/auto_deploy/transform/library/quantize_moe.py
  • tensorrt_llm/_torch/auto_deploy/transform/library/attention.py
  • tensorrt_llm/_torch/auto_deploy/custom_ops/torch_attention.py
🧠 Learnings (2)
📓 Common learnings
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.
📚 Learning: in tensorrt-llm testing, it's common to have both cli flow tests (test_cli_flow.py) and pytorch api ...
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.

Applied to files:

  • tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_attention_matcher_hf.py
🧬 Code Graph Analysis (1)
tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_attention_matcher_hf.py (5)
tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py (4)
  • AttentionDescriptor (498-624)
  • device (147-148)
  • args (151-159)
  • to (310-314)
tensorrt_llm/_torch/auto_deploy/export/export.py (1)
  • torch_export_to_gm (198-284)
tensorrt_llm/_torch/auto_deploy/transformations/transform.py (1)
  • InferenceOptimizer (37-202)
tensorrt_llm/_torch/auto_deploy/transformations/_graph.py (1)
  • move_to_device (132-139)
tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_attention_matcher.py (5)
  • MockAttentionDescriptor (1018-1030)
  • verify_matcher (478-528)
  • verify_matcher (598-680)
  • verify_matcher (747-782)
  • verify_matcher (1207-1317)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (7)
tensorrt_llm/_torch/auto_deploy/transform/library/quantize_moe.py (2)

2-2: LGTM! Import changes align with the class-based transform framework.

The new imports support the refactored architecture: List and Tuple for type hints, ModelFactory for configuration access, and the transform framework interfaces.

Also applies to: 8-9, 12-12


134-179: LGTM! Well-structured class-based transform implementation.

The QuantizeMOE class correctly:

  • Registers with the transform framework using the decorator
  • Implements the required _apply method signature
  • Handles configuration retrieval and early returns appropriately
  • Preserves the original quantization logic while integrating with the new architecture
  • Returns meaningful TransformInfo metadata for different scenarios

The refactoring successfully encapsulates the quantization functionality within the modular transform framework.

tests/unittest/_torch/auto_deploy/_utils_test/_graph_test_helpers.py (2)

11-11: LGTM! Import changes support the new testing utilities.

The imports are necessary for the new FakeFactory and SequenceEmbeddingInfo classes.

Also applies to: 13-13


41-53: LGTM! Well-designed extension for embedding input testing.

The SequenceEmbeddingInfo class correctly extends SequenceInfo to support 3D embedding tensors instead of token IDs, maintaining proper device placement and data types.

tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_attention_matcher_hf.py (1)

169-170: Consider reducing tolerance values

The tolerance values (atol=5e-2, rtol=5e-2) seem high for float16 precision. Unless there's a specific reason for such high tolerances (e.g., accumulated errors from transformations), consider using more typical values like 1e-3.

If the high tolerances are necessary due to numerical instability in the transformations, please add a comment explaining why these specific values were chosen.

tensorrt_llm/_torch/auto_deploy/transform/library/attention.py (2)

31-42: LGTM! Correct repeat_kv pattern implementation

The pattern correctly matches the HuggingFace repeat_kv implementation with unsqueeze, expand, and reshape operations.


375-384: Consider setting run_shape_prop for consistency

Unlike MatchRepeatKV and other transforms, MatchEagerAttention doesn't set self.config.run_shape_prop = True when patterns are matched. This might be intentional, but consider adding it for consistency if shape propagation is needed after eager attention transformations.

if num_eager_patterns > 0:
    self.config.run_shape_prop = True

nzmora-nvidia and others added 14 commits July 31, 2025 21:22
* Change the all-reduce strategy to NCCL

When the strategy is set to AUTO and world_size>1 we experience hangs and CUDA
memory errors.

* This is the same issue as https://nvbugspro.nvidia.com/bug/5331013
* Without this change test test_ad_build_small_multi.py fails (tp==2)
* This is a temporary change until we understand why this hang is happening.
* On dllcuster this issue does not manifest.

Signed-off-by: Neta Zmora <[email protected]>

* Re-enable test_ad_build_small_multi.py

tests/unittest/_torch/auto_deploy/unit/multigpu/test_ad_build_small_multi.py

Signed-off-by: Neta Zmora <[email protected]>

* fix kvcache mem size compute - convert to MB

Signed-off-by: Gal Agam <[email protected]>

---------

Signed-off-by: Neta Zmora <[email protected]>
Signed-off-by: Gal Agam <[email protected]>
Co-authored-by: Gal Agam <[email protected]>
Signed-off-by: Neta Zmora <[email protected]>
* attention matcher with torch._inductor pattern matcher,matching repeat kv, sdpa and group attention, update unit tests

Signed-off-by: Frida Hou <[email protected]>

* Fix the torch backend Attention

Signed-off-by: nvchenghaoz <[email protected]>

* Revert "attention matcher with torch._inductor pattern matcher,matching repeat kv, sdpa and group attention, update unit tests"

This reverts commit 5743fb3.

---------

Signed-off-by: Frida Hou <[email protected]>
Signed-off-by: nvchenghaoz <[email protected]>
Co-authored-by: Frida Hou <[email protected]>
…tcher (#101)

* attention matcher with torch._inductor pattern matcher,matching repeat kv, sdpa and group attention, update unit tests

Signed-off-by: Frida Hou <[email protected]>

* update matcher to only handle causal attn mask and set is_causal=True

Signed-off-by: Frida Hou <[email protected]>

* separate into three transformations

Signed-off-by: Frida Hou <[email protected]>

---------

Signed-off-by: Frida Hou <[email protected]>
* improve error handling and graph clean-up

Signed-off-by: Lucas Liebenwein <[email protected]>

* fix: avoid modify immutable type TransformInfo

Signed-off-by: haoguo <[email protected]>

---------

Signed-off-by: Lucas Liebenwein <[email protected]>
Signed-off-by: haoguo <[email protected]>
Co-authored-by: haoguo <[email protected]>
* attention matcher with torch._inductor pattern matcher,matching repeat kv, sdpa and group attention, update unit tests

Signed-off-by: Frida Hou <[email protected]>

* Update the torch ref op

Signed-off-by: nvchenghaoz <[email protected]>

* Revert "attention matcher with torch._inductor pattern matcher,matching repeat kv, sdpa and group attention, update unit tests"

This reverts commit 5743fb3.

---------

Signed-off-by: Frida Hou <[email protected]>
Signed-off-by: nvchenghaoz <[email protected]>
Co-authored-by: Frida Hou <[email protected]>
* refactor: move quantization and quant_moe to new inf optimizer

Signed-off-by: haoguo <[email protected]>

* refactor: use quant_config from factory instead of new config type

Signed-off-by: haoguo <[email protected]>

* refactor: del old files; update default.yaml

Signed-off-by: haoguo <[email protected]>

* move helper class FakeFacotry to _graph_test_helpers.py

Signed-off-by: haoguo <[email protected]>

* polish: remove unreachable branch in quantization.py

Co-authored-by: Fridah-nv <[email protected]>
Signed-off-by: h-guo18 <[email protected]>

* style: run pre-commit

Signed-off-by: haoguo <[email protected]>

* fix to fetch hf_quant_config from fetched dir

Signed-off-by: Frida Hou <[email protected]>

---------

Signed-off-by: haoguo <[email protected]>
Signed-off-by: h-guo18 <[email protected]>
Signed-off-by: Frida Hou <[email protected]>
Co-authored-by: Fridah-nv <[email protected]>
* refactor: merge attn updates;move to new inf optimizer

Signed-off-by: haoguo <[email protected]>

* minor: fix import

Signed-off-by: haoguo <[email protected]>

* doc: fix file docstring

Signed-off-by: haoguo <[email protected]>

* Update tensorrt_llm/_torch/auto_deploy/transform/library/attention.py

Co-authored-by: Lucas Liebenwein <[email protected]>
Signed-off-by: h-guo18 <[email protected]>

* Update tensorrt_llm/_torch/auto_deploy/transform/library/attention.py

Co-authored-by: Lucas Liebenwein <[email protected]>
Signed-off-by: h-guo18 <[email protected]>

* Update tensorrt_llm/_torch/auto_deploy/transform/library/attention.py

Co-authored-by: Lucas Liebenwein <[email protected]>
Signed-off-by: h-guo18 <[email protected]>

* polish: use config.run_shape_prop for shape prop

Signed-off-by: haoguo <[email protected]>

* polish: remove redundant canonicalize()

Signed-off-by: haoguo <[email protected]>

---------

Signed-off-by: haoguo <[email protected]>
Signed-off-by: h-guo18 <[email protected]>
Co-authored-by: Lucas Liebenwein <[email protected]>
…s and refactored (#119)

* refactored compile_limit

Signed-off-by: Eran Geva <[email protected]>

* removed changes made to TorchCompileCompiler

Signed-off-by: Eran Geva <[email protected]>

* set cache_size_limit in TorchCompileCompiler

Signed-off-by: Eran Geva <[email protected]>

---------

Signed-off-by: Eran Geva <[email protected]>
…tch_sizes and refactored (#119)"

This reverts commit 2c81f8a.

Signed-off-by: Lucas Liebenwein <[email protected]>
@Fridah-nv Fridah-nv force-pushed the ll/merge-ad-2025-07-22 branch from 25e1230 to ee0252a Compare July 31, 2025 21:23
@Fridah-nv
Copy link
Collaborator

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #13723 [ run ] triggered by Bot

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (5)
tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_attention_matcher_hf.py (2)

47-65: Add return statement for clarity

The function modifies gm but doesn't return the transformed result from InferenceOptimizer.

 def _joint_transform(gm: GraphModule) -> None:
-    gm = InferenceOptimizer(
+    return InferenceOptimizer(
         None,
         {
             "match_repeat_kv": {
                 "stage": "pattern_matcher",
             },
             "match_eager_attention": {
                 "stage": "pattern_matcher",
             },
             "match_grouped_attention": {
                 "stage": "pattern_matcher",
             },
             "match_attention_layout": {
                 "stage": "pattern_matcher",
                 "attention_op": MockAttentionDescriptor,
             },
         },
     )(None, gm)

154-155: Fix transform application

The transformed graph from _joint_transform is not being used. The function should return the transformed graph.

     # Apply transformation
-    _joint_transform(gm)
+    gm = _joint_transform(gm)
tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_attention_matcher.py (1)

720-720: Incorrect optimizer used in counter example test.

The test_counter_example is designed to verify that similar tensor operations with different patterns are not falsely matched as repeat_kv patterns. It should use _get_match_repeat_kv_optimizer() instead of _get_match_eager_attention_optimizer().

Apply this diff to fix the optimizer:

-        _get_match_eager_attention_optimizer(),
+        _get_match_repeat_kv_optimizer(),
tensorrt_llm/_torch/auto_deploy/custom_ops/torch_attention.py (2)

142-159: Sliding window mask condition needs correction.

The sliding window mask condition on line 157 still has the issue identified in previous reviews. The current condition (pos_diff < 0) | (pos_diff >= sliding_window) should be (pos_diff < 0) | (pos_diff > sliding_window) to allow attention to exactly sliding_window + 1 positions (current position plus sliding_window past positions), which is the standard behavior for sliding window attention.


164-184: Sinks implementation remains incomplete.

The sinks mechanism still has the issue identified in previous reviews. While sink values are included in the attention score normalization, the output computation on line 180 uses the original value_t tensor without concatenating corresponding sink values. For a complete sinks implementation, sink values should be concatenated to the value tensor to match the expanded attention scores that include the sink column.

🧹 Nitpick comments (1)
tests/unittest/_torch/auto_deploy/_utils_test/_graph_test_helpers.py (1)

68-139: Comprehensive testing utility for transformed GraphModules.

This function provides thorough validation of transformed GraphModules including parameter consistency, output correctness, and state dict loading. The implementation is well-structured with clear error messages and appropriate test coverage.

Consider extracting common logic between run_test_transformed_gm and run_test into shared helper functions to reduce code duplication, particularly for parameter counting, output verification, and state dict loading tests.

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 25e1230 and ee0252a.

📒 Files selected for processing (22)
  • tensorrt_llm/_torch/auto_deploy/config/default.yaml (1 hunks)
  • tensorrt_llm/_torch/auto_deploy/custom_ops/torch_attention.py (4 hunks)
  • tensorrt_llm/_torch/auto_deploy/custom_ops/torch_backend_attention.py (3 hunks)
  • tensorrt_llm/_torch/auto_deploy/distributed/trtllm.py (1 hunks)
  • tensorrt_llm/_torch/auto_deploy/models/hf.py (4 hunks)
  • tensorrt_llm/_torch/auto_deploy/transform/interface.py (2 hunks)
  • tensorrt_llm/_torch/auto_deploy/transform/library/attention.py (1 hunks)
  • tensorrt_llm/_torch/auto_deploy/transform/library/quantization.py (3 hunks)
  • tensorrt_llm/_torch/auto_deploy/transform/library/quantize_moe.py (2 hunks)
  • tensorrt_llm/_torch/auto_deploy/transformations/_graph.py (2 hunks)
  • tensorrt_llm/_torch/auto_deploy/transformations/library/__init__.py (0 hunks)
  • tensorrt_llm/_torch/auto_deploy/transformations/library/attention.py (0 hunks)
  • tensorrt_llm/_torch/auto_deploy/transformations/library/kvcache.py (1 hunks)
  • tensorrt_llm/_torch/auto_deploy/transformations/library/rope.py (2 hunks)
  • tensorrt_llm/_torch/auto_deploy/transformations/transform.py (1 hunks)
  • tensorrt_llm/_torch/auto_deploy/utils/pattern_matcher.py (1 hunks)
  • tests/unittest/_torch/auto_deploy/_utils_test/_graph_test_helpers.py (2 hunks)
  • tests/unittest/_torch/auto_deploy/unit/multigpu/test_ad_build_small_multi.py (0 hunks)
  • tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_attention_matcher.py (16 hunks)
  • tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_attention_matcher_hf.py (3 hunks)
  • tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_quant_moe.py (2 hunks)
  • tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_quantization.py (4 hunks)
💤 Files with no reviewable changes (3)
  • tests/unittest/_torch/auto_deploy/unit/multigpu/test_ad_build_small_multi.py
  • tensorrt_llm/_torch/auto_deploy/transformations/library/init.py
  • tensorrt_llm/_torch/auto_deploy/transformations/library/attention.py
✅ Files skipped from review due to trivial changes (1)
  • tensorrt_llm/_torch/auto_deploy/distributed/trtllm.py
🚧 Files skipped from review as they are similar to previous changes (13)
  • tensorrt_llm/_torch/auto_deploy/utils/pattern_matcher.py
  • tensorrt_llm/_torch/auto_deploy/config/default.yaml
  • tensorrt_llm/_torch/auto_deploy/transformations/library/kvcache.py
  • tensorrt_llm/_torch/auto_deploy/transformations/_graph.py
  • tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_quant_moe.py
  • tensorrt_llm/_torch/auto_deploy/transformations/transform.py
  • tensorrt_llm/_torch/auto_deploy/transform/interface.py
  • tensorrt_llm/_torch/auto_deploy/transformations/library/rope.py
  • tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_quantization.py
  • tensorrt_llm/_torch/auto_deploy/custom_ops/torch_backend_attention.py
  • tensorrt_llm/_torch/auto_deploy/models/hf.py
  • tensorrt_llm/_torch/auto_deploy/transform/library/quantize_moe.py
  • tensorrt_llm/_torch/auto_deploy/transform/library/quantization.py
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.py: Python code should conform to Python 3.8+.
Indent Python code with 4 spaces. Do not use tabs.
Always maintain the namespace when importing in Python, even if only one class or function from a module is used.
Python filenames should use snake_case (e.g., some_file.py).
Python classes should use PascalCase (e.g., class SomeClass).
Python functions and methods should use snake_case (e.g., def my_awesome_function():).
Python local variables should use snake_case. Prefix k for variable names that start with a number (e.g., k_99th_percentile = ...).
Python global variables should use upper snake_case and prefix G (e.g., G_MY_GLOBAL = ...).
Python constants should use upper snake_case (e.g., MY_CONSTANT = ...).
Avoid shadowing variables declared in an outer scope in Python.
Initialize all externally visible members of a Python class in the constructor.
For interfaces that may be used outside a Python file, prefer docstrings over comments.
Comments in Python should be reserved for code within a function, or interfaces that are local to a file.
Use Google style docstrings for Python classes and functions, which can be parsed by Sphinx.
Attributes and variables in Python can be documented inline; attribute docstrings will be rendered under the docstring for the class.
Avoid using reflection in Python when functionality can be easily achieved without it.
When using try-except blocks in Python, limit the except to the smallest set of errors possible.
When using try-except blocks to handle multiple possible variable types in Python, keep the body of the try as small as possible, using the else block to implement the logic.

Files:

  • tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_attention_matcher_hf.py
  • tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_attention_matcher.py
  • tensorrt_llm/_torch/auto_deploy/transform/library/attention.py
  • tests/unittest/_torch/auto_deploy/_utils_test/_graph_test_helpers.py
  • tensorrt_llm/_torch/auto_deploy/custom_ops/torch_attention.py
**/*.{cpp,h,hpp,cc,cxx,cu,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the current year. This includes .cpp, .h, .cu, .py, and any other source files which are compiled or interpreted.

Files:

  • tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_attention_matcher_hf.py
  • tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_attention_matcher.py
  • tensorrt_llm/_torch/auto_deploy/transform/library/attention.py
  • tests/unittest/_torch/auto_deploy/_utils_test/_graph_test_helpers.py
  • tensorrt_llm/_torch/auto_deploy/custom_ops/torch_attention.py
🧠 Learnings (2)
📓 Common learnings
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.
Learnt from: yechank-nvidia
PR: NVIDIA/TensorRT-LLM#6254
File: tensorrt_llm/_torch/pyexecutor/model_engine.py:1201-1204
Timestamp: 2025-07-22T09:22:14.726Z
Learning: In TensorRT-LLM's multimodal processing pipeline, shared tensor recovery using `from_shared_tensor()` is only needed during the context phase. Generation requests reuse the already-recovered tensor data and only need to call `strip_for_generation()` to remove unnecessary multimodal data while preserving the recovered tensors. This avoids redundant tensor recovery operations during generation.
📚 Learning: in tensorrt-llm testing, it's common to have both cli flow tests (test_cli_flow.py) and pytorch api ...
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.

Applied to files:

  • tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_attention_matcher_hf.py
🧬 Code Graph Analysis (1)
tensorrt_llm/_torch/auto_deploy/transform/library/attention.py (7)
tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py (1)
  • AttentionDescriptor (498-624)
tensorrt_llm/_torch/auto_deploy/utils/node_utils.py (1)
  • is_op (183-206)
tensorrt_llm/_torch/auto_deploy/utils/pattern_matcher.py (3)
  • ADPatternMatcherPass (59-65)
  • register_ad_pattern (97-180)
  • call_function (247-274)
tensorrt_llm/_torch/auto_deploy/transform/interface.py (4)
  • BaseTransform (129-354)
  • TransformConfig (50-89)
  • TransformInfo (98-123)
  • TransformRegistry (357-385)
tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_attention_matcher_hf.py (2)
  • get_attention_layout (29-30)
  • get_source_attention_op (33-34)
tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_attention_matcher.py (2)
  • get_attention_layout (1025-1026)
  • get_source_attention_op (1029-1030)
tensorrt_llm/logger.py (1)
  • debug (143-144)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (31)
tests/unittest/_torch/auto_deploy/_utils_test/_graph_test_helpers.py (3)

11-13: LGTM!

The new imports correctly support the added functionality and follow the existing import patterns.


17-39: Well-designed test utility class.

The FakeFactory class provides a clean way to inject configuration data into tests without requiring full model instantiation. The stub methods are appropriately implemented for testing purposes.


41-53: Good extension for embedding-like test inputs.

The SequenceEmbeddingInfo class properly extends SequenceInfo to support tests requiring 3D embedding tensors. The implementation maintains device and dtype consistency.

tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_attention_matcher_hf.py (4)

3-20: Good import updates for new functionality.

The added imports support meta device initialization, graph transformations, and reproducible testing. The manual seed setting ensures test consistency.


23-35: Proper implementation of AttentionDescriptor interface.

The MockAttentionDescriptor class correctly implements the required interface methods and aligns with the new attention descriptor design.


112-156: Well-structured test with comprehensive verification.

The test properly uses meta device initialization, exports to GraphModule, and performs thorough verification of the transformation and output consistency.


105-108: Good verification of pattern transformation.

The verification correctly checks that old attention patterns are replaced with the expected new patterns after transformation.

tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_attention_matcher.py (10)

10-10: LGTM!

The import addition supports the new attention descriptor interface used in the updated code.


161-170: Improved attention mask construction.

The new implementation creates causal masks more efficiently by directly using torch.triu with -inf values instead of boolean masks with masked_fill. This is cleaner and more direct.


244-252: Consistent mask construction improvement.

The same efficient causal mask construction is applied here, maintaining consistency across model classes.


380-398: Good update to use explicit is_causal flag.

The change to use the is_causal parameter instead of explicit mask tensors for causal attention is more efficient and aligns with the updated attention operator interface.


410-458: Well-designed modular optimizer functions.

The new optimizer getter functions provide clean separation of concerns with consistent configuration patterns. Each function focuses on a specific pattern matching stage while including necessary cleanup operations.


564-583: Good test practice improvements.

Chaining .eval() after .to() ensures models are properly set to evaluation mode for inference testing, which is important for consistent test results.


669-678: Proper verification of new attention interface.

The verification logic correctly checks for the is_causal parameter in both kwargs and positional arguments, ensuring the new attention interface is working as expected.


744-745: Good test simplification.

Removing the conditional logic and always expecting 1 grouped SDPA node makes the test more straightforward and aligns with the model implementation.


1018-1031: Proper AttentionDescriptor implementation.

The updated MockAttentionDescriptor class correctly implements the required interface methods and follows the new attention descriptor pattern.


1323-1331: Consistent optimizer-based approach.

The test correctly uses InferenceOptimizer with the match_attention_layout stage, maintaining consistency with the overall refactoring approach.

tensorrt_llm/_torch/auto_deploy/transform/library/attention.py (9)

1-17: LGTM! Well-structured module with clear imports.

The module docstring clearly describes the purpose, and imports are properly organized following Python best practices.


19-29: LGTM! Clean utility function for pattern application.

The function properly encapsulates pattern registration and application logic with a clear interface.


31-42: LGTM! Correct repeat_kv pattern implementation.

The pattern accurately captures the standard repeat_kv logic from HuggingFace models, and the replacement properly uses the custom operator.


44-264: LGTM! Comprehensive SDPA pattern coverage.

The 8 patterns effectively cover the major variations of scaled dot-product attention found in transformer models, including different scaling approaches, masking, and casting behaviors. The pattern generation function is well-structured with appropriate dummy data and scalar workarounds.


267-319: LGTM! Well-designed grouped attention patterns.

The 4 patterns effectively handle different GQA scenarios with and without explicit repeat_kv operations, covering both causal and non-causal attention variants. The replacements appropriately use the grouped SDPA operator.


321-360: LGTM! Proper transform class implementation.

The MatchRepeatKV class correctly follows the BaseTransform pattern, uses appropriate dummy arguments and scalar workarounds, and properly handles shape propagation when patterns are found.


362-385: LGTM! Clean eager attention transform implementation.

The class properly leverages the pattern generation function to register all SDPA patterns efficiently.


387-454: LGTM! Comprehensive grouped attention transform.

The class properly registers all grouped attention patterns with appropriate dummy tensors and scalar workarounds for GQA scenarios.


456-563: LGTM! Comprehensive attention layout transformation.

The MatchAttentionLayout transform correctly handles layout conversions between 'bnsd' and 'bsnd' formats, properly inserts transpose operations, and preserves meta tensor information for shape propagation. The implementation covers both SDPA variants appropriately.

tensorrt_llm/_torch/auto_deploy/custom_ops/torch_attention.py (5)

11-31: LGTM! Well-implemented helper functions.

Both helper functions are correctly implemented with clear docstrings, proper parameter validation, and follow expected mathematical formulations for logit softcapping and mask conversion.


92-123: LGTM! Proper function signature and GQA handling.

The updated function signature includes necessary parameters, and the GQA implementation correctly handles key-value tensor repetition when needed. Scale calculation follows standard practices.


127-140: LGTM! Correct attention and causal mask handling.

The implementation properly converts boolean masks to float and applies causal masking only during the context phase (when s_q == s_k), following standard practices.


182-191: LGTM! Correct standard attention computation.

The fallback softmax path and dropout application are properly implemented, maintaining format consistency between input and output.


194-255: LGTM! Consistent function signature updates.

All fake functions properly match their real counterparts with the new parameters, and the bsnd variant correctly handles layout transformations between bsnd and bnsd formats.

@tensorrt-cicd
Copy link
Collaborator

PR_Github #13723 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #10311 completed with status: 'SUCCESS'
Pipeline passed with automatic retried tests. Check the rerun report for details.

@Fridah-nv Fridah-nv merged commit 5247df6 into NVIDIA:main Aug 1, 2025
3 checks passed
@github-project-automation github-project-automation bot moved this from Backlog to Done in AutoDeploy Board Aug 1, 2025
lancelly pushed a commit to lancelly/TensorRT-LLM that referenced this pull request Aug 6, 2025
Signed-off-by: Neta Zmora <[email protected]>
Signed-off-by: Gal Agam <[email protected]>
Signed-off-by: Lucas Liebenwein <[email protected]>
Signed-off-by: haoguo <[email protected]>
Signed-off-by: h-guo18 <[email protected]>
Signed-off-by: Frida Hou <[email protected]>
Signed-off-by: nvchenghaoz <[email protected]>
Signed-off-by: Eran Geva <[email protected]>
Signed-off-by: Fridah-nv <[email protected]>
Co-authored-by: Neta Zmora <[email protected]>
Co-authored-by: Gal Agam <[email protected]>
Co-authored-by: h-guo18 <[email protected]>
Co-authored-by: nvchenghaoz <[email protected]>
Co-authored-by: Frida Hou <[email protected]>
Co-authored-by: Eran Geva <[email protected]>
Signed-off-by: Lanyu Liao <[email protected]>
jain-ria pushed a commit to jain-ria/TensorRT-LLM that referenced this pull request Aug 7, 2025
Signed-off-by: Neta Zmora <[email protected]>
Signed-off-by: Gal Agam <[email protected]>
Signed-off-by: Lucas Liebenwein <[email protected]>
Signed-off-by: haoguo <[email protected]>
Signed-off-by: h-guo18 <[email protected]>
Signed-off-by: Frida Hou <[email protected]>
Signed-off-by: nvchenghaoz <[email protected]>
Signed-off-by: Eran Geva <[email protected]>
Signed-off-by: Fridah-nv <[email protected]>
Co-authored-by: Neta Zmora <[email protected]>
Co-authored-by: Gal Agam <[email protected]>
Co-authored-by: h-guo18 <[email protected]>
Co-authored-by: nvchenghaoz <[email protected]>
Co-authored-by: Frida Hou <[email protected]>
Co-authored-by: Eran Geva <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

7 participants