Skip to content

Conversation

galagam
Copy link
Collaborator

@galagam galagam commented Aug 5, 2025

Summary by CodeRabbit

  • New Features

    • Enhanced sequence management with device-aware tensor handling and host-pinned memory for faster CPU-GPU transfers.
    • Added support for more efficient input updates and position tracking during generation.
    • Introduced profiling instrumentation for key model operations.
  • Performance Improvements

    • Optimized data transfer and memory allocation for input and position tensors, reducing overhead.
    • Improved asynchronous tensor copying to accelerate input processing.
    • Introduced NVTX profiling for key operations to aid in performance tracing.
  • Bug Fixes

    • Improved device specification consistency and eliminated unnecessary data transfers between CPU and GPU.

Description

Optimize prepare_inputs routine in AutoDeploy, as part of the effort to reduce the performance gap compared to the default backend.
This PR includes two major fixes, and some other minor tweaks:

  • Avoid back and forth data copies
  • Optimize position ids update by separating the implementation for generation mode and context mode.

Test Coverage

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md
and the scripts/test_to_stage_mapping.py helper.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

Copy link
Contributor

coderabbitai bot commented Aug 5, 2025

📝 Walkthrough

Walkthrough

The changes introduce device-aware tensor management, host-pinned memory usage, and profiling instrumentation in the SequenceInfo dataclass to optimize sequence and position ID updates. Input preparation in the executor is refactored to avoid unnecessary CPU transfers and to handle new tokens more efficiently. Minor updates are made to CUDA graph input copying.

Changes

Cohort / File(s) Change Summary
Device-aware SequenceInfo & Host-Pinned Memory
tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py
Refactored SequenceInfo to add a device attribute and manage tensors on the specified device. Changed tensor shapes for input_ids and position_ids to 1D. Introduced host-pinned memory tensors (seq_len_host, input_pos_host, etc.) to optimize CPU-GPU transfers. Added methods for in-place updates (update_input_ids_with_new_tokens), conditional tensor reallocation (nest_sequences with allow_realloc), reshaping (maybe_reshape_for_generate), and sequence length updates (_update_sequence_lengths). Replaced position ID update logic with a new method using NVTX profiling and host-pinned memory. Updated cache location assignment to use pinned memory. Removed old device property and related methods. Added NVTX profiling instrumentation to key update methods.
Input Preparation & Executor Refactor
tensorrt_llm/_torch/auto_deploy/shim/ad_executor.py
Updated build_from_config to append CUDA device index to device string if missing before passing to SequenceInfo. Removed conversion of new_tokens tensor to CPU list in _prepare_inputs, using tensors directly instead. When handling generation requests, appended dummy token -1 to input_ids and tracked batch indices. Reordered sequence info updates to call update_pos and assign_cache_loc before nest_sequences. Added call to update_input_ids_with_new_tokens if new tokens are present. Annotated _compute_logits with NVTX profiling decorator. Renamed new_tokens_device parameter to new_tensors_device in forward method and updated internal usage accordingly.
CUDA Graph Input Copy Optimization
tensorrt_llm/_torch/auto_deploy/compile/backends/torch_cudagraph.py
Modified CapturedGraph.forward to replace synchronous slice assignment of input tensors with asynchronous in-place copy_ using non_blocking=True, improving data transfer semantics without changing control flow or error handling.
Test Update for SequenceInfo Nesting
tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_kv_cache.py
Updated test call to nest_sequences by adding allow_realloc=True argument to reflect changes in method signature and behavior. No other changes to test logic.

Sequence Diagram(s)

sequenceDiagram
    participant Executor as ADEngine
    participant SeqInfo as SequenceInfo
    participant Device as CUDA/Host

    Executor->>SeqInfo: build_from_config (pass device string)
    Executor->>SeqInfo: _prepare_inputs(new_tokens)
    alt If new_tokens provided
        SeqInfo->>Device: update_input_ids_with_new_tokens (host-pinned copy)
    end
    Executor->>SeqInfo: update_pos, assign_cache_loc, nest_sequences
    SeqInfo->>Device: _update_position_ids (host-pinned, NVTX profiling)
    SeqInfo->>Device: Tensor updates (input_ids, position_ids)
    Executor->>Executor: _compute_logits (NVTX profiling)
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~40 minutes

Suggested reviewers

  • yilin-void
  • venkywonka
  • symphonylyh

📜 Recent review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ac4ae69 and f8aa994.

📒 Files selected for processing (4)
  • tensorrt_llm/_torch/auto_deploy/compile/backends/torch_cudagraph.py (1 hunks)
  • tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py (12 hunks)
  • tensorrt_llm/_torch/auto_deploy/shim/ad_executor.py (5 hunks)
  • tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_kv_cache.py (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
  • tensorrt_llm/_torch/auto_deploy/compile/backends/torch_cudagraph.py
  • tensorrt_llm/_torch/auto_deploy/shim/ad_executor.py
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.py: Python code should conform to Python 3.8+.
Indent Python code with 4 spaces. Do not use tabs.
Always maintain the namespace when importing in Python, even if only one class or function from a module is used.
Python filenames should use snake_case (e.g., some_file.py).
Python classes should use PascalCase (e.g., class SomeClass).
Python functions and methods should use snake_case (e.g., def my_awesome_function():).
Python local variables should use snake_case. Prefix k for variable names that start with a number (e.g., k_99th_percentile).
Python global variables should use upper snake_case and prefix G (e.g., G_MY_GLOBAL).
Python constants should use upper snake_case (e.g., MY_CONSTANT).
Avoid shadowing variables declared in an outer scope in Python.
Initialize all externally visible members of a Python class in the constructor.
For interfaces that may be used outside a Python file, prefer docstrings over comments.
Comments in Python should be reserved for code within a function, or interfaces that are local to a file.
Use Google style docstrings for Python classes and functions, which can be parsed by Sphinx.
Attributes and variables in Python can be documented inline; attribute docstrings will be rendered under the class docstring.
Avoid using reflection in Python when functionality can be easily achieved without it.
When using try-except blocks in Python, limit the except to the smallest set of errors possible.
When using try-except blocks to handle multiple possible variable types in Python, keep the body of the try as small as possible, using the else block to implement the logic.

Files:

  • tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_kv_cache.py
  • tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py
**/*.{cpp,h,hpp,cc,cxx,cu,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the current year. This includes .cpp, .h, .cu, .py, and any other source files which are compiled or interpreted.

Files:

  • tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_kv_cache.py
  • tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (14)
tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_kv_cache.py (1)

190-190: LGTM: Test updated to match new API signature.

The addition of allow_realloc=True parameter aligns the test with the updated nest_sequences method signature in SequenceInfo. This ensures the test exercises the new device tensor reallocation behavior.

tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py (13)

21-21: LGTM: Added NVTX profiling support.

The import of nvtx_range enables performance profiling instrumentation throughout the sequence management operations.


92-94: LGTM: Device-aware configuration.

Adding the device field with a default of "cuda" enables device-aware tensor management throughout the SequenceInfo class.


97-98: LGTM: Simplified tensor field initialization.

Converting input_ids and position_ids to 1D tensors with default device allocation is consistent with the flattened sequence approach and device-aware design.


117-155: Comprehensive tensor initialization improvements.

The enhanced __post_init__ method introduces several optimizations:

  1. Device-aware allocation: All tensors are now allocated on the specified device
  2. Host-pinned memory: Added seq_len_host and input_pos_host for efficient host-device transfers
  3. Full-length tensors: input_ids_full and position_ids_full reduce tensor creation overhead
  4. Optimized indexing: previous_batch_indices_cuda for efficient batch operations

The approach of maintaining both working and full-length versions of tensors is a good optimization to avoid repeated allocations.


167-169: LGTM: Added token count tracking.

The num_tokens field provides efficient access to the current batch token count, avoiding repeated calculations.


187-190: LGTM: Clarified method documentation.

The updated docstring clearly explains that this method returns the count of original graph arguments (input_ids and position_ids).


422-454: Excellent optimization for position ID updates.

The _update_position_ids method introduces several key optimizations:

  1. Fast path for generation mode: When all seq_len values are 1, it directly uses input_pos_host as position_ids
  2. Host-side computation: Position ID calculation is done on host to avoid device transfers
  3. Pinned memory: Uses pinned memory for efficient async transfers
  4. Conditional reallocation: allow_realloc parameter controls whether to create new tensors or update in-place

The NVTX profiling ranges will help identify performance bottlenecks in production.


455-462: LGTM: Optimized sequence length updates.

The method efficiently updates sequence information using pinned memory for host-device transfers and includes proper NVTX profiling.


463-489: Efficient in-place input ID updates.

The update_input_ids_with_new_tokens method implements an optimized approach:

  1. Pinned memory transfers: Uses pinned memory for async host-to-device copies
  2. Sorted indexing: Sorts indices for proper masked_scatter_ alignment
  3. In-place updates: Uses masked_scatter_ to efficiently update only the necessary positions
  4. Placeholder replacement: Updates positions marked with -1

This approach should significantly reduce memory allocation overhead compared to recreating tensors.


491-524: Enhanced nest_sequences with reallocation control.

The updated method adds:

  1. Reallocation control: allow_realloc parameter enables choosing between tensor reallocation vs. in-place updates
  2. Pinned memory optimization: Uses pinned memory for efficient host-device transfers
  3. NVTX profiling: Added performance monitoring
  4. Consistent reshaping: Maintains the existing reshaping logic for generate vs. context modes

The dual-path approach (realloc vs. in-place) provides flexibility for different use cases while optimizing performance.


525-528: LGTM: Added profiling to unnest_sequences.

The NVTX profiling range enables performance monitoring for the sequence unnesting operation.


530-548: Optimized position updates with async transfers.

The update_pos method improvements:

  1. Pinned memory: Creates tensors with pinned memory for efficient transfers
  2. Async copies: Uses non_blocking=True for better performance
  3. NVTX profiling: Monitors performance bottlenecks
  4. Host-device synchronization: Maintains consistency between host and device tensors

The approach reduces synchronization overhead and improves throughput.


549-563: LGTM: Optimized cache location assignment.

The assign_cache_loc method uses pinned memory and async transfers for both cache_loc and pages_per_seq updates, maintaining consistency with the overall optimization approach.

✨ Finishing Touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Explain this complex logic.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai explain this code block.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and explain its main purpose.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
  • @coderabbitai generate unit tests to generate unit tests for this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai or @coderabbitai title anywhere in the PR title to generate the title automatically.

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py (1)

152-154: Consider lazy allocation for previous_batch_indices_cuda.

The previous_batch_indices_cuda tensor is allocated during initialization but may not be used in all workflows. Consider lazy allocation when update_input_ids_with_new_tokens is first called to save memory.

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 78a75c2 and 96c77bd.

📒 Files selected for processing (5)
  • tensorrt_llm/_torch/auto_deploy/compile/backends/torch_cudagraph.py (1 hunks)
  • tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py (11 hunks)
  • tensorrt_llm/_torch/auto_deploy/shim/ad_executor.py (5 hunks)
  • tensorrt_llm/_torch/auto_deploy/shim/demollm.py (1 hunks)
  • tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_kv_cache.py (1 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.py: The code developed for TensorRT-LLM should conform to Python 3.8+.
Indent Python code with 4 spaces. Do not use tabs.
Always maintain the namespace when importing in Python, even if only one class or function from a module is used.
Python filenames should use snake_case (e.g., some_file.py).
Python classes should use PascalCase (e.g., class SomeClass).
Python functions and methods should use snake_case (e.g., def my_awesome_function():).
Python local variables should use snake_case. Prefix k for variable names that start with a number (e.g., k_99th_percentile).
Python global variables should use upper snake_case and prefix G (e.g., G_MY_GLOBAL).
Python constants should use upper snake_case (e.g., MY_CONSTANT).
Avoid shadowing variables declared in an outer scope in Python.
Initialize all externally visible members of a Python class in the constructor.
For interfaces that may be used outside a file, prefer docstrings over comments in Python.
Comments in Python should be reserved for code within a function, or interfaces that are local to a file.
Use Google style docstrings for Python classes and functions, which can be parsed by Sphinx.
Attributes and variables in Python can be documented inline; attribute docstrings will be rendered under the docstring for the class.
Avoid using reflection in Python when functionality can be easily achieved without reflection.
When using try-except blocks in Python, limit the except to the smallest set of errors possible.
When using try-except blocks to handle multiple possible variable types in Python, keep the body of the try as small as possible, using the else block to implement the logic.

Files:

  • tensorrt_llm/_torch/auto_deploy/shim/ad_executor.py
  • tensorrt_llm/_torch/auto_deploy/shim/demollm.py
  • tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_kv_cache.py
  • tensorrt_llm/_torch/auto_deploy/compile/backends/torch_cudagraph.py
  • tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py
**/*.{cpp,h,cu,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the current year. This includes .cpp, .h, .cu, .py, and any other source files which are compiled or interpreted.

Files:

  • tensorrt_llm/_torch/auto_deploy/shim/ad_executor.py
  • tensorrt_llm/_torch/auto_deploy/shim/demollm.py
  • tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_kv_cache.py
  • tensorrt_llm/_torch/auto_deploy/compile/backends/torch_cudagraph.py
  • tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py
🧠 Learnings (1)
📚 Learning: in tensorrt-llm's multimodal processing pipeline, shared tensor recovery using `from_shared_tensor()...
Learnt from: yechank-nvidia
PR: NVIDIA/TensorRT-LLM#6254
File: tensorrt_llm/_torch/pyexecutor/model_engine.py:1201-1204
Timestamp: 2025-07-22T09:22:14.726Z
Learning: In TensorRT-LLM's multimodal processing pipeline, shared tensor recovery using `from_shared_tensor()` is only needed during the context phase. Generation requests reuse the already-recovered tensor data and only need to call `strip_for_generation()` to remove unnecessary multimodal data while preserving the recovered tensors. This avoids redundant tensor recovery operations during generation.

Applied to files:

  • tensorrt_llm/_torch/auto_deploy/shim/demollm.py
🧬 Code Graph Analysis (2)
tensorrt_llm/_torch/auto_deploy/shim/demollm.py (1)
tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py (3)
  • update_input_pos (526-544)
  • sequence_lengths (228-229)
  • nest_sequences (487-518)
tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_kv_cache.py (2)
tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py (3)
  • update_input_pos (526-544)
  • update_position_ids (418-448)
  • reset (351-369)
tests/unittest/_torch/auto_deploy/_utils_test/_torch_test_utils.py (1)
  • all_close (6-21)
🔇 Additional comments (10)
tensorrt_llm/_torch/auto_deploy/shim/demollm.py (1)

118-119: LGTM! API update is correct.

The change from update_pos to update_input_pos correctly follows the refactored API in SequenceInfo. The added comment appropriately documents that nest_sequences handles position ID updates.

tensorrt_llm/_torch/auto_deploy/compile/backends/torch_cudagraph.py (1)

165-165: Good optimization for asynchronous data transfer.

Using copy_ with non_blocking=True enables asynchronous copying when possible, which aligns with the broader performance optimizations in this PR for efficient CPU-GPU data transfers.

tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_kv_cache.py (1)

210-211: Test updates correctly reflect the new API design.

The split of position updates into update_input_pos() and update_position_ids() is properly implemented. The comment explaining why explicit update_position_ids() calls are needed (when nest_sequences is not called) provides good documentation.

Also applies to: 215-216, 219-220, 224-225, 228-229

tensorrt_llm/_torch/auto_deploy/shim/ad_executor.py (2)

97-102: Proper device handling implementation.

The enhanced device handling correctly ensures that CUDA devices have an explicit index, defaulting to the current device when not specified. This prevents potential issues with device mismatches.


190-197: Efficient new token handling without CPU transfers.

The refactored logic efficiently handles new tokens by:

  1. Using placeholder tokens (-1) to mark positions for new tokens
  2. Tracking batch indices for proper token placement
  3. Utilizing the new update_input_ids_with_new_tokens method for in-place updates

This avoids unnecessary CPU transfers and aligns with the performance optimization goals.

Also applies to: 212-218

tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py (5)

93-94: Excellent device-aware tensor management implementation.

The changes introduce proper device-aware initialization with:

  • Explicit device specification for all tensors
  • Host-pinned memory for seq_len_host and input_pos_host to optimize CPU-GPU transfers
  • Maintenance of full-size tensor variants to avoid reallocation overhead
  • Consistent 1D tensor shapes for flattened sequence handling

These optimizations align perfectly with the PR's performance goals.

Also applies to: 97-98, 130-154


417-448: Highly optimized position ID update implementation.

The refactored update_position_ids method includes excellent optimizations:

  • NVTX profiling for performance analysis
  • Fast path for generation mode (all seq_len == 1)
  • Host-side computation to avoid device synchronization
  • Non-blocking copies for better CPU-GPU overlap
  • Conditional reallocation support

This is a well-thought-out performance improvement.


458-485: Efficient in-place token update implementation.

The update_input_ids_with_new_tokens method efficiently updates tokens using:

  • Masked scatter for in-place updates where tokens are marked with -1
  • Sorted indices to ensure correct alignment
  • Host-pinned memory for index transfers
  • Minimal memory allocations

This approach avoids unnecessary copies and aligns with the performance optimization goals.


487-518: Well-structured sequence nesting with flexible memory management.

The refactored nest_sequences method provides:

  • Flexible memory management with allow_realloc parameter
  • Centralized sequence length updates via _update_sequence_lengths
  • Efficient non-blocking copies with pinned memory
  • Automatic position ID updates maintaining consistency

Good separation of concerns and performance optimization.


525-544: Clean refactoring with performance optimizations.

The renamed update_input_pos method includes:

  • More descriptive naming
  • NVTX profiling instrumentation
  • Host-pinned memory for efficient CPU-GPU transfers
  • Non-blocking copy operations

These changes align with the overall performance optimization strategy.

@galagam galagam force-pushed the user/galagam/ad-prepare-inputs-opt branch from 96c77bd to 3899933 Compare August 6, 2025 11:09
@galagam galagam changed the title [#5048][AutoDeploy: Optimize prepare_inputs [#5048][AutoDeploy] Optimize prepare_inputs Aug 6, 2025
@galagam galagam changed the title [#5048][AutoDeploy] Optimize prepare_inputs [#5048][feat] AutoDeploy: Optimize prepare_inputs Aug 6, 2025
@galagam galagam force-pushed the user/galagam/ad-prepare-inputs-opt branch from 3899933 to 413cd3f Compare August 6, 2025 13:59
@galagam galagam changed the title [#5048][feat] AutoDeploy: Optimize prepare_inputs [#5048][fix] AutoDeploy: Optimize prepare_inputs Aug 6, 2025
@galagam galagam marked this pull request as ready for review August 6, 2025 14:15
@galagam galagam requested a review from a team as a code owner August 6, 2025 14:15
@galagam galagam requested a review from Fridah-nv August 6, 2025 14:15
@galagam
Copy link
Collaborator Author

galagam commented Aug 6, 2025

/bot run

@galagam galagam force-pushed the user/galagam/ad-prepare-inputs-opt branch from 413cd3f to ac4ae69 Compare August 6, 2025 17:22
@galagam
Copy link
Collaborator Author

galagam commented Aug 6, 2025

/bot run

1 similar comment
@venkywonka
Copy link
Collaborator

/bot run

@github-project-automation github-project-automation bot moved this from Backlog to In review in AutoDeploy Board Aug 6, 2025
@tensorrt-cicd
Copy link
Collaborator

PR_Github #14323 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #14323 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #10821 completed with status: 'FAILURE'

@galagam galagam force-pushed the user/galagam/ad-prepare-inputs-opt branch from ac4ae69 to f8aa994 Compare August 7, 2025 09:24
@galagam galagam changed the title [#5048][fix] AutoDeploy: Optimize prepare_inputs [#5048][enhance] AutoDeploy: Optimize prepare_inputs Aug 7, 2025
@galagam
Copy link
Collaborator Author

galagam commented Aug 7, 2025

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #14442 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #14442 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #10916 completed with status: 'FAILURE'

@galagam galagam force-pushed the user/galagam/ad-prepare-inputs-opt branch from f8aa994 to f84a16f Compare August 7, 2025 11:15
@galagam
Copy link
Collaborator Author

galagam commented Aug 7, 2025

/bot run

Copy link

github-actions bot commented Aug 7, 2025

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

@suyoggupta
Copy link
Collaborator

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #14503 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #14503 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #10954 completed with status: 'FAILURE'

suyoggupta and others added 3 commits August 9, 2025 22:40
avoid gpu->cpu transfer when using overlap scheduler

Signed-off-by: Suyog Gupta <[email protected]>
Signed-off-by: Suyog Gupta <[email protected]>

prealloc

Signed-off-by: Suyog Gupta <[email protected]>

optimize prepare input

Signed-off-by: Suyog Gupta <[email protected]>

more changes

Signed-off-by: Suyog Gupta <[email protected]>

remove more syncs

Signed-off-by: Suyog Gupta <[email protected]>

clean up

Signed-off-by: Suyog Gupta <[email protected]>

clean up position_id handling

Signed-off-by: Suyog Gupta <[email protected]>

revert spurious changes

Signed-off-by: Suyog Gupta <[email protected]>

revert spurious print

Signed-off-by: Suyog Gupta <[email protected]>

Update export_to_gm.py

Signed-off-by: Suyog Gupta <[email protected]>

revert spurious print

Signed-off-by: Suyog Gupta <[email protected]>

revert nvtx debug

Signed-off-by: Suyog Gupta <[email protected]>

revert nvtx debug

Signed-off-by: Suyog Gupta <[email protected]>
update_position_ids is the bottleneck - add nvtx markers and small list comprehension perf improvement

Signed-off-by: Gal Hubara Agam <[email protected]>

optimize prepare_list for generation mode

Signed-off-by: Gal Hubara Agam <[email protected]>

bugfix in _update_position_ids optimized code

Signed-off-by: Gal Hubara Agam <[email protected]>
Signed-off-by: Gal Hubara Agam <[email protected]>
@galagam
Copy link
Collaborator Author

galagam commented Aug 10, 2025

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #14697 [ run ] triggered by Bot

@galagam galagam force-pushed the user/galagam/ad-prepare-inputs-opt branch from f84a16f to a0bcb2c Compare August 10, 2025 05:52
@tensorrt-cicd
Copy link
Collaborator

PR_Github #14697 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #11094 completed with status: 'FAILURE'

@galagam
Copy link
Collaborator Author

galagam commented Aug 10, 2025

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #14702 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #14702 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #11097 completed with status: 'SUCCESS'

@galagam galagam merged commit 3c5aec1 into NVIDIA:main Aug 10, 2025
4 checks passed
@github-project-automation github-project-automation bot moved this from In review to Done in AutoDeploy Board Aug 10, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

4 participants