[#5048][enhance] AutoDeploy: Optimize prepare_inputs #6634

galagam · 2025-08-05T12:35:41Z

Summary by CodeRabbit

New Features
- Enhanced sequence management with device-aware tensor handling and host-pinned memory for faster CPU-GPU transfers.
- Added support for more efficient input updates and position tracking during generation.
- Introduced profiling instrumentation for key model operations.
Performance Improvements
- Optimized data transfer and memory allocation for input and position tensors, reducing overhead.
- Improved asynchronous tensor copying to accelerate input processing.
- Introduced NVTX profiling for key operations to aid in performance tracing.
Bug Fixes
- Improved device specification consistency and eliminated unnecessary data transfers between CPU and GPU.

Description

Optimize prepare_inputs routine in AutoDeploy, as part of the effort to reduce the performance gap compared to the default backend.
This PR includes two major fixes, and some other minor tweaks:

Avoid back and forth data copies
Optimize position ids update by separating the implementation for generation mode and context mode.

Test Coverage

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md
and the scripts/test_to_stage_mapping.py helper.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

coderabbitai · 2025-08-05T12:35:48Z

📝 Walkthrough

Walkthrough

The changes introduce device-aware tensor management, host-pinned memory usage, and profiling instrumentation in the SequenceInfo dataclass to optimize sequence and position ID updates. Input preparation in the executor is refactored to avoid unnecessary CPU transfers and to handle new tokens more efficiently. Minor updates are made to CUDA graph input copying.

Changes

Cohort / File(s)	Change Summary
Device-aware SequenceInfo & Host-Pinned Memory `tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py`	Refactored `SequenceInfo` to add a `device` attribute and manage tensors on the specified device. Changed tensor shapes for `input_ids` and `position_ids` to 1D. Introduced host-pinned memory tensors (`seq_len_host`, `input_pos_host`, etc.) to optimize CPU-GPU transfers. Added methods for in-place updates (`update_input_ids_with_new_tokens`), conditional tensor reallocation (`nest_sequences` with `allow_realloc`), reshaping (`maybe_reshape_for_generate`), and sequence length updates (`_update_sequence_lengths`). Replaced position ID update logic with a new method using NVTX profiling and host-pinned memory. Updated cache location assignment to use pinned memory. Removed old device property and related methods. Added NVTX profiling instrumentation to key update methods.
Input Preparation & Executor Refactor `tensorrt_llm/_torch/auto_deploy/shim/ad_executor.py`	Updated `build_from_config` to append CUDA device index to device string if missing before passing to `SequenceInfo`. Removed conversion of `new_tokens` tensor to CPU list in `_prepare_inputs`, using tensors directly instead. When handling generation requests, appended dummy token `-1` to `input_ids` and tracked batch indices. Reordered sequence info updates to call `update_pos` and `assign_cache_loc` before `nest_sequences`. Added call to `update_input_ids_with_new_tokens` if new tokens are present. Annotated `_compute_logits` with NVTX profiling decorator. Renamed `new_tokens_device` parameter to `new_tensors_device` in `forward` method and updated internal usage accordingly.
CUDA Graph Input Copy Optimization `tensorrt_llm/_torch/auto_deploy/compile/backends/torch_cudagraph.py`	Modified `CapturedGraph.forward` to replace synchronous slice assignment of input tensors with asynchronous in-place `copy_` using `non_blocking=True`, improving data transfer semantics without changing control flow or error handling.
Test Update for SequenceInfo Nesting `tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_kv_cache.py`	Updated test call to `nest_sequences` by adding `allow_realloc=True` argument to reflect changes in method signature and behavior. No other changes to test logic.

Sequence Diagram(s)

sequenceDiagram
    participant Executor as ADEngine
    participant SeqInfo as SequenceInfo
    participant Device as CUDA/Host

    Executor->>SeqInfo: build_from_config (pass device string)
    Executor->>SeqInfo: _prepare_inputs(new_tokens)
    alt If new_tokens provided
        SeqInfo->>Device: update_input_ids_with_new_tokens (host-pinned copy)
    end
    Executor->>SeqInfo: update_pos, assign_cache_loc, nest_sequences
    SeqInfo->>Device: _update_position_ids (host-pinned, NVTX profiling)
    SeqInfo->>Device: Tensor updates (input_ids, position_ids)
    Executor->>Executor: _compute_logits (NVTX profiling)

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~40 minutes

Suggested reviewers

yilin-void
venkywonka
symphonylyh

📜 Recent review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ac4ae69 and f8aa994.

📒 Files selected for processing (4)

tensorrt_llm/_torch/auto_deploy/compile/backends/torch_cudagraph.py (1 hunks)
tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py (12 hunks)
tensorrt_llm/_torch/auto_deploy/shim/ad_executor.py (5 hunks)
tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_kv_cache.py (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (2)

tensorrt_llm/_torch/auto_deploy/compile/backends/torch_cudagraph.py
tensorrt_llm/_torch/auto_deploy/shim/ad_executor.py

🧰 Additional context used

📓 Path-based instructions (2)

**/*.py

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.py: Python code should conform to Python 3.8+.
Indent Python code with 4 spaces. Do not use tabs.
Always maintain the namespace when importing in Python, even if only one class or function from a module is used.
Python filenames should use snake_case (e.g., some_file.py).
Python classes should use PascalCase (e.g., class SomeClass).
Python functions and methods should use snake_case (e.g., def my_awesome_function():).
Python local variables should use snake_case. Prefix k for variable names that start with a number (e.g., k_99th_percentile).
Python global variables should use upper snake_case and prefix G (e.g., G_MY_GLOBAL).
Python constants should use upper snake_case (e.g., MY_CONSTANT).
Avoid shadowing variables declared in an outer scope in Python.
Initialize all externally visible members of a Python class in the constructor.
For interfaces that may be used outside a Python file, prefer docstrings over comments.
Comments in Python should be reserved for code within a function, or interfaces that are local to a file.
Use Google style docstrings for Python classes and functions, which can be parsed by Sphinx.
Attributes and variables in Python can be documented inline; attribute docstrings will be rendered under the class docstring.
Avoid using reflection in Python when functionality can be easily achieved without it.
When using try-except blocks in Python, limit the except to the smallest set of errors possible.
When using try-except blocks to handle multiple possible variable types in Python, keep the body of the try as small as possible, using the else block to implement the logic.

Files:

tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_kv_cache.py
tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py

**/*.{cpp,h,hpp,cc,cxx,cu,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the current year. This includes .cpp, .h, .cu, .py, and any other source files which are compiled or interpreted.

Files:

tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_kv_cache.py
tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Pre-commit Check

🔇 Additional comments (14)

tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_kv_cache.py (1)

190-190: LGTM: Test updated to match new API signature.

The addition of allow_realloc=True parameter aligns the test with the updated nest_sequences method signature in SequenceInfo. This ensures the test exercises the new device tensor reallocation behavior.

tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py (13)

21-21: LGTM: Added NVTX profiling support.

The import of nvtx_range enables performance profiling instrumentation throughout the sequence management operations.

92-94: LGTM: Device-aware configuration.

Adding the device field with a default of "cuda" enables device-aware tensor management throughout the SequenceInfo class.

97-98: LGTM: Simplified tensor field initialization.

Converting input_ids and position_ids to 1D tensors with default device allocation is consistent with the flattened sequence approach and device-aware design.

117-155: Comprehensive tensor initialization improvements.

The enhanced __post_init__ method introduces several optimizations:

Device-aware allocation: All tensors are now allocated on the specified device

Host-pinned memory: Added seq_len_host and input_pos_host for efficient host-device transfers

Full-length tensors: input_ids_full and position_ids_full reduce tensor creation overhead

Optimized indexing: previous_batch_indices_cuda for efficient batch operations

The approach of maintaining both working and full-length versions of tensors is a good optimization to avoid repeated allocations.

167-169: LGTM: Added token count tracking.

The num_tokens field provides efficient access to the current batch token count, avoiding repeated calculations.

187-190: LGTM: Clarified method documentation.

The updated docstring clearly explains that this method returns the count of original graph arguments (input_ids and position_ids).

422-454: Excellent optimization for position ID updates.

The _update_position_ids method introduces several key optimizations:

Fast path for generation mode: When all seq_len values are 1, it directly uses input_pos_host as position_ids

Host-side computation: Position ID calculation is done on host to avoid device transfers

Pinned memory: Uses pinned memory for efficient async transfers

Conditional reallocation: allow_realloc parameter controls whether to create new tensors or update in-place

The NVTX profiling ranges will help identify performance bottlenecks in production.

455-462: LGTM: Optimized sequence length updates.

The method efficiently updates sequence information using pinned memory for host-device transfers and includes proper NVTX profiling.

463-489: Efficient in-place input ID updates.

The update_input_ids_with_new_tokens method implements an optimized approach:

Pinned memory transfers: Uses pinned memory for async host-to-device copies

Sorted indexing: Sorts indices for proper masked_scatter_ alignment

In-place updates: Uses masked_scatter_ to efficiently update only the necessary positions

Placeholder replacement: Updates positions marked with -1

This approach should significantly reduce memory allocation overhead compared to recreating tensors.

491-524: Enhanced nest_sequences with reallocation control.

The updated method adds:

Reallocation control: allow_realloc parameter enables choosing between tensor reallocation vs. in-place updates

Pinned memory optimization: Uses pinned memory for efficient host-device transfers

NVTX profiling: Added performance monitoring

Consistent reshaping: Maintains the existing reshaping logic for generate vs. context modes

The dual-path approach (realloc vs. in-place) provides flexibility for different use cases while optimizing performance.

525-528: LGTM: Added profiling to unnest_sequences.

The NVTX profiling range enables performance monitoring for the sequence unnesting operation.

530-548: Optimized position updates with async transfers.

The update_pos method improvements:

Pinned memory: Creates tensors with pinned memory for efficient transfers

Async copies: Uses non_blocking=True for better performance

NVTX profiling: Monitors performance bottlenecks

Host-device synchronization: Maintains consistency between host and device tensors

The approach reduces synchronization overhead and improves throughput.

549-563: LGTM: Optimized cache location assignment.

The assign_cache_loc method uses pinned memory and async transfers for both cache_loc and pages_per_seq updates, maintaining consistency with the overall optimization approach.

✨ Finishing Touches

📝 Generate Docstrings

🧪 Generate unit tests

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Explain this complex logic.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai explain this code block.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and explain its main purpose.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai generate docstrings to generate docstrings for this PR.
@coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
@coderabbitai generate unit tests to generate unit tests for this PR.
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai or @coderabbitai title anywhere in the PR title to generate the title automatically.

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (1)

tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py (1)

152-154: Consider lazy allocation for previous_batch_indices_cuda.

The previous_batch_indices_cuda tensor is allocated during initialization but may not be used in all workflows. Consider lazy allocation when update_input_ids_with_new_tokens is first called to save memory.

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 78a75c2 and 96c77bd.

📒 Files selected for processing (5)

tensorrt_llm/_torch/auto_deploy/compile/backends/torch_cudagraph.py (1 hunks)
tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py (11 hunks)
tensorrt_llm/_torch/auto_deploy/shim/ad_executor.py (5 hunks)
tensorrt_llm/_torch/auto_deploy/shim/demollm.py (1 hunks)
tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_kv_cache.py (1 hunks)

🧰 Additional context used

📓 Path-based instructions (2)

**/*.py

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.py: The code developed for TensorRT-LLM should conform to Python 3.8+.
Indent Python code with 4 spaces. Do not use tabs.
Always maintain the namespace when importing in Python, even if only one class or function from a module is used.
Python filenames should use snake_case (e.g., some_file.py).
Python classes should use PascalCase (e.g., class SomeClass).
Python functions and methods should use snake_case (e.g., def my_awesome_function():).
Python local variables should use snake_case. Prefix k for variable names that start with a number (e.g., k_99th_percentile).
Python global variables should use upper snake_case and prefix G (e.g., G_MY_GLOBAL).
Python constants should use upper snake_case (e.g., MY_CONSTANT).
Avoid shadowing variables declared in an outer scope in Python.
Initialize all externally visible members of a Python class in the constructor.
For interfaces that may be used outside a file, prefer docstrings over comments in Python.
Comments in Python should be reserved for code within a function, or interfaces that are local to a file.
Use Google style docstrings for Python classes and functions, which can be parsed by Sphinx.
Attributes and variables in Python can be documented inline; attribute docstrings will be rendered under the docstring for the class.
Avoid using reflection in Python when functionality can be easily achieved without reflection.
When using try-except blocks in Python, limit the except to the smallest set of errors possible.
When using try-except blocks to handle multiple possible variable types in Python, keep the body of the try as small as possible, using the else block to implement the logic.

Files:

tensorrt_llm/_torch/auto_deploy/shim/ad_executor.py
tensorrt_llm/_torch/auto_deploy/shim/demollm.py
tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_kv_cache.py
tensorrt_llm/_torch/auto_deploy/compile/backends/torch_cudagraph.py
tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py

**/*.{cpp,h,cu,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the current year. This includes .cpp, .h, .cu, .py, and any other source files which are compiled or interpreted.

Files:

tensorrt_llm/_torch/auto_deploy/shim/ad_executor.py
tensorrt_llm/_torch/auto_deploy/shim/demollm.py
tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_kv_cache.py
tensorrt_llm/_torch/auto_deploy/compile/backends/torch_cudagraph.py
tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py

🧠 Learnings (1)

📚 Learning: in tensorrt-llm's multimodal processing pipeline, shared tensor recovery using `from_shared_tensor()...

Learnt from: yechank-nvidia
PR: NVIDIA/TensorRT-LLM#6254
File: tensorrt_llm/_torch/pyexecutor/model_engine.py:1201-1204
Timestamp: 2025-07-22T09:22:14.726Z
Learning: In TensorRT-LLM's multimodal processing pipeline, shared tensor recovery using `from_shared_tensor()` is only needed during the context phase. Generation requests reuse the already-recovered tensor data and only need to call `strip_for_generation()` to remove unnecessary multimodal data while preserving the recovered tensors. This avoids redundant tensor recovery operations during generation.

Applied to files:

tensorrt_llm/_torch/auto_deploy/shim/demollm.py

🧬 Code Graph Analysis (2)

tensorrt_llm/_torch/auto_deploy/shim/demollm.py (1)

tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py (3)

update_input_pos (526-544)

sequence_lengths (228-229)

nest_sequences (487-518)

tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_kv_cache.py (2)

tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py (3)

update_input_pos (526-544)

update_position_ids (418-448)

reset (351-369)

tests/unittest/_torch/auto_deploy/_utils_test/_torch_test_utils.py (1)

all_close (6-21)

🔇 Additional comments (10)

tensorrt_llm/_torch/auto_deploy/shim/demollm.py (1)

118-119: LGTM! API update is correct.

The change from update_pos to update_input_pos correctly follows the refactored API in SequenceInfo. The added comment appropriately documents that nest_sequences handles position ID updates.

tensorrt_llm/_torch/auto_deploy/compile/backends/torch_cudagraph.py (1)

165-165: Good optimization for asynchronous data transfer.

Using copy_ with non_blocking=True enables asynchronous copying when possible, which aligns with the broader performance optimizations in this PR for efficient CPU-GPU data transfers.

tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_kv_cache.py (1)

210-211: Test updates correctly reflect the new API design.

The split of position updates into update_input_pos() and update_position_ids() is properly implemented. The comment explaining why explicit update_position_ids() calls are needed (when nest_sequences is not called) provides good documentation.

Also applies to: 215-216, 219-220, 224-225, 228-229

tensorrt_llm/_torch/auto_deploy/shim/ad_executor.py (2)

97-102: Proper device handling implementation.

The enhanced device handling correctly ensures that CUDA devices have an explicit index, defaulting to the current device when not specified. This prevents potential issues with device mismatches.

190-197: Efficient new token handling without CPU transfers.

The refactored logic efficiently handles new tokens by:

Using placeholder tokens (-1) to mark positions for new tokens

Tracking batch indices for proper token placement

Utilizing the new update_input_ids_with_new_tokens method for in-place updates

This avoids unnecessary CPU transfers and aligns with the performance optimization goals.

Also applies to: 212-218

tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py (5)

93-94: Excellent device-aware tensor management implementation.

The changes introduce proper device-aware initialization with:

Explicit device specification for all tensors

Host-pinned memory for seq_len_host and input_pos_host to optimize CPU-GPU transfers

Maintenance of full-size tensor variants to avoid reallocation overhead

Consistent 1D tensor shapes for flattened sequence handling

These optimizations align perfectly with the PR's performance goals.

Also applies to: 97-98, 130-154

417-448: Highly optimized position ID update implementation.

The refactored update_position_ids method includes excellent optimizations:

NVTX profiling for performance analysis

Fast path for generation mode (all seq_len == 1)

Host-side computation to avoid device synchronization

Non-blocking copies for better CPU-GPU overlap

Conditional reallocation support

This is a well-thought-out performance improvement.

458-485: Efficient in-place token update implementation.

The update_input_ids_with_new_tokens method efficiently updates tokens using:

Masked scatter for in-place updates where tokens are marked with -1

Sorted indices to ensure correct alignment

Host-pinned memory for index transfers

Minimal memory allocations

This approach avoids unnecessary copies and aligns with the performance optimization goals.

487-518: Well-structured sequence nesting with flexible memory management.

The refactored nest_sequences method provides:

Flexible memory management with allow_realloc parameter

Centralized sequence length updates via _update_sequence_lengths

Efficient non-blocking copies with pinned memory

Automatic position ID updates maintaining consistency

Good separation of concerns and performance optimization.

525-544: Clean refactoring with performance optimizations.

The renamed update_input_pos method includes:

More descriptive naming

NVTX profiling instrumentation

Host-pinned memory for efficient CPU-GPU transfers

Non-blocking copy operations

These changes align with the overall performance optimization strategy.

galagam · 2025-08-06T17:21:56Z

/bot run

galagam · 2025-08-06T17:32:48Z

/bot run

venkywonka · 2025-08-06T17:38:46Z

/bot run

tensorrt-cicd · 2025-08-06T17:44:25Z

PR_Github #14323 [ run ] triggered by Bot

tensorrt-cicd · 2025-08-06T19:24:46Z

PR_Github #14323 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #10821 completed with status: 'FAILURE'

galagam · 2025-08-07T09:30:33Z

/bot run

tensorrt-cicd · 2025-08-07T09:36:00Z

PR_Github #14442 [ run ] triggered by Bot

tensorrt-cicd · 2025-08-07T10:59:27Z

PR_Github #14442 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #10916 completed with status: 'FAILURE'

galagam · 2025-08-07T11:25:09Z

/bot run

github-actions · 2025-08-07T11:25:19Z