Skip to content

Conversation

pcastonguay
Copy link
Collaborator

@pcastonguay pcastonguay commented Aug 1, 2025

Summary by CodeRabbit

  • New Features

    • Added support for attention data parallelism in event management, allowing configuration of attention DP rank, size, and event gathering period.
    • Introduced asynchronous MPI-based event exchange and merging across ranks for attention DP.
    • Python bindings for KVCacheEvent now include a read-only attribute for attention DP rank.
    • Added a new test verifying attention DP KV cache event behavior across multiple requests.
  • Enhancements

    • Event serialization and JSON export now include attention DP rank information.
    • Extended serialization support for detailed KVCache event data structures enabling efficient data exchange.
    • Added MPI communication tags and improved event queue handling for distributed attention DP event synchronization.
    • Added configuration parameter for attention DP events gather period with validation and Python API exposure.
    • Updated tests to cover serialization/deserialization of KVCache events and configuration including attention DP parameters.

Description

Test Coverage

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md
and the scripts/test_to_stage_mapping.py helper.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

@pcastonguay pcastonguay requested a review from a team as a code owner August 1, 2025 17:30
@pcastonguay pcastonguay requested a review from achartier August 1, 2025 17:30
Copy link
Contributor

coderabbitai bot commented Aug 1, 2025

📝 Walkthrough

Walkthrough

Support for attention data parallelism was added to the KVCache event management system. This includes new constructor parameters, member variables, and logic for handling attention DP ranks and sizes in both C++ and Python. Event serialization, event queue management, distributed event exchange via MPI, and Python bindings were updated to propagate and expose the new attention DP context.

Changes

Cohort / File(s) Change Summary
KVCacheEventManager attention DP support
cpp/include/tensorrt_llm/batch_manager/kvCacheEventManager.h, cpp/tensorrt_llm/batch_manager/kvCacheEventManager.cpp
KVCacheEventManager constructors updated to accept and validate attention DP parameters. New member variables for attention DP state. Event queue serialization and distributed event gathering logic via MPI added. Event creation methods propagate attention DP rank. Worker thread and attention DP exchange thread added.
KVCacheEvent struct and constructor
cpp/include/tensorrt_llm/executor/executor.h, cpp/tensorrt_llm/executor/executor.cpp
KVCacheEvent struct and constructor updated to include an optional attention DP rank parameter and member. KVCacheUpdatedData constructor overload added.
Python bindings for KVCacheEvent
cpp/tensorrt_llm/pybind/executor/bindings.cpp
Python bindings for KVCacheEvent extended to expose attention_dp_rank as a read-only property.
KVCacheManager initialization
tensorrt_llm/_torch/pyexecutor/resource_manager.py
KVCacheEventManagerCpp initialization updated to pass attention DP and pipeline parallelism parameters based on mapping attributes, including the new attention DP events gather period.
KVCacheEvent serialization
tensorrt_llm/_utils.py
KVCacheEventSerializer updated to include attention_dp_rank in the event's JSON serialization.
Serialization support for KVCacheEvent and related types
cpp/include/tensorrt_llm/executor/serialization.h, cpp/tensorrt_llm/executor/serialization.cpp, cpp/tensorrt_llm/executor/serializeUtils.h
Added serialization and deserialization methods for KVCacheEvent and related data structures including KVCacheCreatedData, KVCacheStoredData, KVCacheStoredBlockData, KVCacheRemovedData, KVCacheEventDiff, KVCacheUpdatedData, and UniqueToken. Integrated KVCacheEvent into existing serialization utilities. Extended KvCacheConfig serialization to include attention DP events gather period.
MPI Tags for KVCacheEventManager
cpp/include/tensorrt_llm/runtime/utils/mpiTags.h
Added new MPI tags kKvCacheEventSize and kKvCacheEvent for attention DP event communication.
KvCacheConfig class updates
cpp/include/tensorrt_llm/executor/executor.h, cpp/tensorrt_llm/executor/kvCacheConfig.cpp, cpp/tensorrt_llm/nanobind/executor/executorConfig.cpp, cpp/tensorrt_llm/pybind/executor/executorConfig.cpp, tensorrt_llm/llmapi/llm_args.py
Added new member variable and parameter attentionDpEventsGatherPeriodMs with validation, getter, and setter in KvCacheConfig. Exposed this parameter in nanobind and pybind Python bindings and in the LLM API config class.
Unit tests for serialization and config
cpp/tests/unit_tests/executor/serializeUtilsTest.cpp, tests/unittest/bindings/test_executor_bindings.py, tests/unittest/llmapi/test_llm_args.py
Added tests for serialization/deserialization of KVCacheEvent variants and UniqueToken. Extended KvCacheConfig serialization test to cover new attention DP events gather period field. Updated Python binding tests to verify new config parameter default and set values.
New attention DP KV cache events test
tests/unittest/llmapi/test_llm_kv_cache_events.py
Added a new test function to validate attention data parallel KV cache events across multiple requests, checking event types, ranks, window sizes, and stored block counts.
Test list update
tests/integration/test_lists/test-db/l0_dgx_h200.yml
Added the new attention DP KV cache events test to the DGX H200 4-GPU TensorRT backend test list.
KVCacheEventManager nanobind and pybind constructor bindings
cpp/tensorrt_llm/nanobind/batch_manager/kvCacheManager.cpp, cpp/tensorrt_llm/pybind/batch_manager/kvCacheManager.cpp
Updated constructor bindings for KVCacheEventManager to new signature with attention DP rank, size, and event gather period parameters.

Sequence Diagram(s)

sequenceDiagram
    participant PythonApp
    participant KVCacheManager
    participant KVCacheEventManagerCpp
    participant MPI
    participant KVCacheEvent

    PythonApp->>KVCacheManager: create (with mapping)
    KVCacheManager->>KVCacheEventManagerCpp: initialize(max_kv_event_entries, attention_dp_rank, attention_dp_size, pp_size)
    KVCacheEventManagerCpp->>KVCacheEventManagerCpp: validate and store DP parameters
    KVCacheEventManagerCpp->>KVCacheEventManagerCpp: start exchangeAttentionDpThread()

    PythonApp->>KVCacheManager: trigger event
    KVCacheManager->>KVCacheEventManagerCpp: enqueueCreatedEvent / enqueueStoredEvent / enqueueUpdatedEvent(...)
    KVCacheEventManagerCpp->>KVCacheEvent: create(event_id, data, window_size, attention_dp_rank)
    KVCacheEvent-->>KVCacheEventManagerCpp: event instance

    KVCacheManager->>KVCacheEventManagerCpp: flush()
    KVCacheEventManagerCpp->>KVCacheEventManagerCpp: notify worker thread

    alt attention DP enabled and size > 1
        KVCacheEventManagerCpp->>MPI: non-blocking probe for incoming events (rank 0)
        KVCacheEventManagerCpp->>MPI: async send serialized events (non-zero ranks)
        MPI->>KVCacheEventManagerCpp: receive serialized events (rank 0)
        KVCacheEventManagerCpp->>KVCacheEventManagerCpp: deserialize and merge events
        KVCacheEventManagerCpp->>KVCacheEventManagerCpp: notify worker thread after merge
    end

    KVCacheEventManagerCpp->>KVCacheEventManagerCpp: worker thread processes event queue
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~35 minutes

Suggested labels

Community want to contribute

Suggested reviewers

  • litaotju
  • Shixiaowei02
  • nv-guomingz

Note

⚡️ Unit Test Generation is now available in beta!

Learn more here, or try it out under "Finishing Touches" below.

✨ Finishing Touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Explain this complex logic.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai explain this code block.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and explain its main purpose.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
  • @coderabbitai generate unit tests to generate unit tests for this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai or @coderabbitai title anywhere in the PR title to generate the title automatically.

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🔭 Outside diff range comments (1)
cpp/tensorrt_llm/batch_manager/kvCacheEventManager.cpp (1)

91-103: Add missing mAttentionDpRank to removed‐event enqueue for consistency

The other enqueue*Event methods all pass mAttentionDpRank into the KVCacheEvent constructor, but in enqueueRemovedEvent (line 101) it uses the three-argument overload and drops the rank info (defaulting to nullopt). To keep all events carrying the same metadata, include mAttentionDpRank here as well.

• File: cpp/tensorrt_llm/batch_manager/kvCacheEventManager.cpp
• Line: 101

Suggested change:

-        enqueueEvent({mEventId++, tle::KVCacheRemovedData{{block->getHash()}}, windowSize});
+        enqueueEvent({mEventId++, tle::KVCacheRemovedData{{block->getHash()}}, windowSize, mAttentionDpRank});
🧹 Nitpick comments (2)
cpp/include/tensorrt_llm/executor/executor.h (2)

1730-1735: Consider passing attentionDpRank by const-reference and documenting it in the Doxygen header.

std::optional<SizeType32> is small, but by convention optional parameters are usually taken as const& to avoid an extra move/copy and to keep the signature uniform with the rest of the header.
While you are touching the constructor, please also extend the Doxygen comment that precedes the struct (or add one above the ctor) so that the new field is visible in generated docs.

-    KVCacheEvent(IdType eventId, KVCacheEventData data, SizeType32 windowSize,
-        std::optional<SizeType32> attentionDpRank = std::nullopt)
+    KVCacheEvent(IdType eventId, KVCacheEventData data, SizeType32 windowSize,
+        std::optional<SizeType32> const& attentionDpRank = std::nullopt)

This is a low-impact change, so feel free to defer if binary size or compile time is a larger concern.


1743-1744: Add a default member-initializer to keep aggregate defaults predictable.

If somebody default-constructs a KVCacheEvent via aggregate initialization (e.g. KVCacheEvent e{}) the new attentionDpRank will be value-initialised, but making that explicit improves readability and prevents future regressions when additional constructors are added.

-    std::optional<SizeType32> attentionDpRank;
+    std::optional<SizeType32> attentionDpRank{};   //!< Attention DP rank (nullopt if not applicable)
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 263c6c0 and 832bb10.

📒 Files selected for processing (7)
  • cpp/include/tensorrt_llm/batch_manager/kvCacheEventManager.h (2 hunks)
  • cpp/include/tensorrt_llm/executor/executor.h (1 hunks)
  • cpp/tensorrt_llm/batch_manager/kvCacheEventManager.cpp (5 hunks)
  • cpp/tensorrt_llm/executor/executor.cpp (1 hunks)
  • cpp/tensorrt_llm/pybind/executor/bindings.cpp (1 hunks)
  • tensorrt_llm/_torch/pyexecutor/resource_manager.py (1 hunks)
  • tensorrt_llm/_utils.py (1 hunks)
🧰 Additional context used
📓 Path-based instructions (4)
**/*.{cpp,h,hpp,cc,cxx}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.{cpp,h,hpp,cc,cxx}: Closing braces of namespaces should have a comment saying the namespace it closes (e.g., } // namespace foo)
Prefer const or constexpr variables over #defines whenever possible, as the latter are not visible to the compiler.
A variable that is not modified after its initialization should be declared as const.
Except 0 (only used in comparison for checking signness/existence/emptiness) and nullptr, true, false, all other literals should only be used for variable initialization.
Use the Allman indentation style for braces.
Put the semicolon for an empty for or while loop in a new line.
The statement forming the body of a switch, while, do .. while or for statement shall be a compound statement (use brace-delimited statements).
If and else should always be followed by brace-delimited statements, even if empty or a single statement.
C++ filenames should use camel case with first letter lowercase (e.g., thisIsAFilename.cpp), and all files involved in the compilation of a target must have filenames that are case-insensitive unique.
All types (including class names) are camel case with uppercase first letter (e.g., FooBarClass).
Local variables, methods, and namespaces use camel case with first letter lowercase (e.g., localFooBar).
Non-magic-number global variables that are non-static and not defined in anonymous namespace use camel case prefixed by a lower case 'g' (e.g., gDontUseGlobalFoos).
Non-magic-number global variables that are static or defined in an anonymous namespace use camel case prefixed by a lower case 's' (e.g., sMutableStaticGlobal).
Locally visible static variable uses camel case with lowercase prefix 's' as the first letter of the name (e.g., static std::once_flag sFlag;).
Class member variables use camel case prefixed with an 'm' (e.g., mNbFooValues). Public member variables do not require the 'm' prefix but it is encouraged for clarity.
Enumerations, global constants, static constants at class-scope, and function-scope magic...

Files:

  • cpp/tensorrt_llm/pybind/executor/bindings.cpp
  • cpp/tensorrt_llm/executor/executor.cpp
  • cpp/include/tensorrt_llm/executor/executor.h
  • cpp/include/tensorrt_llm/batch_manager/kvCacheEventManager.h
  • cpp/tensorrt_llm/batch_manager/kvCacheEventManager.cpp
**/*.{cpp,h,hpp,cc,cxx,cu,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the current year. This includes .cpp, .h, .cu, .py, and any other source files which are compiled or interpreted.

Files:

  • cpp/tensorrt_llm/pybind/executor/bindings.cpp
  • cpp/tensorrt_llm/executor/executor.cpp
  • tensorrt_llm/_utils.py
  • tensorrt_llm/_torch/pyexecutor/resource_manager.py
  • cpp/include/tensorrt_llm/executor/executor.h
  • cpp/include/tensorrt_llm/batch_manager/kvCacheEventManager.h
  • cpp/tensorrt_llm/batch_manager/kvCacheEventManager.cpp
**/*.py

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.py: The code developed for TensorRT-LLM should conform to Python 3.8+.
Indent Python code with 4 spaces. Do not use tabs.
Always maintain the namespace when importing in Python, even if only one class or function from a module is used.
Python filenames should use snake_case (e.g., some_file.py).
Python classes should use PascalCase (e.g., class SomeClass).
Python functions and methods should use snake_case (e.g., def my_awesome_function():).
Python local variables should use snake_case. Prefix k for variable names that start with a number (e.g., k_99th_percentile = ...).
Python global variables should use upper snake_case and prefix G (e.g., G_MY_GLOBAL = ...).
Python constants should use upper snake_case (e.g., MY_CONSTANT = ...).
Avoid shadowing variables declared in an outer scope in Python.
Initialize all externally visible members of a class in the constructor in Python.
For interfaces that may be used outside a file, prefer docstrings over comments in Python.
Comments in Python should be reserved for code within a function, or interfaces that are local to a file.
Use Google style docstrings for classes and functions in Python, which can be parsed by Sphinx.
Attributes and variables in Python can be documented inline; attribute docstrings will be rendered under the docstring for the class.
Avoid using reflection in Python when functionality can be easily achieved without it.
When using try-except blocks in Python, limit the except to the smallest set of errors possible.
When using try-except blocks to handle multiple possible variable types in Python, keep the body of the try as small as possible, using the else block to implement the logic.

Files:

  • tensorrt_llm/_utils.py
  • tensorrt_llm/_torch/pyexecutor/resource_manager.py
**/*.{h,hpp}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Use a preprocessor guard in header files. The guard name must have prefix TRTLLM_ followed by the filename, all in caps, and no trailing underscore.

Files:

  • cpp/include/tensorrt_llm/executor/executor.h
  • cpp/include/tensorrt_llm/batch_manager/kvCacheEventManager.h
🧠 Learnings (1)
📚 Learning: in tensorrt_llm/executor/worker.py, the lora adapter cache optimization logic that checks `is_adapte...
Learnt from: amitz-nv
PR: NVIDIA/TensorRT-LLM#5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks `is_adapter_in_cpu_cache()` and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.

Applied to files:

  • tensorrt_llm/_torch/pyexecutor/resource_manager.py
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (4)
cpp/tensorrt_llm/pybind/executor/bindings.cpp (1)

243-244: LGTM! Property binding follows established patterns.

The new attention_dp_rank read-only property correctly exposes the C++ attentionDpRank member using appropriate Python naming conventions.

cpp/tensorrt_llm/executor/executor.cpp (1)

135-142: LGTM! Constructor correctly extended with optional parameter.

The constructor properly adds the new attentionDpRank optional parameter and includes it in the initializer list, maintaining backwards compatibility while following C++ coding conventions.

tensorrt_llm/_torch/pyexecutor/resource_manager.py (1)

302-309: LGTM! Attention DP parameters correctly passed to event manager.

The changes properly propagate attention data parallelism configuration to the KV cache event manager. When attention DP is enabled, the rank and world size are correctly mapped to provide the necessary context for event handling across distributed ranks.

cpp/include/tensorrt_llm/batch_manager/kvCacheEventManager.h (1)

39-41: Clarify the usage of enableAttentionDp parameter.

The constructor accepts an enableAttentionDp parameter but it doesn't appear to be used to initialize mEnableAttentionDp. Consider either:

  1. Using the parameter to initialize the member variable, or
  2. Removing the parameter if mEnableAttentionDp is meant to be derived from the presence of attentionDpRank/attentionDpSize

Also applies to: 98-101

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (1)
cpp/tensorrt_llm/batch_manager/kvCacheEventManager.cpp (1)

27-34: Constructor signature still has the same issues as previously identified.

The constructor implementation still has the same problems that were flagged in previous reviews:

  1. Line 28: Parameter type std::optional<attentionDpSize> should be std::optional<SizeType32>
  2. Missing enableAttentionDp parameter that should be declared in the header
🧹 Nitpick comments (2)
cpp/tensorrt_llm/executor/serialization.cpp (2)

2185-2185: Fix typo in comment

The comment has a minor typo - it should be "KVCacheEvent deque" not "KVCacheEvents deque" to match the actual type.

-//  KVCacheEvents deque
+//  KVCacheEvent deque

2324-2324: Fix capitalization in comment

The comment should use consistent capitalization to match the actual type name.

-// KVcacheRemovedData
+// KVCacheRemovedData
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 832bb10 and 3a56c1e.

📒 Files selected for processing (6)
  • cpp/include/tensorrt_llm/executor/executor.h (2 hunks)
  • cpp/include/tensorrt_llm/executor/serialization.h (1 hunks)
  • cpp/include/tensorrt_llm/runtime/utils/mpiTags.h (1 hunks)
  • cpp/tensorrt_llm/batch_manager/kvCacheEventManager.cpp (5 hunks)
  • cpp/tensorrt_llm/executor/serialization.cpp (2 hunks)
  • cpp/tensorrt_llm/executor/serializeUtils.h (3 hunks)
✅ Files skipped from review due to trivial changes (1)
  • cpp/include/tensorrt_llm/runtime/utils/mpiTags.h
🚧 Files skipped from review as they are similar to previous changes (1)
  • cpp/include/tensorrt_llm/executor/executor.h
🧰 Additional context used
📓 Path-based instructions (3)
**/*.{cpp,h,hpp,cc,cxx}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.{cpp,h,hpp,cc,cxx}: Closing braces of namespaces should have a comment saying the namespace it closes (e.g., } // namespace foo)
Prefer const or constexpr variables over #defines whenever possible, as the latter are not visible to the compiler.
A variable that is not modified after its initialization should be declared as const.
Except 0 (only used in comparison for checking signness/existence/emptiness) and nullptr, true, false, all other literals should only be used for variable initialization.
Use the Allman indentation style for braces in C++ code.
Put the semicolon for an empty for or while loop in a new line.
The statement forming the body of a switch, while, do .. while or for statement shall be a compound statement (use brace-delimited statements).
If and else should always be followed by brace-delimited statements, even if empty or a single statement.
C++ filenames should use camel case with first letter lowercase (e.g., thisIsAFilename.cpp), and all files involved in the compilation of a target must have filenames that are case-insensitive unique.
All types (including class names) are camel case with uppercase first letter (e.g., FooBarClass).
Local variables, methods, and namespaces use camel case with first letter lowercase (e.g., localFooBar).
Non-magic-number global variables that are non-static and not defined in anonymous namespace use camel case prefixed by a lower case 'g' (e.g., gDontUseGlobalFoos).
Non-magic-number global variables that are static or defined in an anonymous namespace use camel case prefixed by a lower case 's' (e.g., sMutableStaticGlobal).
Locally visible static variable uses camel case with lowercase prefix 's' as the first letter of the name (e.g., static std::once_flag sFlag;).
Class member variables use camel case prefixed with an 'm' (e.g., mNbFooValues). Public member variables do not require the 'm' prefix but it is encouraged for clarity.
Enumerations, global constants, static constants at class-scope and function-...

Files:

  • cpp/tensorrt_llm/batch_manager/kvCacheEventManager.cpp
  • cpp/tensorrt_llm/executor/serialization.cpp
  • cpp/include/tensorrt_llm/executor/serialization.h
  • cpp/tensorrt_llm/executor/serializeUtils.h
**/*.{cpp,h,hpp,cc,cxx,cu,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the current year. This includes .cpp, .h, .cu, .py, and any other source files which are compiled or interpreted.

Files:

  • cpp/tensorrt_llm/batch_manager/kvCacheEventManager.cpp
  • cpp/tensorrt_llm/executor/serialization.cpp
  • cpp/include/tensorrt_llm/executor/serialization.h
  • cpp/tensorrt_llm/executor/serializeUtils.h
**/*.{h,hpp}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Use a preprocessor guard in header files. The guard name must have prefix TRTLLM_ followed by the filename, all in caps, and no trailing underscore.

Files:

  • cpp/include/tensorrt_llm/executor/serialization.h
  • cpp/tensorrt_llm/executor/serializeUtils.h
🧠 Learnings (1)
📚 Learning: in tensorrt-llm testing, it's common to have both cli flow tests (test_cli_flow.py) and pytorch api ...
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.

Applied to files:

  • cpp/tensorrt_llm/executor/serialization.cpp
🧬 Code Graph Analysis (2)
cpp/tensorrt_llm/batch_manager/kvCacheEventManager.cpp (1)
cpp/tensorrt_llm/executor/serialization.cpp (18)
  • serialize (44-47)
  • serialize (44-44)
  • serialize (89-113)
  • serialize (89-89)
  • serialize (173-194)
  • serialize (173-173)
  • serialize (235-244)
  • serialize (235-235)
  • serialize (267-271)
  • serialize (267-267)
  • serialize (292-298)
  • serialize (292-292)
  • serialize (318-322)
  • serialize (318-318)
  • serialize (341-346)
  • serialize (341-341)
  • deserializeKVCacheEvents (2209-2220)
  • deserializeKVCacheEvents (2209-2209)
cpp/tensorrt_llm/executor/serialization.cpp (2)
cpp/tensorrt_llm/executor/cache_transmission/agent_utils/connection.cpp (6)
  • serialize (54-60)
  • serialize (54-54)
  • serializedSize (71-76)
  • serializedSize (71-71)
  • deserialize (62-69)
  • deserialize (62-62)
cpp/tensorrt_llm/executor/cache_transmission/agent_utils/connection.h (2)
  • serializedSize (147-165)
  • deserialize (127-145)
🔇 Additional comments (15)
cpp/tensorrt_llm/executor/serializeUtils.h (3)

125-125: LGTM!

The static assertion correctly verifies that KVCacheEvent supports serialized size calculation, following the established pattern for other types in the codebase.


223-223: LGTM!

The static assertion correctly verifies that KVCacheEvent supports serialization, maintaining consistency with the existing compile-time verification pattern.


516-519: LGTM!

The deserialization branch for KVCacheEvent is correctly implemented, following the established pattern of using constexpr if with type checking and delegating to the appropriate Serialization class method.

cpp/tensorrt_llm/batch_manager/kvCacheEventManager.cpp (5)

36-49: LGTM!

The validation logic for attention DP parameters is comprehensive and well-designed:

  • Ensures consistency between rank and size parameters
  • Properly validates constraints (PP > 1 not supported with attention DP)
  • Provides informative error messages
  • Covers both directions of the relationship between rank and size parameters

51-52: LGTM!

The thread initialization is properly implemented, following the same pattern as the existing worker thread. The new attention DP exchange thread is appropriately initialized for handling distributed event exchange.


67-67: LGTM!

The event enqueuing methods are consistently updated to include attention DP rank information, enabling proper event tagging for distributed processing across all event types.

Also applies to: 87-90, 109-109


139-140: LGTM!

The explicit notification to the worker thread after flushing events is a good improvement that ensures responsive event processing. The comment clearly explains the rationale.


194-231: LGTM!

The worker method refactoring is a significant improvement that adds robust event queue management:

  • Proper condition variable-based waiting for pending events
  • Clear size-based eviction strategy (oldest events removed first)
  • Appropriate logging when events are discarded due to queue limits
  • Proper notification of waiting threads

The implementation correctly handles the distributed event processing requirements.

cpp/include/tensorrt_llm/executor/serialization.h (1)

305-342: LGTM!

The serialization method declarations for KVCacheEvent and related types are comprehensive and well-designed:

  • Follow established naming conventions and patterns
  • Proper const-correctness and [[nodiscard]] usage
  • Complete coverage of all KVCacheEvent data structures
  • Consistent method signatures with the existing serialization framework
  • Include both individual object serialization and batch operations for deques

These declarations properly extend the serialization infrastructure to support the new attention DP functionality.

cpp/tensorrt_llm/executor/serialization.cpp (6)

2223-2249: Correctly implements KVCacheEvent serialization with attention DP rank

The implementation properly handles the new attentionDpRank field as an optional value, maintaining consistency with the serialization pattern used throughout the file.


2252-2268: Simple and correct implementation


2271-2290: Correct implementation for KVCacheStoredData


2293-2322: Proper serialization of KVCacheStoredBlockData


2346-2365: Correct template specialization for KVCacheEventDiff


2368-2390: Well-structured serialization for KVCacheUpdatedData

@pcastonguay pcastonguay requested a review from a team as a code owner August 4, 2025 18:59
@pcastonguay pcastonguay requested a review from Superjomn August 4, 2025 18:59
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🔭 Outside diff range comments (3)
cpp/tensorrt_llm/nanobind/executor/executorConfig.cpp (1)

108-127: Critical: Update pickle serialization to include the new attention_dp_events_gather_period_ms field.

The pickle state methods still expect 13 elements but should handle 14 with the new field. This will cause serialization failures.

Apply this diff to fix the serialization:

 auto kvCacheConfigGetstate = [](tle::KvCacheConfig const& self)
 {
     return nb::make_tuple(self.getEnableBlockReuse(), self.getMaxTokens(), self.getMaxAttentionWindowVec(),
         self.getSinkTokenLength(), self.getFreeGpuMemoryFraction(), self.getHostCacheSize(),
         self.getOnboardBlocks(), self.getCrossKvCacheFraction(), self.getSecondaryOffloadMinPriority(),
-        self.getEventBufferMaxSize(), self.getEnablePartialReuse(), self.getCopyOnPartialReuse(), self.getUseUvm());
+        self.getEventBufferMaxSize(), self.getEnablePartialReuse(), self.getCopyOnPartialReuse(), self.getUseUvm(),
+        self.getAttentionDpEventsGatherPeriodMs());
 };
 auto kvCacheConfigSetstate = [](tle::KvCacheConfig& self, nb::tuple const& state)
 {
-    if (state.size() != 13)
+    if (state.size() != 14)
     {
         throw std::runtime_error("Invalid state!");
     }
     new (&self) tle::KvCacheConfig(nb::cast<bool>(state[0]), nb::cast<std::optional<SizeType32>>(state[1]),
         nb::cast<std::optional<std::vector<SizeType32>>>(state[2]), nb::cast<std::optional<SizeType32>>(state[3]),
         nb::cast<std::optional<float>>(state[4]), nb::cast<std::optional<size_t>>(state[5]),
         nb::cast<bool>(state[6]), nb::cast<std::optional<float>>(state[7]),
         nb::cast<std::optional<tle::RetentionPriority>>(state[8]), nb::cast<size_t>(state[9]),
-        nb::cast<bool>(state[10]), nb::cast<bool>(state[11]), nb::cast<bool>(state[12]));
+        nb::cast<bool>(state[10]), nb::cast<bool>(state[11]), nb::cast<bool>(state[12]),
+        nb::cast<SizeType32>(state[13]));
 };
cpp/tests/unit_tests/executor/serializeUtilsTest.cpp (1)

196-196: Fix function name typo

There's a typo in the function call - it should be compareStaticBatchingStats instead of compareStaticBatching.

-        compareStaticBatching(val, val2);
+        compareStaticBatchingStats(val, val2);
cpp/tensorrt_llm/executor/serialization.cpp (1)

1191-1209: Missing size calculation for attentionDpEventsGatherPeriodMs

The serializedSize function is missing the size calculation for the new attentionDpEventsGatherPeriodMs field, which could lead to buffer size miscalculation.

Apply this fix:

 size_t Serialization::serializedSize(KvCacheConfig const& kvCacheConfig)
 {
 
     size_t totalSize = 0;
     totalSize += su::serializedSize(kvCacheConfig.getEnableBlockReuse());
     totalSize += su::serializedSize(kvCacheConfig.getEnablePartialReuse());
     totalSize += su::serializedSize(kvCacheConfig.getCopyOnPartialReuse());
     totalSize += su::serializedSize(kvCacheConfig.getMaxTokens());
     totalSize += su::serializedSize(kvCacheConfig.getMaxAttentionWindowVec());
     totalSize += su::serializedSize(kvCacheConfig.getSinkTokenLength());
     totalSize += su::serializedSize(kvCacheConfig.getFreeGpuMemoryFraction());
     totalSize += su::serializedSize(kvCacheConfig.getHostCacheSize());
     totalSize += su::serializedSize(kvCacheConfig.getOnboardBlocks());
     totalSize += su::serializedSize(kvCacheConfig.getCrossKvCacheFraction());
     totalSize += su::serializedSize(kvCacheConfig.getSecondaryOffloadMinPriority());
     totalSize += su::serializedSize(kvCacheConfig.getEventBufferMaxSize());
     totalSize += su::serializedSize(kvCacheConfig.getUseUvm());
+    totalSize += su::serializedSize(kvCacheConfig.getAttentionDpEventsGatherPeriodMs());
     return totalSize;
 }
🧹 Nitpick comments (5)
tests/unittest/bindings/test_executor_bindings.py (2)

1315-1318: Avoid hard-coding the new default – derive it from the implementation

5 is duplicated knowledge: if the implementation later changes its default period (e.g. to tune performance) this test would fail for the wrong reason. Make the assertion self-describing by asking the class for its default instead of inlining the literal.

-    assert config.attention_dp_events_gather_period_ms == 5
+    default_period = trtllm.KvCacheConfig().attention_dp_events_gather_period_ms
+    assert config.attention_dp_events_gather_period_ms == default_period

1331-1346: Style: prefer is False / not … over == False

Several assertions in this block use equality against the boolean literal. PEP 8 (and Ruff rule E712) recommend assert not config.enable_block_reuse or assert config.enable_block_reuse is False for better readability and to avoid subtle bugs with objects that override __eq__.

Updating the handful of new/modified lines while you touch this area will silence the linter without a large diff:

-    assert config.enable_block_reuse == False
+    assert not config.enable_block_reuse

(Apply the same pattern to the other boolean comparisons in this stanza if convenient.)

cpp/include/tensorrt_llm/batch_manager/kvCacheEventManager.h (1)

1-15: Update copyright year to 2025 for consistency.

Other files in this PR have updated the copyright to 2025, but this file still shows 2024.

- * Copyright (c) 2022-2024, NVIDIA CORPORATION.  All rights reserved.
+ * Copyright (c) 2022-2025, NVIDIA CORPORATION.  All rights reserved.
cpp/tests/unit_tests/executor/serializeUtilsTest.cpp (1)

927-933: Rename test to be more descriptive

The test name "KVCacheEvents" is too generic. Since this test specifically validates KVCacheRemovedData, consider renaming it to "KVCacheRemovedEvent" for consistency with other test names.

-TEST(SerializeUtilsTest, KVCacheEvents)
+TEST(SerializeUtilsTest, KVCacheRemovedEvent)
cpp/include/tensorrt_llm/executor/serialization.h (1)

342-342: Fix comment typo

The comment says "KVCacheUpdateData" but the actual type is "KVCacheUpdatedData".

-    // KVCacheUpdateData
+    // KVCacheUpdatedData
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 3a56c1e and 8adbd75.

📒 Files selected for processing (14)
  • cpp/include/tensorrt_llm/batch_manager/kvCacheEventManager.h (4 hunks)
  • cpp/include/tensorrt_llm/executor/executor.h (6 hunks)
  • cpp/include/tensorrt_llm/executor/serialization.h (1 hunks)
  • cpp/tensorrt_llm/batch_manager/kvCacheEventManager.cpp (6 hunks)
  • cpp/tensorrt_llm/executor/kvCacheConfig.cpp (5 hunks)
  • cpp/tensorrt_llm/executor/serialization.cpp (4 hunks)
  • cpp/tensorrt_llm/executor/serializeUtils.h (3 hunks)
  • cpp/tensorrt_llm/nanobind/executor/executorConfig.cpp (2 hunks)
  • cpp/tensorrt_llm/pybind/executor/executorConfig.cpp (3 hunks)
  • cpp/tests/unit_tests/executor/serializeUtilsTest.cpp (3 hunks)
  • tensorrt_llm/_torch/pyexecutor/resource_manager.py (2 hunks)
  • tensorrt_llm/llmapi/llm_args.py (2 hunks)
  • tests/unittest/bindings/test_executor_bindings.py (4 hunks)
  • tests/unittest/llmapi/test_llm_args.py (2 hunks)
✅ Files skipped from review due to trivial changes (2)
  • tests/unittest/llmapi/test_llm_args.py
  • tensorrt_llm/llmapi/llm_args.py
🚧 Files skipped from review as they are similar to previous changes (3)
  • tensorrt_llm/_torch/pyexecutor/resource_manager.py
  • cpp/tensorrt_llm/batch_manager/kvCacheEventManager.cpp
  • cpp/include/tensorrt_llm/executor/executor.h
🧰 Additional context used
📓 Path-based instructions (4)
**/*.{cpp,h,hpp,cc,cxx}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.{cpp,h,hpp,cc,cxx}: Closing braces of namespaces should have a comment saying the namespace it closes (e.g., } // namespace foo)
Prefer const or constexpr variables over #defines whenever possible, as the latter are not visible to the compiler.
A variable that is not modified after its initialization should be declared as const.
Except 0 (only used in comparison for checking signness/existence/emptiness) and nullptr, true, false, all other literals should only be used for variable initialization.
Use the Allman indentation style for braces in C++ code.
Put the semicolon for an empty for or while loop in a new line.
The statement forming the body of a switch, while, do .. while or for statement shall be a compound statement (use brace-delimited statements).
If and else should always be followed by brace-delimited statements, even if empty or a single statement.
C++ filenames should use camel case with first letter lowercase (e.g., thisIsAFilename.cpp), and all files involved in the compilation of a target must have filenames that are case-insensitive unique.
All types (including class names) are camel case with uppercase first letter (e.g., FooBarClass).
Local variables, methods, and namespaces use camel case with first letter lowercase (e.g., localFooBar).
Non-magic-number global variables that are non-static and not defined in anonymous namespace use camel case prefixed by a lower case 'g' (e.g., gDontUseGlobalFoos).
Non-magic-number global variables that are static or defined in an anonymous namespace use camel case prefixed by a lower case 's' (e.g., sMutableStaticGlobal).
Locally visible static variable uses camel case with lowercase prefix 's' as the first letter of the name (e.g., static std::once_flag sFlag;).
Class member variables use camel case prefixed with an 'm' (e.g., mNbFooValues). Public member variables do not require the 'm' prefix but it is encouraged for clarity.
Enumerations, global constants, static constants at class-scope and function-...

Files:

  • cpp/tensorrt_llm/executor/kvCacheConfig.cpp
  • cpp/tensorrt_llm/executor/serializeUtils.h
  • cpp/tensorrt_llm/nanobind/executor/executorConfig.cpp
  • cpp/include/tensorrt_llm/batch_manager/kvCacheEventManager.h
  • cpp/tensorrt_llm/pybind/executor/executorConfig.cpp
  • cpp/tests/unit_tests/executor/serializeUtilsTest.cpp
  • cpp/include/tensorrt_llm/executor/serialization.h
  • cpp/tensorrt_llm/executor/serialization.cpp
**/*.{cpp,h,hpp,cc,cxx,cu,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the current year. This includes .cpp, .h, .cu, .py, and any other source files which are compiled or interpreted.

Files:

  • cpp/tensorrt_llm/executor/kvCacheConfig.cpp
  • cpp/tensorrt_llm/executor/serializeUtils.h
  • cpp/tensorrt_llm/nanobind/executor/executorConfig.cpp
  • cpp/include/tensorrt_llm/batch_manager/kvCacheEventManager.h
  • cpp/tensorrt_llm/pybind/executor/executorConfig.cpp
  • cpp/tests/unit_tests/executor/serializeUtilsTest.cpp
  • cpp/include/tensorrt_llm/executor/serialization.h
  • cpp/tensorrt_llm/executor/serialization.cpp
  • tests/unittest/bindings/test_executor_bindings.py
**/*.{h,hpp}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Use a preprocessor guard in header files. The guard name must have prefix TRTLLM_ followed by the filename, all in caps, and no trailing underscore.

Files:

  • cpp/tensorrt_llm/executor/serializeUtils.h
  • cpp/include/tensorrt_llm/batch_manager/kvCacheEventManager.h
  • cpp/include/tensorrt_llm/executor/serialization.h
**/*.py

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.py: The code developed for TensorRT-LLM should conform to Python 3.8+.
Indent Python code with 4 spaces. Do not use tabs.
Always maintain the namespace when importing in Python, even if only one class or function from a module is used.
Python filenames should use snake_case (e.g., some_file.py).
Python classes should use PascalCase (e.g., class SomeClass).
Python functions and methods should use snake_case (e.g., def my_awesome_function():).
Python local variables should use snake_case. Prefix k for variable names that start with a number (e.g., k_99th_percentile = ...).
Python global variables should use upper snake_case and prefix G (e.g., G_MY_GLOBAL = ...).
Python constants should use upper snake_case (e.g., MY_CONSTANT = ...).
Avoid shadowing variables declared in an outer scope in Python.
Initialize all externally visible members of a Python class in the constructor.
For interfaces that may be used outside a file, prefer docstrings over comments in Python.
Comments in Python should be reserved for code within a function, or interfaces that are local to a file.
Use Google style docstrings for classes and functions in Python, which can be parsed by Sphinx.
Attributes and variables in Python can be documented inline; attribute docstrings will be rendered under the docstring for the class.
Avoid using reflection in Python when functionality can be easily achieved without it.
When using try-except blocks in Python, limit the except to the smallest set of errors possible.
When using try-except blocks to handle multiple possible variable types in Python, keep the body of the try as small as possible, using the else block to implement the logic.

Files:

  • tests/unittest/bindings/test_executor_bindings.py
🧠 Learnings (1)
📚 Learning: in tensorrt-llm testing, it's common to have both cli flow tests (test_cli_flow.py) and pytorch api ...
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.

Applied to files:

  • cpp/tensorrt_llm/executor/serialization.cpp
🪛 Ruff (0.12.2)
tests/unittest/bindings/test_executor_bindings.py

1332-1332: Avoid equality comparisons to False; use not config.enable_block_reuse: for false checks

Replace with not config.enable_block_reuse

(E712)

🔇 Additional comments (16)
cpp/tensorrt_llm/executor/kvCacheConfig.cpp (2)

30-68: LGTM! Constructor implementation is clean and follows established patterns.

The new attentionDpEventsGatherPeriodMs parameter is properly validated and initialized.


135-139: LGTM! Getter implementation is correct.

cpp/tensorrt_llm/nanobind/executor/executorConfig.cpp (1)

128-164: LGTM! Nanobind property and constructor bindings are properly implemented.

The new parameter is correctly added as a keyword-only argument with appropriate default value.

cpp/include/tensorrt_llm/batch_manager/kvCacheEventManager.h (2)

39-67: LGTM! Clean API design with backward compatibility.

The constructor uses optional parameters with sensible defaults, and the new exchange thread method is appropriately placed.


76-108: LGTM! Well-structured member variables with clear documentation.

The new members are properly organized and documented for the attention DP feature.

cpp/tensorrt_llm/pybind/executor/executorConfig.cpp (2)

101-121: LGTM! Pickle serialization correctly updated for the new field.

The state size check and tuple handling are properly updated to accommodate the new attention_dp_events_gather_period_ms field.


122-159: LGTM! Property bindings and constructor are properly implemented.

The new parameter integrates seamlessly with the existing API.

cpp/tests/unit_tests/executor/serializeUtilsTest.cpp (5)

918-925: LGTM!

The test correctly validates serialization/deserialization of KVCacheEvent with KVCacheCreatedData.


935-945: LGTM!

The test thoroughly validates serialization of KVCacheStoredData with nested block data and token structures.


947-956: LGTM!

The test properly validates serialization of KVCacheUpdatedData with optional cache level and priority diffs.


958-976: LGTM!

Both UniqueToken tests correctly validate serialization of individual tokens and token vectors.


477-477: LGTM!

The test correctly validates the new attentionDpEventsGatherPeriodMs parameter in KvCacheConfig.

Also applies to: 493-493

cpp/include/tensorrt_llm/executor/serialization.h (1)

305-351: LGTM!

The serialization method declarations for KVCache events and UniqueToken follow the established patterns and are properly marked with [[nodiscard]] attributes where appropriate.

cpp/tensorrt_llm/executor/serializeUtils.h (3)

125-133: LGTM!

Static assertions correctly validate serializedSize support for all KVCache event types and UniqueToken.


230-237: LGTM!

Static assertions correctly validate serialize support for all KVCache event types and UniqueToken.


530-561: LGTM!

Deserialization branches correctly handle all KVCache event types and UniqueToken, properly delegating to the corresponding Serialization methods.

@pcastonguay pcastonguay force-pushed the attention_dp_events branch from 431ebff to 1332536 Compare August 4, 2025 19:27
@pcastonguay pcastonguay changed the title Draft: feat: Include attention dp rank info with KV cache events [TRTLLM-6881][feat] Include attention dp rank info with KV cache events Aug 4, 2025
@pcastonguay
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #14028 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #14028 [ run ] completed with state FAILURE
/LLM/main/L0_MergeRequest_PR pipeline #10577 completed with status: 'FAILURE'

@pcastonguay
Copy link
Collaborator Author

/bot run

@pcastonguay
Copy link
Collaborator Author

@coderabbitai full review

Copy link
Contributor

coderabbitai bot commented Aug 4, 2025

✅ Actions performed

Full review triggered.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🔭 Outside diff range comments (2)
cpp/tensorrt_llm/nanobind/executor/executorConfig.cpp (1)

108-127: Fix missing serialization support for attention_dp_events_gather_period_ms.

The serialization/deserialization methods have not been updated to handle the new attention_dp_events_gather_period_ms parameter, which will cause runtime failures during pickle operations.

Apply this diff to fix the serialization:

 auto kvCacheConfigGetstate = [](tle::KvCacheConfig const& self)
 {
     return nb::make_tuple(self.getEnableBlockReuse(), self.getMaxTokens(), self.getMaxAttentionWindowVec(),
         self.getSinkTokenLength(), self.getFreeGpuMemoryFraction(), self.getHostCacheSize(),
         self.getOnboardBlocks(), self.getCrossKvCacheFraction(), self.getSecondaryOffloadMinPriority(),
-        self.getEventBufferMaxSize(), self.getEnablePartialReuse(), self.getCopyOnPartialReuse(), self.getUseUvm());
+        self.getEventBufferMaxSize(), self.getEnablePartialReuse(), self.getCopyOnPartialReuse(), self.getUseUvm(),
+        self.getAttentionDpEventsGatherPeriodMs());
 };
 auto kvCacheConfigSetstate = [](tle::KvCacheConfig& self, nb::tuple const& state)
 {
-    if (state.size() != 13)
+    if (state.size() != 14)
     {
         throw std::runtime_error("Invalid state!");
     }
     new (&self) tle::KvCacheConfig(nb::cast<bool>(state[0]), nb::cast<std::optional<SizeType32>>(state[1]),
         nb::cast<std::optional<std::vector<SizeType32>>>(state[2]), nb::cast<std::optional<SizeType32>>(state[3]),
         nb::cast<std::optional<float>>(state[4]), nb::cast<std::optional<size_t>>(state[5]),
         nb::cast<bool>(state[6]), nb::cast<std::optional<float>>(state[7]),
         nb::cast<std::optional<tle::RetentionPriority>>(state[8]), nb::cast<size_t>(state[9]),
-        nb::cast<bool>(state[10]), nb::cast<bool>(state[11]), nb::cast<bool>(state[12]));
+        nb::cast<bool>(state[10]), nb::cast<bool>(state[11]), nb::cast<bool>(state[12]),
+        nb::cast<SizeType32>(state[13]));
 };
cpp/tensorrt_llm/pybind/executor/executorConfig.cpp (1)

109-121: Fix inconsistent tuple size check in deserialization.

The kvCacheConfigSetstate function checks for 13 elements but actually uses 14 elements from the tuple, causing a runtime error.

Apply this diff to fix the size check:

 auto kvCacheConfigSetstate = [](py::tuple const& state)
 {
-    if (state.size() != 13)
+    if (state.size() != 14)
     {
         throw std::runtime_error("Invalid state!");
     }
♻️ Duplicate comments (4)
tensorrt_llm/_utils.py (1)

1020-1020: Fix the incorrect serialization of attention_dp_rank field.

The attention_dp_rank field is being serialized using event_serialize_func, which is designed for complex event data objects. Since attention_dp_rank is likely an Optional[int], it should be serialized directly rather than passed through the event serialization function.

Apply this fix:

-            "attention_dp_rank": event_serialize_func(event.attention_dp_rank),
+            "attention_dp_rank": event.attention_dp_rank,
cpp/tensorrt_llm/executor/kvCacheConfig.cpp (1)

216-220: Fix parameter name inconsistency in setter method.

The parameter name uses "Poll" while the member variable and other methods use "Gather". This inconsistency could cause confusion.

Apply this diff to fix the naming inconsistency:

-void KvCacheConfig::setAttentionDpEventsGatherPeriodMs(SizeType32 attentionDpEventPollPeriodMs)
+void KvCacheConfig::setAttentionDpEventsGatherPeriodMs(SizeType32 attentionDpEventsGatherPeriodMs)
 {
-    TLLM_CHECK(attentionDpEventPollPeriodMs > 0);
-    mAttentionDpEventsGatherPeriodMs = attentionDpEventPollPeriodMs;
+    TLLM_CHECK(attentionDpEventsGatherPeriodMs > 0);
+    mAttentionDpEventsGatherPeriodMs = attentionDpEventsGatherPeriodMs;
 }
cpp/tensorrt_llm/batch_manager/kvCacheEventManager.cpp (1)

170-173: Simplify mutex locking with std::lock_guard.

The current unique_lock usage can be simplified since there's no need for the advanced features of unique_lock.

Apply this diff to use lock_guard:

-                std::unique_lock<std::mutex> lck(mEventsMutex);
+                std::lock_guard<std::mutex> lck(mEventsMutex);
                 serializedEvents = executor::Serialization::serialize(mEvents);
                 mEvents.clear();

And:

-                        std::unique_lock<std::mutex> lck(mEventsMutex);
+                        std::lock_guard<std::mutex> lck(mEventsMutex);
                         mEvents.insert(mEvents.end(), rankEvents.begin(), rankEvents.end());
                         mEmptyCV.notify_one();

Also applies to: 200-203

cpp/include/tensorrt_llm/batch_manager/kvCacheEventManager.h (1)

65-65: Fix grammatical issue in comment

-    // Thread which exchange events if attentionDP is enabled
+    // Thread which exchanges events if attentionDP is enabled
🧹 Nitpick comments (3)
tests/unittest/bindings/test_executor_bindings.py (1)

1333-1333: Fix boolean comparison style issue.

Use not config.enable_block_reuse instead of == False for better Python style.

-    assert config.enable_block_reuse == False
+    assert not config.enable_block_reuse
tensorrt_llm/llmapi/llm_args.py (1)

990-993: Minor: keep constructor kwargs grouped for readability

attention_dp_events_gather_period_ms is conceptually related to the event-buffer knobs that precede it (Lines 950-955). Consider moving the keyword argument next to event_buffer_max_size to keep semantically related parameters together and avoid future merge conflicts when the C++ signature changes.

No functional impact, purely readability.

cpp/tests/unit_tests/executor/serializeUtilsTest.cpp (1)

928-933: Rename test to be more specific.

The test name KVCacheEvents is too generic. Consider renaming it to KVCacheRemovedEvent to match the naming pattern of other KVCache event tests and clearly indicate what variant is being tested.

-TEST(SerializeUtilsTest, KVCacheEvents)
+TEST(SerializeUtilsTest, KVCacheRemovedEvent)
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6135f75 and 0c1a2ea.

📒 Files selected for processing (18)
  • cpp/include/tensorrt_llm/batch_manager/kvCacheEventManager.h (4 hunks)
  • cpp/include/tensorrt_llm/executor/executor.h (6 hunks)
  • cpp/include/tensorrt_llm/executor/serialization.h (1 hunks)
  • cpp/include/tensorrt_llm/runtime/utils/mpiTags.h (1 hunks)
  • cpp/tensorrt_llm/batch_manager/kvCacheEventManager.cpp (6 hunks)
  • cpp/tensorrt_llm/executor/executor.cpp (1 hunks)
  • cpp/tensorrt_llm/executor/kvCacheConfig.cpp (5 hunks)
  • cpp/tensorrt_llm/executor/serialization.cpp (5 hunks)
  • cpp/tensorrt_llm/executor/serializeUtils.h (5 hunks)
  • cpp/tensorrt_llm/nanobind/executor/executorConfig.cpp (2 hunks)
  • cpp/tensorrt_llm/pybind/executor/bindings.cpp (1 hunks)
  • cpp/tensorrt_llm/pybind/executor/executorConfig.cpp (3 hunks)
  • cpp/tests/unit_tests/executor/serializeUtilsTest.cpp (3 hunks)
  • tensorrt_llm/_torch/pyexecutor/resource_manager.py (2 hunks)
  • tensorrt_llm/_utils.py (1 hunks)
  • tensorrt_llm/llmapi/llm_args.py (2 hunks)
  • tests/unittest/bindings/test_executor_bindings.py (4 hunks)
  • tests/unittest/llmapi/test_llm_args.py (2 hunks)
🧰 Additional context used
🧠 Learnings (2)
📚 Learning: applies to **/*.{cpp,h,hpp,cc,cxx} : parameter names should be consistent across function definition...
Learnt from: CR
PR: NVIDIA/TensorRT-LLM#0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-08-04T02:12:17.582Z
Learning: Applies to **/*.{cpp,h,hpp,cc,cxx} : Parameter names should be consistent across function definition and corresponding function declarations.

Applied to files:

  • cpp/tensorrt_llm/executor/kvCacheConfig.cpp
📚 Learning: in tensorrt-llm testing, it's common to have both cli flow tests (test_cli_flow.py) and pytorch api ...
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.

Applied to files:

  • cpp/tensorrt_llm/executor/serialization.cpp
🪛 Ruff (0.12.2)
tests/unittest/bindings/test_executor_bindings.py

1332-1332: Avoid equality comparisons to False; use not config.enable_block_reuse: for false checks

Replace with not config.enable_block_reuse

(E712)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (37)
cpp/include/tensorrt_llm/runtime/utils/mpiTags.h (1)

72-74: LGTM! New MPI tags for KV cache event communication.

The addition of kKvCacheEventSize and kKvCacheEvent MPI tags is well-structured and follows the sequential numbering pattern. These tags will enable the MPI-based event exchange mechanism for attention data parallelism.

tests/unittest/llmapi/test_llm_args.py (2)

165-166: LGTM! Test coverage for new attention DP configuration parameter.

The addition of attention_dp_events_gather_period_ms=10 to the test configuration is appropriate and follows the existing testing pattern.


181-181: LGTM! Assertion verifies pybind conversion of new parameter.

The assertion correctly verifies that the new attention_dp_events_gather_period_ms parameter is properly converted in the pybind layer.

cpp/tensorrt_llm/pybind/executor/bindings.cpp (1)

244-244: LGTM! Python binding for attention DP rank attribute.

The addition of the read-only attention_dp_rank attribute to the KVCacheEvent Python binding correctly exposes the new C++ member to Python code, following the established pattern.

cpp/tensorrt_llm/executor/executor.cpp (1)

135-142: LGTM! Constructor updated to support attention DP rank.

The KVCacheEvent constructor correctly adds the optional attentionDpRank parameter and properly initializes the member variable. The use of std::optional<SizeType32> is appropriate since the attention DP rank may not always be available.

tests/unittest/bindings/test_executor_bindings.py (1)

1317-1317: LGTM! Comprehensive test coverage for the new attention DP events field.

The test additions properly cover the new attention_dp_events_gather_period_ms field with default value assertions, direct assignment testing, and kwargs-based construction verification. This follows the established testing patterns for other KvCacheConfig fields.

Also applies to: 1332-1332, 1346-1346, 1360-1361

tensorrt_llm/_torch/pyexecutor/resource_manager.py (2)

199-199: LGTM!

The assignment follows the established pattern and correctly initializes the attention DP events gather period from the configuration.


304-310: LGTM!

The conditional parameter passing logic is well-implemented:

  • Correctly passes attention DP rank and size only when attention DP is enabled
  • Uses appropriate values from the mapping object
  • Always passes the gather period configuration parameter
cpp/tensorrt_llm/executor/kvCacheConfig.cpp (4)

30-30: LGTM!

The constructor parameter is correctly added with appropriate type and consistent naming.


40-40: LGTM!

The member initialization is correct and follows the established pattern.


66-67: LGTM!

The validation check appropriately ensures the gather period is positive with a descriptive error message.


135-138: LGTM!

The getter method is correctly implemented following established patterns with appropriate const-correctness.

cpp/include/tensorrt_llm/executor/executor.h (7)

1004-1004: LGTM: Well-designed constructor parameter addition.

The new attentionDpEventsGatherPeriodMs parameter follows consistent naming conventions and provides a reasonable default value of 5ms for the attention DP events gathering period.


1020-1020: LGTM: Consistent getter method implementation.

The getter method follows the established pattern with appropriate [[nodiscard]] attribute and correct return type.


1035-1035: LGTM: Consistent setter method implementation.

The setter method follows the established naming convention and parameter type consistency.


1711-1715: LGTM: Well-designed constructor overload.

The new constructor overload properly uses member initializer lists and provides flexibility with optional parameters while maintaining backward compatibility with the existing constructor.


1741-1742: LGTM: Appropriate constructor extension.

The constructor modification properly adds the optional attentionDpRank parameter with a sensible default, maintaining backward compatibility.


1750-1751: LGTM: Well-documented member addition.

The new attentionDpRank member variable is properly typed as optional and includes clear documentation about its purpose.


1092-1093: LGTM: Well-documented member variable addition.

The new member variable follows naming conventions and includes clear documentation explaining its purpose for attention DP events polling period.

cpp/tests/unit_tests/executor/serializeUtilsTest.cpp (4)

477-477: LGTM! Properly tests the new attention DP events gather period field.

The test correctly passes the new attentionDpEventsGatherPeriodMs parameter (value 77) and verifies it after serialization/deserialization.

Also applies to: 493-493


850-916: Well-structured comparison helper for KVCacheEvent testing.

The compareKvCacheEvents function provides comprehensive equality checks for all KVCacheEvent variant types and properly includes the new attentionDpRank field. The deep comparison logic for complex nested structures is thorough and correct.


918-925: LGTM! Test properly validates KVCacheCreatedData serialization.


936-976: LGTM! Comprehensive test coverage for KVCache events and UniqueToken.

The tests properly validate serialization/deserialization for:

  • Complex nested structures in KVCacheStoredData
  • Optional fields in KVCacheUpdatedData
  • Basic and vector forms of UniqueToken
cpp/include/tensorrt_llm/executor/serialization.h (1)

305-351: LGTM! Well-structured serialization interface extensions.

The new serialization methods for KVCache events and UniqueToken follow the established pattern consistently:

  • Proper use of [[nodiscard]] attributes for functions that return values
  • Consistent parameter types (const references for serialize, streams for I/O)
  • Complete coverage for all KVCache event variant types
  • Template support for KVCacheEventDiff<T>
cpp/tensorrt_llm/executor/serialization.cpp (5)

26-26: LGTM!

Including <cstddef> explicitly is a good practice for code clarity, even though size_t might be available through other headers.


1166-1170: Well-implemented serialization for attention DP events gathering period.

The new attentionDpEventsGatherPeriodMs field is correctly handled in all serialization methods with consistent ordering, maintaining backward compatibility.

Also applies to: 1188-1188, 1208-1208


2188-2223: Properly implemented KVCacheEvent deque serialization.

The serialization correctly handles the queue size first, enabling proper deserialization of the collection.


2225-2252: Excellent KVCacheEvent serialization with attention DP support.

The implementation correctly includes the new attentionDpRank field as an optional parameter, aligning with the PR's objective to include attention DP rank information with KV cache events.


2254-2417: Comprehensive and consistent serialization for all KVCache event types.

All event data structures (KVCacheCreatedData, KVCacheStoredData, KVCacheRemovedData, KVCacheUpdatedData) follow a consistent serialization pattern with proper size calculation, field ordering, and type safety. The template implementation for KVCacheEventDiff is particularly well-designed for reusability.

cpp/include/tensorrt_llm/batch_manager/kvCacheEventManager.h (3)

39-40: LGTM!

The constructor signature correctly adds the new attention DP parameters with appropriate default values.


66-67: Thread management looks correct

The new exchange thread follows the same pattern as the existing worker thread, which is good for consistency.

Also applies to: 76-78


101-108: Member variables are well-structured

The attention DP configuration members are properly typed with std::optional for values that may not be present, and the documentation clearly explains their purpose.

cpp/tensorrt_llm/executor/serializeUtils.h (5)

125-132: LGTM! Static assertions properly verify serialization support for new types.

The static assertions correctly verify that all new KVCache event types and UniqueToken have the required serializedSize implementation, maintaining consistency with the existing pattern.


311-324: Well-designed variant deserialization helper using modern C++ features.

The deserializeVariantByIndex helper function elegantly handles variant deserialization using fold expressions and template recursion. The forward declaration of deserialize is appropriately placed, and the error handling for invalid indices is correct.


546-577: Deserialization branches correctly implemented for all new types.

All KVCache event types and UniqueToken deserialization branches follow the established pattern, properly delegating to their respective Serialization::deserialize* functions.


614-614: Clean integration of the new variant deserialization helper.

The updated call correctly uses std::make_index_sequence to generate compile-time indices for all variant alternatives, making the deserialization more robust and maintainable.


230-237: Serialize assertions properly added for consistency.

The static assertions for hasSerialize correctly mirror the hasSerializedSize assertions, ensuring complete serialization support for all new KVCache event types and UniqueToken.

@tensorrt-cicd
Copy link
Collaborator

PR_Github #14033 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #14033 [ run ] completed with state FAILURE
/LLM/main/L0_MergeRequest_PR pipeline #10582 completed with status: 'FAILURE'

@pcastonguay
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #14159 [ run ] triggered by Bot

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
tests/unittest/llmapi/test_llm_kv_cache_events.py (1)

258-275: Remove commented code or provide explanation.

The third request generation and its validation are commented out. Either remove this commented code to keep the test clean, or uncomment it if it's needed for comprehensive testing.

If this code should remain commented, please add a comment explaining why (e.g., "TODO: Enable after fixing issue #XXXX").

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 0c1a2ea and a73d28d.

📒 Files selected for processing (4)
  • cpp/tensorrt_llm/nanobind/executor/executorConfig.cpp (2 hunks)
  • tensorrt_llm/_utils.py (1 hunks)
  • tests/integration/test_lists/test-db/l0_dgx_h200.yml (2 hunks)
  • tests/unittest/llmapi/test_llm_kv_cache_events.py (2 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
  • tensorrt_llm/_utils.py
  • cpp/tensorrt_llm/nanobind/executor/executorConfig.cpp
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.py: The code developed for TensorRT-LLM should conform to Python 3.8+.
Indent Python code with 4 spaces. Do not use tabs.
Always maintain the namespace when importing in Python, even if only one class or function from a module is used.
Python filenames should use snake_case (e.g., some_file.py).
Python classes should use PascalCase (e.g., class SomeClass).
Python functions and methods should use snake_case (e.g., def my_awesome_function():).
Python local variables should use snake_case. Prefix k for variable names that start with a number (e.g., k_99th_percentile).
Python global variables should use upper snake_case and prefix G (e.g., G_MY_GLOBAL).
Python constants should use upper snake_case (e.g., MY_CONSTANT).
Avoid shadowing variables declared in an outer scope in Python.
Initialize all externally visible members of a Python class in the constructor.
For interfaces that may be used outside a file, prefer docstrings over comments in Python.
Comments in Python should be reserved for code within a function, or interfaces that are local to a file.
Use Google style docstrings for Python classes and functions, which can be parsed by Sphinx.
Attributes and variables in Python can be documented inline; attribute docstrings will be rendered under the docstring for the class.
Avoid using reflection in Python when functionality can be easily achieved without reflection.
When using try-except blocks in Python, limit the except to the smallest set of errors possible.
When using try-except blocks to handle multiple possible variable types in Python, keep the body of the try as small as possible, using the else block to implement the logic.

Files:

  • tests/unittest/llmapi/test_llm_kv_cache_events.py
**/*.{cpp,h,cu,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the current year. This includes .cpp, .h, .cu, .py, and any other source files which are compiled or interpreted.

Files:

  • tests/unittest/llmapi/test_llm_kv_cache_events.py
🧠 Learnings (5)
📚 Learning: in tensorrt-llm testing, it's common to have both cli flow tests (test_cli_flow.py) and pytorch api ...
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.

Applied to files:

  • tests/unittest/llmapi/test_llm_kv_cache_events.py
  • tests/integration/test_lists/test-db/l0_dgx_h200.yml
📚 Learning: in tensorrt-llm, examples directory can have different dependency versions than the root requirement...
Learnt from: yibinl-nvidia
PR: NVIDIA/TensorRT-LLM#6506
File: examples/models/core/mixtral/requirements.txt:3-3
Timestamp: 2025-08-01T15:14:45.673Z
Learning: In TensorRT-LLM, examples directory can have different dependency versions than the root requirements.txt file. Version conflicts between root and examples dependencies are acceptable because examples are designed to be standalone and self-contained.

Applied to files:

  • tests/unittest/llmapi/test_llm_kv_cache_events.py
  • tests/integration/test_lists/test-db/l0_dgx_h200.yml
📚 Learning: applies to **/*.py : the code developed for tensorrt-llm should conform to python 3.8+....
Learnt from: CR
PR: NVIDIA/TensorRT-LLM#0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-08-05T07:49:09.875Z
Learning: Applies to **/*.py : The code developed for TensorRT-LLM should conform to Python 3.8+.

Applied to files:

  • tests/unittest/llmapi/test_llm_kv_cache_events.py
📚 Learning: in tensorrt-llm's multimodal processing pipeline, shared tensor recovery using `from_shared_tensor()...
Learnt from: yechank-nvidia
PR: NVIDIA/TensorRT-LLM#6254
File: tensorrt_llm/_torch/pyexecutor/model_engine.py:1201-1204
Timestamp: 2025-07-22T09:22:14.726Z
Learning: In TensorRT-LLM's multimodal processing pipeline, shared tensor recovery using `from_shared_tensor()` is only needed during the context phase. Generation requests reuse the already-recovered tensor data and only need to call `strip_for_generation()` to remove unnecessary multimodal data while preserving the recovered tensors. This avoids redundant tensor recovery operations during generation.

Applied to files:

  • tests/unittest/llmapi/test_llm_kv_cache_events.py
📚 Learning: in tensorrt_llm/executor/worker.py, the lora adapter cache optimization logic that checks `is_adapte...
Learnt from: amitz-nv
PR: NVIDIA/TensorRT-LLM#5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks `is_adapter_in_cpu_cache()` and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.

Applied to files:

  • tests/unittest/llmapi/test_llm_kv_cache_events.py
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (4)
tests/integration/test_lists/test-db/l0_dgx_h200.yml (1)

139-139: LGTM! New test added for attention DP KV cache events.

The addition of unittest/llmapi/test_llm_kv_cache_events.py::test_llm_api_attention_dp_kv_events correctly integrates the new test for attention data parallelism KV cache events functionality into the H200 test suite.

tests/unittest/llmapi/test_llm_kv_cache_events.py (3)

4-4: LGTM! Appropriate import for multi-GPU test.

The import of skip_single_gpu decorator is correct for a test that requires tensor parallelism size of 2.


210-216: LGTM! Proper test setup for attention DP functionality.

The test function is correctly decorated with @skip_single_gpu and creates an LLM instance with tensor parallelism size 2 and attention data parallelism enabled, which aligns with the PR objectives.


235-237: No change needed: the Python executor always uses rank 0 for attention DP events.
The tensor_parallel_size=2 setting applies to the C++ backend (covered by test_llm_get_kv_cache_events_tp2), but the Python‐backend implementation only ever emits events with attention_dp_rank == 0. Therefore the existing assertion correctly reflects the behavior of this test.

@tensorrt-cicd
Copy link
Collaborator

PR_Github #14159 [ run ] completed with state FAILURE
/LLM/main/L0_MergeRequest_PR pipeline #10685 completed with status: 'FAILURE'

@pcastonguay pcastonguay requested review from a team as code owners August 5, 2025 20:09
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
tests/unittest/llmapi/test_llm_kv_cache_events.py (1)

150-227: Consider removing debug print statements.

The check_events helper function provides good code reuse and correctly validates attention DP rank information in events. However, the debug print statements (lines 163, 176, 205) should be removed before merging to production.

Apply this diff to remove debug prints:

-        print("event1:", event)
-        print("event2:", event)
-        print("event3:", event)
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a73d28d and 82078cb.

📒 Files selected for processing (6)
  • cpp/include/tensorrt_llm/batch_manager/kvCacheEventManager.h (4 hunks)
  • cpp/tensorrt_llm/batch_manager/kvCacheEventManager.cpp (6 hunks)
  • cpp/tensorrt_llm/nanobind/batch_manager/kvCacheManager.cpp (1 hunks)
  • cpp/tensorrt_llm/pybind/batch_manager/kvCacheManager.cpp (1 hunks)
  • cpp/tensorrt_llm/pybind/executor/executorConfig.cpp (2 hunks)
  • tests/unittest/llmapi/test_llm_kv_cache_events.py (3 hunks)
🚧 Files skipped from review as they are similar to previous changes (3)
  • cpp/include/tensorrt_llm/batch_manager/kvCacheEventManager.h
  • cpp/tensorrt_llm/batch_manager/kvCacheEventManager.cpp
  • cpp/tensorrt_llm/pybind/executor/executorConfig.cpp
🧰 Additional context used
📓 Path-based instructions (3)
**/*.{cpp,h,hpp,cc,cxx}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.{cpp,h,hpp,cc,cxx}: Closing braces of namespaces should have a comment saying the namespace it closes (e.g., } // namespace foo).
Prefer const or constexpr variables over #defines whenever possible.
A variable that is not modified after its initialization should be declared as const.
Except 0 (only used in comparison for checking signness/existence/emptiness) and nullptr, true, false, all other literals should only be used for variable initialization.
Use the Allman indentation style for braces.
Put the semicolon for an empty for or while loop in a new line.
The statement forming the body of a switch, while, do .. while or for statement shall be a compound statement (use brace-delimited statements).
If and else should always be followed by brace-delimited statements, even if empty or a single statement.
C++ filenames should use camel case with first letter lowercase (e.g., thisIsAFilename.cpp), and all files involved in the compilation of a target must have filenames that are case-insensitive unique.
All types (including class names) are camel case with uppercase first letter (e.g., FooBarClass).
Local variables, methods, and namespaces use camel case with first letter lowercase (e.g., localFooBar).
Non-magic-number global variables that are non-static and not defined in anonymous namespace use camel case prefixed by 'g' (e.g., gDontUseGlobalFoos).
Non-magic-number global variables that are static or defined in an anonymous namespace use camel case prefixed by 's' (e.g., sMutableStaticGlobal).
Locally visible static variable uses camel case with lowercase prefix 's' as the first letter of the name (e.g., static std::once_flag sFlag;).
Class member variables use camelcase prefixed with 'm' (e.g., mNbFooValues). Public member variables do not require the 'm' prefix but it is encouraged for clarity.
Constants (enumerations, global constants, static constants at class-scope, and function-scope magic-number/literal constants) are uppercase snakecase with prefix 'k...

Files:

  • cpp/tensorrt_llm/pybind/batch_manager/kvCacheManager.cpp
  • cpp/tensorrt_llm/nanobind/batch_manager/kvCacheManager.cpp
**/*.{cpp,h,cu,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the current year. This includes .cpp, .h, .cu, .py, and any other source files which are compiled or interpreted.

Files:

  • cpp/tensorrt_llm/pybind/batch_manager/kvCacheManager.cpp
  • cpp/tensorrt_llm/nanobind/batch_manager/kvCacheManager.cpp
  • tests/unittest/llmapi/test_llm_kv_cache_events.py
**/*.py

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.py: The code developed for TensorRT-LLM should conform to Python 3.8+.
Indent Python code with 4 spaces. Do not use tabs.
Always maintain the namespace when importing in Python, even if only one class or function from a module is used.
Python filenames should use snake_case (e.g., some_file.py).
Python classes should use PascalCase (e.g., class SomeClass).
Python functions and methods should use snake_case (e.g., def my_awesome_function():).
Python local variables should use snake_case. Prefix k for variable names that start with a number (e.g., k_99th_percentile).
Python global variables should use upper snake_case and prefix G (e.g., G_MY_GLOBAL).
Python constants should use upper snake_case (e.g., MY_CONSTANT).
Avoid shadowing variables declared in an outer scope in Python.
Initialize all externally visible members of a Python class in the constructor.
For interfaces that may be used outside a file, prefer docstrings over comments in Python.
Comments in Python should be reserved for code within a function, or interfaces that are local to a file.
Use Google style docstrings for Python classes and functions, which can be parsed by Sphinx.
Attributes and variables in Python can be documented inline; attribute docstrings will be rendered under the docstring for the class.
Avoid using reflection in Python when functionality can be easily achieved without reflection.
When using try-except blocks in Python, limit the except to the smallest set of errors possible.
When using try-except blocks to handle multiple possible variable types in Python, keep the body of the try as small as possible, using the else block to implement the logic.

Files:

  • tests/unittest/llmapi/test_llm_kv_cache_events.py
🧠 Learnings (5)
📚 Learning: in tensorrt-llm testing, it's common to have both cli flow tests (test_cli_flow.py) and pytorch api ...
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.

Applied to files:

  • tests/unittest/llmapi/test_llm_kv_cache_events.py
📚 Learning: in tensorrt-llm, examples directory can have different dependency versions than the root requirement...
Learnt from: yibinl-nvidia
PR: NVIDIA/TensorRT-LLM#6506
File: examples/models/core/mixtral/requirements.txt:3-3
Timestamp: 2025-08-01T15:14:45.673Z
Learning: In TensorRT-LLM, examples directory can have different dependency versions than the root requirements.txt file. Version conflicts between root and examples dependencies are acceptable because examples are designed to be standalone and self-contained.

Applied to files:

  • tests/unittest/llmapi/test_llm_kv_cache_events.py
📚 Learning: applies to **/*.py : the code developed for tensorrt-llm should conform to python 3.8+....
Learnt from: CR
PR: NVIDIA/TensorRT-LLM#0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-08-05T07:49:09.875Z
Learning: Applies to **/*.py : The code developed for TensorRT-LLM should conform to Python 3.8+.

Applied to files:

  • tests/unittest/llmapi/test_llm_kv_cache_events.py
📚 Learning: in tensorrt-llm's multimodal processing pipeline, shared tensor recovery using `from_shared_tensor()...
Learnt from: yechank-nvidia
PR: NVIDIA/TensorRT-LLM#6254
File: tensorrt_llm/_torch/pyexecutor/model_engine.py:1201-1204
Timestamp: 2025-07-22T09:22:14.726Z
Learning: In TensorRT-LLM's multimodal processing pipeline, shared tensor recovery using `from_shared_tensor()` is only needed during the context phase. Generation requests reuse the already-recovered tensor data and only need to call `strip_for_generation()` to remove unnecessary multimodal data while preserving the recovered tensors. This avoids redundant tensor recovery operations during generation.

Applied to files:

  • tests/unittest/llmapi/test_llm_kv_cache_events.py
📚 Learning: in tensorrt_llm/executor/worker.py, the lora adapter cache optimization logic that checks `is_adapte...
Learnt from: amitz-nv
PR: NVIDIA/TensorRT-LLM#5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks `is_adapter_in_cpu_cache()` and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.

Applied to files:

  • tests/unittest/llmapi/test_llm_kv_cache_events.py
🪛 Ruff (0.12.2)
tests/unittest/llmapi/test_llm_kv_cache_events.py

245-245: Local variable kvcache_config is assigned to but never used

Remove assignment to unused variable kvcache_config

(F841)


266-266: Undefined name SchedulingParams

(F821)

🔇 Additional comments (4)
cpp/tensorrt_llm/nanobind/batch_manager/kvCacheManager.cpp (1)

328-330: LGTM! Constructor binding correctly updated for attention DP support.

The nanobind constructor binding properly exposes the new attention data parallelism parameters (attention_dp_rank, attention_dp_size, attention_dp_events_gather_period_ms) to Python, enabling configuration of KV cache event management with attention DP context.

cpp/tensorrt_llm/pybind/batch_manager/kvCacheManager.cpp (1)

324-326: LGTM! Pybind11 constructor binding properly updated.

The pybind11 constructor binding is correctly updated to match the nanobind version, maintaining consistency between both Python binding approaches while exposing the new attention DP parameters.

tests/unittest/llmapi/test_llm_kv_cache_events.py (2)

4-4: LGTM! Import addition is appropriate.

The import of skip_single_gpu utility is correctly added for the new multi-GPU test.


230-239: LGTM! Good refactoring to use helper function.

The refactoring to use the check_events helper improves code maintainability and reduces duplication.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (1)
tests/unittest/llmapi/test_llm_kv_cache_events.py (1)

248-279: Fix unused variable and configuration inconsistency.

This test has the same issues identified in previous reviews:

  1. kvcache_config is defined but never used (lines 252-257)
  2. The test uses global_kvcache_config instead of the locally defined config (line 262)
  3. This creates configuration inconsistency and triggers static analysis warnings

Apply this diff to fix the issues:

-    kvcache_config = KvCacheConfig(free_gpu_memory_fraction=0.4,
-                                   event_buffer_max_size=1024,
-                                   attention_dp_events_gather_period_ms=10,
-                                   enable_block_reuse=True,
-                                   onboard_blocks=True,
-                                   max_tokens=256)
+    attention_dp_kvcache_config = KvCacheConfig(free_gpu_memory_fraction=0.4,
+                                                event_buffer_max_size=1024,
+                                                attention_dp_events_gather_period_ms=10,
+                                                enable_block_reuse=True,
+                                                onboard_blocks=True,
+                                                max_tokens=256)

     llm = LLM(model=llama_model_path,
               tensor_parallel_size=2,
               enable_attention_dp=True,
-              kv_cache_config=global_kvcache_config,
+              kv_cache_config=attention_dp_kvcache_config,
               enable_autotuner=False)
🧹 Nitpick comments (1)
tests/unittest/llmapi/test_llm_kv_cache_events.py (1)

152-232: Well-structured helper function with minor performance concern.

The check_events helper function effectively extracts common validation logic and properly handles both regular and attention DP scenarios. The event validation logic is comprehensive and correctly verifies event types, IDs, and block counts.

However, consider reducing the sleep duration from 1 second to improve test performance:

-    time.sleep(1)
+    time.sleep(0.1)  # Brief sleep to allow event gathering
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 82078cb and 717e9cb.

📒 Files selected for processing (3)
  • cpp/tensorrt_llm/nanobind/batch_manager/kvCacheManager.cpp (1 hunks)
  • cpp/tensorrt_llm/pybind/batch_manager/kvCacheManager.cpp (1 hunks)
  • tests/unittest/llmapi/test_llm_kv_cache_events.py (5 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
  • cpp/tensorrt_llm/pybind/batch_manager/kvCacheManager.cpp
  • cpp/tensorrt_llm/nanobind/batch_manager/kvCacheManager.cpp
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.py: The code developed for TensorRT-LLM should conform to Python 3.8+.
Indent Python code with 4 spaces. Do not use tabs.
Always maintain the namespace when importing in Python, even if only one class or function from a module is used.
Python filenames should use snake_case (e.g., some_file.py).
Python classes should use PascalCase (e.g., class SomeClass).
Python functions and methods should use snake_case (e.g., def my_awesome_function():).
Python local variables should use snake_case. Prefix k for variable names that start with a number (e.g., k_99th_percentile = ...).
Python global variables should use upper snake_case and prefix G (e.g., G_MY_GLOBAL = ...).
Python constants should use upper snake_case (e.g., MY_CONSTANT = ...).
Avoid shadowing variables declared in an outer scope in Python.
Initialize all externally visible members of a Python class in the constructor.
For interfaces that may be used outside a file, prefer docstrings over comments in Python.
Comments in Python should be reserved for code within a function, or interfaces that are local to a file.
Use Google style docstrings for Python classes and functions, which can be parsed by Sphinx.
Attributes and variables in Python can be documented inline; attribute docstrings will be rendered under the docstring for the class.
Avoid using reflection in Python when functionality can be easily achieved without it.
When using try-except blocks in Python, limit the except to the smallest set of errors possible.
When using try-except blocks to handle multiple possible variable types in Python, keep the body of the try as small as possible, using the else block to implement the logic.

Files:

  • tests/unittest/llmapi/test_llm_kv_cache_events.py
**/*.{cpp,h,cu,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the current year. This includes .cpp, .h, .cu, .py, and any other source files which are compiled or interpreted.

Files:

  • tests/unittest/llmapi/test_llm_kv_cache_events.py
🧠 Learnings (3)
📚 Learning: in tensorrt-llm testing, it's common to have both cli flow tests (test_cli_flow.py) and pytorch api ...
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.

Applied to files:

  • tests/unittest/llmapi/test_llm_kv_cache_events.py
📚 Learning: in tensorrt-llm, examples directory can have different dependency versions than the root requirement...
Learnt from: yibinl-nvidia
PR: NVIDIA/TensorRT-LLM#6506
File: examples/models/core/mixtral/requirements.txt:3-3
Timestamp: 2025-08-01T15:14:45.673Z
Learning: In TensorRT-LLM, examples directory can have different dependency versions than the root requirements.txt file. Version conflicts between root and examples dependencies are acceptable because examples are designed to be standalone and self-contained.

Applied to files:

  • tests/unittest/llmapi/test_llm_kv_cache_events.py
📚 Learning: applies to **/*.py : the code developed for tensorrt-llm should conform to python 3.8+....
Learnt from: CR
PR: NVIDIA/TensorRT-LLM#0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-08-06T00:54:56.038Z
Learning: Applies to **/*.py : The code developed for TensorRT-LLM should conform to Python 3.8+.

Applied to files:

  • tests/unittest/llmapi/test_llm_kv_cache_events.py
🪛 Ruff (0.12.2)
tests/unittest/llmapi/test_llm_kv_cache_events.py

252-252: Local variable kvcache_config is assigned to but never used

Remove assignment to unused variable kvcache_config

(F841)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (2)
tests/unittest/llmapi/test_llm_kv_cache_events.py (2)

4-5: LGTM! Imports correctly added.

The new imports are properly added to support the test enhancements:

  • pytest for the threadleak decorator
  • skip_single_gpu for multi-GPU test requirements
  • SchedulingParams addresses the missing import from previous reviews

Also applies to: 15-15


234-246: Excellent refactoring!

The simplified test effectively leverages the new check_events helper function, reducing code duplication while maintaining the same test coverage. This follows the DRY principle and improves maintainability.

@pcastonguay pcastonguay requested review from Tabrizian and Shunkangz and removed request for Wanli-Jiang and tomeras91 August 6, 2025 15:52
@pcastonguay
Copy link
Collaborator Author

@Superjomn could you review the changes llm_args.py? Apparently I need an approval from someon on @NVIDIA/trt-llm-llmapi-devs. Thanks.

Signed-off-by: Patrice Castonguay <[email protected]>
Signed-off-by: Patrice Castonguay <[email protected]>
Signed-off-by: Patrice Castonguay <[email protected]>
Signed-off-by: Patrice Castonguay <[email protected]>
Signed-off-by: Patrice Castonguay <[email protected]>
Signed-off-by: Patrice Castonguay <[email protected]>
Signed-off-by: Patrice Castonguay <[email protected]>
Signed-off-by: Patrice Castonguay <[email protected]>
@pcastonguay pcastonguay force-pushed the attention_dp_events branch from d6a384d to caa37e8 Compare August 6, 2025 23:15
@pcastonguay
Copy link
Collaborator Author

/bot run --disable-fail-fast --add-multi-gpu-test

@tensorrt-cicd
Copy link
Collaborator

PR_Github #14344 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #14311 [ run ] completed with state ABORTED

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d6a384d and caa37e8.

📒 Files selected for processing (22)
  • cpp/include/tensorrt_llm/batch_manager/kvCacheEventManager.h (4 hunks)
  • cpp/include/tensorrt_llm/executor/executor.h (6 hunks)
  • cpp/include/tensorrt_llm/executor/serialization.h (1 hunks)
  • cpp/include/tensorrt_llm/runtime/utils/mpiTags.h (1 hunks)
  • cpp/tensorrt_llm/batch_manager/kvCacheEventManager.cpp (6 hunks)
  • cpp/tensorrt_llm/executor/executor.cpp (1 hunks)
  • cpp/tensorrt_llm/executor/kvCacheConfig.cpp (5 hunks)
  • cpp/tensorrt_llm/executor/serialization.cpp (5 hunks)
  • cpp/tensorrt_llm/executor/serializeUtils.h (5 hunks)
  • cpp/tensorrt_llm/nanobind/batch_manager/kvCacheManager.cpp (1 hunks)
  • cpp/tensorrt_llm/nanobind/executor/executorConfig.cpp (3 hunks)
  • cpp/tensorrt_llm/pybind/batch_manager/kvCacheManager.cpp (1 hunks)
  • cpp/tensorrt_llm/pybind/executor/bindings.cpp (1 hunks)
  • cpp/tensorrt_llm/pybind/executor/executorConfig.cpp (2 hunks)
  • cpp/tests/unit_tests/executor/serializeUtilsTest.cpp (3 hunks)
  • tensorrt_llm/_torch/pyexecutor/resource_manager.py (2 hunks)
  • tensorrt_llm/_utils.py (1 hunks)
  • tensorrt_llm/llmapi/llm_args.py (2 hunks)
  • tests/integration/test_lists/test-db/l0_dgx_h200.yml (1 hunks)
  • tests/unittest/bindings/test_executor_bindings.py (4 hunks)
  • tests/unittest/llmapi/test_llm_args.py (2 hunks)
  • tests/unittest/llmapi/test_llm_kv_cache_events.py (5 hunks)
🚧 Files skipped from review as they are similar to previous changes (18)
  • tests/integration/test_lists/test-db/l0_dgx_h200.yml
  • tests/unittest/llmapi/test_llm_args.py
  • cpp/tensorrt_llm/executor/executor.cpp
  • cpp/tensorrt_llm/pybind/batch_manager/kvCacheManager.cpp
  • cpp/tensorrt_llm/pybind/executor/bindings.cpp
  • cpp/include/tensorrt_llm/runtime/utils/mpiTags.h
  • tensorrt_llm/_torch/pyexecutor/resource_manager.py
  • cpp/tensorrt_llm/executor/kvCacheConfig.cpp
  • tensorrt_llm/llmapi/llm_args.py
  • cpp/tensorrt_llm/pybind/executor/executorConfig.cpp
  • cpp/tensorrt_llm/nanobind/executor/executorConfig.cpp
  • cpp/tensorrt_llm/nanobind/batch_manager/kvCacheManager.cpp
  • tensorrt_llm/_utils.py
  • cpp/include/tensorrt_llm/batch_manager/kvCacheEventManager.h
  • tests/unittest/llmapi/test_llm_kv_cache_events.py
  • cpp/tensorrt_llm/batch_manager/kvCacheEventManager.cpp
  • cpp/include/tensorrt_llm/executor/executor.h
  • cpp/include/tensorrt_llm/executor/serialization.h
🧰 Additional context used
📓 Path-based instructions (4)
**/*.{cpp,h,hpp,cc,cxx}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.{cpp,h,hpp,cc,cxx}: Closing braces of namespaces should have a comment saying the namespace it closes (e.g., } // namespace foo).
Prefer const or constexpr variables over #defines whenever possible.
A variable that is not modified after its initialization should be declared as const.
Except 0 (used for checking signness/existence/emptiness), nullptr, true, false, all other literals should only be used for variable initialization.
Use the Allman indentation style for braces in C++ code.
Put the semicolon for an empty for or while loop in a new line.
The statement forming the body of a switch, while, do..while, or for statement shall be a compound statement (use brace-delimited statements).
If and else should always be followed by brace-delimited statements, even if empty or a single statement.
C++ filenames should use camel case with the first letter lowercase (e.g., thisIsAFilename.cpp), and all files involved in a compilation target must have case-insensitive unique filenames.
All types (including class names) should use camel case with uppercase first letter (e.g., FooBarClass).
Local variables, methods, and namespaces should use camel case with first letter lowercase (e.g., localFooBar).
Non-magic-number global variables that are non-static and not defined in anonymous namespace should use camel case prefixed by 'g' (e.g., gDontUseGlobalFoos).
Non-magic-number global variables that are static or defined in an anonymous namespace should use camel case prefixed by 's' (e.g., sMutableStaticGlobal).
Locally visible static variables should use camel case with lowercase prefix 's' as the first letter (e.g., static std::once_flag sFlag;).
Class member variables should use camel case prefixed with 'm' (e.g., mNbFooValues). Public member variables do not require the 'm' prefix but it is encouraged for clarity.
Enumerations, global constants, static constants at class-scope, and function-scope magic-number/literal constants should be uppercase snake case with prefix...

Files:

  • cpp/tensorrt_llm/executor/serializeUtils.h
  • cpp/tests/unit_tests/executor/serializeUtilsTest.cpp
  • cpp/tensorrt_llm/executor/serialization.cpp
**/*.{h,hpp}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Use a preprocessor guard in header files. The guard name must have prefix TRTLLM_ followed by the filename, all in caps, and no trailing underscore.

Files:

  • cpp/tensorrt_llm/executor/serializeUtils.h
**/*.{cpp,h,hpp,cc,cxx,cu,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the current year. This includes .cpp, .h, .cu, .py, and any other source files which are compiled or interpreted.

Files:

  • cpp/tensorrt_llm/executor/serializeUtils.h
  • cpp/tests/unit_tests/executor/serializeUtilsTest.cpp
  • cpp/tensorrt_llm/executor/serialization.cpp
  • tests/unittest/bindings/test_executor_bindings.py
**/*.py

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.py: Python code should conform to Python 3.8+.
Indent Python code with 4 spaces. Do not use tabs.
Always maintain the namespace when importing in Python, even if only one class or function from a module is used.
Python filenames should use snake_case (e.g., some_file.py).
Python classes should use PascalCase (e.g., class SomeClass).
Python functions and methods should use snake_case (e.g., def my_awesome_function():).
Python local variables should use snake_case. Prefix k for variable names that start with a number (e.g., k_99th_percentile).
Python global variables should use upper snake_case and prefix G (e.g., G_MY_GLOBAL).
Python constants should use upper snake_case (e.g., MY_CONSTANT).
Avoid shadowing variables declared in an outer scope in Python.
Initialize all externally visible members of a Python class in the constructor.
For interfaces that may be used outside a Python file, prefer docstrings over comments.
Comments in Python should be reserved for code within a function, or interfaces that are local to a file.
Use Google style docstrings for Python classes and functions, which can be parsed by Sphinx.
Attributes and variables in Python can be documented inline; attribute docstrings will be rendered under the class docstring.
Avoid using reflection in Python when functionality can be easily achieved without it.
When using try-except blocks in Python, limit the except to the smallest set of errors possible.
When using try-except blocks to handle multiple possible variable types in Python, keep the body of the try as small as possible, using the else block to implement the logic.

Files:

  • tests/unittest/bindings/test_executor_bindings.py
🧠 Learnings (2)
📚 Learning: in tensorrt-llm, test files (files under tests/ directories) do not require nvidia copyright headers...
Learnt from: galagam
PR: NVIDIA/TensorRT-LLM#6487
File: tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_trtllm_bench.py:1-12
Timestamp: 2025-08-06T13:58:07.506Z
Learning: In TensorRT-LLM, test files (files under tests/ directories) do not require NVIDIA copyright headers, unlike production source code files. Test files typically start directly with imports, docstrings, or code.

Applied to files:

  • cpp/tensorrt_llm/executor/serialization.cpp
📚 Learning: in tensorrt-llm testing, it's common to have both cli flow tests (test_cli_flow.py) and pytorch api ...
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.

Applied to files:

  • cpp/tensorrt_llm/executor/serialization.cpp
🪛 Ruff (0.12.2)
tests/unittest/bindings/test_executor_bindings.py

1332-1332: Avoid equality comparisons to False; use not config.enable_block_reuse: for false checks

Replace with not config.enable_block_reuse

(E712)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (18)
tests/unittest/bindings/test_executor_bindings.py (4)

1317-1317: Good test coverage for the new field default value.

The assertion correctly verifies that the new attention_dp_events_gather_period_ms field defaults to 5, following the established testing pattern in this function.


1332-1332: LGTM! Setter test is correctly implemented.

The assignment tests the setter functionality for the new field. The static analysis suggestion to use not instead of == False doesn't apply here since this is a test context where explicit comparisons are appropriate for clarity.


1346-1346: Good verification of setter functionality.

The assertion properly verifies that the attention_dp_events_gather_period_ms field was correctly set to 10, completing the setter/getter test cycle.


1360-1361: Comprehensive constructor testing with keyword arguments.

Adding the new field to the kwargs dictionary ensures that attention_dp_events_gather_period_ms can be properly initialized through constructor keyword arguments, providing complete test coverage for the field initialization patterns.

cpp/tensorrt_llm/executor/serializeUtils.h (5)

125-132: LGTM!

The new static assertions for KVCache event types and UniqueToken follow the established pattern and are necessary to ensure these types have proper serialization traits.


230-237: LGTM!

The static assertions for serialize traits complement the serializedSize assertions and ensure complete serialization support for the new KVCache event types and UniqueToken.


310-324: Well-designed generic variant deserialization helper.

The deserializeVariantByIndex function uses modern C++17 features (fold expressions, index sequences) to provide a clean, generic solution for deserializing std::variant types by their active index. The error handling is appropriate and the implementation is efficient.


546-577: Comprehensive deserialization support for new KVCache types.

The new deserialize branches properly extend support for all KVCache event-related types and UniqueToken, following the established pattern of delegating to the Serialization namespace functions.


614-614: Improved variant deserialization using the new generic helper.

The update to use deserializeVariantByIndex with std::make_index_sequence provides a more robust and scalable approach to variant deserialization compared to the previous implementation.

cpp/tests/unit_tests/executor/serializeUtilsTest.cpp (6)

477-477: Correctly updated constructor call for new attention DP parameter.

The KvCacheConfig constructor call properly includes the new attentionDpEventsGatherPeriodMs parameter with test value 77.


493-493: Proper verification of new attention DP parameter.

The test correctly verifies that the attentionDpEventsGatherPeriodMs parameter is preserved through serialization/deserialization.


850-916: Comprehensive and well-structured event comparison function.

The compareKvCacheEvents function provides thorough comparison logic for all KVCacheEvent variant types, properly handling nested structures, optional fields, and using appropriate variant access methods. The implementation covers all the complexity of the KVCacheEvent data structure.


918-950: Comprehensive test coverage for KVCache events deque serialization.

The test properly covers all four KVCacheEvent variant types in a deque container, using meaningful test data and thorough verification through the comparison helper function.


952-990: Thorough individual test coverage for all KVCache event variant types.

The individual tests for each KVCacheEvent variant type (Created, Removed, Stored, Updated) use appropriate test data and provide comprehensive verification through the comparison helper function. The test data covers the complexity of nested structures appropriately.


992-1010: Adequate test coverage for UniqueToken serialization.

The tests for UniqueToken and vector of UniqueToken provide proper verification of the simple two-field structure, covering both individual object and container serialization scenarios.

cpp/tensorrt_llm/executor/serialization.cpp (3)

26-26: LGTM!

Proper use of C++ standard header <cstddef> instead of C-style <stddef.h>, following the coding guidelines.


1166-1170: LGTM!

The attentionDpEventsGatherPeriodMs field is correctly integrated into the KvCacheConfig serialization/deserialization methods, maintaining consistency with the existing pattern.

Also applies to: 1188-1188, 1208-1208


2188-2417: Well-structured KVCache event serialization implementation!

The serialization/deserialization functions for KVCache events and related data structures are comprehensive and follow established patterns consistently. All variant types, optional fields, and the new attentionDpRank field are properly handled.

@tensorrt-cicd
Copy link
Collaborator

PR_Github #14344 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #10838 completed with status: 'SUCCESS'
Pipeline passed with automatic retried tests. Check the rerun report for details.

@MartinMarciniszyn MartinMarciniszyn merged commit 453a06e into NVIDIA:main Aug 7, 2025
4 checks passed
Shunkangz pushed a commit to hcyezhang/TensorRT-LLM that referenced this pull request Aug 8, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants