-
Notifications
You must be signed in to change notification settings - Fork 1.7k
[TRTLLM-6881][feat] Include attention dp rank info with KV cache events #6563
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[TRTLLM-6881][feat] Include attention dp rank info with KV cache events #6563
Conversation
📝 WalkthroughWalkthroughSupport for attention data parallelism was added to the KVCache event management system. This includes new constructor parameters, member variables, and logic for handling attention DP ranks and sizes in both C++ and Python. Event serialization, event queue management, distributed event exchange via MPI, and Python bindings were updated to propagate and expose the new attention DP context. Changes
Sequence Diagram(s)sequenceDiagram
participant PythonApp
participant KVCacheManager
participant KVCacheEventManagerCpp
participant MPI
participant KVCacheEvent
PythonApp->>KVCacheManager: create (with mapping)
KVCacheManager->>KVCacheEventManagerCpp: initialize(max_kv_event_entries, attention_dp_rank, attention_dp_size, pp_size)
KVCacheEventManagerCpp->>KVCacheEventManagerCpp: validate and store DP parameters
KVCacheEventManagerCpp->>KVCacheEventManagerCpp: start exchangeAttentionDpThread()
PythonApp->>KVCacheManager: trigger event
KVCacheManager->>KVCacheEventManagerCpp: enqueueCreatedEvent / enqueueStoredEvent / enqueueUpdatedEvent(...)
KVCacheEventManagerCpp->>KVCacheEvent: create(event_id, data, window_size, attention_dp_rank)
KVCacheEvent-->>KVCacheEventManagerCpp: event instance
KVCacheManager->>KVCacheEventManagerCpp: flush()
KVCacheEventManagerCpp->>KVCacheEventManagerCpp: notify worker thread
alt attention DP enabled and size > 1
KVCacheEventManagerCpp->>MPI: non-blocking probe for incoming events (rank 0)
KVCacheEventManagerCpp->>MPI: async send serialized events (non-zero ranks)
MPI->>KVCacheEventManagerCpp: receive serialized events (rank 0)
KVCacheEventManagerCpp->>KVCacheEventManagerCpp: deserialize and merge events
KVCacheEventManagerCpp->>KVCacheEventManagerCpp: notify worker thread after merge
end
KVCacheEventManagerCpp->>KVCacheEventManagerCpp: worker thread processes event queue
Estimated code review effort🎯 4 (Complex) | ⏱️ ~35 minutes Suggested labels
Suggested reviewers
Note ⚡️ Unit Test Generation is now available in beta!Learn more here, or try it out under "Finishing Touches" below. ✨ Finishing Touches
🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
Documentation and Community
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 5
🔭 Outside diff range comments (1)
cpp/tensorrt_llm/batch_manager/kvCacheEventManager.cpp (1)
91-103
: Add missingmAttentionDpRank
to removed‐event enqueue for consistencyThe other
enqueue*Event
methods all passmAttentionDpRank
into theKVCacheEvent
constructor, but in enqueueRemovedEvent (line 101) it uses the three-argument overload and drops the rank info (defaulting tonullopt
). To keep all events carrying the same metadata, includemAttentionDpRank
here as well.• File: cpp/tensorrt_llm/batch_manager/kvCacheEventManager.cpp
• Line: 101Suggested change:
- enqueueEvent({mEventId++, tle::KVCacheRemovedData{{block->getHash()}}, windowSize}); + enqueueEvent({mEventId++, tle::KVCacheRemovedData{{block->getHash()}}, windowSize, mAttentionDpRank});
🧹 Nitpick comments (2)
cpp/include/tensorrt_llm/executor/executor.h (2)
1730-1735
: Consider passingattentionDpRank
by const-reference and documenting it in the Doxygen header.
std::optional<SizeType32>
is small, but by convention optional parameters are usually taken asconst&
to avoid an extra move/copy and to keep the signature uniform with the rest of the header.
While you are touching the constructor, please also extend the Doxygen comment that precedes the struct (or add one above the ctor) so that the new field is visible in generated docs.- KVCacheEvent(IdType eventId, KVCacheEventData data, SizeType32 windowSize, - std::optional<SizeType32> attentionDpRank = std::nullopt) + KVCacheEvent(IdType eventId, KVCacheEventData data, SizeType32 windowSize, + std::optional<SizeType32> const& attentionDpRank = std::nullopt)This is a low-impact change, so feel free to defer if binary size or compile time is a larger concern.
1743-1744
: Add a default member-initializer to keep aggregate defaults predictable.If somebody default-constructs a
KVCacheEvent
via aggregate initialization (e.g.KVCacheEvent e{}
) the newattentionDpRank
will be value-initialised, but making that explicit improves readability and prevents future regressions when additional constructors are added.- std::optional<SizeType32> attentionDpRank; + std::optional<SizeType32> attentionDpRank{}; //!< Attention DP rank (nullopt if not applicable)
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (7)
cpp/include/tensorrt_llm/batch_manager/kvCacheEventManager.h
(2 hunks)cpp/include/tensorrt_llm/executor/executor.h
(1 hunks)cpp/tensorrt_llm/batch_manager/kvCacheEventManager.cpp
(5 hunks)cpp/tensorrt_llm/executor/executor.cpp
(1 hunks)cpp/tensorrt_llm/pybind/executor/bindings.cpp
(1 hunks)tensorrt_llm/_torch/pyexecutor/resource_manager.py
(1 hunks)tensorrt_llm/_utils.py
(1 hunks)
🧰 Additional context used
📓 Path-based instructions (4)
**/*.{cpp,h,hpp,cc,cxx}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.{cpp,h,hpp,cc,cxx}
: Closing braces of namespaces should have a comment saying the namespace it closes (e.g., } // namespace foo)
Prefer const or constexpr variables over #defines whenever possible, as the latter are not visible to the compiler.
A variable that is not modified after its initialization should be declared as const.
Except 0 (only used in comparison for checking signness/existence/emptiness) and nullptr, true, false, all other literals should only be used for variable initialization.
Use the Allman indentation style for braces.
Put the semicolon for an empty for or while loop in a new line.
The statement forming the body of a switch, while, do .. while or for statement shall be a compound statement (use brace-delimited statements).
If and else should always be followed by brace-delimited statements, even if empty or a single statement.
C++ filenames should use camel case with first letter lowercase (e.g., thisIsAFilename.cpp), and all files involved in the compilation of a target must have filenames that are case-insensitive unique.
All types (including class names) are camel case with uppercase first letter (e.g., FooBarClass).
Local variables, methods, and namespaces use camel case with first letter lowercase (e.g., localFooBar).
Non-magic-number global variables that are non-static and not defined in anonymous namespace use camel case prefixed by a lower case 'g' (e.g., gDontUseGlobalFoos).
Non-magic-number global variables that are static or defined in an anonymous namespace use camel case prefixed by a lower case 's' (e.g., sMutableStaticGlobal).
Locally visible static variable uses camel case with lowercase prefix 's' as the first letter of the name (e.g., static std::once_flag sFlag;).
Class member variables use camel case prefixed with an 'm' (e.g., mNbFooValues). Public member variables do not require the 'm' prefix but it is encouraged for clarity.
Enumerations, global constants, static constants at class-scope, and function-scope magic...
Files:
cpp/tensorrt_llm/pybind/executor/bindings.cpp
cpp/tensorrt_llm/executor/executor.cpp
cpp/include/tensorrt_llm/executor/executor.h
cpp/include/tensorrt_llm/batch_manager/kvCacheEventManager.h
cpp/tensorrt_llm/batch_manager/kvCacheEventManager.cpp
**/*.{cpp,h,hpp,cc,cxx,cu,py}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the current year. This includes .cpp, .h, .cu, .py, and any other source files which are compiled or interpreted.
Files:
cpp/tensorrt_llm/pybind/executor/bindings.cpp
cpp/tensorrt_llm/executor/executor.cpp
tensorrt_llm/_utils.py
tensorrt_llm/_torch/pyexecutor/resource_manager.py
cpp/include/tensorrt_llm/executor/executor.h
cpp/include/tensorrt_llm/batch_manager/kvCacheEventManager.h
cpp/tensorrt_llm/batch_manager/kvCacheEventManager.cpp
**/*.py
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.py
: The code developed for TensorRT-LLM should conform to Python 3.8+.
Indent Python code with 4 spaces. Do not use tabs.
Always maintain the namespace when importing in Python, even if only one class or function from a module is used.
Python filenames should use snake_case (e.g., some_file.py).
Python classes should use PascalCase (e.g., class SomeClass).
Python functions and methods should use snake_case (e.g., def my_awesome_function():).
Python local variables should use snake_case. Prefix k for variable names that start with a number (e.g., k_99th_percentile = ...).
Python global variables should use upper snake_case and prefix G (e.g., G_MY_GLOBAL = ...).
Python constants should use upper snake_case (e.g., MY_CONSTANT = ...).
Avoid shadowing variables declared in an outer scope in Python.
Initialize all externally visible members of a class in the constructor in Python.
For interfaces that may be used outside a file, prefer docstrings over comments in Python.
Comments in Python should be reserved for code within a function, or interfaces that are local to a file.
Use Google style docstrings for classes and functions in Python, which can be parsed by Sphinx.
Attributes and variables in Python can be documented inline; attribute docstrings will be rendered under the docstring for the class.
Avoid using reflection in Python when functionality can be easily achieved without it.
When using try-except blocks in Python, limit the except to the smallest set of errors possible.
When using try-except blocks to handle multiple possible variable types in Python, keep the body of the try as small as possible, using the else block to implement the logic.
Files:
tensorrt_llm/_utils.py
tensorrt_llm/_torch/pyexecutor/resource_manager.py
**/*.{h,hpp}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
Use a preprocessor guard in header files. The guard name must have prefix TRTLLM_ followed by the filename, all in caps, and no trailing underscore.
Files:
cpp/include/tensorrt_llm/executor/executor.h
cpp/include/tensorrt_llm/batch_manager/kvCacheEventManager.h
🧠 Learnings (1)
📚 Learning: in tensorrt_llm/executor/worker.py, the lora adapter cache optimization logic that checks `is_adapte...
Learnt from: amitz-nv
PR: NVIDIA/TensorRT-LLM#5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks `is_adapter_in_cpu_cache()` and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
Applied to files:
tensorrt_llm/_torch/pyexecutor/resource_manager.py
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (4)
cpp/tensorrt_llm/pybind/executor/bindings.cpp (1)
243-244
: LGTM! Property binding follows established patterns.The new
attention_dp_rank
read-only property correctly exposes the C++attentionDpRank
member using appropriate Python naming conventions.cpp/tensorrt_llm/executor/executor.cpp (1)
135-142
: LGTM! Constructor correctly extended with optional parameter.The constructor properly adds the new
attentionDpRank
optional parameter and includes it in the initializer list, maintaining backwards compatibility while following C++ coding conventions.tensorrt_llm/_torch/pyexecutor/resource_manager.py (1)
302-309
: LGTM! Attention DP parameters correctly passed to event manager.The changes properly propagate attention data parallelism configuration to the KV cache event manager. When attention DP is enabled, the rank and world size are correctly mapped to provide the necessary context for event handling across distributed ranks.
cpp/include/tensorrt_llm/batch_manager/kvCacheEventManager.h (1)
39-41
: Clarify the usage of enableAttentionDp parameter.The constructor accepts an
enableAttentionDp
parameter but it doesn't appear to be used to initializemEnableAttentionDp
. Consider either:
- Using the parameter to initialize the member variable, or
- Removing the parameter if
mEnableAttentionDp
is meant to be derived from the presence ofattentionDpRank
/attentionDpSize
Also applies to: 98-101
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
♻️ Duplicate comments (1)
cpp/tensorrt_llm/batch_manager/kvCacheEventManager.cpp (1)
27-34
: Constructor signature still has the same issues as previously identified.The constructor implementation still has the same problems that were flagged in previous reviews:
- Line 28: Parameter type
std::optional<attentionDpSize>
should bestd::optional<SizeType32>
- Missing
enableAttentionDp
parameter that should be declared in the header
🧹 Nitpick comments (2)
cpp/tensorrt_llm/executor/serialization.cpp (2)
2185-2185
: Fix typo in commentThe comment has a minor typo - it should be "KVCacheEvent deque" not "KVCacheEvents deque" to match the actual type.
-// KVCacheEvents deque +// KVCacheEvent deque
2324-2324
: Fix capitalization in commentThe comment should use consistent capitalization to match the actual type name.
-// KVcacheRemovedData +// KVCacheRemovedData
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (6)
cpp/include/tensorrt_llm/executor/executor.h
(2 hunks)cpp/include/tensorrt_llm/executor/serialization.h
(1 hunks)cpp/include/tensorrt_llm/runtime/utils/mpiTags.h
(1 hunks)cpp/tensorrt_llm/batch_manager/kvCacheEventManager.cpp
(5 hunks)cpp/tensorrt_llm/executor/serialization.cpp
(2 hunks)cpp/tensorrt_llm/executor/serializeUtils.h
(3 hunks)
✅ Files skipped from review due to trivial changes (1)
- cpp/include/tensorrt_llm/runtime/utils/mpiTags.h
🚧 Files skipped from review as they are similar to previous changes (1)
- cpp/include/tensorrt_llm/executor/executor.h
🧰 Additional context used
📓 Path-based instructions (3)
**/*.{cpp,h,hpp,cc,cxx}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.{cpp,h,hpp,cc,cxx}
: Closing braces of namespaces should have a comment saying the namespace it closes (e.g., } // namespace foo)
Prefer const or constexpr variables over #defines whenever possible, as the latter are not visible to the compiler.
A variable that is not modified after its initialization should be declared as const.
Except 0 (only used in comparison for checking signness/existence/emptiness) and nullptr, true, false, all other literals should only be used for variable initialization.
Use the Allman indentation style for braces in C++ code.
Put the semicolon for an empty for or while loop in a new line.
The statement forming the body of a switch, while, do .. while or for statement shall be a compound statement (use brace-delimited statements).
If and else should always be followed by brace-delimited statements, even if empty or a single statement.
C++ filenames should use camel case with first letter lowercase (e.g., thisIsAFilename.cpp), and all files involved in the compilation of a target must have filenames that are case-insensitive unique.
All types (including class names) are camel case with uppercase first letter (e.g., FooBarClass).
Local variables, methods, and namespaces use camel case with first letter lowercase (e.g., localFooBar).
Non-magic-number global variables that are non-static and not defined in anonymous namespace use camel case prefixed by a lower case 'g' (e.g., gDontUseGlobalFoos).
Non-magic-number global variables that are static or defined in an anonymous namespace use camel case prefixed by a lower case 's' (e.g., sMutableStaticGlobal).
Locally visible static variable uses camel case with lowercase prefix 's' as the first letter of the name (e.g., static std::once_flag sFlag;).
Class member variables use camel case prefixed with an 'm' (e.g., mNbFooValues). Public member variables do not require the 'm' prefix but it is encouraged for clarity.
Enumerations, global constants, static constants at class-scope and function-...
Files:
cpp/tensorrt_llm/batch_manager/kvCacheEventManager.cpp
cpp/tensorrt_llm/executor/serialization.cpp
cpp/include/tensorrt_llm/executor/serialization.h
cpp/tensorrt_llm/executor/serializeUtils.h
**/*.{cpp,h,hpp,cc,cxx,cu,py}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the current year. This includes .cpp, .h, .cu, .py, and any other source files which are compiled or interpreted.
Files:
cpp/tensorrt_llm/batch_manager/kvCacheEventManager.cpp
cpp/tensorrt_llm/executor/serialization.cpp
cpp/include/tensorrt_llm/executor/serialization.h
cpp/tensorrt_llm/executor/serializeUtils.h
**/*.{h,hpp}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
Use a preprocessor guard in header files. The guard name must have prefix TRTLLM_ followed by the filename, all in caps, and no trailing underscore.
Files:
cpp/include/tensorrt_llm/executor/serialization.h
cpp/tensorrt_llm/executor/serializeUtils.h
🧠 Learnings (1)
📚 Learning: in tensorrt-llm testing, it's common to have both cli flow tests (test_cli_flow.py) and pytorch api ...
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.
Applied to files:
cpp/tensorrt_llm/executor/serialization.cpp
🧬 Code Graph Analysis (2)
cpp/tensorrt_llm/batch_manager/kvCacheEventManager.cpp (1)
cpp/tensorrt_llm/executor/serialization.cpp (18)
serialize
(44-47)serialize
(44-44)serialize
(89-113)serialize
(89-89)serialize
(173-194)serialize
(173-173)serialize
(235-244)serialize
(235-235)serialize
(267-271)serialize
(267-267)serialize
(292-298)serialize
(292-292)serialize
(318-322)serialize
(318-318)serialize
(341-346)serialize
(341-341)deserializeKVCacheEvents
(2209-2220)deserializeKVCacheEvents
(2209-2209)
cpp/tensorrt_llm/executor/serialization.cpp (2)
cpp/tensorrt_llm/executor/cache_transmission/agent_utils/connection.cpp (6)
serialize
(54-60)serialize
(54-54)serializedSize
(71-76)serializedSize
(71-71)deserialize
(62-69)deserialize
(62-62)cpp/tensorrt_llm/executor/cache_transmission/agent_utils/connection.h (2)
serializedSize
(147-165)deserialize
(127-145)
🔇 Additional comments (15)
cpp/tensorrt_llm/executor/serializeUtils.h (3)
125-125
: LGTM!The static assertion correctly verifies that
KVCacheEvent
supports serialized size calculation, following the established pattern for other types in the codebase.
223-223
: LGTM!The static assertion correctly verifies that
KVCacheEvent
supports serialization, maintaining consistency with the existing compile-time verification pattern.
516-519
: LGTM!The deserialization branch for
KVCacheEvent
is correctly implemented, following the established pattern of usingconstexpr if
with type checking and delegating to the appropriateSerialization
class method.cpp/tensorrt_llm/batch_manager/kvCacheEventManager.cpp (5)
36-49
: LGTM!The validation logic for attention DP parameters is comprehensive and well-designed:
- Ensures consistency between rank and size parameters
- Properly validates constraints (PP > 1 not supported with attention DP)
- Provides informative error messages
- Covers both directions of the relationship between rank and size parameters
51-52
: LGTM!The thread initialization is properly implemented, following the same pattern as the existing worker thread. The new attention DP exchange thread is appropriately initialized for handling distributed event exchange.
67-67
: LGTM!The event enqueuing methods are consistently updated to include attention DP rank information, enabling proper event tagging for distributed processing across all event types.
Also applies to: 87-90, 109-109
139-140
: LGTM!The explicit notification to the worker thread after flushing events is a good improvement that ensures responsive event processing. The comment clearly explains the rationale.
194-231
: LGTM!The worker method refactoring is a significant improvement that adds robust event queue management:
- Proper condition variable-based waiting for pending events
- Clear size-based eviction strategy (oldest events removed first)
- Appropriate logging when events are discarded due to queue limits
- Proper notification of waiting threads
The implementation correctly handles the distributed event processing requirements.
cpp/include/tensorrt_llm/executor/serialization.h (1)
305-342
: LGTM!The serialization method declarations for
KVCacheEvent
and related types are comprehensive and well-designed:
- Follow established naming conventions and patterns
- Proper const-correctness and
[[nodiscard]]
usage- Complete coverage of all KVCacheEvent data structures
- Consistent method signatures with the existing serialization framework
- Include both individual object serialization and batch operations for deques
These declarations properly extend the serialization infrastructure to support the new attention DP functionality.
cpp/tensorrt_llm/executor/serialization.cpp (6)
2223-2249
: Correctly implements KVCacheEvent serialization with attention DP rankThe implementation properly handles the new
attentionDpRank
field as an optional value, maintaining consistency with the serialization pattern used throughout the file.
2252-2268
: Simple and correct implementation
2271-2290
: Correct implementation for KVCacheStoredData
2293-2322
: Proper serialization of KVCacheStoredBlockData
2346-2365
: Correct template specialization for KVCacheEventDiff
2368-2390
: Well-structured serialization for KVCacheUpdatedData
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🔭 Outside diff range comments (3)
cpp/tensorrt_llm/nanobind/executor/executorConfig.cpp (1)
108-127
: Critical: Update pickle serialization to include the new attention_dp_events_gather_period_ms field.The pickle state methods still expect 13 elements but should handle 14 with the new field. This will cause serialization failures.
Apply this diff to fix the serialization:
auto kvCacheConfigGetstate = [](tle::KvCacheConfig const& self) { return nb::make_tuple(self.getEnableBlockReuse(), self.getMaxTokens(), self.getMaxAttentionWindowVec(), self.getSinkTokenLength(), self.getFreeGpuMemoryFraction(), self.getHostCacheSize(), self.getOnboardBlocks(), self.getCrossKvCacheFraction(), self.getSecondaryOffloadMinPriority(), - self.getEventBufferMaxSize(), self.getEnablePartialReuse(), self.getCopyOnPartialReuse(), self.getUseUvm()); + self.getEventBufferMaxSize(), self.getEnablePartialReuse(), self.getCopyOnPartialReuse(), self.getUseUvm(), + self.getAttentionDpEventsGatherPeriodMs()); }; auto kvCacheConfigSetstate = [](tle::KvCacheConfig& self, nb::tuple const& state) { - if (state.size() != 13) + if (state.size() != 14) { throw std::runtime_error("Invalid state!"); } new (&self) tle::KvCacheConfig(nb::cast<bool>(state[0]), nb::cast<std::optional<SizeType32>>(state[1]), nb::cast<std::optional<std::vector<SizeType32>>>(state[2]), nb::cast<std::optional<SizeType32>>(state[3]), nb::cast<std::optional<float>>(state[4]), nb::cast<std::optional<size_t>>(state[5]), nb::cast<bool>(state[6]), nb::cast<std::optional<float>>(state[7]), nb::cast<std::optional<tle::RetentionPriority>>(state[8]), nb::cast<size_t>(state[9]), - nb::cast<bool>(state[10]), nb::cast<bool>(state[11]), nb::cast<bool>(state[12])); + nb::cast<bool>(state[10]), nb::cast<bool>(state[11]), nb::cast<bool>(state[12]), + nb::cast<SizeType32>(state[13])); };cpp/tests/unit_tests/executor/serializeUtilsTest.cpp (1)
196-196
: Fix function name typoThere's a typo in the function call - it should be
compareStaticBatchingStats
instead ofcompareStaticBatching
.- compareStaticBatching(val, val2); + compareStaticBatchingStats(val, val2);cpp/tensorrt_llm/executor/serialization.cpp (1)
1191-1209
: Missing size calculation forattentionDpEventsGatherPeriodMs
The
serializedSize
function is missing the size calculation for the newattentionDpEventsGatherPeriodMs
field, which could lead to buffer size miscalculation.Apply this fix:
size_t Serialization::serializedSize(KvCacheConfig const& kvCacheConfig) { size_t totalSize = 0; totalSize += su::serializedSize(kvCacheConfig.getEnableBlockReuse()); totalSize += su::serializedSize(kvCacheConfig.getEnablePartialReuse()); totalSize += su::serializedSize(kvCacheConfig.getCopyOnPartialReuse()); totalSize += su::serializedSize(kvCacheConfig.getMaxTokens()); totalSize += su::serializedSize(kvCacheConfig.getMaxAttentionWindowVec()); totalSize += su::serializedSize(kvCacheConfig.getSinkTokenLength()); totalSize += su::serializedSize(kvCacheConfig.getFreeGpuMemoryFraction()); totalSize += su::serializedSize(kvCacheConfig.getHostCacheSize()); totalSize += su::serializedSize(kvCacheConfig.getOnboardBlocks()); totalSize += su::serializedSize(kvCacheConfig.getCrossKvCacheFraction()); totalSize += su::serializedSize(kvCacheConfig.getSecondaryOffloadMinPriority()); totalSize += su::serializedSize(kvCacheConfig.getEventBufferMaxSize()); totalSize += su::serializedSize(kvCacheConfig.getUseUvm()); + totalSize += su::serializedSize(kvCacheConfig.getAttentionDpEventsGatherPeriodMs()); return totalSize; }
🧹 Nitpick comments (5)
tests/unittest/bindings/test_executor_bindings.py (2)
1315-1318
: Avoid hard-coding the new default – derive it from the implementation
5
is duplicated knowledge: if the implementation later changes its default period (e.g. to tune performance) this test would fail for the wrong reason. Make the assertion self-describing by asking the class for its default instead of inlining the literal.- assert config.attention_dp_events_gather_period_ms == 5 + default_period = trtllm.KvCacheConfig().attention_dp_events_gather_period_ms + assert config.attention_dp_events_gather_period_ms == default_period
1331-1346
: Style: preferis False
/not …
over== False
Several assertions in this block use equality against the boolean literal. PEP 8 (and Ruff rule E712) recommend
assert not config.enable_block_reuse
orassert config.enable_block_reuse is False
for better readability and to avoid subtle bugs with objects that override__eq__
.Updating the handful of new/modified lines while you touch this area will silence the linter without a large diff:
- assert config.enable_block_reuse == False + assert not config.enable_block_reuse(Apply the same pattern to the other boolean comparisons in this stanza if convenient.)
cpp/include/tensorrt_llm/batch_manager/kvCacheEventManager.h (1)
1-15
: Update copyright year to 2025 for consistency.Other files in this PR have updated the copyright to 2025, but this file still shows 2024.
- * Copyright (c) 2022-2024, NVIDIA CORPORATION. All rights reserved. + * Copyright (c) 2022-2025, NVIDIA CORPORATION. All rights reserved.cpp/tests/unit_tests/executor/serializeUtilsTest.cpp (1)
927-933
: Rename test to be more descriptiveThe test name "KVCacheEvents" is too generic. Since this test specifically validates KVCacheRemovedData, consider renaming it to "KVCacheRemovedEvent" for consistency with other test names.
-TEST(SerializeUtilsTest, KVCacheEvents) +TEST(SerializeUtilsTest, KVCacheRemovedEvent)cpp/include/tensorrt_llm/executor/serialization.h (1)
342-342
: Fix comment typoThe comment says "KVCacheUpdateData" but the actual type is "KVCacheUpdatedData".
- // KVCacheUpdateData + // KVCacheUpdatedData
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (14)
cpp/include/tensorrt_llm/batch_manager/kvCacheEventManager.h
(4 hunks)cpp/include/tensorrt_llm/executor/executor.h
(6 hunks)cpp/include/tensorrt_llm/executor/serialization.h
(1 hunks)cpp/tensorrt_llm/batch_manager/kvCacheEventManager.cpp
(6 hunks)cpp/tensorrt_llm/executor/kvCacheConfig.cpp
(5 hunks)cpp/tensorrt_llm/executor/serialization.cpp
(4 hunks)cpp/tensorrt_llm/executor/serializeUtils.h
(3 hunks)cpp/tensorrt_llm/nanobind/executor/executorConfig.cpp
(2 hunks)cpp/tensorrt_llm/pybind/executor/executorConfig.cpp
(3 hunks)cpp/tests/unit_tests/executor/serializeUtilsTest.cpp
(3 hunks)tensorrt_llm/_torch/pyexecutor/resource_manager.py
(2 hunks)tensorrt_llm/llmapi/llm_args.py
(2 hunks)tests/unittest/bindings/test_executor_bindings.py
(4 hunks)tests/unittest/llmapi/test_llm_args.py
(2 hunks)
✅ Files skipped from review due to trivial changes (2)
- tests/unittest/llmapi/test_llm_args.py
- tensorrt_llm/llmapi/llm_args.py
🚧 Files skipped from review as they are similar to previous changes (3)
- tensorrt_llm/_torch/pyexecutor/resource_manager.py
- cpp/tensorrt_llm/batch_manager/kvCacheEventManager.cpp
- cpp/include/tensorrt_llm/executor/executor.h
🧰 Additional context used
📓 Path-based instructions (4)
**/*.{cpp,h,hpp,cc,cxx}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.{cpp,h,hpp,cc,cxx}
: Closing braces of namespaces should have a comment saying the namespace it closes (e.g., } // namespace foo)
Prefer const or constexpr variables over #defines whenever possible, as the latter are not visible to the compiler.
A variable that is not modified after its initialization should be declared as const.
Except 0 (only used in comparison for checking signness/existence/emptiness) and nullptr, true, false, all other literals should only be used for variable initialization.
Use the Allman indentation style for braces in C++ code.
Put the semicolon for an empty for or while loop in a new line.
The statement forming the body of a switch, while, do .. while or for statement shall be a compound statement (use brace-delimited statements).
If and else should always be followed by brace-delimited statements, even if empty or a single statement.
C++ filenames should use camel case with first letter lowercase (e.g., thisIsAFilename.cpp), and all files involved in the compilation of a target must have filenames that are case-insensitive unique.
All types (including class names) are camel case with uppercase first letter (e.g., FooBarClass).
Local variables, methods, and namespaces use camel case with first letter lowercase (e.g., localFooBar).
Non-magic-number global variables that are non-static and not defined in anonymous namespace use camel case prefixed by a lower case 'g' (e.g., gDontUseGlobalFoos).
Non-magic-number global variables that are static or defined in an anonymous namespace use camel case prefixed by a lower case 's' (e.g., sMutableStaticGlobal).
Locally visible static variable uses camel case with lowercase prefix 's' as the first letter of the name (e.g., static std::once_flag sFlag;).
Class member variables use camel case prefixed with an 'm' (e.g., mNbFooValues). Public member variables do not require the 'm' prefix but it is encouraged for clarity.
Enumerations, global constants, static constants at class-scope and function-...
Files:
cpp/tensorrt_llm/executor/kvCacheConfig.cpp
cpp/tensorrt_llm/executor/serializeUtils.h
cpp/tensorrt_llm/nanobind/executor/executorConfig.cpp
cpp/include/tensorrt_llm/batch_manager/kvCacheEventManager.h
cpp/tensorrt_llm/pybind/executor/executorConfig.cpp
cpp/tests/unit_tests/executor/serializeUtilsTest.cpp
cpp/include/tensorrt_llm/executor/serialization.h
cpp/tensorrt_llm/executor/serialization.cpp
**/*.{cpp,h,hpp,cc,cxx,cu,py}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the current year. This includes .cpp, .h, .cu, .py, and any other source files which are compiled or interpreted.
Files:
cpp/tensorrt_llm/executor/kvCacheConfig.cpp
cpp/tensorrt_llm/executor/serializeUtils.h
cpp/tensorrt_llm/nanobind/executor/executorConfig.cpp
cpp/include/tensorrt_llm/batch_manager/kvCacheEventManager.h
cpp/tensorrt_llm/pybind/executor/executorConfig.cpp
cpp/tests/unit_tests/executor/serializeUtilsTest.cpp
cpp/include/tensorrt_llm/executor/serialization.h
cpp/tensorrt_llm/executor/serialization.cpp
tests/unittest/bindings/test_executor_bindings.py
**/*.{h,hpp}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
Use a preprocessor guard in header files. The guard name must have prefix TRTLLM_ followed by the filename, all in caps, and no trailing underscore.
Files:
cpp/tensorrt_llm/executor/serializeUtils.h
cpp/include/tensorrt_llm/batch_manager/kvCacheEventManager.h
cpp/include/tensorrt_llm/executor/serialization.h
**/*.py
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.py
: The code developed for TensorRT-LLM should conform to Python 3.8+.
Indent Python code with 4 spaces. Do not use tabs.
Always maintain the namespace when importing in Python, even if only one class or function from a module is used.
Python filenames should use snake_case (e.g., some_file.py).
Python classes should use PascalCase (e.g., class SomeClass).
Python functions and methods should use snake_case (e.g., def my_awesome_function():).
Python local variables should use snake_case. Prefix k for variable names that start with a number (e.g., k_99th_percentile = ...).
Python global variables should use upper snake_case and prefix G (e.g., G_MY_GLOBAL = ...).
Python constants should use upper snake_case (e.g., MY_CONSTANT = ...).
Avoid shadowing variables declared in an outer scope in Python.
Initialize all externally visible members of a Python class in the constructor.
For interfaces that may be used outside a file, prefer docstrings over comments in Python.
Comments in Python should be reserved for code within a function, or interfaces that are local to a file.
Use Google style docstrings for classes and functions in Python, which can be parsed by Sphinx.
Attributes and variables in Python can be documented inline; attribute docstrings will be rendered under the docstring for the class.
Avoid using reflection in Python when functionality can be easily achieved without it.
When using try-except blocks in Python, limit the except to the smallest set of errors possible.
When using try-except blocks to handle multiple possible variable types in Python, keep the body of the try as small as possible, using the else block to implement the logic.
Files:
tests/unittest/bindings/test_executor_bindings.py
🧠 Learnings (1)
📚 Learning: in tensorrt-llm testing, it's common to have both cli flow tests (test_cli_flow.py) and pytorch api ...
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.
Applied to files:
cpp/tensorrt_llm/executor/serialization.cpp
🪛 Ruff (0.12.2)
tests/unittest/bindings/test_executor_bindings.py
1332-1332: Avoid equality comparisons to False
; use not config.enable_block_reuse:
for false checks
Replace with not config.enable_block_reuse
(E712)
🔇 Additional comments (16)
cpp/tensorrt_llm/executor/kvCacheConfig.cpp (2)
30-68
: LGTM! Constructor implementation is clean and follows established patterns.The new
attentionDpEventsGatherPeriodMs
parameter is properly validated and initialized.
135-139
: LGTM! Getter implementation is correct.cpp/tensorrt_llm/nanobind/executor/executorConfig.cpp (1)
128-164
: LGTM! Nanobind property and constructor bindings are properly implemented.The new parameter is correctly added as a keyword-only argument with appropriate default value.
cpp/include/tensorrt_llm/batch_manager/kvCacheEventManager.h (2)
39-67
: LGTM! Clean API design with backward compatibility.The constructor uses optional parameters with sensible defaults, and the new exchange thread method is appropriately placed.
76-108
: LGTM! Well-structured member variables with clear documentation.The new members are properly organized and documented for the attention DP feature.
cpp/tensorrt_llm/pybind/executor/executorConfig.cpp (2)
101-121
: LGTM! Pickle serialization correctly updated for the new field.The state size check and tuple handling are properly updated to accommodate the new
attention_dp_events_gather_period_ms
field.
122-159
: LGTM! Property bindings and constructor are properly implemented.The new parameter integrates seamlessly with the existing API.
cpp/tests/unit_tests/executor/serializeUtilsTest.cpp (5)
918-925
: LGTM!The test correctly validates serialization/deserialization of KVCacheEvent with KVCacheCreatedData.
935-945
: LGTM!The test thoroughly validates serialization of KVCacheStoredData with nested block data and token structures.
947-956
: LGTM!The test properly validates serialization of KVCacheUpdatedData with optional cache level and priority diffs.
958-976
: LGTM!Both UniqueToken tests correctly validate serialization of individual tokens and token vectors.
477-477
: LGTM!The test correctly validates the new attentionDpEventsGatherPeriodMs parameter in KvCacheConfig.
Also applies to: 493-493
cpp/include/tensorrt_llm/executor/serialization.h (1)
305-351
: LGTM!The serialization method declarations for KVCache events and UniqueToken follow the established patterns and are properly marked with [[nodiscard]] attributes where appropriate.
cpp/tensorrt_llm/executor/serializeUtils.h (3)
125-133
: LGTM!Static assertions correctly validate serializedSize support for all KVCache event types and UniqueToken.
230-237
: LGTM!Static assertions correctly validate serialize support for all KVCache event types and UniqueToken.
530-561
: LGTM!Deserialization branches correctly handle all KVCache event types and UniqueToken, properly delegating to the corresponding Serialization methods.
431ebff
to
1332536
Compare
/bot run |
PR_Github #14028 [ run ] triggered by Bot |
PR_Github #14028 [ run ] completed with state |
/bot run |
@coderabbitai full review |
✅ Actions performedFull review triggered. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🔭 Outside diff range comments (2)
cpp/tensorrt_llm/nanobind/executor/executorConfig.cpp (1)
108-127
: Fix missing serialization support forattention_dp_events_gather_period_ms
.The serialization/deserialization methods have not been updated to handle the new
attention_dp_events_gather_period_ms
parameter, which will cause runtime failures during pickle operations.Apply this diff to fix the serialization:
auto kvCacheConfigGetstate = [](tle::KvCacheConfig const& self) { return nb::make_tuple(self.getEnableBlockReuse(), self.getMaxTokens(), self.getMaxAttentionWindowVec(), self.getSinkTokenLength(), self.getFreeGpuMemoryFraction(), self.getHostCacheSize(), self.getOnboardBlocks(), self.getCrossKvCacheFraction(), self.getSecondaryOffloadMinPriority(), - self.getEventBufferMaxSize(), self.getEnablePartialReuse(), self.getCopyOnPartialReuse(), self.getUseUvm()); + self.getEventBufferMaxSize(), self.getEnablePartialReuse(), self.getCopyOnPartialReuse(), self.getUseUvm(), + self.getAttentionDpEventsGatherPeriodMs()); }; auto kvCacheConfigSetstate = [](tle::KvCacheConfig& self, nb::tuple const& state) { - if (state.size() != 13) + if (state.size() != 14) { throw std::runtime_error("Invalid state!"); } new (&self) tle::KvCacheConfig(nb::cast<bool>(state[0]), nb::cast<std::optional<SizeType32>>(state[1]), nb::cast<std::optional<std::vector<SizeType32>>>(state[2]), nb::cast<std::optional<SizeType32>>(state[3]), nb::cast<std::optional<float>>(state[4]), nb::cast<std::optional<size_t>>(state[5]), nb::cast<bool>(state[6]), nb::cast<std::optional<float>>(state[7]), nb::cast<std::optional<tle::RetentionPriority>>(state[8]), nb::cast<size_t>(state[9]), - nb::cast<bool>(state[10]), nb::cast<bool>(state[11]), nb::cast<bool>(state[12])); + nb::cast<bool>(state[10]), nb::cast<bool>(state[11]), nb::cast<bool>(state[12]), + nb::cast<SizeType32>(state[13])); };cpp/tensorrt_llm/pybind/executor/executorConfig.cpp (1)
109-121
: Fix inconsistent tuple size check in deserialization.The
kvCacheConfigSetstate
function checks for 13 elements but actually uses 14 elements from the tuple, causing a runtime error.Apply this diff to fix the size check:
auto kvCacheConfigSetstate = [](py::tuple const& state) { - if (state.size() != 13) + if (state.size() != 14) { throw std::runtime_error("Invalid state!"); }
♻️ Duplicate comments (4)
tensorrt_llm/_utils.py (1)
1020-1020
: Fix the incorrect serialization of attention_dp_rank field.The
attention_dp_rank
field is being serialized usingevent_serialize_func
, which is designed for complex event data objects. Sinceattention_dp_rank
is likely anOptional[int]
, it should be serialized directly rather than passed through the event serialization function.Apply this fix:
- "attention_dp_rank": event_serialize_func(event.attention_dp_rank), + "attention_dp_rank": event.attention_dp_rank,cpp/tensorrt_llm/executor/kvCacheConfig.cpp (1)
216-220
: Fix parameter name inconsistency in setter method.The parameter name uses "Poll" while the member variable and other methods use "Gather". This inconsistency could cause confusion.
Apply this diff to fix the naming inconsistency:
-void KvCacheConfig::setAttentionDpEventsGatherPeriodMs(SizeType32 attentionDpEventPollPeriodMs) +void KvCacheConfig::setAttentionDpEventsGatherPeriodMs(SizeType32 attentionDpEventsGatherPeriodMs) { - TLLM_CHECK(attentionDpEventPollPeriodMs > 0); - mAttentionDpEventsGatherPeriodMs = attentionDpEventPollPeriodMs; + TLLM_CHECK(attentionDpEventsGatherPeriodMs > 0); + mAttentionDpEventsGatherPeriodMs = attentionDpEventsGatherPeriodMs; }cpp/tensorrt_llm/batch_manager/kvCacheEventManager.cpp (1)
170-173
: Simplify mutex locking withstd::lock_guard
.The current
unique_lock
usage can be simplified since there's no need for the advanced features ofunique_lock
.Apply this diff to use
lock_guard
:- std::unique_lock<std::mutex> lck(mEventsMutex); + std::lock_guard<std::mutex> lck(mEventsMutex); serializedEvents = executor::Serialization::serialize(mEvents); mEvents.clear();And:
- std::unique_lock<std::mutex> lck(mEventsMutex); + std::lock_guard<std::mutex> lck(mEventsMutex); mEvents.insert(mEvents.end(), rankEvents.begin(), rankEvents.end()); mEmptyCV.notify_one();Also applies to: 200-203
cpp/include/tensorrt_llm/batch_manager/kvCacheEventManager.h (1)
65-65
: Fix grammatical issue in comment- // Thread which exchange events if attentionDP is enabled + // Thread which exchanges events if attentionDP is enabled
🧹 Nitpick comments (3)
tests/unittest/bindings/test_executor_bindings.py (1)
1333-1333
: Fix boolean comparison style issue.Use
not config.enable_block_reuse
instead of== False
for better Python style.- assert config.enable_block_reuse == False + assert not config.enable_block_reusetensorrt_llm/llmapi/llm_args.py (1)
990-993
: Minor: keep constructor kwargs grouped for readability
attention_dp_events_gather_period_ms
is conceptually related to the event-buffer knobs that precede it (Lines 950-955). Consider moving the keyword argument next toevent_buffer_max_size
to keep semantically related parameters together and avoid future merge conflicts when the C++ signature changes.No functional impact, purely readability.
cpp/tests/unit_tests/executor/serializeUtilsTest.cpp (1)
928-933
: Rename test to be more specific.The test name
KVCacheEvents
is too generic. Consider renaming it toKVCacheRemovedEvent
to match the naming pattern of other KVCache event tests and clearly indicate what variant is being tested.-TEST(SerializeUtilsTest, KVCacheEvents) +TEST(SerializeUtilsTest, KVCacheRemovedEvent)
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (18)
cpp/include/tensorrt_llm/batch_manager/kvCacheEventManager.h
(4 hunks)cpp/include/tensorrt_llm/executor/executor.h
(6 hunks)cpp/include/tensorrt_llm/executor/serialization.h
(1 hunks)cpp/include/tensorrt_llm/runtime/utils/mpiTags.h
(1 hunks)cpp/tensorrt_llm/batch_manager/kvCacheEventManager.cpp
(6 hunks)cpp/tensorrt_llm/executor/executor.cpp
(1 hunks)cpp/tensorrt_llm/executor/kvCacheConfig.cpp
(5 hunks)cpp/tensorrt_llm/executor/serialization.cpp
(5 hunks)cpp/tensorrt_llm/executor/serializeUtils.h
(5 hunks)cpp/tensorrt_llm/nanobind/executor/executorConfig.cpp
(2 hunks)cpp/tensorrt_llm/pybind/executor/bindings.cpp
(1 hunks)cpp/tensorrt_llm/pybind/executor/executorConfig.cpp
(3 hunks)cpp/tests/unit_tests/executor/serializeUtilsTest.cpp
(3 hunks)tensorrt_llm/_torch/pyexecutor/resource_manager.py
(2 hunks)tensorrt_llm/_utils.py
(1 hunks)tensorrt_llm/llmapi/llm_args.py
(2 hunks)tests/unittest/bindings/test_executor_bindings.py
(4 hunks)tests/unittest/llmapi/test_llm_args.py
(2 hunks)
🧰 Additional context used
🧠 Learnings (2)
📚 Learning: applies to **/*.{cpp,h,hpp,cc,cxx} : parameter names should be consistent across function definition...
Learnt from: CR
PR: NVIDIA/TensorRT-LLM#0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-08-04T02:12:17.582Z
Learning: Applies to **/*.{cpp,h,hpp,cc,cxx} : Parameter names should be consistent across function definition and corresponding function declarations.
Applied to files:
cpp/tensorrt_llm/executor/kvCacheConfig.cpp
📚 Learning: in tensorrt-llm testing, it's common to have both cli flow tests (test_cli_flow.py) and pytorch api ...
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.
Applied to files:
cpp/tensorrt_llm/executor/serialization.cpp
🪛 Ruff (0.12.2)
tests/unittest/bindings/test_executor_bindings.py
1332-1332: Avoid equality comparisons to False
; use not config.enable_block_reuse:
for false checks
Replace with not config.enable_block_reuse
(E712)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (37)
cpp/include/tensorrt_llm/runtime/utils/mpiTags.h (1)
72-74
: LGTM! New MPI tags for KV cache event communication.The addition of
kKvCacheEventSize
andkKvCacheEvent
MPI tags is well-structured and follows the sequential numbering pattern. These tags will enable the MPI-based event exchange mechanism for attention data parallelism.tests/unittest/llmapi/test_llm_args.py (2)
165-166
: LGTM! Test coverage for new attention DP configuration parameter.The addition of
attention_dp_events_gather_period_ms=10
to the test configuration is appropriate and follows the existing testing pattern.
181-181
: LGTM! Assertion verifies pybind conversion of new parameter.The assertion correctly verifies that the new
attention_dp_events_gather_period_ms
parameter is properly converted in the pybind layer.cpp/tensorrt_llm/pybind/executor/bindings.cpp (1)
244-244
: LGTM! Python binding for attention DP rank attribute.The addition of the read-only
attention_dp_rank
attribute to theKVCacheEvent
Python binding correctly exposes the new C++ member to Python code, following the established pattern.cpp/tensorrt_llm/executor/executor.cpp (1)
135-142
: LGTM! Constructor updated to support attention DP rank.The
KVCacheEvent
constructor correctly adds the optionalattentionDpRank
parameter and properly initializes the member variable. The use ofstd::optional<SizeType32>
is appropriate since the attention DP rank may not always be available.tests/unittest/bindings/test_executor_bindings.py (1)
1317-1317
: LGTM! Comprehensive test coverage for the new attention DP events field.The test additions properly cover the new
attention_dp_events_gather_period_ms
field with default value assertions, direct assignment testing, and kwargs-based construction verification. This follows the established testing patterns for other KvCacheConfig fields.Also applies to: 1332-1332, 1346-1346, 1360-1361
tensorrt_llm/_torch/pyexecutor/resource_manager.py (2)
199-199
: LGTM!The assignment follows the established pattern and correctly initializes the attention DP events gather period from the configuration.
304-310
: LGTM!The conditional parameter passing logic is well-implemented:
- Correctly passes attention DP rank and size only when attention DP is enabled
- Uses appropriate values from the mapping object
- Always passes the gather period configuration parameter
cpp/tensorrt_llm/executor/kvCacheConfig.cpp (4)
30-30
: LGTM!The constructor parameter is correctly added with appropriate type and consistent naming.
40-40
: LGTM!The member initialization is correct and follows the established pattern.
66-67
: LGTM!The validation check appropriately ensures the gather period is positive with a descriptive error message.
135-138
: LGTM!The getter method is correctly implemented following established patterns with appropriate const-correctness.
cpp/include/tensorrt_llm/executor/executor.h (7)
1004-1004
: LGTM: Well-designed constructor parameter addition.The new
attentionDpEventsGatherPeriodMs
parameter follows consistent naming conventions and provides a reasonable default value of 5ms for the attention DP events gathering period.
1020-1020
: LGTM: Consistent getter method implementation.The getter method follows the established pattern with appropriate
[[nodiscard]]
attribute and correct return type.
1035-1035
: LGTM: Consistent setter method implementation.The setter method follows the established naming convention and parameter type consistency.
1711-1715
: LGTM: Well-designed constructor overload.The new constructor overload properly uses member initializer lists and provides flexibility with optional parameters while maintaining backward compatibility with the existing constructor.
1741-1742
: LGTM: Appropriate constructor extension.The constructor modification properly adds the optional
attentionDpRank
parameter with a sensible default, maintaining backward compatibility.
1750-1751
: LGTM: Well-documented member addition.The new
attentionDpRank
member variable is properly typed as optional and includes clear documentation about its purpose.
1092-1093
: LGTM: Well-documented member variable addition.The new member variable follows naming conventions and includes clear documentation explaining its purpose for attention DP events polling period.
cpp/tests/unit_tests/executor/serializeUtilsTest.cpp (4)
477-477
: LGTM! Properly tests the new attention DP events gather period field.The test correctly passes the new
attentionDpEventsGatherPeriodMs
parameter (value 77) and verifies it after serialization/deserialization.Also applies to: 493-493
850-916
: Well-structured comparison helper for KVCacheEvent testing.The
compareKvCacheEvents
function provides comprehensive equality checks for all KVCacheEvent variant types and properly includes the newattentionDpRank
field. The deep comparison logic for complex nested structures is thorough and correct.
918-925
: LGTM! Test properly validates KVCacheCreatedData serialization.
936-976
: LGTM! Comprehensive test coverage for KVCache events and UniqueToken.The tests properly validate serialization/deserialization for:
- Complex nested structures in KVCacheStoredData
- Optional fields in KVCacheUpdatedData
- Basic and vector forms of UniqueToken
cpp/include/tensorrt_llm/executor/serialization.h (1)
305-351
: LGTM! Well-structured serialization interface extensions.The new serialization methods for KVCache events and UniqueToken follow the established pattern consistently:
- Proper use of
[[nodiscard]]
attributes for functions that return values- Consistent parameter types (const references for serialize, streams for I/O)
- Complete coverage for all KVCache event variant types
- Template support for
KVCacheEventDiff<T>
cpp/tensorrt_llm/executor/serialization.cpp (5)
26-26
: LGTM!Including
<cstddef>
explicitly is a good practice for code clarity, even thoughsize_t
might be available through other headers.
1166-1170
: Well-implemented serialization for attention DP events gathering period.The new
attentionDpEventsGatherPeriodMs
field is correctly handled in all serialization methods with consistent ordering, maintaining backward compatibility.Also applies to: 1188-1188, 1208-1208
2188-2223
: Properly implemented KVCacheEvent deque serialization.The serialization correctly handles the queue size first, enabling proper deserialization of the collection.
2225-2252
: Excellent KVCacheEvent serialization with attention DP support.The implementation correctly includes the new
attentionDpRank
field as an optional parameter, aligning with the PR's objective to include attention DP rank information with KV cache events.
2254-2417
: Comprehensive and consistent serialization for all KVCache event types.All event data structures (KVCacheCreatedData, KVCacheStoredData, KVCacheRemovedData, KVCacheUpdatedData) follow a consistent serialization pattern with proper size calculation, field ordering, and type safety. The template implementation for KVCacheEventDiff is particularly well-designed for reusability.
cpp/include/tensorrt_llm/batch_manager/kvCacheEventManager.h (3)
39-40
: LGTM!The constructor signature correctly adds the new attention DP parameters with appropriate default values.
66-67
: Thread management looks correctThe new exchange thread follows the same pattern as the existing worker thread, which is good for consistency.
Also applies to: 76-78
101-108
: Member variables are well-structuredThe attention DP configuration members are properly typed with std::optional for values that may not be present, and the documentation clearly explains their purpose.
cpp/tensorrt_llm/executor/serializeUtils.h (5)
125-132
: LGTM! Static assertions properly verify serialization support for new types.The static assertions correctly verify that all new KVCache event types and
UniqueToken
have the requiredserializedSize
implementation, maintaining consistency with the existing pattern.
311-324
: Well-designed variant deserialization helper using modern C++ features.The
deserializeVariantByIndex
helper function elegantly handles variant deserialization using fold expressions and template recursion. The forward declaration ofdeserialize
is appropriately placed, and the error handling for invalid indices is correct.
546-577
: Deserialization branches correctly implemented for all new types.All KVCache event types and
UniqueToken
deserialization branches follow the established pattern, properly delegating to their respectiveSerialization::deserialize*
functions.
614-614
: Clean integration of the new variant deserialization helper.The updated call correctly uses
std::make_index_sequence
to generate compile-time indices for all variant alternatives, making the deserialization more robust and maintainable.
230-237
: Serialize assertions properly added for consistency.The static assertions for
hasSerialize
correctly mirror thehasSerializedSize
assertions, ensuring complete serialization support for all new KVCache event types andUniqueToken
.
PR_Github #14033 [ run ] triggered by Bot |
PR_Github #14033 [ run ] completed with state |
/bot run |
PR_Github #14159 [ run ] triggered by Bot |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (1)
tests/unittest/llmapi/test_llm_kv_cache_events.py (1)
258-275
: Remove commented code or provide explanation.The third request generation and its validation are commented out. Either remove this commented code to keep the test clean, or uncomment it if it's needed for comprehensive testing.
If this code should remain commented, please add a comment explaining why (e.g., "TODO: Enable after fixing issue #XXXX").
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
cpp/tensorrt_llm/nanobind/executor/executorConfig.cpp
(2 hunks)tensorrt_llm/_utils.py
(1 hunks)tests/integration/test_lists/test-db/l0_dgx_h200.yml
(2 hunks)tests/unittest/llmapi/test_llm_kv_cache_events.py
(2 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
- tensorrt_llm/_utils.py
- cpp/tensorrt_llm/nanobind/executor/executorConfig.cpp
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.py
: The code developed for TensorRT-LLM should conform to Python 3.8+.
Indent Python code with 4 spaces. Do not use tabs.
Always maintain the namespace when importing in Python, even if only one class or function from a module is used.
Python filenames should use snake_case (e.g., some_file.py).
Python classes should use PascalCase (e.g., class SomeClass).
Python functions and methods should use snake_case (e.g., def my_awesome_function():).
Python local variables should use snake_case. Prefix k for variable names that start with a number (e.g., k_99th_percentile).
Python global variables should use upper snake_case and prefix G (e.g., G_MY_GLOBAL).
Python constants should use upper snake_case (e.g., MY_CONSTANT).
Avoid shadowing variables declared in an outer scope in Python.
Initialize all externally visible members of a Python class in the constructor.
For interfaces that may be used outside a file, prefer docstrings over comments in Python.
Comments in Python should be reserved for code within a function, or interfaces that are local to a file.
Use Google style docstrings for Python classes and functions, which can be parsed by Sphinx.
Attributes and variables in Python can be documented inline; attribute docstrings will be rendered under the docstring for the class.
Avoid using reflection in Python when functionality can be easily achieved without reflection.
When using try-except blocks in Python, limit the except to the smallest set of errors possible.
When using try-except blocks to handle multiple possible variable types in Python, keep the body of the try as small as possible, using the else block to implement the logic.
Files:
tests/unittest/llmapi/test_llm_kv_cache_events.py
**/*.{cpp,h,cu,py}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the current year. This includes .cpp, .h, .cu, .py, and any other source files which are compiled or interpreted.
Files:
tests/unittest/llmapi/test_llm_kv_cache_events.py
🧠 Learnings (5)
📚 Learning: in tensorrt-llm testing, it's common to have both cli flow tests (test_cli_flow.py) and pytorch api ...
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.
Applied to files:
tests/unittest/llmapi/test_llm_kv_cache_events.py
tests/integration/test_lists/test-db/l0_dgx_h200.yml
📚 Learning: in tensorrt-llm, examples directory can have different dependency versions than the root requirement...
Learnt from: yibinl-nvidia
PR: NVIDIA/TensorRT-LLM#6506
File: examples/models/core/mixtral/requirements.txt:3-3
Timestamp: 2025-08-01T15:14:45.673Z
Learning: In TensorRT-LLM, examples directory can have different dependency versions than the root requirements.txt file. Version conflicts between root and examples dependencies are acceptable because examples are designed to be standalone and self-contained.
Applied to files:
tests/unittest/llmapi/test_llm_kv_cache_events.py
tests/integration/test_lists/test-db/l0_dgx_h200.yml
📚 Learning: applies to **/*.py : the code developed for tensorrt-llm should conform to python 3.8+....
Learnt from: CR
PR: NVIDIA/TensorRT-LLM#0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-08-05T07:49:09.875Z
Learning: Applies to **/*.py : The code developed for TensorRT-LLM should conform to Python 3.8+.
Applied to files:
tests/unittest/llmapi/test_llm_kv_cache_events.py
📚 Learning: in tensorrt-llm's multimodal processing pipeline, shared tensor recovery using `from_shared_tensor()...
Learnt from: yechank-nvidia
PR: NVIDIA/TensorRT-LLM#6254
File: tensorrt_llm/_torch/pyexecutor/model_engine.py:1201-1204
Timestamp: 2025-07-22T09:22:14.726Z
Learning: In TensorRT-LLM's multimodal processing pipeline, shared tensor recovery using `from_shared_tensor()` is only needed during the context phase. Generation requests reuse the already-recovered tensor data and only need to call `strip_for_generation()` to remove unnecessary multimodal data while preserving the recovered tensors. This avoids redundant tensor recovery operations during generation.
Applied to files:
tests/unittest/llmapi/test_llm_kv_cache_events.py
📚 Learning: in tensorrt_llm/executor/worker.py, the lora adapter cache optimization logic that checks `is_adapte...
Learnt from: amitz-nv
PR: NVIDIA/TensorRT-LLM#5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks `is_adapter_in_cpu_cache()` and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
Applied to files:
tests/unittest/llmapi/test_llm_kv_cache_events.py
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (4)
tests/integration/test_lists/test-db/l0_dgx_h200.yml (1)
139-139
: LGTM! New test added for attention DP KV cache events.The addition of
unittest/llmapi/test_llm_kv_cache_events.py::test_llm_api_attention_dp_kv_events
correctly integrates the new test for attention data parallelism KV cache events functionality into the H200 test suite.tests/unittest/llmapi/test_llm_kv_cache_events.py (3)
4-4
: LGTM! Appropriate import for multi-GPU test.The import of
skip_single_gpu
decorator is correct for a test that requires tensor parallelism size of 2.
210-216
: LGTM! Proper test setup for attention DP functionality.The test function is correctly decorated with
@skip_single_gpu
and creates an LLM instance with tensor parallelism size 2 and attention data parallelism enabled, which aligns with the PR objectives.
235-237
: No change needed: the Python executor always uses rank 0 for attention DP events.
Thetensor_parallel_size=2
setting applies to the C++ backend (covered bytest_llm_get_kv_cache_events_tp2
), but the Python‐backend implementation only ever emits events withattention_dp_rank == 0
. Therefore the existing assertion correctly reflects the behavior of this test.
PR_Github #14159 [ run ] completed with state |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (1)
tests/unittest/llmapi/test_llm_kv_cache_events.py (1)
150-227
: Consider removing debug print statements.The
check_events
helper function provides good code reuse and correctly validates attention DP rank information in events. However, the debug print statements (lines 163, 176, 205) should be removed before merging to production.Apply this diff to remove debug prints:
- print("event1:", event) - print("event2:", event) - print("event3:", event)
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (6)
cpp/include/tensorrt_llm/batch_manager/kvCacheEventManager.h
(4 hunks)cpp/tensorrt_llm/batch_manager/kvCacheEventManager.cpp
(6 hunks)cpp/tensorrt_llm/nanobind/batch_manager/kvCacheManager.cpp
(1 hunks)cpp/tensorrt_llm/pybind/batch_manager/kvCacheManager.cpp
(1 hunks)cpp/tensorrt_llm/pybind/executor/executorConfig.cpp
(2 hunks)tests/unittest/llmapi/test_llm_kv_cache_events.py
(3 hunks)
🚧 Files skipped from review as they are similar to previous changes (3)
- cpp/include/tensorrt_llm/batch_manager/kvCacheEventManager.h
- cpp/tensorrt_llm/batch_manager/kvCacheEventManager.cpp
- cpp/tensorrt_llm/pybind/executor/executorConfig.cpp
🧰 Additional context used
📓 Path-based instructions (3)
**/*.{cpp,h,hpp,cc,cxx}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.{cpp,h,hpp,cc,cxx}
: Closing braces of namespaces should have a comment saying the namespace it closes (e.g., } // namespace foo).
Prefer const or constexpr variables over #defines whenever possible.
A variable that is not modified after its initialization should be declared as const.
Except 0 (only used in comparison for checking signness/existence/emptiness) and nullptr, true, false, all other literals should only be used for variable initialization.
Use the Allman indentation style for braces.
Put the semicolon for an empty for or while loop in a new line.
The statement forming the body of a switch, while, do .. while or for statement shall be a compound statement (use brace-delimited statements).
If and else should always be followed by brace-delimited statements, even if empty or a single statement.
C++ filenames should use camel case with first letter lowercase (e.g., thisIsAFilename.cpp), and all files involved in the compilation of a target must have filenames that are case-insensitive unique.
All types (including class names) are camel case with uppercase first letter (e.g., FooBarClass).
Local variables, methods, and namespaces use camel case with first letter lowercase (e.g., localFooBar).
Non-magic-number global variables that are non-static and not defined in anonymous namespace use camel case prefixed by 'g' (e.g., gDontUseGlobalFoos).
Non-magic-number global variables that are static or defined in an anonymous namespace use camel case prefixed by 's' (e.g., sMutableStaticGlobal).
Locally visible static variable uses camel case with lowercase prefix 's' as the first letter of the name (e.g., static std::once_flag sFlag;).
Class member variables use camelcase prefixed with 'm' (e.g., mNbFooValues). Public member variables do not require the 'm' prefix but it is encouraged for clarity.
Constants (enumerations, global constants, static constants at class-scope, and function-scope magic-number/literal constants) are uppercase snakecase with prefix 'k...
Files:
cpp/tensorrt_llm/pybind/batch_manager/kvCacheManager.cpp
cpp/tensorrt_llm/nanobind/batch_manager/kvCacheManager.cpp
**/*.{cpp,h,cu,py}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the current year. This includes .cpp, .h, .cu, .py, and any other source files which are compiled or interpreted.
Files:
cpp/tensorrt_llm/pybind/batch_manager/kvCacheManager.cpp
cpp/tensorrt_llm/nanobind/batch_manager/kvCacheManager.cpp
tests/unittest/llmapi/test_llm_kv_cache_events.py
**/*.py
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.py
: The code developed for TensorRT-LLM should conform to Python 3.8+.
Indent Python code with 4 spaces. Do not use tabs.
Always maintain the namespace when importing in Python, even if only one class or function from a module is used.
Python filenames should use snake_case (e.g., some_file.py).
Python classes should use PascalCase (e.g., class SomeClass).
Python functions and methods should use snake_case (e.g., def my_awesome_function():).
Python local variables should use snake_case. Prefix k for variable names that start with a number (e.g., k_99th_percentile).
Python global variables should use upper snake_case and prefix G (e.g., G_MY_GLOBAL).
Python constants should use upper snake_case (e.g., MY_CONSTANT).
Avoid shadowing variables declared in an outer scope in Python.
Initialize all externally visible members of a Python class in the constructor.
For interfaces that may be used outside a file, prefer docstrings over comments in Python.
Comments in Python should be reserved for code within a function, or interfaces that are local to a file.
Use Google style docstrings for Python classes and functions, which can be parsed by Sphinx.
Attributes and variables in Python can be documented inline; attribute docstrings will be rendered under the docstring for the class.
Avoid using reflection in Python when functionality can be easily achieved without reflection.
When using try-except blocks in Python, limit the except to the smallest set of errors possible.
When using try-except blocks to handle multiple possible variable types in Python, keep the body of the try as small as possible, using the else block to implement the logic.
Files:
tests/unittest/llmapi/test_llm_kv_cache_events.py
🧠 Learnings (5)
📚 Learning: in tensorrt-llm testing, it's common to have both cli flow tests (test_cli_flow.py) and pytorch api ...
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.
Applied to files:
tests/unittest/llmapi/test_llm_kv_cache_events.py
📚 Learning: in tensorrt-llm, examples directory can have different dependency versions than the root requirement...
Learnt from: yibinl-nvidia
PR: NVIDIA/TensorRT-LLM#6506
File: examples/models/core/mixtral/requirements.txt:3-3
Timestamp: 2025-08-01T15:14:45.673Z
Learning: In TensorRT-LLM, examples directory can have different dependency versions than the root requirements.txt file. Version conflicts between root and examples dependencies are acceptable because examples are designed to be standalone and self-contained.
Applied to files:
tests/unittest/llmapi/test_llm_kv_cache_events.py
📚 Learning: applies to **/*.py : the code developed for tensorrt-llm should conform to python 3.8+....
Learnt from: CR
PR: NVIDIA/TensorRT-LLM#0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-08-05T07:49:09.875Z
Learning: Applies to **/*.py : The code developed for TensorRT-LLM should conform to Python 3.8+.
Applied to files:
tests/unittest/llmapi/test_llm_kv_cache_events.py
📚 Learning: in tensorrt-llm's multimodal processing pipeline, shared tensor recovery using `from_shared_tensor()...
Learnt from: yechank-nvidia
PR: NVIDIA/TensorRT-LLM#6254
File: tensorrt_llm/_torch/pyexecutor/model_engine.py:1201-1204
Timestamp: 2025-07-22T09:22:14.726Z
Learning: In TensorRT-LLM's multimodal processing pipeline, shared tensor recovery using `from_shared_tensor()` is only needed during the context phase. Generation requests reuse the already-recovered tensor data and only need to call `strip_for_generation()` to remove unnecessary multimodal data while preserving the recovered tensors. This avoids redundant tensor recovery operations during generation.
Applied to files:
tests/unittest/llmapi/test_llm_kv_cache_events.py
📚 Learning: in tensorrt_llm/executor/worker.py, the lora adapter cache optimization logic that checks `is_adapte...
Learnt from: amitz-nv
PR: NVIDIA/TensorRT-LLM#5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks `is_adapter_in_cpu_cache()` and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
Applied to files:
tests/unittest/llmapi/test_llm_kv_cache_events.py
🪛 Ruff (0.12.2)
tests/unittest/llmapi/test_llm_kv_cache_events.py
245-245: Local variable kvcache_config
is assigned to but never used
Remove assignment to unused variable kvcache_config
(F841)
266-266: Undefined name SchedulingParams
(F821)
🔇 Additional comments (4)
cpp/tensorrt_llm/nanobind/batch_manager/kvCacheManager.cpp (1)
328-330
: LGTM! Constructor binding correctly updated for attention DP support.The nanobind constructor binding properly exposes the new attention data parallelism parameters (
attention_dp_rank
,attention_dp_size
,attention_dp_events_gather_period_ms
) to Python, enabling configuration of KV cache event management with attention DP context.cpp/tensorrt_llm/pybind/batch_manager/kvCacheManager.cpp (1)
324-326
: LGTM! Pybind11 constructor binding properly updated.The pybind11 constructor binding is correctly updated to match the nanobind version, maintaining consistency between both Python binding approaches while exposing the new attention DP parameters.
tests/unittest/llmapi/test_llm_kv_cache_events.py (2)
4-4
: LGTM! Import addition is appropriate.The import of
skip_single_gpu
utility is correctly added for the new multi-GPU test.
230-239
: LGTM! Good refactoring to use helper function.The refactoring to use the
check_events
helper improves code maintainability and reduces duplication.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (1)
tests/unittest/llmapi/test_llm_kv_cache_events.py (1)
248-279
: Fix unused variable and configuration inconsistency.This test has the same issues identified in previous reviews:
kvcache_config
is defined but never used (lines 252-257)- The test uses
global_kvcache_config
instead of the locally defined config (line 262)- This creates configuration inconsistency and triggers static analysis warnings
Apply this diff to fix the issues:
- kvcache_config = KvCacheConfig(free_gpu_memory_fraction=0.4, - event_buffer_max_size=1024, - attention_dp_events_gather_period_ms=10, - enable_block_reuse=True, - onboard_blocks=True, - max_tokens=256) + attention_dp_kvcache_config = KvCacheConfig(free_gpu_memory_fraction=0.4, + event_buffer_max_size=1024, + attention_dp_events_gather_period_ms=10, + enable_block_reuse=True, + onboard_blocks=True, + max_tokens=256) llm = LLM(model=llama_model_path, tensor_parallel_size=2, enable_attention_dp=True, - kv_cache_config=global_kvcache_config, + kv_cache_config=attention_dp_kvcache_config, enable_autotuner=False)
🧹 Nitpick comments (1)
tests/unittest/llmapi/test_llm_kv_cache_events.py (1)
152-232
: Well-structured helper function with minor performance concern.The
check_events
helper function effectively extracts common validation logic and properly handles both regular and attention DP scenarios. The event validation logic is comprehensive and correctly verifies event types, IDs, and block counts.However, consider reducing the sleep duration from 1 second to improve test performance:
- time.sleep(1) + time.sleep(0.1) # Brief sleep to allow event gathering
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
cpp/tensorrt_llm/nanobind/batch_manager/kvCacheManager.cpp
(1 hunks)cpp/tensorrt_llm/pybind/batch_manager/kvCacheManager.cpp
(1 hunks)tests/unittest/llmapi/test_llm_kv_cache_events.py
(5 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
- cpp/tensorrt_llm/pybind/batch_manager/kvCacheManager.cpp
- cpp/tensorrt_llm/nanobind/batch_manager/kvCacheManager.cpp
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.py
: The code developed for TensorRT-LLM should conform to Python 3.8+.
Indent Python code with 4 spaces. Do not use tabs.
Always maintain the namespace when importing in Python, even if only one class or function from a module is used.
Python filenames should use snake_case (e.g., some_file.py).
Python classes should use PascalCase (e.g., class SomeClass).
Python functions and methods should use snake_case (e.g., def my_awesome_function():).
Python local variables should use snake_case. Prefix k for variable names that start with a number (e.g., k_99th_percentile = ...).
Python global variables should use upper snake_case and prefix G (e.g., G_MY_GLOBAL = ...).
Python constants should use upper snake_case (e.g., MY_CONSTANT = ...).
Avoid shadowing variables declared in an outer scope in Python.
Initialize all externally visible members of a Python class in the constructor.
For interfaces that may be used outside a file, prefer docstrings over comments in Python.
Comments in Python should be reserved for code within a function, or interfaces that are local to a file.
Use Google style docstrings for Python classes and functions, which can be parsed by Sphinx.
Attributes and variables in Python can be documented inline; attribute docstrings will be rendered under the docstring for the class.
Avoid using reflection in Python when functionality can be easily achieved without it.
When using try-except blocks in Python, limit the except to the smallest set of errors possible.
When using try-except blocks to handle multiple possible variable types in Python, keep the body of the try as small as possible, using the else block to implement the logic.
Files:
tests/unittest/llmapi/test_llm_kv_cache_events.py
**/*.{cpp,h,cu,py}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the current year. This includes .cpp, .h, .cu, .py, and any other source files which are compiled or interpreted.
Files:
tests/unittest/llmapi/test_llm_kv_cache_events.py
🧠 Learnings (3)
📚 Learning: in tensorrt-llm testing, it's common to have both cli flow tests (test_cli_flow.py) and pytorch api ...
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.
Applied to files:
tests/unittest/llmapi/test_llm_kv_cache_events.py
📚 Learning: in tensorrt-llm, examples directory can have different dependency versions than the root requirement...
Learnt from: yibinl-nvidia
PR: NVIDIA/TensorRT-LLM#6506
File: examples/models/core/mixtral/requirements.txt:3-3
Timestamp: 2025-08-01T15:14:45.673Z
Learning: In TensorRT-LLM, examples directory can have different dependency versions than the root requirements.txt file. Version conflicts between root and examples dependencies are acceptable because examples are designed to be standalone and self-contained.
Applied to files:
tests/unittest/llmapi/test_llm_kv_cache_events.py
📚 Learning: applies to **/*.py : the code developed for tensorrt-llm should conform to python 3.8+....
Learnt from: CR
PR: NVIDIA/TensorRT-LLM#0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-08-06T00:54:56.038Z
Learning: Applies to **/*.py : The code developed for TensorRT-LLM should conform to Python 3.8+.
Applied to files:
tests/unittest/llmapi/test_llm_kv_cache_events.py
🪛 Ruff (0.12.2)
tests/unittest/llmapi/test_llm_kv_cache_events.py
252-252: Local variable kvcache_config
is assigned to but never used
Remove assignment to unused variable kvcache_config
(F841)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (2)
tests/unittest/llmapi/test_llm_kv_cache_events.py (2)
4-5
: LGTM! Imports correctly added.The new imports are properly added to support the test enhancements:
pytest
for the threadleak decoratorskip_single_gpu
for multi-GPU test requirementsSchedulingParams
addresses the missing import from previous reviewsAlso applies to: 15-15
234-246
: Excellent refactoring!The simplified test effectively leverages the new
check_events
helper function, reducing code duplication while maintaining the same test coverage. This follows the DRY principle and improves maintainability.
@Superjomn could you review the changes |
Signed-off-by: Patrice Castonguay <[email protected]>
Signed-off-by: Patrice Castonguay <[email protected]>
Signed-off-by: Patrice Castonguay <[email protected]>
Signed-off-by: Patrice Castonguay <[email protected]>
Signed-off-by: Patrice Castonguay <[email protected]>
Signed-off-by: Patrice Castonguay <[email protected]>
Signed-off-by: Patrice Castonguay <[email protected]>
Signed-off-by: Patrice Castonguay <[email protected]>
Signed-off-by: Patrice Castonguay <[email protected]>
Signed-off-by: Patrice Castonguay <[email protected]>
Signed-off-by: Patrice Castonguay <[email protected]>
Signed-off-by: Patrice Castonguay <[email protected]>
d6a384d
to
caa37e8
Compare
/bot run --disable-fail-fast --add-multi-gpu-test |
PR_Github #14344 [ run ] triggered by Bot |
PR_Github #14311 [ run ] completed with state |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (22)
cpp/include/tensorrt_llm/batch_manager/kvCacheEventManager.h
(4 hunks)cpp/include/tensorrt_llm/executor/executor.h
(6 hunks)cpp/include/tensorrt_llm/executor/serialization.h
(1 hunks)cpp/include/tensorrt_llm/runtime/utils/mpiTags.h
(1 hunks)cpp/tensorrt_llm/batch_manager/kvCacheEventManager.cpp
(6 hunks)cpp/tensorrt_llm/executor/executor.cpp
(1 hunks)cpp/tensorrt_llm/executor/kvCacheConfig.cpp
(5 hunks)cpp/tensorrt_llm/executor/serialization.cpp
(5 hunks)cpp/tensorrt_llm/executor/serializeUtils.h
(5 hunks)cpp/tensorrt_llm/nanobind/batch_manager/kvCacheManager.cpp
(1 hunks)cpp/tensorrt_llm/nanobind/executor/executorConfig.cpp
(3 hunks)cpp/tensorrt_llm/pybind/batch_manager/kvCacheManager.cpp
(1 hunks)cpp/tensorrt_llm/pybind/executor/bindings.cpp
(1 hunks)cpp/tensorrt_llm/pybind/executor/executorConfig.cpp
(2 hunks)cpp/tests/unit_tests/executor/serializeUtilsTest.cpp
(3 hunks)tensorrt_llm/_torch/pyexecutor/resource_manager.py
(2 hunks)tensorrt_llm/_utils.py
(1 hunks)tensorrt_llm/llmapi/llm_args.py
(2 hunks)tests/integration/test_lists/test-db/l0_dgx_h200.yml
(1 hunks)tests/unittest/bindings/test_executor_bindings.py
(4 hunks)tests/unittest/llmapi/test_llm_args.py
(2 hunks)tests/unittest/llmapi/test_llm_kv_cache_events.py
(5 hunks)
🚧 Files skipped from review as they are similar to previous changes (18)
- tests/integration/test_lists/test-db/l0_dgx_h200.yml
- tests/unittest/llmapi/test_llm_args.py
- cpp/tensorrt_llm/executor/executor.cpp
- cpp/tensorrt_llm/pybind/batch_manager/kvCacheManager.cpp
- cpp/tensorrt_llm/pybind/executor/bindings.cpp
- cpp/include/tensorrt_llm/runtime/utils/mpiTags.h
- tensorrt_llm/_torch/pyexecutor/resource_manager.py
- cpp/tensorrt_llm/executor/kvCacheConfig.cpp
- tensorrt_llm/llmapi/llm_args.py
- cpp/tensorrt_llm/pybind/executor/executorConfig.cpp
- cpp/tensorrt_llm/nanobind/executor/executorConfig.cpp
- cpp/tensorrt_llm/nanobind/batch_manager/kvCacheManager.cpp
- tensorrt_llm/_utils.py
- cpp/include/tensorrt_llm/batch_manager/kvCacheEventManager.h
- tests/unittest/llmapi/test_llm_kv_cache_events.py
- cpp/tensorrt_llm/batch_manager/kvCacheEventManager.cpp
- cpp/include/tensorrt_llm/executor/executor.h
- cpp/include/tensorrt_llm/executor/serialization.h
🧰 Additional context used
📓 Path-based instructions (4)
**/*.{cpp,h,hpp,cc,cxx}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.{cpp,h,hpp,cc,cxx}
: Closing braces of namespaces should have a comment saying the namespace it closes (e.g., } // namespace foo).
Prefer const or constexpr variables over #defines whenever possible.
A variable that is not modified after its initialization should be declared as const.
Except 0 (used for checking signness/existence/emptiness), nullptr, true, false, all other literals should only be used for variable initialization.
Use the Allman indentation style for braces in C++ code.
Put the semicolon for an empty for or while loop in a new line.
The statement forming the body of a switch, while, do..while, or for statement shall be a compound statement (use brace-delimited statements).
If and else should always be followed by brace-delimited statements, even if empty or a single statement.
C++ filenames should use camel case with the first letter lowercase (e.g., thisIsAFilename.cpp), and all files involved in a compilation target must have case-insensitive unique filenames.
All types (including class names) should use camel case with uppercase first letter (e.g., FooBarClass).
Local variables, methods, and namespaces should use camel case with first letter lowercase (e.g., localFooBar).
Non-magic-number global variables that are non-static and not defined in anonymous namespace should use camel case prefixed by 'g' (e.g., gDontUseGlobalFoos).
Non-magic-number global variables that are static or defined in an anonymous namespace should use camel case prefixed by 's' (e.g., sMutableStaticGlobal).
Locally visible static variables should use camel case with lowercase prefix 's' as the first letter (e.g., static std::once_flag sFlag;).
Class member variables should use camel case prefixed with 'm' (e.g., mNbFooValues). Public member variables do not require the 'm' prefix but it is encouraged for clarity.
Enumerations, global constants, static constants at class-scope, and function-scope magic-number/literal constants should be uppercase snake case with prefix...
Files:
cpp/tensorrt_llm/executor/serializeUtils.h
cpp/tests/unit_tests/executor/serializeUtilsTest.cpp
cpp/tensorrt_llm/executor/serialization.cpp
**/*.{h,hpp}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
Use a preprocessor guard in header files. The guard name must have prefix TRTLLM_ followed by the filename, all in caps, and no trailing underscore.
Files:
cpp/tensorrt_llm/executor/serializeUtils.h
**/*.{cpp,h,hpp,cc,cxx,cu,py}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the current year. This includes .cpp, .h, .cu, .py, and any other source files which are compiled or interpreted.
Files:
cpp/tensorrt_llm/executor/serializeUtils.h
cpp/tests/unit_tests/executor/serializeUtilsTest.cpp
cpp/tensorrt_llm/executor/serialization.cpp
tests/unittest/bindings/test_executor_bindings.py
**/*.py
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.py
: Python code should conform to Python 3.8+.
Indent Python code with 4 spaces. Do not use tabs.
Always maintain the namespace when importing in Python, even if only one class or function from a module is used.
Python filenames should use snake_case (e.g., some_file.py).
Python classes should use PascalCase (e.g., class SomeClass).
Python functions and methods should use snake_case (e.g., def my_awesome_function():).
Python local variables should use snake_case. Prefix k for variable names that start with a number (e.g., k_99th_percentile).
Python global variables should use upper snake_case and prefix G (e.g., G_MY_GLOBAL).
Python constants should use upper snake_case (e.g., MY_CONSTANT).
Avoid shadowing variables declared in an outer scope in Python.
Initialize all externally visible members of a Python class in the constructor.
For interfaces that may be used outside a Python file, prefer docstrings over comments.
Comments in Python should be reserved for code within a function, or interfaces that are local to a file.
Use Google style docstrings for Python classes and functions, which can be parsed by Sphinx.
Attributes and variables in Python can be documented inline; attribute docstrings will be rendered under the class docstring.
Avoid using reflection in Python when functionality can be easily achieved without it.
When using try-except blocks in Python, limit the except to the smallest set of errors possible.
When using try-except blocks to handle multiple possible variable types in Python, keep the body of the try as small as possible, using the else block to implement the logic.
Files:
tests/unittest/bindings/test_executor_bindings.py
🧠 Learnings (2)
📚 Learning: in tensorrt-llm, test files (files under tests/ directories) do not require nvidia copyright headers...
Learnt from: galagam
PR: NVIDIA/TensorRT-LLM#6487
File: tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_trtllm_bench.py:1-12
Timestamp: 2025-08-06T13:58:07.506Z
Learning: In TensorRT-LLM, test files (files under tests/ directories) do not require NVIDIA copyright headers, unlike production source code files. Test files typically start directly with imports, docstrings, or code.
Applied to files:
cpp/tensorrt_llm/executor/serialization.cpp
📚 Learning: in tensorrt-llm testing, it's common to have both cli flow tests (test_cli_flow.py) and pytorch api ...
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.
Applied to files:
cpp/tensorrt_llm/executor/serialization.cpp
🪛 Ruff (0.12.2)
tests/unittest/bindings/test_executor_bindings.py
1332-1332: Avoid equality comparisons to False
; use not config.enable_block_reuse:
for false checks
Replace with not config.enable_block_reuse
(E712)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (18)
tests/unittest/bindings/test_executor_bindings.py (4)
1317-1317
: Good test coverage for the new field default value.The assertion correctly verifies that the new
attention_dp_events_gather_period_ms
field defaults to 5, following the established testing pattern in this function.
1332-1332
: LGTM! Setter test is correctly implemented.The assignment tests the setter functionality for the new field. The static analysis suggestion to use
not
instead of== False
doesn't apply here since this is a test context where explicit comparisons are appropriate for clarity.
1346-1346
: Good verification of setter functionality.The assertion properly verifies that the
attention_dp_events_gather_period_ms
field was correctly set to 10, completing the setter/getter test cycle.
1360-1361
: Comprehensive constructor testing with keyword arguments.Adding the new field to the kwargs dictionary ensures that
attention_dp_events_gather_period_ms
can be properly initialized through constructor keyword arguments, providing complete test coverage for the field initialization patterns.cpp/tensorrt_llm/executor/serializeUtils.h (5)
125-132
: LGTM!The new static assertions for KVCache event types and UniqueToken follow the established pattern and are necessary to ensure these types have proper serialization traits.
230-237
: LGTM!The static assertions for serialize traits complement the serializedSize assertions and ensure complete serialization support for the new KVCache event types and UniqueToken.
310-324
: Well-designed generic variant deserialization helper.The
deserializeVariantByIndex
function uses modern C++17 features (fold expressions, index sequences) to provide a clean, generic solution for deserializingstd::variant
types by their active index. The error handling is appropriate and the implementation is efficient.
546-577
: Comprehensive deserialization support for new KVCache types.The new deserialize branches properly extend support for all KVCache event-related types and UniqueToken, following the established pattern of delegating to the Serialization namespace functions.
614-614
: Improved variant deserialization using the new generic helper.The update to use
deserializeVariantByIndex
withstd::make_index_sequence
provides a more robust and scalable approach to variant deserialization compared to the previous implementation.cpp/tests/unit_tests/executor/serializeUtilsTest.cpp (6)
477-477
: Correctly updated constructor call for new attention DP parameter.The KvCacheConfig constructor call properly includes the new
attentionDpEventsGatherPeriodMs
parameter with test value 77.
493-493
: Proper verification of new attention DP parameter.The test correctly verifies that the
attentionDpEventsGatherPeriodMs
parameter is preserved through serialization/deserialization.
850-916
: Comprehensive and well-structured event comparison function.The
compareKvCacheEvents
function provides thorough comparison logic for all KVCacheEvent variant types, properly handling nested structures, optional fields, and using appropriate variant access methods. The implementation covers all the complexity of the KVCacheEvent data structure.
918-950
: Comprehensive test coverage for KVCache events deque serialization.The test properly covers all four KVCacheEvent variant types in a deque container, using meaningful test data and thorough verification through the comparison helper function.
952-990
: Thorough individual test coverage for all KVCache event variant types.The individual tests for each KVCacheEvent variant type (Created, Removed, Stored, Updated) use appropriate test data and provide comprehensive verification through the comparison helper function. The test data covers the complexity of nested structures appropriately.
992-1010
: Adequate test coverage for UniqueToken serialization.The tests for UniqueToken and vector of UniqueToken provide proper verification of the simple two-field structure, covering both individual object and container serialization scenarios.
cpp/tensorrt_llm/executor/serialization.cpp (3)
26-26
: LGTM!Proper use of C++ standard header
<cstddef>
instead of C-style<stddef.h>
, following the coding guidelines.
1166-1170
: LGTM!The
attentionDpEventsGatherPeriodMs
field is correctly integrated into the KvCacheConfig serialization/deserialization methods, maintaining consistency with the existing pattern.Also applies to: 1188-1188, 1208-1208
2188-2417
: Well-structured KVCache event serialization implementation!The serialization/deserialization functions for KVCache events and related data structures are comprehensive and follow established patterns consistently. All variant types, optional fields, and the new
attentionDpRank
field are properly handled.
PR_Github #14344 [ run ] completed with state |
…ts (NVIDIA#6563) Signed-off-by: Patrice Castonguay <[email protected]>
Summary by CodeRabbit
New Features
Enhancements
Description
Test Coverage
GitHub Bot Help
/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...
Provide a user friendly way for developers to interact with a Jenkins server.
Run
/bot [-h|--help]
to print this help message.See details below for each supported subcommand.
run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]
Launch build/test pipelines. All previously running jobs will be killed.
--reuse-test (optional)pipeline-id
(OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.--disable-reuse-test
(OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.--disable-fail-fast
(OPTIONAL) : Disable fail fast on build/tests/infra failures.--skip-test
(OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.--stage-list "A10-PyTorch-1, xxx"
(OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.--gpu-type "A30, H100_PCIe"
(OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.--test-backend "pytorch, cpp"
(OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.--only-multi-gpu-test
(OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.--disable-multi-gpu-test
(OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.--add-multi-gpu-test
(OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.--post-merge
(OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx"
(OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".--detailed-log
(OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.--debug
(OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in thestage-list
parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.For guidance on mapping tests to stage names, see
docs/source/reference/ci-overview.md
and the
scripts/test_to_stage_mapping.py
helper.kill
kill
Kill all running builds associated with pull request.
skip
skip --comment COMMENT
Skip testing for latest commit on pull request.
--comment "Reason for skipping build/test"
is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.reuse-pipeline
reuse-pipeline
Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.