Skip to content

Conversation

brb-nv
Copy link
Collaborator

@brb-nv brb-nv commented Sep 1, 2025

Summary by CodeRabbit

  • New Features

    • Added context-parallelism (CP) across KV-cache split/concat and transmission, extending partitioning from PP to PP×CP (with TP integration) and aligning send/receive sizing and concurrency.
  • Tests

    • Expanded multi-GPU tests for CP scenarios, asymmetric TP/PP/CP configs, migrated tests to the KV cache manager API, and added MPI debug hooks.
  • Chores

    • Relaxed a prior context-parallelism restriction, added CP-centric logging/debug prints and small runtime prints, and introduced a CP field/parameter in public interfaces that callers must accommodate.

Description

Test Coverage

PR Checklist

Please review the following before submitting your PR:

  • PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.

  • PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.

  • Test cases are provided for new code paths (see test instructions)

  • Any new dependencies have been scanned for license and vulnerabilities

  • CODEOWNERS updated if ownership changes

  • Documentation updated as needed

  • The reviewers assigned automatically/manually are appropriate for the PR.

  • Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md
and the scripts/test_to_stage_mapping.py helper.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

brb-nv added 7 commits August 28, 2025 10:21
$ export LD_LIBRARY_PATH=/home/bbuddharaju/scratch/TensorRT-LLM/cpp/build/tensorrt_llm/executor/cache_transmission/ucx_utils/:$LD_LIBRARY_PATH
$ TRTLLM_USE_UCX_KVCACHE=1 mpirun -n 8 ./cpp/build/tests/unit_tests/multi_gpu/cacheTransceiverTest --gtest_filter="AsymmetricCaseTestWithCPForMLA/AsymmetricalCacheTest.TestCase/*"
Copy link
Contributor

coderabbitai bot commented Sep 1, 2025

📝 Walkthrough

Walkthrough

Format/unformat, split/concat kernels, and tests are updated to account for Context Parallelism (CP). Domain sizing, indexing, buffer allocation, kernel signatures, and rank math now use PP×CP and TP×CP×PP; inquireSupport no longer enforces context-parallelism==1.

Changes

Cohort / File(s) Summary of Changes
MLA Cache Formatter (PP×CP domain)
cpp/tensorrt_llm/batch_manager/mlaCacheFormatter.cpp
Replace PP-only logic with PP×CP across divisibility checks, target buffer sizing/allocation, send/recv dispatch, indexing, and concurrency caps; update comments; remove context-parallelism==1 restriction in inquireSupport; minor variable rename.
KV Cache Split/Concat & header (TP×CP×PP)
cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.cu, cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.h
Add mDomainCPSize to TargetRanksInfo; extend IRanks and rank math to TP×CP×PP; validate mIRanks.size() against TP×CP×PP; propagate DomainCPSize into host indexing, MLA kernel signatures/launches and memory offsets; add CP logging and traces.
Unit tests: CP integration & manager refactor
cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp
Replace mManager with mKVCacheManager pointer API; add CP fields and CP-aware rank decomposition (mCpRank, mCpSize); propagate KVCacheManager through buffer/formatter/sender/receiver; expand test instantiations for CP scenarios; add MPI/debug hooks and CP-aware token math.
Buffer manager logging tweak
cpp/tensorrt_llm/batch_manager/cacheTransBuffer.cpp
Minor formatting change and added printf debug in CacheTransBufferManager::getOrAllocateBuffers (runtime output only, no logic change).

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant Sender
  participant MLA_Formatter as MLA Cache Formatter
  participant SplitConcat as KV Split/Concat
  participant Receiver

  Note over MLA_Formatter: Format path partitions by PP × CP
  Sender->>MLA_Formatter: format(caches, pPDomainSize, cPDomainSize)
  MLA_Formatter->>MLA_Formatter: domainCount = PP × CP\nallocate buffers sized for PP×CP\ncacheIdx = procIdx % (PP×CP)
  MLA_Formatter->>SplitConcat: splitKVCache(..., DomainPP, DomainTP, DomainCP)
  SplitConcat->>SplitConcat: build IRanks over TP × CP × PP\nvalidate mIRanks.size == TP×CP×PP
  SplitConcat-->>MLA_Formatter: return split outputs (size == PP×CP)
  MLA_Formatter-->>Receiver: dispatch with concurrency cap = PP×CP

  Note over MLA_Formatter,Receiver: Unformat/recv mirrors PP×CP sizing and CP-aware indexing
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

Suggested labels

KV-Cache Management

Suggested reviewers

  • pcastonguay
  • schetlur-nv
  • MatthiasKohl
  • chuangz0
  • Tabrizian
✨ Finishing Touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

  • Add @coderabbitai ignore or @coderabbit ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai or @coderabbitai title anywhere in the PR title to generate the title automatically.

Status, Documentation and Community

  • Visit our Status Page to check the current availability of CodeRabbit.
  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (5)
cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.h (1)

39-43: Add include guard and Doxygen comment for mDomainCPSize

  • Wrap the header with:
    #ifndef TRTLLM_CACHESPLITCONCAT_H
    #define TRTLLM_CACHESPLITCONCAT_H
    #pragma once
    …  
    #endif // TRTLLM_CACHESPLITCONCAT_H
  • Above int mDomainCPSize; in struct TargetRanksInfo, add a Doxygen comment, e.g.:
    /// Width of the Channel-Parallel (CP) domain
    int mDomainCPSize;
  • No serializers/deserializers for TargetRanksInfo were found in the codebase; verify any ABI/serialization boundaries if this type is shared across DSOs.
cpp/tensorrt_llm/batch_manager/mlaCacheFormatter.cpp (2)

45-53: pickRecvConnections ignores CP; may drop CP shards.

When CP>1 for MLA, receiver should pick 1 connection per (PP×CP) group (ignoring TP replicas). Current code returns only PP entries.

Apply:

 std::vector<size_t> MLACacheFormatter::pickRecvConnections(
     size_t numConnections, CacheState const& selfConfig, SizeType32 selfIdx, CacheState const& destConfig) const
 {
     auto targetInfo = executor::kv_cache::targetIRanks(destConfig, selfConfig, selfIdx);
-    TLLM_CHECK(numConnections == targetInfo.mIRanks.size());
+    TLLM_CHECK(numConnections == targetInfo.mIRanks.size());
     std::vector<size_t> ret;
-    // targetInfo , mRanks [tpranks, dpranks]
-    for (int i = 0; i < targetInfo.mDomainPPSize; i++)
-    {
-        ret.push_back(i);
-    }
+    // Pick one connection per (PP, CP). Skip TP replicas by stepping with DomainTPSize.
+    if (selfConfig.getAttentionConfig().mAttentionType == CacheState::AttentionType::kMLA
+        && destConfig.getParallelConfig().mContextParallelism > 1)
+    {
+        int const stride = targetInfo.mDomainTPSize * targetInfo.mDomainPPSize; // TP-major order in mIRanks
+        for (int cp = 0; cp < targetInfo.mDomainCPSize; ++cp)
+        {
+            for (int pp = 0; pp < targetInfo.mDomainPPSize; ++pp)
+            {
+                ret.push_back(static_cast<size_t>(cp * stride + pp));
+            }
+        }
+    }
+    else
+    {
+        for (int pp = 0; pp < targetInfo.mDomainPPSize; ++pp) { ret.push_back(static_cast<size_t>(pp)); }
+    }
     return ret;
 }

554-559: Bug: duplicated self check; dest heads-per-layer never checked.

Second operand repeats selfConfig; should reference destConfig.

-    if ((selfConfig.getModelConfig().mNbKvHeadsPerLayer.at(0) != 1)
-        || (selfConfig.getModelConfig().mNbKvHeadsPerLayer.at(0) != 1))
+    if ((selfConfig.getModelConfig().mNbKvHeadsPerLayer.at(0) != 1)
+        || (destConfig.getModelConfig().mNbKvHeadsPerLayer.at(0) != 1))
     {
         TLLM_LOG_WARNING("MLACacheFormatter::inquireSupport: only support MLA");
         return false;
     }
cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp (1)

660-666: Logic bug: cacheType shadowed; K-only never applied.

Inner ‘auto cacheType’ creates a new variable; outer cacheType remains kSELF.

-        CacheType cacheType = CacheType::kSELF;
-        if (kvFactor == 1)
-        {
-            auto cacheType = CacheType::kSELFKONLY;
-        }
+        CacheType cacheType = CacheType::kSELF;
+        if (kvFactor == 1)
+        {
+            cacheType = CacheType::kSELFKONLY;
+        }
cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.cu (1)

1267-1272: Assertion misses CP; fails for CP>1 on MLA vector concat path.

IRanks size check must include CP like split path does, otherwise CP scenarios trip the assert before MLA specialization.

-    TLLM_CHECK(targetRankInfo.mIRanks.size()
-        == (static_cast<size_t>(targetRankInfo.mDomainPPSize * targetRankInfo.mDomainTPSize)));
+    TLLM_CHECK(targetRankInfo.mIRanks.size()
+        == static_cast<size_t>(targetRankInfo.mDomainPPSize * targetRankInfo.mDomainTPSize * targetRankInfo.mDomainCPSize));

Note: subsequent inputCacheNum adjustment for MLA (DomainPPSize) is still correct.

🧹 Nitpick comments (6)
cpp/tensorrt_llm/batch_manager/mlaCacheFormatter.cpp (1)

165-169: Typo in variable name: agentConnnection.

Rename for clarity/consistency.

-    auto* agentConnnection = dynamic_cast<executor::kv_cache::AgentConnection const*>(connections[0]);
-    if (agentConnnection != nullptr)
+    auto* agentConnection = dynamic_cast<executor::kv_cache::AgentConnection const*>(connections[0]);
+    if (agentConnection != nullptr)
cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp (2)

646-651: Clarify “shared blocks” comment and remove dead variable.

Comment is fine; also remove unused dataTypeSize earlier/later if present to keep tests clean.

Would you like me to sweep and drop trivial unused locals in test code?


944-954: Excessive stderr/logging in tight loops; gate by env.

Unconditional std::cerr will spam CI. Guard with the same TARGET_RANK check.

-        std::cerr << "fillBlockData called for rank " << mRank << " mRankInInstance " << mRankInInstance << " blockId " << blockId << std::endl;
+        if (TARGET_RANK == -1 || tensorrt_llm::mpi::MpiComm::world().getRank() == TARGET_RANK) {
+            std::cerr << "fillBlockData called for rank " << mRank
+                      << " mRankInInstance " << mRankInInstance
+                      << " blockId " << blockId << std::endl;
+        }

Mirror the same change in verifyBlockData.

Also applies to: 958-965

cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.cu (3)

933-939: Reduce noisy INFO logs and fix format specifiers.

  • Printing every target rank at INFO is noisy; prefer DEBUG or guard by env.
  • %d for size_t is UB; use %zu or cast to unsigned long long.
-    TLLM_LOG_INFO("[splitKVCache] targetRankInfo.mIRanks.size(): %d", targetRankInfo.mIRanks.size());
+    TLLM_LOG_DEBUG("[splitKVCache] targetRankInfo.mIRanks.size(): %zu", targetRankInfo.mIRanks.size());
@@
-        TLLM_LOG_INFO("[splitKVCache] target rank: %d, ", rank);
+        TLLM_LOG_DEBUG("[splitKVCache] target rank: %d", rank);

1052-1056: DomainCPSize passed to MLA kernel but unused inside kernel.

This will trigger “unused parameter” warnings and confuses readers. Either use it in addressing or explicitly mark unused.

 __global__ void splitKVCacheForMLAKernel(..., int DomainCPSize, int layerNumDomainPP, int kvFactor)
 {
+    (void)DomainCPSize;

If CP is intended to affect MLA layout later, add a TODO with rationale.

Also applies to: 1095-1099, 1123-1127, 1156-1160, 1185-1189


1006-1010: Good diagnostics.

Grid dims and domain logs help triage; keep at DEBUG level as already done in split; consider same for concat.

Also applies to: 1371-1376

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between e257cb3 and e91f63b.

📒 Files selected for processing (4)
  • cpp/tensorrt_llm/batch_manager/mlaCacheFormatter.cpp (4 hunks)
  • cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.cu (13 hunks)
  • cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.h (1 hunks)
  • cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp (35 hunks)
🧰 Additional context used
📓 Path-based instructions (4)
**/*.{cpp,cc,cxx,cu,h,hpp,hh,hxx,cuh}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.{cpp,cc,cxx,cu,h,hpp,hh,hxx,cuh}: In C++ and CUDA files, closing braces of namespaces must include a trailing comment naming the namespace (e.g., } // namespace foo)
Prefer const or constexpr variables over #define for constants; variables not modified after initialization must be declared const
Avoid magic literals: except 0, nullptr, true, false, use named constants (e.g., constexpr) instead of inline numeric or string literals
Use Allman brace style; always brace bodies of if/else/switch/while/do/for; put the semicolon of empty loops on a new line
C++ filenames should be camelCase starting lowercase (e.g., thisIsAFilename.cpp) and case-insensitive unique within a build target
Type names are UpperCamelCase; local variables, methods, and namespaces are lowerCamelCase
Global non-magic-number variables: prefix g for non-static globals and s for static or anonymous-namespace globals (e.g., gFoo, sBar)
Locally visible static variables should be lowerCamelCase starting with 's' (e.g., static std::once_flag sFlag)
Member variables use mPrefix (e.g., mCount); public members may omit but using m is encouraged for clarity
Constants (enums, globals, static constants, and function-scope magic-number constants) use uppercase SNAKE_CASE with k prefix (e.g., kMAX_SIZE)
Avoid macros; if necessary, use UPPER_SNAKE_CASE for macro names
Run clang-format (LLVM style) before submitting; maximum line length is 120; use clang-format off/on only for justified exceptions
Use C++ comments (//); C-style /* / only for special inline cases; prefer Doxygen comments: //! and //!<; full-sentence comments are capitalized and punctuated; document public APIs with Doxygen
Disable code with #if/#endif (possibly via a DEBUG macro); do not comment out code; avoid dead code blocks
Do not throw exceptions across library boundaries
Use the least-forceful cast; avoid C-style and functional casts (except void casts); do not remove const/volatile; void
to T* via static_cast; reinterpret_cast only a...

Files:

  • cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.h
  • cpp/tensorrt_llm/batch_manager/mlaCacheFormatter.cpp
  • cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp
  • cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.cu
**/*.{hpp,h,hxx,hh}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Header include guards are required with macro name TRTLLM_<FILE_NAME_IN_CAPS> (no directories, no leading/trailing underscores)

Files:

  • cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.h
**/*.{cpp,cc,cxx,cu,h,hpp,hh,hxx,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.{cpp,cc,cxx,cu,h,hpp,hh,hxx,cuh,py}: Use spaces only; no tabs; indent with 4 spaces
Prepend NVIDIA copyright header (current year) to all source files (.cpp, .h, .cu, .py, etc.)

Files:

  • cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.h
  • cpp/tensorrt_llm/batch_manager/mlaCacheFormatter.cpp
  • cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp
  • cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.cu
**/*.{cpp,cc,cxx,cu}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.{cpp,cc,cxx,cu}: Prefer smart pointers; use std::unique_ptr for sole ownership, std::shared_ptr for shared ownership; avoid deprecated smart pointers
Do not use assignment in subexpressions (e.g., if (x = y)); avoid chained assignments (x = y = z)
Switch statements: provide cases for all enum values and omit default to catch new values; prohibit fall-through except between empty labels; terminate each case with break or throw; do not end a case with return; place break inside braces when using a compound statement

Files:

  • cpp/tensorrt_llm/batch_manager/mlaCacheFormatter.cpp
  • cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp
  • cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.cu
🧠 Learnings (5)
📚 Learning: 2025-08-21T09:41:49.347Z
Learnt from: eopXD
PR: NVIDIA/TensorRT-LLM#6768
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:2010-2045
Timestamp: 2025-08-21T09:41:49.347Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, updateSequenceCacheBlockOffsets is specifically for updating bookkeeping when blocks are added during the context phase, not for refreshing offsets after detach operations. During detach operations, GenerationRequest::removeFrontBlock handles the necessary cache block bookkeeping internally.

Applied to files:

  • cpp/tensorrt_llm/batch_manager/mlaCacheFormatter.cpp
  • cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp
📚 Learning: 2025-08-14T21:04:50.248Z
Learnt from: thorjohnsen
PR: NVIDIA/TensorRT-LLM#6910
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-14T21:04:50.248Z
Learning: In KV cache onboarding logic during prefill in cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, when calculating which blocks fall within the attention window, use getTokensPerBlock() to advance token indices rather than block->getUniqueTokens().size(), because the calculation needs to consider the post-prefill state where blocks will be filled to capacity, not their current token count.

Applied to files:

  • cpp/tensorrt_llm/batch_manager/mlaCacheFormatter.cpp
  • cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp
  • cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.cu
📚 Learning: 2025-08-20T06:56:02.889Z
Learnt from: eopXD
PR: NVIDIA/TensorRT-LLM#6768
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:577-579
Timestamp: 2025-08-20T06:56:02.889Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, maxSequenceLength is now enforced as a non-optional argument in the BlockManager constructor, so concerns about std::nullopt defaulting to 0 are not applicable. When windowSize > maxSequenceLength, a warning should be added instead of handling optional parameter cases.

Applied to files:

  • cpp/tensorrt_llm/batch_manager/mlaCacheFormatter.cpp
  • cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp
📚 Learning: 2025-08-15T06:46:54.897Z
Learnt from: eopXD
PR: NVIDIA/TensorRT-LLM#6767
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-15T06:46:54.897Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp addToken function, newly allocated blocks are unshared by design. The beam search path in addToken (when sequence.getNumTokens() > windowSize) is currently broken/non-functional with SWA, so the block allocation doesn't follow a shared-then-unshared pattern.

Applied to files:

  • cpp/tensorrt_llm/batch_manager/mlaCacheFormatter.cpp
  • cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp
  • cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.cu
📚 Learning: 2025-08-20T06:48:45.368Z
Learnt from: eopXD
PR: NVIDIA/TensorRT-LLM#6768
File: cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h:0-0
Timestamp: 2025-08-20T06:48:45.368Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, updateSequenceCacheBlockOffsets is only called when adding a sequence, not during detach operations. During detach, the cache block bookkeeping is handled by GenerationRequest::removeFrontBlock.

Applied to files:

  • cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp
🧬 Code graph analysis (2)
cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp (3)
cpp/include/tensorrt_llm/batch_manager/kvCacheUtils.h (1)
  • fromAllBlockIds (34-41)
cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.cu (2)
  • TargetRanksInfoForDP (50-160)
  • TargetRanksInfoForDP (50-51)
cpp/tensorrt_llm/batch_manager/mlaCacheFormatter.cpp (2)
  • needSendCache (56-85)
  • needSendCache (56-57)
cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.cu (1)
cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp (3)
  • numLayers (288-389)
  • numLayers (607-714)
  • numLayers (607-609)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (10)
cpp/tensorrt_llm/batch_manager/mlaCacheFormatter.cpp (4)

101-106: Early exit via needSendCache: LGTM.

Short-circuiting send when TP down-sampling says “no-op” is correct and avoids allocation.


157-173: PP×CP sizing and preallocation checks: LGTM.

Correct product-based divisibility and buffer sizing; agent path requiring full prealloc also looks right.


190-205: Indexing and concurrency cap over PP×CP: LGTM.

Modulo by PP×CP and concurrency min(bufferCoverTargetNum, PP×CP) is consistent with the split fanout.

Also applies to: 254-255


491-494: Comment matches behavior.

concatKvCacheV2Dispatch will expect recvSplitCaches sized to PP×(possibly CP) on MLA path; paired with pickRecvConnections fix above.

Please ensure tests cover MLA with CP>1 (you added one; good).

cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp (4)

723-735: CommState propagation: LGTM.

UCX/NIXL path broadcasts context comm state; non-UCX path builds rank vector. Looks consistent with CP-aware topology.

Also applies to: 782-832, 833-836


885-906: Good: use KVCacheManager everywhere and sync per-pool.

BlockRange usage via KVCacheManager is consistent.

Also applies to: 931-942


1451-1467: Nice: CP-aware MLA test added.

Covers genCp=2 case; this should catch the pickRecvConnections bug above once fixed.

Please confirm this test fails before and passes after the pickRecvConnections fix.


534-588: Rank arithmetic with CP: LGTM.

mPpRank/mCpRank derived from (tp×cp) matches the 3D mapping used by the kernels.

cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.cu (2)

60-66: CP domain integration in TargetRanksInfo: LGTM.

3D (TP,CP,PP) mapping and rank linearization looks consistent.

Also applies to: 114-131, 136-142, 159-160


323-359: Potential CP indexing gap in non-MLA concatKVCache (ptr version).

inputAllRankNum excludes CP and tp/pp rank derivation ignores CP. If this path can run with CP>1, indexing will break.

Please confirm concatKVCacheDispatch is only used when contextParallelism==1. If not, I can send a patch to:

  • multiply inputAllRankNum by mContextParallelism
  • compute ppRank = rank / (tpNum*cpNum) and derive cpRank as needed
  • adjust startHead/startLayer accordingly

@brb-nv brb-nv requested a review from a team as a code owner September 1, 2025 06:31
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (4)
cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp (2)

447-449: Bug: operator precedence breaks the byte-compare predicate

The expression compares (i == promptLen) first, then bitwise-ANDs with 0xff, yielding 0/1, not the intended masked compare.

Apply:

-            EXPECT_TRUE(std::all_of(bytes.begin(), bytes.end(),
-                [&llmRequest](uint8_t i) { return i == llmRequest->getPromptLen() & 0xff; }));
+            EXPECT_TRUE(std::all_of(bytes.begin(), bytes.end(),
+                [&llmRequest](uint8_t i) {
+                    return i == static_cast<uint8_t>(llmRequest->getPromptLen());
+                }));

660-665: Bug: cacheType shadowing prevents K-only mode

The inner ‘auto cacheType’ shadows the outer variable; kSELFKONLY is never applied.

-        CacheType cacheType = CacheType::kSELF;
-        if (kvFactor == 1)
-        {
-            auto cacheType = CacheType::kSELFKONLY;
-        }
+        CacheType cacheType = kvFactor == 1 ? CacheType::kSELFKONLY : CacheType::kSELF;
cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.cu (2)

323-346: Bug: concatKVCache (pointer variant) ignores CP in rank-stride and rank→(pp,tp) mapping

  • inputAllRankNum omits CP; indexing with global iRank (which includes CP) can go OOB.
  • ppRank derivation ignores CP (should divide by TP×CP).
-    int const inputAllRankNum = iParallelConfig.mPipelineParallelism * iParallelConfig.mTensorParallelism;
+    int const inputAllRankNum = iParallelConfig.mPipelineParallelism
+        * iParallelConfig.mTensorParallelism * iParallelConfig.mContextParallelism;
...
-        int const tpRank = rank % parallelConfig.mTensorParallelism;
-        int const ppRank = rank / parallelConfig.mTensorParallelism;
+        int const tpNum = parallelConfig.mTensorParallelism;
+        int const cpNum = parallelConfig.mContextParallelism;
+        int const tpRank = rank % tpNum;
+        int const ppRank = rank / (tpNum * cpNum);
         int const ppNum = parallelConfig.mPipelineParallelism;

Also ensure inputRanks passed in match TargetRanksInfo.mIRanks (which now includes CP).

Also applies to: 351-360


1271-1284: Bug: DomainCPSize omitted in concat (vector variant) rank validation; MLA input count mismatch

  • Validation should include CP: PP×TP×CP.
  • For MLA, inputCacheNum should be PP×CP (mirrors split side), not just PP.
-    TLLM_CHECK(targetRankInfo.mIRanks.size()
-        == (static_cast<size_t>(targetRankInfo.mDomainPPSize * targetRankInfo.mDomainTPSize)));
+    TLLM_CHECK(targetRankInfo.mIRanks.size()
+        == static_cast<size_t>(targetRankInfo.mDomainPPSize * targetRankInfo.mDomainTPSize * targetRankInfo.mDomainCPSize));
...
-    if (selfCacheState.getAttentionConfig().mAttentionType == CacheState::AttentionType::kMLA)
-    {
-        inputCacheNum = targetRankInfo.mDomainPPSize;
-    }
+    if (selfCacheState.getAttentionConfig().mAttentionType == CacheState::AttentionType::kMLA)
+    {
+        inputCacheNum = targetRankInfo.mDomainPPSize * targetRankInfo.mDomainCPSize;
+    }
🧹 Nitpick comments (5)
cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp (3)

646-651: Clarify/remove “@b” inline question

This is test code; keep comments actionable. Replace with a brief explanation or remove.

-        // @B: What are shared blocks?
+        // Shared blocks: tokens from the prefill window reused across beams before divergence.

944-953: getEnvMpiDebugRank utility is fine

-2 sentinel for “disabled” is reasonable. Consider documenting -1 vs -2 in a file-level comment.


957-976: Guard heavy test logging

Inner-loop TLLM_LOG_INFO calls can produce huge output. They’re already rank-gated; also gate on an env toggle to avoid accidental log floods.

-        if (TARGET_RANK == -1 || tensorrt_llm::mpi::MpiComm::world().getRank() == TARGET_RANK)
+        if ((TARGET_RANK == -1 || tensorrt_llm::mpi::MpiComm::world().getRank() == TARGET_RANK)
+            && tensorrt_llm::common::getEnv("TLLM_DEBUG_VERBOSE", "0") == "1")

Also applies to: 1001-1014, 1027-1041

cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.cu (2)

499-505: Comment nit: “Why not use Domain{P,T,C}?”

Clarify or drop the @b note; comments should state intent, not questions.

-// @B: Why do we not use Domain{P,T,C}PSize?
+// Note: For MLA, head domain is singleton; kernel parameters pass PP/TP/CP domain sizes explicitly.

935-943: Verbose IRank logging in split path

Consider gating with an env var; printing all ranks per call is noisy.

-    TLLM_LOG_INFO("[splitKVCache] targetRankInfo.mIRanks.size(): %d", targetRankInfo.mIRanks.size());
-    for (auto rank : targetRankInfo.mIRanks)
-    {
-        TLLM_LOG_INFO("[splitKVCache] target rank: %d, ", rank);
-    }
+    if (tensorrt_llm::common::getEnv("TLLM_DEBUG_VERBOSE", "0") == "1")
+    {
+        TLLM_LOG_INFO("[splitKVCache] targetRankInfo.mIRanks.size(): %zu", targetRankInfo.mIRanks.size());
+        for (auto rank : targetRankInfo.mIRanks)
+        {
+            TLLM_LOG_INFO("[splitKVCache] target rank: %d", rank);
+        }
+    }
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between e91f63b and e465f43.

📒 Files selected for processing (4)
  • cpp/tensorrt_llm/batch_manager/cacheTransBuffer.cpp (1 hunks)
  • cpp/tensorrt_llm/batch_manager/mlaCacheFormatter.cpp (4 hunks)
  • cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.cu (14 hunks)
  • cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp (36 hunks)
✅ Files skipped from review due to trivial changes (1)
  • cpp/tensorrt_llm/batch_manager/cacheTransBuffer.cpp
🚧 Files skipped from review as they are similar to previous changes (1)
  • cpp/tensorrt_llm/batch_manager/mlaCacheFormatter.cpp
🧰 Additional context used
📓 Path-based instructions (3)
**/*.{cpp,cc,cxx,cu,h,hpp,hh,hxx,cuh}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.{cpp,cc,cxx,cu,h,hpp,hh,hxx,cuh}: In C++ and CUDA files, closing braces of namespaces must include a trailing comment naming the namespace (e.g., } // namespace foo)
Prefer const or constexpr variables over #define for constants; variables not modified after initialization must be declared const
Avoid magic literals: except 0, nullptr, true, false, use named constants (e.g., constexpr) instead of inline numeric or string literals
Use Allman brace style; always brace bodies of if/else/switch/while/do/for; put the semicolon of empty loops on a new line
C++ filenames should be camelCase starting lowercase (e.g., thisIsAFilename.cpp) and case-insensitive unique within a build target
Type names are UpperCamelCase; local variables, methods, and namespaces are lowerCamelCase
Global non-magic-number variables: prefix g for non-static globals and s for static or anonymous-namespace globals (e.g., gFoo, sBar)
Locally visible static variables should be lowerCamelCase starting with 's' (e.g., static std::once_flag sFlag)
Member variables use mPrefix (e.g., mCount); public members may omit but using m is encouraged for clarity
Constants (enums, globals, static constants, and function-scope magic-number constants) use uppercase SNAKE_CASE with k prefix (e.g., kMAX_SIZE)
Avoid macros; if necessary, use UPPER_SNAKE_CASE for macro names
Run clang-format (LLVM style) before submitting; maximum line length is 120; use clang-format off/on only for justified exceptions
Use C++ comments (//); C-style /* / only for special inline cases; prefer Doxygen comments: //! and //!<; full-sentence comments are capitalized and punctuated; document public APIs with Doxygen
Disable code with #if/#endif (possibly via a DEBUG macro); do not comment out code; avoid dead code blocks
Do not throw exceptions across library boundaries
Use the least-forceful cast; avoid C-style and functional casts (except void casts); do not remove const/volatile; void
to T* via static_cast; reinterpret_cast only a...

Files:

  • cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp
  • cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.cu
**/*.{cpp,cc,cxx,cu,h,hpp,hh,hxx,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.{cpp,cc,cxx,cu,h,hpp,hh,hxx,cuh,py}: Use spaces only; no tabs; indent with 4 spaces
Prepend NVIDIA copyright header (current year) to all source files (.cpp, .h, .cu, .py, etc.)

Files:

  • cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp
  • cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.cu
**/*.{cpp,cc,cxx,cu}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.{cpp,cc,cxx,cu}: Prefer smart pointers; use std::unique_ptr for sole ownership, std::shared_ptr for shared ownership; avoid deprecated smart pointers
Do not use assignment in subexpressions (e.g., if (x = y)); avoid chained assignments (x = y = z)
Switch statements: provide cases for all enum values and omit default to catch new values; prohibit fall-through except between empty labels; terminate each case with break or throw; do not end a case with return; place break inside braces when using a compound statement

Files:

  • cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp
  • cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.cu
🧠 Learnings (5)
📚 Learning: 2025-08-15T06:46:54.897Z
Learnt from: eopXD
PR: NVIDIA/TensorRT-LLM#6767
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-15T06:46:54.897Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp addToken function, newly allocated blocks are unshared by design. The beam search path in addToken (when sequence.getNumTokens() > windowSize) is currently broken/non-functional with SWA, so the block allocation doesn't follow a shared-then-unshared pattern.

Applied to files:

  • cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp
  • cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.cu
📚 Learning: 2025-08-14T21:04:50.248Z
Learnt from: thorjohnsen
PR: NVIDIA/TensorRT-LLM#6910
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-14T21:04:50.248Z
Learning: In KV cache onboarding logic during prefill in cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, when calculating which blocks fall within the attention window, use getTokensPerBlock() to advance token indices rather than block->getUniqueTokens().size(), because the calculation needs to consider the post-prefill state where blocks will be filled to capacity, not their current token count.

Applied to files:

  • cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp
  • cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.cu
📚 Learning: 2025-08-20T06:56:02.889Z
Learnt from: eopXD
PR: NVIDIA/TensorRT-LLM#6768
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:577-579
Timestamp: 2025-08-20T06:56:02.889Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, maxSequenceLength is now enforced as a non-optional argument in the BlockManager constructor, so concerns about std::nullopt defaulting to 0 are not applicable. When windowSize > maxSequenceLength, a warning should be added instead of handling optional parameter cases.

Applied to files:

  • cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp
📚 Learning: 2025-08-21T09:41:49.347Z
Learnt from: eopXD
PR: NVIDIA/TensorRT-LLM#6768
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:2010-2045
Timestamp: 2025-08-21T09:41:49.347Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, updateSequenceCacheBlockOffsets is specifically for updating bookkeeping when blocks are added during the context phase, not for refreshing offsets after detach operations. During detach operations, GenerationRequest::removeFrontBlock handles the necessary cache block bookkeeping internally.

Applied to files:

  • cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp
📚 Learning: 2025-08-20T06:48:45.368Z
Learnt from: eopXD
PR: NVIDIA/TensorRT-LLM#6768
File: cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h:0-0
Timestamp: 2025-08-20T06:48:45.368Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, updateSequenceCacheBlockOffsets is only called when adding a sequence, not during detach operations. During detach, the cache block bookkeeping is handled by GenerationRequest::removeFrontBlock.

Applied to files:

  • cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp
🧬 Code graph analysis (2)
cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp (3)
cpp/include/tensorrt_llm/batch_manager/kvCacheUtils.h (1)
  • fromAllBlockIds (34-41)
cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.cu (2)
  • TargetRanksInfoForDP (50-160)
  • TargetRanksInfoForDP (50-51)
cpp/tensorrt_llm/batch_manager/mlaCacheFormatter.cpp (2)
  • needSendCache (56-85)
  • needSendCache (56-57)
cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.cu (1)
cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp (3)
  • numLayers (288-389)
  • numLayers (607-714)
  • numLayers (607-609)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (11)
cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp (8)

317-321: mKVCacheManager migration looks correct

Constructor args align with prior semantics; PP/TP/CP-neutral here and consistent with downstream usages.


388-388: Pool allocation moved to mKVCacheManager is fine

No behavior change; uses explicit useUvm=false consistent with MPI constraints.


394-406: Transceiver wiring uses mKVCacheManager throughout

Formatter/BufferManager constructed with the new manager correctly.


713-713: Pool allocation with window attention path

Good to allocate after window-pool config is finalized.


782-830: UCX/NIXL commState propagation path is sensible

Rank-0 broadcast → gen ranks deserialize; context ranks set same CommState. Matches mixed backends.


833-836: Non-UCX/MPI path: context rank vector

Deterministic 0..contextRankSize-1 vector is correct for comm topology.


1456-1474: New CP-for-MLA instantiation looks good

Covers CP growth on generation side (genCp=2) with MLA kvFactor=2 and INT8; good addition.


1545-1817: TargetRanksInfo/needSendCache matrix tests are comprehensive

Good coverage for PP/TP/CP grow/shrink and DP toggles.

cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.cu (3)

60-66: CP domain integration in TargetRanksInfo is correct

  • Added CP derivation and power-of-two checks.
  • IRank mapping now PP×CP×TP; return struct extended accordingly.

Also applies to: 114-132, 136-144, 159-160


1054-1067: MLA split: CP-aware output count and launches are sensible

  • DomainCPSize plumbed into kernel args and used to compute outputCacheIdx.
  • Launch selection by vector width unchanged.

Also applies to: 1075-1192


488-498: makeShapeFromCacheState remains correct

Dims reflect [layers/PP, KV, blockSize]; unchanged by CP.

Comment on lines 1102 to 1116
// EXPECT_EQ(*dataPtr, generateValue);
// Debug print with rank information for MPI debugging (KEY values)
if (TARGET_RANK == -1 || tensorrt_llm::mpi::MpiComm::world().getRank() == TARGET_RANK)
{
TLLM_LOG_INFO(tensorrt_llm::mpi::MpiComm::world().getRank(),
"[RANK %d] [verifyBlockData::key] blockId=%d, layer=%d->%d, head=%d->%d, token=%d->%d, hidden=%d, "
"keyIdx=%zu, value=%s, dataType=%d",
tensorrt_llm::mpi::MpiComm::world().getRank(),
blockId, layerId, layerId + startLayerId,
headId, headId + startHeadId,
tokenId, tokenId + startTokenId,
hiddenId, keyIndex,
std::to_string(static_cast<double>(generateValue)).c_str(),
static_cast<int>(blockData.getDataType()));
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

⚠️ Potential issue

Critical: assertions disabled in verifyBlockData

Both EXPECT_EQ checks are commented out; tests won’t detect corruption.

-                                // EXPECT_EQ(*dataPtr, generateValue);
+                                EXPECT_EQ(*dataPtr, generateValue);
...
-                                    // EXPECT_EQ(*dataPtr, generateValue);
+                                    EXPECT_EQ(*dataPtr, generateValue);

If float types cause flakiness, switch to EXPECT_NEAR with 0 tolerance or reinterpret-cast bitwise equality for exact copy semantics.

Also applies to: 1127-1141

🤖 Prompt for AI Agents
In cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp around lines
1102-1116 (and also apply same fix at 1127-1141), the EXPECT_EQ assertions are
commented out so the test cannot detect data corruption; re-enable the checks
and replace fragile float comparisons with a stable equality check: for
floating-point types use EXPECT_NEAR(actual, expected, 0.0) or perform a bitwise
comparison by reinterpret_casting both values to an integral type of the same
size and using EXPECT_EQ on those bits; ensure the test handles non-float types
with the original EXPECT_EQ and keep the existing MPI debug logging unchanged.

@brb-nv brb-nv marked this pull request as draft September 1, 2025 06:43
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
cpp/tensorrt_llm/batch_manager/cacheTransBuffer.cpp (1)

333-345: Replace raw printf with TLLM_LOG_DEBUG and fix type mismatch in log.

Raw printf is noisy and bypasses logging controls. Also, bufferCoverTargetNum is size_t but printed with %d.

-    printf("[CacheTransBufferManager::getOrAllocateBuffers] targetNum:%d, targetBufferEleSize:%ld, mTransferBufferSize:%ld\n", targetNum, targetBufferEleSize, mTransferBufferSize);
+    TLLM_LOG_DEBUG(
+        "[CacheTransBufferManager::getOrAllocateBuffers] targetNum:%d, targetBufferEleSize:%zu, mTransferBufferSize:%zu",
+        targetNum, targetBufferEleSize, mTransferBufferSize);
@@
-    TLLM_LOG_DEBUG("getOrAllocateBuffers bufferCoverTargetNum:%d", bufferCoverTargetNum);
+    TLLM_LOG_DEBUG("getOrAllocateBuffers bufferCoverTargetNum:%zu", bufferCoverTargetNum);
cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.cu (1)

1269-1289: MLA concat path not CP-aware (functional mismatch with split).

Split uses CP (DomainCPSize) but concat’s MLA path ignores CP: size checks and kernel indexing don’t include CP. This will mis-assemble caches when CP>1.

Apply the following updates:

  1. Validate ranks against PP×TP×CP for concat:
-    TLLM_CHECK(targetRankInfo.mIRanks.size()
-        == (static_cast<size_t>(targetRankInfo.mDomainPPSize * targetRankInfo.mDomainTPSize)));
+    TLLM_CHECK(targetRankInfo.mIRanks.size()
+        == static_cast<size_t>(targetRankInfo.mDomainPPSize * targetRankInfo.mDomainTPSize * targetRankInfo.mDomainCPSize));
  1. For MLA, expect input caches = PP×CP:
-    if (selfCacheState.getAttentionConfig().mAttentionType == CacheState::AttentionType::kMLA)
-    {
-        inputCacheNum = targetRankInfo.mDomainPPSize;
-    }
+    if (selfCacheState.getAttentionConfig().mAttentionType == CacheState::AttentionType::kMLA)
+    {
+        inputCacheNum = targetRankInfo.mDomainPPSize * targetRankInfo.mDomainCPSize;
+    }
  1. Make MLA concat kernel CP-aware (signature + indexing) and pass DomainCPSize:
-__global__ void concatKVCacheForMLAKernel(T const** __restrict__ inputCaches, T** __restrict__ outputBlocks,
-    int tokensPerBlock, int numLayers, int headNum, int dimsPerHead, int outputBlockNum, int DomainPPSize,
-    int DomainTPSize, int layerNumDomainPP, int kvFactor)
+__global__ void concatKVCacheForMLAKernel(T const** __restrict__ inputCaches, T** __restrict__ outputBlocks,
+    int tokensPerBlock, int numLayers, int headNum, int dimsPerHead, int outputBlockNum, int DomainPPSize,
+    int DomainTPSize, int DomainCPSize, int layerNumDomainPP, int kvFactor)
@@
-                int inputCacheIdx = layerId / layerNumDomainPP;
-                T const* inputCachePtr = inputCaches[inputCacheIdx];
+                int const cpOffset = blockId % DomainCPSize;
+                int const inputCacheIdx = (layerId / layerNumDomainPP) * DomainCPSize + cpOffset;
+                T const* inputCachePtr = inputCaches[inputCacheIdx];
@@
-                T const* kInputPtr = inputCachePtr
-                    + blockId * (layerNumDomainPP * kvFactor * headNum * tokensPerBlock * dimsPerHead)
+                int const blockIdInDomainCP = blockId / DomainCPSize;
+                T const* kInputPtr = inputCachePtr
+                    + blockIdInDomainCP * (layerNumDomainPP * kvFactor * headNum * tokensPerBlock * dimsPerHead)
                     + layerIdInDomainPP * kvFactor * headNum * tokensPerBlock * dimsPerHead
                     + headIdInDomainTP * tokensPerBlock * dimsPerHead;

And pass DomainCPSize at call sites:

-            concatKVCacheForMLAKernel<T, mlaSubWarpSize, 16>(..., outputBlockNumSum, DomainPPSize, DomainTPSize,
-                    layerNumDomainPP, kvFactor);
+            concatKVCacheForMLAKernel<T, mlaSubWarpSize, 16>(..., outputBlockNumSum, DomainPPSize, DomainTPSize,
+                    DomainCPSize, layerNumDomainPP, kvFactor);

(Apply the same addition of DomainCPSize for the 8/4/2/1 vector-size cases.)

  1. Provide DomainCPSize from targetRankInfo:
-    int const DomainPPSize = targetRankInfo.mDomainPPSize;
-    int const DomainTPSize = targetRankInfo.mDomainTPSize;
+    int const DomainPPSize = targetRankInfo.mDomainPPSize;
+    int const DomainTPSize = targetRankInfo.mDomainTPSize;
+    int const DomainCPSize = targetRankInfo.mDomainCPSize;
♻️ Duplicate comments (1)
cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp (1)

1102-1116: Re-enable assertions in verifyBlockData.

Tests won’t catch corruption while EXPECT_EQ is commented. For floats use EXPECT_NEAR(..., 0.0) or bitwise equality. This was flagged previously.

-                                // EXPECT_EQ(*dataPtr, generateValue);
+                                EXPECT_EQ(*dataPtr, generateValue);
@@
-                                    // EXPECT_EQ(*dataPtr, generateValue);
+                                    EXPECT_EQ(*dataPtr, generateValue);

Also applies to: 1127-1141

🧹 Nitpick comments (8)
cpp/tensorrt_llm/batch_manager/cacheTransBuffer.cpp (1)

231-235: Use correct format specifiers for size_t in logs.

Several args are size_t but formatted with %ld. Use %zu (or cast to unsigned long long and use %llu) to avoid UB on LP64/LLP64.

-    TLLM_LOG_INFO(
-        "CacheTransBufferManager: mMaxNumTokens:%ld, mRecvBufferCount:%ld, "
-        "mSendBufferCount:%ld, mTransferBufferSize:%ld, mPreAllocBufferSize:%ld, mOnlyUseDynamicBuffer:%d "
-        "mUseFabricMemory:%d mDataType:%d",
-        maxNumTokens.has_value() ? maxNumTokens.value() : 0, mRecvBufferCount, mSendBufferCount, mTransferBufferSize,
-        mPreAllocBufferSize, mOnlyUseDynamicBuffer, mUseFabricMemory, mDataType);
+    TLLM_LOG_INFO(
+        "CacheTransBufferManager: mMaxNumTokens:%zu, mRecvBufferCount:%zu, "
+        "mSendBufferCount:%zu, mTransferBufferSize:%zu, mPreAllocBufferSize:%zu, mOnlyUseDynamicBuffer:%d "
+        "mUseFabricMemory:%d mDataType:%d",
+        maxNumTokens.has_value() ? maxNumTokens.value() : 0, mRecvBufferCount, mSendBufferCount, mTransferBufferSize,
+        mPreAllocBufferSize, mOnlyUseDynamicBuffer, mUseFabricMemory, mDataType);
cpp/tensorrt_llm/batch_manager/mlaCacheFormatter.cpp (2)

156-159: Remove internal “@b” question comment.

Keep code comments actionable; move questions to a follow-up issue.

-    // @B: This works as if all output caches are of the same size. Is this a fair assumption?

160-169: Typo: agentConnnection → agentConnection (and make pointer const).

Fix spelling for readability and consistency; apply in both places.

-    auto* agentConnnection = dynamic_cast<executor::kv_cache::AgentConnection const*>(connections[0]);
-    if (agentConnnection != nullptr)
+    auto const* agentConnection = dynamic_cast<executor::kv_cache::AgentConnection const*>(connections[0]);
+    if (agentConnection != nullptr)
@@
-        if (agentConnnection != nullptr)
+        if (agentConnection != nullptr)
         {
-            cacheBufferId = agentConnnection->getCacheBufferId();
+            cacheBufferId = agentConnection->getCacheBufferId();
             TLLM_CHECK(cacheBufferId.has_value());
         }

Also applies to: 346-352

cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.cu (3)

114-131: CP domain calculation: LGTM, but add explicit error for zero-sizes.

Power-of-two check is good; also early-return with clear message if any dimension is zero to avoid div-by-zero.

-    for (auto val : {peerCPNum, selfCPNum})
+    for (auto val : {peerCPNum, selfCPNum})
     {
         TLLM_CHECK(isPowerOfTwo(val));
     }
+    TLLM_CHECK_WITH_INFO(peerCPNum > 0 && selfCPNum > 0, "Context parallelism must be > 0.");

499-507: Remove “@b” question; document rationale once.

Keep kernels free of conversational comments; add a brief note in header instead.

-// @B: Why do we not use Domain{P,T,C}PSize?

937-944: Reduce log spam: demote to DEBUG and avoid per-rank dumps.

Printing every IRank at INFO can flood logs. Prefer a single DEBUG line.

-    TLLM_LOG_INFO("[splitKVCache] targetRankInfo.mIRanks.size(): %d", targetRankInfo.mIRanks.size());
-    for (auto rank : targetRankInfo.mIRanks)
-    {
-        TLLM_LOG_INFO("[splitKVCache] target rank: %d, ", rank);
-    }
+    TLLM_LOG_DEBUG("[splitKVCache] IRanks.size(): %zu", targetRankInfo.mIRanks.size());
cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp (2)

646-651: Remove conversational “@b” comments.

These distract in tests; move notes to a doc or issue.

-        // @B: What are shared blocks?

946-954: Harden getenv parsing.

std::stoi throws; prefer strtol and default on errors to avoid test crashes with invalid env.

-        return std::stoi(env);
+        char* end{};
+        long v = std::strtol(env, &end, 10);
+        return (end && *end == '\0') ? static_cast<int>(v) : -2;
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between e465f43 and b37ff08.

📒 Files selected for processing (4)
  • cpp/tensorrt_llm/batch_manager/cacheTransBuffer.cpp (2 hunks)
  • cpp/tensorrt_llm/batch_manager/mlaCacheFormatter.cpp (7 hunks)
  • cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.cu (17 hunks)
  • cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp (36 hunks)
🧰 Additional context used
📓 Path-based instructions (5)
**/*.{h,hpp,hh,hxx,cc,cpp,cxx,cu,cuh}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.{h,hpp,hh,hxx,cc,cpp,cxx,cu,cuh}: Closing braces of C++ namespaces must include a comment naming the namespace (e.g., } // namespace foo)
Avoid using literals (except 0, nullptr, true, false) directly in logic; use named constants for comparisons
Use Allman brace style in C++
Place semicolon of empty for/while loop on its own line
Use brace-delimited statements for bodies of switch/while/do/for and always brace if/else bodies
C++ type names use UpperCamelCase
Local variables, methods, and namespaces use lowerCamelCase
Non-static, externally visible globals use g prefix with lowerCamelCase (e.g., gDontUseGlobalFoos)
Static or anonymous-namespace globals use s prefix with lowerCamelCase (e.g., sMutableStaticGlobal)
Locally visible static variables use s prefix (e.g., static std::once_flag sFlag)
Member variables use m prefix with CamelCase (public may omit but encouraged)
Constants (enums, globals, static consts, function-scope magic numbers) use k prefix with UPPER_SNAKE (e.g., kDIGIT_NUM)
Function-scope non-literal, non-magic constants use normal non-const naming (e.g., const bool pass)
If macros are necessary, name them in UPPER_SNAKE_CASE
Avoid Hungarian notation except allowed app’s hungarian like nb for counts
Constructor parameters conflicting with member names get a trailing underscore (e.g., foo_)
Use uppercase literal suffixes (e.g., 1234L not 1234l)
Format C++ with clang-format (LLVM style), max line length 120; justify any exceptions with clang-format off/on blocks
Use C++-style comments; C comments not allowed except special inline cases; single-line comments use //
Use inline parameter comments in calls when arguments aren’t obvious (e.g., /* checkForErrors = / false)
Disable code with #if/#endif (optionally mnemonic conditions or no-op macros); do not comment out code; avoid dead code
Use the least forceful C++ cast; avoid removing const/volatile; avoid C-style and functional casts (except explicit constructors); cast void
to T* with static_cas...

Files:

  • cpp/tensorrt_llm/batch_manager/cacheTransBuffer.cpp
  • cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp
  • cpp/tensorrt_llm/batch_manager/mlaCacheFormatter.cpp
  • cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.cu
**/*.{cc,cpp,cxx,cu}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.{cc,cpp,cxx,cu}: Prefer const or constexpr variables over #define for constants in C++
Declare variables const if not modified after initialization
Use smart pointers for heap allocation; prefer unique_ptr for sole ownership, shared_ptr for shared; weak_ptr only exceptionally; avoid deprecated smart pointers
Avoid declaring large functions inline unless there’s a quantifiable benefit; remember in-class definitions are implicitly inline
Every defined function must be referenced at least once; avoid unused methods

Files:

  • cpp/tensorrt_llm/batch_manager/cacheTransBuffer.cpp
  • cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp
  • cpp/tensorrt_llm/batch_manager/mlaCacheFormatter.cpp
  • cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.cu
**/*

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Filenames compiled into a target must be case-insensitively unique

Files:

  • cpp/tensorrt_llm/batch_manager/cacheTransBuffer.cpp
  • cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp
  • cpp/tensorrt_llm/batch_manager/mlaCacheFormatter.cpp
  • cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.cu
**/*.{h,hpp,hh,hxx,cc,cpp,cxx,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Use spaces, not tabs; indent 4 spaces

Files:

  • cpp/tensorrt_llm/batch_manager/cacheTransBuffer.cpp
  • cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp
  • cpp/tensorrt_llm/batch_manager/mlaCacheFormatter.cpp
  • cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.cu
**/*.{cpp,cc,cxx,h,hpp,hh,hxx,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Prepend NVIDIA copyright header (current year) to all source files

Files:

  • cpp/tensorrt_llm/batch_manager/cacheTransBuffer.cpp
  • cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp
  • cpp/tensorrt_llm/batch_manager/mlaCacheFormatter.cpp
  • cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.cu
🧠 Learnings (6)
📚 Learning: 2025-08-20T06:56:02.889Z
Learnt from: eopXD
PR: NVIDIA/TensorRT-LLM#6768
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:577-579
Timestamp: 2025-08-20T06:56:02.889Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, maxSequenceLength is now enforced as a non-optional argument in the BlockManager constructor, so concerns about std::nullopt defaulting to 0 are not applicable. When windowSize > maxSequenceLength, a warning should be added instead of handling optional parameter cases.

Applied to files:

  • cpp/tensorrt_llm/batch_manager/cacheTransBuffer.cpp
  • cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp
  • cpp/tensorrt_llm/batch_manager/mlaCacheFormatter.cpp
📚 Learning: 2025-08-21T09:41:49.347Z
Learnt from: eopXD
PR: NVIDIA/TensorRT-LLM#6768
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:2010-2045
Timestamp: 2025-08-21T09:41:49.347Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, updateSequenceCacheBlockOffsets is specifically for updating bookkeeping when blocks are added during the context phase, not for refreshing offsets after detach operations. During detach operations, GenerationRequest::removeFrontBlock handles the necessary cache block bookkeeping internally.

Applied to files:

  • cpp/tensorrt_llm/batch_manager/cacheTransBuffer.cpp
  • cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp
  • cpp/tensorrt_llm/batch_manager/mlaCacheFormatter.cpp
📚 Learning: 2025-08-15T06:46:54.897Z
Learnt from: eopXD
PR: NVIDIA/TensorRT-LLM#6767
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-15T06:46:54.897Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp addToken function, newly allocated blocks are unshared by design. The beam search path in addToken (when sequence.getNumTokens() > windowSize) is currently broken/non-functional with SWA, so the block allocation doesn't follow a shared-then-unshared pattern.

Applied to files:

  • cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp
  • cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.cu
📚 Learning: 2025-08-14T21:04:50.248Z
Learnt from: thorjohnsen
PR: NVIDIA/TensorRT-LLM#6910
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-14T21:04:50.248Z
Learning: In KV cache onboarding logic during prefill in cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, when calculating which blocks fall within the attention window, use getTokensPerBlock() to advance token indices rather than block->getUniqueTokens().size(), because the calculation needs to consider the post-prefill state where blocks will be filled to capacity, not their current token count.

Applied to files:

  • cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp
  • cpp/tensorrt_llm/batch_manager/mlaCacheFormatter.cpp
  • cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.cu
📚 Learning: 2025-08-20T06:48:45.368Z
Learnt from: eopXD
PR: NVIDIA/TensorRT-LLM#6768
File: cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h:0-0
Timestamp: 2025-08-20T06:48:45.368Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, updateSequenceCacheBlockOffsets is only called when adding a sequence, not during detach operations. During detach, the cache block bookkeeping is handled by GenerationRequest::removeFrontBlock.

Applied to files:

  • cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp
📚 Learning: 2025-08-25T00:03:39.294Z
Learnt from: djns99
PR: NVIDIA/TensorRT-LLM#7104
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:1185-1189
Timestamp: 2025-08-25T00:03:39.294Z
Learning: TLLM_CHECK_WITH_INFO is a host-side utility function and cannot be called from CUDA device functions (those marked with __device__ or __global__). In device code, assert() is the primary mechanism for handling "should never happen" conditions, and like standard C++ assert, CUDA's assert only works in debug builds and is compiled out in release builds.

Applied to files:

  • cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.cu
🧬 Code graph analysis (2)
cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp (2)
cpp/include/tensorrt_llm/batch_manager/kvCacheUtils.h (1)
  • fromAllBlockIds (34-41)
cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.cu (2)
  • TargetRanksInfoForDP (50-160)
  • TargetRanksInfoForDP (50-51)
cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.cu (1)
cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp (3)
  • numLayers (288-389)
  • numLayers (607-715)
  • numLayers (607-609)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (8)
cpp/tensorrt_llm/batch_manager/mlaCacheFormatter.cpp (3)

154-159: PP×CP sizing/divisibility checks: LGTM.


489-491: Several issues in your instrumentation can easily lead to those “extra” keys and counts:

• You’re recording during the decode phase (attn_metadata.prefill_metadata is None) rather than the prefill phase, so you’re indexing into a cache that already contains the prefix KV state plus newly appended entries. That means the “first” token you read isn’t always the same <im_start> vector.
• You loop over every entry in block_tables and never reset your global block_recorder, so you record every sub-block of every request (and every invocation), not just one sample per request. That drives your counts into the hundreds of thousands.
• The shared Python dict and file writes aren’t synchronized across async threads—using “w” on each invocation will race and overwrite your own data.

Fix it by flipping the condition to record only in prefill (attn_metadata.prefill_metadata != None), capturing exactly one vector per request (e.g. guard with a per-request flag), and moving file I/O/summarization outside the hot path (or using a thread-safe logger).


191-207: Verify slice-to-connection mapping ordering
Modulo-based cacheIdx logic assumes session.getConnections() returns connections in PP-major then CP-minor order. Confirm all getConnections() implementations (MPI, Agent, UCX) adhere to that ordering.

cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.cu (2)

60-66: CP dimension plumbed into rank math: LGTM.


529-541: MLA split indexing requires CP>1 validation: add or confirm a test for a 1-layer, 1-head, 2-token, CP=2 case (e.g., INT8) to exercise
outputCacheIdx = (layerId/CP)*CPSize + blockId%CP and ensure no off-by-one interleaving.

cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp (3)

526-529: World-size guard for 4-proc tests: LGTM.


688-688: Clarify whether totalNumBlocks should account for CP.

If each CP shard owns a fraction of blocks, divide by mCpSize; if blocks are replicated per CP, keep as is. Please confirm intended semantics.


1456-1474: New CP=2 MLA test: LGTM.

Good minimal CP coverage.

Comment on lines 153 to 161
size_t const pPDomainSize = targetInfo.mDomainPPSize;
TLLM_CHECK((cacheBlockSize * blockNum) % pPDomainSize == 0);
auto const targetBufferSize = (cacheBlockSize * blockNum) / pPDomainSize;
size_t const cPDomainSize = targetInfo.mDomainCPSize;
TLLM_CHECK((cacheBlockSize * blockNum) % (pPDomainSize * cPDomainSize) == 0);
// @B: This works as if all output caches are of the same size. Is this a fair assumption?
auto const targetBufferSize = (cacheBlockSize * blockNum) / (pPDomainSize * cPDomainSize);
TLLM_LOG_INFO("[MLACacheFormatter::format] BEFORE getOrAllocateSendBuffers cacheBlockSize: %zu, blockNum: %d, pPDomainSize: %zu, cPDomainSize: %zu, targetBufferSize: %zu", cacheBlockSize, blockNum, pPDomainSize, cPDomainSize, targetBufferSize);
auto result = mCacheTransBufferManager->getOrAllocateSendBuffers(
cacheBufferId, pPDomainSize, targetBufferSize, bufferManager);
cacheBufferId, pPDomainSize * cPDomainSize, targetBufferSize, bufferManager);
auto& outputSplitCaches = std::get<0>(result);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Guard: ensure connections.size() matches PP×CP before modulo mapping.

Add a sanity check to prevent modulo on a smaller connection set, which would scatter to wrong peers.

     auto targetInfo = executor::kv_cache::targetIRanks(destConfig, selfConfig, selfIdx);
     size_t const pPDomainSize = targetInfo.mDomainPPSize;
     size_t const cPDomainSize = targetInfo.mDomainCPSize;
+    TLLM_CHECK_WITH_INFO(
+        connections.size() == pPDomainSize * cPDomainSize,
+        "Mismatch: number of connections (%zu) must equal PP×CP (%zu).",
+        connections.size(), pPDomainSize * cPDomainSize);
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
size_t const pPDomainSize = targetInfo.mDomainPPSize;
TLLM_CHECK((cacheBlockSize * blockNum) % pPDomainSize == 0);
auto const targetBufferSize = (cacheBlockSize * blockNum) / pPDomainSize;
size_t const cPDomainSize = targetInfo.mDomainCPSize;
TLLM_CHECK((cacheBlockSize * blockNum) % (pPDomainSize * cPDomainSize) == 0);
// @B: This works as if all output caches are of the same size. Is this a fair assumption?
auto const targetBufferSize = (cacheBlockSize * blockNum) / (pPDomainSize * cPDomainSize);
TLLM_LOG_INFO("[MLACacheFormatter::format] BEFORE getOrAllocateSendBuffers cacheBlockSize: %zu, blockNum: %d, pPDomainSize: %zu, cPDomainSize: %zu, targetBufferSize: %zu", cacheBlockSize, blockNum, pPDomainSize, cPDomainSize, targetBufferSize);
auto result = mCacheTransBufferManager->getOrAllocateSendBuffers(
cacheBufferId, pPDomainSize, targetBufferSize, bufferManager);
cacheBufferId, pPDomainSize * cPDomainSize, targetBufferSize, bufferManager);
auto& outputSplitCaches = std::get<0>(result);
size_t const pPDomainSize = targetInfo.mDomainPPSize;
size_t const cPDomainSize = targetInfo.mDomainCPSize;
TLLM_CHECK_WITH_INFO(
connections.size() == pPDomainSize * cPDomainSize,
"Mismatch: number of connections (%zu) must equal PP×CP (%zu).",
connections.size(), pPDomainSize * cPDomainSize);
TLLM_CHECK((cacheBlockSize * blockNum) % (pPDomainSize * cPDomainSize) == 0);
// @B: This works as if all output caches are of the same size. Is this a fair assumption?
auto const targetBufferSize = (cacheBlockSize * blockNum) / (pPDomainSize * cPDomainSize);
TLLM_LOG_INFO("[MLACacheFormatter::format] BEFORE getOrAllocateSendBuffers cacheBlockSize: %zu, blockNum: %d, pPDomainSize: %zu, cPDomainSize: %zu, targetBufferSize: %zu",
cacheBlockSize, blockNum, pPDomainSize, cPDomainSize, targetBufferSize);
auto result = mCacheTransBufferManager->getOrAllocateSendBuffers(
cacheBufferId, pPDomainSize * cPDomainSize, targetBufferSize, bufferManager);
auto& outputSplitCaches = std::get<0>(result);
🤖 Prompt for AI Agents
cpp/tensorrt_llm/batch_manager/mlaCacheFormatter.cpp around lines 153 to 161:
the code assumes the number of connections equals pPDomainSize * cPDomainSize
before doing modulo-based mapping which can scatter to wrong peers if
connections is smaller; add a sanity guard that verifies connections.size() >=
pPDomainSize * cPDomainSize (or == if strict) and fail fast with an explanatory
TLLM_CHECK or error log if the condition is not met, before computing
targetBufferSize and calling getOrAllocateSendBuffers; ensure the check prevents
division/modulo mapping against a smaller connection set and include minimal
context in the error message (e.g., actual sizes) so callers can debug.

auto blockRange = getBlockRangeForReceiving(mCacheManager, llmRequest);
printf("[MLACacheFormatter::unformat] pickUpConnections.size(): %zu, connections.size(): %zu, blockRange.size(): %zu\n", pickUpConnections.size(), connections.size(), blockRange.size());
std::vector<runtime::ITensor::SharedPtr> recvBufferTmps;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Replace stdout printf with logger.

Use TLLM_LOG_DEBUG to keep debug output consistent and filterable.

-    printf("[MLACacheFormatter::unformat] pickUpConnections.size(): %zu, connections.size(): %zu, blockRange.size(): %zu\n", pickUpConnections.size(), connections.size(), blockRange.size());
+    TLLM_LOG_DEBUG("[MLACacheFormatter::unformat] pickUpConnections:%zu, connections:%zu, blockRange:%zu",
+        pickUpConnections.size(), connections.size(), blockRange.size());

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents
In cpp/tensorrt_llm/batch_manager/mlaCacheFormatter.cpp around line 304, there's
a debug print to stdout using printf; replace that call with TLLM_LOG_DEBUG to
make debug output consistent and filterable. Change the printf to a
TLLM_LOG_DEBUG call that includes the same formatted message and variables,
ensure the file includes the logger header if missing, and remove the stdout
printf so all debug messages use TLLM_LOG_DEBUG.

Comment on lines +1223 to 1228
printf("[splitKVCacheDispatch] selfIdx: %d, kVCacheBlocksPerWindow.size(): %zu, ouputSplitBlocks.size(): %zu\n", selfIdx, kVCacheBlocksPerWindow.size(), ouputSplitBlocks.size());
for (auto const& [window, blocks] : kVCacheBlocksPerWindow)
{
printf("[splitKVCacheDispatch] window: %zu, blocks.size(): %zu\n", window, blocks.size());
}
auto dataType = kVCacheBlocksPerWindow.begin()->second.front()->getDataType();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Replace printf in dispatch with logger.

-    printf("[splitKVCacheDispatch] selfIdx: %d, kVCacheBlocksPerWindow.size(): %zu, ouputSplitBlocks.size(): %zu\n", selfIdx, kVCacheBlocksPerWindow.size(), ouputSplitBlocks.size());
-    for (auto const& [window, blocks] : kVCacheBlocksPerWindow)
-    {
-        printf("[splitKVCacheDispatch] window: %zu, blocks.size(): %zu\n", window, blocks.size());
-    }
+    TLLM_LOG_DEBUG("[splitKVCacheDispatch] selfIdx: %d, windows: %zu, outSplits: %zu",
+        selfIdx, kVCacheBlocksPerWindow.size(), ouputSplitBlocks.size());
+    for (auto const& [window, blocks] : kVCacheBlocksPerWindow)
+    {
+        TLLM_LOG_DEBUG("[splitKVCacheDispatch] window: %d, blocks: %zu", window, blocks.size());
+    }

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents
In cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.cu around lines
1223-1228, replace the C-style printf calls with the project's logging API (use
the existing logger macro/function such as LOG_INFO or spdlog::info or the
project's logger instance) and preserve the same diagnostic text and values;
format sizes with the logger's preferred formatting (e.g., {} placeholders or
streams) instead of printf specifiers, and ensure you still check
kVCacheBlocksPerWindow is non-empty before calling
begin()->second.front()->getDataType() (move or guard the dataType access
accordingly).

Comment on lines +1544 to +1548
printf("[concatKvCacheV2Dispatch] selfIdx: %d, inputSplitBlocks.size(): %zu, outputKvCacheBlocksPerWindow.size(): %zu\n", selfIdx, inputSplitBlocks.size(), outputKvCacheBlocksPerWindow.size());
for (auto const& [window, blocks] : outputKvCacheBlocksPerWindow)
{
printf("[concatKvCacheV2Dispatch] window: %zu, blocks.size(): %zu\n", window, blocks.size());
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Replace printf in concat dispatch with logger.

-    printf("[concatKvCacheV2Dispatch] selfIdx: %d, inputSplitBlocks.size(): %zu, outputKvCacheBlocksPerWindow.size(): %zu\n", selfIdx, inputSplitBlocks.size(), outputKvCacheBlocksPerWindow.size());
-    for (auto const& [window, blocks] : outputKvCacheBlocksPerWindow)
-    {
-        printf("[concatKvCacheV2Dispatch] window: %zu, blocks.size(): %zu\n", window, blocks.size());
-    }
+    TLLM_LOG_DEBUG("[concatKvCacheV2Dispatch] selfIdx: %d, inputSplits: %zu, windows: %zu",
+        selfIdx, inputSplitBlocks.size(), outputKvCacheBlocksPerWindow.size());
+    for (auto const& [window, blocks] : outputKvCacheBlocksPerWindow)
+    {
+        TLLM_LOG_DEBUG("[concatKvCacheV2Dispatch] window: %d, blocks: %zu", window, blocks.size());
+    }

Committable suggestion skipped: line range outside the PR's diff.

Comment on lines +835 to 836
std::iota(contextRankVec.begin(), contextRankVec.end(), 0);
mContextCommState = std::make_unique<tensorrt_llm::executor::kv_cache::CommState>(contextRankVec);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Missing header for std::iota.

Add .

@@
 #include <random>
+#include <numeric>
🤖 Prompt for AI Agents
In cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp around lines 835-836,
the code uses std::iota but the <numeric> header is not included; add #include
<numeric> to the top of the file (or the appropriate header section) so
std::iota is declared and the file compiles.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (4)
cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp (4)

447-449: Fix operator precedence: mask the expected byte, not the boolean result.

== binds tighter than &, so the mask is applied to the boolean, not the expected value.

Apply this diff:

-            EXPECT_TRUE(std::all_of(bytes.begin(), bytes.end(),
-                [&llmRequest](uint8_t i) { return i == llmRequest->getPromptLen() & 0xff; }));
+            EXPECT_TRUE(std::all_of(bytes.begin(), bytes.end(),
+                [&llmRequest](uint8_t i) { return i == (llmRequest->getPromptLen() & 0xff); }));

660-664: Variable shadowing prevents SELF-K-ONLY mode from ever being selected.

auto cacheType = ... inside the if-statement shadows the outer cacheType.

Apply this diff:

-        CacheType cacheType = CacheType::kSELF;
+        CacheType cacheType = CacheType::kSELF;
         if (kvFactor == 1)
         {
-            auto cacheType = CacheType::kSELFKONLY;
+            cacheType = CacheType::kSELFKONLY;
         }

164-171: gMock expectations missing argument matchers for a parameterized method.

getCounterpartsCount takes a requestId; expectations should use _.

Apply this diff:

-        ON_CALL(*this, getCounterpartsCount).WillByDefault(Return(1));
+        ON_CALL(*this, getCounterpartsCount(testing::_)).WillByDefault(Return(1));
-    EXPECT_CALL(*sender, getCounterpartsCount).WillOnce(Return(1));
+    EXPECT_CALL(*sender, getCounterpartsCount(testing::_)).WillOnce(Return(1));

Also applies to: 223-225


372-372: Logging format string missing placeholders (extra varargs ignored/UB).

bufferSize is passed but not formatted.

Apply this diff:

-                TLLM_LOG_DEBUG(tensorrt_llm::mpi::MpiComm::world().getRank(), "recv buffer from 0", bufferSize);
+                TLLM_LOG_DEBUG(tensorrt_llm::mpi::MpiComm::world().getRank(), "recv buffer from 0: %ld", bufferSize);
-                    TLLM_LOG_DEBUG(tensorrt_llm::mpi::MpiComm::world().getRank(), "recv buffer from 0", bufferSize);
+                    TLLM_LOG_DEBUG(tensorrt_llm::mpi::MpiComm::world().getRank(), "recv buffer from 0: %ld", bufferSize);

Also applies to: 815-815

♻️ Duplicate comments (2)
cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp (2)

1102-1108: Don’t skip value checks; this hides data corruption. Use float-safe comparisons.

Guarding on != 0 bypasses verification for zero values. Compare unconditionally; use EXPECT_NEAR for floats.

Apply this diff (KEY path):

-                                if (*dataPtr != static_cast<ValueType>(0)) {
-                                    EXPECT_EQ(*dataPtr, generateValue);
-                                } else {
-                                    // // TODO: Remove this when over-allocation is fixed.
-                                    // printf("[verifyBlockData::key] SKIPPING 0! \n");
-                                }
+                                if constexpr (std::is_floating_point_v<ValueType>) {
+                                    EXPECT_NEAR(static_cast<double>(*dataPtr),
+                                                static_cast<double>(generateValue), 0.0);
+                                } else {
+                                    EXPECT_EQ(*dataPtr, generateValue);
+                                }

And (VALUE path):

-                                    if (*dataPtr != static_cast<ValueType>(0)) {
-                                        EXPECT_EQ(*dataPtr, generateValue);
-                                    } else {
-                                        // // TODO: Remove this when over-allocation is fixed.
-                                        // printf("[verifyBlockData::value] SKIPPING 0! \n");
-                                    }
+                                    if constexpr (std::is_floating_point_v<ValueType>) {
+                                        EXPECT_NEAR(static_cast<double>(*dataPtr),
+                                                    static_cast<double>(generateValue), 0.0);
+                                    } else {
+                                        EXPECT_EQ(*dataPtr, generateValue);
+                                    }

Add the missing header (outside these ranges):

+#include <type_traits>

Also applies to: 1132-1138


834-836: Missing header for std::iota.

Add the correct include to avoid build failure.

Outside this hunk, add:

+#include <numeric>
🧹 Nitpick comments (4)
cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp (4)

688-689: Block provisioning may be over-allocating under CP; consider per-CP sizing.

For per-rank KV managers, total blocks likely should be divided by mCpSize; current code allocates for all CP shards on each rank.

If intended per-rank allocation, change to:

-        auto blocksPerWindow = BlocksPerWindow{{maxAttentionWindow, {totalNumBlocks, blocksInSecondaryPool}}};
+        auto const perCpBlocks = totalNumBlocks / std::max(1, mCpSize);
+        auto blocksPerWindow = BlocksPerWindow{{maxAttentionWindow, {perCpBlocks, blocksInSecondaryPool}}};

Please confirm the expected ownership semantics of blocks under CP so we can finalize this safely.


945-954: Harden parsing of TLLM_DEBUG_RANK.

std::stoi throws on bad input; prefer strtol with validation.

Apply this diff:

-        char const* const env = std::getenv("TLLM_DEBUG_RANK");
-        if (env == nullptr)
-        {
-            return -2;  // -1 means all ranks, -2 means no debug rank.
-        }
-        return std::stoi(env);
+        if (auto const* env = std::getenv("TLLM_DEBUG_RANK"); env != nullptr)
+        {
+            char* end = nullptr;
+            long v = std::strtol(env, &end, 10);
+            if (end && *end == '\0') { return static_cast<int>(v); }
+        }
+        return -2; // -1 means all ranks, -2 means no debug rank.

373-376: Remove unused std::istringstream iStream variables.

They are redundant with VectorWrapBuf streams.

Apply this diff:

-                std::istringstream iStream(std::string(recvBuffer.begin(), recvBuffer.end()));
                 su::VectorWrapBuf<char> strbuf(recvBuffer);
                 std::istream is(&strbuf);

(and similarly in the later block)

Also applies to: 816-821


1-11: Update copyright year per repo guideline.

Guidelines require current year; header lists 2023-2024.

Apply this diff:

- * SPDX-FileCopyrightText: Copyright (c) 2023-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-FileCopyrightText: Copyright (c) 2023-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between b37ff08 and d902264.

📒 Files selected for processing (1)
  • cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp (37 hunks)
🧰 Additional context used
📓 Path-based instructions (5)
**/*.{h,hpp,hh,hxx,cc,cpp,cxx,cu,cuh}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.{h,hpp,hh,hxx,cc,cpp,cxx,cu,cuh}: Closing braces of C++ namespaces must include a comment naming the namespace (e.g., } // namespace foo)
Avoid using literals (except 0, nullptr, true, false) directly in logic; use named constants for comparisons
Use Allman brace style in C++
Place semicolon of empty for/while loop on its own line
Use brace-delimited statements for bodies of switch/while/do/for and always brace if/else bodies
C++ type names use UpperCamelCase
Local variables, methods, and namespaces use lowerCamelCase
Non-static, externally visible globals use g prefix with lowerCamelCase (e.g., gDontUseGlobalFoos)
Static or anonymous-namespace globals use s prefix with lowerCamelCase (e.g., sMutableStaticGlobal)
Locally visible static variables use s prefix (e.g., static std::once_flag sFlag)
Member variables use m prefix with CamelCase (public may omit but encouraged)
Constants (enums, globals, static consts, function-scope magic numbers) use k prefix with UPPER_SNAKE (e.g., kDIGIT_NUM)
Function-scope non-literal, non-magic constants use normal non-const naming (e.g., const bool pass)
If macros are necessary, name them in UPPER_SNAKE_CASE
Avoid Hungarian notation except allowed app’s hungarian like nb for counts
Constructor parameters conflicting with member names get a trailing underscore (e.g., foo_)
Use uppercase literal suffixes (e.g., 1234L not 1234l)
Format C++ with clang-format (LLVM style), max line length 120; justify any exceptions with clang-format off/on blocks
Use C++-style comments; C comments not allowed except special inline cases; single-line comments use //
Use inline parameter comments in calls when arguments aren’t obvious (e.g., /* checkForErrors = / false)
Disable code with #if/#endif (optionally mnemonic conditions or no-op macros); do not comment out code; avoid dead code
Use the least forceful C++ cast; avoid removing const/volatile; avoid C-style and functional casts (except explicit constructors); cast void
to T* with static_cas...

Files:

  • cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp
**/*.{cc,cpp,cxx,cu}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.{cc,cpp,cxx,cu}: Prefer const or constexpr variables over #define for constants in C++
Declare variables const if not modified after initialization
Use smart pointers for heap allocation; prefer unique_ptr for sole ownership, shared_ptr for shared; weak_ptr only exceptionally; avoid deprecated smart pointers
Avoid declaring large functions inline unless there’s a quantifiable benefit; remember in-class definitions are implicitly inline
Every defined function must be referenced at least once; avoid unused methods

Files:

  • cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp
**/*

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Filenames compiled into a target must be case-insensitively unique

Files:

  • cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp
**/*.{h,hpp,hh,hxx,cc,cpp,cxx,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Use spaces, not tabs; indent 4 spaces

Files:

  • cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp
**/*.{cpp,cc,cxx,h,hpp,hh,hxx,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Prepend NVIDIA copyright header (current year) to all source files

Files:

  • cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp
🧠 Learnings (5)
📚 Learning: 2025-08-21T09:41:49.347Z
Learnt from: eopXD
PR: NVIDIA/TensorRT-LLM#6768
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:2010-2045
Timestamp: 2025-08-21T09:41:49.347Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, updateSequenceCacheBlockOffsets is specifically for updating bookkeeping when blocks are added during the context phase, not for refreshing offsets after detach operations. During detach operations, GenerationRequest::removeFrontBlock handles the necessary cache block bookkeeping internally.

Applied to files:

  • cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp
📚 Learning: 2025-08-15T06:46:54.897Z
Learnt from: eopXD
PR: NVIDIA/TensorRT-LLM#6767
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-15T06:46:54.897Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp addToken function, newly allocated blocks are unshared by design. The beam search path in addToken (when sequence.getNumTokens() > windowSize) is currently broken/non-functional with SWA, so the block allocation doesn't follow a shared-then-unshared pattern.

Applied to files:

  • cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp
📚 Learning: 2025-08-14T21:04:50.248Z
Learnt from: thorjohnsen
PR: NVIDIA/TensorRT-LLM#6910
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-14T21:04:50.248Z
Learning: In KV cache onboarding logic during prefill in cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, when calculating which blocks fall within the attention window, use getTokensPerBlock() to advance token indices rather than block->getUniqueTokens().size(), because the calculation needs to consider the post-prefill state where blocks will be filled to capacity, not their current token count.

Applied to files:

  • cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp
📚 Learning: 2025-08-20T06:56:02.889Z
Learnt from: eopXD
PR: NVIDIA/TensorRT-LLM#6768
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:577-579
Timestamp: 2025-08-20T06:56:02.889Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, maxSequenceLength is now enforced as a non-optional argument in the BlockManager constructor, so concerns about std::nullopt defaulting to 0 are not applicable. When windowSize > maxSequenceLength, a warning should be added instead of handling optional parameter cases.

Applied to files:

  • cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp
📚 Learning: 2025-08-20T06:48:45.368Z
Learnt from: eopXD
PR: NVIDIA/TensorRT-LLM#6768
File: cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h:0-0
Timestamp: 2025-08-20T06:48:45.368Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, updateSequenceCacheBlockOffsets is only called when adding a sequence, not during detach operations. During detach, the cache block bookkeeping is handled by GenerationRequest::removeFrontBlock.

Applied to files:

  • cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp
🧬 Code graph analysis (1)
cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp (4)
cpp/include/tensorrt_llm/batch_manager/kvCacheUtils.h (1)
  • fromAllBlockIds (34-41)
cpp/tensorrt_llm/executor/cache_transmission/mpi_utils/connection.h (1)
  • mRank (38-39)
cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.cu (2)
  • TargetRanksInfoForDP (50-160)
  • TargetRanksInfoForDP (50-51)
cpp/tensorrt_llm/batch_manager/mlaCacheFormatter.cpp (2)
  • needSendCache (56-85)
  • needSendCache (56-57)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (2)
cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp (2)

1466-1483: Good addition: CP-aware MLA test instantiation.

Covers CP growth on generation; pairs well with verify path using mCpRank. Once the verification guards are removed, this should give reliable signal.


1079-1080: Token start index: CP mapping looks correct; please confirm intent.

fillBlockData uses startTokenId = blockId * tokensPerBlock (context CP=1). verifyBlockData uses (blockId * mCpSize + mCpRank) * tokensPerBlock (generation CP≥1), i.e., global-block indexing interleaved by CP. This matches “global coordinate” comments, but double-check it matches the split/concat kernels’ semantics.

Would you like me to generate a focused test asserting equality across CP ranks for a small hand-crafted tensor to lock this in?

Also applies to: 982-982

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (2)
cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp (2)

646-650: Clarify block allocation and shared block terminology.

The comments mention "shared blocks" but the context is unclear. Consider adding documentation explaining:

  • What constitutes a shared block vs. unshared block
  • How CP affects block sharing semantics
  • Whether totalNumBlocks should be divided by mCpSize as suggested in line 687

1278-1280: Consider expanding test iteration for robustness.

The test currently runs with minimal iterations (for (int i = 0; i < 1; i++)) and a single length ({8}). For more comprehensive testing, consider:

  • Running multiple iterations to test cache reuse scenarios
  • Testing with varied sequence lengths to exercise different block allocation patterns
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between d902264 and 4c51b47.

📒 Files selected for processing (1)
  • cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp (37 hunks)
🧰 Additional context used
📓 Path-based instructions (5)
**/*.{h,hpp,hh,hxx,cc,cpp,cxx,cu,cuh}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.{h,hpp,hh,hxx,cc,cpp,cxx,cu,cuh}: Closing braces of C++ namespaces must include a comment naming the namespace (e.g., } // namespace foo)
Avoid using literals (except 0, nullptr, true, false) directly in logic; use named constants for comparisons
Use Allman brace style in C++
Place semicolon of empty for/while loop on its own line
Use brace-delimited statements for bodies of switch/while/do/for and always brace if/else bodies
C++ type names use UpperCamelCase
Local variables, methods, and namespaces use lowerCamelCase
Non-static, externally visible globals use g prefix with lowerCamelCase (e.g., gDontUseGlobalFoos)
Static or anonymous-namespace globals use s prefix with lowerCamelCase (e.g., sMutableStaticGlobal)
Locally visible static variables use s prefix (e.g., static std::once_flag sFlag)
Member variables use m prefix with CamelCase (public may omit but encouraged)
Constants (enums, globals, static consts, function-scope magic numbers) use k prefix with UPPER_SNAKE (e.g., kDIGIT_NUM)
Function-scope non-literal, non-magic constants use normal non-const naming (e.g., const bool pass)
If macros are necessary, name them in UPPER_SNAKE_CASE
Avoid Hungarian notation except allowed app’s hungarian like nb for counts
Constructor parameters conflicting with member names get a trailing underscore (e.g., foo_)
Use uppercase literal suffixes (e.g., 1234L not 1234l)
Format C++ with clang-format (LLVM style), max line length 120; justify any exceptions with clang-format off/on blocks
Use C++-style comments; C comments not allowed except special inline cases; single-line comments use //
Use inline parameter comments in calls when arguments aren’t obvious (e.g., /* checkForErrors = / false)
Disable code with #if/#endif (optionally mnemonic conditions or no-op macros); do not comment out code; avoid dead code
Use the least forceful C++ cast; avoid removing const/volatile; avoid C-style and functional casts (except explicit constructors); cast void
to T* with static_cas...

Files:

  • cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp
**/*.{cc,cpp,cxx,cu}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.{cc,cpp,cxx,cu}: Prefer const or constexpr variables over #define for constants in C++
Declare variables const if not modified after initialization
Use smart pointers for heap allocation; prefer unique_ptr for sole ownership, shared_ptr for shared; weak_ptr only exceptionally; avoid deprecated smart pointers
Avoid declaring large functions inline unless there’s a quantifiable benefit; remember in-class definitions are implicitly inline
Every defined function must be referenced at least once; avoid unused methods

Files:

  • cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp
**/*

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Filenames compiled into a target must be case-insensitively unique

Files:

  • cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp
**/*.{h,hpp,hh,hxx,cc,cpp,cxx,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Use spaces, not tabs; indent 4 spaces

Files:

  • cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp
**/*.{cpp,cc,cxx,h,hpp,hh,hxx,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Prepend NVIDIA copyright header (current year) to all source files

Files:

  • cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp
🧠 Learnings (5)
📚 Learning: 2025-08-15T06:46:54.897Z
Learnt from: eopXD
PR: NVIDIA/TensorRT-LLM#6767
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-15T06:46:54.897Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp addToken function, newly allocated blocks are unshared by design. The beam search path in addToken (when sequence.getNumTokens() > windowSize) is currently broken/non-functional with SWA, so the block allocation doesn't follow a shared-then-unshared pattern.

Applied to files:

  • cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp
📚 Learning: 2025-08-14T21:04:50.248Z
Learnt from: thorjohnsen
PR: NVIDIA/TensorRT-LLM#6910
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-14T21:04:50.248Z
Learning: In KV cache onboarding logic during prefill in cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, when calculating which blocks fall within the attention window, use getTokensPerBlock() to advance token indices rather than block->getUniqueTokens().size(), because the calculation needs to consider the post-prefill state where blocks will be filled to capacity, not their current token count.

Applied to files:

  • cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp
📚 Learning: 2025-08-20T06:56:02.889Z
Learnt from: eopXD
PR: NVIDIA/TensorRT-LLM#6768
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:577-579
Timestamp: 2025-08-20T06:56:02.889Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, maxSequenceLength is now enforced as a non-optional argument in the BlockManager constructor, so concerns about std::nullopt defaulting to 0 are not applicable. When windowSize > maxSequenceLength, a warning should be added instead of handling optional parameter cases.

Applied to files:

  • cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp
📚 Learning: 2025-08-21T09:41:49.347Z
Learnt from: eopXD
PR: NVIDIA/TensorRT-LLM#6768
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:2010-2045
Timestamp: 2025-08-21T09:41:49.347Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, updateSequenceCacheBlockOffsets is specifically for updating bookkeeping when blocks are added during the context phase, not for refreshing offsets after detach operations. During detach operations, GenerationRequest::removeFrontBlock handles the necessary cache block bookkeeping internally.

Applied to files:

  • cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp
📚 Learning: 2025-08-20T06:48:45.368Z
Learnt from: eopXD
PR: NVIDIA/TensorRT-LLM#6768
File: cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h:0-0
Timestamp: 2025-08-20T06:48:45.368Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, updateSequenceCacheBlockOffsets is only called when adding a sequence, not during detach operations. During detach, the cache block bookkeeping is handled by GenerationRequest::removeFrontBlock.

Applied to files:

  • cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp
🧬 Code graph analysis (1)
cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp (4)
cpp/include/tensorrt_llm/batch_manager/kvCacheUtils.h (1)
  • fromAllBlockIds (34-41)
cpp/tensorrt_llm/executor/cache_transmission/mpi_utils/connection.h (1)
  • mRank (38-39)
cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.cu (2)
  • TargetRanksInfoForDP (50-160)
  • TargetRanksInfoForDP (50-51)
cpp/tensorrt_llm/batch_manager/mlaCacheFormatter.cpp (2)
  • needSendCache (56-85)
  • needSendCache (56-57)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (5)
cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp (5)

882-893: Approve CP-aware token count calculation.

The tokenCountAdjustedForCP function correctly implements block distribution among CP ranks, handling the case where blocks are not evenly divisible by mCpSize.


1098-1098: Review token ID calculation for CP verification.

The token ID calculation in verifyBlockData needs adjustment to account for CP-distributed blocks:

int startTokenId = (blockId * mCpSize + mCpRank) * tokensPerBlock;

This assumes blocks are striped across CP ranks, but verify this matches the block allocation strategy in tokenCountAdjustedForCP.


1486-1504: Approve CP-focused test instantiation.

The new test case AsymmetricCaseTestWithCPForMLA properly exercises context parallelism with MLA, using reasonable parameter combinations for focused CP testing.


1575-1847: Comprehensive CP test coverage added.

The extensive test cases in targetTest.CacheStateNODP provide thorough verification of rank mapping logic across various TP/PP/CP growth and shrink scenarios. The test systematically validates:

  • TP shrinks/grows with CP changes
  • PP shrinks/grows with CP changes
  • Multi-dimensional parallelism transitions
  • Target rank calculation correctness

This demonstrates robust test coverage for the CP feature.


965-974: Approve MPI debugging infrastructure.

The getEnvMpiDebugRank() function provides useful debugging capability with proper environment variable handling and clear return value semantics (-1 for all ranks, -2 for disabled).

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp (2)

447-449: Operator precedence bug in equality vs bitmask check

== has higher precedence than &; the current lambda evaluates (i == promptLen) & 0xff. Parenthesize the mask target.

-                EXPECT_TRUE(std::all_of(bytes.begin(), bytes.end(),
-                    [&llmRequest](uint8_t i) { return i == llmRequest->getPromptLen() & 0xff; }));
+                EXPECT_TRUE(std::all_of(bytes.begin(), bytes.end(),
+                    [&llmRequest](uint8_t i) { return i == (llmRequest->getPromptLen() & 0xff); }));

660-664: Shadowed variable prevents K-only cache mode

auto cacheType = ... inside the if creates a new local and leaves the outer cacheType as kSELF.

         CacheType cacheType = CacheType::kSELF;
         if (kvFactor == 1)
         {
-            auto cacheType = CacheType::kSELFKONLY;
+            cacheType = CacheType::kSELFKONLY;
         }
♻️ Duplicate comments (2)
cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp (2)

834-837: Missing header for std::iota

 #include <random>
+#include <numeric>

1123-1128: Don’t skip 0 values in verification; this masks corruption

Guarding EXPECT_EQ behind *dataPtr != 0 defeats the test whenever the payload happens to encode zeros or when transmission dropped a region. Always compare against the generated expectation.

-                                if (*dataPtr != static_cast<ValueType>(0)) {
-                                    EXPECT_EQ(*dataPtr, generateValue);
-                                } else {
-                                    // // TODO: Remove this when over-allocation is fixed.
-                                    // printf("[verifyBlockData::key] SKIPPING 0! \n");
-                                }
+                                EXPECT_EQ(*dataPtr, generateValue);
-                                    if (*dataPtr != static_cast<ValueType>(0)) {
-                                        EXPECT_EQ(*dataPtr, generateValue);
-                                    } else {
-                                        // // TODO: Remove this when over-allocation is fixed.
-                                        // printf("[verifyBlockData::value] SKIPPING 0! \n");
-                                    }
+                                    EXPECT_EQ(*dataPtr, generateValue);

If partial last blocks are expected with CP, bound the token loop to the valid token count for the current rank rather than skipping zeros; I can provide a patch if you confirm the desired bounds.

Also applies to: 1153-1158

🧹 Nitpick comments (4)
cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp (4)

998-1002: Align token indexing in fill with CP layout

Generation uses startTokenId = (blockId * mCpSize + mCpRank) * tokensPerBlock, but fill uses blockId * tokensPerBlock. Unify to CP-aware indexing to avoid implicit assumptions.

-        int startTokenId = blockId * tokensPerBlock;
+        int startTokenId = (blockId * mCpSize + mCpRank) * tokensPerBlock;

353-359: Replace magic MPI tags with named constants

Hard-coded tags risk collisions; use named constants to document intent and avoid reuse mistakes.

Add near the existing anonymous namespace (after Line 70):

 namespace
 {
 std::mutex mDllMutex;
+constexpr int kTagCommStateSize = 0x1F;
+constexpr int kTagCommStatePayload = 0x2F;

Then replace usages:

-                tensorrt_llm::mpi::MpiComm::world().sendRawTag(
-                    &bufferSize, 1, tensorrt_llm::mpi::MpiType::kINT64, genRank, 0x1F);
+                tensorrt_llm::mpi::MpiComm::world().sendRawTag(
+                    &bufferSize, 1, tensorrt_llm::mpi::MpiType::kINT64, genRank, kTagCommStateSize);
-                tensorrt_llm::mpi::MpiComm::world().sendRawTag(
-                    buffer.data(), buffer.size(), tensorrt_llm::mpi::MpiType::kCHAR, genRank, 0x2F);
+                tensorrt_llm::mpi::MpiComm::world().sendRawTag(
+                    buffer.data(), buffer.size(), tensorrt_llm::mpi::MpiType::kCHAR, genRank, kTagCommStatePayload);
-                tensorrt_llm::mpi::MpiComm::world().recvRawTag(
-                    &bufferSize, 1, tensorrt_llm::mpi::MpiType::kINT64, 0, 0x1F);
+                tensorrt_llm::mpi::MpiComm::world().recvRawTag(
+                    &bufferSize, 1, tensorrt_llm::mpi::MpiType::kINT64, 0, kTagCommStateSize);
-                tensorrt_llm::mpi::MpiComm::world().recvRawTag(
-                    recvBuffer.data(), bufferSize, tensorrt_llm::mpi::MpiType::kCHAR, 0, 0x2F);
+                tensorrt_llm::mpi::MpiComm::world().recvRawTag(
+                    recvBuffer.data(), bufferSize, tensorrt_llm::mpi::MpiType::kCHAR, 0, kTagCommStatePayload);

Also applies to: 365-373, 795-801, 809-816


902-904: Prefer TLLM logging over printf for consistency

Use TLLM_LOG_INFO/DEBUG to keep output consistent and rank-scoped.

-        printf("[addRequestAndTransportCacheForContext] mRankInInstance: %d numTokensAdjustedForCP: %d\n", mRankInInstance, numTokensAdjustedForCP);
+        TLLM_LOG_INFO("[addRequestAndTransportCacheForContext] mRankInInstance: %d numTokensAdjustedForCP: %d",
+            mRankInInstance, numTokensAdjustedForCP);
-        printf("[addRequestAndTransportCacheForGeneration] mRankInInstance: %d numTokensAdjustedForCP: %d\n", mRankInInstance, numTokensAdjustedForCP);
+        TLLM_LOG_INFO("[addRequestAndTransportCacheForGeneration] mRankInInstance: %d numTokensAdjustedForCP: %d",
+            mRankInInstance, numTokensAdjustedForCP);

Also applies to: 939-941


40-47: Missing headers used in this file

std::ostringstream/std::istringstream and std::all_of are used; include <sstream> and <algorithm> locally to avoid relying on transitive includes.

 #include <cstdint>
 #include <cstdio>
 #include <cstdlib>
 #include <memory>
 #include <random>
+#include <algorithm>
+#include <sstream>
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 4c51b47 and 9d380bd.

📒 Files selected for processing (1)
  • cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp (37 hunks)
🧰 Additional context used
📓 Path-based instructions (5)
**/*.{h,hpp,hh,hxx,cc,cpp,cxx,cu,cuh}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.{h,hpp,hh,hxx,cc,cpp,cxx,cu,cuh}: Closing braces of C++ namespaces must include a comment naming the namespace (e.g., } // namespace foo)
Avoid using literals (except 0, nullptr, true, false) directly in logic; use named constants for comparisons
Use Allman brace style in C++
Place semicolon of empty for/while loop on its own line
Use brace-delimited statements for bodies of switch/while/do/for and always brace if/else bodies
C++ type names use UpperCamelCase
Local variables, methods, and namespaces use lowerCamelCase
Non-static, externally visible globals use g prefix with lowerCamelCase (e.g., gDontUseGlobalFoos)
Static or anonymous-namespace globals use s prefix with lowerCamelCase (e.g., sMutableStaticGlobal)
Locally visible static variables use s prefix (e.g., static std::once_flag sFlag)
Member variables use m prefix with CamelCase (public may omit but encouraged)
Constants (enums, globals, static consts, function-scope magic numbers) use k prefix with UPPER_SNAKE (e.g., kDIGIT_NUM)
Function-scope non-literal, non-magic constants use normal non-const naming (e.g., const bool pass)
If macros are necessary, name them in UPPER_SNAKE_CASE
Avoid Hungarian notation except allowed app’s hungarian like nb for counts
Constructor parameters conflicting with member names get a trailing underscore (e.g., foo_)
Use uppercase literal suffixes (e.g., 1234L not 1234l)
Format C++ with clang-format (LLVM style), max line length 120; justify any exceptions with clang-format off/on blocks
Use C++-style comments; C comments not allowed except special inline cases; single-line comments use //
Use inline parameter comments in calls when arguments aren’t obvious (e.g., /* checkForErrors = / false)
Disable code with #if/#endif (optionally mnemonic conditions or no-op macros); do not comment out code; avoid dead code
Use the least forceful C++ cast; avoid removing const/volatile; avoid C-style and functional casts (except explicit constructors); cast void
to T* with static_cas...

Files:

  • cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp
**/*.{cc,cpp,cxx,cu}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.{cc,cpp,cxx,cu}: Prefer const or constexpr variables over #define for constants in C++
Declare variables const if not modified after initialization
Use smart pointers for heap allocation; prefer unique_ptr for sole ownership, shared_ptr for shared; weak_ptr only exceptionally; avoid deprecated smart pointers
Avoid declaring large functions inline unless there’s a quantifiable benefit; remember in-class definitions are implicitly inline
Every defined function must be referenced at least once; avoid unused methods

Files:

  • cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp
**/*

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Filenames compiled into a target must be case-insensitively unique

Files:

  • cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp
**/*.{h,hpp,hh,hxx,cc,cpp,cxx,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Use spaces, not tabs; indent 4 spaces

Files:

  • cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp
**/*.{cpp,cc,cxx,h,hpp,hh,hxx,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Prepend NVIDIA copyright header (current year) to all source files

Files:

  • cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp
🧠 Learnings (6)
📚 Learning: 2025-08-15T06:46:54.897Z
Learnt from: eopXD
PR: NVIDIA/TensorRT-LLM#6767
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-15T06:46:54.897Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp addToken function, newly allocated blocks are unshared by design. The beam search path in addToken (when sequence.getNumTokens() > windowSize) is currently broken/non-functional with SWA, so the block allocation doesn't follow a shared-then-unshared pattern.

Applied to files:

  • cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp
📚 Learning: 2025-08-14T21:04:50.248Z
Learnt from: thorjohnsen
PR: NVIDIA/TensorRT-LLM#6910
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-14T21:04:50.248Z
Learning: In KV cache onboarding logic during prefill in cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, when calculating which blocks fall within the attention window, use getTokensPerBlock() to advance token indices rather than block->getUniqueTokens().size(), because the calculation needs to consider the post-prefill state where blocks will be filled to capacity, not their current token count.

Applied to files:

  • cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp
📚 Learning: 2025-08-20T06:56:02.889Z
Learnt from: eopXD
PR: NVIDIA/TensorRT-LLM#6768
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:577-579
Timestamp: 2025-08-20T06:56:02.889Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, maxSequenceLength is now enforced as a non-optional argument in the BlockManager constructor, so concerns about std::nullopt defaulting to 0 are not applicable. When windowSize > maxSequenceLength, a warning should be added instead of handling optional parameter cases.

Applied to files:

  • cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp
📚 Learning: 2025-08-21T09:41:49.347Z
Learnt from: eopXD
PR: NVIDIA/TensorRT-LLM#6768
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:2010-2045
Timestamp: 2025-08-21T09:41:49.347Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, updateSequenceCacheBlockOffsets is specifically for updating bookkeeping when blocks are added during the context phase, not for refreshing offsets after detach operations. During detach operations, GenerationRequest::removeFrontBlock handles the necessary cache block bookkeeping internally.

Applied to files:

  • cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp
📚 Learning: 2025-08-20T06:48:45.368Z
Learnt from: eopXD
PR: NVIDIA/TensorRT-LLM#6768
File: cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h:0-0
Timestamp: 2025-08-20T06:48:45.368Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, updateSequenceCacheBlockOffsets is only called when adding a sequence, not during detach operations. During detach, the cache block bookkeeping is handled by GenerationRequest::removeFrontBlock.

Applied to files:

  • cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp
📚 Learning: 2025-09-02T13:42:44.844Z
Learnt from: pcastonguay
PR: NVIDIA/TensorRT-LLM#7455
File: tensorrt_llm/_torch/pyexecutor/py_executor.py:1852-1860
Timestamp: 2025-09-02T13:42:44.844Z
Learning: In MPI communication within TensorRT-LLM pipeline parallelism, different communication types (tokens, logits, termination sync) must use disjoint tag namespaces to avoid message routing collisions when using the same source/destination patterns.

Applied to files:

  • cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp
🧬 Code graph analysis (1)
cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp (3)
cpp/include/tensorrt_llm/batch_manager/kvCacheUtils.h (1)
  • fromAllBlockIds (34-41)
cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.cu (2)
  • TargetRanksInfoForDP (50-160)
  • TargetRanksInfoForDP (50-51)
cpp/tensorrt_llm/batch_manager/mlaCacheFormatter.cpp (2)
  • needSendCache (56-85)
  • needSendCache (56-57)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (1)
cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp (1)

686-689: Capacity vs CP split: confirm block pool sizing

Total blocks are computed without factoring CP. If each CP rank handles a fraction of tokens, consider dividing capacity per rank or justify intentional overprovisioning to avoid fragmentation.

I can add an assertion comparing allocated block count vs. tokenCountAdjustedForCP(...)/tokensPerBlock to ensure sizing is tight. Want a patch?

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)
cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp (3)

447-449: Fix operator precedence in byte check.

The lambda currently does (i == promptLen) & 0xff due to precedence, breaking cases where promptLen > 255. Parenthesize the mask.

-                EXPECT_TRUE(std::all_of(bytes.begin(), bytes.end(),
-                    [&llmRequest](uint8_t i) { return i == llmRequest->getPromptLen() & 0xff; }));
+                EXPECT_TRUE(std::all_of(bytes.begin(), bytes.end(),
+                    [&llmRequest](uint8_t i) { return i == (llmRequest->getPromptLen() & 0xff); }));

662-665: Bug: shadowed cacheType never applied.

Inside kvFactor == 1 branch, ‘auto cacheType’ shadows the outer variable, so kSELFKONLY isn’t used.

-        if (kvFactor == 1)
-        {
-            auto cacheType = CacheType::kSELFKONLY;
-        }
+        if (kvFactor == 1)
+        {
+            cacheType = CacheType::kSELFKONLY;
+        }

783-831: Replace magic MPI tags in cacheTransceiverTest.cpp with named constants
Define kCOMM_STATE_SIZE_TAG and kCOMM_STATE_DATA_TAG (e.g. constexpr int kCOMM_STATE_SIZE_TAG = 0x101; and constexpr int kCOMM_STATE_DATA_TAG = 0x102;) and use them in both the sendRawTag and recvRawTag calls instead of 0x1F/0x2F to guarantee a disjoint tag namespace.

♻️ Duplicate comments (3)
cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp (3)

834-836: Missing header for std::iota.

Add to avoid implicit declaration issues.

 #include <random>
+#include <numeric>

1154-1160: Re-enable correctness checks in verifyBlockData (KEY).

Assertions are commented out; test won’t catch corruption.

-                                if (*dataPtr != static_cast<ValueType>(0)) {
-                                    // EXPECT_EQ(*dataPtr, generateValue);
-                                } else {
-                                    // // TODO: Remove this when over-allocation is fixed.
-                                    // printf("[verifyBlockData::key] SKIPPING 0! \n");
-                                }
+                                EXPECT_EQ(*dataPtr, generateValue);

1184-1190: Re-enable correctness checks in verifyBlockData (VALUE).

Same issue for V-cache lane.

-                                    if (*dataPtr != static_cast<ValueType>(0)) {
-                                        // EXPECT_EQ(*dataPtr, generateValue);
-                                    } else {
-                                        // // TODO: Remove this when over-allocation is fixed.
-                                        // printf("[verifyBlockData::value] SKIPPING 0! \n");
-                                    }
+                                    EXPECT_EQ(*dataPtr, generateValue);
🧹 Nitpick comments (3)
cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp (3)

686-691: CP-aware pool sizing to reduce over-allocation.

Total blocks per rank should be divided across CP ranks; current sizing may over-allocate and produce zero-filled regions later skipped in tests.

-        auto blocksPerWindow = BlocksPerWindow{{maxAttentionWindow, {totalNumBlocks, blocksInSecondaryPool}}};
+        auto blocksPerWindow = BlocksPerWindow{
+            {maxAttentionWindow,
+             {static_cast<SizeType32>((totalNumBlocks + mCpSize - 1) / mCpSize), blocksInSecondaryPool}}};

861-890: Noisy stderr debug; gate behind logging or env.

Use TLLM_LOG_DEBUG and/or guard with TLLM_DEBUG_RANK to avoid spamming CI.

-        std::cerr << "subTokens: mCPSize: " << mCpSize << " mCPRank: " << mCpRank << " subTokens size: " << subTokens.size() << std::endl;
-        for (auto token : subTokens) {
-            std::cerr << token << ", ";
-        }
-        std::cerr << std::endl;
+        if (getEnvMpiDebugRank() != -2) {
+            TLLM_LOG_DEBUG("subTokens: mCPSize=%d mCPRank=%d size=%zu", mCpSize, mCpRank, subTokens.size());
+        }

2-10: Update copyright year to include 2025.

Keep headers current per repo guidelines.

- * SPDX-FileCopyrightText: Copyright (c) 2023-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-FileCopyrightText: Copyright (c) 2023-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 9d380bd and 6cb9787.

📒 Files selected for processing (1)
  • cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp (39 hunks)
🧰 Additional context used
📓 Path-based instructions (5)
**/*.{h,hpp,hh,hxx,cc,cpp,cxx,cu,cuh}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.{h,hpp,hh,hxx,cc,cpp,cxx,cu,cuh}: Closing braces of C++ namespaces must include a comment naming the namespace (e.g., } // namespace foo)
Avoid using literals (except 0, nullptr, true, false) directly in logic; use named constants for comparisons
Use Allman brace style in C++
Place semicolon of empty for/while loop on its own line
Use brace-delimited statements for bodies of switch/while/do/for and always brace if/else bodies
C++ type names use UpperCamelCase
Local variables, methods, and namespaces use lowerCamelCase
Non-static, externally visible globals use g prefix with lowerCamelCase (e.g., gDontUseGlobalFoos)
Static or anonymous-namespace globals use s prefix with lowerCamelCase (e.g., sMutableStaticGlobal)
Locally visible static variables use s prefix (e.g., static std::once_flag sFlag)
Member variables use m prefix with CamelCase (public may omit but encouraged)
Constants (enums, globals, static consts, function-scope magic numbers) use k prefix with UPPER_SNAKE (e.g., kDIGIT_NUM)
Function-scope non-literal, non-magic constants use normal non-const naming (e.g., const bool pass)
If macros are necessary, name them in UPPER_SNAKE_CASE
Avoid Hungarian notation except allowed app’s hungarian like nb for counts
Constructor parameters conflicting with member names get a trailing underscore (e.g., foo_)
Use uppercase literal suffixes (e.g., 1234L not 1234l)
Format C++ with clang-format (LLVM style), max line length 120; justify any exceptions with clang-format off/on blocks
Use C++-style comments; C comments not allowed except special inline cases; single-line comments use //
Use inline parameter comments in calls when arguments aren’t obvious (e.g., /* checkForErrors = / false)
Disable code with #if/#endif (optionally mnemonic conditions or no-op macros); do not comment out code; avoid dead code
Use the least forceful C++ cast; avoid removing const/volatile; avoid C-style and functional casts (except explicit constructors); cast void
to T* with static_cas...

Files:

  • cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp
**/*.{cc,cpp,cxx,cu}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.{cc,cpp,cxx,cu}: Prefer const or constexpr variables over #define for constants in C++
Declare variables const if not modified after initialization
Use smart pointers for heap allocation; prefer unique_ptr for sole ownership, shared_ptr for shared; weak_ptr only exceptionally; avoid deprecated smart pointers
Avoid declaring large functions inline unless there’s a quantifiable benefit; remember in-class definitions are implicitly inline
Every defined function must be referenced at least once; avoid unused methods

Files:

  • cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp
**/*

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Filenames compiled into a target must be case-insensitively unique

Files:

  • cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp
**/*.{h,hpp,hh,hxx,cc,cpp,cxx,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Use spaces, not tabs; indent 4 spaces

Files:

  • cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp
**/*.{cpp,cc,cxx,h,hpp,hh,hxx,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Prepend NVIDIA copyright header (current year) to all source files

Files:

  • cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp
🧠 Learnings (6)
📚 Learning: 2025-08-15T06:46:54.897Z
Learnt from: eopXD
PR: NVIDIA/TensorRT-LLM#6767
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-15T06:46:54.897Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp addToken function, newly allocated blocks are unshared by design. The beam search path in addToken (when sequence.getNumTokens() > windowSize) is currently broken/non-functional with SWA, so the block allocation doesn't follow a shared-then-unshared pattern.

Applied to files:

  • cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp
📚 Learning: 2025-08-14T21:04:50.248Z
Learnt from: thorjohnsen
PR: NVIDIA/TensorRT-LLM#6910
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-14T21:04:50.248Z
Learning: In KV cache onboarding logic during prefill in cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, when calculating which blocks fall within the attention window, use getTokensPerBlock() to advance token indices rather than block->getUniqueTokens().size(), because the calculation needs to consider the post-prefill state where blocks will be filled to capacity, not their current token count.

Applied to files:

  • cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp
📚 Learning: 2025-08-20T06:56:02.889Z
Learnt from: eopXD
PR: NVIDIA/TensorRT-LLM#6768
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:577-579
Timestamp: 2025-08-20T06:56:02.889Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, maxSequenceLength is now enforced as a non-optional argument in the BlockManager constructor, so concerns about std::nullopt defaulting to 0 are not applicable. When windowSize > maxSequenceLength, a warning should be added instead of handling optional parameter cases.

Applied to files:

  • cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp
📚 Learning: 2025-08-21T09:41:49.347Z
Learnt from: eopXD
PR: NVIDIA/TensorRT-LLM#6768
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:2010-2045
Timestamp: 2025-08-21T09:41:49.347Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, updateSequenceCacheBlockOffsets is specifically for updating bookkeeping when blocks are added during the context phase, not for refreshing offsets after detach operations. During detach operations, GenerationRequest::removeFrontBlock handles the necessary cache block bookkeeping internally.

Applied to files:

  • cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp
📚 Learning: 2025-08-20T06:48:45.368Z
Learnt from: eopXD
PR: NVIDIA/TensorRT-LLM#6768
File: cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h:0-0
Timestamp: 2025-08-20T06:48:45.368Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, updateSequenceCacheBlockOffsets is only called when adding a sequence, not during detach operations. During detach, the cache block bookkeeping is handled by GenerationRequest::removeFrontBlock.

Applied to files:

  • cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp
📚 Learning: 2025-09-02T13:42:44.844Z
Learnt from: pcastonguay
PR: NVIDIA/TensorRT-LLM#7455
File: tensorrt_llm/_torch/pyexecutor/py_executor.py:1852-1860
Timestamp: 2025-09-02T13:42:44.844Z
Learning: In MPI communication within TensorRT-LLM pipeline parallelism, different communication types (tokens, logits, termination sync) must use disjoint tag namespaces to avoid message routing collisions when using the same source/destination patterns.

Applied to files:

  • cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp
🧬 Code graph analysis (1)
cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp (3)
cpp/include/tensorrt_llm/batch_manager/kvCacheUtils.h (1)
  • fromAllBlockIds (34-41)
cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.cu (2)
  • TargetRanksInfoForDP (50-160)
  • TargetRanksInfoForDP (50-51)
cpp/tensorrt_llm/batch_manager/mlaCacheFormatter.cpp (2)
  • needSendCache (56-85)
  • needSendCache (56-57)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (1)
cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp (1)

913-925: CP token sizing is double-dividing; under-allocates tokens per rank.

llmRequest in CP mode already contains rank-local subTokens. tokenCountAdjustedForCP divides by mCpSize again, halving (or worse) the allocation.

-    // Blocks are distributed among CP ranks as evenly as possible.
-    int numTotalBlocks = (llmRequest->getNumTokens(beamIdx) + tokensPerBlock - 1) / tokensPerBlock;
-    int numBlocksCurrRank = numTotalBlocks / mCpSize;
-    if (numTotalBlocks % mCpSize > mCpRank)
-    {
-        numBlocksCurrRank++;
-    }
-    // TODO: Last block on the last overflow rank may not be full.
-    return numBlocksCurrRank * tokensPerBlock;
+    // llmRequest already carries this rank's CP-partitioned tokens.
+    int numBlocksCurrRank = (llmRequest->getNumTokens(beamIdx) + tokensPerBlock - 1) / tokensPerBlock;
+    return numBlocksCurrRank * tokensPerBlock; // round up to block
⛔ Skipped due to learnings
Learnt from: thorjohnsen
PR: NVIDIA/TensorRT-LLM#6910
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-14T21:04:50.248Z
Learning: In KV cache onboarding logic during prefill in cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, when calculating which blocks fall within the attention window, use getTokensPerBlock() to advance token indices rather than block->getUniqueTokens().size(), because the calculation needs to consider the post-prefill state where blocks will be filled to capacity, not their current token count.
Learnt from: eopXD
PR: NVIDIA/TensorRT-LLM#6767
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-15T06:46:54.897Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp addToken function, newly allocated blocks are unshared by design. The beam search path in addToken (when sequence.getNumTokens() > windowSize) is currently broken/non-functional with SWA, so the block allocation doesn't follow a shared-then-unshared pattern.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant