[None][fix]revert dp_tp optimal kvcache transfer #6657 #6707

chuangz0 · 2025-08-07T11:40:37Z

Summary by CodeRabbit

Bug Fixes
- Improved cache sending and receiving logic to more accurately account for data parallel ranks, refining cache transfer distribution in multi-rank scenarios.
Tests
- Updated test expectations to reflect the refined cache transfer logic, ensuring correct behavior for different data parallel rank combinations.

Description

Test Coverage

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

Details

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md
and the scripts/test_to_stage_mapping.py helper.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

coderabbitai · 2025-08-07T11:40:44Z

📝 Walkthrough

Walkthrough

This change updates the internal logic of cache sending and receiving decisions in both CacheFormatter and MLACacheFormatter classes. The logic now incorporates data parallel ranks when determining whether to send or receive cache data, refining conditions based on duplication factors and alignment between tensor and data parallel ranks. Associated tests are updated to reflect the new logic.

Changes

Cohort / File(s)	Change Summary
CacheFormatter logic update `cpp/tensorrt_llm/batch_manager/cacheFormatter.cpp`	Refines logic in `needSendCache` and `pickRecvConnections` to use destination/self data parallel rank for cache transfer decisions, replacing previous modulo checks with conditions involving DP ranks and duplication factors.
MLACacheFormatter logic update `cpp/tensorrt_llm/batch_manager/mlaCacheFormatter.cpp`	Updates `pickRecvConnections` and `needSendCache` to consider data parallel ranks and duplication factors, introducing more nuanced checks for when to send or receive cache data based on tensor and data parallel group sizes and alignment.
Test expectation adjustment `cpp/tests/batch_manager/cacheTransceiverTest.cpp`	Adjusts expected outcomes in `CacheStateContextDP` test by flipping the expected value of `expectNeedSend` for two rank combinations, aligning test expectations with the new logic. Minor formatting changes in `CacheStateNODP` test (blank lines).

Sequence Diagram(s)

sequenceDiagram
    participant Source as Source Rank
    participant Formatter as CacheFormatter/MLACacheFormatter
    participant Dest as Destination Rank

    Source->>Formatter: needSendCache(destRank, ...)
    alt Attention DP enabled
        Formatter->>Formatter: Compute destDPRank/selfDPRank
        Formatter->>Formatter: Compute duplication factor
        Formatter->>Formatter: Check (srcTPRank % dupHeadFactor == destDPRank)
    else Attention DP disabled
        Formatter->>Formatter: Compute duplication factor
        Formatter->>Formatter: Check (srcTPRank % dupHeadFactor == destDPRank)
    end
    Formatter-->>Source: Return true/false

    Dest->>Formatter: pickRecvConnections(...)
    Formatter->>Formatter: Compute selfDPRank/dpRank
    Formatter->>Formatter: For each index, check modulo condition with DP rank
    Formatter-->>Dest: Return list of connection indices

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~15 minutes

Possibly related PRs

[None][chore] optimize kv cache transfer for context TEP and gen DEP #6657: The main PR and the retrieved PR both modify the same functions (needSendCache and pickRecvConnections) in cacheFormatter.cpp and mlaCacheFormatter.cpp to incorporate data parallel ranks into the cache sending and receiving logic, indicating a direct and strong code-level relationship.

Suggested reviewers

Tabrizian
Shixiaowei02

Note

⚡️ Unit Test Generation is now available in beta!

Learn more here, or try it out under "Finishing Touches" below.

📜 Recent review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 1b9781e and 2ab67b5.

📒 Files selected for processing (3)

cpp/tensorrt_llm/batch_manager/cacheFormatter.cpp (2 hunks)
cpp/tensorrt_llm/batch_manager/mlaCacheFormatter.cpp (3 hunks)
cpp/tests/batch_manager/cacheTransceiverTest.cpp (2 hunks)

🧰 Additional context used

📓 Path-based instructions (2)

**/*.{cpp,h,hpp,cc,cxx}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.{cpp,h,hpp,cc,cxx}: Closing braces of namespaces should have a comment saying the namespace it closes (e.g., } // namespace foo).
Prefer const or constexpr variables over #defines whenever possible.
A variable that is not modified after its initialization should be declared as const.
Except 0 (used for checking signness/existence/emptiness), nullptr, true, false, all other literals should only be used for variable initialization.
Use the Allman indentation style for braces in C++ code.
Put the semicolon for an empty for or while loop in a new line.
The statement forming the body of a switch, while, do..while, or for statement shall be a compound statement (use brace-delimited statements).
If and else should always be followed by brace-delimited statements, even if empty or a single statement.
C++ filenames should use camel case with the first letter lowercase (e.g., thisIsAFilename.cpp), and all files involved in a compilation target must have case-insensitive unique filenames.
All types (including class names) should use camel case with uppercase first letter (e.g., FooBarClass).
Local variables, methods, and namespaces should use camel case with first letter lowercase (e.g., localFooBar).
Non-magic-number global variables that are non-static and not defined in anonymous namespace should use camel case prefixed by 'g' (e.g., gDontUseGlobalFoos).
Non-magic-number global variables that are static or defined in an anonymous namespace should use camel case prefixed by 's' (e.g., sMutableStaticGlobal).
Locally visible static variables should use camel case with lowercase prefix 's' as the first letter (e.g., static std::once_flag sFlag;).
Class member variables should use camel case prefixed with 'm' (e.g., mNbFooValues). Public member variables do not require the 'm' prefix but it is encouraged for clarity.
Enumerations, global constants, static constants at class-scope, and function-scope magic-number/literal constants should be uppercase snake case with prefix...

Files:

cpp/tensorrt_llm/batch_manager/cacheFormatter.cpp
cpp/tensorrt_llm/batch_manager/mlaCacheFormatter.cpp
cpp/tests/batch_manager/cacheTransceiverTest.cpp

**/*.{cpp,h,hpp,cc,cxx,cu,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the current year. This includes .cpp, .h, .cu, .py, and any other source files which are compiled or interpreted.

Files:

cpp/tensorrt_llm/batch_manager/cacheFormatter.cpp
cpp/tensorrt_llm/batch_manager/mlaCacheFormatter.cpp
cpp/tests/batch_manager/cacheTransceiverTest.cpp

🧠 Learnings (2)

📚 Learning: in cpp/tensorrt_llm/batch_manager/datatransceiverimpl.cpp, the existing `mmtxformap` mutex in datase...

Learnt from: zhengd-nv
PR: NVIDIA/TensorRT-LLM#6633
File: cpp/tensorrt_llm/batch_manager/dataTransceiverImpl.cpp:145-155
Timestamp: 2025-08-06T08:18:28.669Z
Learning: In cpp/tensorrt_llm/batch_manager/dataTransceiverImpl.cpp, the existing `mMtxForMap` mutex in DataSenderImpl is sufficient to synchronize measurement file operations in the `release` method, as all file operations occur within the same critical section that protects the `mRequestToSession` map access.

Applied to files:

cpp/tensorrt_llm/batch_manager/cacheFormatter.cpp
cpp/tensorrt_llm/batch_manager/mlaCacheFormatter.cpp
cpp/tests/batch_manager/cacheTransceiverTest.cpp

📚 Learning: in tensorrt_llm/executor/worker.py, the lora adapter cache optimization logic that checks `is_adapte...

Learnt from: amitz-nv
PR: NVIDIA/TensorRT-LLM#5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks `is_adapter_in_cpu_cache()` and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.

Applied to files:

cpp/tensorrt_llm/batch_manager/cacheFormatter.cpp
cpp/tensorrt_llm/batch_manager/mlaCacheFormatter.cpp

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Pre-commit Check

🔇 Additional comments (12)

cpp/tests/batch_manager/cacheTransceiverTest.cpp (4)

1460-1460: LGTM - Minor formatting improvement.

The blank line addition improves test readability by separating different test cases.

1463-1463: LGTM - Consistent formatting improvement.

The blank line addition maintains consistent spacing between test cases.

1468-1468: LGTM - Formatting consistency maintained.

The blank line addition continues the consistent test case separation pattern.

1569-1569: Please verify these test expectation changes against MLACacheFormatter::needSendCache logic.

I wasn’t able to locate the verifyContext helper to see exactly how contextCache and genCache are constructed, so please double-check that flipping expectNeedSend for these cases matches the reverted logic in MLACacheFormatter::needSendCache (lines 58–158):

• contextRank=0, generationRank=1: true → false
• contextRank=1, generationRank=1: false → true

Ensure that with contextEnableDP = false (and however generationEnableDP is set), the computed selfTpRank, dupHeadFactor, and destDPRank indeed yield the new expectations.

cpp/tensorrt_llm/batch_manager/cacheFormatter.cpp (4)

92-92: LGTM: DP rank calculation for destination.

The introduction of destDPRank properly handles the case where attention DP is enabled vs disabled, setting it to the actual DP rank or 0 respectively.

94-94: LGTM: Updated cache sending logic with DP rank consideration.

The modified return condition now incorporates the destination DP rank in the decision-making process, which aligns with the data parallel optimization changes being reverted.

131-131: LGTM: Self DP rank calculation.

Similar to the destination DP rank calculation, this properly handles the self DP rank based on whether attention DP is enabled.

136-136: LGTM: Updated connection selection logic with DP rank consideration.

The condition now uses self DP rank modulo the peer duplication head factor, which is consistent with the overall approach of incorporating DP rank into cache transfer decisions.

cpp/tensorrt_llm/batch_manager/mlaCacheFormatter.cpp (4)

48-53: LGTM: Enhanced connection selection with DP rank offset.

The addition of dpRank calculation and its use in the connection index calculation (line 53) properly incorporates data parallel considerations. The offset calculation (dpRank % (targetInfo.mDomainTPSize)) * targetInfo.mDomainPPSize ensures connections are selected based on the appropriate DP group.

63-66: LGTM: Destination DP configuration calculations.

The introduction of destTPNumInDPGroup and destDPRank properly handles the destination's tensor parallel and data parallel configuration, with appropriate fallbacks when attention DP is disabled.

79-80: LGTM: Updated duplication factor logic for attention DP.

The calculation of dupHeadFactor (line 79) and its usage in the return condition (line 80) properly handles the case where source TP groups are larger than destination TP groups, incorporating the destination DP rank in the decision.

91-92: LGTM: Consistent duplication factor logic for non-attention DP.

Similar to the attention DP case, this maintains consistency in the duplication factor calculation and usage for scenarios where attention DP is not enabled.

✨ Finishing Touches

📝 Generate Docstrings

🧪 Generate unit tests

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Explain this complex logic.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai explain this code block.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and explain its main purpose.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai generate docstrings to generate docstrings for this PR.
@coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
@coderabbitai generate unit tests to generate unit tests for this PR.
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai or @coderabbitai title anywhere in the PR title to generate the title automatically.

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

chuangz0 requested a review from a team as a code owner August 7, 2025 11:40

chuangz0 requested review from Tabrizian and schetlur-nv August 7, 2025 11:40

chuangz0 closed this Aug 7, 2025

chuangz0 force-pushed the revert_cache_transfer_tp_dp branch from 2ab67b5 to 1b9781e Compare August 7, 2025 11:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[None][fix]revert dp_tp optimal kvcache transfer #6657 #6707

[None][fix]revert dp_tp optimal kvcache transfer #6657 #6707

Uh oh!

chuangz0 commented Aug 7, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Aug 7, 2025 •

edited

Loading

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested reviewers

Chat

Support

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

Documentation and Community

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

[None][fix]revert dp_tp optimal kvcache transfer #6657 #6707

[None][fix]revert dp_tp optimal kvcache transfer #6657 #6707

Uh oh!

Conversation

chuangz0 commented Aug 7, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Description

Test Coverage

GitHub Bot Help

kill

skip

reuse-pipeline

Uh oh!

coderabbitai bot commented Aug 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested reviewers

Chat

Support

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

Documentation and Community

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

chuangz0 commented Aug 7, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Aug 7, 2025 •

edited

Loading