-
Notifications
You must be signed in to change notification settings - Fork 1.7k
[TRTLLM-6444] Add some UCX trouble shooting docs and print UCX related logs #6085
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
802441e
to
6c6a8ae
Compare
📝 Walkthrough""" WalkthroughA new troubleshooting FAQ was added to the disaggregated service documentation, detailing UCX environment variable settings to avoid server hangs. In the code, the logic for selecting the KV cache backend was clarified, with explicit handling for UCX and MPI backends and improved environment variable checks. Informational logging was added for UCX users. Changes
Sequence Diagram(s)sequenceDiagram
participant User
participant KVCacheTransceiver
participant Logger
User->>KVCacheTransceiver: Request backend selection
KVCacheTransceiver->>KVCacheTransceiver: Check config and env variables
alt Backend is MPI
KVCacheTransceiver->>Logger: Log deprecation warning
else Backend is UCX
KVCacheTransceiver->>Logger: Log UCX env variable guidance
end
KVCacheTransceiver-->>User: Return BindKvCacheTransceiver instance
Estimated code review effort🎯 2 (Simple) | ⏱️ ~8 minutes Possibly related PRs
Suggested reviewers
Note ⚡️ Unit Test Generation is now available in beta!Learn more here, or try it out under "Finishing Touches" below. ✨ Finishing Touches
🧪 Generate unit tests
🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
Documentation and Community
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
tensorrt_llm/_torch/pyexecutor/kv_cache_transceiver.py (1)
50-54
: Fix line length issue in the log message.The implementation logic is sound - retrieving the cluster UUID and logging informative guidance when devices might not be in the same NVLink domain. However, there's a line length violation that should be addressed.
- logger.info( - f"Using NVLink domain/ClusterUUID: {cluster_uuid}, if your devices are not in the same domain, please consider setting UCX_CUDA_IPC_ENABLE_MNNVL=n, UCX_RNDV_SCHEME=put_zcopy and unset UCX_NET_DEVICES when encountering server hangs." - ) + logger.info( + f"Using NVLink domain/ClusterUUID: {cluster_uuid}, if your devices are not in the same domain, " + f"please consider setting UCX_CUDA_IPC_ENABLE_MNNVL=n, UCX_RNDV_SCHEME=put_zcopy and unset " + f"UCX_NET_DEVICES when encountering server hangs." + )
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
docs/source/advanced/disaggregated-service.md
(1 hunks)tensorrt_llm/_mnnvl_utils.py
(1 hunks)tensorrt_llm/_torch/pyexecutor/kv_cache_transceiver.py
(2 hunks)
🧰 Additional context used
🪛 Ruff (0.12.2)
tensorrt_llm/_torch/pyexecutor/kv_cache_transceiver.py
53-53: Line too long (248 > 120)
(E501)
🔇 Additional comments (3)
tensorrt_llm/_mnnvl_utils.py (1)
286-308
: LGTM! Well-implemented method with proper error handling and API versioning.The
get_cluster_uuid()
method correctly implements a fallback strategy for different NVML API versions. The inner helper functionget_fabric_info
provides clean error handling by catchingpynvml.NVMLError
exceptions and returningNone
on failures. The method appropriately tries the newernvmlDeviceGetGpuFabricInfoV
API first before falling back to the oldernvmlDeviceGetGpuFabricInfo
API.The return logic is also sound - returning the
clusterUuid
attribute when fabric info is available, or an empty string otherwise, which allows the calling code to easily check for availability.tensorrt_llm/_torch/pyexecutor/kv_cache_transceiver.py (1)
5-5
: LGTM! Proper imports added for the new functionality.The imports for
MnnvlMemory
andlogger
are correctly placed and necessary for the new logging functionality.Also applies to: 8-8
docs/source/advanced/disaggregated-service.md (1)
122-130
: LGTM! Well-structured FAQ entry with actionable troubleshooting guidance.The new FAQ entry effectively addresses server hang issues in disaggregated deployments. The question is clearly stated, and the response provides:
- Proper context about NVLink domains
- Three specific UCX environment variables with clear recommendations
- Specific guidance for the
UCX_CUDA_IPC_ENABLE_MNNVL
variable, including the related error message users might encounterThe documentation aligns well with the programmatic detection and logging added in the code files, providing a complete troubleshooting solution for users.
/bot run |
79e2fbc
to
a0b4d1d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
docs/source/advanced/disaggregated-service.md
(1 hunks)tensorrt_llm/_mnnvl_utils.py
(1 hunks)tensorrt_llm/_torch/pyexecutor/kv_cache_transceiver.py
(2 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
- docs/source/advanced/disaggregated-service.md
- tensorrt_llm/_mnnvl_utils.py
🧰 Additional context used
🧠 Learnings (1)
tensorrt_llm/_torch/pyexecutor/kv_cache_transceiver.py (1)
Learnt from: amitz-nv
PR: NVIDIA/TensorRT-LLM#5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.374Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks `is_adapter_in_cpu_cache()` and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
🪛 Ruff (0.12.2)
tensorrt_llm/_torch/pyexecutor/kv_cache_transceiver.py
8-8: Redefinition of unused logger
from line 5
(F811)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (4)
tensorrt_llm/_torch/pyexecutor/kv_cache_transceiver.py (4)
35-37
: Good defensive programming improvement.The explicit None check for
cache_transceiver_config
improves code robustness.
39-46
: Clearer backend selection logic.The explicit setting of UCX as the default backend followed by environment variable overrides makes the logic more transparent and maintainable.
48-55
: Excellent troubleshooting guidance.The logging additions provide valuable guidance for users:
- MPI deprecation warning helps users migrate to newer backends
- UCX troubleshooting advice with specific environment variables aligns perfectly with the PR's documentation goals
57-59
: LGTM!The return statement formatting change improves readability.
a0b4d1d
to
5e7edd4
Compare
/bot run |
/bot -h |
GitHub Bot Help
Provide a user friendly way for developers to interact with a Jenkins server. Run See details below for each supported subcommand.
Launch build/test pipelines. All previously running jobs will be killed.
kill
Kill all running builds associated with pull request. skip
Skip testing for latest commit on pull request. reuse-pipeline
Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break. |
/bot run |
PR_Github #12413 [ run ] triggered by Bot |
PR_Github #12413 [ run ] completed with state |
5e7edd4
to
b8ec9a6
Compare
/bot run |
PR_Github #12437 [ run ] triggered by Bot |
PR_Github #12437 [ run ] completed with state |
I would suggest to remove |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
docs/source/advanced/disaggregated-service.md
(1 hunks)tensorrt_llm/_torch/pyexecutor/kv_cache_transceiver.py
(1 hunks)
🧠 Learnings (1)
tensorrt_llm/_torch/pyexecutor/kv_cache_transceiver.py (1)
Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.374Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache()
and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
🪛 Ruff (0.12.2)
tensorrt_llm/_torch/pyexecutor/kv_cache_transceiver.py
58-58: Line too long (137 > 120)
(E501)
🚧 Files skipped from review as they are similar to previous changes (1)
- docs/source/advanced/disaggregated-service.md
🧰 Additional context used
🧠 Learnings (1)
tensorrt_llm/_torch/pyexecutor/kv_cache_transceiver.py (1)
Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.374Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache()
and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
🪛 Ruff (0.12.2)
tensorrt_llm/_torch/pyexecutor/kv_cache_transceiver.py
58-58: Line too long (137 > 120)
(E501)
🔇 Additional comments (3)
tensorrt_llm/_torch/pyexecutor/kv_cache_transceiver.py (3)
34-34
: Good code cleanup.The removal of redundant parentheses improves readability while maintaining the same logical behavior.
38-50
: Improved backend selection logic.The changes enhance the backend selection with:
- Clear UCX default setting
- Robust environment variable checking (exact "1" value)
- Appropriate warnings for deprecated environment variable usage
- Priority-based backend selection
61-62
: Good code formatting improvement.The line wrapping of the return statement improves readability while maintaining the same functionality.
PR_Github #12647 [ run ] completed with state |
f989a95
to
a3182bb
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
docs/source/advanced/disaggregated-service.md (1)
35-38
: Promote the paragraph to a sub-heading and tighten wordingThis paragraph introduces a new set of variables but sits at the same hierarchy as the bullets above, making the doc harder to scan. Converting it to its own third-level heading and replacing “please make sure” with a stronger verb improves readability and style.
-There are some other useful environment variables that may help when encountering failures or performance issues. +### Additional Environment Variables + +The following variables can be helpful when troubleshooting failures or performance regressions:-Setting it to `0` will reduce the number of CUDA streams, but please make sure there are no other NCCL ops outside the one CUDA graph, otherwise there could be accuracy problem. +Setting it to `0` reduces the number of CUDA streams, but **ensure** there are no other NCCL operations outside the single CUDA graph; otherwise, accuracy issues may occur.
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
docs/source/advanced/disaggregated-service.md
(2 hunks)tensorrt_llm/_torch/pyexecutor/kv_cache_transceiver.py
(1 hunks)
🧠 Learnings (1)
docs/source/advanced/disaggregated-service.md (1)
Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.374Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache()
and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
🪛 LanguageTool
docs/source/advanced/disaggregated-service.md
[style] ~37-~37: Consider using a different verb to strengthen your wording.
Context: ... the number of CUDA streams, but please make sure there are no other NCCL ops outside the...
(MAKE_SURE_ENSURE)
🚧 Files skipped from review as they are similar to previous changes (1)
- tensorrt_llm/_torch/pyexecutor/kv_cache_transceiver.py
🧰 Additional context used
🧠 Learnings (1)
docs/source/advanced/disaggregated-service.md (1)
Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.374Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache()
and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
🪛 LanguageTool
docs/source/advanced/disaggregated-service.md
[style] ~37-~37: Consider using a different verb to strengthen your wording.
Context: ... the number of CUDA streams, but please make sure there are no other NCCL ops outside the...
(MAKE_SURE_ENSURE)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (1)
docs/source/advanced/disaggregated-service.md (1)
88-96
: Verify UCX variable names and add quick-check examples
UCX_CUDA_IPC_ENABLE_MNNVL
looks atypical; double-check the spelling against UCX docs (common knobs areUCX_CUDA_IPC_ENABLE
,UCX_CUDA_IPC_OFFLOAD_ENABLE
, etc.).- Consider adding a one-liner example that shows how to confirm the NVLink domain, e.g.
nvidia-smi -q | grep -A2 Fabric
to make the FAQ immediately actionable.
Let me know if you’d like a follow-up patch once the variable names are confirmed.
a3182bb
to
cb6186d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (3)
docs/source/advanced/disaggregated-service.md (3)
35-36
: Consider promoting this sentence to a sub-heading for scannability
There are some other useful environment variables …
looks like the start of a new section; turning it into a level-3 heading (### Other useful environment variables
) will make the page easier to skim.
37-38
: Tighten wording & avoid “make sure” phrasing- * `NCCL_GRAPH_MIXING_SUPPORT`: With the default value `1`, NCCL may create too many CUDA streams while working with one CUDA graph, leading to performance drop. Setting it to `0` will reduce the number of CUDA streams, but please make sure there are no other NCCL ops outside the one CUDA graph, otherwise there could be accuracy problems. + * `NCCL_GRAPH_MIXING_SUPPORT`: With the default value `1`, NCCL may create too many CUDA streams when operating inside a single CUDA graph, causing a performance drop. Setting it to `0` reduces the number of streams, but ensure no additional NCCL operations run outside that CUDA graph; otherwise accuracy issues may occur.
88-90
: Polish FAQ question wording-*Q. When my servers are running on different NVLink domains, some servers hang or have a lower performance. How to fix that?* +*Q. When servers run on different NVLink domains, some instances hang or show lower performance. How can I address this?*
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
docs/source/advanced/disaggregated-service.md
(2 hunks)
🧠 Learnings (1)
docs/source/advanced/disaggregated-service.md (1)
Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.374Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache()
and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
🪛 LanguageTool
docs/source/advanced/disaggregated-service.md
[style] ~37-~37: Consider using a different verb to strengthen your wording.
Context: ... the number of CUDA streams, but please make sure there are no other NCCL ops outside the...
(MAKE_SURE_ENSURE)
🧰 Additional context used
🧠 Learnings (1)
docs/source/advanced/disaggregated-service.md (1)
Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.374Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache()
and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
🪛 LanguageTool
docs/source/advanced/disaggregated-service.md
[style] ~37-~37: Consider using a different verb to strengthen your wording.
Context: ... the number of CUDA streams, but please make sure there are no other NCCL ops outside the...
(MAKE_SURE_ENSURE)
cb6186d
to
2fcaba5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (1)
docs/source/advanced/disaggregated-service.md (1)
88-96
: Typo not fixed: invalid UCX variableUCX_CUDA_IPC_ENABLE_MNNVL
This variable name is still incorrect and will mislead users; the correct knob is
UCX_CUDA_IPC_ENABLE
.-* `UCX_CUDA_IPC_ENABLE_MNNVL`: Set to `n`. This also can reduce UCX timeout error messages like `UCX ERROR cuMemImportFromShareableHandle failed: invalid resource handle`, although these errors don't necessarily cause your trtllm-serve to fail. +* `UCX_CUDA_IPC_ENABLE`: Set to `n`. This can also reduce UCX timeout messages such as `UCX ERROR cuMemImportFromShareableHandle failed: invalid resource handle`. These messages are expected when servers are on different NVLink domains and are not fatal.
🧹 Nitpick comments (1)
docs/source/advanced/disaggregated-service.md (1)
35-38
: Avoid “make sure”; use assertive language for safety-critical guidance“...please make sure there are no other NCCL ops...” is a weak directive in a section describing a potentially unsafe configuration. Prefer a stronger, unequivocal wording such as “Ensure there are no other NCCL operations outside the CUDA graph; otherwise this setting is unsafe.”
-...but please make sure there are no other NCCL ops outside the one CUDA graph, otherwise it's unsafe. +...but ensure no other NCCL operations exist outside the single CUDA graph; otherwise this setting is unsafe.
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
docs/source/advanced/disaggregated-service.md
(2 hunks)
🧠 Learnings (1)
docs/source/advanced/disaggregated-service.md (1)
Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.374Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache()
and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
🪛 LanguageTool
docs/source/advanced/disaggregated-service.md
[style] ~37-~37: Consider using a different verb to strengthen your wording.
Context: ... the number of CUDA streams, but please make sure there are no other NCCL ops outside the...
(MAKE_SURE_ENSURE)
🧰 Additional context used
🧠 Learnings (1)
docs/source/advanced/disaggregated-service.md (1)
Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.374Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache()
and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
🪛 LanguageTool
docs/source/advanced/disaggregated-service.md
[style] ~37-~37: Consider using a different verb to strengthen your wording.
Context: ... the number of CUDA streams, but please make sure there are no other NCCL ops outside the...
(MAKE_SURE_ENSURE)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
/bot run |
PR_Github #12682 [ run ] triggered by Bot |
PR_Github #12682 [ run ] completed with state |
Signed-off-by: Lizhi Zhou <[email protected]>
Signed-off-by: Lizhi Zhou <[email protected]>
Signed-off-by: Lizhi Zhou <[email protected]>
Signed-off-by: Lizhi Zhou <[email protected]>
2fcaba5
to
fd88452
Compare
/bot run |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (1)
docs/source/advanced/disaggregated-service.md (1)
90-96
: Incorrect UCX variable name still present
UCX_CUDA_IPC_ENABLE_MNNVL
is not a documented UCX knob; the correct variable isUCX_CUDA_IPC_ENABLE
. Keeping the typo will mislead users and any copy-pasted export commands will have no effect.-* `UCX_CUDA_IPC_ENABLE_MNNVL`: Set to `n`. This also can reduce UCX timeout error messages like `UCX ERROR cuMemImportFromShareableHandle failed: invalid resource handle`, although these errors don't necessarily cause your trtllm-serve to fail. +* `UCX_CUDA_IPC_ENABLE`: Set to `n`. This can reduce UCX timeout messages such as `UCX ERROR cuMemImportFromShareableHandle failed: invalid resource handle` (these messages are often benign).
🧹 Nitpick comments (2)
docs/source/advanced/disaggregated-service.md (2)
35-38
: Section heading and wording can be tightened
- The sentence starting with “There are some other useful…” is an orphan line between two sections. Either promote it to its own heading (e.g. “### Additional environment variables”) or fold it into the preceding list to keep the structure consistent.
- In line 37 the phrase “please make sure there are no other NCCL ops” is conversational; prefer concise technical wording such as “Ensure no additional NCCL operations are issued outside the CUDA graph.”
-There are some other useful environment variables that may help when encountering failures or performance issues. +### Additional environment variables +The following variables can mitigate failures or performance anomalies. @@ -* `NCCL_GRAPH_MIXING_SUPPORT`: With the default value `1`, the CUDA driver may create too many CUDA streams while working with one CUDA graph, leading to performance drop. Setting it to `0` will reduce the number of CUDA streams, but please make sure there are no other NCCL ops outside the one CUDA graph, otherwise it's unsafe. +* `NCCL_GRAPH_MIXING_SUPPORT`: Default `1`. Setting `0` reduces the number of CUDA streams created inside a CUDA graph. Only safe when no additional NCCL operations are issued outside that graph.
88-96
: Minor clarity & reproducibility improvementsConsider adding a one-liner command to help users locate the NVLink domain quickly:
- A. NVLink domain can be found with `nvidia-smi -q` in the `Fabric.ClusterUUID` field. + A. Identify the NVLink domain via + ```bash + nvidia-smi -q | grep -A2 "Fabric.*ClusterUUID" + ``` + (look for differing `ClusterUUID` values across servers).This makes the guidance directly actionable for end users.
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
docs/source/advanced/disaggregated-service.md
(2 hunks)tensorrt_llm/_torch/pyexecutor/kv_cache_transceiver.py
(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- tensorrt_llm/_torch/pyexecutor/kv_cache_transceiver.py
🧰 Additional context used
🧠 Learnings (1)
docs/source/advanced/disaggregated-service.md (1)
Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache()
and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
🪛 LanguageTool
docs/source/advanced/disaggregated-service.md
[style] ~37-~37: Consider using a different verb to strengthen your wording.
Context: ... the number of CUDA streams, but please make sure there are no other NCCL ops outside the...
(MAKE_SURE_ENSURE)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
PR_Github #12794 [ run ] triggered by Bot |
PR_Github #12794 [ run ] completed with state |
…d logs (NVIDIA#6085) Signed-off-by: Lizhi Zhou <[email protected]> Signed-off-by: Shreyas Misra <[email protected]>
…d logs (NVIDIA#6085) Signed-off-by: Lizhi Zhou <[email protected]> Signed-off-by: Ransiki Zhang <[email protected]>
…d logs (NVIDIA#6085) Signed-off-by: Lizhi Zhou <[email protected]> Signed-off-by: Lanyu Liao <[email protected]>
[TRTLLM-6444][doc] Add some UCX trouble shooting Q&A docs
Description
There is no simple implementation in the short term for TRTLLM-6444, so add some Q&A docs on this UCX issue and print log to display NVLink domain when kvcache_tranceiver is enabled.
Test Coverage
GitHub Bot Help
/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...
Provide a user friendly way for developers to interact with a Jenkins server.
Run
/bot [-h|--help]
to print this help message.See details below for each supported subcommand.
run [--disable-fail-fast --skip-test --stage-list "A10-1, xxx" --gpu-type "A30, H100_PCIe" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-[Post-Merge]-1, xxx"]
Launch build/test pipelines. All previously running jobs will be killed.
--disable-fail-fast
(OPTIONAL) : Disable fail fast on build/tests/infra failures.--skip-test
(OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.--stage-list "A10-1, xxx"
(OPTIONAL) : Only run the specified test stages. Examples: "A10-1, xxx". Note: Does NOT update GitHub check status.--gpu-type "A30, H100_PCIe"
(OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.--only-multi-gpu-test
(OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.--disable-multi-gpu-test
(OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.--add-multi-gpu-test
(OPTIONAL) : Force run the multi-GPU tests. Will also run L0 pre-merge pipeline.--post-merge
(OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.--extra-stage "H100_PCIe-[Post-Merge]-1, xxx"
(OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-[Post-Merge]-1, xxx".For guidance on mapping tests to stage names, see
docs/source/reference/ci-overview.md
.kill
kill
Kill all running builds associated with pull request.
skip
skip --comment COMMENT
Skip testing for latest commit on pull request.
--comment "Reason for skipping build/test"
is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.reuse-pipeline
reuse-pipeline
Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.
Summary by CodeRabbit
Documentation
NCCL_GRAPH_MIXING_SUPPORT
and troubleshooting guidance for NVLink domain-related issues.Chores