-
Notifications
You must be signed in to change notification settings - Fork 1.7k
[WIP][chore] : CUDA13 build #6890
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Boris Fomitchev <[email protected]>
Signed-off-by: Boris Fomitchev <[email protected]>
Signed-off-by: Boris Fomitchev <[email protected]>
Signed-off-by: Boris Fomitchev <[email protected]>
Signed-off-by: Boris Fomitchev <[email protected]>
Signed-off-by: Boris Fomitchev <[email protected]>
Signed-off-by: Boris Fomitchev <[email protected]>
📝 WalkthroughWalkthroughUpdates build and dependency configuration for CUDA 13, refactors NVSHMEM discovery to prefer prebuilt libraries for CUDA ≥13 while keeping a patched ExternalProject fallback, migrates several cuTensorMap usages to a v12000 function-pointer type, narrows a CUTLASS arch, swaps multiple reductions to cuda::maximum, introduces Thrust iterators, updates cudaMemAdvise usage, and specializes/exports a beam-search kernel. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant CMake as CMake
participant System as System (prebuilt NVSHMEM)
participant EP as ExternalProject (nvshmem_project)
CMake->>CMake: Detect CUDA version
alt CUDA >= 13.0
CMake->>System: find_path / find_library NVSHMEM
System-->>CMake: NVSHMEM paths & libs
CMake->>CMake: link NVSHMEM_DEVICE_LIBRARY & NVSHMEM_HOST_LIBRARY
else CUDA < 13.0
CMake->>EP: ExternalProject_Add nvshmem_project (patch+build)
EP-->>CMake: nvshmem_project::nvshmem
CMake->>CMake: add dependency deep_ep -> nvshmem_project
end
CMake->>CMake: set INSTALL_RPATH including NVSHMEM_INSTALL_PREFIX
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Suggested reviewers
Tip 🔌 Remote MCP (Model Context Protocol) integration is now available!Pro plan users can now connect to remote MCP servers from the Integrations page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats. ✨ Finishing Touches
🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. CodeRabbit Commands (Invoked using PR/Issue comments)Type Other keywords and placeholders
Status, Documentation and Community
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 11
🔭 Outside diff range comments (3)
cpp/include/tensorrt_llm/deep_gemm/tma_utils.cuh (1)
98-114
: Use non‑versioned PFN_cuTensorMapEncodeTiled in public headers; add missing <type_traits>PFN_cuTensorMapEncodeTiled_v12000 is a CUDA-internal, versioned typedef — don't expose it in a public header. Keep the versioned symbol only for the runtime lookup/reinterpret_cast and use the non-suffixed PFN_cuTensorMapEncodeTiled in public APIs. Also this header uses std::remove_cv / std::is_same but does not include <type_traits> (it already has ).
Files/locations to change
- cpp/include/tensorrt_llm/deep_gemm/tma_utils.cuh
- Change get_cuTensorMapEncodeTiled() return type to PFN_cuTensorMapEncodeTiled and reinterpret_cast to PFN_cuTensorMapEncodeTiled.
- Change make_2d_tma_copy_desc(...) encode_func parameter to PFN_cuTensorMapEncodeTiled.
- Add #include <type_traits> near the other includes (note: is already present).
- Check other occurrences of the versioned typedef and unify (example): cpp/tensorrt_llm/kernels/cutlass_kernels/fp8_blockscale_gemm/fp8_blockscale_tma_utils.cuh currently defines PFN_cuTensorMapEncodeTiled -> PFN_cuTensorMapEncodeTiled_v12000; ensure consistency across headers/source.
Suggested diffs (apply in tma_utils.cuh)
- PFN_cuTensorMapEncodeTiled_v12000 get_cuTensorMapEncodeTiled() + PFN_cuTensorMapEncodeTiled get_cuTensorMapEncodeTiled() @@ - return reinterpret_cast<PFN_cuTensorMapEncodeTiled_v12000>(cuTensorMapEncodeTiled_ptr); + return reinterpret_cast<PFN_cuTensorMapEncodeTiled>(cuTensorMapEncodeTiled_ptr);- template <typename T> - CUtensorMap make_2d_tma_copy_desc(..., PFN_cuTensorMapEncodeTiled_v12000 encode_func = nullptr) + template <typename T> + CUtensorMap make_2d_tma_copy_desc(..., PFN_cuTensorMapEncodeTiled encode_func = nullptr)#include <stdexcept> +#include <type_traits>
cpp/tensorrt_llm/kernels/cutlass_kernels/fp8_blockscale_gemm/fp8_blockscale_tma_utils.cuh (2)
129-134
: Remove debug I/O from header; handle errors deterministicallyStd::cout in a header is inappropriate and the magic check
int(res) == 1
is unclear — replace with an assert or explicit error propagation. If logging is required, use the project logger (not std::cout) and keep logging implementation out of headers.
- File: cpp/tensorrt_llm/kernels/cutlass_kernels/fp8_blockscale_gemm/fp8_blockscale_tma_utils.cuh
- Lines: 129–134
Suggested change:
- if (int(res) == 1) - { - std::cout << "check 0: " << int(res) << std::endl; - std::cout << gmem_dim[0] << "\t" << gmem_dim[1] << std::endl; - } + // Debug prints removed from header. + // Handle failures deterministically: + // DG_HOST_ASSERT(res == CUDA_SUCCESS); // or project assert macro + // or propagate error (e.g. return/throw an error code/exception)If you must emit runtime diagnostics, plumb a logger (or move prints to a .cpp) rather than printing from a header.
89-99
: Fix CUDA version guard and add missing includes in fp8_blockscale_tma_utils.cuhThe current guard
(__CUDACC_VER_MAJOR__ >= 12 && __CUDACC_VER_MINOR__ >= 5)
is false for CUDA 13.0 (major=13, minor=0) so the ByVersion path is incorrectly disabled. Use the numeric CUDA_VERSION check and add missing includes used in this file.Files / locations to update:
- cpp/tensorrt_llm/kernels/cutlass_kernels/fp8_blockscale_gemm/fp8_blockscale_tma_utils.cuh
- Includes near top (add <type_traits> and )
- get_cuTensorMapEncodeTiled() function (change guard and return type/rewrap)
- make_2d_tma_copy_desc signature (use PFN_cuTensorMapEncodeTiled)
Suggested diff:
--- a/cpp/tensorrt_llm/kernels/cutlass_kernels/fp8_blockscale_gemm/fp8_blockscale_tma_utils.cuh +++ b/cpp/tensorrt_llm/kernels/cutlass_kernels/fp8_blockscale_gemm/fp8_blockscale_tma_utils.cuh @@ #include <cuda_runtime.h> +#include <type_traits> +#include <stdexcept> #include <cassert> @@ -PFN_cuTensorMapEncodeTiled_v12000 get_cuTensorMapEncodeTiled() +PFN_cuTensorMapEncodeTiled get_cuTensorMapEncodeTiled() { @@ -#if (__CUDACC_VER_MAJOR__ >= 12 && __CUDACC_VER_MINOR__ >= 5) +#if CUDA_VERSION >= 12050 @@ - return reinterpret_cast<PFN_cuTensorMapEncodeTiled_v12000>(cuTensorMapEncodeTiled_ptr); + return reinterpret_cast<PFN_cuTensorMapEncodeTiled>(cuTensorMapEncodeTiled_ptr); } @@ -template <typename data_type> -CUtensorMap make_2d_tma_copy_desc(..., PFN_cuTensorMapEncodeTiled_v12000 encode_func = nullptr) +template <typename data_type> +CUtensorMap make_2d_tma_copy_desc(..., PFN_cuTensorMapEncodeTiled encode_func = nullptr)Rationale: using CUDA_VERSION >= 12050 ensures 12.5+ (including 13.x) take the ByVersion path. Adding <type_traits> and covers std::is_same and std::runtime_error used in this file.
🧹 Nitpick comments (10)
cpp/tensorrt_llm/kernels/speculativeDecoding/eagleDecodingKernels.cu (1)
507-514
: Use typed functor for BlockReduce and ensure CCCL headers are availableSwitching to cuda::maximum is fine for CUDA 13/CCCL, but two nits:
- Be explicit about the functor type to avoid deduction pitfalls across toolkits.
- Ensure the compilation unit includes the CCCL functional header (directly or transitively) so cuda::maximum is defined.
Apply this minimal change for clarity and robustness:
- auto const maxGenLength = BlockReduce(tempStorage.reduce).Reduce(nextDraftLen, cuda::maximum()); + auto const maxGenLength = BlockReduce(tempStorage.reduce).Reduce(nextDraftLen, cuda::maximum<SizeType32>());If not already guaranteed elsewhere, please confirm that <cuda/functional> is visible via includes (your CMake change adding the CCCL include path should satisfy this).
cpp/tensorrt_llm/runtime/utils/debugUtils.cu (1)
56-63
: Typed reduction functor and CCCL availabilityUsing cuda::maximum is correct, but specify the type to avoid any template argument deduction issues across CUDA/CCCL versions:
- int blockFound = BlockReduceT(tempStorage).Reduce(found, cuda::maximum()); + int blockFound = BlockReduceT(tempStorage).Reduce(found, cuda::maximum<int32_t>());Also confirm CCCL headers (e.g., <cuda/functional>) are reachable in this TU; otherwise you’ll hit an unresolved identifier on older stacks.
cpp/CMakeLists.txt (1)
378-378
: Prefer message(STATUS) to reduce noise in CMake outputUnscoped message() prints at the default log level and can be noisy in CI. Use STATUS or VERBOSE:
-message("CMAKE_CUDA_FLAGS: ${CMAKE_CUDA_FLAGS}") +message(STATUS "CMAKE_CUDA_FLAGS: ${CMAKE_CUDA_FLAGS}")cpp/tensorrt_llm/kernels/cutlass_kernels/CMakeLists.txt (1)
193-194
: Reducing fp8_blockscale_gemm architectures to only 90 may drop supportSetting fp8_blockscale_gemm_src to 90 only removes 89/100/120 fatbin variants and can lead to runtime failures on non-Hopper GPUs (e.g., Ada Lovelace 89) or newer (100/120) if this target is selected.
If this is intentional (e.g., kernels are Hopper-specific), consider adding a comment and guarding elsewhere to avoid launching on unsupported SMs. Otherwise, retain broader coverage:
-set_cuda_architectures(fp8_blockscale_gemm_src 90) +set_cuda_architectures(fp8_blockscale_gemm_src 90 100f 120f)At minimum, please verify build/run paths won’t reference this lib on SM89 or SM100/SM120.
cpp/tensorrt_llm/kernels/sageAttentionKernels.cu (1)
253-255
: Add the proper header for cuda::maximum and consider specifying the functor type explicitlySwitching BlockReduce’s operator from cub::Max{} to cuda::maximum{} is fine, but you should include the correct header to guarantee availability across toolchains.
- Include header: <cuda/functional>
- Optional: For older/stricter compilers, make the functor type explicit (e.g., cuda::maximum{}) to avoid template deduction issues on non-standard types (e.g., __nv_bfloat16).
Follow-up suggestions (applies outside the selected range):
// near other includes #include <cuda/functional>If you hit type resolution issues with bf16/half in some CUDA versions, prefer:
float aggregate = BlockReduce(temp_storage).Reduce(static_cast<float>(local_amax), cuda::maximum<float>{});and define BlockReduce on float accordingly.
Also applies to: 430-434
cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/DevKernel.cu (2)
157-159
: Ensure cuda::maximum is available: add <cuda/functional>Using cuda::maximum() with BlockReduce requires the libcudacxx functional header on some toolchains. Please add the include to prevent build failures.
Apply outside-range change:
// near other headers #include <cuda/functional>
660-661
: Same as above: add <cuda/functional> for cuda::maximumRepeat usage here. One include at the top of this file suffices for both sites.
cpp/tensorrt_llm/kernels/topkLastDim.cu (1)
1353-1355
: Avoid duplication: factor iterator construction into a small helperYou’re repeating the counting/transform iterator construction. Consider a small inline helper to compute begin/end transform iterators once to reduce duplication and the risk of drift.
Example helper (outside the selected range):
template <typename IdxT> inline auto segmentedOffsets(IdxT k) { auto base = thrust::make_counting_iterator<IdxT>(0); auto begin = thrust::make_transform_iterator(base, air_topk_stable::ComputeOffset<IdxT>(k)); auto end = begin + 1; return std::make_pair(begin, end); }Then use:
auto [beginOff, endOff] = segmentedOffsets<IdxT>(k);
requirements.txt (1)
37-39
: Nit: clean up stray whitespace and obsolete comment for wheelThere’s a trailing space after ‘wheel’ and a leftover comment line. Consider removing to avoid confusion.
-pillow>=10.3.0 -wheel -#<=0.45.1 +pillow>=10.3.0 +wheelcpp/tensorrt_llm/deep_ep/CMakeLists.txt (1)
2-3
: Remove dead variableNVSHMEM_URL_HASH
This variable is now unused after removing the ExternalProject flow.
-set(NVSHMEM_URL_HASH - SHA256=eb2c8fb3b7084c2db86bd9fd905387909f1dfd483e7b45f7b3c3d5fcf5374b5a)
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (13)
cpp/CMakeLists.txt
(3 hunks)cpp/include/tensorrt_llm/deep_gemm/tma_utils.cuh
(2 hunks)cpp/tensorrt_llm/deep_ep/CMakeLists.txt
(4 hunks)cpp/tensorrt_llm/kernels/cutlass_kernels/CMakeLists.txt
(1 hunks)cpp/tensorrt_llm/kernels/cutlass_kernels/fp8_blockscale_gemm/fp8_blockscale_tma_utils.cuh
(3 hunks)cpp/tensorrt_llm/kernels/decoderMaskedMultiheadAttention/decoderMaskedMultiheadAttentionTemplate.h
(1 hunks)cpp/tensorrt_llm/kernels/sageAttentionKernels.cu
(2 hunks)cpp/tensorrt_llm/kernels/speculativeDecoding/eagleDecodingKernels.cu
(1 hunks)cpp/tensorrt_llm/kernels/topkLastDim.cu
(3 hunks)cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/DevKernel.cu
(2 hunks)cpp/tensorrt_llm/runtime/moeLoadBalancer/hostAccessibleDeviceAllocator.cpp
(1 hunks)cpp/tensorrt_llm/runtime/utils/debugUtils.cu
(1 hunks)requirements.txt
(3 hunks)
🧰 Additional context used
📓 Path-based instructions (5)
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}
: In C++, close namespaces with a comment naming the namespace (e.g., } // namespace foo)
Prefer const/constexpr variables over #define for constants
Declare variables const if not modified after initialization
Use Allman brace style in C++
C++ filenames use lowerCamelCase and must be case-insensitively unique within a build target
C++ type names use UpperCamelCase
Local variables, methods, and namespaces use lowerCamelCase
Global non-static variables not in anonymous namespace use gPrefix lowerCamelCase (e.g., gExample)
Static globals or globals in anonymous namespaces use sPrefix lowerCamelCase
Locally visible static variables start with 's' (e.g., static std::once_flag sFlag;)
Member variables use mPrefix lowerCamelCase; public members may omit but are encouraged to use 'm'
Constants (enums, global/static/function-scope magic numbers) use kPREFIXED_UPPER_SNAKE (e.g., kDIGIT_NUM)
If macros are unavoidable, use UPPER_SNAKE_CASE (prefer constants over #define)
Constructor parameter that conflicts with a public member name gets trailing underscore (foo_)
Literal suffixes should be uppercase (e.g., 1234L not 1234l)
C++: use spaces only; indent 4 spaces
Run clang-format (LLVM style) before submitting; wrap lines at 120 characters
If formatting must be bypassed, use // clang-format off/on around the section
Prefer smart pointers; use unique_ptr for sole ownership, shared_ptr for shared; weak_ptr only in exceptional cases
Do not use deprecated pre-C++11 smart pointers
Use C++ style comments; avoid C comments except special inline cases; prefer // single-line
Capitalize and punctuate full-sentence comments
Follow Doxygen rules: use //! for comments and //!< for members in C++
Disable code with #if/#endif and mnemonic conditions; avoid commented-out code; avoid dead code
Do not throw exceptions across library boundaries
Use least-forceful casts; avoid removing const/volatile; avoid C-style and functional casts (except constructors); p...
Files:
cpp/tensorrt_llm/runtime/utils/debugUtils.cu
cpp/tensorrt_llm/kernels/sageAttentionKernels.cu
cpp/tensorrt_llm/runtime/moeLoadBalancer/hostAccessibleDeviceAllocator.cpp
cpp/tensorrt_llm/kernels/topkLastDim.cu
cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/DevKernel.cu
cpp/tensorrt_llm/kernels/speculativeDecoding/eagleDecodingKernels.cu
cpp/tensorrt_llm/kernels/decoderMaskedMultiheadAttention/decoderMaskedMultiheadAttentionTemplate.h
cpp/include/tensorrt_llm/deep_gemm/tma_utils.cuh
cpp/tensorrt_llm/kernels/cutlass_kernels/fp8_blockscale_gemm/fp8_blockscale_tma_utils.cuh
**/*.{cpp,cxx,cc,cu}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.{cpp,cxx,cc,cu}
: Avoid literal values except for 0, nullptr, true, false; use named constexpr for other literals
Place semicolon of empty for/while loop on a new line
Always use brace-delimited bodies for switch/while/do-for/if/else
Use inline C comments in argument lists when parameter meaning is unclear (e.g., /* checkForErrors = */ false)
Do not use assignment in subexpressions (e.g., if (x = y) ... is forbidden)
Switch on enums should enumerate all values and omit default to catch new values at compile time
Structure switch statements; prohibit fallthrough except between empty cases; each case ends with break or throw; return at end of case not allowed; put break inside braces for compound case
Prefer anonymous namespaces over static for internal linkage of functions
Every defined function must be called at least once (no unused methods)
Files:
cpp/tensorrt_llm/runtime/utils/debugUtils.cu
cpp/tensorrt_llm/kernels/sageAttentionKernels.cu
cpp/tensorrt_llm/runtime/moeLoadBalancer/hostAccessibleDeviceAllocator.cpp
cpp/tensorrt_llm/kernels/topkLastDim.cu
cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/DevKernel.cu
cpp/tensorrt_llm/kernels/speculativeDecoding/eagleDecodingKernels.cu
**/*.{h,hpp,hxx,hh,cuh,cpp,cxx,cc,cu}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
Parameter names must be consistent between declarations and definitions
Files:
cpp/tensorrt_llm/runtime/utils/debugUtils.cu
cpp/tensorrt_llm/kernels/sageAttentionKernels.cu
cpp/tensorrt_llm/runtime/moeLoadBalancer/hostAccessibleDeviceAllocator.cpp
cpp/tensorrt_llm/kernels/topkLastDim.cu
cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/DevKernel.cu
cpp/tensorrt_llm/kernels/speculativeDecoding/eagleDecodingKernels.cu
cpp/tensorrt_llm/kernels/decoderMaskedMultiheadAttention/decoderMaskedMultiheadAttentionTemplate.h
cpp/include/tensorrt_llm/deep_gemm/tma_utils.cuh
cpp/tensorrt_llm/kernels/cutlass_kernels/fp8_blockscale_gemm/fp8_blockscale_tma_utils.cuh
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
Prepend NVIDIA copyright header (current year) to all source files
Files:
cpp/tensorrt_llm/runtime/utils/debugUtils.cu
cpp/tensorrt_llm/kernels/sageAttentionKernels.cu
cpp/tensorrt_llm/runtime/moeLoadBalancer/hostAccessibleDeviceAllocator.cpp
cpp/tensorrt_llm/kernels/topkLastDim.cu
cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/DevKernel.cu
cpp/tensorrt_llm/kernels/speculativeDecoding/eagleDecodingKernels.cu
cpp/tensorrt_llm/kernels/decoderMaskedMultiheadAttention/decoderMaskedMultiheadAttentionTemplate.h
cpp/include/tensorrt_llm/deep_gemm/tma_utils.cuh
cpp/tensorrt_llm/kernels/cutlass_kernels/fp8_blockscale_gemm/fp8_blockscale_tma_utils.cuh
**/*.{h,hpp,hxx,hh,cuh}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
Header files must use include guards named TRTLLM__H without underscores prefix/suffix (e.g., TRTLLM_FOO_BAR_HELLO_H)
Files:
cpp/tensorrt_llm/kernels/decoderMaskedMultiheadAttention/decoderMaskedMultiheadAttentionTemplate.h
cpp/include/tensorrt_llm/deep_gemm/tma_utils.cuh
cpp/tensorrt_llm/kernels/cutlass_kernels/fp8_blockscale_gemm/fp8_blockscale_tma_utils.cuh
🔇 Additional comments (11)
cpp/tensorrt_llm/kernels/topkLastDim.cu (2)
28-30
: LGTM: Thrust iterators are correctly includedIncluding thrust counting and transform iterators is correct for the new iterator approach. No issues here.
1226-1231
: Segment offsets via Thrust iterators look correct; keep functor host device
- Using thrust::make_counting_iterator and thrust::make_transform_iterator with ComputeOffset(k) is correct for segments of uniform length k.
- ComputeOffset is already host device, which is required.
No action needed.
If you want to double-check the API usage (num segments, begin/end offsets), confirm that k * batch_size equals the total number of items passed to SortPairs for all calls in this TU.
requirements.txt (9)
6-6
: cuda-python constraint looks appropriate; confirm compatibility with your base imageThe ~=13.0.0 spec is fine. Ensure the base NGC image provides a matching cuda-python to avoid downgrades/upgrades during install.
16-20
: Loosening pynvml/nvidia-ml-py: verify modelopt constraintsSince nvidia-modelopt transitively depends on pynvml, confirm that loosening these pins does not introduce incompatible versions at runtime.
22-22
: h5py unpin is reasonable; just ensure ABI compatibility for platforms used in CIh5py>=3.12.1 should be fine. If manylinux/arm64 targets are in scope, watch for prebuilt wheel availability.
25-27
: Torch range bump: double-check downstream constraints (TRT, transformers)torch>=2.7.1,<=2.9.0a0 widens the range. Verify compatibility with:
- TensorRT Python bindings used (~10.13.0)
- transformers~=4.55.0
33-33
: Transformers bump looks fine; verify model code paths for API shiftstransformers 4.55 may include minor API/behavior changes. Make sure your integration doesn’t rely on deprecated symbols from 4.53.
49-49
: FastAPI pin loosened: ensure Uvicorn and Pydantic versions remain compatiblefastapi>=0.115.4 with pydantic>=2.9.1 is expected to be compatible; confirm no transitive conflicts with other web dependencies.
55-55
: flashinfer-python rename/version: confirm correct package name and wheelsThere have been naming inconsistencies in the past. Verify that flashinfer-python>=0.2.5 is the intended package name and is available for your Python/CUDA matrix.
65-65
: llguidance loosened: confirm it’s still compatible with your runtimeNo action if tests pass; otherwise consider pinning to a known-good minor range.
1-1
: Confirm PyTorch cu130 index vs. pinned torch/torchvision — manual check requiredTop-level requirements.txt sets the extra index to cu130 and pins torch to >=2.7.1,<=2.9.0a0 (torchvision is unpinned). I could not fetch the PyTorch cu130 index from this sandbox, so I cannot confirm the required wheels are present for your CI platforms — please verify.
Files to check:
- requirements.txt (repo root) — line 1:
--extra-index-url https://download.pytorch.org/whl/cu130
; torch pinned at lines ~27–29.- constraints.txt (referenced by requirements.txt) — check for any platform/version overrides.
- CI pipelines that consume this requirements.txt (no .github/workflows found in the repo root) — confirm which CI systems/images use this file and their CUDA/Python combos.
Quick local verification (run where network access is available):
python - <<'PY'
import urllib.request, re
url = "https://download.pytorch.org/whl/cu130/torch_stable.html"
data = urllib.request.urlopen(url, timeout=10).read().decode("utf-8", errors="ignore")
for pkg in ("torch", "torchvision"):
hits = sorted(set(re.findall(r'href="\'["\']' % pkg, data, re.I)))
print(pkg, len(hits))
for h in hits[:30]:
print(" ", h)
PYWhat to confirm:
- The cu130 index contains wheels matching the torch version range (and torchvision) for all Python versions / OS / CUDA builds your CI uses.
- If not, either change the extra-index-url to the appropriate CUDA index(s) or pin torch/torchvision to versions that are available on cu130.
template <typename T> | ||
CUtensorMap make_2d_tma_copy_desc(T* global_address, uint64_t gmem_dim[2], uint64_t stride_in_bytes, | ||
uint32_t smem_dim[2], CUtensorMapSwizzle swizzle_type, PFN_cuTensorMapEncodeTiled encode_func = nullptr) | ||
uint32_t smem_dim[2], CUtensorMapSwizzle swizzle_type, PFN_cuTensorMapEncodeTiled_v12000 encode_func = nullptr) | ||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Verification agent
🧩 Analysis chain
Default parameter type should match the stable PFN type
Align the encode_func param with the stable, unversioned type to avoid exposing versioned types.
- uint32_t smem_dim[2], CUtensorMapSwizzle swizzle_type, PFN_cuTensorMapEncodeTiled_v12000 encode_func = nullptr)
+ uint32_t smem_dim[2], CUtensorMapSwizzle swizzle_type, PFN_cuTensorMapEncodeTiled encode_func = nullptr)
Double-check the other call sites for signature consistency.
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "=== show file (if exists) ==="
FILE="cpp/include/tensorrt_llm/deep_gemm/tma_utils.cuh"
if [ -f "$FILE" ]; then
sed -n '1,240p' "$FILE" || true
else
echo "File not found: $FILE"
fi
echo
echo "=== rg: PFN_cuTensorMapEncodeTiled_v12000 ==="
rg --hidden -n "PFN_cuTensorMapEncodeTiled_v12000" || true
echo
echo "=== rg: PFN_cuTensorMapEncodeTiled ==="
rg --hidden -n "PFN_cuTensorMapEncodeTiled(?!_v12000)" || true
echo
echo "=== rg: make_2d_tma_copy_desc ==="
rg --hidden -n "make_2d_tma_copy_desc" || true
echo
echo "=== rg: cuTensorMapEncodeTiled occurrences ==="
rg --hidden -n "cuTensorMapEncodeTiled" || true
echo
echo "=== rg: cuTensorMap related typedefs/usages ==="
rg --hidden -n "cuTensorMap" || true
Length of output: 25549
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "=== fp8_blockscale_tma_utils.cuh (head) ==="
sed -n '1,240p' cpp/tensorrt_llm/kernels/cutlass_kernels/fp8_blockscale_gemm/fp8_blockscale_tma_utils.cuh || true
echo
echo "=== cudaDriverWrapper.h (head) ==="
sed -n '1,320p' cpp/tensorrt_llm/common/cudaDriverWrapper.h || true
echo
echo "=== cudaDriverWrapper.cpp (head) ==="
sed -n '1,360p' cpp/tensorrt_llm/common/cudaDriverWrapper.cpp || true
echo
echo "=== tma_utils.cuh (relevant region) ==="
sed -n '1,220p' cpp/include/tensorrt_llm/deep_gemm/tma_utils.cuh || true
echo
echo "=== rg: where PFN_cuTensorMapEncodeTiled is defined/mentioned ==="
rg --hidden -n "PFN_cuTensorMapEncodeTiled" || true
Length of output: 32409
Prefer the stable PFN_cuTensorMapEncodeTiled type for the default parameter
Use the unversioned PFN to avoid leaking the versioned typedef; callers I checked do not pass encode_func explicitly.
Files to update:
-
cpp/include/tensorrt_llm/deep_gemm/tma_utils.cuh
- Replace:
- uint32_t smem_dim[2], CUtensorMapSwizzle swizzle_type, PFN_cuTensorMapEncodeTiled_v12000 encode_func = nullptr)
- With:
- uint32_t smem_dim[2], CUtensorMapSwizzle swizzle_type, PFN_cuTensorMapEncodeTiled encode_func = nullptr)
- Replace:
-
cpp/tensorrt_llm/kernels/cutlass_kernels/fp8_blockscale_gemm/fp8_blockscale_tma_utils.cuh
- Make the same signature change (or rely on the existing
#define PFN_cuTensorMapEncodeTiled PFN_cuTensorMapEncodeTiled_v12000 and prefer using the unversioned name).
- Make the same signature change (or rely on the existing
Call sites verified (no explicit encode_func argument): cpp/include/tensorrt_llm/deep_gemm/fp8_gemm.cuh, cpp/tensorrt_llm/kernels/cutlass_kernels/fp8_blockscale_gemm/fp8_blockscale_gemm_kernel.cuh, etc.
Optional safety: add a local fallback in headers to map the unversioned name to the v12000 typedef if the unversioned PFN is not present:
#ifndef PFN_cuTensorMapEncodeTiled
#define PFN_cuTensorMapEncodeTiled PFN_cuTensorMapEncodeTiled_v12000
#endif
🤖 Prompt for AI Agents
In cpp/include/tensorrt_llm/deep_gemm/tma_utils.cuh around lines 116-119, the
function signature uses the versioned typedef PFN_cuTensorMapEncodeTiled_v12000
as the default parameter; change that parameter type to the unversioned
PFN_cuTensorMapEncodeTiled (i.e., replace PFN_cuTensorMapEncodeTiled_v12000 with
PFN_cuTensorMapEncodeTiled) so callers that don’t pass encode_func keep using
the stable name, and make the same signature change in
cpp/tensorrt_llm/kernels/cutlass_kernels/fp8_blockscale_gemm/fp8_blockscale_tma_utils.cuh
(or rely on an existing #define mapping the unversioned name to the v12000
typedef); optionally add a local fallback macro guard at the top of the headers:
if PFN_cuTensorMapEncodeTiled is not defined, #define it to
PFN_cuTensorMapEncodeTiled_v12000.
set(NVSHMEM_INSTALL_PREFIX "${TORCH_INSTALL_PREFIX}/../nvidia/shmem") | ||
# message("Torch Install Prefix: ${TORCH_INSTALL_PREFIX}") | ||
|
||
find_path(NVSHMEM_INCLUDE_DIR nvshmem.h HINTS ${NVSHMEM_INSTALL_PREFIX}/include) | ||
find_library(NVSHMEM_HOST_LIBRARY nvshmem_host.so.3 HINTS ${NVSHMEM_INSTALL_PREFIX}/lib) | ||
find_library(NVSHMEM_DEVICE_LIBRARY nvshmem_device HINTS ${NVSHMEM_INSTALL_PREFIX}/lib) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Verification agent
🧩 Analysis chain
NVSHMEM discovery is brittle: hard-coded path, not REQUIRED, and host SONAME pinned to .so.3
- Hard-coding NVSHMEM_INSTALL_PREFIX to a Torch-relative path without allowing override will break on non-standard layouts.
find_library(NVSHMEM_HOST_LIBRARY nvshmem_host.so.3 ...)
pins a Linux SONAME and will fail on other platforms or future versions. Use base names via NAMES and search both lib and lib64.- None of the NVSHMEM finds are REQUIRED; a missing lib will fail late at link time.
Apply this diff to make discovery robust and fail-fast:
-set(NVSHMEM_INSTALL_PREFIX "${TORCH_INSTALL_PREFIX}/../nvidia/shmem")
-# message("Torch Install Prefix: ${TORCH_INSTALL_PREFIX}")
-
-find_path(NVSHMEM_INCLUDE_DIR nvshmem.h HINTS ${NVSHMEM_INSTALL_PREFIX}/include)
-find_library(NVSHMEM_HOST_LIBRARY nvshmem_host.so.3 HINTS ${NVSHMEM_INSTALL_PREFIX}/lib)
-find_library(NVSHMEM_DEVICE_LIBRARY nvshmem_device HINTS ${NVSHMEM_INSTALL_PREFIX}/lib)
+set(NVSHMEM_INSTALL_PREFIX "${TORCH_INSTALL_PREFIX}/../nvidia/shmem" CACHE PATH "NVSHMEM install root")
+
+find_path(NVSHMEM_INCLUDE_DIR nvshmem.h
+ HINTS ${NVSHMEM_INSTALL_PREFIX}/include
+ REQUIRED)
+find_library(NVSHMEM_HOST_LIBRARY
+ NAMES nvshmem_host nvshmem
+ HINTS ${NVSHMEM_INSTALL_PREFIX}/lib ${NVSHMEM_INSTALL_PREFIX}/lib64
+ REQUIRED)
+find_library(NVSHMEM_DEVICE_LIBRARY
+ NAMES nvshmem_device
+ HINTS ${NVSHMEM_INSTALL_PREFIX}/lib ${NVSHMEM_INSTALL_PREFIX}/lib64
+ REQUIRED)
Please confirm whether your deployment layout sometimes places NVSHMEM under lib64; if so, the added hint is required.
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "Repository root: $(pwd)"
echo
echo "=== Show cpp/tensorrt_llm/deep_ep/CMakeLists.txt (first 300 lines) ==="
sed -n '1,300p' cpp/tensorrt_llm/deep_ep/CMakeLists.txt || true
echo
echo "=== Search for NVSHMEM references ==="
rg "NVSHMEM|nvshmem" -n || true
echo
echo "=== Show surrounding context where NVSHMEM_INSTALL_PREFIX might be used ==="
rg -n "NVSHMEM_INSTALL_PREFIX" -n || true
rg -n "TORCH_INSTALL_PREFIX" -n || true
echo
echo "=== Search for usages of nvshmem_host or nvshmem_device in repo ==="
rg -n "nvshmem_host|nvshmem_device" -n || true
echo
echo "=== List typical lib directories referenced elsewhere (lib64) ==="
rg -n "lib64" -n || true
Length of output: 18877
Make NVSHMEM discovery robust and fail-fast
Current code in cpp/tensorrt_llm/deep_ep/CMakeLists.txt hard-codes a Torch-relative install path, pins a specific SONAME (nvshmem_host.so.3), and does not use REQUIRED — this is brittle and will fail late on non-standard layouts or future NVSHMEM versions.
Files/locations to fix:
- cpp/tensorrt_llm/deep_ep/CMakeLists.txt — lines ~41–47 (NVSHMEM_INSTALL_PREFIX and the three find_* calls).
- Note: cpp/tensorrt_llm/CMakeLists.txt already uses find_package(NVSHMEM REQUIRED ...); also setup.py and scripts/build_wheel.py reference specific NVSHMEM .so files (e.g. nvshmem_bootstrap_uid.so.3), so ensure packaging and discovery remain consistent.
Apply this diff to make discovery configurable, search lib64, avoid SONAME pinning, and fail fast:
-set(NVSHMEM_INSTALL_PREFIX "${TORCH_INSTALL_PREFIX}/../nvidia/shmem")
-# message("Torch Install Prefix: ${TORCH_INSTALL_PREFIX}")
-
-find_path(NVSHMEM_INCLUDE_DIR nvshmem.h HINTS ${NVSHMEM_INSTALL_PREFIX}/include)
-find_library(NVSHMEM_HOST_LIBRARY nvshmem_host.so.3 HINTS ${NVSHMEM_INSTALL_PREFIX}/lib)
-find_library(NVSHMEM_DEVICE_LIBRARY nvshmem_device HINTS ${NVSHMEM_INSTALL_PREFIX}/lib)
+set(NVSHMEM_INSTALL_PREFIX "${TORCH_INSTALL_PREFIX}/../nvidia/shmem" CACHE PATH "NVSHMEM install root")
+
+find_path(NVSHMEM_INCLUDE_DIR nvshmem.h
+ HINTS ${NVSHMEM_INSTALL_PREFIX}/include
+ REQUIRED)
+find_library(NVSHMEM_HOST_LIBRARY
+ NAMES nvshmem_host nvshmem
+ HINTS ${NVSHMEM_INSTALL_PREFIX}/lib ${NVSHMEM_INSTALL_PREFIX}/lib64
+ REQUIRED)
+find_library(NVSHMEM_DEVICE_LIBRARY
+ NAMES nvshmem_device
+ HINTS ${NVSHMEM_INSTALL_PREFIX}/lib ${NVSHMEM_INSTALL_PREFIX}/lib64
+ REQUIRED)
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
set(NVSHMEM_INSTALL_PREFIX "${TORCH_INSTALL_PREFIX}/../nvidia/shmem") | |
# message("Torch Install Prefix: ${TORCH_INSTALL_PREFIX}") | |
find_path(NVSHMEM_INCLUDE_DIR nvshmem.h HINTS ${NVSHMEM_INSTALL_PREFIX}/include) | |
find_library(NVSHMEM_HOST_LIBRARY nvshmem_host.so.3 HINTS ${NVSHMEM_INSTALL_PREFIX}/lib) | |
find_library(NVSHMEM_DEVICE_LIBRARY nvshmem_device HINTS ${NVSHMEM_INSTALL_PREFIX}/lib) | |
set(NVSHMEM_INSTALL_PREFIX "${TORCH_INSTALL_PREFIX}/../nvidia/shmem" CACHE PATH "NVSHMEM install root") | |
find_path(NVSHMEM_INCLUDE_DIR nvshmem.h | |
HINTS ${NVSHMEM_INSTALL_PREFIX}/include | |
REQUIRED) | |
find_library(NVSHMEM_HOST_LIBRARY | |
NAMES nvshmem_host nvshmem | |
HINTS ${NVSHMEM_INSTALL_PREFIX}/lib ${NVSHMEM_INSTALL_PREFIX}/lib64 | |
REQUIRED) | |
find_library(NVSHMEM_DEVICE_LIBRARY | |
NAMES nvshmem_device | |
HINTS ${NVSHMEM_INSTALL_PREFIX}/lib ${NVSHMEM_INSTALL_PREFIX}/lib64 | |
REQUIRED) |
🤖 Prompt for AI Agents
In cpp/tensorrt_llm/deep_ep/CMakeLists.txt around lines 41–47, the NVSHMEM
discovery is brittle: it hard-codes a Torch-relative NVSHMEM_INSTALL_PREFIX,
pins a specific SONAME, doesn't search lib64, and doesn't fail-fast. Replace the
fixed install prefix with a cache variable NVSHMEM_INSTALL_PREFIX (allowing
override), use find_path and find_library with NAMES (no SONAME suffix) and
HINTS ${NVSHMEM_INSTALL_PREFIX}/lib ${NVSHMEM_INSTALL_PREFIX}/lib64
${NVSHMEM_INSTALL_PREFIX}/include and system paths, mark the find_* calls
REQUIRED so configuration errors fail early, and prefer using
find_package(NVSHMEM REQUIRED ...) if available for consistency with the
top-level CMake; ensure variables (NVSHMEM_INCLUDE_DIR, NVSHMEM_HOST_LIBRARY,
NVSHMEM_DEVICE_LIBRARY) are set from those robust searches.
#define PFN_cuTensorMapEncodeTiled PFN_cuTensorMapEncodeTiled_v12000 | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Avoid redefining SDK typedefs via macros
#define PFN_cuTensorMapEncodeTiled PFN_cuTensorMapEncodeTiled_v12000
is brittle, violates macro naming guidelines, and can unintentionally affect other headers. Prefer explicit types or a local alias.
Remove the macro:
-#define PFN_cuTensorMapEncodeTiled PFN_cuTensorMapEncodeTiled_v12000
-
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
#define PFN_cuTensorMapEncodeTiled PFN_cuTensorMapEncodeTiled_v12000 |
🤖 Prompt for AI Agents
In
cpp/tensorrt_llm/kernels/cutlass_kernels/fp8_blockscale_gemm/fp8_blockscale_tma_utils.cuh
around lines 27-28, remove the macro definition that rebinds
PFN_cuTensorMapEncodeTiled to PFN_cuTensorMapEncodeTiled_v12000; instead delete
the #define and create a local, scoped alias (preferably a typedef or C++
using-alias) that maps PFN_cuTensorMapEncodeTiled to
PFN_cuTensorMapEncodeTiled_v12000 within this translation unit or an appropriate
namespace to avoid leaking a macro into other headers. Ensure the alias is only
visible where needed and that no preprocessor macro remains.
uint32_t smem_dim[2], CUtensorMapSwizzle swizzle_type, PFN_cuTensorMapEncodeTiled_v12000 encode_func = nullptr) | ||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Unify encode_func parameter type with stable PFN
Avoid exposing the versioned PFN type in public signatures.
- uint32_t smem_dim[2], CUtensorMapSwizzle swizzle_type, PFN_cuTensorMapEncodeTiled_v12000 encode_func = nullptr)
+ uint32_t smem_dim[2], CUtensorMapSwizzle swizzle_type, PFN_cuTensorMapEncodeTiled encode_func = nullptr)
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
uint32_t smem_dim[2], CUtensorMapSwizzle swizzle_type, PFN_cuTensorMapEncodeTiled_v12000 encode_func = nullptr) | |
{ | |
uint32_t smem_dim[2], CUtensorMapSwizzle swizzle_type, PFN_cuTensorMapEncodeTiled encode_func = nullptr) | |
{ |
🤖 Prompt for AI Agents
In
cpp/tensorrt_llm/kernels/cutlass_kernels/fp8_blockscale_gemm/fp8_blockscale_tma_utils.cuh
around lines 111-112, the function signature exposes the versioned PFN type
PFN_cuTensorMapEncodeTiled_v12000; change the parameter type to the stable,
unversioned PFN alias (PFN_cuTensorMapEncodeTiled) so the public signature
doesn't leak a versioned typedef, and update any forward declarations or
includes so the unversioned PFN type is visible where this header is compiled.
...nsorrt_llm/kernels/decoderMaskedMultiheadAttention/decoderMaskedMultiheadAttentionTemplate.h
Show resolved
Hide resolved
cudaMemLocation location {cudaMemLocationTypeDevice, currentDevId}; | ||
TLLM_CUDA_CHECK(cudaMemAdvise(devPtr, memorySize, cudaMemAdviseSetPreferredLocation, location)); | ||
hostPtr = devPtr; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Verification agent
🧩 Analysis chain
Guard new cudaMemAdvise location-API for backward compatibility and fix indentation
The location-based cudaMemAdvise overload requires newer CUDA. Given CMake still allows CUDA >= 11.2, gate this with CUDART_VERSION to avoid compile errors with older toolkits. Also replace the tab with spaces to match file style.
Apply:
- cudaMemLocation location {cudaMemLocationTypeDevice, currentDevId};
- TLLM_CUDA_CHECK(cudaMemAdvise(devPtr, memorySize, cudaMemAdviseSetPreferredLocation, location));
+ #if CUDART_VERSION >= 12000
+ cudaMemLocation location{cudaMemLocationTypeDevice, currentDevId};
+ TLLM_CUDA_CHECK(cudaMemAdvise(devPtr, memorySize, cudaMemAdviseSetPreferredLocation, location));
+ #else
+ TLLM_CUDA_CHECK(cudaMemAdvise(devPtr, memorySize, cudaMemAdviseSetPreferredLocation, currentDevId));
+ #endif
Alternatively, if you intend to raise the project’s minimum CUDA to 13, update find_package(CUDAToolkit ...) accordingly and remove the fallback.
🏁 Script executed:
#!/bin/bash
set -euo pipefail
FILE="cpp/tensorrt_llm/runtime/moeLoadBalancer/hostAccessibleDeviceAllocator.cpp"
echo "PWD: $(pwd)"
if [ -f "$FILE" ]; then
echo "Found file: $FILE"
echo "---- File header (first 120 lines) ----"
sed -n '1,120p' "$FILE"
echo "---- Context lines 340-400 ----"
sed -n '340,400p' "$FILE"
echo "---- Search for relevant symbols in file ----"
rg -n "TLLM_CUDA_CHECK|cudaMemAdvise|cudaMemLocation|CUDART_VERSION" "$FILE" || true
echo "---- Lines containing tabs (show up to 200) ----"
nl -ba "$FILE" | sed -n '1,520p' | rg --line-number $'\t' || true
else
echo "File not found: $FILE"
rg -n "hostAccessibleDeviceAllocator.cpp" || true
fi
echo "---- Search for find_package(CUDAToolkit in repo ----"
rg -n "find_package\\(CUDAToolkit" || true
echo "---- Search for CUDART_VERSION usage in repo ----"
rg -n "CUDART_VERSION" || true
Length of output: 13048
Guard cudaMemAdvise(location) for older CUDA toolkits & fix indentation
The cudaMemLocation overload requires CUDA ≥ 12, and cpp/CMakeLists.txt still allows CUDA 11.2 — gate the new API with CUDART_VERSION and replace the stray tab with spaces.
- File: cpp/tensorrt_llm/runtime/moeLoadBalancer/hostAccessibleDeviceAllocator.cpp — replace the cudaMemLocation / cudaMemAdvise lines (around lines ~366–369).
- CMake: cpp/CMakeLists.txt currently uses find_package(CUDAToolkit 11.2 ...), so keep the fallback unless you bump the min CUDA version.
Apply:
- cudaMemLocation location {cudaMemLocationTypeDevice, currentDevId};
- TLLM_CUDA_CHECK(cudaMemAdvise(devPtr, memorySize, cudaMemAdviseSetPreferredLocation, location));
+#if defined(CUDART_VERSION) && (CUDART_VERSION >= 12000)
+ cudaMemLocation location{cudaMemLocationTypeDevice, currentDevId};
+ TLLM_CUDA_CHECK(cudaMemAdvise(devPtr, memorySize, cudaMemAdviseSetPreferredLocation, location));
+#else
+ TLLM_CUDA_CHECK(cudaMemAdvise(devPtr, memorySize, cudaMemAdviseSetPreferredLocation, currentDevId));
+#endif
Alternatively, if you intend to require CUDA ≥ 12 project-wide, update find_package(CUDAToolkit ...) in cpp/CMakeLists.txt and remove the fallback.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
cudaMemLocation location {cudaMemLocationTypeDevice, currentDevId}; | |
TLLM_CUDA_CHECK(cudaMemAdvise(devPtr, memorySize, cudaMemAdviseSetPreferredLocation, location)); | |
hostPtr = devPtr; | |
#if defined(CUDART_VERSION) && (CUDART_VERSION >= 12000) | |
cudaMemLocation location{cudaMemLocationTypeDevice, currentDevId}; | |
TLLM_CUDA_CHECK(cudaMemAdvise(devPtr, memorySize, cudaMemAdviseSetPreferredLocation, location)); | |
#else | |
TLLM_CUDA_CHECK(cudaMemAdvise(devPtr, memorySize, cudaMemAdviseSetPreferredLocation, currentDevId)); | |
#endif | |
hostPtr = devPtr; |
🤖 Prompt for AI Agents
In cpp/tensorrt_llm/runtime/moeLoadBalancer/hostAccessibleDeviceAllocator.cpp
around lines 367–369, the current cudaMemLocation/cudaMemAdvise usage requires
CUDA ≥ 12 and also contains a stray tab; guard the new API with a preprocessor
check (e.g., #if defined(CUDART_VERSION) && CUDART_VERSION >= 12000) to call
cudaMemLocation and cudaMemAdvise only when available and provide the existing
fallback for older CUDA versions in the #else branch, and replace the stray tab
with spaces to fix indentation; do not change cpp/CMakeLists.txt unless you
intend to raise the minimum CUDA version project-wide.
nvidia-nccl-cu13 | ||
nvidia-nvshmem-cu13 | ||
nvidia-cuda-nvrtc-cu13 | ||
transformers~=4.55.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Verification agent
🧩 Analysis chain
Validate availability of new CUDA 13 vendor wheels
Please verify that the following packages exist on pip mirrors used in CI/CD:
- nvidia-nccl-cu13
- nvidia-nvshmem-cu13
- nvidia-cuda-nvrtc-cu13
🏁 Script executed:
#!/bin/bash
# Check pip can find the cu13 variants (requires internet in CI)
for pkg in nvidia-nccl-cu13 nvidia-nvshmem-cu13 nvidia-cuda-nvrtc-cu13; do
python -m pip index versions "$pkg" 2>/dev/null | head -n 5 || echo "Lookup failed for $pkg"
done
Length of output: 463
🏁 Script executed:
#!/bin/bash
set -euo pipefail
pkgs=(nvidia-nccl-cu13 nvidia-nvshmem-cu13 nvidia-cuda-nvrtc-cu13)
for pkg in "${pkgs[@]}"; do
echo "=== $pkg ==="
echo "--- python -m pip index versions ---"
python -m pip index versions "$pkg" 2>&1 | sed -n '1,200p' || true
echo "--- pypi.org JSON check ---"
status=$(curl -s -o /tmp/pypi_${pkg}.json -w "%{http_code}" "https://pypi.org/pypi/${pkg}/json" || true)
echo "HTTP status: $status"
if [ "$status" = "200" ]; then
python - <<PY
import json,sys
f="/tmp/pypi_${pkg}.json"
data=json.load(open(f))
print("latest_version:", data.get("info",{}).get("version"))
names = [r.get("packagetype") for r in data.get("releases", {})]
print("releases_count:", len(data.get("releases",{})))
PY
fi
echo
done
Length of output: 961
🏁 Script executed:
#!/bin/bash
set -euo pipefail
pkgs=(nvidia-nccl-cu13 nvidia-nvshmem-cu13 nvidia-cuda-nvrtc-cu13)
for pkg in "${pkgs[@]}"; do
echo "=== $pkg ==="
echo "--- python -m pip index versions ---"
python -m pip index versions "$pkg" 2>&1 || true
echo "--- pypi.org JSON ---"
url="https://pypi.org/pypi/${pkg}/json"
status=$(curl -s -w "%{http_code}" -o /tmp/pypi_${pkg}.json "$url" || echo "000")
echo "HTTP status: $status"
if [ "$status" = "200" ]; then
python - <<PY
import json,sys
f="/tmp/pypi_${pkg}.json"
data=json.load(open(f))
info = data.get("info", {})
releases = data.get("releases", {})
print("info.version:", info.get("version"))
print("releases_count:", len(releases))
# show up to 10 newest releases
for idx,(ver, files) in enumerate(sorted(releases.items(), key=lambda x: x[0], reverse=True)):
if idx >= 10:
break
print(f"- {ver} (files: {len(files)})")
for fi in files[:5]:
print(" ", fi.get("filename"), fi.get("packagetype"), fi.get("python_version"), fi.get("url")[:140])
PY
fi
echo
done
Length of output: 4191
Action required: verify CUDA‑13 vendor wheels on your CI package mirror before merging
Short summary of verification:
- nvidia-nvshmem-cu13 — found on PyPI (v3.3.20) with manylinux wheels for x86_64 and aarch64.
- nvidia-nccl-cu13 — PyPI only shows a placeholder release (0.0.0a0); pip reports "No matching distribution".
- nvidia-cuda-nvrtc-cu13 — PyPI only shows a placeholder release (0.0.0a0); pip reports "No matching distribution".
Location to check:
- requirements.txt (lines ~30–33)
Current snippet:
nvidia-nccl-cu13
nvidia-nvshmem-cu13
nvidia-cuda-nvrtc-cu13
transformers~=4.55.0
Requested actions:
- Either remove/hold the two placeholder packages (nvidia-nccl-cu13, nvidia-cuda-nvrtc-cu13) until real vendor wheels are published, or ensure your CI/internal PyPI mirror contains the correct vendor-built wheels for your target platforms.
- If you expect these names to be different or hosted on a private index, confirm the exact package names and that CI will use that index.
🤖 Prompt for AI Agents
In requirements.txt around lines 30–33, the two CUDA‑13 vendor placeholder
packages (nvidia-nccl-cu13, nvidia-cuda-nvrtc-cu13) are not available on PyPI
and will break CI; either remove or comment them out (hold) until vendor wheels
are published, or update CI to point to your internal/alternate PyPI mirror that
contains the correct vendor-built wheel names and versions; if the package names
differ on your private index, replace them with the exact package names and
ensure the CI/pip install command includes the --index-url/--extra-index-url and
any credentials required.
Signed-off-by: Boris Fomitchev <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🧹 Nitpick comments (3)
cpp/tensorrt_llm/kernels/beamSearchKernels.h (3)
149-149
: Cast float literals to T to avoid implicit-narrowing for half/bfloat16Minor, but avoids warnings and makes intent explicit when T is a low-precision type.
Apply this diff:
- pLocalLogProbs[i] += (i == endIds[slot]) ? 1.0f : 0.0f; + pLocalLogProbs[i] += (i == endIds[slot]) ? static_cast<T>(1.0f) : static_cast<T>(0.0f);Note: This only addresses type-safety. The indexing/semantics issue on this line still needs to be resolved as noted in a separate comment.
134-134
: Use a portability macro for symbol visibility rather than a raw GCC attributeattribute((visibility("default"))) is GCC/Clang specific. Prefer a project-level macro (e.g., TLLM_PUBLIC) that maps to __declspec(dllexport) on MSVC and visibility("default") elsewhere to keep CUDA 13 builds portable across toolchains.
Add (outside this file, e.g., in tensorrt_llm/common/visibility.h):
#pragma once #if defined(_MSC_VER) #define TLLM_PUBLIC __declspec(dllexport) #else #define TLLM_PUBLIC __attribute__((visibility("default"))) #endifThen update this declaration:
-__global__ __attribute__((visibility("default"))) void addCumLogProbs( +__global__ TLLM_PUBLIC void addCumLogProbs(
157-157
: Remove unnecessary return; from the end of a global kernelRedundant in a void kernel and slightly noisy.
Apply this diff:
- return;
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
cpp/tensorrt_llm/kernels/beamSearchKernels.cu
(0 hunks)cpp/tensorrt_llm/kernels/beamSearchKernels.h
(1 hunks)
💤 Files with no reviewable changes (1)
- cpp/tensorrt_llm/kernels/beamSearchKernels.cu
🧰 Additional context used
📓 Path-based instructions (4)
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}
: In C++, close namespaces with a comment naming the namespace (e.g., } // namespace foo)
Prefer const/constexpr variables over #define for constants
Declare variables const if not modified after initialization
Use Allman brace style in C++
C++ filenames use lowerCamelCase and must be case-insensitively unique within a build target
C++ type names use UpperCamelCase
Local variables, methods, and namespaces use lowerCamelCase
Global non-static variables not in anonymous namespace use gPrefix lowerCamelCase (e.g., gExample)
Static globals or globals in anonymous namespaces use sPrefix lowerCamelCase
Locally visible static variables start with 's' (e.g., static std::once_flag sFlag;)
Member variables use mPrefix lowerCamelCase; public members may omit but are encouraged to use 'm'
Constants (enums, global/static/function-scope magic numbers) use kPREFIXED_UPPER_SNAKE (e.g., kDIGIT_NUM)
If macros are unavoidable, use UPPER_SNAKE_CASE (prefer constants over #define)
Constructor parameter that conflicts with a public member name gets trailing underscore (foo_)
Literal suffixes should be uppercase (e.g., 1234L not 1234l)
C++: use spaces only; indent 4 spaces
Run clang-format (LLVM style) before submitting; wrap lines at 120 characters
If formatting must be bypassed, use // clang-format off/on around the section
Prefer smart pointers; use unique_ptr for sole ownership, shared_ptr for shared; weak_ptr only in exceptional cases
Do not use deprecated pre-C++11 smart pointers
Use C++ style comments; avoid C comments except special inline cases; prefer // single-line
Capitalize and punctuate full-sentence comments
Follow Doxygen rules: use //! for comments and //!< for members in C++
Disable code with #if/#endif and mnemonic conditions; avoid commented-out code; avoid dead code
Do not throw exceptions across library boundaries
Use least-forceful casts; avoid removing const/volatile; avoid C-style and functional casts (except constructors); p...
Files:
cpp/tensorrt_llm/kernels/beamSearchKernels.h
**/*.{h,hpp,hxx,hh,cuh,cpp,cxx,cc,cu}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
Parameter names must be consistent between declarations and definitions
Files:
cpp/tensorrt_llm/kernels/beamSearchKernels.h
**/*.{h,hpp,hxx,hh,cuh}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
Header files must use include guards named TRTLLM__H without underscores prefix/suffix (e.g., TRTLLM_FOO_BAR_HELLO_H)
Files:
cpp/tensorrt_llm/kernels/beamSearchKernels.h
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
Prepend NVIDIA copyright header (current year) to all source files
Files:
cpp/tensorrt_llm/kernels/beamSearchKernels.h
🔇 Additional comments (1)
cpp/tensorrt_llm/kernels/beamSearchKernels.h (1)
154-155
: Verify nBM vs nBMIn for indexing cumLogProbs and finishedYou intentionally use nBM for cumLogProbs due to VBWS initialization, but use nBMIn for finished. Given finished is documented as [BS*BM], verify whether BM here corresponds to nBMIn or the max beam width. If it’s the latter, finished[slot * nBM + iBMIn] may be required for consistency.
If finished is sized with the max beam width, update the indexing accordingly:
- if (finished[slot * nBMIn + iBMIn].isFinished()) + if (finished[slot * nBM + iBMIn].isFinished())If it is truly sized with the current input beam width, keep as-is and consider adding a comment to document the difference from cumLogProbs for future maintainers.
if (finished[slot * nBMIn + iBMIn].isFinished()) | ||
{ | ||
pLocalLogProbs[i] += (i == endIds[slot]) ? 1.0f : 0.0f; | ||
} | ||
else |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Verification agent
🧩 Analysis chain
EOS/endIds indexing appears incorrect and likely out-of-bounds for [BS, BM] layout
endIds is documented as [BS, BM], but the code indexes it as endIds[slot] and compares it to i (a flattened [BM_in, BM_out, 2] index). Two issues:
- Indexing: endIds[slot] ignores the beam dimension; for per-beam EOS, it should likely be endIds[slot * nBM + iBMIn] (or nBMIn, depending on layout).
- Semantics: i indexes the flattened candidate grid, not the vocabulary/token ID dimension. Comparing i to an EOS token ID does not look meaningful and may never match.
This can cause incorrect scoring for finished beams and undefined behavior if endIds has only [BS, BM] elements.
Please confirm shapes and intended semantics. If indeed per-beam EOS, a minimal indexing fix would be:
-
Use the per-beam EOS id:
- int const eosId = endIds[slot * nBM + iBMIn]; // or slot * nBMIn + iBMIn if finished is also sized that way.
-
Compare against the actual token index in pStage1LogProbs, not the flattened [BM_in, BM_out, 2] index. If stage-1 buffer includes a vocab dimension elsewhere, determine the local token index before comparison.
To help locate references and prior logic, run:
🏁 Script executed:
#!/bin/bash
# Look for other usages of endIds indexing and EOS handling in kernels
rg -n -A3 -B3 $'endIds\\[' cpp | sed -n '1,200p'
# Find all write sites for pStage1LogProbs to infer its layout
rg -n -A3 -B3 'pStage1(Log)?Probs' cpp | sed -n '1,200p'
# Find all callers of addCumLogProbs and their launch grids to reason about mapping
rg -n -A3 -B3 'addCumLogProbs\\s*<|<<<' cpp | sed -n '1,200p'
Length of output: 45150
🏁 Script executed:
#!/bin/bash
sed -n '1,240p' cpp/tensorrt_llm/kernels/beamSearchKernels.h
Length of output: 15146
🏁 Script executed:
#!/bin/bash
sed -n '1,420p' cpp/tensorrt_llm/kernels/beamSearchKernels/beamSearchKernelsTemplate.h
Length of output: 17803
Fix required — addCumLogProbs compares candidate index to EOS token and ignores beam dim
Short summary: addCumLogProbs (cpp/tensorrt_llm/kernels/beamSearchKernels.h) iterates i over the flattened candidate grid (nBMIn * nBMOut * 2) and does (i == endIds[slot]). i is a candidate index, not a token id; also the code ignores the beam index in endIds. This will almost never match the EOS token and therefore fails to prefer EOS for finished beams (and is inconsistent with other kernels that treat endIds as per-slot token ids).
Locations to fix:
- cpp/tensorrt_llm/kernels/beamSearchKernels.h
- Kernel addCumLogProbs — replace the incorrect comparison and add access to the stage-1 ids.
- cpp/tensorrt_llm/kernels/beamSearchKernels/beamSearchKernelsTemplate.h
- Call site that launches addCumLogProbs must pass pStage1Ids (pStage1Ids is already allocated in this scope).
Suggested minimal changes (apply and adjust to your code style):
-
Change kernel signature to accept the stage-1 ids:
- before: addCumLogProbs(T* pStage1LogProbs, float const* cumLogProbs, FinishedState const* finished, int const* endIds, float const* diversityRates, ...)
- after: addCumLogProbs(T* pStage1LogProbs, float const* cumLogProbs, FinishedState const* finished, int const* endIds, int const* pStage1Ids, float const* diversityRates, ...)
-
In the kernel body replace the block:
- before:
pLocalLogProbs[i] += (i == endIds[slot]) ? 1.0f : 0.0f; - after (minimal, assuming endIds is per-slot token id; if endIds is per-beam use the per-beam index shown below):
int const* pLocalIds = pStage1Ids + bid * nBMIn * nBMOut * 2;
int const eosId = endIds[slot]; // or: endIds[slot * nBM + iBMIn] if endIds is [BS, BM]
pLocalLogProbs[i] += (pLocalIds[i] == eosId) ? (T)1.0f : (T)0.0f;
- before:
-
Update the kernel launch in beamSearchKernelsTemplate.h:
- before:
addCumLogProbs<<<nBS, nThread>>>(pStage1LogProbs, bh.cumLogProbs, bh.finished, bh.endIds, bh.diversityRates, bh.batchSlots, ...); - after:
addCumLogProbs<<<nBS, nThread>>>(pStage1LogProbs, bh.cumLogProbs, bh.finished, bh.endIds, pStage1Ids, bh.diversityRates, bh.batchSlots, ...);
- before:
Extra: clarify the declared shape of endIds in BeamHypotheses (cpp/tensorrt_llm/kernels/beamSearchKernels.h) — comments currently read [BS, BM] but most kernels use endIds[slot] (per-slot). Confirm whether endIds should be [BS] or [BS, BM] and update code/comments and indexing across kernels accordingly.
Reason: The kernel must compare the candidate's token id (from stage-1 ids) to the EOS token id; comparing the flattened candidate index is incorrect and will break finished-beam handling.
🤖 Prompt for AI Agents
In cpp/tensorrt_llm/kernels/beamSearchKernels.h around lines 147-151,
addCumLogProbs currently compares the flattened candidate index i to
endIds[slot] which is wrong — you must compare the candidate token id from
stage-1 ids and account for beam dim; update the kernel signature to accept int
const* pStage1Ids, inside the kernel compute a pointer to the local stage1 ids
for this batch/beam region (e.g. pStage1Ids + bid * nBMIn * nBMOut * 2) and
replace the (i == endIds[slot]) test with a comparison of pLocalIds[i] to the
appropriate eosId (use endIds[slot] if endIds is per-slot, or endIds[slot * nBM
+ iBMIn] if per-beam), then add pStage1Ids to the addCumLogProbs kernel launch
in cpp/tensorrt_llm/kernels/beamSearchKernelsTemplate.h so the kernel receives
the stage-1 ids; also verify and if needed fix/comment the declared
shape/indexing of endIds across BeamHypotheses to be consistently [BS] or [BS,
BM].
Signed-off-by: Boris Fomitchev <[email protected]>
The wheel builds now. Still should be considered WIP until review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (3)
cpp/tensorrt_llm/deep_ep/CMakeLists.txt (3)
129-130
: Good fix: host and device NVSHMEM libraries are now linkedLinking both
${NVSHMEM_DEVICE_LIBRARY}
and${NVSHMEM_HOST_LIBRARY}
addresses previous link/runtime risk when host APIs are used.
41-48
: Harden NVSHMEM discovery: allow overrides, avoid SONAME pinning, search lib64, and fail fastCurrent find logic is brittle:
- Hard-coded install prefix is not overrideable (not a CACHE var).
find_path/find_library
are not REQUIRED (fail late).- Host library is pinned to a specific SONAME (
libnvshmem_host.so.3
) and includeslib
prefix — both are error-prone across platforms/versions.- Only
${prefix}/lib
is searched; many installs uselib64
.Refactor as below to make it robust and configurable. This is a continuation of prior feedback.
-set(NVSHMEM_INSTALL_PREFIX "${TORCH_INSTALL_PREFIX}/../nvidia/nvshmem") -# message("Torch Install Prefix: ${TORCH_INSTALL_PREFIX}") - -find_path(NVSHMEM_INCLUDE_DIR nvshmem.h HINTS ${NVSHMEM_INSTALL_PREFIX}/include) -# set(NVSHMEM_HOST_LIBRARY ${NVSHMEM_INSTALL_PREFIX}/lib/libnvshmem_host.so.3) -find_library(NVSHMEM_DEVICE_LIBRARY nvshmem_device HINTS ${NVSHMEM_INSTALL_PREFIX}/lib) -find_library(NVSHMEM_HOST_LIBRARY libnvshmem_host.so.3 HINTS ${NVSHMEM_INSTALL_PREFIX}/lib) +set(NVSHMEM_INSTALL_PREFIX "${TORCH_INSTALL_PREFIX}/../nvidia/nvshmem" + CACHE PATH "NVSHMEM install root") + +find_path(NVSHMEM_INCLUDE_DIR nvshmem.h + HINTS ${NVSHMEM_INSTALL_PREFIX}/include + REQUIRED) +find_library(NVSHMEM_DEVICE_LIBRARY + NAMES nvshmem_device + HINTS ${NVSHMEM_INSTALL_PREFIX}/lib ${NVSHMEM_INSTALL_PREFIX}/lib64 + REQUIRED) +find_library(NVSHMEM_HOST_LIBRARY + NAMES nvshmem_host + HINTS ${NVSHMEM_INSTALL_PREFIX}/lib ${NVSHMEM_INSTALL_PREFIX}/lib64 + REQUIRED)
118-119
: Ensure runtime resolution: add $ORIGIN and lib64 to INSTALL_RPATHOnly
${NVSHMEM_INSTALL_PREFIX}/lib
is in RPATH. Add$ORIGIN
for relocatability and includelib64
to cover common layouts. This aligns with earlier feedback.- INSTALL_RPATH "${TORCH_INSTALL_PREFIX}/lib;${NVSHMEM_INSTALL_PREFIX}/lib" + INSTALL_RPATH "\$ORIGIN;${TORCH_INSTALL_PREFIX}/lib;${NVSHMEM_INSTALL_PREFIX}/lib;${NVSHMEM_INSTALL_PREFIX}/lib64"
🧹 Nitpick comments (1)
cpp/tensorrt_llm/deep_ep/CMakeLists.txt (1)
41-45
: Remove stale commented SONAME lineThe commented
libnvshmem_host.so.3
line is obsolete and can confuse future readers.-# set(NVSHMEM_HOST_LIBRARY ${NVSHMEM_INSTALL_PREFIX}/lib/libnvshmem_host.so.3)
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
cpp/tensorrt_llm/deep_ep/CMakeLists.txt
(4 hunks)
🔇 Additional comments (2)
cpp/tensorrt_llm/deep_ep/CMakeLists.txt (2)
127-127
: Include path OK, but depends on non-REQUIRED find_pathThis is fine, but make sure
find_path(NVSHMEM_INCLUDE_DIR ...)
is REQUIRED (see suggested refactor above) to fail fast if headers are missing.
138-139
: LGTM: removed ExternalProject dependency
deep_ep
no longer depends on an external NVSHMEM build target, consistent with switching to prebuilt discovery.
Signed-off-by: Boris Fomitchev <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
♻️ Duplicate comments (2)
cpp/tensorrt_llm/kernels/beamSearchKernels.h (2)
162-164
: Still leaking global to non-CUDA TUs: guard gatherId() as well
gatherId
is declared__global__
outside any#ifdef __CUDACC__
and will break host-only compilation. Wrap it with the same CUDA guard (or move to a .cuh/.cu). This is the same class of issue called out previously for kernels in this header.Apply this diff:
-__global__ void gatherId(int const* __restrict pStage1Id, int* __restrict pStage2Id, size_t const nBS, - size_t const nBMIn, size_t const nBMOut, size_t const nV); +#ifdef __CUDACC__ +__global__ void gatherId(int const* __restrict pStage1Id, int* __restrict pStage2Id, size_t const nBS, + size_t const nBMIn, size_t const nBMOut, size_t const nV); +#endif
135-156
: Fix addCumLogProbs: compare candidate token id (from stage‑1 ids) to per‑slot EOS and avoid implicit float->T castsShort: The kernel currently compares the flattened candidate index
i
toendIds[slot]
(a token id). endIds is per‑slot (shape [BS]) and the correct comparison is between the candidate token id (from the stage‑1 ids buffer) and the per‑slot EOS id. Also avoid implicit float→T conversions by casting.Places to change
- cpp/tensorrt_llm/kernels/beamSearchKernels.h — update addCumLogProbs signature and body (around lines ~135–156).
- cpp/tensorrt_llm/kernels/beamSearchKernels/beamSearchKernelsTemplate.h — update kernel launch at ~line 631 to pass the stage‑1 ids pointer (pStage1Ids is available at ~line 619).
- cpp/tensorrt_llm/kernels/beamSearchKernels.cu — update explicit template instantiations (the addCumLogProbs<float/half> prototype lines at ~137 and ~141) to the new signature.
Updated minimal diff (apply in beamSearchKernels.h; adjust instantiations & launches accordingly):
-__global__ __attribute__((visibility("default"))) void addCumLogProbs(T* __restrict pStage1LogProbs, float const* __restrict cumLogProbs, - FinishedState const* finished, int const* endIds, float const* diversityRates, - runtime::SizeType32 const* batchSlots, size_t const nBS, size_t const nBMIn, size_t const nBMOut, size_t const nBM) +__global__ __attribute__((visibility("default"))) void addCumLogProbs( + T* __restrict pStage1LogProbs, + float const* __restrict cumLogProbs, + FinishedState const* __restrict finished, + int const* __restrict endIds, + int const* __restrict pStage1Ids, + float const* __restrict diversityRates, + runtime::SizeType32 const* __restrict batchSlots, + size_t const nBS, size_t const nBMIn, size_t const nBMOut, size_t const nBM) { int const bid = blockIdx.x; // Index of request in batch runtime::SizeType32 const slot = batchSlots[bid]; float const diversityRate{diversityRates[slot]}; T* pLocalLogProbs = pStage1LogProbs + bid * nBMIn * nBMOut * 2; + int const* pLocalIds = pStage1Ids + bid * nBMIn * nBMOut * 2; for (int i = threadIdx.x; i < nBMIn * nBMOut * 2; i += blockDim.x) { int const iBMIn = i / (nBMOut * 2); if (finished[slot * nBMIn + iBMIn].isFinished()) { - pLocalLogProbs[i] += (i == endIds[slot]) ? 1.0f : 0.0f; + // endIds is per-slot; compare the candidate token id from stage‑1 ids + int const eosId = endIds[slot]; + pLocalLogProbs[i] += (pLocalIds[i] == eosId) ? static_cast<T>(1.0f) : static_cast<T>(0.0f); } else { - pLocalLogProbs[i] += cumLogProbs[slot * nBM + iBMIn] + diversityRate * iBMIn; + float const addend = cumLogProbs[slot * nBM + iBMIn] + diversityRate * static_cast<float>(iBMIn); + pLocalLogProbs[i] += static_cast<T>(addend); } } }Follow‑ups required
- Update the kernel launch in beamSearchKernelsTemplate.h to pass pStage1Ids:
addCumLogProbs<<<...>>>(pStage1LogProbs, bh.cumLogProbs, bh.finished, bh.endIds, pStage1Ids, bh.diversityRates, bh.batchSlots, ...);- Update explicit instantiations in beamSearchKernels.cu to include the new int const* pStage1Ids parameter for both float and half.
- Run unit tests / beam search integration tests to validate correctness.
Reasoning from verification
- pStage1Ids buffer exists in the template workspace (pStage1Ids declared around beamSearchKernelsTemplate.h:619) and top‑K writes it (invokeTopkLastDim at ~627). The addCumLogProbs launch is at ~631 — so the ids are available to pass.
- endIds is used elsewhere as per‑slot (endIds[slot]) in the codebase; use endIds[slot] (not per‑beam indexing).
Result: bug confirmed and fix required.
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these settings in your CodeRabbit configuration.
📒 Files selected for processing (3)
cpp/CMakeLists.txt
(3 hunks)cpp/tensorrt_llm/kernels/beamSearchKernels.h
(1 hunks)cpp/tensorrt_llm/kernels/cutlass_kernels/fp8_blockscale_gemm/fp8_blockscale_tma_utils.cuh
(2 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
- cpp/CMakeLists.txt
- cpp/tensorrt_llm/kernels/cutlass_kernels/fp8_blockscale_gemm/fp8_blockscale_tma_utils.cuh
🧰 Additional context used
📓 Path-based instructions (4)
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}
: In C++, close namespaces with a comment naming the namespace (e.g., } // namespace foo)
Prefer const/constexpr variables over #define for constants
Declare variables const if not modified after initialization
Use Allman brace style in C++
C++ filenames use lowerCamelCase and must be case-insensitively unique within a build target
C++ type names use UpperCamelCase
Local variables, methods, and namespaces use lowerCamelCase
Global non-static variables not in anonymous namespace use gPrefix lowerCamelCase (e.g., gExample)
Static globals or globals in anonymous namespaces use sPrefix lowerCamelCase
Locally visible static variables start with 's' (e.g., static std::once_flag sFlag;)
Member variables use mPrefix lowerCamelCase; public members may omit but are encouraged to use 'm'
Constants (enums, global/static/function-scope magic numbers) use kPREFIXED_UPPER_SNAKE (e.g., kDIGIT_NUM)
If macros are unavoidable, use UPPER_SNAKE_CASE (prefer constants over #define)
Constructor parameter that conflicts with a public member name gets trailing underscore (foo_)
Literal suffixes should be uppercase (e.g., 1234L not 1234l)
C++: use spaces only; indent 4 spaces
Run clang-format (LLVM style) before submitting; wrap lines at 120 characters
If formatting must be bypassed, use // clang-format off/on around the section
Prefer smart pointers; use unique_ptr for sole ownership, shared_ptr for shared; weak_ptr only in exceptional cases
Do not use deprecated pre-C++11 smart pointers
Use C++ style comments; avoid C comments except special inline cases; prefer // single-line
Capitalize and punctuate full-sentence comments
Follow Doxygen rules: use //! for comments and //!< for members in C++
Disable code with #if/#endif and mnemonic conditions; avoid commented-out code; avoid dead code
Do not throw exceptions across library boundaries
Use least-forceful casts; avoid removing const/volatile; avoid C-style and functional casts (except constructors); p...
Files:
cpp/tensorrt_llm/kernels/beamSearchKernels.h
**/*.{h,hpp,hxx,hh,cuh,cpp,cxx,cc,cu}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
Parameter names must be consistent between declarations and definitions
Files:
cpp/tensorrt_llm/kernels/beamSearchKernels.h
**/*.{h,hpp,hxx,hh,cuh}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
Header files must use include guards named TRTLLM__H without underscores prefix/suffix (e.g., TRTLLM_FOO_BAR_HELLO_H)
Files:
cpp/tensorrt_llm/kernels/beamSearchKernels.h
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
Prepend NVIDIA copyright header (current year) to all source files
Files:
cpp/tensorrt_llm/kernels/beamSearchKernels.h
🧬 Code Graph Analysis (1)
cpp/tensorrt_llm/kernels/beamSearchKernels.h (1)
cpp/tensorrt_llm/kernels/beamSearchKernels.cu (2)
addCumLogProbs
(137-139)addCumLogProbs
(141-143)
🔇 Additional comments (1)
cpp/tensorrt_llm/kernels/beamSearchKernels.h (1)
133-137
: Good fix: CUDA guard now prevents leaking global to host-only TUsPlacing the kernel definition under
#ifdef __CUDACC__
addresses the earlier compilation issue for non-CUDA translation units.
#ifdef __CUDACC__ | ||
template <typename T> | ||
__global__ void addCumLogProbs(T* __restrict pStage1Probs, float const* __restrict cumLogProbs, | ||
__global__ __attribute__((visibility("default"))) void addCumLogProbs(T* __restrict pStage1LogProbs, float const* __restrict cumLogProbs, | ||
FinishedState const* finished, int const* endIds, float const* diversityRates, | ||
runtime::SizeType32 const* batchSlots, size_t const nBS, size_t const nBMIn, size_t const nBMOut, size_t const nBM); | ||
runtime::SizeType32 const* batchSlots, size_t const nBS, size_t const nBMIn, size_t const nBMOut, size_t const nBM) | ||
{ | ||
int const bid = blockIdx.x; // Index of request in batch | ||
runtime::SizeType32 const slot = batchSlots[bid]; | ||
float const diversityRate{diversityRates[slot]}; | ||
T* pLocalLogProbs = pStage1LogProbs + bid * nBMIn * nBMOut * 2; | ||
|
||
for (int i = threadIdx.x; i < nBMIn * nBMOut * 2; i += blockDim.x) | ||
{ | ||
int const iBMIn = i / (nBMOut * 2); | ||
if (finished[slot * nBMIn + iBMIn].isFinished()) | ||
{ | ||
pLocalLogProbs[i] += (i == endIds[slot]) ? 1.0f : 0.0f; | ||
} | ||
else | ||
{ | ||
// nBM is used in VBWS since `cumLogProbs` is initialized with kMaxBeamWidth earlier than BeamSearchLayer | ||
pLocalLogProbs[i] += cumLogProbs[slot * nBM + iBMIn] + diversityRate * iBMIn; | ||
} | ||
} | ||
return; | ||
} | ||
#endif |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Verification agent
🧩 Analysis chain
Avoid ODR/duplicate definitions: add extern templates or move the kernel out of the header
Defining a templated __global__
kernel in a header while also explicitly instantiating it in a .cu risks multiple definitions if any TU implicitly instantiates it via a launch. Either:
- Keep the definition in the .cu and leave only a declaration in the header, or
- Keep the definition in the header and prevent implicit instantiation with
extern template
declarations, keeping explicit instantiations in the .cu.
Minimal change (add extern templates under the CUDA guard):
#ifdef __CUDACC__
template <typename T>
__global__ __attribute__((visibility("default"))) void addCumLogProbs(T* __restrict pStage1LogProbs, float const* __restrict cumLogProbs,
FinishedState const* finished, int const* endIds, float const* diversityRates,
runtime::SizeType32 const* batchSlots, size_t const nBS, size_t const nBMIn, size_t const nBMOut, size_t const nBM)
{
@@
}
-#endif
+// Prevent implicit instantiation; explicit instantiations live in the .cu
+extern template __global__ __attribute__((visibility("default"))) void addCumLogProbs<float>(
+ float* __restrict pStage1LogProbs, float const* __restrict cumLogProbs, FinishedState const* finished,
+ int const* __restrict endIds, int const* __restrict pStage1Ids, float const* __restrict diversityRates,
+ runtime::SizeType32 const* __restrict batchSlots, size_t const nBS, size_t const nBMIn, size_t const nBMOut, size_t const nBM);
+
+extern template __global__ __attribute__((visibility("default"))) void addCumLogProbs<half>(
+ half* __restrict pStage1LogProbs, float const* __restrict cumLogProbs, FinishedState const* finished,
+ int const* __restrict endIds, int const* __restrict pStage1Ids, float const* __restrict diversityRates,
+ runtime::SizeType32 const* __restrict batchSlots, size_t const nBS, size_t const nBMIn, size_t const nBMOut, size_t const nBM);
+#endif
And update the explicit instantiations in the .cu to match the new signature:
// cpp/tensorrt_llm/kernels/beamSearchKernels.cu (adjust signature)
template __global__ void addCumLogProbs<float>(float* __restrict, float const* __restrict,
FinishedState const*, int const* __restrict, int const* __restrict, float const* __restrict,
runtime::SizeType32 const* __restrict, size_t const, size_t const, size_t const, size_t const);
template __global__ void addCumLogProbs<half>(half* __restrict, float const* __restrict,
FinishedState const*, int const* __restrict, int const* __restrict, float const* __restrict,
runtime::SizeType32 const* __restrict, size_t const, size_t const, size_t const, size_t const);
Alternative: Move the kernel body to the .cu and leave a guarded declaration in the header to avoid any template instantiation from headers.
To confirm there are explicit instantiations today and avoid duplicates, run:
🏁 Script executed:
#!/bin/bash
rg -n -A2 -B2 $'template\\s+__global__\\s+void\\s+addCumLogProbs' cpp | sed -n '1,200p'
Length of output: 1181
Move the CUDA kernel body out of the header (or add correct extern template specializations) — explicit instantiations exist in the .cu
The header currently defines a templated global kernel while cpp/tensorrt_llm/kernels/beamSearchKernels.cu already contains explicit instantiations for float and half (rg found instantiations at lines ~137 and ~141). This risks ODR/duplicate-definition errors. Fix by removing the definition from the header and leaving only a declaration, or (less preferred) add matching extern template specializations.
Files to change
- cpp/tensorrt_llm/kernels/beamSearchKernels.h — replace the kernel definition (lines ~133–159) with a declaration.
- cpp/tensorrt_llm/kernels/beamSearchKernels.cu — already has explicit instantiations; verify they match the declaration (they do: float and half specializations exist).
Suggested minimal change (preferred: move body to .cu; in the header leave only the declaration):
#ifdef __CUDACC__
-template <typename T>
-__global__ __attribute__((visibility("default"))) void addCumLogProbs(T* __restrict pStage1LogProbs, float const* __restrict cumLogProbs,
- FinishedState const* finished, int const* endIds, float const* diversityRates,
- runtime::SizeType32 const* batchSlots, size_t const nBS, size_t const nBMIn, size_t const nBMOut, size_t const nBM)
-{
- ...
-}
+#ifdef __CUDACC__
+template <typename T>
+__global__ __attribute__((visibility("default"))) void addCumLogProbs(
+ T* __restrict pStage1LogProbs,
+ float const* __restrict cumLogProbs,
+ FinishedState const* finished,
+ int const* endIds,
+ float const* diversityRates,
+ runtime::SizeType32 const* batchSlots,
+ size_t const nBS,
+ size_t const nBMIn,
+ size_t const nBMOut,
+ size_t const nBM);
#endif
Optional alternative (if you must keep the body in the header): add extern template declarations that exactly match the explicit instantiations in the .cu (use the same parameter list and types). Example externs (place under the CUDA guard after the definition if you keep it, though moving the body is safer):
extern template __global__ void addCumLogProbs<float>(
float* __restrict pStage1LogProbs, float const* __restrict cumLogProbs,
FinishedState const* finished, int const* endIds, float const* diversityRates,
runtime::SizeType32 const* batchSlots, size_t const nBS, size_t const nBMIn, size_t const nBMOut, size_t const nBM);
extern template __global__ void addCumLogProbs<half>(
half* __restrict pStage1LogProbs, float const* __restrict cumLogProbs,
FinishedState const* finished, int const* endIds, float const* diversityRates,
runtime::SizeType32 const* batchSlots, size_t const nBS, size_t const nBMIn, size_t const nBMOut, size_t const nBM);
Action: prefer moving the kernel body to the .cu and keep the declaration in the header; that aligns with the existing explicit instantiations and avoids ODR issues.
Signed-off-by: Boris Fomitchev <[email protected]>
Signed-off-by: Boris Fomitchev <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
♻️ Duplicate comments (1)
requirements.txt (1)
30-33
: Blocking: cu13 vendor wheels (nccl/nvrtc) not on public PyPI; will break installsPrior verification already showed nvidia-nccl-cu13 and nvidia-cuda-nvrtc-cu13 are placeholders on PyPI and not installable, while nvidia-nvshmem-cu13 exists. Unless your CI uses a private index with these exact names, pip will fail here. Suggest holding/commenting them until wheels are available or wiring CI to your internal index.
Apply this minimal, reversible change:
nvidia-modelopt[torch]~=0.33.0 -nvidia-nccl-cu13 -nvidia-nvshmem-cu13 -nvidia-cuda-nvrtc-cu13 +## CUDA 13 vendor wheels +# HOLD until available on configured index (or pin to private URLs): +# nvidia-nccl-cu13 +nvidia-nvshmem-cu13 +# nvidia-cuda-nvrtc-cu13 transformers~=4.55.0Optionally replace with exact private URLs or add a CI-level --index-url/--extra-index-url pointing at your internal mirror.
#!/bin/bash set -euo pipefail echo "[info] Checking for vendor cu13 wheels on public PyPI (expect: nvshmem OK; nccl/nvrtc likely missing)" for pkg in nvidia-nccl-cu13 nvidia-nvshmem-cu13 nvidia-cuda-nvrtc-cu13; do echo "=== $pkg ===" url="https://pypi.org/pypi/${pkg}/json" status=$(curl -s -w "%{http_code}" -o /tmp/${pkg}.json "$url" || echo "000") echo "HTTP: $status" if [ "$status" = "200" ]; then python - <<'PY' import json,sys,os p=f"/tmp/{sys.argv[1]}.json" d=json.load(open(p)) print("latest:", d.get("info",{}).get("version"), "releases:", len(d.get("releases",{}))) PY $pkg fi done echo echo "[info] Inspect constraints.txt and grep for older cu12 pins that could conflict" if [ -f constraints.txt ]; then sed -n '1,120p' constraints.txt | nl -ba | sed -n '1,120p' echo "--- grep cu12/cu13/torch/torchvision ---" rg -n "cu12|cu13|torch(vision)?|nvrtc|nccl" constraints.txt || true else echo "constraints.txt not found at repo root" fi
🧹 Nitpick comments (4)
requirements.txt (4)
1-1
: Global extra-index-url affects all packages; consider scoping to torch installs in CIKeeping --extra-index-url in a shared requirements file can cause resolver backtracking and accidental pulls from the PyTorch index for unrelated packages. If feasible, move the index flag to the CI pip install step for torch/torchvision/torchaudio only.
4-4
: build/wheel are build-time tools; consider moving to a dev/build requirements fileThese aren’t needed at runtime and can bloat production images. Suggest creating requirements-dev.txt and installing them only in build contexts.
Example change in this file:
-build ... -wheel -#<=0.45.1And create requirements-dev.txt (outside this diff):
- build
- wheel
Also applies to: 38-39
55-55
: Clarify Triton dependency strategy (flashinfer/xgrammar comments only)Comments imply Triton is provided transitively via flashinfer-python or xgrammar, but both are commented out here. If Triton-backed paths are required at runtime, expose these as optional extras (e.g., [triton]) or document installation steps in README/CI. Otherwise, guard imports with clear error messages.
Also applies to: 57-57
22-22
: Heads-up on loosened pins (h5py, pillow) with numpy<2 caph5py and pillow are now minimum-pinned. Newer h5py may hard-require numpy>=2, causing resolver backtracking; pillow major bumps can be disruptive. Not blocking, but be prepared to constrain upper bounds if CI starts flaking.
Also applies to: 37-37
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these settings in your CodeRabbit configuration.
📒 Files selected for processing (2)
cpp/CMakeLists.txt
(3 hunks)requirements.txt
(3 hunks)
✅ Files skipped from review due to trivial changes (1)
- cpp/CMakeLists.txt
nvidia-ml-py | ||
# >=12,<13 | ||
# Just a wrapper since nvidia-modelopt requires pynvml | ||
pynvml==12.0.0 | ||
pynvml |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Avoid installing both nvidia-ml-py and pynvml — they ship the same module (pynvml) and may conflict
Having both often leads to site-packages collisions and non-deterministic import behavior. Keep exactly one; pynvml is what downstream packages typically require.
-nvidia-ml-py
-# >=12,<13
-# Just a wrapper since nvidia-modelopt requires pynvml
-pynvml
+# NVML: use a single provider to avoid conflicts
+pynvml # required by nvidia-modelopt
If you must support environments that still depend on nvidia-ml-py, consider environment markers/extras rather than installing both globally.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
nvidia-ml-py | |
# >=12,<13 | |
# Just a wrapper since nvidia-modelopt requires pynvml | |
pynvml==12.0.0 | |
pynvml | |
# NVML: use a single provider to avoid conflicts | |
pynvml # required by nvidia-modelopt |
🤖 Prompt for AI Agents
In requirements.txt around lines 16 to 19, both nvidia-ml-py and pynvml are
listed which ship the same pynvml module and can conflict; remove one of them
(prefer pynvml) so only a single package providing pynvml is installed, or
replace with an environment-marker or extras-based conditional if you must
support older environments that require nvidia-ml-py.
torch>=2.7.1,<=2.9.0a0 | ||
torchvision | ||
nvidia-modelopt[torch]~=0.33.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Verification agent
🧩 Analysis chain
Align torchvision with torch/cu130 to prevent binary/APT mismatches
torch is range-pinned, but torchvision is unconstrained. For CUDA wheels, torch/torchvision must be ABI-matched (same CUDA build, major.minor pairing). Please either pin torchvision to the corresponding release series for the chosen torch range or ensure it’s constrained via constraints.txt.
Run to check if constraints already pin torchvision:
If not pinned elsewhere, consider adding something like:
- torchvision matched to the highest allowed torch version in your matrix (same CUDA build), or
- explicitly install both in CI in one transaction from the cu130 index.
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "[info] Searching for torchvision pins or references"
rg -n "^torchvision(==|~=|>=|<=)" || true
if [ -f constraints.txt ]; then
echo "[info] constraints.txt entries for torch/vision"
rg -n "torch(vision)?" constraints.txt || true
fi
Length of output: 473
Pin torchvision to match torch/CUDA ABI (prevent binary mismatches)
Verification: I ran the grep you supplied — there are no torchvision pins in the repo and constraints.txt only contains a comment (no torch/torchvision constraints).
Current snippet (requirements.txt, lines 27–29):
torch>=2.7.1,<=2.9.0a0
torchvision
nvidia-modelopt[torch]~=0.33.0
Files to fix:
- requirements.txt (lines 27–29) — torchvision is unconstrained.
- constraints.txt — contains no torch/torchvision pins (only a comment).
Actionable fixes (pick one):
- Pin torchvision to the matching release series/CUDA build for your chosen torch range (same major.minor and same +cuNNN wheel), e.g. add a matching torchvision== (+cu130 if using cu130) in requirements.txt.
- Or add a torchvision constraint in constraints.txt.
- Or ensure CI installs torch and torchvision together from the correct CUDA index (install both in one transaction from the cu130 wheel index).
Signed-off-by: Boris Fomitchev <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
♻️ Duplicate comments (2)
cpp/tensorrt_llm/deep_ep/CMakeLists.txt (2)
202-203
: Include NVSHMEM lib64 and $ORIGIN in INSTALL_RPATHTo improve runtime robustness across layouts (and when packaging), include NVSHMEM lib64 and $ORIGIN.
set_target_properties( deep_ep_cpp_tllm @@ - INSTALL_RPATH "${TORCH_INSTALL_PREFIX}/lib;${NVSHMEM_INSTALL_PREFIX}/lib" + INSTALL_RPATH "\$ORIGIN;${TORCH_INSTALL_PREFIX}/lib;${NVSHMEM_INSTALL_PREFIX}/lib;${NVSHMEM_INSTALL_PREFIX}/lib64" BUILD_WITH_INSTALL_RPATH TRUE)
39-45
: NVSHMEM discovery: make it configurable, robust, and fail-fast
- Hard-codes install prefix; not overridable.
- Doesn’t mark finds as REQUIRED (late failures).
- Pins host SONAME to libnvshmem_host.so.3.
- Doesn’t search lib64.
Apply this to improve portability and early error surfacing:
-# TODO: restore patched nvshmem for CUDA12 if(${CUDAToolkit_VERSION} VERSION_GREATER_EQUAL "13.0") - set(NVSHMEM_INSTALL_PREFIX "${TORCH_INSTALL_PREFIX}/../nvidia/nvshmem") - find_path(NVSHMEM_INCLUDE_DIR nvshmem.h HINTS ${NVSHMEM_INSTALL_PREFIX}/include) - find_library(NVSHMEM_DEVICE_LIBRARY nvshmem_device HINTS ${NVSHMEM_INSTALL_PREFIX}/lib) - find_library(NVSHMEM_HOST_LIBRARY libnvshmem_host.so.3 HINTS ${NVSHMEM_INSTALL_PREFIX}/lib) + set(NVSHMEM_INSTALL_PREFIX + "${TORCH_INSTALL_PREFIX}/../nvidia/nvshmem" + CACHE PATH "NVSHMEM install root") + find_path(NVSHMEM_INCLUDE_DIR nvshmem.h + HINTS ${NVSHMEM_INSTALL_PREFIX}/include + REQUIRED) + find_library(NVSHMEM_DEVICE_LIBRARY + NAMES nvshmem_device + HINTS ${NVSHMEM_INSTALL_PREFIX}/lib ${NVSHMEM_INSTALL_PREFIX}/lib64 + REQUIRED) + find_library(NVSHMEM_HOST_LIBRARY + NAMES nvshmem_host nvshmem + HINTS ${NVSHMEM_INSTALL_PREFIX}/lib ${NVSHMEM_INSTALL_PREFIX}/lib64 + REQUIRED) else()
🧹 Nitpick comments (3)
cpp/CMakeLists.txt (1)
378-379
: Reduce build log noise for CUDA flags outputUnconditional message() can clutter logs in CI. Consider downgrading the verbosity or gating behind a flag.
-message("CMAKE_CUDA_FLAGS: ${CMAKE_CUDA_FLAGS}") +message(STATUS "CMAKE_CUDA_FLAGS: ${CMAKE_CUDA_FLAGS}") +# Or gate it: +# if(NVCC_TIMING) +# message(STATUS "CMAKE_CUDA_FLAGS: ${CMAKE_CUDA_FLAGS}") +# endif()cpp/tensorrt_llm/deep_ep/CMakeLists.txt (2)
77-81
: Avoid mutating the top-level compilers; constrain GCC usage to ExternalProjectOverwriting CMAKE_C/CXX/CUDA host compilers mid-config can impact the entire project toolchain. Prefer scoping GCC selection to the ExternalProject via its CMAKE_CACHE_ARGS.
- if(NOT CMAKE_CXX_COMPILER_ID STREQUAL "GNU") - set(CMAKE_C_COMPILER gcc) - set(CMAKE_CXX_COMPILER g++) - set(CMAKE_CUDA_HOST_COMPILER g++) - endif()Then, pass explicit compilers to the ExternalProject:
ExternalProject_Add( nvshmem_project @@ - CMAKE_CACHE_ARGS - -DCMAKE_C_COMPILER:STRING=${CMAKE_C_COMPILER} - -DCMAKE_C_COMPILER_LAUNCHER:STRING=${CMAKE_C_COMPILER_LAUNCHER} - -DCMAKE_CXX_COMPILER:STRING=${CMAKE_CXX_COMPILER} - -DCMAKE_CXX_COMPILER_LAUNCHER:STRING=${CMAKE_CXX_COMPILER_LAUNCHER} + CMAKE_CACHE_ARGS + -DCMAKE_C_COMPILER:STRING=gcc + -DCMAKE_C_COMPILER_LAUNCHER:STRING=${CMAKE_C_COMPILER_LAUNCHER} + -DCMAKE_CXX_COMPILER:STRING=g++ + -DCMAKE_CXX_COMPILER_LAUNCHER:STRING=${CMAKE_CXX_COMPILER_LAUNCHER} @@ - -DCMAKE_CUDA_HOST_COMPILER:STRING=${CMAKE_CUDA_HOST_COMPILER} + -DCMAKE_CUDA_HOST_COMPILER:STRING=g++
131-132
: Avoid pinning MLX5 SONAME; let the linker resolve the correct versionPinning
libmlx5.so.1
is brittle. Use the base name and search standard locations (and lib64 if needed).-find_library(MLX5_lib NAMES mlx5 libmlx5.so.1 REQUIRED) +find_library(MLX5_lib NAMES mlx5 REQUIRED) +# If you need to search custom prefixes: +# find_library(MLX5_lib NAMES mlx5 HINTS ${CUSTOM_PREFIX}/lib ${CUSTOM_PREFIX}/lib64 REQUIRED)
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (2)
cpp/CMakeLists.txt
(3 hunks)cpp/tensorrt_llm/deep_ep/CMakeLists.txt
(4 hunks)
🔇 Additional comments (5)
cpp/CMakeLists.txt (1)
211-211
: No-op whitespace changeNo functional impact here.
cpp/tensorrt_llm/deep_ep/CMakeLists.txt (4)
211-211
: LGTM: NVSHMEM include directory is properly addedIncluding
${NVSHMEM_INCLUDE_DIR}
ensures headers resolve for DeepEP CUDA/C++ sources.
213-214
: LGTM: Link both NVSHMEM device and host librariesLinking both device and host libraries resolves prior linkage gaps for host APIs.
222-222
: LGTM: Cleaned dependency on nvshmem_projectDropping the ExternalProject dependency for CUDA >= 13 path removes unnecessary build coupling.
46-46
: Escape “$ORIGIN” to preserve a literal in RPATHIn CMake, unescaped
$ORIGIN
may be consumed during variable expansion instead of passed through to the loader. Use\$ORIGIN
to ensure the literal ends up in RPATH:
- File:
cpp/tensorrt_llm/deep_ep/CMakeLists.txt
, line 46- set(NVSHMEM_INSTALL_PREFIX "$ORIGIN/libs/nvshmem") + set(NVSHMEM_INSTALL_PREFIX "\$ORIGIN/libs/nvshmem")After configuring your build (e.g.
cmake -S . -B build
), verify that\$ORIGIN
appears in the build outputs:grep -R '\\$ORIGIN' build
if(${CUDAToolkit_VERSION} VERSION_GREATER_EQUAL "13.0") | ||
message( | ||
STATUS | ||
"CUDAToolkit_VERSION ${CUDAToolkit_VERSION_MAJOR}.${CUDAToolkit_VERSION_MINOR} is greater or equal than 13.0, setting CMAKE_CUDA_RUNTIME_LIBRARY to Shared" | ||
) | ||
set(CMAKE_CUDA_FLAGS "${CMAKE_CUDA_FLAGS} --device-entity-has-hidden-visibility=false -cudart=shared") | ||
set(CMAKE_CUDA_RUNTIME_LIBRARY Shared) | ||
endif() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Verification agent
🧩 Analysis chain
Make CUDA runtime selection consistent and avoid redundant flags
You’re switching to Shared runtime for CUDA >= 13, but you also append “-cudart=shared” to CMAKE_CUDA_FLAGS. Prefer one source of truth (CMake’s CMAKE_CUDA_RUNTIME_LIBRARY) and avoid hard-coding -cudart
in flags. Also, if you keep Shared for >=13, ensure consumers don’t still link against CUDA::cudart_static
via CUDA_RT_LIB
.
Apply this minimal diff to let CMake own the runtime selection and align the helper variable:
if(${CUDAToolkit_VERSION} VERSION_GREATER_EQUAL "13.0")
message(
STATUS
"CUDAToolkit_VERSION ${CUDAToolkit_VERSION_MAJOR}.${CUDAToolkit_VERSION_MINOR} is greater or equal than 13.0, setting CMAKE_CUDA_RUNTIME_LIBRARY to Shared"
)
- set(CMAKE_CUDA_FLAGS "${CMAKE_CUDA_FLAGS} --device-entity-has-hidden-visibility=false -cudart=shared")
+ set(CMAKE_CUDA_FLAGS "${CMAKE_CUDA_FLAGS} --device-entity-has-hidden-visibility=false")
set(CMAKE_CUDA_RUNTIME_LIBRARY Shared)
+ # Ensure helper points to the shared cudart for consumers relying on CUDA_RT_LIB.
+ set(CUDA_RT_LIB CUDA::cudart)
endif()
To verify no accidental static cudart linkage remains after this change, run:
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "Searching for CUDA_RT_LIB usage and cudart_static references..."
rg -n -C2 '\bCUDA_RT_LIB\b|cudart_static|CUDA::cudart_static' --type cmake
Length of output: 3327
Ensure consistent CUDA runtime selection across the entire CMake setup
We should centralize runtime choice via CMAKE_CUDA_RUNTIME_LIBRARY
and the CUDA_RT_LIB
helper, and remove all hard-coded references to cudart_static
or -cudart=shared
. In practice that means:
• In cpp/CMakeLists.txt
’s setup_cuda_architectures()
(around lines 153–165), replace the separate static/default logic with a single if(VERSION ≥ 13)
/else()
that:
– Sets CMAKE_CUDA_RUNTIME_LIBRARY
to Shared for ≥13, Static otherwise.
– Sets CUDA_RT_LIB
to CUDA::cudart
for ≥13, CUDA::cudart_static
otherwise.
– Drops the -cudart=shared
flag entirely.
• Remove all other direct invocations of CUDA::cudart_static
or -cudart=…
and have consumers link only against ${CUDA_RT_LIB}
.
• Update every CMakeLists that currently does
find_package(CUDAToolkit REQUIRED COMPONENTS cuda_driver cudart_static nvml)
…
target_link_libraries(... CUDA::cudart_static ...)
to either:
– Remove cudart_static
from the find_package
call (letting CMake pick the right runtime), and
– Link against ${CUDA_RT_LIB}
(or explicitly CUDA::cudart
when you truly need the shared library).
Key files requiring edits (non-exhaustive):
- cpp/CMakeLists.txt (setup_cuda_architectures)
- cpp/tensorrt_llm/plugins/CMakeLists.txt
- cpp/tensorrt_llm/kernels/.../nvrtcWrapper/CMakeLists.txt
- cpp/tensorrt_llm/executor/cache_transmission/ucx_utils/CMakeLists.txt
- examples/cpp/executor/CMakeLists.txt
Example refactor in setup_cuda_architectures()
:
# before
find_package(CUDAToolkit 11.2 REQUIRED COMPONENTS cudart_static cuda_driver ...)
set(CUDA_RT_LIB CUDA::cudart_static)
set(CMAKE_CUDA_RUNTIME_LIBRARY Static)
...
if(${CUDAToolkit_VERSION} VERSION_GREATER_EQUAL "13.0")
message(STATUS "… setting Shared runtime …")
- set(CMAKE_CUDA_FLAGS "${CMAKE_CUDA_FLAGS} --device-entity-has-hidden-visibility=false -cudart=shared")
+ set(CMAKE_CUDA_FLAGS "${CMAKE_CUDA_FLAGS} --device-entity-has-hidden-visibility=false")
set(CMAKE_CUDA_RUNTIME_LIBRARY Shared)
+ set(CUDA_RT_LIB CUDA::cudart)
endif()
Then remove all other occurrences of cudart_static
(e.g. in examples) and use ${CUDA_RT_LIB}
for linking. This ensures a single source of truth and prevents stale static‐runtime references.
🤖 Prompt for AI Agents
In cpp/CMakeLists.txt around lines 253-260 (and in setup_cuda_architectures()
around 153-165), remove the hard-coded -cudart=shared flag and instead
centralize CUDA runtime selection: set CMAKE_CUDA_RUNTIME_LIBRARY to Shared when
CUDAToolkit_VERSION >= 13.0 and to Static otherwise, and set a top-level
variable CUDA_RT_LIB to CUDA::cudart for ≥13 or CUDA::cudart_static for <13;
update these lines to drop any -cudart=... flags. Then search the repo for all
find_package(CUDAToolkit ...) and target_link_libraries(...) usages (notably in
cpp/tensorrt_llm/plugins, nvrtcWrapper, ucx_utils, and examples/cpp/executor)
and remove explicit cudart_static from find_package and replace any
CUDA::cudart_static or -cudart references with linking against ${CUDA_RT_LIB}
(or CUDA::cudart only when you explicitly require shared), ensuring every
consumer uses the centralized CUDA_RT_LIB variable.
Signed-off-by: Boris Fomitchev <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
scripts/build_wheel.py (1)
672-686
: Harden NVSHMEM packaging: guard directory creation, use robust path construction, handle SONAME variations, and warn on partial installs.
- Create libs/nvshmem only when License.txt exists to avoid shipping empty dirs.
- Build paths via Path segments for readability/portability.
- Be resilient to NVSHMEM library SONAME changes (e.g., .so.103) by globbing.
- Emit a clear message when the license exists but expected libs are missing.
Apply this diff:
- - (lib_dir / "nvshmem").mkdir(exist_ok=True) - nvshmem_license = build_dir / "tensorrt_llm/deep_ep/nvshmem-build/License.txt" - if nvshmem_license.exists(): - install_file( - build_dir / "tensorrt_llm/deep_ep/nvshmem-build/License.txt", - lib_dir / "nvshmem") - install_file( - build_dir / - "tensorrt_llm/deep_ep/nvshmem-build/src/lib/nvshmem_bootstrap_uid.so.3", - lib_dir / "nvshmem") - install_file( - build_dir / - "tensorrt_llm/deep_ep/nvshmem-build/src/lib/nvshmem_transport_ibgda.so.103", - lib_dir / "nvshmem") + nvshmem_build_dir = build_dir / "tensorrt_llm" / "deep_ep" / "nvshmem-build" + nvshmem_license = nvshmem_build_dir / "License.txt" + if nvshmem_license.exists(): + nvshmem_out_dir = lib_dir / "nvshmem" + nvshmem_out_dir.mkdir(exist_ok=True) + lib_src_dir = nvshmem_build_dir / "src" / "lib" + # Be resilient to SONAME changes across NVSHMEM versions + libs = [] + for pattern in ("nvshmem_bootstrap_uid.so*", "nvshmem_transport_ibgda.so*"): + libs.extend(lib_src_dir.glob(pattern)) + if not libs: + warnings.warn( + f"NVSHMEM license found at {nvshmem_license}, but no expected libraries in {lib_src_dir}. " + "Skipping NVSHMEM packaging.") + else: + install_file(nvshmem_license, nvshmem_out_dir) + for lib in libs: + install_file(lib, nvshmem_out_dir) + else: + print("-- NVSHMEM artifacts not detected (License.txt missing); skipping NVSHMEM packaging.")
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (1)
scripts/build_wheel.py
(1 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.py
: Python code must target Python 3.8+
Python indentation: 4 spaces, no tabs
Maintain module namespace in imports (from package.subpackage import foo; then use foo.SomeClass())
Python file names use snake_case
Python class names use PascalCase
Python functions/methods and local variables use snake_case; variables starting with a number get k_ prefix (e.g., k_99th_percentile)
Global variables use G_ prefixed UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE in Python
Avoid shadowing variables from outer scopes in Python
Initialize all externally visible members of a Python class in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with short docstrings
Avoid reflection when simple alternatives exist (e.g., prefer explicit parameters over dict(**locals()))
In try/except, catch the narrowest exceptions possible
For duck-typing with try/except, keep try body minimal and put logic in else
Files:
scripts/build_wheel.py
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
Prepend NVIDIA copyright header (current year) to all source files
Files:
scripts/build_wheel.py
🔇 Additional comments (1)
scripts/build_wheel.py (1)
672-686
: Packaging includes nvshmem artifactsI’ve verified that in setup.py’s
package_data
(lines 102–110) you explicitly list:
libs/nvshmem/License.txt
libs/nvshmem/nvshmem_bootstrap_uid.so.3
libs/nvshmem/nvshmem_transport_ibgda.so.103
These files will be bundled into the wheel. If you ever add more files under
libs/nvshmem
, remember to update this list (or switch to a glob pattern like'libs/nvshmem/*'
) so they aren’t omitted.
Summary by CodeRabbit
New Features
Performance
Build
Chores
Description
An attempt to build on CUDA13. Work in progress.
Test Coverage
GitHub Bot Help
/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...
Provide a user friendly way for developers to interact with a Jenkins server.
Run
/bot [-h|--help]
to print this help message.See details below for each supported subcommand.
run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]
Launch build/test pipelines. All previously running jobs will be killed.
--reuse-test (optional)pipeline-id
(OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.--disable-reuse-test
(OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.--disable-fail-fast
(OPTIONAL) : Disable fail fast on build/tests/infra failures.--skip-test
(OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.--stage-list "A10-PyTorch-1, xxx"
(OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.--gpu-type "A30, H100_PCIe"
(OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.--test-backend "pytorch, cpp"
(OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.--only-multi-gpu-test
(OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.--disable-multi-gpu-test
(OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.--add-multi-gpu-test
(OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.--post-merge
(OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx"
(OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".--detailed-log
(OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.--debug
(OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in thestage-list
parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.For guidance on mapping tests to stage names, see
docs/source/reference/ci-overview.md
and the
scripts/test_to_stage_mapping.py
helper.kill
kill
Kill all running builds associated with pull request.
skip
skip --comment COMMENT
Skip testing for latest commit on pull request.
--comment "Reason for skipping build/test"
is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.reuse-pipeline
reuse-pipeline
Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.