Skip to content

Conversation

borisfom
Copy link

@borisfom borisfom commented Aug 14, 2025

Summary by CodeRabbit

  • New Features

    • Added CUDA 13 support and a build-time CUDA flags status message.
  • Performance

    • Replaced several internal reduction and iterator primitives for more consistent and uniform numerical behavior.
  • Build

    • Switched NVSHMEM to prebuilt discovery for newer toolchains, tightened runtime library paths, and reduced targeted GPU architectures.
  • Chores

    • Loosened and aligned Python dependency pins, standardized kernel symbol visibility/instantiations, and made NVSHMEM artifact installation conditional.

Description

An attempt to build on CUDA13. Work in progress.

Test Coverage

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md
and the scripts/test_to_stage_mapping.py helper.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

borisfom added 7 commits July 29, 2025 16:14
Signed-off-by: Boris Fomitchev <[email protected]>
Signed-off-by: Boris Fomitchev <[email protected]>
Signed-off-by: Boris Fomitchev <[email protected]>
Signed-off-by: Boris Fomitchev <[email protected]>
Signed-off-by: Boris Fomitchev <[email protected]>
Signed-off-by: Boris Fomitchev <[email protected]>
Copy link
Contributor

coderabbitai bot commented Aug 14, 2025

📝 Walkthrough

Walkthrough

Updates build and dependency configuration for CUDA 13, refactors NVSHMEM discovery to prefer prebuilt libraries for CUDA ≥13 while keeping a patched ExternalProject fallback, migrates several cuTensorMap usages to a v12000 function-pointer type, narrows a CUTLASS arch, swaps multiple reductions to cuda::maximum, introduces Thrust iterators, updates cudaMemAdvise usage, and specializes/exports a beam-search kernel.

Changes

Cohort / File(s) Summary
Top-level CMake CUDA flags
cpp/CMakeLists.txt
Formatting and new CUDA-version gated logic: emit status, append CUDA/NVCC flags for CUDA ≥13.0, set CMAKE_CUDA_RUNTIME_LIBRARY to Shared, and log CMAKE_CUDA_FLAGS.
NVSHMEM discovery & deep_ep target
cpp/tensorrt_llm/deep_ep/CMakeLists.txt
For CUDA ≥13 use prebuilt NVSHMEM (set NVSHMEM_INSTALL_PREFIX, find_path/find_library), link ${NVSHMEM_DEVICE_LIBRARY} & ${NVSHMEM_HOST_LIBRARY}, add ${NVSHMEM_INCLUDE_DIR}, update INSTALL_RPATH, and remove ExternalProject dependency; retain ExternalProject+patch flow for older toolchains.
cuTensorMap TMA API versioning
cpp/include/.../tma_utils.cuh, cpp/tensorrt_llm/kernels/.../fp8_blockscale_tma_utils.cuh
Change function-pointer type to PFN_cuTensorMapEncodeTiled_v12000: get_cuTensorMapEncodeTiled() returns v12000, casts and template parameter encode_func updated accordingly.
CUTLASS arch change
cpp/tensorrt_llm/kernels/cutlass_kernels/CMakeLists.txt
Reduce fp8_blockscale_gemm_src architectures from 89 90 100f to 90.
BlockReduce reduction primitive migration
cpp/.../decoderMaskedMultiheadAttentionTemplate.h, cpp/.../sageAttentionKernels.cu, cpp/.../eagleDecodingKernels.cu, cpp/.../blockScaleMoe/DevKernel.cu, cpp/.../runtime/utils/debugUtils.cu
Replace cub::Max with cuda::maximum in multiple BlockReduce(...).Reduce(...) calls; no other logic or signature changes.
Thrust iterators for segmented sort
cpp/tensorrt_llm/kernels/topkLastDim.cu
Replace cub::CountingInputIterator/cub::TransformInputIterator with thrust::make_counting_iterator and thrust::make_transform_iterator; add thrust includes; pass iterators to cub::DeviceSegmentedSort::SortPairs as before.
Beam-search kernel specialization & export
cpp/tensorrt_llm/kernels/beamSearchKernels.cu, cpp/tensorrt_llm/kernels/beamSearchKernels.h
Remove generic addCumLogProbs<T> definition, add explicit instantiations for <float> and <half>; header: rename first param to pStage1LogProbs, add __attribute__((visibility("default"))), and provide kernel body under #ifdef __CUDACC__.
cudaMemAdvise API update
cpp/tensorrt_llm/runtime/moeLoadBalancer/hostAccessibleDeviceAllocator.cpp
Replace cudaMemAdvise call that passed an int device id with a cudaMemLocation{cudaMemLocationTypeDevice, currentDevId} for cudaMemAdviseSetPreferredLocation.
Conditional NVSHMEM artifact install
scripts/build_wheel.py
Guard NVSHMEM artifact installation on presence of nvshmem-build/License.txt; only install NVSHMEM libs if license file exists.
Requirements / CUDA 13 alignment
requirements.txt
Move extra index to cu130, align CUDA-related packages to cu13, loosen/adjust many pins (torch/tensorrt/transformers/pillow/wheel/etc.), add nvidia-nvshmem-cu13, and annotate triton-related entries.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant CMake as CMake
  participant System as System (prebuilt NVSHMEM)
  participant EP as ExternalProject (nvshmem_project)

  CMake->>CMake: Detect CUDA version
  alt CUDA >= 13.0
    CMake->>System: find_path / find_library NVSHMEM
    System-->>CMake: NVSHMEM paths & libs
    CMake->>CMake: link NVSHMEM_DEVICE_LIBRARY & NVSHMEM_HOST_LIBRARY
  else CUDA < 13.0
    CMake->>EP: ExternalProject_Add nvshmem_project (patch+build)
    EP-->>CMake: nvshmem_project::nvshmem
    CMake->>CMake: add dependency deep_ep -> nvshmem_project
  end
  CMake->>CMake: set INSTALL_RPATH including NVSHMEM_INSTALL_PREFIX
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Suggested reviewers

  • nv-guomingz
  • chzblych
  • pcastonguay
  • litaotju
  • tomeras91
  • PerkzZheng

Tip

🔌 Remote MCP (Model Context Protocol) integration is now available!

Pro plan users can now connect to remote MCP servers from the Integrations page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats.

✨ Finishing Touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai or @coderabbitai title anywhere in the PR title to generate the title automatically.

Status, Documentation and Community

  • Visit our Status Page to check the current availability of CodeRabbit.
  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 11

🔭 Outside diff range comments (3)
cpp/include/tensorrt_llm/deep_gemm/tma_utils.cuh (1)

98-114: Use non‑versioned PFN_cuTensorMapEncodeTiled in public headers; add missing <type_traits>

PFN_cuTensorMapEncodeTiled_v12000 is a CUDA-internal, versioned typedef — don't expose it in a public header. Keep the versioned symbol only for the runtime lookup/reinterpret_cast and use the non-suffixed PFN_cuTensorMapEncodeTiled in public APIs. Also this header uses std::remove_cv / std::is_same but does not include <type_traits> (it already has ).

Files/locations to change

  • cpp/include/tensorrt_llm/deep_gemm/tma_utils.cuh
    • Change get_cuTensorMapEncodeTiled() return type to PFN_cuTensorMapEncodeTiled and reinterpret_cast to PFN_cuTensorMapEncodeTiled.
    • Change make_2d_tma_copy_desc(...) encode_func parameter to PFN_cuTensorMapEncodeTiled.
    • Add #include <type_traits> near the other includes (note: is already present).
  • Check other occurrences of the versioned typedef and unify (example): cpp/tensorrt_llm/kernels/cutlass_kernels/fp8_blockscale_gemm/fp8_blockscale_tma_utils.cuh currently defines PFN_cuTensorMapEncodeTiled -> PFN_cuTensorMapEncodeTiled_v12000; ensure consistency across headers/source.

Suggested diffs (apply in tma_utils.cuh)

- PFN_cuTensorMapEncodeTiled_v12000 get_cuTensorMapEncodeTiled()
+ PFN_cuTensorMapEncodeTiled get_cuTensorMapEncodeTiled()
@@
-    return reinterpret_cast<PFN_cuTensorMapEncodeTiled_v12000>(cuTensorMapEncodeTiled_ptr);
+    return reinterpret_cast<PFN_cuTensorMapEncodeTiled>(cuTensorMapEncodeTiled_ptr);
- template <typename T>
- CUtensorMap make_2d_tma_copy_desc(..., PFN_cuTensorMapEncodeTiled_v12000 encode_func = nullptr)
+ template <typename T>
+ CUtensorMap make_2d_tma_copy_desc(..., PFN_cuTensorMapEncodeTiled encode_func = nullptr)
 #include <stdexcept>
+#include <type_traits>
cpp/tensorrt_llm/kernels/cutlass_kernels/fp8_blockscale_gemm/fp8_blockscale_tma_utils.cuh (2)

129-134: Remove debug I/O from header; handle errors deterministically

Std::cout in a header is inappropriate and the magic check int(res) == 1 is unclear — replace with an assert or explicit error propagation. If logging is required, use the project logger (not std::cout) and keep logging implementation out of headers.

  • File: cpp/tensorrt_llm/kernels/cutlass_kernels/fp8_blockscale_gemm/fp8_blockscale_tma_utils.cuh
  • Lines: 129–134

Suggested change:

-    if (int(res) == 1)
-    {
-        std::cout << "check 0: " << int(res) << std::endl;
-        std::cout << gmem_dim[0] << "\t" << gmem_dim[1] << std::endl;
-    }
+    // Debug prints removed from header.
+    // Handle failures deterministically:
+    // DG_HOST_ASSERT(res == CUDA_SUCCESS); // or project assert macro
+    // or propagate error (e.g. return/throw an error code/exception)

If you must emit runtime diagnostics, plumb a logger (or move prints to a .cpp) rather than printing from a header.


89-99: Fix CUDA version guard and add missing includes in fp8_blockscale_tma_utils.cuh

The current guard (__CUDACC_VER_MAJOR__ >= 12 && __CUDACC_VER_MINOR__ >= 5) is false for CUDA 13.0 (major=13, minor=0) so the ByVersion path is incorrectly disabled. Use the numeric CUDA_VERSION check and add missing includes used in this file.

Files / locations to update:

  • cpp/tensorrt_llm/kernels/cutlass_kernels/fp8_blockscale_gemm/fp8_blockscale_tma_utils.cuh
    • Includes near top (add <type_traits> and )
    • get_cuTensorMapEncodeTiled() function (change guard and return type/rewrap)
    • make_2d_tma_copy_desc signature (use PFN_cuTensorMapEncodeTiled)

Suggested diff:

--- a/cpp/tensorrt_llm/kernels/cutlass_kernels/fp8_blockscale_gemm/fp8_blockscale_tma_utils.cuh
+++ b/cpp/tensorrt_llm/kernels/cutlass_kernels/fp8_blockscale_gemm/fp8_blockscale_tma_utils.cuh
@@
 #include <cuda_runtime.h>
+#include <type_traits>
+#include <stdexcept>
 
 #include <cassert>
@@
-PFN_cuTensorMapEncodeTiled_v12000 get_cuTensorMapEncodeTiled()
+PFN_cuTensorMapEncodeTiled get_cuTensorMapEncodeTiled()
 {
@@
-#if (__CUDACC_VER_MAJOR__ >= 12 && __CUDACC_VER_MINOR__ >= 5)
+#if CUDA_VERSION >= 12050
@@
-    return reinterpret_cast<PFN_cuTensorMapEncodeTiled_v12000>(cuTensorMapEncodeTiled_ptr);
+    return reinterpret_cast<PFN_cuTensorMapEncodeTiled>(cuTensorMapEncodeTiled_ptr);
 }
@@
-template <typename data_type>
-CUtensorMap make_2d_tma_copy_desc(..., PFN_cuTensorMapEncodeTiled_v12000 encode_func = nullptr)
+template <typename data_type>
+CUtensorMap make_2d_tma_copy_desc(..., PFN_cuTensorMapEncodeTiled encode_func = nullptr)

Rationale: using CUDA_VERSION >= 12050 ensures 12.5+ (including 13.x) take the ByVersion path. Adding <type_traits> and covers std::is_same and std::runtime_error used in this file.

🧹 Nitpick comments (10)
cpp/tensorrt_llm/kernels/speculativeDecoding/eagleDecodingKernels.cu (1)

507-514: Use typed functor for BlockReduce and ensure CCCL headers are available

Switching to cuda::maximum is fine for CUDA 13/CCCL, but two nits:

  • Be explicit about the functor type to avoid deduction pitfalls across toolkits.
  • Ensure the compilation unit includes the CCCL functional header (directly or transitively) so cuda::maximum is defined.

Apply this minimal change for clarity and robustness:

-    auto const maxGenLength = BlockReduce(tempStorage.reduce).Reduce(nextDraftLen, cuda::maximum());
+    auto const maxGenLength = BlockReduce(tempStorage.reduce).Reduce(nextDraftLen, cuda::maximum<SizeType32>());

If not already guaranteed elsewhere, please confirm that <cuda/functional> is visible via includes (your CMake change adding the CCCL include path should satisfy this).

cpp/tensorrt_llm/runtime/utils/debugUtils.cu (1)

56-63: Typed reduction functor and CCCL availability

Using cuda::maximum is correct, but specify the type to avoid any template argument deduction issues across CUDA/CCCL versions:

-    int blockFound = BlockReduceT(tempStorage).Reduce(found, cuda::maximum());
+    int blockFound = BlockReduceT(tempStorage).Reduce(found, cuda::maximum<int32_t>());

Also confirm CCCL headers (e.g., <cuda/functional>) are reachable in this TU; otherwise you’ll hit an unresolved identifier on older stacks.

cpp/CMakeLists.txt (1)

378-378: Prefer message(STATUS) to reduce noise in CMake output

Unscoped message() prints at the default log level and can be noisy in CI. Use STATUS or VERBOSE:

-message("CMAKE_CUDA_FLAGS: ${CMAKE_CUDA_FLAGS}")
+message(STATUS "CMAKE_CUDA_FLAGS: ${CMAKE_CUDA_FLAGS}")
cpp/tensorrt_llm/kernels/cutlass_kernels/CMakeLists.txt (1)

193-194: Reducing fp8_blockscale_gemm architectures to only 90 may drop support

Setting fp8_blockscale_gemm_src to 90 only removes 89/100/120 fatbin variants and can lead to runtime failures on non-Hopper GPUs (e.g., Ada Lovelace 89) or newer (100/120) if this target is selected.

If this is intentional (e.g., kernels are Hopper-specific), consider adding a comment and guarding elsewhere to avoid launching on unsupported SMs. Otherwise, retain broader coverage:

-set_cuda_architectures(fp8_blockscale_gemm_src 90)
+set_cuda_architectures(fp8_blockscale_gemm_src 90 100f 120f)

At minimum, please verify build/run paths won’t reference this lib on SM89 or SM100/SM120.

cpp/tensorrt_llm/kernels/sageAttentionKernels.cu (1)

253-255: Add the proper header for cuda::maximum and consider specifying the functor type explicitly

Switching BlockReduce’s operator from cub::Max{} to cuda::maximum{} is fine, but you should include the correct header to guarantee availability across toolchains.

  • Include header: <cuda/functional>
  • Optional: For older/stricter compilers, make the functor type explicit (e.g., cuda::maximum{}) to avoid template deduction issues on non-standard types (e.g., __nv_bfloat16).

Follow-up suggestions (applies outside the selected range):

// near other includes
#include <cuda/functional>

If you hit type resolution issues with bf16/half in some CUDA versions, prefer:

float aggregate = BlockReduce(temp_storage).Reduce(static_cast<float>(local_amax), cuda::maximum<float>{});

and define BlockReduce on float accordingly.

Also applies to: 430-434

cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/DevKernel.cu (2)

157-159: Ensure cuda::maximum is available: add <cuda/functional>

Using cuda::maximum() with BlockReduce requires the libcudacxx functional header on some toolchains. Please add the include to prevent build failures.

Apply outside-range change:

// near other headers
#include <cuda/functional>

660-661: Same as above: add <cuda/functional> for cuda::maximum

Repeat usage here. One include at the top of this file suffices for both sites.

cpp/tensorrt_llm/kernels/topkLastDim.cu (1)

1353-1355: Avoid duplication: factor iterator construction into a small helper

You’re repeating the counting/transform iterator construction. Consider a small inline helper to compute begin/end transform iterators once to reduce duplication and the risk of drift.

Example helper (outside the selected range):

template <typename IdxT>
inline auto segmentedOffsets(IdxT k) {
    auto base = thrust::make_counting_iterator<IdxT>(0);
    auto begin = thrust::make_transform_iterator(base, air_topk_stable::ComputeOffset<IdxT>(k));
    auto end   = begin + 1;
    return std::make_pair(begin, end);
}

Then use:

auto [beginOff, endOff] = segmentedOffsets<IdxT>(k);
requirements.txt (1)

37-39: Nit: clean up stray whitespace and obsolete comment for wheel

There’s a trailing space after ‘wheel’ and a leftover comment line. Consider removing to avoid confusion.

-pillow>=10.3.0
-wheel 
-#<=0.45.1
+pillow>=10.3.0
+wheel
cpp/tensorrt_llm/deep_ep/CMakeLists.txt (1)

2-3: Remove dead variable NVSHMEM_URL_HASH

This variable is now unused after removing the ExternalProject flow.

-set(NVSHMEM_URL_HASH
-    SHA256=eb2c8fb3b7084c2db86bd9fd905387909f1dfd483e7b45f7b3c3d5fcf5374b5a)
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 5cd8c0f and c47c7b5.

📒 Files selected for processing (13)
  • cpp/CMakeLists.txt (3 hunks)
  • cpp/include/tensorrt_llm/deep_gemm/tma_utils.cuh (2 hunks)
  • cpp/tensorrt_llm/deep_ep/CMakeLists.txt (4 hunks)
  • cpp/tensorrt_llm/kernels/cutlass_kernels/CMakeLists.txt (1 hunks)
  • cpp/tensorrt_llm/kernels/cutlass_kernels/fp8_blockscale_gemm/fp8_blockscale_tma_utils.cuh (3 hunks)
  • cpp/tensorrt_llm/kernels/decoderMaskedMultiheadAttention/decoderMaskedMultiheadAttentionTemplate.h (1 hunks)
  • cpp/tensorrt_llm/kernels/sageAttentionKernels.cu (2 hunks)
  • cpp/tensorrt_llm/kernels/speculativeDecoding/eagleDecodingKernels.cu (1 hunks)
  • cpp/tensorrt_llm/kernels/topkLastDim.cu (3 hunks)
  • cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/DevKernel.cu (2 hunks)
  • cpp/tensorrt_llm/runtime/moeLoadBalancer/hostAccessibleDeviceAllocator.cpp (1 hunks)
  • cpp/tensorrt_llm/runtime/utils/debugUtils.cu (1 hunks)
  • requirements.txt (3 hunks)
🧰 Additional context used
📓 Path-based instructions (5)
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}: In C++, close namespaces with a comment naming the namespace (e.g., } // namespace foo)
Prefer const/constexpr variables over #define for constants
Declare variables const if not modified after initialization
Use Allman brace style in C++
C++ filenames use lowerCamelCase and must be case-insensitively unique within a build target
C++ type names use UpperCamelCase
Local variables, methods, and namespaces use lowerCamelCase
Global non-static variables not in anonymous namespace use gPrefix lowerCamelCase (e.g., gExample)
Static globals or globals in anonymous namespaces use sPrefix lowerCamelCase
Locally visible static variables start with 's' (e.g., static std::once_flag sFlag;)
Member variables use mPrefix lowerCamelCase; public members may omit but are encouraged to use 'm'
Constants (enums, global/static/function-scope magic numbers) use kPREFIXED_UPPER_SNAKE (e.g., kDIGIT_NUM)
If macros are unavoidable, use UPPER_SNAKE_CASE (prefer constants over #define)
Constructor parameter that conflicts with a public member name gets trailing underscore (foo_)
Literal suffixes should be uppercase (e.g., 1234L not 1234l)
C++: use spaces only; indent 4 spaces
Run clang-format (LLVM style) before submitting; wrap lines at 120 characters
If formatting must be bypassed, use // clang-format off/on around the section
Prefer smart pointers; use unique_ptr for sole ownership, shared_ptr for shared; weak_ptr only in exceptional cases
Do not use deprecated pre-C++11 smart pointers
Use C++ style comments; avoid C comments except special inline cases; prefer // single-line
Capitalize and punctuate full-sentence comments
Follow Doxygen rules: use //! for comments and //!< for members in C++
Disable code with #if/#endif and mnemonic conditions; avoid commented-out code; avoid dead code
Do not throw exceptions across library boundaries
Use least-forceful casts; avoid removing const/volatile; avoid C-style and functional casts (except constructors); p...

Files:

  • cpp/tensorrt_llm/runtime/utils/debugUtils.cu
  • cpp/tensorrt_llm/kernels/sageAttentionKernels.cu
  • cpp/tensorrt_llm/runtime/moeLoadBalancer/hostAccessibleDeviceAllocator.cpp
  • cpp/tensorrt_llm/kernels/topkLastDim.cu
  • cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/DevKernel.cu
  • cpp/tensorrt_llm/kernels/speculativeDecoding/eagleDecodingKernels.cu
  • cpp/tensorrt_llm/kernels/decoderMaskedMultiheadAttention/decoderMaskedMultiheadAttentionTemplate.h
  • cpp/include/tensorrt_llm/deep_gemm/tma_utils.cuh
  • cpp/tensorrt_llm/kernels/cutlass_kernels/fp8_blockscale_gemm/fp8_blockscale_tma_utils.cuh
**/*.{cpp,cxx,cc,cu}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.{cpp,cxx,cc,cu}: Avoid literal values except for 0, nullptr, true, false; use named constexpr for other literals
Place semicolon of empty for/while loop on a new line
Always use brace-delimited bodies for switch/while/do-for/if/else
Use inline C comments in argument lists when parameter meaning is unclear (e.g., /* checkForErrors = */ false)
Do not use assignment in subexpressions (e.g., if (x = y) ... is forbidden)
Switch on enums should enumerate all values and omit default to catch new values at compile time
Structure switch statements; prohibit fallthrough except between empty cases; each case ends with break or throw; return at end of case not allowed; put break inside braces for compound case
Prefer anonymous namespaces over static for internal linkage of functions
Every defined function must be called at least once (no unused methods)

Files:

  • cpp/tensorrt_llm/runtime/utils/debugUtils.cu
  • cpp/tensorrt_llm/kernels/sageAttentionKernels.cu
  • cpp/tensorrt_llm/runtime/moeLoadBalancer/hostAccessibleDeviceAllocator.cpp
  • cpp/tensorrt_llm/kernels/topkLastDim.cu
  • cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/DevKernel.cu
  • cpp/tensorrt_llm/kernels/speculativeDecoding/eagleDecodingKernels.cu
**/*.{h,hpp,hxx,hh,cuh,cpp,cxx,cc,cu}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Parameter names must be consistent between declarations and definitions

Files:

  • cpp/tensorrt_llm/runtime/utils/debugUtils.cu
  • cpp/tensorrt_llm/kernels/sageAttentionKernels.cu
  • cpp/tensorrt_llm/runtime/moeLoadBalancer/hostAccessibleDeviceAllocator.cpp
  • cpp/tensorrt_llm/kernels/topkLastDim.cu
  • cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/DevKernel.cu
  • cpp/tensorrt_llm/kernels/speculativeDecoding/eagleDecodingKernels.cu
  • cpp/tensorrt_llm/kernels/decoderMaskedMultiheadAttention/decoderMaskedMultiheadAttentionTemplate.h
  • cpp/include/tensorrt_llm/deep_gemm/tma_utils.cuh
  • cpp/tensorrt_llm/kernels/cutlass_kernels/fp8_blockscale_gemm/fp8_blockscale_tma_utils.cuh
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Prepend NVIDIA copyright header (current year) to all source files

Files:

  • cpp/tensorrt_llm/runtime/utils/debugUtils.cu
  • cpp/tensorrt_llm/kernels/sageAttentionKernels.cu
  • cpp/tensorrt_llm/runtime/moeLoadBalancer/hostAccessibleDeviceAllocator.cpp
  • cpp/tensorrt_llm/kernels/topkLastDim.cu
  • cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/DevKernel.cu
  • cpp/tensorrt_llm/kernels/speculativeDecoding/eagleDecodingKernels.cu
  • cpp/tensorrt_llm/kernels/decoderMaskedMultiheadAttention/decoderMaskedMultiheadAttentionTemplate.h
  • cpp/include/tensorrt_llm/deep_gemm/tma_utils.cuh
  • cpp/tensorrt_llm/kernels/cutlass_kernels/fp8_blockscale_gemm/fp8_blockscale_tma_utils.cuh
**/*.{h,hpp,hxx,hh,cuh}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Header files must use include guards named TRTLLM__H without underscores prefix/suffix (e.g., TRTLLM_FOO_BAR_HELLO_H)

Files:

  • cpp/tensorrt_llm/kernels/decoderMaskedMultiheadAttention/decoderMaskedMultiheadAttentionTemplate.h
  • cpp/include/tensorrt_llm/deep_gemm/tma_utils.cuh
  • cpp/tensorrt_llm/kernels/cutlass_kernels/fp8_blockscale_gemm/fp8_blockscale_tma_utils.cuh
🔇 Additional comments (11)
cpp/tensorrt_llm/kernels/topkLastDim.cu (2)

28-30: LGTM: Thrust iterators are correctly included

Including thrust counting and transform iterators is correct for the new iterator approach. No issues here.


1226-1231: Segment offsets via Thrust iterators look correct; keep functor host device

  • Using thrust::make_counting_iterator and thrust::make_transform_iterator with ComputeOffset(k) is correct for segments of uniform length k.
  • ComputeOffset is already host device, which is required.

No action needed.

If you want to double-check the API usage (num segments, begin/end offsets), confirm that k * batch_size equals the total number of items passed to SortPairs for all calls in this TU.

requirements.txt (9)

6-6: cuda-python constraint looks appropriate; confirm compatibility with your base image

The ~=13.0.0 spec is fine. Ensure the base NGC image provides a matching cuda-python to avoid downgrades/upgrades during install.


16-20: Loosening pynvml/nvidia-ml-py: verify modelopt constraints

Since nvidia-modelopt transitively depends on pynvml, confirm that loosening these pins does not introduce incompatible versions at runtime.


22-22: h5py unpin is reasonable; just ensure ABI compatibility for platforms used in CI

h5py>=3.12.1 should be fine. If manylinux/arm64 targets are in scope, watch for prebuilt wheel availability.


25-27: Torch range bump: double-check downstream constraints (TRT, transformers)

torch>=2.7.1,<=2.9.0a0 widens the range. Verify compatibility with:

  • TensorRT Python bindings used (~10.13.0)
  • transformers~=4.55.0

33-33: Transformers bump looks fine; verify model code paths for API shifts

transformers 4.55 may include minor API/behavior changes. Make sure your integration doesn’t rely on deprecated symbols from 4.53.


49-49: FastAPI pin loosened: ensure Uvicorn and Pydantic versions remain compatible

fastapi>=0.115.4 with pydantic>=2.9.1 is expected to be compatible; confirm no transitive conflicts with other web dependencies.


55-55: flashinfer-python rename/version: confirm correct package name and wheels

There have been naming inconsistencies in the past. Verify that flashinfer-python>=0.2.5 is the intended package name and is available for your Python/CUDA matrix.


65-65: llguidance loosened: confirm it’s still compatible with your runtime

No action if tests pass; otherwise consider pinning to a known-good minor range.


1-1: Confirm PyTorch cu130 index vs. pinned torch/torchvision — manual check required

Top-level requirements.txt sets the extra index to cu130 and pins torch to >=2.7.1,<=2.9.0a0 (torchvision is unpinned). I could not fetch the PyTorch cu130 index from this sandbox, so I cannot confirm the required wheels are present for your CI platforms — please verify.

Files to check:

  • requirements.txt (repo root) — line 1: --extra-index-url https://download.pytorch.org/whl/cu130; torch pinned at lines ~27–29.
  • constraints.txt (referenced by requirements.txt) — check for any platform/version overrides.
  • CI pipelines that consume this requirements.txt (no .github/workflows found in the repo root) — confirm which CI systems/images use this file and their CUDA/Python combos.

Quick local verification (run where network access is available):
python - <<'PY'
import urllib.request, re
url = "https://download.pytorch.org/whl/cu130/torch_stable.html"
data = urllib.request.urlopen(url, timeout=10).read().decode("utf-8", errors="ignore")
for pkg in ("torch", "torchvision"):
hits = sorted(set(re.findall(r'href="\'["\']' % pkg, data, re.I)))
print(pkg, len(hits))
for h in hits[:30]:
print(" ", h)
PY

What to confirm:

  • The cu130 index contains wheels matching the torch version range (and torchvision) for all Python versions / OS / CUDA builds your CI uses.
  • If not, either change the extra-index-url to the appropriate CUDA index(s) or pin torch/torchvision to versions that are available on cu130.

Comment on lines 116 to 119
template <typename T>
CUtensorMap make_2d_tma_copy_desc(T* global_address, uint64_t gmem_dim[2], uint64_t stride_in_bytes,
uint32_t smem_dim[2], CUtensorMapSwizzle swizzle_type, PFN_cuTensorMapEncodeTiled encode_func = nullptr)
uint32_t smem_dim[2], CUtensorMapSwizzle swizzle_type, PFN_cuTensorMapEncodeTiled_v12000 encode_func = nullptr)
{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Verification agent

🧩 Analysis chain

Default parameter type should match the stable PFN type

Align the encode_func param with the stable, unversioned type to avoid exposing versioned types.

-    uint32_t smem_dim[2], CUtensorMapSwizzle swizzle_type, PFN_cuTensorMapEncodeTiled_v12000 encode_func = nullptr)
+    uint32_t smem_dim[2], CUtensorMapSwizzle swizzle_type, PFN_cuTensorMapEncodeTiled encode_func = nullptr)

Double-check the other call sites for signature consistency.


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "=== show file (if exists) ==="
FILE="cpp/include/tensorrt_llm/deep_gemm/tma_utils.cuh"
if [ -f "$FILE" ]; then
  sed -n '1,240p' "$FILE" || true
else
  echo "File not found: $FILE"
fi

echo
echo "=== rg: PFN_cuTensorMapEncodeTiled_v12000 ==="
rg --hidden -n "PFN_cuTensorMapEncodeTiled_v12000" || true

echo
echo "=== rg: PFN_cuTensorMapEncodeTiled ==="
rg --hidden -n "PFN_cuTensorMapEncodeTiled(?!_v12000)" || true

echo
echo "=== rg: make_2d_tma_copy_desc ==="
rg --hidden -n "make_2d_tma_copy_desc" || true

echo
echo "=== rg: cuTensorMapEncodeTiled occurrences ==="
rg --hidden -n "cuTensorMapEncodeTiled" || true

echo
echo "=== rg: cuTensorMap related typedefs/usages ==="
rg --hidden -n "cuTensorMap" || true

Length of output: 25549


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "=== fp8_blockscale_tma_utils.cuh (head) ==="
sed -n '1,240p' cpp/tensorrt_llm/kernels/cutlass_kernels/fp8_blockscale_gemm/fp8_blockscale_tma_utils.cuh || true

echo
echo "=== cudaDriverWrapper.h (head) ==="
sed -n '1,320p' cpp/tensorrt_llm/common/cudaDriverWrapper.h || true

echo
echo "=== cudaDriverWrapper.cpp (head) ==="
sed -n '1,360p' cpp/tensorrt_llm/common/cudaDriverWrapper.cpp || true

echo
echo "=== tma_utils.cuh (relevant region) ==="
sed -n '1,220p' cpp/include/tensorrt_llm/deep_gemm/tma_utils.cuh || true

echo
echo "=== rg: where PFN_cuTensorMapEncodeTiled is defined/mentioned ==="
rg --hidden -n "PFN_cuTensorMapEncodeTiled" || true

Length of output: 32409


Prefer the stable PFN_cuTensorMapEncodeTiled type for the default parameter

Use the unversioned PFN to avoid leaking the versioned typedef; callers I checked do not pass encode_func explicitly.

Files to update:

  • cpp/include/tensorrt_llm/deep_gemm/tma_utils.cuh

    • Replace:
      • uint32_t smem_dim[2], CUtensorMapSwizzle swizzle_type, PFN_cuTensorMapEncodeTiled_v12000 encode_func = nullptr)
    • With:
      • uint32_t smem_dim[2], CUtensorMapSwizzle swizzle_type, PFN_cuTensorMapEncodeTiled encode_func = nullptr)
  • cpp/tensorrt_llm/kernels/cutlass_kernels/fp8_blockscale_gemm/fp8_blockscale_tma_utils.cuh

    • Make the same signature change (or rely on the existing
      #define PFN_cuTensorMapEncodeTiled PFN_cuTensorMapEncodeTiled_v12000 and prefer using the unversioned name).

Call sites verified (no explicit encode_func argument): cpp/include/tensorrt_llm/deep_gemm/fp8_gemm.cuh, cpp/tensorrt_llm/kernels/cutlass_kernels/fp8_blockscale_gemm/fp8_blockscale_gemm_kernel.cuh, etc.

Optional safety: add a local fallback in headers to map the unversioned name to the v12000 typedef if the unversioned PFN is not present:

#ifndef PFN_cuTensorMapEncodeTiled
#define PFN_cuTensorMapEncodeTiled PFN_cuTensorMapEncodeTiled_v12000
#endif

🤖 Prompt for AI Agents
In cpp/include/tensorrt_llm/deep_gemm/tma_utils.cuh around lines 116-119, the
function signature uses the versioned typedef PFN_cuTensorMapEncodeTiled_v12000
as the default parameter; change that parameter type to the unversioned
PFN_cuTensorMapEncodeTiled (i.e., replace PFN_cuTensorMapEncodeTiled_v12000 with
PFN_cuTensorMapEncodeTiled) so callers that don’t pass encode_func keep using
the stable name, and make the same signature change in
cpp/tensorrt_llm/kernels/cutlass_kernels/fp8_blockscale_gemm/fp8_blockscale_tma_utils.cuh
(or rely on an existing #define mapping the unversioned name to the v12000
typedef); optionally add a local fallback macro guard at the top of the headers:
if PFN_cuTensorMapEncodeTiled is not defined, #define it to
PFN_cuTensorMapEncodeTiled_v12000.

Comment on lines 41 to 47
set(NVSHMEM_INSTALL_PREFIX "${TORCH_INSTALL_PREFIX}/../nvidia/shmem")
# message("Torch Install Prefix: ${TORCH_INSTALL_PREFIX}")

find_path(NVSHMEM_INCLUDE_DIR nvshmem.h HINTS ${NVSHMEM_INSTALL_PREFIX}/include)
find_library(NVSHMEM_HOST_LIBRARY nvshmem_host.so.3 HINTS ${NVSHMEM_INSTALL_PREFIX}/lib)
find_library(NVSHMEM_DEVICE_LIBRARY nvshmem_device HINTS ${NVSHMEM_INSTALL_PREFIX}/lib)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Verification agent

🧩 Analysis chain

NVSHMEM discovery is brittle: hard-coded path, not REQUIRED, and host SONAME pinned to .so.3

  • Hard-coding NVSHMEM_INSTALL_PREFIX to a Torch-relative path without allowing override will break on non-standard layouts.
  • find_library(NVSHMEM_HOST_LIBRARY nvshmem_host.so.3 ...) pins a Linux SONAME and will fail on other platforms or future versions. Use base names via NAMES and search both lib and lib64.
  • None of the NVSHMEM finds are REQUIRED; a missing lib will fail late at link time.

Apply this diff to make discovery robust and fail-fast:

-set(NVSHMEM_INSTALL_PREFIX "${TORCH_INSTALL_PREFIX}/../nvidia/shmem")
-# message("Torch Install Prefix: ${TORCH_INSTALL_PREFIX}")
-
-find_path(NVSHMEM_INCLUDE_DIR nvshmem.h HINTS ${NVSHMEM_INSTALL_PREFIX}/include)
-find_library(NVSHMEM_HOST_LIBRARY nvshmem_host.so.3 HINTS ${NVSHMEM_INSTALL_PREFIX}/lib)
-find_library(NVSHMEM_DEVICE_LIBRARY nvshmem_device HINTS ${NVSHMEM_INSTALL_PREFIX}/lib)
+set(NVSHMEM_INSTALL_PREFIX "${TORCH_INSTALL_PREFIX}/../nvidia/shmem" CACHE PATH "NVSHMEM install root")
+
+find_path(NVSHMEM_INCLUDE_DIR nvshmem.h
+          HINTS ${NVSHMEM_INSTALL_PREFIX}/include
+          REQUIRED)
+find_library(NVSHMEM_HOST_LIBRARY
+             NAMES nvshmem_host nvshmem
+             HINTS ${NVSHMEM_INSTALL_PREFIX}/lib ${NVSHMEM_INSTALL_PREFIX}/lib64
+             REQUIRED)
+find_library(NVSHMEM_DEVICE_LIBRARY
+             NAMES nvshmem_device
+             HINTS ${NVSHMEM_INSTALL_PREFIX}/lib ${NVSHMEM_INSTALL_PREFIX}/lib64
+             REQUIRED)

Please confirm whether your deployment layout sometimes places NVSHMEM under lib64; if so, the added hint is required.


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "Repository root: $(pwd)"
echo
echo "=== Show cpp/tensorrt_llm/deep_ep/CMakeLists.txt (first 300 lines) ==="
sed -n '1,300p' cpp/tensorrt_llm/deep_ep/CMakeLists.txt || true

echo
echo "=== Search for NVSHMEM references ==="
rg "NVSHMEM|nvshmem" -n || true

echo
echo "=== Show surrounding context where NVSHMEM_INSTALL_PREFIX might be used ==="
rg -n "NVSHMEM_INSTALL_PREFIX" -n || true
rg -n "TORCH_INSTALL_PREFIX" -n || true

echo
echo "=== Search for usages of nvshmem_host or nvshmem_device in repo ==="
rg -n "nvshmem_host|nvshmem_device" -n || true

echo
echo "=== List typical lib directories referenced elsewhere (lib64) ==="
rg -n "lib64" -n || true

Length of output: 18877


Make NVSHMEM discovery robust and fail-fast

Current code in cpp/tensorrt_llm/deep_ep/CMakeLists.txt hard-codes a Torch-relative install path, pins a specific SONAME (nvshmem_host.so.3), and does not use REQUIRED — this is brittle and will fail late on non-standard layouts or future NVSHMEM versions.

Files/locations to fix:

  • cpp/tensorrt_llm/deep_ep/CMakeLists.txt — lines ~41–47 (NVSHMEM_INSTALL_PREFIX and the three find_* calls).
  • Note: cpp/tensorrt_llm/CMakeLists.txt already uses find_package(NVSHMEM REQUIRED ...); also setup.py and scripts/build_wheel.py reference specific NVSHMEM .so files (e.g. nvshmem_bootstrap_uid.so.3), so ensure packaging and discovery remain consistent.

Apply this diff to make discovery configurable, search lib64, avoid SONAME pinning, and fail fast:

-set(NVSHMEM_INSTALL_PREFIX "${TORCH_INSTALL_PREFIX}/../nvidia/shmem")
-# message("Torch Install Prefix: ${TORCH_INSTALL_PREFIX}")
-
-find_path(NVSHMEM_INCLUDE_DIR nvshmem.h HINTS ${NVSHMEM_INSTALL_PREFIX}/include)
-find_library(NVSHMEM_HOST_LIBRARY nvshmem_host.so.3 HINTS ${NVSHMEM_INSTALL_PREFIX}/lib)
-find_library(NVSHMEM_DEVICE_LIBRARY nvshmem_device HINTS ${NVSHMEM_INSTALL_PREFIX}/lib)
+set(NVSHMEM_INSTALL_PREFIX "${TORCH_INSTALL_PREFIX}/../nvidia/shmem" CACHE PATH "NVSHMEM install root")
+
+find_path(NVSHMEM_INCLUDE_DIR nvshmem.h
+          HINTS ${NVSHMEM_INSTALL_PREFIX}/include
+          REQUIRED)
+find_library(NVSHMEM_HOST_LIBRARY
+             NAMES nvshmem_host nvshmem
+             HINTS ${NVSHMEM_INSTALL_PREFIX}/lib ${NVSHMEM_INSTALL_PREFIX}/lib64
+             REQUIRED)
+find_library(NVSHMEM_DEVICE_LIBRARY
+             NAMES nvshmem_device
+             HINTS ${NVSHMEM_INSTALL_PREFIX}/lib ${NVSHMEM_INSTALL_PREFIX}/lib64
+             REQUIRED)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
set(NVSHMEM_INSTALL_PREFIX "${TORCH_INSTALL_PREFIX}/../nvidia/shmem")
# message("Torch Install Prefix: ${TORCH_INSTALL_PREFIX}")
find_path(NVSHMEM_INCLUDE_DIR nvshmem.h HINTS ${NVSHMEM_INSTALL_PREFIX}/include)
find_library(NVSHMEM_HOST_LIBRARY nvshmem_host.so.3 HINTS ${NVSHMEM_INSTALL_PREFIX}/lib)
find_library(NVSHMEM_DEVICE_LIBRARY nvshmem_device HINTS ${NVSHMEM_INSTALL_PREFIX}/lib)
set(NVSHMEM_INSTALL_PREFIX "${TORCH_INSTALL_PREFIX}/../nvidia/shmem" CACHE PATH "NVSHMEM install root")
find_path(NVSHMEM_INCLUDE_DIR nvshmem.h
HINTS ${NVSHMEM_INSTALL_PREFIX}/include
REQUIRED)
find_library(NVSHMEM_HOST_LIBRARY
NAMES nvshmem_host nvshmem
HINTS ${NVSHMEM_INSTALL_PREFIX}/lib ${NVSHMEM_INSTALL_PREFIX}/lib64
REQUIRED)
find_library(NVSHMEM_DEVICE_LIBRARY
NAMES nvshmem_device
HINTS ${NVSHMEM_INSTALL_PREFIX}/lib ${NVSHMEM_INSTALL_PREFIX}/lib64
REQUIRED)
🤖 Prompt for AI Agents
In cpp/tensorrt_llm/deep_ep/CMakeLists.txt around lines 41–47, the NVSHMEM
discovery is brittle: it hard-codes a Torch-relative NVSHMEM_INSTALL_PREFIX,
pins a specific SONAME, doesn't search lib64, and doesn't fail-fast. Replace the
fixed install prefix with a cache variable NVSHMEM_INSTALL_PREFIX (allowing
override), use find_path and find_library with NAMES (no SONAME suffix) and
HINTS ${NVSHMEM_INSTALL_PREFIX}/lib ${NVSHMEM_INSTALL_PREFIX}/lib64
${NVSHMEM_INSTALL_PREFIX}/include and system paths, mark the find_* calls
REQUIRED so configuration errors fail early, and prefer using
find_package(NVSHMEM REQUIRED ...) if available for consistency with the
top-level CMake; ensure variables (NVSHMEM_INCLUDE_DIR, NVSHMEM_HOST_LIBRARY,
NVSHMEM_DEVICE_LIBRARY) are set from those robust searches.

Comment on lines 27 to 28
#define PFN_cuTensorMapEncodeTiled PFN_cuTensorMapEncodeTiled_v12000

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

⚠️ Potential issue

Avoid redefining SDK typedefs via macros

#define PFN_cuTensorMapEncodeTiled PFN_cuTensorMapEncodeTiled_v12000 is brittle, violates macro naming guidelines, and can unintentionally affect other headers. Prefer explicit types or a local alias.

Remove the macro:

-#define PFN_cuTensorMapEncodeTiled PFN_cuTensorMapEncodeTiled_v12000
-
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
#define PFN_cuTensorMapEncodeTiled PFN_cuTensorMapEncodeTiled_v12000
🤖 Prompt for AI Agents
In
cpp/tensorrt_llm/kernels/cutlass_kernels/fp8_blockscale_gemm/fp8_blockscale_tma_utils.cuh
around lines 27-28, remove the macro definition that rebinds
PFN_cuTensorMapEncodeTiled to PFN_cuTensorMapEncodeTiled_v12000; instead delete
the #define and create a local, scoped alias (preferably a typedef or C++
using-alias) that maps PFN_cuTensorMapEncodeTiled to
PFN_cuTensorMapEncodeTiled_v12000 within this translation unit or an appropriate
namespace to avoid leaking a macro into other headers. Ensure the alias is only
visible where needed and that no preprocessor macro remains.

Comment on lines +111 to 112
uint32_t smem_dim[2], CUtensorMapSwizzle swizzle_type, PFN_cuTensorMapEncodeTiled_v12000 encode_func = nullptr)
{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Unify encode_func parameter type with stable PFN

Avoid exposing the versioned PFN type in public signatures.

-    uint32_t smem_dim[2], CUtensorMapSwizzle swizzle_type, PFN_cuTensorMapEncodeTiled_v12000 encode_func = nullptr)
+    uint32_t smem_dim[2], CUtensorMapSwizzle swizzle_type, PFN_cuTensorMapEncodeTiled encode_func = nullptr)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
uint32_t smem_dim[2], CUtensorMapSwizzle swizzle_type, PFN_cuTensorMapEncodeTiled_v12000 encode_func = nullptr)
{
uint32_t smem_dim[2], CUtensorMapSwizzle swizzle_type, PFN_cuTensorMapEncodeTiled encode_func = nullptr)
{
🤖 Prompt for AI Agents
In
cpp/tensorrt_llm/kernels/cutlass_kernels/fp8_blockscale_gemm/fp8_blockscale_tma_utils.cuh
around lines 111-112, the function signature exposes the versioned PFN type
PFN_cuTensorMapEncodeTiled_v12000; change the parameter type to the stable,
unversioned PFN alias (PFN_cuTensorMapEncodeTiled) so the public signature
doesn't leak a versioned typedef, and update any forward declarations or
includes so the unversioned PFN type is visible where this header is compiled.

Comment on lines +367 to 369
cudaMemLocation location {cudaMemLocationTypeDevice, currentDevId};
TLLM_CUDA_CHECK(cudaMemAdvise(devPtr, memorySize, cudaMemAdviseSetPreferredLocation, location));
hostPtr = devPtr;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Verification agent

🧩 Analysis chain

Guard new cudaMemAdvise location-API for backward compatibility and fix indentation

The location-based cudaMemAdvise overload requires newer CUDA. Given CMake still allows CUDA >= 11.2, gate this with CUDART_VERSION to avoid compile errors with older toolkits. Also replace the tab with spaces to match file style.

Apply:

-	cudaMemLocation location {cudaMemLocationTypeDevice, currentDevId};
-        TLLM_CUDA_CHECK(cudaMemAdvise(devPtr, memorySize, cudaMemAdviseSetPreferredLocation, location));
+    #if CUDART_VERSION >= 12000
+    cudaMemLocation location{cudaMemLocationTypeDevice, currentDevId};
+    TLLM_CUDA_CHECK(cudaMemAdvise(devPtr, memorySize, cudaMemAdviseSetPreferredLocation, location));
+    #else
+    TLLM_CUDA_CHECK(cudaMemAdvise(devPtr, memorySize, cudaMemAdviseSetPreferredLocation, currentDevId));
+    #endif

Alternatively, if you intend to raise the project’s minimum CUDA to 13, update find_package(CUDAToolkit ...) accordingly and remove the fallback.


🏁 Script executed:

#!/bin/bash
set -euo pipefail
FILE="cpp/tensorrt_llm/runtime/moeLoadBalancer/hostAccessibleDeviceAllocator.cpp"
echo "PWD: $(pwd)"
if [ -f "$FILE" ]; then
  echo "Found file: $FILE"
  echo "---- File header (first 120 lines) ----"
  sed -n '1,120p' "$FILE"
  echo "---- Context lines 340-400 ----"
  sed -n '340,400p' "$FILE"
  echo "---- Search for relevant symbols in file ----"
  rg -n "TLLM_CUDA_CHECK|cudaMemAdvise|cudaMemLocation|CUDART_VERSION" "$FILE" || true
  echo "---- Lines containing tabs (show up to 200) ----"
  nl -ba "$FILE" | sed -n '1,520p' | rg --line-number $'\t' || true
else
  echo "File not found: $FILE"
  rg -n "hostAccessibleDeviceAllocator.cpp" || true
fi

echo "---- Search for find_package(CUDAToolkit in repo ----"
rg -n "find_package\\(CUDAToolkit" || true
echo "---- Search for CUDART_VERSION usage in repo ----"
rg -n "CUDART_VERSION" || true

Length of output: 13048


Guard cudaMemAdvise(location) for older CUDA toolkits & fix indentation

The cudaMemLocation overload requires CUDA ≥ 12, and cpp/CMakeLists.txt still allows CUDA 11.2 — gate the new API with CUDART_VERSION and replace the stray tab with spaces.

  • File: cpp/tensorrt_llm/runtime/moeLoadBalancer/hostAccessibleDeviceAllocator.cpp — replace the cudaMemLocation / cudaMemAdvise lines (around lines ~366–369).
  • CMake: cpp/CMakeLists.txt currently uses find_package(CUDAToolkit 11.2 ...), so keep the fallback unless you bump the min CUDA version.

Apply:

-	cudaMemLocation location {cudaMemLocationTypeDevice, currentDevId};
-        TLLM_CUDA_CHECK(cudaMemAdvise(devPtr, memorySize, cudaMemAdviseSetPreferredLocation, location));
+#if defined(CUDART_VERSION) && (CUDART_VERSION >= 12000)
+        cudaMemLocation location{cudaMemLocationTypeDevice, currentDevId};
+        TLLM_CUDA_CHECK(cudaMemAdvise(devPtr, memorySize, cudaMemAdviseSetPreferredLocation, location));
+#else
+        TLLM_CUDA_CHECK(cudaMemAdvise(devPtr, memorySize, cudaMemAdviseSetPreferredLocation, currentDevId));
+#endif

Alternatively, if you intend to require CUDA ≥ 12 project-wide, update find_package(CUDAToolkit ...) in cpp/CMakeLists.txt and remove the fallback.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
cudaMemLocation location {cudaMemLocationTypeDevice, currentDevId};
TLLM_CUDA_CHECK(cudaMemAdvise(devPtr, memorySize, cudaMemAdviseSetPreferredLocation, location));
hostPtr = devPtr;
#if defined(CUDART_VERSION) && (CUDART_VERSION >= 12000)
cudaMemLocation location{cudaMemLocationTypeDevice, currentDevId};
TLLM_CUDA_CHECK(cudaMemAdvise(devPtr, memorySize, cudaMemAdviseSetPreferredLocation, location));
#else
TLLM_CUDA_CHECK(cudaMemAdvise(devPtr, memorySize, cudaMemAdviseSetPreferredLocation, currentDevId));
#endif
hostPtr = devPtr;
🤖 Prompt for AI Agents
In cpp/tensorrt_llm/runtime/moeLoadBalancer/hostAccessibleDeviceAllocator.cpp
around lines 367–369, the current cudaMemLocation/cudaMemAdvise usage requires
CUDA ≥ 12 and also contains a stray tab; guard the new API with a preprocessor
check (e.g., #if defined(CUDART_VERSION) && CUDART_VERSION >= 12000) to call
cudaMemLocation and cudaMemAdvise only when available and provide the existing
fallback for older CUDA versions in the #else branch, and replace the stray tab
with spaces to fix indentation; do not change cpp/CMakeLists.txt unless you
intend to raise the minimum CUDA version project-wide.

Comment on lines +30 to +33
nvidia-nccl-cu13
nvidia-nvshmem-cu13
nvidia-cuda-nvrtc-cu13
transformers~=4.55.0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Verification agent

🧩 Analysis chain

Validate availability of new CUDA 13 vendor wheels

Please verify that the following packages exist on pip mirrors used in CI/CD:

  • nvidia-nccl-cu13
  • nvidia-nvshmem-cu13
  • nvidia-cuda-nvrtc-cu13

🏁 Script executed:

#!/bin/bash
# Check pip can find the cu13 variants (requires internet in CI)
for pkg in nvidia-nccl-cu13 nvidia-nvshmem-cu13 nvidia-cuda-nvrtc-cu13; do
  python -m pip index versions "$pkg" 2>/dev/null | head -n 5 || echo "Lookup failed for $pkg"
done

Length of output: 463


🏁 Script executed:

#!/bin/bash
set -euo pipefail
pkgs=(nvidia-nccl-cu13 nvidia-nvshmem-cu13 nvidia-cuda-nvrtc-cu13)

for pkg in "${pkgs[@]}"; do
  echo "=== $pkg ==="
  echo "--- python -m pip index versions ---"
  python -m pip index versions "$pkg" 2>&1 | sed -n '1,200p' || true

  echo "--- pypi.org JSON check ---"
  status=$(curl -s -o /tmp/pypi_${pkg}.json -w "%{http_code}" "https://pypi.org/pypi/${pkg}/json" || true)
  echo "HTTP status: $status"
  if [ "$status" = "200" ]; then
    python - <<PY
import json,sys
f="/tmp/pypi_${pkg}.json"
data=json.load(open(f))
print("latest_version:", data.get("info",{}).get("version"))
names = [r.get("packagetype") for r in data.get("releases", {})]
print("releases_count:", len(data.get("releases",{})))
PY
  fi
  echo
done

Length of output: 961


🏁 Script executed:

#!/bin/bash
set -euo pipefail
pkgs=(nvidia-nccl-cu13 nvidia-nvshmem-cu13 nvidia-cuda-nvrtc-cu13)

for pkg in "${pkgs[@]}"; do
  echo "=== $pkg ==="
  echo "--- python -m pip index versions ---"
  python -m pip index versions "$pkg" 2>&1 || true

  echo "--- pypi.org JSON ---"
  url="https://pypi.org/pypi/${pkg}/json"
  status=$(curl -s -w "%{http_code}" -o /tmp/pypi_${pkg}.json "$url" || echo "000")
  echo "HTTP status: $status"
  if [ "$status" = "200" ]; then
    python - <<PY
import json,sys
f="/tmp/pypi_${pkg}.json"
data=json.load(open(f))
info = data.get("info", {})
releases = data.get("releases", {})
print("info.version:", info.get("version"))
print("releases_count:", len(releases))
# show up to 10 newest releases
for idx,(ver, files) in enumerate(sorted(releases.items(), key=lambda x: x[0], reverse=True)):
    if idx >= 10:
        break
    print(f"- {ver} (files: {len(files)})")
    for fi in files[:5]:
        print("   ", fi.get("filename"), fi.get("packagetype"), fi.get("python_version"), fi.get("url")[:140])
PY
  fi
  echo
done

Length of output: 4191


Action required: verify CUDA‑13 vendor wheels on your CI package mirror before merging

Short summary of verification:

  • nvidia-nvshmem-cu13 — found on PyPI (v3.3.20) with manylinux wheels for x86_64 and aarch64.
  • nvidia-nccl-cu13 — PyPI only shows a placeholder release (0.0.0a0); pip reports "No matching distribution".
  • nvidia-cuda-nvrtc-cu13 — PyPI only shows a placeholder release (0.0.0a0); pip reports "No matching distribution".

Location to check:

  • requirements.txt (lines ~30–33)

Current snippet:

nvidia-nccl-cu13
nvidia-nvshmem-cu13
nvidia-cuda-nvrtc-cu13
transformers~=4.55.0

Requested actions:

  • Either remove/hold the two placeholder packages (nvidia-nccl-cu13, nvidia-cuda-nvrtc-cu13) until real vendor wheels are published, or ensure your CI/internal PyPI mirror contains the correct vendor-built wheels for your target platforms.
  • If you expect these names to be different or hosted on a private index, confirm the exact package names and that CI will use that index.
🤖 Prompt for AI Agents
In requirements.txt around lines 30–33, the two CUDA‑13 vendor placeholder
packages (nvidia-nccl-cu13, nvidia-cuda-nvrtc-cu13) are not available on PyPI
and will break CI; either remove or comment them out (hold) until vendor wheels
are published, or update CI to point to your internal/alternate PyPI mirror that
contains the correct vendor-built wheel names and versions; if the package names
differ on your private index, replace them with the exact package names and
ensure the CI/pip install command includes the --index-url/--extra-index-url and
any credentials required.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (3)
cpp/tensorrt_llm/kernels/beamSearchKernels.h (3)

149-149: Cast float literals to T to avoid implicit-narrowing for half/bfloat16

Minor, but avoids warnings and makes intent explicit when T is a low-precision type.

Apply this diff:

-            pLocalLogProbs[i] += (i == endIds[slot]) ? 1.0f : 0.0f;
+            pLocalLogProbs[i] += (i == endIds[slot]) ? static_cast<T>(1.0f) : static_cast<T>(0.0f);

Note: This only addresses type-safety. The indexing/semantics issue on this line still needs to be resolved as noted in a separate comment.


134-134: Use a portability macro for symbol visibility rather than a raw GCC attribute

attribute((visibility("default"))) is GCC/Clang specific. Prefer a project-level macro (e.g., TLLM_PUBLIC) that maps to __declspec(dllexport) on MSVC and visibility("default") elsewhere to keep CUDA 13 builds portable across toolchains.

Add (outside this file, e.g., in tensorrt_llm/common/visibility.h):

#pragma once
#if defined(_MSC_VER)
#define TLLM_PUBLIC __declspec(dllexport)
#else
#define TLLM_PUBLIC __attribute__((visibility("default")))
#endif

Then update this declaration:

-__global__ __attribute__((visibility("default"))) void addCumLogProbs(
+__global__ TLLM_PUBLIC void addCumLogProbs(

157-157: Remove unnecessary return; from the end of a global kernel

Redundant in a void kernel and slightly noisy.

Apply this diff:

-    return;
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c47c7b5 and 0fdaec9.

📒 Files selected for processing (2)
  • cpp/tensorrt_llm/kernels/beamSearchKernels.cu (0 hunks)
  • cpp/tensorrt_llm/kernels/beamSearchKernels.h (1 hunks)
💤 Files with no reviewable changes (1)
  • cpp/tensorrt_llm/kernels/beamSearchKernels.cu
🧰 Additional context used
📓 Path-based instructions (4)
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}: In C++, close namespaces with a comment naming the namespace (e.g., } // namespace foo)
Prefer const/constexpr variables over #define for constants
Declare variables const if not modified after initialization
Use Allman brace style in C++
C++ filenames use lowerCamelCase and must be case-insensitively unique within a build target
C++ type names use UpperCamelCase
Local variables, methods, and namespaces use lowerCamelCase
Global non-static variables not in anonymous namespace use gPrefix lowerCamelCase (e.g., gExample)
Static globals or globals in anonymous namespaces use sPrefix lowerCamelCase
Locally visible static variables start with 's' (e.g., static std::once_flag sFlag;)
Member variables use mPrefix lowerCamelCase; public members may omit but are encouraged to use 'm'
Constants (enums, global/static/function-scope magic numbers) use kPREFIXED_UPPER_SNAKE (e.g., kDIGIT_NUM)
If macros are unavoidable, use UPPER_SNAKE_CASE (prefer constants over #define)
Constructor parameter that conflicts with a public member name gets trailing underscore (foo_)
Literal suffixes should be uppercase (e.g., 1234L not 1234l)
C++: use spaces only; indent 4 spaces
Run clang-format (LLVM style) before submitting; wrap lines at 120 characters
If formatting must be bypassed, use // clang-format off/on around the section
Prefer smart pointers; use unique_ptr for sole ownership, shared_ptr for shared; weak_ptr only in exceptional cases
Do not use deprecated pre-C++11 smart pointers
Use C++ style comments; avoid C comments except special inline cases; prefer // single-line
Capitalize and punctuate full-sentence comments
Follow Doxygen rules: use //! for comments and //!< for members in C++
Disable code with #if/#endif and mnemonic conditions; avoid commented-out code; avoid dead code
Do not throw exceptions across library boundaries
Use least-forceful casts; avoid removing const/volatile; avoid C-style and functional casts (except constructors); p...

Files:

  • cpp/tensorrt_llm/kernels/beamSearchKernels.h
**/*.{h,hpp,hxx,hh,cuh,cpp,cxx,cc,cu}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Parameter names must be consistent between declarations and definitions

Files:

  • cpp/tensorrt_llm/kernels/beamSearchKernels.h
**/*.{h,hpp,hxx,hh,cuh}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Header files must use include guards named TRTLLM__H without underscores prefix/suffix (e.g., TRTLLM_FOO_BAR_HELLO_H)

Files:

  • cpp/tensorrt_llm/kernels/beamSearchKernels.h
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Prepend NVIDIA copyright header (current year) to all source files

Files:

  • cpp/tensorrt_llm/kernels/beamSearchKernels.h
🔇 Additional comments (1)
cpp/tensorrt_llm/kernels/beamSearchKernels.h (1)

154-155: Verify nBM vs nBMIn for indexing cumLogProbs and finished

You intentionally use nBM for cumLogProbs due to VBWS initialization, but use nBMIn for finished. Given finished is documented as [BS*BM], verify whether BM here corresponds to nBMIn or the max beam width. If it’s the latter, finished[slot * nBM + iBMIn] may be required for consistency.

If finished is sized with the max beam width, update the indexing accordingly:

-        if (finished[slot * nBMIn + iBMIn].isFinished())
+        if (finished[slot * nBM + iBMIn].isFinished())

If it is truly sized with the current input beam width, keep as-is and consider adding a comment to document the difference from cumLogProbs for future maintainers.

Comment on lines +147 to +151
if (finished[slot * nBMIn + iBMIn].isFinished())
{
pLocalLogProbs[i] += (i == endIds[slot]) ? 1.0f : 0.0f;
}
else
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Verification agent

🧩 Analysis chain

EOS/endIds indexing appears incorrect and likely out-of-bounds for [BS, BM] layout

endIds is documented as [BS, BM], but the code indexes it as endIds[slot] and compares it to i (a flattened [BM_in, BM_out, 2] index). Two issues:

  • Indexing: endIds[slot] ignores the beam dimension; for per-beam EOS, it should likely be endIds[slot * nBM + iBMIn] (or nBMIn, depending on layout).
  • Semantics: i indexes the flattened candidate grid, not the vocabulary/token ID dimension. Comparing i to an EOS token ID does not look meaningful and may never match.

This can cause incorrect scoring for finished beams and undefined behavior if endIds has only [BS, BM] elements.

Please confirm shapes and intended semantics. If indeed per-beam EOS, a minimal indexing fix would be:

  • Use the per-beam EOS id:

    • int const eosId = endIds[slot * nBM + iBMIn]; // or slot * nBMIn + iBMIn if finished is also sized that way.
  • Compare against the actual token index in pStage1LogProbs, not the flattened [BM_in, BM_out, 2] index. If stage-1 buffer includes a vocab dimension elsewhere, determine the local token index before comparison.

To help locate references and prior logic, run:


🏁 Script executed:

#!/bin/bash
# Look for other usages of endIds indexing and EOS handling in kernels
rg -n -A3 -B3 $'endIds\\[' cpp | sed -n '1,200p'

# Find all write sites for pStage1LogProbs to infer its layout
rg -n -A3 -B3 'pStage1(Log)?Probs' cpp | sed -n '1,200p'

# Find all callers of addCumLogProbs and their launch grids to reason about mapping
rg -n -A3 -B3 'addCumLogProbs\\s*<|<<<' cpp | sed -n '1,200p'

Length of output: 45150


🏁 Script executed:

#!/bin/bash
sed -n '1,240p' cpp/tensorrt_llm/kernels/beamSearchKernels.h

Length of output: 15146


🏁 Script executed:

#!/bin/bash
sed -n '1,420p' cpp/tensorrt_llm/kernels/beamSearchKernels/beamSearchKernelsTemplate.h

Length of output: 17803


Fix required — addCumLogProbs compares candidate index to EOS token and ignores beam dim

Short summary: addCumLogProbs (cpp/tensorrt_llm/kernels/beamSearchKernels.h) iterates i over the flattened candidate grid (nBMIn * nBMOut * 2) and does (i == endIds[slot]). i is a candidate index, not a token id; also the code ignores the beam index in endIds. This will almost never match the EOS token and therefore fails to prefer EOS for finished beams (and is inconsistent with other kernels that treat endIds as per-slot token ids).

Locations to fix:

  • cpp/tensorrt_llm/kernels/beamSearchKernels.h
    • Kernel addCumLogProbs — replace the incorrect comparison and add access to the stage-1 ids.
  • cpp/tensorrt_llm/kernels/beamSearchKernels/beamSearchKernelsTemplate.h
    • Call site that launches addCumLogProbs must pass pStage1Ids (pStage1Ids is already allocated in this scope).

Suggested minimal changes (apply and adjust to your code style):

  • Change kernel signature to accept the stage-1 ids:

    • before: addCumLogProbs(T* pStage1LogProbs, float const* cumLogProbs, FinishedState const* finished, int const* endIds, float const* diversityRates, ...)
    • after: addCumLogProbs(T* pStage1LogProbs, float const* cumLogProbs, FinishedState const* finished, int const* endIds, int const* pStage1Ids, float const* diversityRates, ...)
  • In the kernel body replace the block:

    • before:
      pLocalLogProbs[i] += (i == endIds[slot]) ? 1.0f : 0.0f;
    • after (minimal, assuming endIds is per-slot token id; if endIds is per-beam use the per-beam index shown below):
      int const* pLocalIds = pStage1Ids + bid * nBMIn * nBMOut * 2;
      int const eosId = endIds[slot]; // or: endIds[slot * nBM + iBMIn] if endIds is [BS, BM]
      pLocalLogProbs[i] += (pLocalIds[i] == eosId) ? (T)1.0f : (T)0.0f;
  • Update the kernel launch in beamSearchKernelsTemplate.h:

    • before:
      addCumLogProbs<<<nBS, nThread>>>(pStage1LogProbs, bh.cumLogProbs, bh.finished, bh.endIds, bh.diversityRates, bh.batchSlots, ...);
    • after:
      addCumLogProbs<<<nBS, nThread>>>(pStage1LogProbs, bh.cumLogProbs, bh.finished, bh.endIds, pStage1Ids, bh.diversityRates, bh.batchSlots, ...);

Extra: clarify the declared shape of endIds in BeamHypotheses (cpp/tensorrt_llm/kernels/beamSearchKernels.h) — comments currently read [BS, BM] but most kernels use endIds[slot] (per-slot). Confirm whether endIds should be [BS] or [BS, BM] and update code/comments and indexing across kernels accordingly.

Reason: The kernel must compare the candidate's token id (from stage-1 ids) to the EOS token id; comparing the flattened candidate index is incorrect and will break finished-beam handling.

🤖 Prompt for AI Agents
In cpp/tensorrt_llm/kernels/beamSearchKernels.h around lines 147-151,
addCumLogProbs currently compares the flattened candidate index i to
endIds[slot] which is wrong — you must compare the candidate token id from
stage-1 ids and account for beam dim; update the kernel signature to accept int
const* pStage1Ids, inside the kernel compute a pointer to the local stage1 ids
for this batch/beam region (e.g. pStage1Ids + bid * nBMIn * nBMOut * 2) and
replace the (i == endIds[slot]) test with a comparison of pLocalIds[i] to the
appropriate eosId (use endIds[slot] if endIds is per-slot, or endIds[slot * nBM
+ iBMIn] if per-beam), then add pStage1Ids to the addCumLogProbs kernel launch
in cpp/tensorrt_llm/kernels/beamSearchKernelsTemplate.h so the kernel receives
the stage-1 ids; also verify and if needed fix/comment the declared
shape/indexing of endIds across BeamHypotheses to be consistently [BS] or [BS,
BM].

Signed-off-by: Boris Fomitchev <[email protected]>
@borisfom
Copy link
Author

The wheel builds now. Still should be considered WIP until review.

@coderabbitai coderabbitai bot requested review from chzblych and dcampora August 14, 2025 07:38
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (3)
cpp/tensorrt_llm/deep_ep/CMakeLists.txt (3)

129-130: Good fix: host and device NVSHMEM libraries are now linked

Linking both ${NVSHMEM_DEVICE_LIBRARY} and ${NVSHMEM_HOST_LIBRARY} addresses previous link/runtime risk when host APIs are used.


41-48: Harden NVSHMEM discovery: allow overrides, avoid SONAME pinning, search lib64, and fail fast

Current find logic is brittle:

  • Hard-coded install prefix is not overrideable (not a CACHE var).
  • find_path/find_library are not REQUIRED (fail late).
  • Host library is pinned to a specific SONAME (libnvshmem_host.so.3) and includes lib prefix — both are error-prone across platforms/versions.
  • Only ${prefix}/lib is searched; many installs use lib64.

Refactor as below to make it robust and configurable. This is a continuation of prior feedback.

-set(NVSHMEM_INSTALL_PREFIX "${TORCH_INSTALL_PREFIX}/../nvidia/nvshmem")
-# message("Torch Install Prefix: ${TORCH_INSTALL_PREFIX}")
-
-find_path(NVSHMEM_INCLUDE_DIR nvshmem.h HINTS ${NVSHMEM_INSTALL_PREFIX}/include)
-# set(NVSHMEM_HOST_LIBRARY ${NVSHMEM_INSTALL_PREFIX}/lib/libnvshmem_host.so.3)
-find_library(NVSHMEM_DEVICE_LIBRARY nvshmem_device HINTS ${NVSHMEM_INSTALL_PREFIX}/lib)
-find_library(NVSHMEM_HOST_LIBRARY libnvshmem_host.so.3 HINTS ${NVSHMEM_INSTALL_PREFIX}/lib)
+set(NVSHMEM_INSTALL_PREFIX "${TORCH_INSTALL_PREFIX}/../nvidia/nvshmem"
+    CACHE PATH "NVSHMEM install root")
+
+find_path(NVSHMEM_INCLUDE_DIR nvshmem.h
+          HINTS ${NVSHMEM_INSTALL_PREFIX}/include
+          REQUIRED)
+find_library(NVSHMEM_DEVICE_LIBRARY
+             NAMES nvshmem_device
+             HINTS ${NVSHMEM_INSTALL_PREFIX}/lib ${NVSHMEM_INSTALL_PREFIX}/lib64
+             REQUIRED)
+find_library(NVSHMEM_HOST_LIBRARY
+             NAMES nvshmem_host
+             HINTS ${NVSHMEM_INSTALL_PREFIX}/lib ${NVSHMEM_INSTALL_PREFIX}/lib64
+             REQUIRED)

118-119: Ensure runtime resolution: add $ORIGIN and lib64 to INSTALL_RPATH

Only ${NVSHMEM_INSTALL_PREFIX}/lib is in RPATH. Add $ORIGIN for relocatability and include lib64 to cover common layouts. This aligns with earlier feedback.

-             INSTALL_RPATH "${TORCH_INSTALL_PREFIX}/lib;${NVSHMEM_INSTALL_PREFIX}/lib"
+             INSTALL_RPATH "\$ORIGIN;${TORCH_INSTALL_PREFIX}/lib;${NVSHMEM_INSTALL_PREFIX}/lib;${NVSHMEM_INSTALL_PREFIX}/lib64"
🧹 Nitpick comments (1)
cpp/tensorrt_llm/deep_ep/CMakeLists.txt (1)

41-45: Remove stale commented SONAME line

The commented libnvshmem_host.so.3 line is obsolete and can confuse future readers.

-# set(NVSHMEM_HOST_LIBRARY ${NVSHMEM_INSTALL_PREFIX}/lib/libnvshmem_host.so.3)
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 0fdaec9 and bcca984.

📒 Files selected for processing (1)
  • cpp/tensorrt_llm/deep_ep/CMakeLists.txt (4 hunks)
🔇 Additional comments (2)
cpp/tensorrt_llm/deep_ep/CMakeLists.txt (2)

127-127: Include path OK, but depends on non-REQUIRED find_path

This is fine, but make sure find_path(NVSHMEM_INCLUDE_DIR ...) is REQUIRED (see suggested refactor above) to fail fast if headers are missing.


138-139: LGTM: removed ExternalProject dependency

deep_ep no longer depends on an external NVSHMEM build target, consistent with switching to prebuilt discovery.

Signed-off-by: Boris Fomitchev <[email protected]>
@coderabbitai coderabbitai bot requested a review from kaiyux August 14, 2025 08:15
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (2)
cpp/tensorrt_llm/kernels/beamSearchKernels.h (2)

162-164: Still leaking global to non-CUDA TUs: guard gatherId() as well

gatherId is declared __global__ outside any #ifdef __CUDACC__ and will break host-only compilation. Wrap it with the same CUDA guard (or move to a .cuh/.cu). This is the same class of issue called out previously for kernels in this header.

Apply this diff:

-__global__ void gatherId(int const* __restrict pStage1Id, int* __restrict pStage2Id, size_t const nBS,
-    size_t const nBMIn, size_t const nBMOut, size_t const nV);
+#ifdef __CUDACC__
+__global__ void gatherId(int const* __restrict pStage1Id, int* __restrict pStage2Id, size_t const nBS,
+    size_t const nBMIn, size_t const nBMOut, size_t const nV);
+#endif

135-156: Fix addCumLogProbs: compare candidate token id (from stage‑1 ids) to per‑slot EOS and avoid implicit float->T casts

Short: The kernel currently compares the flattened candidate index i to endIds[slot] (a token id). endIds is per‑slot (shape [BS]) and the correct comparison is between the candidate token id (from the stage‑1 ids buffer) and the per‑slot EOS id. Also avoid implicit float→T conversions by casting.

Places to change

  • cpp/tensorrt_llm/kernels/beamSearchKernels.h — update addCumLogProbs signature and body (around lines ~135–156).
  • cpp/tensorrt_llm/kernels/beamSearchKernels/beamSearchKernelsTemplate.h — update kernel launch at ~line 631 to pass the stage‑1 ids pointer (pStage1Ids is available at ~line 619).
  • cpp/tensorrt_llm/kernels/beamSearchKernels.cu — update explicit template instantiations (the addCumLogProbs<float/half> prototype lines at ~137 and ~141) to the new signature.

Updated minimal diff (apply in beamSearchKernels.h; adjust instantiations & launches accordingly):

-__global__ __attribute__((visibility("default"))) void addCumLogProbs(T* __restrict pStage1LogProbs, float const* __restrict cumLogProbs,
-    FinishedState const* finished, int const* endIds, float const* diversityRates,
-    runtime::SizeType32 const* batchSlots, size_t const nBS, size_t const nBMIn, size_t const nBMOut, size_t const nBM)
+__global__ __attribute__((visibility("default"))) void addCumLogProbs(
+    T* __restrict pStage1LogProbs,
+    float const* __restrict cumLogProbs,
+    FinishedState const* __restrict finished,
+    int const* __restrict endIds,
+    int const* __restrict pStage1Ids,
+    float const* __restrict diversityRates,
+    runtime::SizeType32 const* __restrict batchSlots,
+    size_t const nBS, size_t const nBMIn, size_t const nBMOut, size_t const nBM)
 {
     int const bid = blockIdx.x; // Index of request in batch
     runtime::SizeType32 const slot = batchSlots[bid];
     float const diversityRate{diversityRates[slot]};
     T* pLocalLogProbs = pStage1LogProbs + bid * nBMIn * nBMOut * 2;
+    int const* pLocalIds = pStage1Ids + bid * nBMIn * nBMOut * 2;
 
     for (int i = threadIdx.x; i < nBMIn * nBMOut * 2; i += blockDim.x)
     {
         int const iBMIn = i / (nBMOut * 2);
         if (finished[slot * nBMIn + iBMIn].isFinished())
         {
-            pLocalLogProbs[i] += (i == endIds[slot]) ? 1.0f : 0.0f;
+            // endIds is per-slot; compare the candidate token id from stage‑1 ids
+            int const eosId = endIds[slot];
+            pLocalLogProbs[i] += (pLocalIds[i] == eosId) ? static_cast<T>(1.0f) : static_cast<T>(0.0f);
         }
         else
         {
-            pLocalLogProbs[i] += cumLogProbs[slot * nBM + iBMIn] + diversityRate * iBMIn;
+            float const addend = cumLogProbs[slot * nBM + iBMIn] + diversityRate * static_cast<float>(iBMIn);
+            pLocalLogProbs[i] += static_cast<T>(addend);
         }
     }
 }

Follow‑ups required

  • Update the kernel launch in beamSearchKernelsTemplate.h to pass pStage1Ids:
    addCumLogProbs<<<...>>>(pStage1LogProbs, bh.cumLogProbs, bh.finished, bh.endIds, pStage1Ids, bh.diversityRates, bh.batchSlots, ...);
  • Update explicit instantiations in beamSearchKernels.cu to include the new int const* pStage1Ids parameter for both float and half.
  • Run unit tests / beam search integration tests to validate correctness.

Reasoning from verification

  • pStage1Ids buffer exists in the template workspace (pStage1Ids declared around beamSearchKernelsTemplate.h:619) and top‑K writes it (invokeTopkLastDim at ~627). The addCumLogProbs launch is at ~631 — so the ids are available to pass.
  • endIds is used elsewhere as per‑slot (endIds[slot]) in the codebase; use endIds[slot] (not per‑beam indexing).

Result: bug confirmed and fix required.

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these settings in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between bcca984 and 4eb7618.

📒 Files selected for processing (3)
  • cpp/CMakeLists.txt (3 hunks)
  • cpp/tensorrt_llm/kernels/beamSearchKernels.h (1 hunks)
  • cpp/tensorrt_llm/kernels/cutlass_kernels/fp8_blockscale_gemm/fp8_blockscale_tma_utils.cuh (2 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
  • cpp/CMakeLists.txt
  • cpp/tensorrt_llm/kernels/cutlass_kernels/fp8_blockscale_gemm/fp8_blockscale_tma_utils.cuh
🧰 Additional context used
📓 Path-based instructions (4)
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}: In C++, close namespaces with a comment naming the namespace (e.g., } // namespace foo)
Prefer const/constexpr variables over #define for constants
Declare variables const if not modified after initialization
Use Allman brace style in C++
C++ filenames use lowerCamelCase and must be case-insensitively unique within a build target
C++ type names use UpperCamelCase
Local variables, methods, and namespaces use lowerCamelCase
Global non-static variables not in anonymous namespace use gPrefix lowerCamelCase (e.g., gExample)
Static globals or globals in anonymous namespaces use sPrefix lowerCamelCase
Locally visible static variables start with 's' (e.g., static std::once_flag sFlag;)
Member variables use mPrefix lowerCamelCase; public members may omit but are encouraged to use 'm'
Constants (enums, global/static/function-scope magic numbers) use kPREFIXED_UPPER_SNAKE (e.g., kDIGIT_NUM)
If macros are unavoidable, use UPPER_SNAKE_CASE (prefer constants over #define)
Constructor parameter that conflicts with a public member name gets trailing underscore (foo_)
Literal suffixes should be uppercase (e.g., 1234L not 1234l)
C++: use spaces only; indent 4 spaces
Run clang-format (LLVM style) before submitting; wrap lines at 120 characters
If formatting must be bypassed, use // clang-format off/on around the section
Prefer smart pointers; use unique_ptr for sole ownership, shared_ptr for shared; weak_ptr only in exceptional cases
Do not use deprecated pre-C++11 smart pointers
Use C++ style comments; avoid C comments except special inline cases; prefer // single-line
Capitalize and punctuate full-sentence comments
Follow Doxygen rules: use //! for comments and //!< for members in C++
Disable code with #if/#endif and mnemonic conditions; avoid commented-out code; avoid dead code
Do not throw exceptions across library boundaries
Use least-forceful casts; avoid removing const/volatile; avoid C-style and functional casts (except constructors); p...

Files:

  • cpp/tensorrt_llm/kernels/beamSearchKernels.h
**/*.{h,hpp,hxx,hh,cuh,cpp,cxx,cc,cu}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Parameter names must be consistent between declarations and definitions

Files:

  • cpp/tensorrt_llm/kernels/beamSearchKernels.h
**/*.{h,hpp,hxx,hh,cuh}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Header files must use include guards named TRTLLM__H without underscores prefix/suffix (e.g., TRTLLM_FOO_BAR_HELLO_H)

Files:

  • cpp/tensorrt_llm/kernels/beamSearchKernels.h
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Prepend NVIDIA copyright header (current year) to all source files

Files:

  • cpp/tensorrt_llm/kernels/beamSearchKernels.h
🧬 Code Graph Analysis (1)
cpp/tensorrt_llm/kernels/beamSearchKernels.h (1)
cpp/tensorrt_llm/kernels/beamSearchKernels.cu (2)
  • addCumLogProbs (137-139)
  • addCumLogProbs (141-143)
🔇 Additional comments (1)
cpp/tensorrt_llm/kernels/beamSearchKernels.h (1)

133-137: Good fix: CUDA guard now prevents leaking global to host-only TUs

Placing the kernel definition under #ifdef __CUDACC__ addresses the earlier compilation issue for non-CUDA translation units.

Comment on lines +133 to +159
#ifdef __CUDACC__
template <typename T>
__global__ void addCumLogProbs(T* __restrict pStage1Probs, float const* __restrict cumLogProbs,
__global__ __attribute__((visibility("default"))) void addCumLogProbs(T* __restrict pStage1LogProbs, float const* __restrict cumLogProbs,
FinishedState const* finished, int const* endIds, float const* diversityRates,
runtime::SizeType32 const* batchSlots, size_t const nBS, size_t const nBMIn, size_t const nBMOut, size_t const nBM);
runtime::SizeType32 const* batchSlots, size_t const nBS, size_t const nBMIn, size_t const nBMOut, size_t const nBM)
{
int const bid = blockIdx.x; // Index of request in batch
runtime::SizeType32 const slot = batchSlots[bid];
float const diversityRate{diversityRates[slot]};
T* pLocalLogProbs = pStage1LogProbs + bid * nBMIn * nBMOut * 2;

for (int i = threadIdx.x; i < nBMIn * nBMOut * 2; i += blockDim.x)
{
int const iBMIn = i / (nBMOut * 2);
if (finished[slot * nBMIn + iBMIn].isFinished())
{
pLocalLogProbs[i] += (i == endIds[slot]) ? 1.0f : 0.0f;
}
else
{
// nBM is used in VBWS since `cumLogProbs` is initialized with kMaxBeamWidth earlier than BeamSearchLayer
pLocalLogProbs[i] += cumLogProbs[slot * nBM + iBMIn] + diversityRate * iBMIn;
}
}
return;
}
#endif
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Verification agent

🧩 Analysis chain

Avoid ODR/duplicate definitions: add extern templates or move the kernel out of the header

Defining a templated __global__ kernel in a header while also explicitly instantiating it in a .cu risks multiple definitions if any TU implicitly instantiates it via a launch. Either:

  • Keep the definition in the .cu and leave only a declaration in the header, or
  • Keep the definition in the header and prevent implicit instantiation with extern template declarations, keeping explicit instantiations in the .cu.

Minimal change (add extern templates under the CUDA guard):

 #ifdef __CUDACC__
 template <typename T>
 __global__ __attribute__((visibility("default"))) void addCumLogProbs(T* __restrict pStage1LogProbs, float const* __restrict cumLogProbs,
     FinishedState const* finished, int const* endIds, float const* diversityRates,
     runtime::SizeType32 const* batchSlots, size_t const nBS, size_t const nBMIn, size_t const nBMOut, size_t const nBM)
 {
@@
 }
-#endif
+// Prevent implicit instantiation; explicit instantiations live in the .cu
+extern template __global__ __attribute__((visibility("default"))) void addCumLogProbs<float>(
+    float* __restrict pStage1LogProbs, float const* __restrict cumLogProbs, FinishedState const* finished,
+    int const* __restrict endIds, int const* __restrict pStage1Ids, float const* __restrict diversityRates,
+    runtime::SizeType32 const* __restrict batchSlots, size_t const nBS, size_t const nBMIn, size_t const nBMOut, size_t const nBM);
+
+extern template __global__ __attribute__((visibility("default"))) void addCumLogProbs<half>(
+    half* __restrict pStage1LogProbs, float const* __restrict cumLogProbs, FinishedState const* finished,
+    int const* __restrict endIds, int const* __restrict pStage1Ids, float const* __restrict diversityRates,
+    runtime::SizeType32 const* __restrict batchSlots, size_t const nBS, size_t const nBMIn, size_t const nBMOut, size_t const nBM);
+#endif

And update the explicit instantiations in the .cu to match the new signature:

// cpp/tensorrt_llm/kernels/beamSearchKernels.cu (adjust signature)
template __global__ void addCumLogProbs<float>(float* __restrict, float const* __restrict,
    FinishedState const*, int const* __restrict, int const* __restrict, float const* __restrict,
    runtime::SizeType32 const* __restrict, size_t const, size_t const, size_t const, size_t const);

template __global__ void addCumLogProbs<half>(half* __restrict, float const* __restrict,
    FinishedState const*, int const* __restrict, int const* __restrict, float const* __restrict,
    runtime::SizeType32 const* __restrict, size_t const, size_t const, size_t const, size_t const);

Alternative: Move the kernel body to the .cu and leave a guarded declaration in the header to avoid any template instantiation from headers.

To confirm there are explicit instantiations today and avoid duplicates, run:


🏁 Script executed:

#!/bin/bash
rg -n -A2 -B2 $'template\\s+__global__\\s+void\\s+addCumLogProbs' cpp | sed -n '1,200p'

Length of output: 1181


Move the CUDA kernel body out of the header (or add correct extern template specializations) — explicit instantiations exist in the .cu

The header currently defines a templated global kernel while cpp/tensorrt_llm/kernels/beamSearchKernels.cu already contains explicit instantiations for float and half (rg found instantiations at lines ~137 and ~141). This risks ODR/duplicate-definition errors. Fix by removing the definition from the header and leaving only a declaration, or (less preferred) add matching extern template specializations.

Files to change

  • cpp/tensorrt_llm/kernels/beamSearchKernels.h — replace the kernel definition (lines ~133–159) with a declaration.
  • cpp/tensorrt_llm/kernels/beamSearchKernels.cu — already has explicit instantiations; verify they match the declaration (they do: float and half specializations exist).

Suggested minimal change (preferred: move body to .cu; in the header leave only the declaration):

 #ifdef __CUDACC__
-template <typename T>
-__global__ __attribute__((visibility("default"))) void addCumLogProbs(T* __restrict pStage1LogProbs, float const* __restrict cumLogProbs,
-    FinishedState const* finished, int const* endIds, float const* diversityRates,
-    runtime::SizeType32 const* batchSlots, size_t const nBS, size_t const nBMIn, size_t const nBMOut, size_t const nBM)
-{
-    ...
-}
+#ifdef __CUDACC__
+template <typename T>
+__global__ __attribute__((visibility("default"))) void addCumLogProbs(
+    T* __restrict pStage1LogProbs,
+    float const* __restrict cumLogProbs,
+    FinishedState const* finished,
+    int const* endIds,
+    float const* diversityRates,
+    runtime::SizeType32 const* batchSlots,
+    size_t const nBS,
+    size_t const nBMIn,
+    size_t const nBMOut,
+    size_t const nBM);
 #endif

Optional alternative (if you must keep the body in the header): add extern template declarations that exactly match the explicit instantiations in the .cu (use the same parameter list and types). Example externs (place under the CUDA guard after the definition if you keep it, though moving the body is safer):

extern template __global__ void addCumLogProbs<float>(
    float* __restrict pStage1LogProbs, float const* __restrict cumLogProbs,
    FinishedState const* finished, int const* endIds, float const* diversityRates,
    runtime::SizeType32 const* batchSlots, size_t const nBS, size_t const nBMIn, size_t const nBMOut, size_t const nBM);

extern template __global__ void addCumLogProbs<half>(
    half* __restrict pStage1LogProbs, float const* __restrict cumLogProbs,
    FinishedState const* finished, int const* endIds, float const* diversityRates,
    runtime::SizeType32 const* batchSlots, size_t const nBS, size_t const nBMIn, size_t const nBMOut, size_t const nBM);

Action: prefer moving the kernel body to the .cu and keep the declaration in the header; that aligns with the existing explicit instantiations and avoids ODR issues.

Signed-off-by: Boris Fomitchev <[email protected]>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (1)
requirements.txt (1)

30-33: Blocking: cu13 vendor wheels (nccl/nvrtc) not on public PyPI; will break installs

Prior verification already showed nvidia-nccl-cu13 and nvidia-cuda-nvrtc-cu13 are placeholders on PyPI and not installable, while nvidia-nvshmem-cu13 exists. Unless your CI uses a private index with these exact names, pip will fail here. Suggest holding/commenting them until wheels are available or wiring CI to your internal index.

Apply this minimal, reversible change:

 nvidia-modelopt[torch]~=0.33.0
-nvidia-nccl-cu13
-nvidia-nvshmem-cu13
-nvidia-cuda-nvrtc-cu13
+## CUDA 13 vendor wheels
+# HOLD until available on configured index (or pin to private URLs):
+# nvidia-nccl-cu13
+nvidia-nvshmem-cu13
+# nvidia-cuda-nvrtc-cu13
 transformers~=4.55.0

Optionally replace with exact private URLs or add a CI-level --index-url/--extra-index-url pointing at your internal mirror.

#!/bin/bash
set -euo pipefail
echo "[info] Checking for vendor cu13 wheels on public PyPI (expect: nvshmem OK; nccl/nvrtc likely missing)"
for pkg in nvidia-nccl-cu13 nvidia-nvshmem-cu13 nvidia-cuda-nvrtc-cu13; do
  echo "=== $pkg ==="
  url="https://pypi.org/pypi/${pkg}/json"
  status=$(curl -s -w "%{http_code}" -o /tmp/${pkg}.json "$url" || echo "000")
  echo "HTTP: $status"
  if [ "$status" = "200" ]; then
    python - <<'PY'
import json,sys,os
p=f"/tmp/{sys.argv[1]}.json"
d=json.load(open(p))
print("latest:", d.get("info",{}).get("version"), "releases:", len(d.get("releases",{})))
PY $pkg
  fi
done

echo
echo "[info] Inspect constraints.txt and grep for older cu12 pins that could conflict"
if [ -f constraints.txt ]; then
  sed -n '1,120p' constraints.txt | nl -ba | sed -n '1,120p'
  echo "--- grep cu12/cu13/torch/torchvision ---"
  rg -n "cu12|cu13|torch(vision)?|nvrtc|nccl" constraints.txt || true
else
  echo "constraints.txt not found at repo root"
fi
🧹 Nitpick comments (4)
requirements.txt (4)

1-1: Global extra-index-url affects all packages; consider scoping to torch installs in CI

Keeping --extra-index-url in a shared requirements file can cause resolver backtracking and accidental pulls from the PyTorch index for unrelated packages. If feasible, move the index flag to the CI pip install step for torch/torchvision/torchaudio only.


4-4: build/wheel are build-time tools; consider moving to a dev/build requirements file

These aren’t needed at runtime and can bloat production images. Suggest creating requirements-dev.txt and installing them only in build contexts.

Example change in this file:

-build
...
-wheel 
-#<=0.45.1

And create requirements-dev.txt (outside this diff):

  • build
  • wheel

Also applies to: 38-39


55-55: Clarify Triton dependency strategy (flashinfer/xgrammar comments only)

Comments imply Triton is provided transitively via flashinfer-python or xgrammar, but both are commented out here. If Triton-backed paths are required at runtime, expose these as optional extras (e.g., [triton]) or document installation steps in README/CI. Otherwise, guard imports with clear error messages.

Also applies to: 57-57


22-22: Heads-up on loosened pins (h5py, pillow) with numpy<2 cap

h5py and pillow are now minimum-pinned. Newer h5py may hard-require numpy>=2, causing resolver backtracking; pillow major bumps can be disruptive. Not blocking, but be prepared to constrain upper bounds if CI starts flaking.

Also applies to: 37-37

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these settings in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 4eb7618 and b5aa676.

📒 Files selected for processing (2)
  • cpp/CMakeLists.txt (3 hunks)
  • requirements.txt (3 hunks)
✅ Files skipped from review due to trivial changes (1)
  • cpp/CMakeLists.txt

Comment on lines +16 to +19
nvidia-ml-py
# >=12,<13
# Just a wrapper since nvidia-modelopt requires pynvml
pynvml==12.0.0
pynvml
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

⚠️ Potential issue

Avoid installing both nvidia-ml-py and pynvml — they ship the same module (pynvml) and may conflict

Having both often leads to site-packages collisions and non-deterministic import behavior. Keep exactly one; pynvml is what downstream packages typically require.

-nvidia-ml-py
-# >=12,<13
-# Just a wrapper since nvidia-modelopt requires pynvml
-pynvml
+# NVML: use a single provider to avoid conflicts
+pynvml  # required by nvidia-modelopt

If you must support environments that still depend on nvidia-ml-py, consider environment markers/extras rather than installing both globally.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
nvidia-ml-py
# >=12,<13
# Just a wrapper since nvidia-modelopt requires pynvml
pynvml==12.0.0
pynvml
# NVML: use a single provider to avoid conflicts
pynvml # required by nvidia-modelopt
🤖 Prompt for AI Agents
In requirements.txt around lines 16 to 19, both nvidia-ml-py and pynvml are
listed which ship the same pynvml module and can conflict; remove one of them
(prefer pynvml) so only a single package providing pynvml is installed, or
replace with an environment-marker or extras-based conditional if you must
support older environments that require nvidia-ml-py.

Comment on lines +27 to 29
torch>=2.7.1,<=2.9.0a0
torchvision
nvidia-modelopt[torch]~=0.33.0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Verification agent

🧩 Analysis chain

Align torchvision with torch/cu130 to prevent binary/APT mismatches

torch is range-pinned, but torchvision is unconstrained. For CUDA wheels, torch/torchvision must be ABI-matched (same CUDA build, major.minor pairing). Please either pin torchvision to the corresponding release series for the chosen torch range or ensure it’s constrained via constraints.txt.

Run to check if constraints already pin torchvision:

If not pinned elsewhere, consider adding something like:

  • torchvision matched to the highest allowed torch version in your matrix (same CUDA build), or
  • explicitly install both in CI in one transaction from the cu130 index.

🏁 Script executed:

#!/bin/bash
set -euo pipefail
echo "[info] Searching for torchvision pins or references"
rg -n "^torchvision(==|~=|>=|<=)" || true
if [ -f constraints.txt ]; then
  echo "[info] constraints.txt entries for torch/vision"
  rg -n "torch(vision)?" constraints.txt || true
fi

Length of output: 473


Pin torchvision to match torch/CUDA ABI (prevent binary mismatches)

Verification: I ran the grep you supplied — there are no torchvision pins in the repo and constraints.txt only contains a comment (no torch/torchvision constraints).

Current snippet (requirements.txt, lines 27–29):

torch>=2.7.1,<=2.9.0a0
torchvision
nvidia-modelopt[torch]~=0.33.0

Files to fix:

  • requirements.txt (lines 27–29) — torchvision is unconstrained.
  • constraints.txt — contains no torch/torchvision pins (only a comment).

Actionable fixes (pick one):

  • Pin torchvision to the matching release series/CUDA build for your chosen torch range (same major.minor and same +cuNNN wheel), e.g. add a matching torchvision== (+cu130 if using cu130) in requirements.txt.
  • Or add a torchvision constraint in constraints.txt.
  • Or ensure CI installs torch and torchvision together from the correct CUDA index (install both in one transaction from the cu130 wheel index).

Signed-off-by: Boris Fomitchev <[email protected]>
@coderabbitai coderabbitai bot requested a review from PerkzZheng August 17, 2025 08:17
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (2)
cpp/tensorrt_llm/deep_ep/CMakeLists.txt (2)

202-203: Include NVSHMEM lib64 and $ORIGIN in INSTALL_RPATH

To improve runtime robustness across layouts (and when packaging), include NVSHMEM lib64 and $ORIGIN.

 set_target_properties(
   deep_ep_cpp_tllm
@@
-             INSTALL_RPATH "${TORCH_INSTALL_PREFIX}/lib;${NVSHMEM_INSTALL_PREFIX}/lib"
+             INSTALL_RPATH "\$ORIGIN;${TORCH_INSTALL_PREFIX}/lib;${NVSHMEM_INSTALL_PREFIX}/lib;${NVSHMEM_INSTALL_PREFIX}/lib64"
              BUILD_WITH_INSTALL_RPATH TRUE)

39-45: NVSHMEM discovery: make it configurable, robust, and fail-fast

  • Hard-codes install prefix; not overridable.
  • Doesn’t mark finds as REQUIRED (late failures).
  • Pins host SONAME to libnvshmem_host.so.3.
  • Doesn’t search lib64.

Apply this to improve portability and early error surfacing:

-# TODO: restore patched nvshmem for CUDA12
 if(${CUDAToolkit_VERSION} VERSION_GREATER_EQUAL "13.0")
-  set(NVSHMEM_INSTALL_PREFIX "${TORCH_INSTALL_PREFIX}/../nvidia/nvshmem")
-  find_path(NVSHMEM_INCLUDE_DIR nvshmem.h HINTS ${NVSHMEM_INSTALL_PREFIX}/include)
-  find_library(NVSHMEM_DEVICE_LIBRARY nvshmem_device HINTS ${NVSHMEM_INSTALL_PREFIX}/lib)
-  find_library(NVSHMEM_HOST_LIBRARY libnvshmem_host.so.3 HINTS ${NVSHMEM_INSTALL_PREFIX}/lib)
+  set(NVSHMEM_INSTALL_PREFIX
+      "${TORCH_INSTALL_PREFIX}/../nvidia/nvshmem"
+      CACHE PATH "NVSHMEM install root")
+  find_path(NVSHMEM_INCLUDE_DIR nvshmem.h
+            HINTS ${NVSHMEM_INSTALL_PREFIX}/include
+            REQUIRED)
+  find_library(NVSHMEM_DEVICE_LIBRARY
+               NAMES nvshmem_device
+               HINTS ${NVSHMEM_INSTALL_PREFIX}/lib ${NVSHMEM_INSTALL_PREFIX}/lib64
+               REQUIRED)
+  find_library(NVSHMEM_HOST_LIBRARY
+               NAMES nvshmem_host nvshmem
+               HINTS ${NVSHMEM_INSTALL_PREFIX}/lib ${NVSHMEM_INSTALL_PREFIX}/lib64
+               REQUIRED)
 else()
🧹 Nitpick comments (3)
cpp/CMakeLists.txt (1)

378-379: Reduce build log noise for CUDA flags output

Unconditional message() can clutter logs in CI. Consider downgrading the verbosity or gating behind a flag.

-message("CMAKE_CUDA_FLAGS: ${CMAKE_CUDA_FLAGS}")
+message(STATUS "CMAKE_CUDA_FLAGS: ${CMAKE_CUDA_FLAGS}")
+# Or gate it:
+# if(NVCC_TIMING)
+#   message(STATUS "CMAKE_CUDA_FLAGS: ${CMAKE_CUDA_FLAGS}")
+# endif()
cpp/tensorrt_llm/deep_ep/CMakeLists.txt (2)

77-81: Avoid mutating the top-level compilers; constrain GCC usage to ExternalProject

Overwriting CMAKE_C/CXX/CUDA host compilers mid-config can impact the entire project toolchain. Prefer scoping GCC selection to the ExternalProject via its CMAKE_CACHE_ARGS.

-  if(NOT CMAKE_CXX_COMPILER_ID STREQUAL "GNU")
-    set(CMAKE_C_COMPILER gcc)
-    set(CMAKE_CXX_COMPILER g++)
-    set(CMAKE_CUDA_HOST_COMPILER g++)
-  endif()

Then, pass explicit compilers to the ExternalProject:

   ExternalProject_Add(
     nvshmem_project
@@
-    CMAKE_CACHE_ARGS
-      -DCMAKE_C_COMPILER:STRING=${CMAKE_C_COMPILER}
-      -DCMAKE_C_COMPILER_LAUNCHER:STRING=${CMAKE_C_COMPILER_LAUNCHER}
-      -DCMAKE_CXX_COMPILER:STRING=${CMAKE_CXX_COMPILER}
-      -DCMAKE_CXX_COMPILER_LAUNCHER:STRING=${CMAKE_CXX_COMPILER_LAUNCHER}
+    CMAKE_CACHE_ARGS
+      -DCMAKE_C_COMPILER:STRING=gcc
+      -DCMAKE_C_COMPILER_LAUNCHER:STRING=${CMAKE_C_COMPILER_LAUNCHER}
+      -DCMAKE_CXX_COMPILER:STRING=g++
+      -DCMAKE_CXX_COMPILER_LAUNCHER:STRING=${CMAKE_CXX_COMPILER_LAUNCHER}
@@
-      -DCMAKE_CUDA_HOST_COMPILER:STRING=${CMAKE_CUDA_HOST_COMPILER}
+      -DCMAKE_CUDA_HOST_COMPILER:STRING=g++

131-132: Avoid pinning MLX5 SONAME; let the linker resolve the correct version

Pinning libmlx5.so.1 is brittle. Use the base name and search standard locations (and lib64 if needed).

-find_library(MLX5_lib NAMES mlx5 libmlx5.so.1 REQUIRED)
+find_library(MLX5_lib NAMES mlx5 REQUIRED)
+# If you need to search custom prefixes:
+# find_library(MLX5_lib NAMES mlx5 HINTS ${CUSTOM_PREFIX}/lib ${CUSTOM_PREFIX}/lib64 REQUIRED)
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between b5aa676 and 155f262.

📒 Files selected for processing (2)
  • cpp/CMakeLists.txt (3 hunks)
  • cpp/tensorrt_llm/deep_ep/CMakeLists.txt (4 hunks)
🔇 Additional comments (5)
cpp/CMakeLists.txt (1)

211-211: No-op whitespace change

No functional impact here.

cpp/tensorrt_llm/deep_ep/CMakeLists.txt (4)

211-211: LGTM: NVSHMEM include directory is properly added

Including ${NVSHMEM_INCLUDE_DIR} ensures headers resolve for DeepEP CUDA/C++ sources.


213-214: LGTM: Link both NVSHMEM device and host libraries

Linking both device and host libraries resolves prior linkage gaps for host APIs.


222-222: LGTM: Cleaned dependency on nvshmem_project

Dropping the ExternalProject dependency for CUDA >= 13 path removes unnecessary build coupling.


46-46: Escape “$ORIGIN” to preserve a literal in RPATH

In CMake, unescaped $ORIGIN may be consumed during variable expansion instead of passed through to the loader. Use \$ORIGIN to ensure the literal ends up in RPATH:

  • File: cpp/tensorrt_llm/deep_ep/CMakeLists.txt, line 46
- set(NVSHMEM_INSTALL_PREFIX "$ORIGIN/libs/nvshmem")
+ set(NVSHMEM_INSTALL_PREFIX "\$ORIGIN/libs/nvshmem")

After configuring your build (e.g. cmake -S . -B build), verify that \$ORIGIN appears in the build outputs:

grep -R '\\$ORIGIN' build

Comment on lines +253 to +260
if(${CUDAToolkit_VERSION} VERSION_GREATER_EQUAL "13.0")
message(
STATUS
"CUDAToolkit_VERSION ${CUDAToolkit_VERSION_MAJOR}.${CUDAToolkit_VERSION_MINOR} is greater or equal than 13.0, setting CMAKE_CUDA_RUNTIME_LIBRARY to Shared"
)
set(CMAKE_CUDA_FLAGS "${CMAKE_CUDA_FLAGS} --device-entity-has-hidden-visibility=false -cudart=shared")
set(CMAKE_CUDA_RUNTIME_LIBRARY Shared)
endif()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Verification agent

🧩 Analysis chain

Make CUDA runtime selection consistent and avoid redundant flags

You’re switching to Shared runtime for CUDA >= 13, but you also append “-cudart=shared” to CMAKE_CUDA_FLAGS. Prefer one source of truth (CMake’s CMAKE_CUDA_RUNTIME_LIBRARY) and avoid hard-coding -cudart in flags. Also, if you keep Shared for >=13, ensure consumers don’t still link against CUDA::cudart_static via CUDA_RT_LIB.

Apply this minimal diff to let CMake own the runtime selection and align the helper variable:

 if(${CUDAToolkit_VERSION} VERSION_GREATER_EQUAL "13.0")
   message(
     STATUS
       "CUDAToolkit_VERSION ${CUDAToolkit_VERSION_MAJOR}.${CUDAToolkit_VERSION_MINOR} is greater or equal than 13.0, setting CMAKE_CUDA_RUNTIME_LIBRARY to Shared"
   )
-  set(CMAKE_CUDA_FLAGS "${CMAKE_CUDA_FLAGS} --device-entity-has-hidden-visibility=false -cudart=shared")
+  set(CMAKE_CUDA_FLAGS "${CMAKE_CUDA_FLAGS} --device-entity-has-hidden-visibility=false")
   set(CMAKE_CUDA_RUNTIME_LIBRARY Shared)
+  # Ensure helper points to the shared cudart for consumers relying on CUDA_RT_LIB.
+  set(CUDA_RT_LIB CUDA::cudart)
 endif()

To verify no accidental static cudart linkage remains after this change, run:


🏁 Script executed:

#!/bin/bash
set -euo pipefail
echo "Searching for CUDA_RT_LIB usage and cudart_static references..."
rg -n -C2 '\bCUDA_RT_LIB\b|cudart_static|CUDA::cudart_static' --type cmake

Length of output: 3327


Ensure consistent CUDA runtime selection across the entire CMake setup

We should centralize runtime choice via CMAKE_CUDA_RUNTIME_LIBRARY and the CUDA_RT_LIB helper, and remove all hard-coded references to cudart_static or -cudart=shared. In practice that means:

• In cpp/CMakeLists.txt’s setup_cuda_architectures() (around lines 153–165), replace the separate static/default logic with a single if(VERSION ≥ 13)/else() that:
– Sets CMAKE_CUDA_RUNTIME_LIBRARY to Shared for ≥13, Static otherwise.
– Sets CUDA_RT_LIB to CUDA::cudart for ≥13, CUDA::cudart_static otherwise.
– Drops the -cudart=shared flag entirely.

• Remove all other direct invocations of CUDA::cudart_static or -cudart=… and have consumers link only against ${CUDA_RT_LIB}.

• Update every CMakeLists that currently does

find_package(CUDAToolkit REQUIRED COMPONENTS cuda_driver cudart_static nvml)

target_link_libraries(... CUDA::cudart_static ...)

to either:
– Remove cudart_static from the find_package call (letting CMake pick the right runtime), and
– Link against ${CUDA_RT_LIB} (or explicitly CUDA::cudart when you truly need the shared library).

Key files requiring edits (non-exhaustive):

  • cpp/CMakeLists.txt (setup_cuda_architectures)
  • cpp/tensorrt_llm/plugins/CMakeLists.txt
  • cpp/tensorrt_llm/kernels/.../nvrtcWrapper/CMakeLists.txt
  • cpp/tensorrt_llm/executor/cache_transmission/ucx_utils/CMakeLists.txt
  • examples/cpp/executor/CMakeLists.txt

Example refactor in setup_cuda_architectures():

# before
find_package(CUDAToolkit 11.2 REQUIRED COMPONENTS cudart_static cuda_driver ...)
set(CUDA_RT_LIB CUDA::cudart_static)
set(CMAKE_CUDA_RUNTIME_LIBRARY Static)
...
if(${CUDAToolkit_VERSION} VERSION_GREATER_EQUAL "13.0")
  message(STATUS "… setting Shared runtime …")
- set(CMAKE_CUDA_FLAGS "${CMAKE_CUDA_FLAGS} --device-entity-has-hidden-visibility=false -cudart=shared")
+ set(CMAKE_CUDA_FLAGS "${CMAKE_CUDA_FLAGS} --device-entity-has-hidden-visibility=false")
  set(CMAKE_CUDA_RUNTIME_LIBRARY Shared)
+ set(CUDA_RT_LIB CUDA::cudart)
endif()

Then remove all other occurrences of cudart_static (e.g. in examples) and use ${CUDA_RT_LIB} for linking. This ensures a single source of truth and prevents stale static‐runtime references.

🤖 Prompt for AI Agents
In cpp/CMakeLists.txt around lines 253-260 (and in setup_cuda_architectures()
around 153-165), remove the hard-coded -cudart=shared flag and instead
centralize CUDA runtime selection: set CMAKE_CUDA_RUNTIME_LIBRARY to Shared when
CUDAToolkit_VERSION >= 13.0 and to Static otherwise, and set a top-level
variable CUDA_RT_LIB to CUDA::cudart for ≥13 or CUDA::cudart_static for <13;
update these lines to drop any -cudart=... flags. Then search the repo for all
find_package(CUDAToolkit ...) and target_link_libraries(...) usages (notably in
cpp/tensorrt_llm/plugins, nvrtcWrapper, ucx_utils, and examples/cpp/executor)
and remove explicit cudart_static from find_package and replace any
CUDA::cudart_static or -cudart references with linking against ${CUDA_RT_LIB}
(or CUDA::cudart only when you explicitly require shared), ensuring every
consumer uses the centralized CUDA_RT_LIB variable.

@coderabbitai coderabbitai bot requested a review from tomeras91 August 17, 2025 22:17
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
scripts/build_wheel.py (1)

672-686: Harden NVSHMEM packaging: guard directory creation, use robust path construction, handle SONAME variations, and warn on partial installs.

  • Create libs/nvshmem only when License.txt exists to avoid shipping empty dirs.
  • Build paths via Path segments for readability/portability.
  • Be resilient to NVSHMEM library SONAME changes (e.g., .so.103) by globbing.
  • Emit a clear message when the license exists but expected libs are missing.

Apply this diff:

-            
-            (lib_dir / "nvshmem").mkdir(exist_ok=True)
-            nvshmem_license = build_dir / "tensorrt_llm/deep_ep/nvshmem-build/License.txt"
-            if nvshmem_license.exists():
-                install_file(
-                    build_dir / "tensorrt_llm/deep_ep/nvshmem-build/License.txt",
-                    lib_dir / "nvshmem")
-                install_file(
-                    build_dir /
-                    "tensorrt_llm/deep_ep/nvshmem-build/src/lib/nvshmem_bootstrap_uid.so.3",
-                    lib_dir / "nvshmem")
-                install_file(
-                    build_dir /
-                    "tensorrt_llm/deep_ep/nvshmem-build/src/lib/nvshmem_transport_ibgda.so.103",
-                    lib_dir / "nvshmem")
+            nvshmem_build_dir = build_dir / "tensorrt_llm" / "deep_ep" / "nvshmem-build"
+            nvshmem_license = nvshmem_build_dir / "License.txt"
+            if nvshmem_license.exists():
+                nvshmem_out_dir = lib_dir / "nvshmem"
+                nvshmem_out_dir.mkdir(exist_ok=True)
+                lib_src_dir = nvshmem_build_dir / "src" / "lib"
+                # Be resilient to SONAME changes across NVSHMEM versions
+                libs = []
+                for pattern in ("nvshmem_bootstrap_uid.so*", "nvshmem_transport_ibgda.so*"):
+                    libs.extend(lib_src_dir.glob(pattern))
+                if not libs:
+                    warnings.warn(
+                        f"NVSHMEM license found at {nvshmem_license}, but no expected libraries in {lib_src_dir}. "
+                        "Skipping NVSHMEM packaging.")
+                else:
+                    install_file(nvshmem_license, nvshmem_out_dir)
+                    for lib in libs:
+                        install_file(lib, nvshmem_out_dir)
+            else:
+                print("-- NVSHMEM artifacts not detected (License.txt missing); skipping NVSHMEM packaging.")
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 155f262 and e331ef0.

📒 Files selected for processing (1)
  • scripts/build_wheel.py (1 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.py: Python code must target Python 3.8+
Python indentation: 4 spaces, no tabs
Maintain module namespace in imports (from package.subpackage import foo; then use foo.SomeClass())
Python file names use snake_case
Python class names use PascalCase
Python functions/methods and local variables use snake_case; variables starting with a number get k_ prefix (e.g., k_99th_percentile)
Global variables use G_ prefixed UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE in Python
Avoid shadowing variables from outer scopes in Python
Initialize all externally visible members of a Python class in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with short docstrings
Avoid reflection when simple alternatives exist (e.g., prefer explicit parameters over dict(**locals()))
In try/except, catch the narrowest exceptions possible
For duck-typing with try/except, keep try body minimal and put logic in else

Files:

  • scripts/build_wheel.py
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Prepend NVIDIA copyright header (current year) to all source files

Files:

  • scripts/build_wheel.py
🔇 Additional comments (1)
scripts/build_wheel.py (1)

672-686: Packaging includes nvshmem artifacts

I’ve verified that in setup.py’s package_data (lines 102–110) you explicitly list:

  • libs/nvshmem/License.txt
  • libs/nvshmem/nvshmem_bootstrap_uid.so.3
  • libs/nvshmem/nvshmem_transport_ibgda.so.103

These files will be bundled into the wheel. If you ever add more files under libs/nvshmem, remember to update this list (or switch to a glob pattern like 'libs/nvshmem/*') so they aren’t omitted.

@borisfom borisfom closed this Aug 20, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant