Refactor kernel_density to use less memory by Intron7 · Pull Request #7833 · rapidsai/cuml

Intron7 · 2026-02-26T17:48:20Z

Hey this is my first time working on the c++ / cython layer so....

I recently came across Welford's algorithm and I thought something similar should work for kernel density to not need to compute the full pairwise distance matrix. So this does now an online log-sum-exp with max tracking. This way we can run arbitrarily big embeddings without any memory issues and batching.

copy-pr-bot · 2026-02-26T17:48:25Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

coderabbitai · 2026-02-26T17:52:46Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

Adds a CUDA KDE scorer (header + CUDA source), exposes it via a new Cython module, integrates the source into CMake, refactors Python KernelDensity to call the new backend, and expands tests to cover more metrics and kernels.

Changes

Cohort / File(s)	Summary
Build / CMake `cpp/CMakeLists.txt`, `python/cuml/cuml/neighbors/CMakeLists.txt`	Added `src/kde/kde.cu` to cuml private sources; registered new GPU Cython modules (`kde.pyx`, `kneighbors_classifier.pyx`, `kneighbors_regressor.pyx`, `nearest_neighbors.pyx`); updated license year.
C++ public API `cpp/include/cuml/neighbors/kde.hpp`	New header declaring `ML::KDE::score_samples` template, `using DensityKernelType`, and `extern template` instantiations for `float` and `double`.
C++ CUDA implementation `cpp/src/kde/kde.cu`	New CUDA-backed wrapper that forwards `ML::KDE::score_samples<T>` to `cuvs::distance::kde_score_samples`; explicit instantiations for `float` and `double`.
Python Cython module `python/cuml/cuml/neighbors/kde.pyx`	New Cython bridge exposing `kde_score_samples`, maps kernel/metric strings to enums, validates inputs, dispatches float32/float64 to CUDA backend, and returns output arrays.
Python KernelDensity refactor `python/cuml/cuml/neighbors/kernel_density.py`	Replaces in-file kernel/distance/logsumexp computation with delegation to `kde_score_samples`; enforces input ordering and tightens sample_weight validation (`<= 0` invalid).
Tests `python/cuml/tests/test_kernel_density.py`	Expanded tests: added custom distance functions (Hellinger, Jensen-Shannon, KL), new naive references, broader kernel/metric coverage including tiling/multipass tests; minor imports and copyright year bump.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Suggested labels

improvement, non-breaking

Suggested reviewers

betatim

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 15.38% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately summarizes the main change: refactoring kernel_density to reduce memory usage from O(n·m) to O(n+m) by replacing full pairwise distance matrix with online log-sum-exp computation.
Description check	✅ Passed	The description is related to the changeset, explaining the motivation (Welford's algorithm inspiration) and the key technical approach (online log-sum-exp with max tracking) to address memory issues.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

📝 Coding Plan

Generate coding plan for human review comments

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Tip

You can enable review details to help with troubleshooting, context usage and more.

Enable the reviews.review_details setting to include review details such as the model used, the time taken for each step and more in the review comments.

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (1)

python/cuml/cuml/neighbors/kernel_density.py (1)

252-259: Consider using next(iter()) for single-value extraction.

Per static analysis (RUF015), prefer next(iter(self.metric_params.values())) over creating an intermediate list for a single element.

Suggested improvement

         if self.metric_params:
             if len(self.metric_params) != 1:
                 raise ValueError(
                     "Cuml only supports metrics with a single arg."
                 )
-            metric_arg = float(list(self.metric_params.values())[0])
+            metric_arg = float(next(iter(self.metric_params.values())))
         else:
             metric_arg = 2.0

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@python/cuml/cuml/neighbors/kernel_density.py` around lines 252 - 259, The
code in kernel_density.py currently converts metric_params.values() to a list to
extract a single value for metric_arg; replace that intermediate list with an
iterator-based fetch using next(iter(self.metric_params.values())) and cast it
to float (i.e., metric_arg = float(next(iter(self.metric_params.values()))))
while preserving the existing single-value length check and default branch;
update the block around the metric_params handling in the KernelDensity
implementation where metric_arg is assigned.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@cpp/src/kde/kde.cu`:
- Around line 438-442: The CUDA kernel launch of kde_fused_kernel<T, M, K> is
missing a post-launch error check; include the RAFT CUDA utilities header
(raft/util/cuda_utils.cuh) and add a RAFT_CUDA_TRY(...) check immediately after
the kernel launch inside the same scope (e.g., after
kde_fused_kernel<<<...>>>(...)) to catch asynchronous launch errors; ensure the
RAFT_CUDA_TRY invocation uses the appropriate CUDA error query
(cudaGetLastError()/cudaPeekAtLastError() as provided by RAFT) and keep the
change local to the kernel launch block.

---

Nitpick comments:
In `@python/cuml/cuml/neighbors/kernel_density.py`:
- Around line 252-259: The code in kernel_density.py currently converts
metric_params.values() to a list to extract a single value for metric_arg;
replace that intermediate list with an iterator-based fetch using
next(iter(self.metric_params.values())) and cast it to float (i.e., metric_arg =
float(next(iter(self.metric_params.values())))) while preserving the existing
single-value length check and default branch; update the block around the
metric_params handling in the KernelDensity implementation where metric_arg is
assigned.

ℹ️ Review info

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ed4de0a and 7ae3918.

📒 Files selected for processing (7)

cpp/CMakeLists.txt
cpp/include/cuml/neighbors/kde.hpp
cpp/src/kde/kde.cu
python/cuml/cuml/neighbors/CMakeLists.txt
python/cuml/cuml/neighbors/kde.pyx
python/cuml/cuml/neighbors/kernel_density.py
python/cuml/tests/test_kernel_density.py

cpp/src/kde/kde.cu

jcrist · 2026-02-27T04:59:51Z

Thanks for the PR! On a first brief skim the idea looks sound. I'm a bit wary of the code duplication between RAFT/cuvs/cuml here for distances, but it's honestly not so much code so worst case merging as is may be fine. Others more versed on the C++ side of things may have some suggestions though.

I probably won't have time to look more into this until Monday. One quick request I'd have if you have some time is to push up some more motivation for your use case here. How much of a memory savings is this providing for workloads you're running, and are there other benefits (perf, ...) worth noting? Any numbers you can provide to help motivate the change and use case would be very helpful here.

jcrist · 2026-02-27T05:00:39Z

/ok to test 7ae3918

Intron7 · 2026-02-27T07:28:40Z

I have done some small benchmarks. For small datasets the performance is roughly the same the new implementation is 1.1x faster for (10000x10000). However for a bigger embedding (200000,200000) where I need to chunk to not blow up memory this is 11 times faster. The memory use is the most impactful part. It goes from (n x m) to (n + m) since we never compute this massive pairwise distance matrix. I was trying to use the raft distances. Some of them worked others didn't because they assume a different threadlayout. So I created custom distance functions.

viclafargue

Thanks @Intron7! This would be very helpful to scale kernel_density to larger problem sizes. I could review the CUDA code. It looks like there is a loop over all the train vectors which would not scale well. However, this new solution would save a lot of memory. I suggested some optimizations. Have you benchmarked the old vs new solution on a case with a small n_query and large n_train? I wonder if this is really a drop-in replacement for what we had.

viclafargue · 2026-03-04T10:22:01Z

cpp/src/kde/kde.cu

+  T running_sum = T(0);
+
+  for (int j = 0; j < n_train; ++j) {
+    T dist  = Distance<T, Metric>::compute(&query[i * d], &train[j * d], d, metric_arg);


i * d and j * d here would likely result in integer overflow.

viclafargue · 2026-03-04T10:22:47Z

cpp/src/kde/kde.cu

+      kde_fused_kernel<T, M, K><<<blocks, threads, 0, stream>>>(
+        query, train, weights, output, n_query, n_train, d, bandwidth, metric_arg, log_norm);


Suggested change

kde_fused_kernel<T, M, K><<<blocks, threads, 0, stream>>>(

query, train, weights, output, n_query, n_train, d, bandwidth, metric_arg, log_norm);

kde_fused_kernel<T, M, K><<<blocks, threads, 0, stream>>>(

query, train, weights, output, n_query, n_train, d, bandwidth, metric_arg, log_norm);

RAFT_CUDA_TRY(cudaPeekAtLastError());

cpp/src/kde/kde.cu

viclafargue · 2026-03-04T10:29:47Z

cpp/src/kde/kde.cu

+// euclidean: sqrt(sum((a-b)^2))
+template <typename T>
+struct Distance<T, ML::distance::DistanceType::L2SqrtUnexpanded> {
+  __device__ static T compute(const T* a, const T* b, int d, T)


Suggested change

__device__ static T compute(const T* a, const T* b, int d, T)

inline __device__ static T compute(const T* a, const T* b, int d, T)

It is possible to inline distance and log functions to remove function calls and have improved compiler optimizations. It is small optimization, but not necessarily what we want to do since it will bloat binary size (one kernel per metric, log and type).

viclafargue · 2026-03-04T11:13:01Z

cpp/src/kde/kde.cu

+  T running_max = -cuda::std::numeric_limits<T>::infinity();
+  T running_sum = T(0);
+
+  for (int j = 0; j < n_train; ++j) {


This will essentially make the kernel process rows sequentially and prevent scaling according to n_train which is quite bad. The pairwise_distance kernel, even though it uses a lot more memory and memory bandwidth, it will not suffer from this issue.

If we want to keep with this design, one major improvement would be to store the train vectors once in shared memory. This would prevent accessing the same data several time from global memory. Access would be divided by (blockDim.x=256). To make it possible the train vectors would have to be processed as tiles (as shared memory is limited). And if possible coalesced. Also the query can be stored in registers.

Intron7 · 2026-03-04T22:54:40Z

For my current limited testing it's faster than the current implementation between 1.1 x faster to 11x faster. Also the speed being the same doesn't really matter if the other implementation breaks because a pairwise distance matrix blowing up the memory. I can definitely work on prefetching the data into shared memory. But right now it looks like the kernel is compute and not memory bound.

coderabbitai

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@cpp/src/kde/kde.cu`:
- Around line 550-551: The code calls cudaDeviceGetAttribute(&sm_count,
cudaDevAttrMultiProcessorCount, 0) with a hard-coded device ID; change it to
query the current device first (e.g., call cudaGetDevice to obtain the active
device) and pass that device variable into cudaDeviceGetAttribute so sm_count is
obtained for the active GPU. Locate the cudaDeviceGetAttribute usage around
sm_count and replace the literal 0 with the retrieved current device (or obtain
the device from the provided raft::handle_t if available) to make the operation
device-agnostic.
- Around line 583-610: The code allocates partial_max and partial_sum with
cudaMallocAsync and manually frees them, which leaks if RAFT_CUDA_TRY throws;
replace raw T* allocations with RAII rmm::device_uvector<T> (construct with
buf_elems and stream) and pass .data() to kde_tiled_kernel and
kde_reduce_kernel, remove the explicit cudaFreeAsync calls, and ensure
includes/namespace for rmm are added so allocations are automatically freed on
exception or scope exit.
- Line 396: Avoid taking log(0) by skipping the log when a sample weight is
zero: in kde.cu where log_k is incremented using weights, add a guard that
checks weights is non-null and that weights[j_base + c] is greater than T(0)
before calling log, e.g., only add log(weights[j_base + c]) when the weight > 0;
update the same check around any other places that assume positive weights
and/or alternatively enforce weights > 0 in kernel_density.py validation if you
prefer failing earlier.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 439759da-d557-4f91-95c2-c3c10d90adcb

📥 Commits

Reviewing files that changed from the base of the PR and between 7ae3918 and 0de59e9.

📒 Files selected for processing (3)

cpp/include/cuml/neighbors/kde.hpp
cpp/src/kde/kde.cu
python/cuml/cuml/neighbors/kernel_density.py

cpp/src/kde/kde.cu

viclafargue

Tiled processing is a great addition for overall performance. Please add checks in the C++ API for d > 0, n_train > 0, n_query > 0, bandwidth > 0 with RAFT_EXPECTS.

Also, could you add some Pytest tests to check the different metrics and tiling layout for correctness against the reference KDE?

cpp/src/kde/kde.cu

viclafargue · 2026-03-09T14:05:17Z

cpp/src/kde/kde.cu

+        RAFT_CUDA_TRY(cudaMallocAsync(&partial_max, buf_elems * sizeof(T), stream));
+        RAFT_CUDA_TRY(cudaMallocAsync(&partial_sum, buf_elems * sizeof(T), stream));


Please use rmm::device_uvector<T> for safer memory management (RAII) and also to make use of the current RMM allocator.

Co-authored-by: Victor Lafargue <viclafargue@nvidia.com>

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (1)

cpp/src/kde/kde.cu (1)

503-517: Consider adding input validation for edge cases.

The function does not validate inputs like n_query, n_train, d, or null pointers. While callers should provide valid inputs, defensive checks (especially d > 0) would prevent undefined behavior from propagating silently. Some distance metrics (Hamming, RusselRao) divide by d in their finalize step.

💡 Optional validation

void score_samples(const raft::handle_t& handle,
                   const T* query,
                   const T* train,
                   const T* weights,
                   T* output,
                   int n_query,
                   int n_train,
                   int d,
                   ...)
{
  RAFT_EXPECTS(query != nullptr, "query must not be null");
  RAFT_EXPECTS(train != nullptr, "train must not be null");
  RAFT_EXPECTS(output != nullptr, "output must not be null");
  RAFT_EXPECTS(n_query > 0, "n_query must be positive");
  RAFT_EXPECTS(n_train > 0, "n_train must be positive");
  RAFT_EXPECTS(d > 0, "d (n_features) must be positive");
  // ... rest of function

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@cpp/src/kde/kde.cu` around lines 503 - 517, Add defensive input validation at
the start of score_samples to guard against null pointers and non-positive
sizes: check that query, train, and output (and weights if required by code
path) are not null and that n_query, n_train, and d are > 0 (use RAFT_EXPECTS or
the project’s preferred assertion macro). Also ensure any distance metric paths
that divide by d (e.g., Hamming, RusselRao in finalize logic) only run when d >
0 and fail fast otherwise; update score_samples’ signature body accordingly so
these checks occur before any computation or calls to finalize/metric-specific
code.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@cpp/src/kde/kde.cu`:
- Around line 81-88: The finalize of DistOp<T,
ML::distance::DistanceType::LpUnexpanded> can divide by a zero or near-zero p
(metric_arg); update finalize (and optionally accumulate) to guard against p <=
0 or extremely small values by checking p against a small epsilon (or explicitly
handling p == 0 case), and return a well-defined result (e.g., treat p==0 as
appropriate limit or clamp p to epsilon and compute pow(acc[0],
T(1)/clamped_p)); ensure the check uses the template type T and references the
same parameter p so callers of DistOp::finalize get a stable result instead of
Inf/NaN.

---

Nitpick comments:
In `@cpp/src/kde/kde.cu`:
- Around line 503-517: Add defensive input validation at the start of
score_samples to guard against null pointers and non-positive sizes: check that
query, train, and output (and weights if required by code path) are not null and
that n_query, n_train, and d are > 0 (use RAFT_EXPECTS or the project’s
preferred assertion macro). Also ensure any distance metric paths that divide by
d (e.g., Hamming, RusselRao in finalize logic) only run when d > 0 and fail fast
otherwise; update score_samples’ signature body accordingly so these checks
occur before any computation or calls to finalize/metric-specific code.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 136bc504-22d0-4130-a12c-56dba05c11e4

📥 Commits

Reviewing files that changed from the base of the PR and between 0de59e9 and e1a66b3.

📒 Files selected for processing (1)

cpp/src/kde/kde.cu

coderabbitai · 2026-03-13T12:46:25Z

cpp/src/kde/kde.cu

+// minkowski: (sum(|a-b|^p))^(1/p)
+template <typename T>
+struct DistOp<T, ML::distance::DistanceType::LpUnexpanded> {
+  static constexpr int N_ACC = 1;
+  inline __device__ static void init(T* acc) { acc[0] = T(0); }
+  inline __device__ static void accumulate(T* acc, T a, T b, T p) { acc[0] += pow(abs(a - b), p); }
+  inline __device__ static T finalize(T* acc, int, T p) { return pow(acc[0], T(1) / p); }
+};


⚠️ Potential issue | 🟡 Minor

Potential division by near-zero p in Minkowski finalize.

If metric_arg (p) is zero or very close to zero, T(1) / p in finalize will produce infinity or extreme values. Consider adding a guard or documenting that callers must ensure p > 0.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@cpp/src/kde/kde.cu` around lines 81 - 88, The finalize of DistOp<T, ML::distance::DistanceType::LpUnexpanded> can divide by a zero or near-zero p (metric_arg); update finalize (and optionally accumulate) to guard against p <= 0 or extremely small values by checking p against a small epsilon (or explicitly handling p == 0 case), and return a well-defined result (e.g., treat p==0 as appropriate limit or clamp p to epsilon and compute pow(acc[0], T(1)/clamped_p)); ensure the check uses the template type T and references the same parameter p so callers of DistOp::finalize get a stable result instead of Inf/NaN.

cjnolet · 2026-03-13T12:49:02Z

python/cuml/cuml/neighbors/kde.pyx

+from cuml.metrics.pairwise_distances import PAIRWISE_DISTANCE_METRICS
+
+
+cdef extern from "cuml/neighbors/kde.hpp" namespace "ML::KDE" nogil:


This should really be getting done in raft /cuVS alongside the kernel gramm APIs. These are inherently distance / neighborhood based.

cjnolet · 2026-03-13T12:49:35Z

python/cuml/cuml/neighbors/kde.pyx

+cdef extern from "cuml/neighbors/kde.hpp" namespace "ML::KDE" nogil:
+
+    ctypedef enum class KernelType:
+        Gaussian "ML::KDE::KernelType::Gaussian"


This should also be using the kernel types in the grammian apis.

cjnolet · 2026-03-13T12:53:15Z

Hey @Intron7 we have a kernel gram API in cuVS that handles pairwise distance / grammian computations for the other kernel methods like SVR/SVM. Rather than scattering these implementations across cuml and cuVs, we should really be aiming to consolidate them into a shared API of sorts, even if they end up dispatching to different impls at first. Just want to make sure we are representing algorithms with as much composability and reuse as possible.

coderabbitai

♻️ Duplicate comments (1)

cpp/src/kde/kde.cu (1)
84-91: ⚠️ Potential issue | 🟡 Minor

Minkowski distance: division by near-zero p remains unguarded.

If metric_arg (p) is zero or very close to zero, T(1) / p in finalize will produce infinity or extreme values. While this is an edge case (callers typically use p ≥ 1), consider either:

Adding input validation in score_samples to require p > 0 when metric is Minkowski, or

Documenting the constraint in the API.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/kde/kde.cu` around lines 84 - 91, The finalize implementation of
DistOp for ML::distance::DistanceType::LpUnexpanded uses T(1)/p which can divide
by zero or near-zero p; update validation to ensure metric_arg (p) > 0 before
computing the power and surface the error to callers (e.g., in the score_samples
caller path) or clamp/handle tiny p values: add an explicit check for p <= 0 (or
p < epsilon) and return/report an error or fallback behavior, and document the
constraint for LpUnexpanded; reference DistOp<T,
ML::distance::DistanceType::LpUnexpanded>::finalize, accumulate, and the
score_samples codepath that provides metric_arg.

🧹 Nitpick comments (1)

python/cuml/tests/test_kernel_density.py (1)

346-353: Replace ambiguous multiplication sign in docstring.

Static analysis (RUF002) flags the × character as ambiguous. Consider using x or spelling out "by" for clarity.

Suggested fix

-def test_all_kernels_all_metrics(metric, kernel):
-    """Every metric × kernel combination produces output matching the reference.
+def test_all_kernels_all_metrics(metric, kernel):
+    """Every metric x kernel combination produces output matching the reference.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@python/cuml/tests/test_kernel_density.py` around lines 346 - 353, The
docstring in test_all_kernels_all_metrics uses the ambiguous multiplication sign
"×"; replace it with a clear ASCII alternative such as "x" or the word "by" so
static analysis (RUF002) no longer flags it — update the docstring text inside
the test_all_kernels_all_metrics function accordingly to read e.g. "Every metric
x kernel combination…" or "Every metric by kernel combination…".

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@cpp/src/kde/kde.cu`:
- Around line 84-91: The finalize implementation of DistOp for
ML::distance::DistanceType::LpUnexpanded uses T(1)/p which can divide by zero or
near-zero p; update validation to ensure metric_arg (p) > 0 before computing the
power and surface the error to callers (e.g., in the score_samples caller path)
or clamp/handle tiny p values: add an explicit check for p <= 0 (or p < epsilon)
and return/report an error or fallback behavior, and document the constraint for
LpUnexpanded; reference DistOp<T,
ML::distance::DistanceType::LpUnexpanded>::finalize, accumulate, and the
score_samples codepath that provides metric_arg.

---

Nitpick comments:
In `@python/cuml/tests/test_kernel_density.py`:
- Around line 346-353: The docstring in test_all_kernels_all_metrics uses the
ambiguous multiplication sign "×"; replace it with a clear ASCII alternative
such as "x" or the word "by" so static analysis (RUF002) no longer flags it —
update the docstring text inside the test_all_kernels_all_metrics function
accordingly to read e.g. "Every metric x kernel combination…" or "Every metric
by kernel combination…".

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 69410cce-7cea-4904-be89-0c89f04c6bde

📥 Commits

Reviewing files that changed from the base of the PR and between e1a66b3 and 8827ad5.

📒 Files selected for processing (3)

cpp/src/kde/kde.cu
python/cuml/cuml/neighbors/kernel_density.py
python/cuml/tests/test_kernel_density.py

Intron7 · 2026-03-13T17:17:40Z

rapidsai/cuvs#1915 is needed now for this PR since i moved the kernel to cuvs @cjnolet

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@python/cuml/tests/test_kernel_density.py`:
- Around line 382-389: In the docstring for test_all_kernels_all_metrics replace
the Unicode multiplication sign "×" with a plain ASCII "x" to avoid ambiguity
and ensure consistent encoding/reading across tools; update the string in the
function test_all_kernels_all_metrics accordingly so it reads "metric x kernel"
(or similar) instead of using the Unicode multiplication character.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: b311b680-ea71-4df1-9695-d61ee7bbc297

📥 Commits

Reviewing files that changed from the base of the PR and between 8827ad5 and 2951567.

📒 Files selected for processing (4)

cpp/include/cuml/neighbors/kde.hpp
cpp/src/kde/kde.cu
python/cuml/cuml/neighbors/kde.pyx
python/cuml/tests/test_kernel_density.py

🚧 Files skipped from review as they are similar to previous changes (2)

cpp/src/kde/kde.cu
cpp/include/cuml/neighbors/kde.hpp

coderabbitai · 2026-03-13T17:21:30Z

python/cuml/tests/test_kernel_density.py

+def test_all_kernels_all_metrics(metric, kernel):
+    """Every metric × kernel combination produces output matching the reference.
+
+    For metrics supported by sklearn.pairwise_distances the reference is
+    compute_kernel_naive; for metrics absent from sklearn a matching numpy
+    reference is used that mirrors the DistOp accumulate/finalize logic in
+    kde.cu exactly.
+    """


⚠️ Potential issue | 🟡 Minor

Minor: Replace ambiguous multiplication sign character.

The docstring uses × (Unicode multiplication sign) which can cause confusion. Consider using x for clarity.

✏️ Suggested fix

def test_all_kernels_all_metrics(metric, kernel): - """Every metric × kernel combination produces output matching the reference. + """Every metric x kernel combination produces output matching the reference.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

def test_all_kernels_all_metrics(metric, kernel):

"""Every metric × kernel combination produces output matching the reference.

For metrics supported by sklearn.pairwise_distances the reference is

compute_kernel_naive; for metrics absent from sklearn a matching numpy

reference is used that mirrors the DistOp accumulate/finalize logic in

kde.cu exactly.

"""

def test_all_kernels_all_metrics(metric, kernel):

"""Every metric x kernel combination produces output matching the reference.

For metrics supported by sklearn.pairwise_distances the reference is

compute_kernel_naive; for metrics absent from sklearn a matching numpy

reference is used that mirrors the DistOp accumulate/finalize logic in

kde.cu exactly.

"""

🧰 Tools

🪛 Ruff (0.15.5)

[warning] 383-383: Docstring contains ambiguous × (MULTIPLICATION SIGN). Did you mean x (LATIN SMALL LETTER X)?

(RUF002)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@python/cuml/tests/test_kernel_density.py` around lines 382 - 389, In the docstring for test_all_kernels_all_metrics replace the Unicode multiplication sign "×" with a plain ASCII "x" to avoid ambiguity and ensure consistent encoding/reading across tools; update the string in the function test_all_kernels_all_metrics accordingly so it reads "metric x kernel" (or similar) instead of using the Unicode multiplication character.

add refactor

7ae3918

Intron7 requested review from a team as code owners February 26, 2026 17:48

Intron7 requested review from betatim, jinsolp, robertmaynard and viclafargue February 26, 2026 17:48

github-actions bot added Cython / Python Cython or Python issue CMake CUDA/C++ labels Feb 26, 2026

github-actions bot assigned Intron7 Feb 26, 2026

coderabbitai bot reviewed Feb 26, 2026

View reviewed changes

cpp/src/kde/kde.cu Outdated Show resolved Hide resolved

jcrist requested review from jcrist and removed request for robertmaynard February 27, 2026 05:01

viclafargue reviewed Mar 4, 2026

View reviewed changes

update kernel

0de59e9

coderabbitai bot reviewed Mar 5, 2026

View reviewed changes

cpp/src/kde/kde.cu Outdated Show resolved Hide resolved

cpp/src/kde/kde.cu Outdated Show resolved Hide resolved

cpp/src/kde/kde.cu Outdated Show resolved Hide resolved

viclafargue reviewed Mar 9, 2026

View reviewed changes

Update cpp/src/kde/kde.cu

e1a66b3

Co-authored-by: Victor Lafargue <viclafargue@nvidia.com>

coderabbitai bot reviewed Mar 13, 2026

View reviewed changes

cjnolet requested changes Mar 13, 2026

View reviewed changes

update test and adress coderabbit

8827ad5

coderabbitai bot reviewed Mar 13, 2026

View reviewed changes

move kernel to cuvs

2951567

Intron7 mentioned this pull request Mar 13, 2026

Add KDE kernel rapidsai/cuvs#1915

Open

coderabbitai bot reviewed Mar 13, 2026

View reviewed changes

update for new cuvs API

259e178

		kde_fused_kernel<T, M, K><<<blocks, threads, 0, stream>>>(
		query, train, weights, output, n_query, n_train, d, bandwidth, metric_arg, log_norm);

	__device__ static T compute(const T* a, const T* b, int d, T)
	inline __device__ static T compute(const T* a, const T* b, int d, T)

		RAFT_CUDA_TRY(cudaMallocAsync(&partial_max, buf_elems * sizeof(T), stream));
		RAFT_CUDA_TRY(cudaMallocAsync(&partial_sum, buf_elems * sizeof(T), stream));

		from cuml.metrics.pairwise_distances import PAIRWISE_DISTANCE_METRICS


		cdef extern from "cuml/neighbors/kde.hpp" namespace "ML::KDE" nogil:

Conversation

Intron7 commented Feb 26, 2026

Uh oh!

copy-pr-bot bot commented Feb 26, 2026

Uh oh!

coderabbitai bot commented Feb 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Estimated code review effort

Suggested labels

Suggested reviewers

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jcrist commented Feb 27, 2026

Uh oh!

jcrist commented Feb 27, 2026

Uh oh!

Intron7 commented Feb 27, 2026

Uh oh!

viclafargue left a comment

Choose a reason for hiding this comment

Uh oh!

viclafargue Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

viclafargue Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

viclafargue Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

viclafargue Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

Intron7 commented Mar 4, 2026

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

viclafargue left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

viclafargue Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 13, 2026

Choose a reason for hiding this comment

Uh oh!

cjnolet Mar 13, 2026

Choose a reason for hiding this comment

Uh oh!

cjnolet Mar 13, 2026

Choose a reason for hiding this comment

Uh oh!

cjnolet commented Mar 13, 2026

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Intron7 commented Mar 13, 2026

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 13, 2026

Choose a reason for hiding this comment

coderabbitai bot commented Feb 26, 2026 •

edited

Loading