Skip to content

Refactor kernel_density to use less memory#7833

Open
Intron7 wants to merge 6 commits intorapidsai:mainfrom
Intron7:refactor-kernel-density
Open

Refactor kernel_density to use less memory#7833
Intron7 wants to merge 6 commits intorapidsai:mainfrom
Intron7:refactor-kernel-density

Conversation

@Intron7
Copy link
Copy Markdown
Contributor

@Intron7 Intron7 commented Feb 26, 2026

Hey this is my first time working on the c++ / cython layer so....

I recently came across Welford's algorithm and I thought something similar should work for kernel density to not need to compute the full pairwise distance matrix. So this does now an online log-sum-exp with max tracking. This way we can run arbitrarily big embeddings without any memory issues and batching.

@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot bot commented Feb 26, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Feb 26, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds a CUDA KDE scorer (header + CUDA source), exposes it via a new Cython module, integrates the source into CMake, refactors Python KernelDensity to call the new backend, and expands tests to cover more metrics and kernels.

Changes

Cohort / File(s) Summary
Build / CMake
cpp/CMakeLists.txt, python/cuml/cuml/neighbors/CMakeLists.txt
Added src/kde/kde.cu to cuml private sources; registered new GPU Cython modules (kde.pyx, kneighbors_classifier.pyx, kneighbors_regressor.pyx, nearest_neighbors.pyx); updated license year.
C++ public API
cpp/include/cuml/neighbors/kde.hpp
New header declaring ML::KDE::score_samples template, using DensityKernelType, and extern template instantiations for float and double.
C++ CUDA implementation
cpp/src/kde/kde.cu
New CUDA-backed wrapper that forwards ML::KDE::score_samples<T> to cuvs::distance::kde_score_samples; explicit instantiations for float and double.
Python Cython module
python/cuml/cuml/neighbors/kde.pyx
New Cython bridge exposing kde_score_samples, maps kernel/metric strings to enums, validates inputs, dispatches float32/float64 to CUDA backend, and returns output arrays.
Python KernelDensity refactor
python/cuml/cuml/neighbors/kernel_density.py
Replaces in-file kernel/distance/logsumexp computation with delegation to kde_score_samples; enforces input ordering and tightens sample_weight validation (<= 0 invalid).
Tests
python/cuml/tests/test_kernel_density.py
Expanded tests: added custom distance functions (Hellinger, Jensen-Shannon, KL), new naive references, broader kernel/metric coverage including tiling/multipass tests; minor imports and copyright year bump.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Suggested labels

improvement, non-breaking

Suggested reviewers

  • betatim
🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 15.38% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately summarizes the main change: refactoring kernel_density to reduce memory usage from O(n·m) to O(n+m) by replacing full pairwise distance matrix with online log-sum-exp computation.
Description check ✅ Passed The description is related to the changeset, explaining the motivation (Welford's algorithm inspiration) and the key technical approach (online log-sum-exp with max tracking) to address memory issues.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
📝 Coding Plan
  • Generate coding plan for human review comments

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Tip

You can enable review details to help with troubleshooting, context usage and more.

Enable the reviews.review_details setting to include review details such as the model used, the time taken for each step and more in the review comments.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
python/cuml/cuml/neighbors/kernel_density.py (1)

252-259: Consider using next(iter()) for single-value extraction.

Per static analysis (RUF015), prefer next(iter(self.metric_params.values())) over creating an intermediate list for a single element.

Suggested improvement
         if self.metric_params:
             if len(self.metric_params) != 1:
                 raise ValueError(
                     "Cuml only supports metrics with a single arg."
                 )
-            metric_arg = float(list(self.metric_params.values())[0])
+            metric_arg = float(next(iter(self.metric_params.values())))
         else:
             metric_arg = 2.0
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@python/cuml/cuml/neighbors/kernel_density.py` around lines 252 - 259, The
code in kernel_density.py currently converts metric_params.values() to a list to
extract a single value for metric_arg; replace that intermediate list with an
iterator-based fetch using next(iter(self.metric_params.values())) and cast it
to float (i.e., metric_arg = float(next(iter(self.metric_params.values()))))
while preserving the existing single-value length check and default branch;
update the block around the metric_params handling in the KernelDensity
implementation where metric_arg is assigned.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@cpp/src/kde/kde.cu`:
- Around line 438-442: The CUDA kernel launch of kde_fused_kernel<T, M, K> is
missing a post-launch error check; include the RAFT CUDA utilities header
(raft/util/cuda_utils.cuh) and add a RAFT_CUDA_TRY(...) check immediately after
the kernel launch inside the same scope (e.g., after
kde_fused_kernel<<<...>>>(...)) to catch asynchronous launch errors; ensure the
RAFT_CUDA_TRY invocation uses the appropriate CUDA error query
(cudaGetLastError()/cudaPeekAtLastError() as provided by RAFT) and keep the
change local to the kernel launch block.

---

Nitpick comments:
In `@python/cuml/cuml/neighbors/kernel_density.py`:
- Around line 252-259: The code in kernel_density.py currently converts
metric_params.values() to a list to extract a single value for metric_arg;
replace that intermediate list with an iterator-based fetch using
next(iter(self.metric_params.values())) and cast it to float (i.e., metric_arg =
float(next(iter(self.metric_params.values())))) while preserving the existing
single-value length check and default branch; update the block around the
metric_params handling in the KernelDensity implementation where metric_arg is
assigned.

ℹ️ Review info

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ed4de0a and 7ae3918.

📒 Files selected for processing (7)
  • cpp/CMakeLists.txt
  • cpp/include/cuml/neighbors/kde.hpp
  • cpp/src/kde/kde.cu
  • python/cuml/cuml/neighbors/CMakeLists.txt
  • python/cuml/cuml/neighbors/kde.pyx
  • python/cuml/cuml/neighbors/kernel_density.py
  • python/cuml/tests/test_kernel_density.py

@jcrist
Copy link
Copy Markdown
Member

jcrist commented Feb 27, 2026

Thanks for the PR! On a first brief skim the idea looks sound. I'm a bit wary of the code duplication between RAFT/cuvs/cuml here for distances, but it's honestly not so much code so worst case merging as is may be fine. Others more versed on the C++ side of things may have some suggestions though.

I probably won't have time to look more into this until Monday. One quick request I'd have if you have some time is to push up some more motivation for your use case here. How much of a memory savings is this providing for workloads you're running, and are there other benefits (perf, ...) worth noting? Any numbers you can provide to help motivate the change and use case would be very helpful here.

@jcrist
Copy link
Copy Markdown
Member

jcrist commented Feb 27, 2026

/ok to test 7ae3918

@jcrist jcrist requested review from jcrist and removed request for robertmaynard February 27, 2026 05:01
@Intron7
Copy link
Copy Markdown
Contributor Author

Intron7 commented Feb 27, 2026

I have done some small benchmarks. For small datasets the performance is roughly the same the new implementation is 1.1x faster for (10000x10000). However for a bigger embedding (200000,200000) where I need to chunk to not blow up memory this is 11 times faster. The memory use is the most impactful part. It goes from (n x m) to (n + m) since we never compute this massive pairwise distance matrix. I was trying to use the raft distances. Some of them worked others didn't because they assume a different threadlayout. So I created custom distance functions.

Copy link
Copy Markdown
Contributor

@viclafargue viclafargue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Intron7! This would be very helpful to scale kernel_density to larger problem sizes. I could review the CUDA code. It looks like there is a loop over all the train vectors which would not scale well. However, this new solution would save a lot of memory. I suggested some optimizations. Have you benchmarked the old vs new solution on a case with a small n_query and large n_train? I wonder if this is really a drop-in replacement for what we had.

T running_sum = T(0);

for (int j = 0; j < n_train; ++j) {
T dist = Distance<T, Metric>::compute(&query[i * d], &train[j * d], d, metric_arg);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i * d and j * d here would likely result in integer overflow.

Comment on lines +438 to +439
kde_fused_kernel<T, M, K><<<blocks, threads, 0, stream>>>(
query, train, weights, output, n_query, n_train, d, bandwidth, metric_arg, log_norm);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
kde_fused_kernel<T, M, K><<<blocks, threads, 0, stream>>>(
query, train, weights, output, n_query, n_train, d, bandwidth, metric_arg, log_norm);
kde_fused_kernel<T, M, K><<<blocks, threads, 0, stream>>>(
query, train, weights, output, n_query, n_train, d, bandwidth, metric_arg, log_norm);
RAFT_CUDA_TRY(cudaPeekAtLastError());

// euclidean: sqrt(sum((a-b)^2))
template <typename T>
struct Distance<T, ML::distance::DistanceType::L2SqrtUnexpanded> {
__device__ static T compute(const T* a, const T* b, int d, T)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
__device__ static T compute(const T* a, const T* b, int d, T)
inline __device__ static T compute(const T* a, const T* b, int d, T)

It is possible to inline distance and log functions to remove function calls and have improved compiler optimizations. It is small optimization, but not necessarily what we want to do since it will bloat binary size (one kernel per metric, log and type).

T running_max = -cuda::std::numeric_limits<T>::infinity();
T running_sum = T(0);

for (int j = 0; j < n_train; ++j) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will essentially make the kernel process rows sequentially and prevent scaling according to n_train which is quite bad. The pairwise_distance kernel, even though it uses a lot more memory and memory bandwidth, it will not suffer from this issue.

If we want to keep with this design, one major improvement would be to store the train vectors once in shared memory. This would prevent accessing the same data several time from global memory. Access would be divided by (blockDim.x=256). To make it possible the train vectors would have to be processed as tiles (as shared memory is limited). And if possible coalesced. Also the query can be stored in registers.

@Intron7
Copy link
Copy Markdown
Contributor Author

Intron7 commented Mar 4, 2026

For my current limited testing it's faster than the current implementation between 1.1 x faster to 11x faster. Also the speed being the same doesn't really matter if the other implementation breaks because a pairwise distance matrix blowing up the memory. I can definitely work on prefetching the data into shared memory. But right now it looks like the kernel is compute and not memory bound.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@cpp/src/kde/kde.cu`:
- Around line 550-551: The code calls cudaDeviceGetAttribute(&sm_count,
cudaDevAttrMultiProcessorCount, 0) with a hard-coded device ID; change it to
query the current device first (e.g., call cudaGetDevice to obtain the active
device) and pass that device variable into cudaDeviceGetAttribute so sm_count is
obtained for the active GPU. Locate the cudaDeviceGetAttribute usage around
sm_count and replace the literal 0 with the retrieved current device (or obtain
the device from the provided raft::handle_t if available) to make the operation
device-agnostic.
- Around line 583-610: The code allocates partial_max and partial_sum with
cudaMallocAsync and manually frees them, which leaks if RAFT_CUDA_TRY throws;
replace raw T* allocations with RAII rmm::device_uvector<T> (construct with
buf_elems and stream) and pass .data() to kde_tiled_kernel and
kde_reduce_kernel, remove the explicit cudaFreeAsync calls, and ensure
includes/namespace for rmm are added so allocations are automatically freed on
exception or scope exit.
- Line 396: Avoid taking log(0) by skipping the log when a sample weight is
zero: in kde.cu where log_k is incremented using weights, add a guard that
checks weights is non-null and that weights[j_base + c] is greater than T(0)
before calling log, e.g., only add log(weights[j_base + c]) when the weight > 0;
update the same check around any other places that assume positive weights
and/or alternatively enforce weights > 0 in kernel_density.py validation if you
prefer failing earlier.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 439759da-d557-4f91-95c2-c3c10d90adcb

📥 Commits

Reviewing files that changed from the base of the PR and between 7ae3918 and 0de59e9.

📒 Files selected for processing (3)
  • cpp/include/cuml/neighbors/kde.hpp
  • cpp/src/kde/kde.cu
  • python/cuml/cuml/neighbors/kernel_density.py

Copy link
Copy Markdown
Contributor

@viclafargue viclafargue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tiled processing is a great addition for overall performance. Please add checks in the C++ API for d > 0, n_train > 0, n_query > 0, bandwidth > 0 with RAFT_EXPECTS.

Also, could you add some Pytest tests to check the different metrics and tiling layout for correctness against the reference KDE?

Comment on lines +586 to +587
RAFT_CUDA_TRY(cudaMallocAsync(&partial_max, buf_elems * sizeof(T), stream));
RAFT_CUDA_TRY(cudaMallocAsync(&partial_sum, buf_elems * sizeof(T), stream));
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use rmm::device_uvector<T> for safer memory management (RAII) and also to make use of the current RMM allocator.

Co-authored-by: Victor Lafargue <viclafargue@nvidia.com>
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
cpp/src/kde/kde.cu (1)

503-517: Consider adding input validation for edge cases.

The function does not validate inputs like n_query, n_train, d, or null pointers. While callers should provide valid inputs, defensive checks (especially d > 0) would prevent undefined behavior from propagating silently. Some distance metrics (Hamming, RusselRao) divide by d in their finalize step.

💡 Optional validation
void score_samples(const raft::handle_t& handle,
                   const T* query,
                   const T* train,
                   const T* weights,
                   T* output,
                   int n_query,
                   int n_train,
                   int d,
                   ...)
{
  RAFT_EXPECTS(query != nullptr, "query must not be null");
  RAFT_EXPECTS(train != nullptr, "train must not be null");
  RAFT_EXPECTS(output != nullptr, "output must not be null");
  RAFT_EXPECTS(n_query > 0, "n_query must be positive");
  RAFT_EXPECTS(n_train > 0, "n_train must be positive");
  RAFT_EXPECTS(d > 0, "d (n_features) must be positive");
  // ... rest of function
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/kde/kde.cu` around lines 503 - 517, Add defensive input validation at
the start of score_samples to guard against null pointers and non-positive
sizes: check that query, train, and output (and weights if required by code
path) are not null and that n_query, n_train, and d are > 0 (use RAFT_EXPECTS or
the project’s preferred assertion macro). Also ensure any distance metric paths
that divide by d (e.g., Hamming, RusselRao in finalize logic) only run when d >
0 and fail fast otherwise; update score_samples’ signature body accordingly so
these checks occur before any computation or calls to finalize/metric-specific
code.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@cpp/src/kde/kde.cu`:
- Around line 81-88: The finalize of DistOp<T,
ML::distance::DistanceType::LpUnexpanded> can divide by a zero or near-zero p
(metric_arg); update finalize (and optionally accumulate) to guard against p <=
0 or extremely small values by checking p against a small epsilon (or explicitly
handling p == 0 case), and return a well-defined result (e.g., treat p==0 as
appropriate limit or clamp p to epsilon and compute pow(acc[0],
T(1)/clamped_p)); ensure the check uses the template type T and references the
same parameter p so callers of DistOp::finalize get a stable result instead of
Inf/NaN.

---

Nitpick comments:
In `@cpp/src/kde/kde.cu`:
- Around line 503-517: Add defensive input validation at the start of
score_samples to guard against null pointers and non-positive sizes: check that
query, train, and output (and weights if required by code path) are not null and
that n_query, n_train, and d are > 0 (use RAFT_EXPECTS or the project’s
preferred assertion macro). Also ensure any distance metric paths that divide by
d (e.g., Hamming, RusselRao in finalize logic) only run when d > 0 and fail fast
otherwise; update score_samples’ signature body accordingly so these checks
occur before any computation or calls to finalize/metric-specific code.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 136bc504-22d0-4130-a12c-56dba05c11e4

📥 Commits

Reviewing files that changed from the base of the PR and between 0de59e9 and e1a66b3.

📒 Files selected for processing (1)
  • cpp/src/kde/kde.cu

Comment on lines +81 to +88
// minkowski: (sum(|a-b|^p))^(1/p)
template <typename T>
struct DistOp<T, ML::distance::DistanceType::LpUnexpanded> {
static constexpr int N_ACC = 1;
inline __device__ static void init(T* acc) { acc[0] = T(0); }
inline __device__ static void accumulate(T* acc, T a, T b, T p) { acc[0] += pow(abs(a - b), p); }
inline __device__ static T finalize(T* acc, int, T p) { return pow(acc[0], T(1) / p); }
};
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Potential division by near-zero p in Minkowski finalize.

If metric_arg (p) is zero or very close to zero, T(1) / p in finalize will produce infinity or extreme values. Consider adding a guard or documenting that callers must ensure p > 0.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/kde/kde.cu` around lines 81 - 88, The finalize of DistOp<T,
ML::distance::DistanceType::LpUnexpanded> can divide by a zero or near-zero p
(metric_arg); update finalize (and optionally accumulate) to guard against p <=
0 or extremely small values by checking p against a small epsilon (or explicitly
handling p == 0 case), and return a well-defined result (e.g., treat p==0 as
appropriate limit or clamp p to epsilon and compute pow(acc[0],
T(1)/clamped_p)); ensure the check uses the template type T and references the
same parameter p so callers of DistOp::finalize get a stable result instead of
Inf/NaN.

from cuml.metrics.pairwise_distances import PAIRWISE_DISTANCE_METRICS


cdef extern from "cuml/neighbors/kde.hpp" namespace "ML::KDE" nogil:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should really be getting done in raft /cuVS alongside the kernel gramm APIs. These are inherently distance / neighborhood based.

cdef extern from "cuml/neighbors/kde.hpp" namespace "ML::KDE" nogil:

ctypedef enum class KernelType:
Gaussian "ML::KDE::KernelType::Gaussian"
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should also be using the kernel types in the grammian apis.

@cjnolet
Copy link
Copy Markdown
Member

cjnolet commented Mar 13, 2026

Hey @Intron7 we have a kernel gram API in cuVS that handles pairwise distance / grammian computations for the other kernel methods like SVR/SVM. Rather than scattering these implementations across cuml and cuVs, we should really be aiming to consolidate them into a shared API of sorts, even if they end up dispatching to different impls at first. Just want to make sure we are representing algorithms with as much composability and reuse as possible.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
cpp/src/kde/kde.cu (1)

84-91: ⚠️ Potential issue | 🟡 Minor

Minkowski distance: division by near-zero p remains unguarded.

If metric_arg (p) is zero or very close to zero, T(1) / p in finalize will produce infinity or extreme values. While this is an edge case (callers typically use p ≥ 1), consider either:

  1. Adding input validation in score_samples to require p > 0 when metric is Minkowski, or
  2. Documenting the constraint in the API.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/kde/kde.cu` around lines 84 - 91, The finalize implementation of
DistOp for ML::distance::DistanceType::LpUnexpanded uses T(1)/p which can divide
by zero or near-zero p; update validation to ensure metric_arg (p) > 0 before
computing the power and surface the error to callers (e.g., in the score_samples
caller path) or clamp/handle tiny p values: add an explicit check for p <= 0 (or
p < epsilon) and return/report an error or fallback behavior, and document the
constraint for LpUnexpanded; reference DistOp<T,
ML::distance::DistanceType::LpUnexpanded>::finalize, accumulate, and the
score_samples codepath that provides metric_arg.
🧹 Nitpick comments (1)
python/cuml/tests/test_kernel_density.py (1)

346-353: Replace ambiguous multiplication sign in docstring.

Static analysis (RUF002) flags the × character as ambiguous. Consider using x or spelling out "by" for clarity.

Suggested fix
-def test_all_kernels_all_metrics(metric, kernel):
-    """Every metric × kernel combination produces output matching the reference.
+def test_all_kernels_all_metrics(metric, kernel):
+    """Every metric x kernel combination produces output matching the reference.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@python/cuml/tests/test_kernel_density.py` around lines 346 - 353, The
docstring in test_all_kernels_all_metrics uses the ambiguous multiplication sign
"×"; replace it with a clear ASCII alternative such as "x" or the word "by" so
static analysis (RUF002) no longer flags it — update the docstring text inside
the test_all_kernels_all_metrics function accordingly to read e.g. "Every metric
x kernel combination…" or "Every metric by kernel combination…".
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@cpp/src/kde/kde.cu`:
- Around line 84-91: The finalize implementation of DistOp for
ML::distance::DistanceType::LpUnexpanded uses T(1)/p which can divide by zero or
near-zero p; update validation to ensure metric_arg (p) > 0 before computing the
power and surface the error to callers (e.g., in the score_samples caller path)
or clamp/handle tiny p values: add an explicit check for p <= 0 (or p < epsilon)
and return/report an error or fallback behavior, and document the constraint for
LpUnexpanded; reference DistOp<T,
ML::distance::DistanceType::LpUnexpanded>::finalize, accumulate, and the
score_samples codepath that provides metric_arg.

---

Nitpick comments:
In `@python/cuml/tests/test_kernel_density.py`:
- Around line 346-353: The docstring in test_all_kernels_all_metrics uses the
ambiguous multiplication sign "×"; replace it with a clear ASCII alternative
such as "x" or the word "by" so static analysis (RUF002) no longer flags it —
update the docstring text inside the test_all_kernels_all_metrics function
accordingly to read e.g. "Every metric x kernel combination…" or "Every metric
by kernel combination…".

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 69410cce-7cea-4904-be89-0c89f04c6bde

📥 Commits

Reviewing files that changed from the base of the PR and between e1a66b3 and 8827ad5.

📒 Files selected for processing (3)
  • cpp/src/kde/kde.cu
  • python/cuml/cuml/neighbors/kernel_density.py
  • python/cuml/tests/test_kernel_density.py

@Intron7
Copy link
Copy Markdown
Contributor Author

Intron7 commented Mar 13, 2026

rapidsai/cuvs#1915 is needed now for this PR since i moved the kernel to cuvs @cjnolet

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@python/cuml/tests/test_kernel_density.py`:
- Around line 382-389: In the docstring for test_all_kernels_all_metrics replace
the Unicode multiplication sign "×" with a plain ASCII "x" to avoid ambiguity
and ensure consistent encoding/reading across tools; update the string in the
function test_all_kernels_all_metrics accordingly so it reads "metric x kernel"
(or similar) instead of using the Unicode multiplication character.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: b311b680-ea71-4df1-9695-d61ee7bbc297

📥 Commits

Reviewing files that changed from the base of the PR and between 8827ad5 and 2951567.

📒 Files selected for processing (4)
  • cpp/include/cuml/neighbors/kde.hpp
  • cpp/src/kde/kde.cu
  • python/cuml/cuml/neighbors/kde.pyx
  • python/cuml/tests/test_kernel_density.py
🚧 Files skipped from review as they are similar to previous changes (2)
  • cpp/src/kde/kde.cu
  • cpp/include/cuml/neighbors/kde.hpp

Comment on lines +382 to +389
def test_all_kernels_all_metrics(metric, kernel):
"""Every metric × kernel combination produces output matching the reference.

For metrics supported by sklearn.pairwise_distances the reference is
compute_kernel_naive; for metrics absent from sklearn a matching numpy
reference is used that mirrors the DistOp accumulate/finalize logic in
kde.cu exactly.
"""
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Minor: Replace ambiguous multiplication sign character.

The docstring uses × (Unicode multiplication sign) which can cause confusion. Consider using x for clarity.

✏️ Suggested fix
 def test_all_kernels_all_metrics(metric, kernel):
-    """Every metric × kernel combination produces output matching the reference.
+    """Every metric x kernel combination produces output matching the reference.
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def test_all_kernels_all_metrics(metric, kernel):
"""Every metric × kernel combination produces output matching the reference.
For metrics supported by sklearn.pairwise_distances the reference is
compute_kernel_naive; for metrics absent from sklearn a matching numpy
reference is used that mirrors the DistOp accumulate/finalize logic in
kde.cu exactly.
"""
def test_all_kernels_all_metrics(metric, kernel):
"""Every metric x kernel combination produces output matching the reference.
For metrics supported by sklearn.pairwise_distances the reference is
compute_kernel_naive; for metrics absent from sklearn a matching numpy
reference is used that mirrors the DistOp accumulate/finalize logic in
kde.cu exactly.
"""
🧰 Tools
🪛 Ruff (0.15.5)

[warning] 383-383: Docstring contains ambiguous × (MULTIPLICATION SIGN). Did you mean x (LATIN SMALL LETTER X)?

(RUF002)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@python/cuml/tests/test_kernel_density.py` around lines 382 - 389, In the
docstring for test_all_kernels_all_metrics replace the Unicode multiplication
sign "×" with a plain ASCII "x" to avoid ambiguity and ensure consistent
encoding/reading across tools; update the string in the function
test_all_kernels_all_metrics accordingly so it reads "metric x kernel" (or
similar) instead of using the Unicode multiplication character.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants