Refactor kernel_density to use less memory#7833
Refactor kernel_density to use less memory#7833Intron7 wants to merge 6 commits intorapidsai:mainfrom
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds a CUDA KDE scorer (header + CUDA source), exposes it via a new Cython module, integrates the source into CMake, refactors Python KernelDensity to call the new backend, and expands tests to cover more metrics and kernels. Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
📝 Coding Plan
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment Tip You can enable review details to help with troubleshooting, context usage and more.Enable the |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
python/cuml/cuml/neighbors/kernel_density.py (1)
252-259: Consider usingnext(iter())for single-value extraction.Per static analysis (RUF015), prefer
next(iter(self.metric_params.values()))over creating an intermediate list for a single element.Suggested improvement
if self.metric_params: if len(self.metric_params) != 1: raise ValueError( "Cuml only supports metrics with a single arg." ) - metric_arg = float(list(self.metric_params.values())[0]) + metric_arg = float(next(iter(self.metric_params.values()))) else: metric_arg = 2.0🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@python/cuml/cuml/neighbors/kernel_density.py` around lines 252 - 259, The code in kernel_density.py currently converts metric_params.values() to a list to extract a single value for metric_arg; replace that intermediate list with an iterator-based fetch using next(iter(self.metric_params.values())) and cast it to float (i.e., metric_arg = float(next(iter(self.metric_params.values())))) while preserving the existing single-value length check and default branch; update the block around the metric_params handling in the KernelDensity implementation where metric_arg is assigned.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@cpp/src/kde/kde.cu`:
- Around line 438-442: The CUDA kernel launch of kde_fused_kernel<T, M, K> is
missing a post-launch error check; include the RAFT CUDA utilities header
(raft/util/cuda_utils.cuh) and add a RAFT_CUDA_TRY(...) check immediately after
the kernel launch inside the same scope (e.g., after
kde_fused_kernel<<<...>>>(...)) to catch asynchronous launch errors; ensure the
RAFT_CUDA_TRY invocation uses the appropriate CUDA error query
(cudaGetLastError()/cudaPeekAtLastError() as provided by RAFT) and keep the
change local to the kernel launch block.
---
Nitpick comments:
In `@python/cuml/cuml/neighbors/kernel_density.py`:
- Around line 252-259: The code in kernel_density.py currently converts
metric_params.values() to a list to extract a single value for metric_arg;
replace that intermediate list with an iterator-based fetch using
next(iter(self.metric_params.values())) and cast it to float (i.e., metric_arg =
float(next(iter(self.metric_params.values())))) while preserving the existing
single-value length check and default branch; update the block around the
metric_params handling in the KernelDensity implementation where metric_arg is
assigned.
ℹ️ Review info
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (7)
cpp/CMakeLists.txtcpp/include/cuml/neighbors/kde.hppcpp/src/kde/kde.cupython/cuml/cuml/neighbors/CMakeLists.txtpython/cuml/cuml/neighbors/kde.pyxpython/cuml/cuml/neighbors/kernel_density.pypython/cuml/tests/test_kernel_density.py
|
Thanks for the PR! On a first brief skim the idea looks sound. I'm a bit wary of the code duplication between RAFT/cuvs/cuml here for distances, but it's honestly not so much code so worst case merging as is may be fine. Others more versed on the C++ side of things may have some suggestions though. I probably won't have time to look more into this until Monday. One quick request I'd have if you have some time is to push up some more motivation for your use case here. How much of a memory savings is this providing for workloads you're running, and are there other benefits (perf, ...) worth noting? Any numbers you can provide to help motivate the change and use case would be very helpful here. |
|
/ok to test 7ae3918 |
|
I have done some small benchmarks. For small datasets the performance is roughly the same the new implementation is 1.1x faster for (10000x10000). However for a bigger embedding (200000,200000) where I need to chunk to not blow up memory this is 11 times faster. The memory use is the most impactful part. It goes from (n x m) to (n + m) since we never compute this massive pairwise distance matrix. I was trying to use the raft distances. Some of them worked others didn't because they assume a different threadlayout. So I created custom distance functions. |
viclafargue
left a comment
There was a problem hiding this comment.
Thanks @Intron7! This would be very helpful to scale kernel_density to larger problem sizes. I could review the CUDA code. It looks like there is a loop over all the train vectors which would not scale well. However, this new solution would save a lot of memory. I suggested some optimizations. Have you benchmarked the old vs new solution on a case with a small n_query and large n_train? I wonder if this is really a drop-in replacement for what we had.
cpp/src/kde/kde.cu
Outdated
| T running_sum = T(0); | ||
|
|
||
| for (int j = 0; j < n_train; ++j) { | ||
| T dist = Distance<T, Metric>::compute(&query[i * d], &train[j * d], d, metric_arg); |
There was a problem hiding this comment.
i * d and j * d here would likely result in integer overflow.
cpp/src/kde/kde.cu
Outdated
| kde_fused_kernel<T, M, K><<<blocks, threads, 0, stream>>>( | ||
| query, train, weights, output, n_query, n_train, d, bandwidth, metric_arg, log_norm); |
There was a problem hiding this comment.
| kde_fused_kernel<T, M, K><<<blocks, threads, 0, stream>>>( | |
| query, train, weights, output, n_query, n_train, d, bandwidth, metric_arg, log_norm); | |
| kde_fused_kernel<T, M, K><<<blocks, threads, 0, stream>>>( | |
| query, train, weights, output, n_query, n_train, d, bandwidth, metric_arg, log_norm); | |
| RAFT_CUDA_TRY(cudaPeekAtLastError()); |
cpp/src/kde/kde.cu
Outdated
| // euclidean: sqrt(sum((a-b)^2)) | ||
| template <typename T> | ||
| struct Distance<T, ML::distance::DistanceType::L2SqrtUnexpanded> { | ||
| __device__ static T compute(const T* a, const T* b, int d, T) |
There was a problem hiding this comment.
| __device__ static T compute(const T* a, const T* b, int d, T) | |
| inline __device__ static T compute(const T* a, const T* b, int d, T) |
It is possible to inline distance and log functions to remove function calls and have improved compiler optimizations. It is small optimization, but not necessarily what we want to do since it will bloat binary size (one kernel per metric, log and type).
cpp/src/kde/kde.cu
Outdated
| T running_max = -cuda::std::numeric_limits<T>::infinity(); | ||
| T running_sum = T(0); | ||
|
|
||
| for (int j = 0; j < n_train; ++j) { |
There was a problem hiding this comment.
This will essentially make the kernel process rows sequentially and prevent scaling according to n_train which is quite bad. The pairwise_distance kernel, even though it uses a lot more memory and memory bandwidth, it will not suffer from this issue.
If we want to keep with this design, one major improvement would be to store the train vectors once in shared memory. This would prevent accessing the same data several time from global memory. Access would be divided by (blockDim.x=256). To make it possible the train vectors would have to be processed as tiles (as shared memory is limited). And if possible coalesced. Also the query can be stored in registers.
|
For my current limited testing it's faster than the current implementation between 1.1 x faster to 11x faster. Also the speed being the same doesn't really matter if the other implementation breaks because a pairwise distance matrix blowing up the memory. I can definitely work on prefetching the data into shared memory. But right now it looks like the kernel is compute and not memory bound. |
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@cpp/src/kde/kde.cu`:
- Around line 550-551: The code calls cudaDeviceGetAttribute(&sm_count,
cudaDevAttrMultiProcessorCount, 0) with a hard-coded device ID; change it to
query the current device first (e.g., call cudaGetDevice to obtain the active
device) and pass that device variable into cudaDeviceGetAttribute so sm_count is
obtained for the active GPU. Locate the cudaDeviceGetAttribute usage around
sm_count and replace the literal 0 with the retrieved current device (or obtain
the device from the provided raft::handle_t if available) to make the operation
device-agnostic.
- Around line 583-610: The code allocates partial_max and partial_sum with
cudaMallocAsync and manually frees them, which leaks if RAFT_CUDA_TRY throws;
replace raw T* allocations with RAII rmm::device_uvector<T> (construct with
buf_elems and stream) and pass .data() to kde_tiled_kernel and
kde_reduce_kernel, remove the explicit cudaFreeAsync calls, and ensure
includes/namespace for rmm are added so allocations are automatically freed on
exception or scope exit.
- Line 396: Avoid taking log(0) by skipping the log when a sample weight is
zero: in kde.cu where log_k is incremented using weights, add a guard that
checks weights is non-null and that weights[j_base + c] is greater than T(0)
before calling log, e.g., only add log(weights[j_base + c]) when the weight > 0;
update the same check around any other places that assume positive weights
and/or alternatively enforce weights > 0 in kernel_density.py validation if you
prefer failing earlier.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 439759da-d557-4f91-95c2-c3c10d90adcb
📒 Files selected for processing (3)
cpp/include/cuml/neighbors/kde.hppcpp/src/kde/kde.cupython/cuml/cuml/neighbors/kernel_density.py
viclafargue
left a comment
There was a problem hiding this comment.
Tiled processing is a great addition for overall performance. Please add checks in the C++ API for d > 0, n_train > 0, n_query > 0, bandwidth > 0 with RAFT_EXPECTS.
Also, could you add some Pytest tests to check the different metrics and tiling layout for correctness against the reference KDE?
cpp/src/kde/kde.cu
Outdated
| RAFT_CUDA_TRY(cudaMallocAsync(&partial_max, buf_elems * sizeof(T), stream)); | ||
| RAFT_CUDA_TRY(cudaMallocAsync(&partial_sum, buf_elems * sizeof(T), stream)); |
There was a problem hiding this comment.
Please use rmm::device_uvector<T> for safer memory management (RAII) and also to make use of the current RMM allocator.
Co-authored-by: Victor Lafargue <viclafargue@nvidia.com>
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
cpp/src/kde/kde.cu (1)
503-517: Consider adding input validation for edge cases.The function does not validate inputs like
n_query,n_train,d, or null pointers. While callers should provide valid inputs, defensive checks (especiallyd > 0) would prevent undefined behavior from propagating silently. Some distance metrics (Hamming, RusselRao) divide bydin their finalize step.💡 Optional validation
void score_samples(const raft::handle_t& handle, const T* query, const T* train, const T* weights, T* output, int n_query, int n_train, int d, ...) { RAFT_EXPECTS(query != nullptr, "query must not be null"); RAFT_EXPECTS(train != nullptr, "train must not be null"); RAFT_EXPECTS(output != nullptr, "output must not be null"); RAFT_EXPECTS(n_query > 0, "n_query must be positive"); RAFT_EXPECTS(n_train > 0, "n_train must be positive"); RAFT_EXPECTS(d > 0, "d (n_features) must be positive"); // ... rest of function🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/src/kde/kde.cu` around lines 503 - 517, Add defensive input validation at the start of score_samples to guard against null pointers and non-positive sizes: check that query, train, and output (and weights if required by code path) are not null and that n_query, n_train, and d are > 0 (use RAFT_EXPECTS or the project’s preferred assertion macro). Also ensure any distance metric paths that divide by d (e.g., Hamming, RusselRao in finalize logic) only run when d > 0 and fail fast otherwise; update score_samples’ signature body accordingly so these checks occur before any computation or calls to finalize/metric-specific code.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@cpp/src/kde/kde.cu`:
- Around line 81-88: The finalize of DistOp<T,
ML::distance::DistanceType::LpUnexpanded> can divide by a zero or near-zero p
(metric_arg); update finalize (and optionally accumulate) to guard against p <=
0 or extremely small values by checking p against a small epsilon (or explicitly
handling p == 0 case), and return a well-defined result (e.g., treat p==0 as
appropriate limit or clamp p to epsilon and compute pow(acc[0],
T(1)/clamped_p)); ensure the check uses the template type T and references the
same parameter p so callers of DistOp::finalize get a stable result instead of
Inf/NaN.
---
Nitpick comments:
In `@cpp/src/kde/kde.cu`:
- Around line 503-517: Add defensive input validation at the start of
score_samples to guard against null pointers and non-positive sizes: check that
query, train, and output (and weights if required by code path) are not null and
that n_query, n_train, and d are > 0 (use RAFT_EXPECTS or the project’s
preferred assertion macro). Also ensure any distance metric paths that divide by
d (e.g., Hamming, RusselRao in finalize logic) only run when d > 0 and fail fast
otherwise; update score_samples’ signature body accordingly so these checks
occur before any computation or calls to finalize/metric-specific code.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 136bc504-22d0-4130-a12c-56dba05c11e4
📒 Files selected for processing (1)
cpp/src/kde/kde.cu
cpp/src/kde/kde.cu
Outdated
| // minkowski: (sum(|a-b|^p))^(1/p) | ||
| template <typename T> | ||
| struct DistOp<T, ML::distance::DistanceType::LpUnexpanded> { | ||
| static constexpr int N_ACC = 1; | ||
| inline __device__ static void init(T* acc) { acc[0] = T(0); } | ||
| inline __device__ static void accumulate(T* acc, T a, T b, T p) { acc[0] += pow(abs(a - b), p); } | ||
| inline __device__ static T finalize(T* acc, int, T p) { return pow(acc[0], T(1) / p); } | ||
| }; |
There was a problem hiding this comment.
Potential division by near-zero p in Minkowski finalize.
If metric_arg (p) is zero or very close to zero, T(1) / p in finalize will produce infinity or extreme values. Consider adding a guard or documenting that callers must ensure p > 0.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@cpp/src/kde/kde.cu` around lines 81 - 88, The finalize of DistOp<T,
ML::distance::DistanceType::LpUnexpanded> can divide by a zero or near-zero p
(metric_arg); update finalize (and optionally accumulate) to guard against p <=
0 or extremely small values by checking p against a small epsilon (or explicitly
handling p == 0 case), and return a well-defined result (e.g., treat p==0 as
appropriate limit or clamp p to epsilon and compute pow(acc[0],
T(1)/clamped_p)); ensure the check uses the template type T and references the
same parameter p so callers of DistOp::finalize get a stable result instead of
Inf/NaN.
python/cuml/cuml/neighbors/kde.pyx
Outdated
| from cuml.metrics.pairwise_distances import PAIRWISE_DISTANCE_METRICS | ||
|
|
||
|
|
||
| cdef extern from "cuml/neighbors/kde.hpp" namespace "ML::KDE" nogil: |
There was a problem hiding this comment.
This should really be getting done in raft /cuVS alongside the kernel gramm APIs. These are inherently distance / neighborhood based.
python/cuml/cuml/neighbors/kde.pyx
Outdated
| cdef extern from "cuml/neighbors/kde.hpp" namespace "ML::KDE" nogil: | ||
|
|
||
| ctypedef enum class KernelType: | ||
| Gaussian "ML::KDE::KernelType::Gaussian" |
There was a problem hiding this comment.
This should also be using the kernel types in the grammian apis.
|
Hey @Intron7 we have a kernel gram API in cuVS that handles pairwise distance / grammian computations for the other kernel methods like SVR/SVM. Rather than scattering these implementations across cuml and cuVs, we should really be aiming to consolidate them into a shared API of sorts, even if they end up dispatching to different impls at first. Just want to make sure we are representing algorithms with as much composability and reuse as possible. |
There was a problem hiding this comment.
♻️ Duplicate comments (1)
cpp/src/kde/kde.cu (1)
84-91:⚠️ Potential issue | 🟡 MinorMinkowski distance: division by near-zero
premains unguarded.If
metric_arg(p) is zero or very close to zero,T(1) / pinfinalizewill produce infinity or extreme values. While this is an edge case (callers typically use p ≥ 1), consider either:
- Adding input validation in
score_samplesto requirep > 0when metric is Minkowski, or- Documenting the constraint in the API.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/src/kde/kde.cu` around lines 84 - 91, The finalize implementation of DistOp for ML::distance::DistanceType::LpUnexpanded uses T(1)/p which can divide by zero or near-zero p; update validation to ensure metric_arg (p) > 0 before computing the power and surface the error to callers (e.g., in the score_samples caller path) or clamp/handle tiny p values: add an explicit check for p <= 0 (or p < epsilon) and return/report an error or fallback behavior, and document the constraint for LpUnexpanded; reference DistOp<T, ML::distance::DistanceType::LpUnexpanded>::finalize, accumulate, and the score_samples codepath that provides metric_arg.
🧹 Nitpick comments (1)
python/cuml/tests/test_kernel_density.py (1)
346-353: Replace ambiguous multiplication sign in docstring.Static analysis (RUF002) flags the
×character as ambiguous. Consider usingxor spelling out "by" for clarity.Suggested fix
-def test_all_kernels_all_metrics(metric, kernel): - """Every metric × kernel combination produces output matching the reference. +def test_all_kernels_all_metrics(metric, kernel): + """Every metric x kernel combination produces output matching the reference.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@python/cuml/tests/test_kernel_density.py` around lines 346 - 353, The docstring in test_all_kernels_all_metrics uses the ambiguous multiplication sign "×"; replace it with a clear ASCII alternative such as "x" or the word "by" so static analysis (RUF002) no longer flags it — update the docstring text inside the test_all_kernels_all_metrics function accordingly to read e.g. "Every metric x kernel combination…" or "Every metric by kernel combination…".
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@cpp/src/kde/kde.cu`:
- Around line 84-91: The finalize implementation of DistOp for
ML::distance::DistanceType::LpUnexpanded uses T(1)/p which can divide by zero or
near-zero p; update validation to ensure metric_arg (p) > 0 before computing the
power and surface the error to callers (e.g., in the score_samples caller path)
or clamp/handle tiny p values: add an explicit check for p <= 0 (or p < epsilon)
and return/report an error or fallback behavior, and document the constraint for
LpUnexpanded; reference DistOp<T,
ML::distance::DistanceType::LpUnexpanded>::finalize, accumulate, and the
score_samples codepath that provides metric_arg.
---
Nitpick comments:
In `@python/cuml/tests/test_kernel_density.py`:
- Around line 346-353: The docstring in test_all_kernels_all_metrics uses the
ambiguous multiplication sign "×"; replace it with a clear ASCII alternative
such as "x" or the word "by" so static analysis (RUF002) no longer flags it —
update the docstring text inside the test_all_kernels_all_metrics function
accordingly to read e.g. "Every metric x kernel combination…" or "Every metric
by kernel combination…".
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 69410cce-7cea-4904-be89-0c89f04c6bde
📒 Files selected for processing (3)
cpp/src/kde/kde.cupython/cuml/cuml/neighbors/kernel_density.pypython/cuml/tests/test_kernel_density.py
|
rapidsai/cuvs#1915 is needed now for this PR since i moved the kernel to cuvs @cjnolet |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@python/cuml/tests/test_kernel_density.py`:
- Around line 382-389: In the docstring for test_all_kernels_all_metrics replace
the Unicode multiplication sign "×" with a plain ASCII "x" to avoid ambiguity
and ensure consistent encoding/reading across tools; update the string in the
function test_all_kernels_all_metrics accordingly so it reads "metric x kernel"
(or similar) instead of using the Unicode multiplication character.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: b311b680-ea71-4df1-9695-d61ee7bbc297
📒 Files selected for processing (4)
cpp/include/cuml/neighbors/kde.hppcpp/src/kde/kde.cupython/cuml/cuml/neighbors/kde.pyxpython/cuml/tests/test_kernel_density.py
🚧 Files skipped from review as they are similar to previous changes (2)
- cpp/src/kde/kde.cu
- cpp/include/cuml/neighbors/kde.hpp
| def test_all_kernels_all_metrics(metric, kernel): | ||
| """Every metric × kernel combination produces output matching the reference. | ||
|
|
||
| For metrics supported by sklearn.pairwise_distances the reference is | ||
| compute_kernel_naive; for metrics absent from sklearn a matching numpy | ||
| reference is used that mirrors the DistOp accumulate/finalize logic in | ||
| kde.cu exactly. | ||
| """ |
There was a problem hiding this comment.
Minor: Replace ambiguous multiplication sign character.
The docstring uses × (Unicode multiplication sign) which can cause confusion. Consider using x for clarity.
✏️ Suggested fix
def test_all_kernels_all_metrics(metric, kernel):
- """Every metric × kernel combination produces output matching the reference.
+ """Every metric x kernel combination produces output matching the reference.📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| def test_all_kernels_all_metrics(metric, kernel): | |
| """Every metric × kernel combination produces output matching the reference. | |
| For metrics supported by sklearn.pairwise_distances the reference is | |
| compute_kernel_naive; for metrics absent from sklearn a matching numpy | |
| reference is used that mirrors the DistOp accumulate/finalize logic in | |
| kde.cu exactly. | |
| """ | |
| def test_all_kernels_all_metrics(metric, kernel): | |
| """Every metric x kernel combination produces output matching the reference. | |
| For metrics supported by sklearn.pairwise_distances the reference is | |
| compute_kernel_naive; for metrics absent from sklearn a matching numpy | |
| reference is used that mirrors the DistOp accumulate/finalize logic in | |
| kde.cu exactly. | |
| """ |
🧰 Tools
🪛 Ruff (0.15.5)
[warning] 383-383: Docstring contains ambiguous × (MULTIPLICATION SIGN). Did you mean x (LATIN SMALL LETTER X)?
(RUF002)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@python/cuml/tests/test_kernel_density.py` around lines 382 - 389, In the
docstring for test_all_kernels_all_metrics replace the Unicode multiplication
sign "×" with a plain ASCII "x" to avoid ambiguity and ensure consistent
encoding/reading across tools; update the string in the function
test_all_kernels_all_metrics accordingly so it reads "metric x kernel" (or
similar) instead of using the Unicode multiplication character.
Hey this is my first time working on the c++ / cython layer so....
I recently came across Welford's algorithm and I thought something similar should work for kernel density to not need to compute the full pairwise distance matrix. So this does now an online log-sum-exp with max tracking. This way we can run arbitrarily big embeddings without any memory issues and batching.