Optimize multiply operation by avoiding unnecessary validity vector allocation #3974

binmahone · 2025-11-19T11:58:18Z

This PR fixes #3973 , and the result nsys loos like:

You can see the yellow bar part reduces significantly.

This improvement reduces our workload e2e time by 10%.

Signed-off-by: Hongbin Ma (Mahone) <[email protected]>

greptile-apps · 2025-11-19T12:00:03Z

Greptile Summary

Adds fast path optimization that eliminates validity vector allocation when both inputs have no nulls and overflow checking is disabled
Introduces multiply_no_validity_fn functor that directly computes multiplication without validity tracking, reducing memory usage by num_rows * sizeof(bool) bytes

Confidence Score: 5/5

This PR is safe to merge with minimal risk
The optimization is well-implemented with correct logic for detecting when fast path can be used, properly handles all input combinations (column-column, column-scalar, scalar-column), and maintains correctness by only skipping validity tracking when both inputs are guaranteed valid and overflow checking is disabled
No files require special attention

Important Files Changed

Filename	Overview
src/main/cpp/src/multiply.cu	Adds optimized fast path that skips validity vector allocation when both inputs have no nulls and overflow checking is disabled, reducing memory and computation overhead

Sequence Diagram

sequenceDiagram
    participant Caller
    participant multiply
    participant dispatch_multiply
    participant multiply_impl
    participant GPU

    Caller->>multiply: "multiply(left, right, ansi_mode, try_mode)"
    multiply->>dispatch_multiply: "dispatch with type and check_overflow flag"
    dispatch_multiply->>dispatch_multiply: "Check if both inputs have no nulls"
    
    alt Both inputs valid AND no overflow check
        dispatch_multiply->>multiply_impl: "multiply_impl(both_inputs_valid=true)"
        multiply_impl->>multiply_impl: "Skip validity vector allocation"
        multiply_impl->>GPU: "Launch multiply_no_validity_fn kernel"
        GPU-->>multiply_impl: "Computed results"
        multiply_impl-->>dispatch_multiply: "Return column with no null mask"
    else Need validity tracking
        dispatch_multiply->>multiply_impl: "multiply_impl(both_inputs_valid=false)"
        multiply_impl->>multiply_impl: "Allocate validity vector"
        multiply_impl->>GPU: "Launch multiply_fn kernel with validity tracking"
        GPU-->>multiply_impl: "Computed results and validity"
        multiply_impl->>multiply_impl: "Convert validity to null mask"
        multiply_impl-->>dispatch_multiply: "Return column with null mask"
    end
    
    dispatch_multiply-->>multiply: "Return result column"
    multiply-->>Caller: "Return result"

greptile-apps

_{1 file reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}
_{React with 👍 or 👎 to share your feedback on this new summary format}

revans2

This looks fine to me. My main concern is that this is going to disappear for spark 4.0+ when ANSI is enabled by default and we always have to check for overflow (except for floating point multiply).

Could you file a follow on issue for us to explore what to do in a case like that? Is there a fast kernel that we can run to see if any overflow would happen first and then decide on allocating the validity buffer or not.

abellina

nice

src/main/cpp/src/multiply.cu

Co-authored-by: Nghia Truong <[email protected]>

greptile-apps

_{1 file reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}
_{React with 👍 or 👎 to share your feedback on this new summary format}

Signed-off-by: Hongbin Ma (Mahone) <[email protected]>

greptile-apps

_{1 file reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}
_{React with 👍 or 👎 to share your feedback on this new summary format}

binmahone · 2025-11-20T02:27:40Z

This looks fine to me. My main concern is that this is going to disappear for spark 4.0+ when ANSI is enabled by default and we always have to check for overflow (except for floating point multiply).

Could you file a follow on issue for us to explore what to do in a case like that? Is there a fast kernel that we can run to see if any overflow would happen first and then decide on allocating the validity buffer or not.

follow up issue at #3982

binmahone · 2025-11-20T02:27:51Z

build

binmahone · 2025-11-20T02:37:39Z

Let's take a step back and jump out the scope of merely "multiply operator".

I remember that we previously observed in ClickHouse that having every column in the input schema as NOT NULL performs significantly better compared to using Nullable, In a broader sense.

However, it seems like our code doesn't handle the NOT NULL scenario very well. Do you think this direction has potential? If so, I feel I should first open an issue to investigate how we can quantify the overhead we're currently incurring even if the input is "already not null" . @revans2 @abellina

GaryShen2008

approve again after a small format change.

revans2 · 2025-11-21T14:50:47Z

I remember that we previously observed in ClickHouse that having every column in the input schema as NOT NULL performs significantly better compared to using Nullable, In a broader sense.

However, it seems like our code doesn't handle the NOT NULL scenario very well. Do you think this direction has potential? If so, I feel I should first open an issue to investigate how we can quantify the overhead we're currently incurring even if the input is "already not null" . @revans2 @abellina

There are two levels of NOT NULL here that we need to think about/deal with.

There is the Spark level of nullable Spark primarily uses this in code generation to avoid even checking to see if the data in null or not. It is important for CPU operations, but not as much for GPU operations. To be conservative Spark generally assumes everything is nullable unless there are explicit operations that prove it cannot be.

On the GPU we tend to react differently. Each operation/algorithm is responsible for determining if it should allocate a validity buffer or not. Most of the time they do the right thing and allocate it properly based on an actual null_count.

https://github.com/rapidsai/cudf/blob/1e109df66a6c2a6368165ef7f8491e265ff59f63/cpp/include/cudf/strings/detail/strings_column_factories.cuh#L71-L76

At times we can know up front if we even need to calculate this, like with multiply. But it is not perfect in all cases. It is probably worth doing an audit, possibly with AI, to validate that it is all optimized.

optimize null mask creation

98a9006

Signed-off-by: Hongbin Ma (Mahone) <[email protected]>

greptile-apps bot reviewed Nov 19, 2025

View reviewed changes

revans2 previously approved these changes Nov 19, 2025

View reviewed changes

abellina previously approved these changes Nov 19, 2025

View reviewed changes

ttnghia reviewed Nov 19, 2025

View reviewed changes

src/main/cpp/src/multiply.cu Outdated Show resolved Hide resolved

Update src/main/cpp/src/multiply.cu

07ce0ec

Co-authored-by: Nghia Truong <[email protected]>

binmahone dismissed stale reviews from abellina and revans2 via 07ce0ec November 20, 2025 01:45

greptile-apps bot reviewed Nov 20, 2025

View reviewed changes

ttnghia previously approved these changes Nov 20, 2025

View reviewed changes

fix clang format

5d53c2e

Signed-off-by: Hongbin Ma (Mahone) <[email protected]>

binmahone dismissed ttnghia’s stale review via 5d53c2e November 20, 2025 01:58

greptile-apps bot reviewed Nov 20, 2025

View reviewed changes

binmahone mentioned this pull request Nov 20, 2025

[PERF] Explore overflow pre-check optimization for multiply with ANSI mode enabled (Spark 4.0+) #3982

Open

winningsix approved these changes Nov 21, 2025

View reviewed changes

GaryShen2008 approved these changes Nov 21, 2025

View reviewed changes

binmahone merged commit 7401237 into NVIDIA:main Nov 21, 2025
5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Optimize multiply operation by avoiding unnecessary validity vector allocation #3974

Optimize multiply operation by avoiding unnecessary validity vector allocation #3974

binmahone commented Nov 19, 2025

Uh oh!

greptile-apps bot commented Nov 19, 2025 •

edited

Loading

Uh oh!

greptile-apps bot left a comment

Uh oh!

revans2 left a comment

Uh oh!

abellina left a comment

Uh oh!

Uh oh!

greptile-apps bot left a comment

Uh oh!

greptile-apps bot left a comment

Uh oh!

binmahone commented Nov 20, 2025

Uh oh!

binmahone commented Nov 20, 2025

Uh oh!

binmahone commented Nov 20, 2025 •

edited

Loading

Uh oh!

GaryShen2008 left a comment

Uh oh!

Uh oh!

revans2 commented Nov 21, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Optimize multiply operation by avoiding unnecessary validity vector allocation #3974

Optimize multiply operation by avoiding unnecessary validity vector allocation #3974

Conversation

binmahone commented Nov 19, 2025

Uh oh!

greptile-apps bot commented Nov 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Sequence Diagram

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

revans2 left a comment

Choose a reason for hiding this comment

Uh oh!

abellina left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

binmahone commented Nov 20, 2025

Uh oh!

binmahone commented Nov 20, 2025

Uh oh!

binmahone commented Nov 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

GaryShen2008 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

revans2 commented Nov 21, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

greptile-apps bot commented Nov 19, 2025 •

edited

Loading

binmahone commented Nov 20, 2025 •

edited

Loading