Skip to content

Conversation

@binmahone
Copy link
Collaborator

This PR fixes #3973 , and the result nsys loos like:

img_v3_02s6_893838a6-f9d2-4469-86b2-2cd164f6f92g

You can see the yellow bar part reduces significantly.

This improvement reduces our workload e2e time by 10%.

Signed-off-by: Hongbin Ma (Mahone) <[email protected]>
@greptile-apps
Copy link
Contributor

greptile-apps bot commented Nov 19, 2025

Greptile Summary

  • Adds fast path optimization that eliminates validity vector allocation when both inputs have no nulls and overflow checking is disabled
  • Introduces multiply_no_validity_fn functor that directly computes multiplication without validity tracking, reducing memory usage by num_rows * sizeof(bool) bytes

Confidence Score: 5/5

  • This PR is safe to merge with minimal risk
  • The optimization is well-implemented with correct logic for detecting when fast path can be used, properly handles all input combinations (column-column, column-scalar, scalar-column), and maintains correctness by only skipping validity tracking when both inputs are guaranteed valid and overflow checking is disabled
  • No files require special attention

Important Files Changed

Filename Overview
src/main/cpp/src/multiply.cu Adds optimized fast path that skips validity vector allocation when both inputs have no nulls and overflow checking is disabled, reducing memory and computation overhead

Sequence Diagram

sequenceDiagram
    participant Caller
    participant multiply
    participant dispatch_multiply
    participant multiply_impl
    participant GPU

    Caller->>multiply: "multiply(left, right, ansi_mode, try_mode)"
    multiply->>dispatch_multiply: "dispatch with type and check_overflow flag"
    dispatch_multiply->>dispatch_multiply: "Check if both inputs have no nulls"
    
    alt Both inputs valid AND no overflow check
        dispatch_multiply->>multiply_impl: "multiply_impl(both_inputs_valid=true)"
        multiply_impl->>multiply_impl: "Skip validity vector allocation"
        multiply_impl->>GPU: "Launch multiply_no_validity_fn kernel"
        GPU-->>multiply_impl: "Computed results"
        multiply_impl-->>dispatch_multiply: "Return column with no null mask"
    else Need validity tracking
        dispatch_multiply->>multiply_impl: "multiply_impl(both_inputs_valid=false)"
        multiply_impl->>multiply_impl: "Allocate validity vector"
        multiply_impl->>GPU: "Launch multiply_fn kernel with validity tracking"
        GPU-->>multiply_impl: "Computed results and validity"
        multiply_impl->>multiply_impl: "Convert validity to null mask"
        multiply_impl-->>dispatch_multiply: "Return column with null mask"
    end
    
    dispatch_multiply-->>multiply: "Return result column"
    multiply-->>Caller: "Return result"
Loading

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 file reviewed, no comments

Edit Code Review Agent Settings | Greptile
React with 👍 or 👎 to share your feedback on this new summary format

revans2
revans2 previously approved these changes Nov 19, 2025
Copy link
Collaborator

@revans2 revans2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks fine to me. My main concern is that this is going to disappear for spark 4.0+ when ANSI is enabled by default and we always have to check for overflow (except for floating point multiply).

Could you file a follow on issue for us to explore what to do in a case like that? Is there a fast kernel that we can run to see if any overflow would happen first and then decide on allocating the validity buffer or not.

abellina
abellina previously approved these changes Nov 19, 2025
Copy link
Collaborator

@abellina abellina left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice

@binmahone binmahone dismissed stale reviews from abellina and revans2 via 07ce0ec November 20, 2025 01:45
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 file reviewed, no comments

Edit Code Review Agent Settings | Greptile
React with 👍 or 👎 to share your feedback on this new summary format

ttnghia
ttnghia previously approved these changes Nov 20, 2025
Signed-off-by: Hongbin Ma (Mahone) <[email protected]>
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 file reviewed, no comments

Edit Code Review Agent Settings | Greptile
React with 👍 or 👎 to share your feedback on this new summary format

@binmahone
Copy link
Collaborator Author

This looks fine to me. My main concern is that this is going to disappear for spark 4.0+ when ANSI is enabled by default and we always have to check for overflow (except for floating point multiply).

Could you file a follow on issue for us to explore what to do in a case like that? Is there a fast kernel that we can run to see if any overflow would happen first and then decide on allocating the validity buffer or not.

follow up issue at #3982

@binmahone
Copy link
Collaborator Author

build

@binmahone
Copy link
Collaborator Author

binmahone commented Nov 20, 2025

Let's take a step back and jump out the scope of merely "multiply operator".

I remember that we previously observed in ClickHouse that having every column in the input schema as NOT NULL performs significantly better compared to using Nullable, In a broader sense.

However, it seems like our code doesn't handle the NOT NULL scenario very well. Do you think this direction has potential? If so, I feel I should first open an issue to investigate how we can quantify the overhead we're currently incurring even if the input is "already not null" . @revans2 @abellina

Copy link
Collaborator

@GaryShen2008 GaryShen2008 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

approve again after a small format change.

@binmahone binmahone merged commit 7401237 into NVIDIA:main Nov 21, 2025
5 checks passed
@revans2
Copy link
Collaborator

revans2 commented Nov 21, 2025

I remember that we previously observed in ClickHouse that having every column in the input schema as NOT NULL performs significantly better compared to using Nullable, In a broader sense.

However, it seems like our code doesn't handle the NOT NULL scenario very well. Do you think this direction has potential? If so, I feel I should first open an issue to investigate how we can quantify the overhead we're currently incurring even if the input is "already not null" . @revans2 @abellina

There are two levels of NOT NULL here that we need to think about/deal with.

There is the Spark level of nullable Spark primarily uses this in code generation to avoid even checking to see if the data in null or not. It is important for CPU operations, but not as much for GPU operations. To be conservative Spark generally assumes everything is nullable unless there are explicit operations that prove it cannot be.

On the GPU we tend to react differently. Each operation/algorithm is responsible for determining if it should allocate a validity buffer or not. Most of the time they do the right thing and allocate it properly based on an actual null_count.

https://github.com/rapidsai/cudf/blob/1e109df66a6c2a6368165ef7f8491e265ff59f63/cpp/include/cudf/strings/detail/strings_column_factories.cuh#L71-L76

At times we can know up front if we even need to calculate this, like with multiply. But it is not perfect in all cases. It is probably worth doing an audit, possibly with AI, to validate that it is all optimized.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[PERF] Optimize multiply operation by avoiding unnecessary validity vector allocation

6 participants