Skip to content

[BUG] Fix performance regression with the updated MurmurHash3_x86_32 hasher #18475

@PointKernel

Description

@PointKernel

Describe the bug
#17429 updated MurmurHash3_x86_32 to use the cuco equivalent implementation in order to reduce code duplication. The expected performance impact of this change was considered negligible.

However, nightly benchmark results show a significant performance regression in the groupby_max_cardinality benchmark on November 27, 2024. Upon investigation, #17429 appears to be the only related change that could have introduced this regression.

Steps/Code to reproduce bug
To verify the impact, I reverted the changes made in https://github.com/rapidsai/cudf/blob/branch-25.06/cpp/include/cudf/hashing/detail/murmurhash3_x86_32.cuh to its state prior to #17429. After the reversion, benchmark results showed up to a 6x speedup when using the original cudf hasher.

Sample benchmark comparison:

['groupby-card-latest.json', 'groupby-card-naive-murmurhash.json']
# groupby_max_cardinality

## [0] Quadro RTX 8000

|  T  |  num_aggregations  |  cardinality  |   Ref Time |   Ref Noise |   Cmp Time |   Cmp Noise |          Diff |   %Diff |  Status  |
|-----|--------------------|---------------|------------|-------------|------------|-------------|---------------|---------|----------|
| I32 |         1          |      10       |  12.254 ms |       0.59% |   2.124 ms |       3.35% | -10129.967 us | -82.67% |   FAST   |
| I32 |         1          |      20       |  12.694 ms |       0.21% |   2.155 ms |       0.45% | -10539.395 us | -83.03% |   FAST   |
| I32 |         1          |      50       |  13.608 ms |       0.43% |   2.508 ms |       0.43% | -11100.313 us | -81.57% |   FAST   |
| I32 |         1          |      100      |  14.456 ms |       0.43% |   2.718 ms |       0.46% | -11737.293 us | -81.19% |   FAST   |
| I32 |         1          |     1000      |   8.557 ms |       0.22% |   2.364 ms |       0.58% |  -6193.158 us | -72.37% |   FAST   |
| I32 |         1          |     10000     |  10.594 ms |       0.21% |   2.418 ms |       0.50% |  -8175.499 us | -77.17% |   FAST   |
| I32 |         1          |    100000     |  13.555 ms |       0.20% |   4.106 ms |       0.24% |  -9448.919 us | -69.71% |   FAST   |
| I32 |         1          |    1000000    |  15.194 ms |       0.24% |   6.443 ms |       0.09% |  -8750.558 us | -57.59% |   FAST   |
| I32 |         1          |   10000000    |  16.471 ms |       0.36% |   7.918 ms |       0.09% |  -8552.743 us | -51.93% |   FAST   |

Expected behavior
We need to identify the source of this performance regression and resolve it to restore previous performance levels.

Additional context
The primary change introduced in #17429 is the replacement of std::byte with cuda::std::byte for improved device compatibility. A likely explanation is that using cuda::std::byte requires additional overhead or register pressure, which negatively impacts runtime performance. If this is confirmed, a corresponding issue should be filed upstream in CCCL to address the performance overhead introduced by cuda::std::byte.

On further reflection, this might also explain why we haven’t observed any performance gains when applying primitive row operators in distinct joins and stream compaction: #17467 and #17726

Metadata

Metadata

Assignees

No one assigned

    Labels

    PerformancePerformance related issuebugSomething isn't workinglibcudfAffects libcudf (C++/CUDA) code.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions