[BUG] Fix performance regression with the updated MurmurHash3_x86_32 hasher

  **Describe the bug**
#17429 updated `MurmurHash3_x86_32` to use the cuco equivalent implementation in order to reduce code duplication. The expected performance impact of this change was considered negligible.

However, nightly benchmark results show a significant performance regression in the `groupby_max_cardinality` benchmark on November 27, 2024. Upon investigation, #17429 appears to be the only related change that could have introduced this regression.
  
  **Steps/Code to reproduce bug**
To verify the impact, I reverted the changes made in https://github.com/rapidsai/cudf/blob/branch-25.06/cpp/include/cudf/hashing/detail/murmurhash3_x86_32.cuh to its state prior to #17429. After the reversion, benchmark results showed up to a 6x speedup when using the original cudf hasher.

Sample benchmark comparison:  
  ```
  ['groupby-card-latest.json', 'groupby-card-naive-murmurhash.json']
  # groupby_max_cardinality
  
  ## [0] Quadro RTX 8000
  
  |  T  |  num_aggregations  |  cardinality  |   Ref Time |   Ref Noise |   Cmp Time |   Cmp Noise |          Diff |   %Diff |  Status  |
  |-----|--------------------|---------------|------------|-------------|------------|-------------|---------------|---------|----------|
  | I32 |         1          |      10       |  12.254 ms |       0.59% |   2.124 ms |       3.35% | -10129.967 us | -82.67% |   FAST   |
  | I32 |         1          |      20       |  12.694 ms |       0.21% |   2.155 ms |       0.45% | -10539.395 us | -83.03% |   FAST   |
  | I32 |         1          |      50       |  13.608 ms |       0.43% |   2.508 ms |       0.43% | -11100.313 us | -81.57% |   FAST   |
  | I32 |         1          |      100      |  14.456 ms |       0.43% |   2.718 ms |       0.46% | -11737.293 us | -81.19% |   FAST   |
  | I32 |         1          |     1000      |   8.557 ms |       0.22% |   2.364 ms |       0.58% |  -6193.158 us | -72.37% |   FAST   |
  | I32 |         1          |     10000     |  10.594 ms |       0.21% |   2.418 ms |       0.50% |  -8175.499 us | -77.17% |   FAST   |
  | I32 |         1          |    100000     |  13.555 ms |       0.20% |   4.106 ms |       0.24% |  -9448.919 us | -69.71% |   FAST   |
  | I32 |         1          |    1000000    |  15.194 ms |       0.24% |   6.443 ms |       0.09% |  -8750.558 us | -57.59% |   FAST   |
  | I32 |         1          |   10000000    |  16.471 ms |       0.36% |   7.918 ms |       0.09% |  -8552.743 us | -51.93% |   FAST   |
  ```
  
  
  **Expected behavior**
We need to identify the source of this performance regression and resolve it to restore previous performance levels.
  
  
  **Additional context**
The primary change introduced in #17429 is the replacement of `std::byte` with `cuda::std::byte` for improved device compatibility. A likely explanation is that using `cuda::std::byte` requires additional overhead or register pressure, which negatively impacts runtime performance. If this is confirmed, a corresponding issue should be filed upstream in CCCL to address the performance overhead introduced by `cuda::std::byte`.

On further reflection, this might also explain why we haven’t observed any performance gains when applying primitive row operators in distinct joins and stream compaction: #17467 and #17726 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Fix performance regression with the updated MurmurHash3_x86_32 hasher #18475

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[BUG] Fix performance regression with the updated MurmurHash3_x86_32 hasher #18475

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions