Describe the bug
#17429 updated MurmurHash3_x86_32 to use the cuco equivalent implementation in order to reduce code duplication. The expected performance impact of this change was considered negligible.
However, nightly benchmark results show a significant performance regression in the groupby_max_cardinality benchmark on November 27, 2024. Upon investigation, #17429 appears to be the only related change that could have introduced this regression.
Steps/Code to reproduce bug
To verify the impact, I reverted the changes made in https://github.com/rapidsai/cudf/blob/branch-25.06/cpp/include/cudf/hashing/detail/murmurhash3_x86_32.cuh to its state prior to #17429. After the reversion, benchmark results showed up to a 6x speedup when using the original cudf hasher.
Sample benchmark comparison:
['groupby-card-latest.json', 'groupby-card-naive-murmurhash.json']
# groupby_max_cardinality
## [0] Quadro RTX 8000
| T | num_aggregations | cardinality | Ref Time | Ref Noise | Cmp Time | Cmp Noise | Diff | %Diff | Status |
|-----|--------------------|---------------|------------|-------------|------------|-------------|---------------|---------|----------|
| I32 | 1 | 10 | 12.254 ms | 0.59% | 2.124 ms | 3.35% | -10129.967 us | -82.67% | FAST |
| I32 | 1 | 20 | 12.694 ms | 0.21% | 2.155 ms | 0.45% | -10539.395 us | -83.03% | FAST |
| I32 | 1 | 50 | 13.608 ms | 0.43% | 2.508 ms | 0.43% | -11100.313 us | -81.57% | FAST |
| I32 | 1 | 100 | 14.456 ms | 0.43% | 2.718 ms | 0.46% | -11737.293 us | -81.19% | FAST |
| I32 | 1 | 1000 | 8.557 ms | 0.22% | 2.364 ms | 0.58% | -6193.158 us | -72.37% | FAST |
| I32 | 1 | 10000 | 10.594 ms | 0.21% | 2.418 ms | 0.50% | -8175.499 us | -77.17% | FAST |
| I32 | 1 | 100000 | 13.555 ms | 0.20% | 4.106 ms | 0.24% | -9448.919 us | -69.71% | FAST |
| I32 | 1 | 1000000 | 15.194 ms | 0.24% | 6.443 ms | 0.09% | -8750.558 us | -57.59% | FAST |
| I32 | 1 | 10000000 | 16.471 ms | 0.36% | 7.918 ms | 0.09% | -8552.743 us | -51.93% | FAST |
Expected behavior
We need to identify the source of this performance regression and resolve it to restore previous performance levels.
Additional context
The primary change introduced in #17429 is the replacement of std::byte with cuda::std::byte for improved device compatibility. A likely explanation is that using cuda::std::byte requires additional overhead or register pressure, which negatively impacts runtime performance. If this is confirmed, a corresponding issue should be filed upstream in CCCL to address the performance overhead introduced by cuda::std::byte.
On further reflection, this might also explain why we haven’t observed any performance gains when applying primitive row operators in distinct joins and stream compaction: #17467 and #17726
Describe the bug
#17429 updated
MurmurHash3_x86_32to use the cuco equivalent implementation in order to reduce code duplication. The expected performance impact of this change was considered negligible.However, nightly benchmark results show a significant performance regression in the
groupby_max_cardinalitybenchmark on November 27, 2024. Upon investigation, #17429 appears to be the only related change that could have introduced this regression.Steps/Code to reproduce bug
To verify the impact, I reverted the changes made in https://github.com/rapidsai/cudf/blob/branch-25.06/cpp/include/cudf/hashing/detail/murmurhash3_x86_32.cuh to its state prior to #17429. After the reversion, benchmark results showed up to a 6x speedup when using the original cudf hasher.
Sample benchmark comparison:
Expected behavior
We need to identify the source of this performance regression and resolve it to restore previous performance levels.
Additional context
The primary change introduced in #17429 is the replacement of
std::bytewithcuda::std::bytefor improved device compatibility. A likely explanation is that usingcuda::std::byterequires additional overhead or register pressure, which negatively impacts runtime performance. If this is confirmed, a corresponding issue should be filed upstream in CCCL to address the performance overhead introduced bycuda::std::byte.On further reflection, this might also explain why we haven’t observed any performance gains when applying primitive row operators in distinct joins and stream compaction: #17467 and #17726