Make Slice kernel tiling adaptive#3557
Conversation
Computes optimal pixels-per-thread for Slice GPU kernel based on count of SMs instead of using hardcoded constant value. Signed-off-by: Szymon Karpiński <hugo@staszic.waw.pl>
Signed-off-by: Szymon Karpiński <hugo@staszic.waw.pl>
dali/kernels/common/utils.h
Outdated
| return strides; | ||
| } | ||
|
|
||
| inline int64_t GetSMCount() { |
There was a problem hiding this comment.
This function should go to cuda_utils.h, right next to MaxThreadsPerBlock. Please move it there and revert this file to the original version.
Also, make it return an int - I don't expect we're going to surpass 2^31 SMs any time soon and cudaDeviceProp::multiProcessorCount is declared as int.
There was a problem hiding this comment.
Moved to cuda_utils.h and changed to int
dali/kernels/slice/slice_gpu.cuh
Outdated
| block_count_ += std::ceil( | ||
| sample_size / static_cast<float>(kBlockSize)); | ||
| sample_size / static_cast<float>(blockSize)); |
There was a problem hiding this comment.
This is prone to rounding errors. Use div_ceil instead.
| block_count_ += std::ceil( | |
| sample_size / static_cast<float>(kBlockSize)); | |
| sample_size / static_cast<float>(blockSize)); | |
| block_count_ += div_ceil(sample_size, block_size); |
There was a problem hiding this comment.
Changed it to div_ceil, thank you
dali/kernels/slice/slice_gpu.cuh
Outdated
| total_volume += volume(args.shape); | ||
| } | ||
|
|
||
| auto minBlocks = 4 * GetSMCount(); |
There was a problem hiding this comment.
We don't use camelCase. Also, don't use auto for trivial types.
| auto minBlocks = 4 * GetSMCount(); | |
| int min_blocks = 4 * GetSMCount(); |
dali/kernels/slice/slice_gpu.cuh
Outdated
| int64 blockSize; | ||
|
|
There was a problem hiding this comment.
| int64 blockSize; | |
| int64_t block_size_ = 256; |
- snake_case
- trailing
_for a member (I don't like it, but that's what we use everywhere...) - stick it to block_count_ field rather than to the constants.
- some default would be nice
There was a problem hiding this comment.
Sorry for that, I had block_count_ just below and didn't notice style difference :/
e7fd9a5 to
5a70853
Compare
Signed-off-by: Szymon Karpiński <hugo@staszic.waw.pl>
5a70853 to
9d0fb91
Compare
jantonguirao
left a comment
There was a problem hiding this comment.
@hugo213 Thanks for this contribution, and for the thorough explanation in the PR description. Good work!
|
!build |
|
CI MESSAGE: [3542688]: BUILD STARTED |
|
CI MESSAGE: [3542688]: BUILD FAILED |
|
@hugo213 it seems that clang is unhappy: |
As clang was complaining about comparing signed and unsigned, I've changed variables which are guaranteed to be non-negative to unsigned. Signed-off-by: Szymon Karpiński <hugo@staszic.waw.pl>
Google Codestyle recommends treating abbreviations as words and CamelCase them. Signed-off-by: Szymon Karpiński <hugo@staszic.waw.pl>
|
I've fixed the signedness issues, Clang should be happy now. Also, @szalpal pointed out that Google Codestyle recommends (https://google.github.io/styleguide/cppguide.html#General_Naming_Rules) treating abbreviations as words, so I've changed |
|
!build |
|
CI MESSAGE: [3544727]: BUILD STARTED |
|
CI MESSAGE: [3544727]: BUILD FAILED |
include/dali/core/cuda_utils.h
Outdated
| inline int GetSmCount() { | ||
| static int count = 0; | ||
| if (!count) { | ||
| cudaDeviceProp prop; | ||
| CUDA_CALL(cudaGetDeviceProperties(&prop, 0)); | ||
| count = prop.multiProcessorCount; | ||
| } | ||
| return count; | ||
| } |
There was a problem hiding this comment.
Sorry, but that's not sufficient. There can be more than one device - and indeed, more than one type of device.
| inline int GetSmCount() { | |
| static int count = 0; | |
| if (!count) { | |
| cudaDeviceProp prop; | |
| CUDA_CALL(cudaGetDeviceProperties(&prop, 0)); | |
| count = prop.multiProcessorCount; | |
| } | |
| return count; | |
| } | |
| inline int GetSmCount(int device_id = -1) { | |
| if (device_id < 0) | |
| CUDA_CALL(cudaGetDevice(&device_id)); | |
| static int dev_count = []() { | |
| int ndevs = 0; | |
| CUDA_CALL(cudaGetDeviceCount(&ndevs)); | |
| return ndevs; | |
| }(); | |
| static unique_ptr<int[]> count(new int[dev_count]()); // this should be zero-initialized | |
| if (!count[device_id]) { | |
| cudaDeviceProp prop; | |
| CUDA_CALL(cudaGetDeviceProperties(&prop, 0)); | |
| count[device_id] = prop.multiProcessorCount; | |
| } | |
| return count[device_id]; | |
| } |
There was a problem hiding this comment.
I've fixed this as you suggested in the latest commit with a small difference. I decided to use vector instead of unique_ptr to an array, because I think it's more readable then. If there's some advantage of the unique_ptr approach, of course I'll change it.
|
!build |
|
CI MESSAGE: [3545912]: BUILD STARTED |
|
CI MESSAGE: [3545912]: BUILD PASSED |
Signed-off-by: Szymon Karpiński <hugo@staszic.waw.pl>
|
!build |
|
CI MESSAGE: [3549166]: BUILD STARTED |
|
CI MESSAGE: [3549166]: BUILD PASSED |
Signed-off-by: Szymon Karpiński <hugo@staszic.waw.pl>
Signed-off-by: Szymon Karpiński <hugo@staszic.waw.pl>
Signed-off-by: Szymon Karpiński <hugo@staszic.waw.pl>
Signed-off-by: Szymon Karpiński <hugo@staszic.waw.pl>
Signed-off-by: Szymon Karpiński <hugo@staszic.waw.pl>
Signed-off-by: Szymon Karpiński <hugo@staszic.waw.pl>
Signed-off-by: Szymon Karpiński <hugo@staszic.waw.pl>
Signed-off-by: Szymon Karpiński <hugo@staszic.waw.pl>
Signed-off-by: Szymon Karpiński <hugo@staszic.waw.pl>
Signed-off-by: Szymon Karpiński <hugo@staszic.waw.pl>
Signed-off-by: Szymon Karpiński <hugo@staszic.waw.pl>
Signed-off-by: Szymon Karpiński <hugo@staszic.waw.pl>
Signed-off-by: Szymon Karpiński <hugo@staszic.waw.pl>
Signed-off-by: Szymon Karpiński <hugo@staszic.waw.pl>
Signed-off-by: Szymon Karpiński <hugo@staszic.waw.pl>
Signed-off-by: Szymon Karpiński <hugo@staszic.waw.pl>
Description
What happened in this PR
When
SliceGPUruns the kernel, each thread processes a hardcoded count of 64 pixels:DALI/dali/kernels/slice/slice_gpu.cuh
Line 197 in d7951e5
This increases the throughput by up to 60% for certain configurations.
The solution
This solution tries to make at least
4 * number of SMstiles to improve total GPU occupancy.It starts with original value of 64 pixels per thread and then divides it by 2 until estimated number of tiles reaches
4 * number of SMsor a lower limit of 4 pixels per thread is reached. This means that for bigger data the behaviour is unchanged as the original value of 64 pixels per thread is used.Benchmarks
Benchmark. I have measured Slice's performance in cropping images of size 500x500x3, 1000x1000x3, 2000x2000x3 to 250x250x3, 500x500x3 and 1000x1000x3 respecitvely. The measurements were taken for batches of 1, 2, 4, 8, 16, 32, 64, 128 and 256 images. As the change is not specific to particular shape of input data, I would expect similar performance impact on other, more complex shapes.
The GPU. The benchmarks were run on Titan V, which has 80 SMs.
Performance for various pixels per thread. I have measured performance of SliceGPU for values of pixels per thread (abbreviated to ppt on the plots) other than original 64. An increased throughput can be observed for smaller values of pixels per thread for small batch sizes.
Performance of the adaptive method. In the second picture the results achieved by the adaptive method described above are presented. As you can see, no other value of pixels per thread performs better than the one chosen by the adaptive method.
Additional information
Affected modules and functionalities:
Key points relevant for the review:
The magic constant
4in computing minimal number of tiles (4 * number of SMs). This constant4turned out to be the best during benchmarks. The reason for that is probably that the kernel uses 48 registers, which means ~5 blocks can fit on the SM. This probably could be computed somehow from the GPU properties in the runtime, but this would add a lot of complexity to the code while having small effect on performance, so I decided to leave the magic constant here. I'm not really convinced though.The placement of
GetSMCountmethod. I'm not very familiar with DALI, so I'm not sure ifutils.his a good place for the newGetSMCountfunction.Checklist
Tests
Documentation
DALI team only
Requirements
REQ IDs: N/A
JIRA TASK: N/A