-
Notifications
You must be signed in to change notification settings - Fork 12.1k
CUDA: mul_mat_v support for batch sizes > 1 #14262
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
CUDA: mul_mat_v support for batch sizes > 1 #14262
Conversation
(hip)rocblas performing very poorly on rdna is a known issue and not down to the exact calls we are useing, its pretty bad for rdna2 but it gets worse for rdna3 and for rnda4 it might as well be broken performance wise. On mi hardware the performance is much better so possibly we will not want to do this there, but needs bench marking. |
I forgot: I changed the integer size in the kernel from 64 bit to 32 bit due to issues with register pressure. |
I think this is ok as long as the pointers or indexes to the weight matrix are still computed with 64-bit math, otherwise it will result in overflows with large matrices. E.g. Command-R output matrix is 256000*8192 elements, which is very close to the limit of a 32-bit int. |
I changed specifically the calculation of the initial offsets to 64 bit math. That is the only part of the kernel where the pointer offsets scale with the product of 2 tensor dimensions. The pointer offsets scaling with 1 tensor dimension are at least 1024x lower. |
|
Thank you, I forgot to check |
Merged your changes along with #13842 and tested on MTT S80 and S4000. All However, I noticed a slight performance drop on the S4000 when running |
On cdna i am seeing a large (2x +) slow down starting at batch 4 in all datatypes. I will try to take a look soon, maybe sunday |
This PR extends the
mul_mat_vec
kernels for batch sizes > 1, they seem to be viable up to a batch size of 8. The primary purpose is to help with speculative decoding and batched inference.Performance changes
On modern NVIDIA GPUs the speedup vs. cuBLAS for FP16 and BF16 is relatively small though the speedup for FP32 is larger than I expected. Conversely, the FP32 speedup for Pascal is much smaller, if there is any. What I think happened is that the NVIDIA engineers simply put less work into optimizing FP32 GEMM on more modern GPUs. The cuBLAS performance for old NVIDIA GPUs and the hipBLAS performance seem to be very bad for FP16/BF16 so this PR achieves a ridiculous 20x speedup for some use cases; maybe we are running the BLAS libraries in a suboptimal way.
@IMbackK @yeahdongcn it may be worth checking whether the logic I implemented in
ggml_cuda_should_use_mmv
can be improved for non-NVIDIA hardware.