[CUDA] Use GEMM with epilogue instead of AddMM#2569
Conversation
|
|
||
| if (cu::can_use_gemv(M, N, K, a_transposed, b_transposed)) { | ||
| // Use gemmv when possible | ||
| if (!bias && cu::can_use_gemv(M, N, K, a_transposed, b_transposed)) { |
There was a problem hiding this comment.
Am I missing something here or is the return statement removed a bug?
It looks like it will encode the gemv and the full gemm right now?
There was a problem hiding this comment.
It was a mistake, thanks for noticing it!
| if (beta_ == 1 && c.strides(-1) == 1 && c.data_size() == out.shape(-1)) { | ||
| out.set_data(allocator::malloc(out.nbytes())); | ||
| gemm_and_bias( | ||
| encoder, | ||
| M, | ||
| N, | ||
| K, | ||
| a_transposed, | ||
| lda, | ||
| b_transposed, | ||
| ldb, | ||
| out, | ||
| a, | ||
| b, | ||
| c.data<void>()); | ||
| return; |
There was a problem hiding this comment.
Can we simply setup the right cublas matmul without adding a set_bias by checking ldc?
There was a problem hiding this comment.
The epilogue bias requires the input to be a vector of size out.shape(-1), while the normal addmm interface requires a matrix of same shape with out. So we can't really encapsulate the dispatching code inside CublasGemm because the shape of c is different, unless we move code processing c (e.g. copy_gpu(c, out) and collapse_batches(a, b, c)) into the class, which I think decreases code readability.
|
Tested the branch with inference and training, both showed no meaningful difference on performance. For inference it is good because it means I didn't make mistakes, but it is a bit disappointing that training did not become faster. |
It's become less common to use a bias in linear layers.. which would be the main place that we route to an addmm with a bias. Not sure which benchmark you ran.. but it's possible there are simply no biases in the training run. |
Ah that is the reason, turning on biases would make training a lot slower. And I also found that there are some cases that not redirected to the bias epilogue code, I'll take further lookings. |
Invoke cublasLt with bias epilogue when possible, which is about 2x faster than general
addmmkernel. Also because of theCUBLASLT_ORDER_ROWoption not working with bias epilogue, theCublasGemmclass is changed to use a slightly counter-intuitive way to do matmul in row-major layout.