Skip to content

[CUDA] Use GEMM with epilogue instead of AddMM#2569

Merged
zcbenz merged 1 commit intoml-explore:mainfrom
zcbenz:cublas-bias
Sep 9, 2025
Merged

[CUDA] Use GEMM with epilogue instead of AddMM#2569
zcbenz merged 1 commit intoml-explore:mainfrom
zcbenz:cublas-bias

Conversation

@zcbenz
Copy link
Collaborator

@zcbenz zcbenz commented Sep 6, 2025

Invoke cublasLt with bias epilogue when possible, which is about 2x faster than general addmm kernel. Also because of the CUBLASLT_ORDER_ROW option not working with bias epilogue, the CublasGemm class is changed to use a slightly counter-intuitive way to do matmul in row-major layout.


if (cu::can_use_gemv(M, N, K, a_transposed, b_transposed)) {
// Use gemmv when possible
if (!bias && cu::can_use_gemv(M, N, K, a_transposed, b_transposed)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Am I missing something here or is the return statement removed a bug?

It looks like it will encode the gemv and the full gemm right now?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was a mistake, thanks for noticing it!

Comment on lines +157 to +173
if (beta_ == 1 && c.strides(-1) == 1 && c.data_size() == out.shape(-1)) {
out.set_data(allocator::malloc(out.nbytes()));
gemm_and_bias(
encoder,
M,
N,
K,
a_transposed,
lda,
b_transposed,
ldb,
out,
a,
b,
c.data<void>());
return;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we simply setup the right cublas matmul without adding a set_bias by checking ldc?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The epilogue bias requires the input to be a vector of size out.shape(-1), while the normal addmm interface requires a matrix of same shape with out. So we can't really encapsulate the dispatching code inside CublasGemm because the shape of c is different, unless we move code processing c (e.g. copy_gpu(c, out) and collapse_batches(a, b, c)) into the class, which I think decreases code readability.

@zcbenz
Copy link
Collaborator Author

zcbenz commented Sep 9, 2025

Tested the branch with inference and training, both showed no meaningful difference on performance. For inference it is good because it means I didn't make mistakes, but it is a bit disappointing that training did not become faster.

Copy link
Member

@awni awni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, thanks!

@awni
Copy link
Member

awni commented Sep 9, 2025

It is a bit disappointing that training did not become faster.

It's become less common to use a bias in linear layers.. which would be the main place that we route to an addmm with a bias. Not sure which benchmark you ran.. but it's possible there are simply no biases in the training run.

@zcbenz
Copy link
Collaborator Author

zcbenz commented Sep 9, 2025

it's possible there are simply no biases in the training run.

Ah that is the reason, turning on biases would make training a lot slower. And I also found that there are some cases that not redirected to the bias epilogue code, I'll take further lookings.

@zcbenz zcbenz merged commit dde3682 into ml-explore:main Sep 9, 2025
7 checks passed
@zcbenz zcbenz deleted the cublas-bias branch September 9, 2025 04:18
zcbenz added a commit that referenced this pull request Sep 9, 2025
faisalmemon pushed a commit to faisalmemon/mlx that referenced this pull request Oct 30, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants