[CUDA] Use GEMM with epilogue instead of AddMM by zcbenz · Pull Request #2569 · ml-explore/mlx

zcbenz · 2025-09-06T08:16:06Z

Invoke cublasLt with bias epilogue when possible, which is about 2x faster than general addmm kernel. Also because of the CUBLASLT_ORDER_ROW option not working with bias epilogue, the CublasGemm class is changed to use a slightly counter-intuitive way to do matmul in row-major layout.

awni · 2025-09-08T14:03:58Z

mlx/backend/cuda/matmul.cpp


-  if (cu::can_use_gemv(M, N, K, a_transposed, b_transposed)) {
+  // Use gemmv when possible
+  if (!bias && cu::can_use_gemv(M, N, K, a_transposed, b_transposed)) {


Am I missing something here or is the return statement removed a bug?

It looks like it will encode the gemv and the full gemm right now?

It was a mistake, thanks for noticing it!

awni · 2025-09-08T14:12:40Z

mlx/backend/cuda/matmul.cpp

+  if (beta_ == 1 && c.strides(-1) == 1 && c.data_size() == out.shape(-1)) {
+    out.set_data(allocator::malloc(out.nbytes()));
+    gemm_and_bias(
+        encoder,
+        M,
+        N,
+        K,
+        a_transposed,
+        lda,
+        b_transposed,
+        ldb,
+        out,
+        a,
+        b,
+        c.data<void>());
+    return;


Can we simply setup the right cublas matmul without adding a set_bias by checking ldc?

The epilogue bias requires the input to be a vector of size out.shape(-1), while the normal addmm interface requires a matrix of same shape with out. So we can't really encapsulate the dispatching code inside CublasGemm because the shape of c is different, unless we move code processing c (e.g. copy_gpu(c, out) and collapse_batches(a, b, c)) into the class, which I think decreases code readability.

zcbenz · 2025-09-09T02:17:06Z

Tested the branch with inference and training, both showed no meaningful difference on performance. For inference it is good because it means I didn't make mistakes, but it is a bit disappointing that training did not become faster.

awni

Looks good, thanks!

awni · 2025-09-09T02:25:19Z

It is a bit disappointing that training did not become faster.

It's become less common to use a bias in linear layers.. which would be the main place that we route to an addmm with a bias. Not sure which benchmark you ran.. but it's possible there are simply no biases in the training run.

zcbenz · 2025-09-09T04:18:36Z

it's possible there are simply no biases in the training run.

Ah that is the reason, turning on biases would make training a lot slower. And I also found that there are some cases that not redirected to the bias epilogue code, I'll take further lookings.

This reverts commit dde3682.

zcbenz mentioned this pull request Sep 6, 2025

Performance discrepancy vs torch on cuda #2566

Closed

zcbenz force-pushed the cublas-bias branch from 541375f to a90e712 Compare September 6, 2025 10:24

awni reviewed Sep 8, 2025

View reviewed changes

zcbenz force-pushed the cublas-bias branch from a90e712 to 06c7ef5 Compare September 9, 2025 00:49

[CUDA] Use GEMM with epilogue instead of AddMM

0887389

zcbenz force-pushed the cublas-bias branch from 06c7ef5 to 0887389 Compare September 9, 2025 01:49

awni approved these changes Sep 9, 2025

View reviewed changes

zcbenz merged commit dde3682 into ml-explore:main Sep 9, 2025
7 checks passed

zcbenz deleted the cublas-bias branch September 9, 2025 04:18

zcbenz added a commit that referenced this pull request Sep 9, 2025

Revert "[CUDA] Use GEMM with epilogue instead of AddMM (#2569)"

5e5fe4a

This reverts commit dde3682.

This was referenced Sep 9, 2025

Revert "[CUDA] Use GEMM with epilogue instead of AddMM" #2577

Closed

[CUDA] Fix alpha not respected when using bias epilogue #2578

Merged

JIT compilation prevents optimization of GEMM with bias #2585

Closed

faisalmemon pushed a commit to faisalmemon/mlx that referenced this pull request Oct 30, 2025

[CUDA] Use GEMM with epilogue instead of AddMM (ml-explore#2569)

262fdd1

BrewTestBot mentioned this pull request Nov 20, 2025

mlx 0.30.0 Homebrew/homebrew-core#255173

Merged

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CUDA] Use GEMM with epilogue instead of AddMM#2569

[CUDA] Use GEMM with epilogue instead of AddMM#2569
zcbenz merged 1 commit intoml-explore:mainfrom
zcbenz:cublas-bias

zcbenz commented Sep 6, 2025

Uh oh!

awni Sep 8, 2025

Uh oh!

zcbenz Sep 9, 2025

Uh oh!

awni Sep 8, 2025

Uh oh!

zcbenz Sep 9, 2025

Uh oh!

zcbenz commented Sep 9, 2025

Uh oh!

awni left a comment

Uh oh!

awni commented Sep 9, 2025

Uh oh!

zcbenz commented Sep 9, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

zcbenz commented Sep 6, 2025

Uh oh!

awni Sep 8, 2025

Choose a reason for hiding this comment

Uh oh!

zcbenz Sep 9, 2025

Choose a reason for hiding this comment

Uh oh!

awni Sep 8, 2025

Choose a reason for hiding this comment

Uh oh!

zcbenz Sep 9, 2025

Choose a reason for hiding this comment

Uh oh!

zcbenz commented Sep 9, 2025

Uh oh!

awni left a comment

Choose a reason for hiding this comment

Uh oh!

awni commented Sep 9, 2025

Uh oh!

zcbenz commented Sep 9, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants