Multiply with inverse scale in loop inside of dividing to improve perf #12

danielvegamyhre · 2025-07-06T03:10:53Z

Stacked PRs:

Multiply with inverse scale in loop inside of dividing to improve perf

NCU showed the kernel was now compute bound, which confused me since this should be a mem bw bound kernel.
Debugging, I've learned doing __fdiv_rn 32 times in an unrolled loop is much slower than doing it once to calculate the inverse scale, then multiplying data * inverse_scale 32 times inside the loop inside.

Test plan

pytest test_kernels.py -k cuda_mx -s

Performance

Memory bandwidth utilization and runtime improved by ~47% with this change.

Before change:

[[email protected] ~/private-torchao/benchmarks/mx_formats (3631dea7)]$ CUDA_VISIBLE_DEVICES=2 python cast_bench.py --mode dim1_cuda
M 16384 K 16384 BLOCK_SIZE 32
GPU: NVIDIA B200
torch version: 2.9.0.dev20250702+cu128
triton version: 3.3.1
mode: dim1_cuda
time_us 234.592005610466
mem_bw_gbps 3468.5537296233347

After change:

[[email protected] ~/private-torchao/benchmarks/mx_formats (work)]$ CUDA_VISIBLE_DEVICES=2 python cast_bench.py --mode dim1_cuda
M 16384 K 16384 BLOCK_SIZE 32
GPU: NVIDIA B200
torch version: 2.9.0.dev20250702+cu128
triton version: 3.3.1
mode: dim1_cuda
time_us 159.10400450229645
mem_bw_gbps 5114.23316179484

Benchmarking against Triton

CUDA dim1 cast now achieves ~23% faster runtime and ~23% higher mem bw utilization than the Triton kernel with this change.

CUDA: 5114 gbps, 159us

[[email protected] ~/private-torchao/benchmarks/mx_formats (work)]$ CUDA_VISIBLE_DEVICES=2 python cast_bench.py --mode dim1_cuda
M 16384 K 16384 BLOCK_SIZE 32
GPU: NVIDIA B200
torch version: 2.9.0.dev20250702+cu128
triton version: 3.3.1
mode: dim1_cuda
time_us 159.10400450229645
mem_bw_gbps 5114.23316179484

Triton: 4137gbps, 197us

[[email protected] ~/private-torchao/benchmarks/mx_formats (work)]$ CUDA_VISIBLE_DEVICES=2 python cast_bench.py --mode dim1_mx_triton
M 16384 K 16384 BLOCK_SIZE 32
GPU: NVIDIA B200
torch version: 2.9.0.dev20250702+cu128
triton version: 3.3.1
mode: dim1_mx_triton
time_us 196.6720074415207
mem_bw_gbps 4137.31972630598

stack-info: PR: #12, branch: danielvegamyhre/stack/6

danielvegamyhre · 2025-07-06T17:55:57Z

fyi @drisspg @vkuzo dim1 cast perf is looking good now!

stack-info: PR: #12, branch: danielvegamyhre/stack/6

Multiply with inverse scale in loop inside of dividing to improve perf

6a9276b

stack-info: PR: #12, branch: danielvegamyhre/stack/6

danielvegamyhre added a commit that referenced this pull request Jul 6, 2025

Multiply with inverse scale in loop inside of dividing to improve perf

555b024

stack-info: PR: #12, branch: danielvegamyhre/stack/6

danielvegamyhre force-pushed the danielvegamyhre/stack/6 branch from b11627a to 555b024 Compare July 6, 2025 03:10

This was referenced Jul 6, 2025

update impls of some helpers and add comments #7

Merged

fix scale shapes #8

Merged

Fix numerics for colwise scaling #9

Merged

add bfloat16 support for colwise scaling #10

Merged

Fix uncoalesced global accesses #11

Merged

danielvegamyhre changed the base branch from danielvegamyhre/stack/5 to main July 6, 2025 20:11

danielvegamyhre added a commit that referenced this pull request Jul 6, 2025

Multiply with inverse scale in loop inside of dividing to improve perf

fed7987

stack-info: PR: #12, branch: danielvegamyhre/stack/6

danielvegamyhre force-pushed the danielvegamyhre/stack/6 branch from 555b024 to fed7987 Compare July 6, 2025 20:11

danielvegamyhre changed the base branch from main to danielvegamyhre/stack/5 July 6, 2025 20:12

danielvegamyhre changed the base branch from danielvegamyhre/stack/5 to main July 6, 2025 21:39

danielvegamyhre added a commit that referenced this pull request Jul 6, 2025

Multiply with inverse scale in loop inside of dividing to improve perf

7aa8fc0

stack-info: PR: #12, branch: danielvegamyhre/stack/6

danielvegamyhre force-pushed the danielvegamyhre/stack/6 branch from fed7987 to 7aa8fc0 Compare July 6, 2025 21:39

danielvegamyhre changed the base branch from main to danielvegamyhre/stack/5 July 6, 2025 21:39

danielvegamyhre force-pushed the danielvegamyhre/stack/5 branch from 3631dea to 2039e19 Compare July 6, 2025 21:41

danielvegamyhre added a commit that referenced this pull request Jul 6, 2025

Multiply with inverse scale in loop inside of dividing to improve perf

f797f4d

stack-info: PR: #12, branch: danielvegamyhre/stack/6

danielvegamyhre force-pushed the danielvegamyhre/stack/6 branch from 7aa8fc0 to f797f4d Compare July 6, 2025 21:41

danielvegamyhre force-pushed the danielvegamyhre/stack/5 branch from 2039e19 to 28a31b6 Compare July 6, 2025 21:42

danielvegamyhre force-pushed the danielvegamyhre/stack/6 branch from f797f4d to 52b2b0e Compare July 6, 2025 21:42

danielvegamyhre added a commit that referenced this pull request Jul 6, 2025

Multiply with inverse scale in loop inside of dividing to improve perf

52b2b0e

stack-info: PR: #12, branch: danielvegamyhre/stack/6

danielvegamyhre force-pushed the danielvegamyhre/stack/5 branch from 28a31b6 to 2c4677c Compare July 6, 2025 21:43

danielvegamyhre added a commit that referenced this pull request Jul 6, 2025

Multiply with inverse scale in loop inside of dividing to improve perf

503ec5d

stack-info: PR: #12, branch: danielvegamyhre/stack/6

danielvegamyhre force-pushed the danielvegamyhre/stack/6 branch from 52b2b0e to 503ec5d Compare July 6, 2025 21:43

danielvegamyhre force-pushed the danielvegamyhre/stack/5 branch from 2c4677c to c40ace9 Compare July 6, 2025 21:43

danielvegamyhre added a commit that referenced this pull request Jul 6, 2025

Multiply with inverse scale in loop inside of dividing to improve perf

bd8511c

stack-info: PR: #12, branch: danielvegamyhre/stack/6

danielvegamyhre force-pushed the danielvegamyhre/stack/6 branch 2 times, most recently from bd8511c to 6a9276b Compare July 6, 2025 21:44

danielvegamyhre changed the base branch from danielvegamyhre/stack/5 to main July 6, 2025 21:44

danielvegamyhre merged commit f0413c7 into main Jul 6, 2025
0 of 16 checks passed

danielvegamyhre mentioned this pull request Jul 10, 2025

Add CUDA kernel for MXFP8 dim1 casting pytorch/ao#2513

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Multiply with inverse scale in loop inside of dividing to improve perf #12

Multiply with inverse scale in loop inside of dividing to improve perf #12

Uh oh!

danielvegamyhre commented Jul 6, 2025 •

edited

Loading

Uh oh!

danielvegamyhre commented Jul 6, 2025

Uh oh!

Uh oh!

Uh oh!

Multiply with inverse scale in loop inside of dividing to improve perf #12

Multiply with inverse scale in loop inside of dividing to improve perf #12

Uh oh!

Conversation

danielvegamyhre commented Jul 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!