Skip to content

Multiply with inverse scale in loop inside of dividing to improve perf #12

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jul 6, 2025

Conversation

danielvegamyhre
Copy link
Owner

@danielvegamyhre danielvegamyhre commented Jul 6, 2025

Stacked PRs:


Multiply with inverse scale in loop inside of dividing to improve perf

  • NCU showed the kernel was now compute bound, which confused me since this should be a mem bw bound kernel.
  • Debugging, I've learned doing __fdiv_rn 32 times in an unrolled loop is much slower than doing it once to calculate the inverse scale, then multiplying data * inverse_scale 32 times inside the loop inside.

Test plan

  • pytest test_kernels.py -k cuda_mx -s

Performance

  • Memory bandwidth utilization and runtime improved by ~47% with this change.

Before change:

[[email protected] ~/private-torchao/benchmarks/mx_formats (3631dea7)]$ CUDA_VISIBLE_DEVICES=2 python cast_bench.py --mode dim1_cuda
M 16384 K 16384 BLOCK_SIZE 32
GPU: NVIDIA B200
torch version: 2.9.0.dev20250702+cu128
triton version: 3.3.1
mode: dim1_cuda
time_us 234.592005610466
mem_bw_gbps 3468.5537296233347

After change:

[[email protected] ~/private-torchao/benchmarks/mx_formats (work)]$ CUDA_VISIBLE_DEVICES=2 python cast_bench.py --mode dim1_cuda
M 16384 K 16384 BLOCK_SIZE 32
GPU: NVIDIA B200
torch version: 2.9.0.dev20250702+cu128
triton version: 3.3.1
mode: dim1_cuda
time_us 159.10400450229645
mem_bw_gbps 5114.23316179484

Benchmarking against Triton

CUDA dim1 cast now achieves ~23% faster runtime and ~23% higher mem bw utilization than the Triton kernel with this change.

CUDA: 5114 gbps, 159us

[[email protected] ~/private-torchao/benchmarks/mx_formats (work)]$ CUDA_VISIBLE_DEVICES=2 python cast_bench.py --mode dim1_cuda
M 16384 K 16384 BLOCK_SIZE 32
GPU: NVIDIA B200
torch version: 2.9.0.dev20250702+cu128
triton version: 3.3.1
mode: dim1_cuda
time_us 159.10400450229645
mem_bw_gbps 5114.23316179484

Triton: 4137gbps, 197us

[[email protected] ~/private-torchao/benchmarks/mx_formats (work)]$ CUDA_VISIBLE_DEVICES=2 python cast_bench.py --mode dim1_mx_triton
M 16384 K 16384 BLOCK_SIZE 32
GPU: NVIDIA B200
torch version: 2.9.0.dev20250702+cu128
triton version: 3.3.1
mode: dim1_mx_triton
time_us 196.6720074415207
mem_bw_gbps 4137.31972630598

stack-info: PR: #12, branch: danielvegamyhre/stack/6
@danielvegamyhre
Copy link
Owner Author

fyi @drisspg @vkuzo dim1 cast perf is looking good now!

@danielvegamyhre danielvegamyhre changed the base branch from danielvegamyhre/stack/5 to main July 6, 2025 20:11
danielvegamyhre added a commit that referenced this pull request Jul 6, 2025
stack-info: PR: #12, branch: danielvegamyhre/stack/6
@danielvegamyhre danielvegamyhre force-pushed the danielvegamyhre/stack/6 branch from 555b024 to fed7987 Compare July 6, 2025 20:11
@danielvegamyhre danielvegamyhre changed the base branch from main to danielvegamyhre/stack/5 July 6, 2025 20:12
@danielvegamyhre danielvegamyhre changed the base branch from danielvegamyhre/stack/5 to main July 6, 2025 21:39
danielvegamyhre added a commit that referenced this pull request Jul 6, 2025
stack-info: PR: #12, branch: danielvegamyhre/stack/6
@danielvegamyhre danielvegamyhre force-pushed the danielvegamyhre/stack/6 branch from fed7987 to 7aa8fc0 Compare July 6, 2025 21:39
@danielvegamyhre danielvegamyhre changed the base branch from main to danielvegamyhre/stack/5 July 6, 2025 21:39
@danielvegamyhre danielvegamyhre force-pushed the danielvegamyhre/stack/5 branch from 3631dea to 2039e19 Compare July 6, 2025 21:41
danielvegamyhre added a commit that referenced this pull request Jul 6, 2025
stack-info: PR: #12, branch: danielvegamyhre/stack/6
@danielvegamyhre danielvegamyhre force-pushed the danielvegamyhre/stack/6 branch from 7aa8fc0 to f797f4d Compare July 6, 2025 21:41
@danielvegamyhre danielvegamyhre force-pushed the danielvegamyhre/stack/5 branch from 2039e19 to 28a31b6 Compare July 6, 2025 21:42
@danielvegamyhre danielvegamyhre force-pushed the danielvegamyhre/stack/6 branch from f797f4d to 52b2b0e Compare July 6, 2025 21:42
danielvegamyhre added a commit that referenced this pull request Jul 6, 2025
stack-info: PR: #12, branch: danielvegamyhre/stack/6
@danielvegamyhre danielvegamyhre force-pushed the danielvegamyhre/stack/5 branch from 28a31b6 to 2c4677c Compare July 6, 2025 21:43
danielvegamyhre added a commit that referenced this pull request Jul 6, 2025
stack-info: PR: #12, branch: danielvegamyhre/stack/6
@danielvegamyhre danielvegamyhre force-pushed the danielvegamyhre/stack/6 branch from 52b2b0e to 503ec5d Compare July 6, 2025 21:43
@danielvegamyhre danielvegamyhre force-pushed the danielvegamyhre/stack/5 branch from 2c4677c to c40ace9 Compare July 6, 2025 21:43
danielvegamyhre added a commit that referenced this pull request Jul 6, 2025
stack-info: PR: #12, branch: danielvegamyhre/stack/6
@danielvegamyhre danielvegamyhre force-pushed the danielvegamyhre/stack/6 branch 2 times, most recently from bd8511c to 6a9276b Compare July 6, 2025 21:44
@danielvegamyhre danielvegamyhre changed the base branch from danielvegamyhre/stack/5 to main July 6, 2025 21:44
@danielvegamyhre danielvegamyhre merged commit f0413c7 into main Jul 6, 2025
0 of 16 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant