Tune the nThreadsPerBlock for FP8 and Half datatype on MI300 #694

seagater · 2025-11-21T18:06:27Z

Tune the nThreadsPerBlock for message size in 32KB to 256KB range for FP8 and Half datatype on MI300.

Before tuning:

--	fp8e4m3	--	--	--	fp8e5m2	--	--	half	--	--
Msg Size	Counts	Out-of-place	In-place	Improvement	Out-of-place	In-place	Improvement	Counts	Out-of-place	In-place
1K	1024	5.38	5.52	-0.20%	5.33	5.4	0.70%	512	5.26	5.28
2K	2048	5.44	5.51	2.50%	5.42	5.51	2.90%	1024	5.37	5.44
4K	4096	5.54	5.6	6.40%	5.54	5.62	6.40%	2048	5.58	5.55
8K	8192	5.95	6.08	7.60%	5.95	6.07	7.60%	4096	5.92	6
16K	16384	6.5	6.56	25.90%	6.48	6.57	26.10%	8192	6.44	6.48
32K	32768	9	9.1	1.70%	8.96	9.03	2.20%	16384	8.77	8.87
64K	65536	9.35	9.45	4.50%	9.32	9.43	4.80%	32768	9.16	9.24
128K	131072	11.72	11.89	3.10%	11.73	11.89	3.00%	65536	9.79	10.01
256K	262144	12.37	12.51	10.10%	12.34	12.51	10.30%	131072	12.09	12.28
512K	524188	13.96	14.04	27.20%	13.99	14.07	27.00%	262144	13.76	13.86
1M	1048576	19.13	19.34	20.70%	19.14	19.34	20.60%	524288	19.17	19.33
2M	2097152	24.55	24.55	32.90%	24.49	24.55	33.10%	1048576	24.11	24.13
4M	4194304	37.25	37.3	38.70%	37.25	37.23	38.70%	2097152	36.58	36.45
8M	8388608	61.36	61.75	43.00%	61.32	61.69	43.10%	4194304	60.72	61.02
16M	16777216	109.3	109.5	44.70%	109.2	109.6	44.70%	8388608	107.7	108.2
32M	33554432	200.7	201.6	47.70%	200.8	201.6	47.70%	16777216	197.6	198.3
64M	67108864	388.9	389.5	48.30%	389.1	389.3	48.30%	33554432	384	384.8
128M	134217728	763	761.9		762.7	762.2		67108864	752.6	752.9

After tuning:

--	fp8e4m3	--	--	--	fp8e5m2	--	--	half	--	--
Msg Size	Counts	Out-of-place	In-place	Improvement	Out-of-place	In-place	Improvement	Counts	Out-of-place	In-place
1K	1024	5.28	5.32	0.8%	5.29	5.34	0.6%	512	5.21	5.25
2K	2048	5.36	5.49	3.1%	5.37	5.5	2.9%	1024	5.32	5.43
4K	4096	5.51	5.6	6.5%	5.53	5.59	6.1%	2048	5.53	5.53
8K	8192	5.9	6.03	8.2%	5.92	6.03	7.9%	4096	5.89	5.96
16K	16384	6.45	6.54	18.7%	6.48	6.53	18.3%	8192	6.43	6.46
32K	32768	8.14	8.21	5.3%	8.14	8.21	5.3%	16384	7.93	8.01
64K	65536	8.83	8.91	7.2%	8.83	8.95	7.2%	32768	8.6	8.74
128K	131072	9.23	9.41	21.7%	9.25	9.44	21.5%	65536	9.52	9.71
256K	262144	10.32	10.62	24.8%	10.33	10.6	24.8%	131072	11.79	12.25
512K	524188	13.93	14	27.1%	13.97	14.05	26.9%	262144	13.73	13.9
1M	1048576	19.12	19.32	20.7%	19.13	19.31	20.6%	524288	19.12	19.29
2M	2097152	24.51	24.53	32.8%	24.51	24.6	32.8%	1048576	24.1	24.03
4M	4194304	37.07	37.23	38.8%	37.21	37.23	38.6%	2097152	36.49	36.51
8M	8388608	61.36	61.63	43.0%	61.48	61.74	42.9%	4194304	60.58	60.99
16M	16777216	109.2	109.7	44.8%	109	109.5	44.9%	8388608	107.7	108.3
32M	33554432	200.4	201.4	47.8%	200.9	201.8	47.7%	16777216	197.7	198.1
64M	67108864	388.4	388.6	48.4%	388.9	388.9	48.3%	33554432	384.2	384.7
128M	134217728	761.5	761.6		762.9	763		67108864	752.6	753.3

…huazhou/fp8_performance_tuning

Copilot

Pull request overview

This PR optimizes GPU kernel performance for AllReduce operations on AMD MI300 by tuning the nThreadsPerBlock parameter for FP8 (both e4m3 and e5m2 variants) and Half datatypes in the 32KB-256KB message size range. The tuning achieves significant performance improvements, particularly for FP8 datatypes at 128K-256K sizes (21-25% improvement) and 64K sizes (7% improvement).

Key changes:

Added AMD HIP platform-specific tuning for Half datatype with reduced thread counts (64 for 32KB, 128 for 64-256KB)
Added FP8-specific tuning with progressively larger thread counts (64 for 32KB, 128 for 64KB, 256 for 128-256KB)
Used nested preprocessor directives to ensure platform and type compatibility

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

apps/nccl/src/allreduce.cu

Binyang2014

LGTM

seagater added 3 commits November 20, 2025 06:42

Tune the nThreadsPerBlock for 32KB to 256KB with FP8 and Half datatype

c409d0a

Improve performance of 32KB and 64KB further for FP8

ecefa62

Merge branch 'main' of https://github.com/microsoft/mscclpp into qing…

afc8801

…huazhou/fp8_performance_tuning

seagater requested review from Binyang2014, chhwang and Copilot and removed request for chhwang and Copilot November 21, 2025 18:07

Copilot started reviewing on behalf of seagater November 21, 2025 18:07 View session

Copilot finished reviewing on behalf of seagater November 21, 2025 18:09

Copilot AI reviewed Nov 21, 2025

View reviewed changes

apps/nccl/src/allreduce.cu Outdated Show resolved Hide resolved

apps/nccl/src/allreduce.cu Show resolved Hide resolved

seagater requested a review from chhwang November 21, 2025 19:13

seagater and others added 2 commits November 21, 2025 23:14

Use else if for condition checks

cceb271

Merge branch 'main' into qinghuazhou/fp8_performance_tuning

90540d1

Binyang2014 approved these changes Dec 1, 2025

View reviewed changes

mahdiehghazim approved these changes Dec 8, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Tune the nThreadsPerBlock for FP8 and Half datatype on MI300 #694

Tune the nThreadsPerBlock for FP8 and Half datatype on MI300 #694

Uh oh!

seagater commented Nov 21, 2025 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Binyang2014 left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Tune the nThreadsPerBlock for FP8 and Half datatype on MI300 #694

Are you sure you want to change the base?

Tune the nThreadsPerBlock for FP8 and Half datatype on MI300 #694

Uh oh!

Conversation

seagater commented Nov 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Binyang2014 left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

seagater commented Nov 21, 2025 •

edited

Loading