-
Notifications
You must be signed in to change notification settings - Fork 77
Tune the nThreadsPerBlock for FP8 and Half datatype on MI300 #694
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR optimizes GPU kernel performance for AllReduce operations on AMD MI300 by tuning the nThreadsPerBlock parameter for FP8 (both e4m3 and e5m2 variants) and Half datatypes in the 32KB-256KB message size range. The tuning achieves significant performance improvements, particularly for FP8 datatypes at 128K-256K sizes (21-25% improvement) and 64K sizes (7% improvement).
Key changes:
- Added AMD HIP platform-specific tuning for Half datatype with reduced thread counts (64 for 32KB, 128 for 64-256KB)
- Added FP8-specific tuning with progressively larger thread counts (64 for 32KB, 128 for 64KB, 256 for 128-256KB)
- Used nested preprocessor directives to ensure platform and type compatibility
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Binyang2014
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Tune the nThreadsPerBlock for message size in 32KB to 256KB range for FP8 and Half datatype on MI300.
Before tuning:
After tuning: