-
Notifications
You must be signed in to change notification settings - Fork 3.6k
[MLAS] Enable FP16 for Gelu #26815
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
[MLAS] Enable FP16 for Gelu #26815
Conversation
There is very common error that appears as in this shape. > ## -- Hash mismatch, removing... -- Using src='https://gitlab.com/libeigen/eigen/-/archive/e7248b26a1ed53fa030c5c459f7ea095dfd276ac/eigen-e7248b26a1ed53fa030c5c459f7ea095dfd276ac.zip' -- verifying file... file='/home/nikhil/KONARK/onnxruntime/build/Linux/Release/_deps/eigen-subbuild/eigen-populate-prefix/src/eigen-e7248b26a1ed53fa030c5c459f7ea095dfd276ac.zip' -- SHA1 hash of /home/nikhil/KONARK/onnxruntime/build/Linux/Release/_deps/eigen-subbuild/eigen-populate-prefix/src/eigen-e7248b26a1ed53fa030c5c459f7ea095dfd276ac.zip does not match expected value expected: 'be8be39fdbc6e60e94fa7870b280707069b5b81a' actual: '32b145f525a8308d7ab1c09388b2e288312d8eba' -- Hash mismatch, removing... CMake Error at eigen-subbuild/eigen-populate-prefix/src/eigen-populate-stamp/download-eigen-populate.cmake:170 (message): Each download failed! > gmake[2]: *** [CMakeFiles/eigen-populate.dir/build.make:100: eigen-populate-prefix/src/eigen-populate-stamp/eigen-populate-download] Error 1 gmake[1]: *** [CMakeFiles/Makefile2:83: CMakeFiles/eigen-populate.dir/all] Error 2 gmake: *** [Makefile:91: all] Error 2 > CMake Error at /usr/local/share/cmake-3.28/Modules/FetchContent.cmake:1679 (message): Build step for eigen failed: 2 Call Stack (most recent call first): /usr/local/share/cmake-3.28/Modules/FetchContent.cmake:1819:EVAL:2 (__FetchContent_directPopulate) /usr/local/share/cmake-3.28/Modules/FetchContent.cmake:1819 (cmake_language) external/eigen.cmake:19 (FetchContent_Populate) external/onnxruntime_external_deps.cmake:546 (include) CMakeLists.txt:694 (include) ` This error can be removed by updating main to the latest from Open source repository.
Got the error related to External dependencies.
`
CMake Error: install(EXPORT "onnxruntimeTargets" ...) includes target "onnxruntime_mlas" which requires target "kleidiai" that is not in any export set.
CMake Error: install(EXPORT "onnxruntimeTargets" ...) includes target "onnxruntime" which requires target "kleidiai" that is not in any export set.
-- Generating done (0.6s)
CMake Generate step failed. Build files cannot be regenerated correctly.
Traceback (most recent call last):
File "/home/nikhil/KONARK/onnxruntime/tools/ci_build/build.py", line 2630, in <module>
sys.exit(main())
File "/home/nikhil/KONARK/onnxruntime/tools/ci_build/build.py", line 2497, in main
generate_build_tree(
File "/home/nikhil/KONARK/onnxruntime/tools/ci_build/build.py", line 1290, in generate_build_tree
run_subprocess(
File "/home/nikhil/KONARK/onnxruntime/tools/ci_build/build.py", line 147, in run_subprocess
return run(*args, cwd=cwd, capture_stdout=capture_stdout, shell=shell, env=my_env)
File "/home/nikhil/KONARK/onnxruntime/tools/python/util/run.py", line 50, in run
completed_process = subprocess.run(
File "/usr/lib/python3.10/subprocess.py", line 526, in run
raise CalledProcessError(retcode, process.args,
`
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR enables FP16 (half-precision floating-point) support for the GELU (Gaussian Error Linear Unit) activation operator in ONNX Runtime opset 20. The implementation provides optimized compute paths using ARM SVE (Scalable Vector Extension) and NEON intrinsics for both tanh and erf approximation methods, with fallback to scalar FP32 computation when vector intrinsics are not available.
Key changes:
- Adds FP16 kernel registration for GELU operator alongside the existing FP32 implementation
- Implements optimized FP16 ERF and TANH kernels using ARM SVE and NEON intrinsics
- Adds comprehensive test coverage for both tanh and erf approximation modes with FP16 inputs
Reviewed changes
Copilot reviewed 13 out of 13 changed files in this pull request and generated 27 comments.
Show a summary per file
| File | Description |
|---|---|
| onnxruntime/core/providers/cpu/cpu_execution_provider.cc | Registers typed GELU kernels for float and MLFloat16 types |
| onnxruntime/core/providers/cpu/tensor/gelu.cc | Implements FP16 GELU computation with SVE/NEON optimizations and scalar fallback |
| onnxruntime/core/providers/cpu/math/element_wise_ops.cc | Adds FP16 ERF operator support using new SVE/NEON kernels |
| onnxruntime/test/providers/cpu/activation/activation_op_test.cc | Adds FP16 GELU tests for both tanh and erf approximations |
| onnxruntime/core/mlas/lib/tanh.cpp | Adds SVE path for FP16 tanh computation |
| onnxruntime/core/mlas/lib/sve/mlasi_sve.h | Declares SVE FP16 function signatures |
| onnxruntime/core/mlas/lib/sve/mlas_sve_fp16.h | Adds SVE FP16 intrinsic wrapper functions |
| onnxruntime/core/mlas/lib/sve/Elementwise_sve_fp16.cpp | Implements SVE FP16 tanh, erf, and GELU kernels |
| onnxruntime/core/mlas/lib/fp16_common.h | Adds NEON FP16 helper functions for erf computation |
| onnxruntime/core/mlas/lib/erf.cpp | Implements NEON FP16 erf kernel |
| onnxruntime/core/mlas/inc/mlas.h | Exports NEON FP16 erf kernel function |
| cmake/onnxruntime_providers_cpu.cmake | Adds ARM FP16 compile flags for gelu.cc and includes MLAS headers |
| cmake/onnxruntime_mlas.cmake | Adds SVE FP16 elementwise source and compile flags for erf.cpp |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
/azp run Linux QNN CI Pipeline,Win_TRT_Minimal_CUDA_Test_CI,Windows ARM64 QNN CI Pipeline,Windows GPU Doc Gen CI Pipeline |
|
Azure Pipelines successfully started running 4 pipeline(s). |
Seperate platform dependant code
ca56982 to
cc2625d
Compare
|
/azp run Linux QNN CI Pipeline,Win_TRT_Minimal_CUDA_Test_CI,Windows ARM64 QNN CI Pipeline,Windows GPU Doc Gen CI Pipeline |
|
Azure Pipelines successfully started running 4 pipeline(s). |
| list(APPEND mlas_platform_srcs ${MLAS_SRC_DIR}/sve/elementwise_sve.cpp) | ||
| list(APPEND mlas_platform_srcs ${MLAS_SRC_DIR}/sve/Elementwise_sve_fp16.cpp) | ||
| set_source_files_properties(${MLAS_SRC_DIR}/sve/elementwise_sve.cpp PROPERTIES COMPILE_FLAGS " -march=armv8.2-a+sve+fp16 ") | ||
| set_source_files_properties(${MLAS_SRC_DIR}/sve/Elementwise_sve_fp16.cpp PROPERTIES COMPILE_FLAGS " -march=armv8.2-a+sve+fp16 ") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Rename to elementwise_sve_fp16.cpp ? (casing is inconsistent)
| #include "core/util/math.h" | ||
| #include "core/mlas/inc/mlas.h" | ||
|
|
||
| #if defined(MLAS_NEON_INTRINSICS) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't this macro defined in mlasi.h (which is not included here).
There seems to be something wrong with the design here - I think there is usage of MLAS internal implementation macros leaking over to the CPU EP files. It may need a re-think.
|
|
||
| int64_t i = 0; | ||
|
|
||
| if (algo == "tanh") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think all this logic needs to live in MLAS and not in a CPU EP file. is there a limitation that necessitated doing things this way ?
|
|
||
| void | ||
| MLASCALL | ||
| MlasSveErfKernelFp16( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice to stick to one naming convention- some APIs have MlasSve... and some others have ..._SVE which is confusing. Some have Mlas.... and another one has Compute.....
|
|
||
| target_include_directories(onnxruntime_providers PRIVATE ${ONNXRUNTIME_ROOT}) | ||
| if(onnxruntime_target_platform STREQUAL "aarch64" OR onnxruntime_target_platform STREQUAL "ARM64" OR onnxruntime_target_platform STREQUAL "arm64") | ||
| set_source_files_properties("${ONNXRUNTIME_ROOT}/core/providers/cpu/tensor/gelu.cc" PROPERTIES COMPILE_FLAGS -march=armv8.2-a+fp16) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This may be a duplicated comment - I think this is coming about because some CPU EP files are now directly using intrinsics and I feel the hardware accelerated intrinsic using routines should live in MLAS and only be called from the CPU EP files
Enabled fp16 Gelu for opset20.Gelu uses tanh and ERF functions depending on the approximation method used. Implemented tanh in sve and erf in sve and neon .
Gr3E results: with tanh and erf approximation:
Gr4 results: with tanh and erf approximation:
This PR is a joint contribution by:
Aruna K(@akote123)
Abhishek Jain(@abhijain1204fujitsu)