[MLAS] Enable FP16 for Gelu #26815

akote123 · 2025-12-17T04:57:22Z

Enabled fp16 Gelu for opset20.Gelu uses tanh and ERF functions depending on the approximation method used. Implemented tanh in sve and erf in sve and neon .
Gr3E results: with tanh and erf approximation:

GELU(ms)	Tanh_SVE	ERF_SVE	Tanh_NEON	ERF_NEON
Shape	F32	F16	F32	F16
100	0.007	0.007	0.007	0.007
1000	0.008	0.007	0.012	0.008
1000000	0.076	0.039	0.203	0.07

Gr4 results: with tanh and erf approximation:

GELU(ms)	Tanh_SVE	ERF_SVE	Tanh_NEON	ERF_NEON
Shape	F32	F16	F32	F16
100	0.005	0.005	0.005	0.005
1000	0.006	0.006	0.008	0.006
1000000	0.092	0.046	0.224	0.088

This PR is a joint contribution by:
Aruna K(@akote123)
Abhishek Jain(@abhijain1204fujitsu)

There is very common error that appears as in this shape. > ## -- Hash mismatch, removing... -- Using src='https://gitlab.com/libeigen/eigen/-/archive/e7248b26a1ed53fa030c5c459f7ea095dfd276ac/eigen-e7248b26a1ed53fa030c5c459f7ea095dfd276ac.zip' -- verifying file... file='/home/nikhil/KONARK/onnxruntime/build/Linux/Release/_deps/eigen-subbuild/eigen-populate-prefix/src/eigen-e7248b26a1ed53fa030c5c459f7ea095dfd276ac.zip' -- SHA1 hash of /home/nikhil/KONARK/onnxruntime/build/Linux/Release/_deps/eigen-subbuild/eigen-populate-prefix/src/eigen-e7248b26a1ed53fa030c5c459f7ea095dfd276ac.zip does not match expected value expected: 'be8be39fdbc6e60e94fa7870b280707069b5b81a' actual: '32b145f525a8308d7ab1c09388b2e288312d8eba' -- Hash mismatch, removing... CMake Error at eigen-subbuild/eigen-populate-prefix/src/eigen-populate-stamp/download-eigen-populate.cmake:170 (message): Each download failed! > gmake[2]: *** [CMakeFiles/eigen-populate.dir/build.make:100: eigen-populate-prefix/src/eigen-populate-stamp/eigen-populate-download] Error 1 gmake[1]: *** [CMakeFiles/Makefile2:83: CMakeFiles/eigen-populate.dir/all] Error 2 gmake: *** [Makefile:91: all] Error 2 > CMake Error at /usr/local/share/cmake-3.28/Modules/FetchContent.cmake:1679 (message): Build step for eigen failed: 2 Call Stack (most recent call first): /usr/local/share/cmake-3.28/Modules/FetchContent.cmake:1819:EVAL:2 (__FetchContent_directPopulate) /usr/local/share/cmake-3.28/Modules/FetchContent.cmake:1819 (cmake_language) external/eigen.cmake:19 (FetchContent_Populate) external/onnxruntime_external_deps.cmake:546 (include) CMakeLists.txt:694 (include) ` This error can be removed by updating main to the latest from Open source repository.

Got the error related to External dependencies. ` CMake Error: install(EXPORT "onnxruntimeTargets" ...) includes target "onnxruntime_mlas" which requires target "kleidiai" that is not in any export set. CMake Error: install(EXPORT "onnxruntimeTargets" ...) includes target "onnxruntime" which requires target "kleidiai" that is not in any export set. -- Generating done (0.6s) CMake Generate step failed. Build files cannot be regenerated correctly. Traceback (most recent call last): File "/home/nikhil/KONARK/onnxruntime/tools/ci_build/build.py", line 2630, in <module> sys.exit(main()) File "/home/nikhil/KONARK/onnxruntime/tools/ci_build/build.py", line 2497, in main generate_build_tree( File "/home/nikhil/KONARK/onnxruntime/tools/ci_build/build.py", line 1290, in generate_build_tree run_subprocess( File "/home/nikhil/KONARK/onnxruntime/tools/ci_build/build.py", line 147, in run_subprocess return run(*args, cwd=cwd, capture_stdout=capture_stdout, shell=shell, env=my_env) File "/home/nikhil/KONARK/onnxruntime/tools/python/util/run.py", line 50, in run completed_process = subprocess.run( File "/usr/lib/python3.10/subprocess.py", line 526, in run raise CalledProcessError(retcode, process.args, `

Copilot

Pull request overview

This PR enables FP16 (half-precision floating-point) support for the GELU (Gaussian Error Linear Unit) activation operator in ONNX Runtime opset 20. The implementation provides optimized compute paths using ARM SVE (Scalable Vector Extension) and NEON intrinsics for both tanh and erf approximation methods, with fallback to scalar FP32 computation when vector intrinsics are not available.

Key changes:

Adds FP16 kernel registration for GELU operator alongside the existing FP32 implementation
Implements optimized FP16 ERF and TANH kernels using ARM SVE and NEON intrinsics
Adds comprehensive test coverage for both tanh and erf approximation modes with FP16 inputs

Reviewed changes

Copilot reviewed 13 out of 13 changed files in this pull request and generated 27 comments.

Show a summary per file

File	Description
onnxruntime/core/providers/cpu/cpu_execution_provider.cc	Registers typed GELU kernels for float and MLFloat16 types
onnxruntime/core/providers/cpu/tensor/gelu.cc	Implements FP16 GELU computation with SVE/NEON optimizations and scalar fallback
onnxruntime/core/providers/cpu/math/element_wise_ops.cc	Adds FP16 ERF operator support using new SVE/NEON kernels
onnxruntime/test/providers/cpu/activation/activation_op_test.cc	Adds FP16 GELU tests for both tanh and erf approximations
onnxruntime/core/mlas/lib/tanh.cpp	Adds SVE path for FP16 tanh computation
onnxruntime/core/mlas/lib/sve/mlasi_sve.h	Declares SVE FP16 function signatures
onnxruntime/core/mlas/lib/sve/mlas_sve_fp16.h	Adds SVE FP16 intrinsic wrapper functions
onnxruntime/core/mlas/lib/sve/Elementwise_sve_fp16.cpp	Implements SVE FP16 tanh, erf, and GELU kernels
onnxruntime/core/mlas/lib/fp16_common.h	Adds NEON FP16 helper functions for erf computation
onnxruntime/core/mlas/lib/erf.cpp	Implements NEON FP16 erf kernel
onnxruntime/core/mlas/inc/mlas.h	Exports NEON FP16 erf kernel function
cmake/onnxruntime_providers_cpu.cmake	Adds ARM FP16 compile flags for gelu.cc and includes MLAS headers
cmake/onnxruntime_mlas.cmake	Adds SVE FP16 elementwise source and compile flags for erf.cpp

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

onnxruntime/core/providers/cpu/cpu_execution_provider.cc

onnxruntime/core/mlas/lib/erf.cpp

cmake/onnxruntime_providers_cpu.cmake

onnxruntime/core/mlas/lib/erf.cpp

onnxruntime/core/mlas/lib/fp16_common.h

onnxruntime/core/providers/cpu/cpu_execution_provider.cc

hariharans29 · 2025-12-18T04:28:28Z

/azp run Linux QNN CI Pipeline,Win_TRT_Minimal_CUDA_Test_CI,Windows ARM64 QNN CI Pipeline,Windows GPU Doc Gen CI Pipeline

azure-pipelines · 2025-12-18T04:28:47Z

Azure Pipelines successfully started running 4 pipeline(s).

Seperate platform dependant code

hariharans29 · 2025-12-18T18:56:29Z

/azp run Linux QNN CI Pipeline,Win_TRT_Minimal_CUDA_Test_CI,Windows ARM64 QNN CI Pipeline,Windows GPU Doc Gen CI Pipeline

azure-pipelines · 2025-12-18T18:56:49Z

Azure Pipelines successfully started running 4 pipeline(s).

hariharans29 · 2025-12-19T05:54:31Z

cmake/onnxruntime_mlas.cmake

          list(APPEND mlas_platform_srcs ${MLAS_SRC_DIR}/sve/elementwise_sve.cpp)
+          list(APPEND mlas_platform_srcs ${MLAS_SRC_DIR}/sve/Elementwise_sve_fp16.cpp)
          set_source_files_properties(${MLAS_SRC_DIR}/sve/elementwise_sve.cpp PROPERTIES COMPILE_FLAGS " -march=armv8.2-a+sve+fp16 ")
+          set_source_files_properties(${MLAS_SRC_DIR}/sve/Elementwise_sve_fp16.cpp PROPERTIES COMPILE_FLAGS " -march=armv8.2-a+sve+fp16 ")


Nit: Rename to elementwise_sve_fp16.cpp ? (casing is inconsistent)

hariharans29 · 2025-12-19T06:04:25Z

onnxruntime/core/providers/cpu/math/element_wise_ops.cc

 #include "core/util/math.h"
 #include "core/mlas/inc/mlas.h"

+#if defined(MLAS_NEON_INTRINSICS)


Isn't this macro defined in mlasi.h (which is not included here).
There seems to be something wrong with the design here - I think there is usage of MLAS internal implementation macros leaking over to the CPU EP files. It may need a re-think.

hariharans29 · 2025-12-19T06:11:37Z

onnxruntime/core/providers/cpu/tensor/gelu.cc

+
+  int64_t i = 0;
+
+  if (algo == "tanh") {


I think all this logic needs to live in MLAS and not in a CPU EP file. is there a limitation that necessitated doing things this way ?

hariharans29 · 2025-12-19T06:13:04Z

onnxruntime/core/mlas/lib/sve/mlasi_sve.h

+
+void
+MLASCALL
+MlasSveErfKernelFp16(


It would be nice to stick to one naming convention- some APIs have MlasSve... and some others have ..._SVE which is confusing. Some have Mlas.... and another one has Compute.....

hariharans29 · 2025-12-19T06:15:09Z

cmake/onnxruntime_providers_cpu.cmake


-target_include_directories(onnxruntime_providers PRIVATE ${ONNXRUNTIME_ROOT})
+if(onnxruntime_target_platform STREQUAL "aarch64" OR onnxruntime_target_platform STREQUAL "ARM64" OR onnxruntime_target_platform STREQUAL "arm64")
+set_source_files_properties("${ONNXRUNTIME_ROOT}/core/providers/cpu/tensor/gelu.cc" PROPERTIES COMPILE_FLAGS -march=armv8.2-a+fp16)


This may be a duplicated comment - I think this is coming about because some CPU EP files are now directly using intrinsics and I feel the hardware accelerated intrinsic using routines should live in MLAS and only be called from the CPU EP files

akote123 and others added 9 commits June 10, 2025 04:34

Merged PR 432: Update fj-develop with main

645c3fa

Merged PR 453: Update FJ-Develop

e69b8e9

Merged PR 465: update fj-develop

ae0da8f

Merged PR 508: Update fj-develop with main

7e94153

Merged PR 613: Rebase fj-develop with main

1bfec70

Merged PR 621: Merge oss main with fj-develop

dc47d84

Merged PR 779: Rebase fj-develop with main

52a38a5

tianleiwu requested a review from Copilot December 17, 2025 05:52

Copilot started reviewing on behalf of tianleiwu December 17, 2025 05:53 View session

Copilot AI reviewed Dec 17, 2025

View reviewed changes

hariharans29 changed the title ~~Enable FP16 for Gelu~~ [MLAS] Enable FP16 for Gelu Dec 18, 2025

Enable Gelu Fp16

cc2625d

Seperate platform dependant code

abhijain1204fujitsu force-pushed the gelu_fp16 branch from ca56982 to cc2625d Compare December 18, 2025 16:17

hariharans29 reviewed Dec 19, 2025

View reviewed changes

[MLAS] Enable FP16 for Gelu #26815

Are you sure you want to change the base?

[MLAS] Enable FP16 for Gelu #26815

Uh oh!

Conversation

akote123 commented Dec 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

hariharans29 commented Dec 18, 2025

Uh oh!

azure-pipelines bot commented Dec 18, 2025

Uh oh!

hariharans29 commented Dec 18, 2025

Uh oh!

azure-pipelines bot commented Dec 18, 2025

Uh oh!

hariharans29 Dec 19, 2025

Choose a reason for hiding this comment

Uh oh!

hariharans29 Dec 19, 2025

Choose a reason for hiding this comment

Uh oh!

hariharans29 Dec 19, 2025

Choose a reason for hiding this comment

Uh oh!

hariharans29 Dec 19, 2025

Choose a reason for hiding this comment

Uh oh!

hariharans29 Dec 19, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

akote123 commented Dec 17, 2025 •

edited

Loading