Skip to content

Conversation

@akote123
Copy link

@akote123 akote123 commented Dec 17, 2025

Enabled fp16 Gelu for opset20.Gelu uses tanh and ERF functions depending on the approximation method used. Implemented tanh in sve and erf in sve and neon .
Gr3E results: with tanh and erf approximation:

GELU(ms) Tanh_SVE ERF_SVE Tanh_NEON ERF_NEON
Shape F32 F16 F32 F16
100 0.007 0.007 0.007 0.007
1000 0.008 0.007 0.012 0.008
1000000 0.076 0.039 0.203 0.07

Gr4 results: with tanh and erf approximation:

GELU(ms) Tanh_SVE ERF_SVE Tanh_NEON ERF_NEON
Shape F32 F16 F32 F16
100 0.005 0.005 0.005 0.005
1000 0.006 0.006 0.008 0.006
1000000 0.092 0.046 0.224 0.088

This PR is a joint contribution by:
Aruna K(@akote123)
Abhishek Jain(@abhijain1204fujitsu)

akote123 and others added 9 commits June 10, 2025 04:34
There is very common error that appears as in this shape.

> ## -- Hash mismatch, removing...
-- Using src='https://gitlab.com/libeigen/eigen/-/archive/e7248b26a1ed53fa030c5c459f7ea095dfd276ac/eigen-e7248b26a1ed53fa030c5c459f7ea095dfd276ac.zip'
-- verifying file...
       file='/home/nikhil/KONARK/onnxruntime/build/Linux/Release/_deps/eigen-subbuild/eigen-populate-prefix/src/eigen-e7248b26a1ed53fa030c5c459f7ea095dfd276ac.zip'
-- SHA1 hash of
    /home/nikhil/KONARK/onnxruntime/build/Linux/Release/_deps/eigen-subbuild/eigen-populate-prefix/src/eigen-e7248b26a1ed53fa030c5c459f7ea095dfd276ac.zip
  does not match expected value
    expected: 'be8be39fdbc6e60e94fa7870b280707069b5b81a'
      actual: '32b145f525a8308d7ab1c09388b2e288312d8eba'
-- Hash mismatch, removing...
CMake Error at eigen-subbuild/eigen-populate-prefix/src/eigen-populate-stamp/download-eigen-populate.cmake:170 (message):
  Each download failed!

> gmake[2]: *** [CMakeFiles/eigen-populate.dir/build.make:100: eigen-populate-prefix/src/eigen-populate-stamp/eigen-populate-download] Error 1
gmake[1]: *** [CMakeFiles/Makefile2:83: CMakeFiles/eigen-populate.dir/all] Error 2
gmake: *** [Makefile:91: all] Error 2

> CMake Error at /usr/local/share/cmake-3.28/Modules/FetchContent.cmake:1679 (message):
  Build step for eigen failed: 2
Call Stack (most recent call first):
  /usr/local/share/cmake-3.28/Modules/FetchContent.cmake:1819:EVAL:2 (__FetchContent_directPopulate)
  /usr/local/share/cmake-3.28/Modules/FetchContent.cmake:1819 (cmake_language)
  external/eigen.cmake:19 (FetchContent_Populate)
  external/onnxruntime_external_deps.cmake:546 (include)
  CMakeLists.txt:694 (include)
`

This error can be removed by updating main to the latest from Open source repository.
Got the error related to External dependencies.

`
CMake Error: install(EXPORT "onnxruntimeTargets" ...) includes target "onnxruntime_mlas" which requires target "kleidiai" that is not in any export set.
CMake Error: install(EXPORT "onnxruntimeTargets" ...) includes target "onnxruntime" which requires target "kleidiai" that is not in any export set.
-- Generating done (0.6s)
CMake Generate step failed.  Build files cannot be regenerated correctly.
Traceback (most recent call last):
  File "/home/nikhil/KONARK/onnxruntime/tools/ci_build/build.py", line 2630, in <module>
    sys.exit(main())
  File "/home/nikhil/KONARK/onnxruntime/tools/ci_build/build.py", line 2497, in main
    generate_build_tree(
  File "/home/nikhil/KONARK/onnxruntime/tools/ci_build/build.py", line 1290, in generate_build_tree
    run_subprocess(
  File "/home/nikhil/KONARK/onnxruntime/tools/ci_build/build.py", line 147, in run_subprocess
    return run(*args, cwd=cwd, capture_stdout=capture_stdout, shell=shell, env=my_env)
  File "/home/nikhil/KONARK/onnxruntime/tools/python/util/run.py", line 50, in run
    completed_process = subprocess.run(
  File "/usr/lib/python3.10/subprocess.py", line 526, in run
    raise CalledProcessError(retcode, process.args,
`
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR enables FP16 (half-precision floating-point) support for the GELU (Gaussian Error Linear Unit) activation operator in ONNX Runtime opset 20. The implementation provides optimized compute paths using ARM SVE (Scalable Vector Extension) and NEON intrinsics for both tanh and erf approximation methods, with fallback to scalar FP32 computation when vector intrinsics are not available.

Key changes:

  • Adds FP16 kernel registration for GELU operator alongside the existing FP32 implementation
  • Implements optimized FP16 ERF and TANH kernels using ARM SVE and NEON intrinsics
  • Adds comprehensive test coverage for both tanh and erf approximation modes with FP16 inputs

Reviewed changes

Copilot reviewed 13 out of 13 changed files in this pull request and generated 27 comments.

Show a summary per file
File Description
onnxruntime/core/providers/cpu/cpu_execution_provider.cc Registers typed GELU kernels for float and MLFloat16 types
onnxruntime/core/providers/cpu/tensor/gelu.cc Implements FP16 GELU computation with SVE/NEON optimizations and scalar fallback
onnxruntime/core/providers/cpu/math/element_wise_ops.cc Adds FP16 ERF operator support using new SVE/NEON kernels
onnxruntime/test/providers/cpu/activation/activation_op_test.cc Adds FP16 GELU tests for both tanh and erf approximations
onnxruntime/core/mlas/lib/tanh.cpp Adds SVE path for FP16 tanh computation
onnxruntime/core/mlas/lib/sve/mlasi_sve.h Declares SVE FP16 function signatures
onnxruntime/core/mlas/lib/sve/mlas_sve_fp16.h Adds SVE FP16 intrinsic wrapper functions
onnxruntime/core/mlas/lib/sve/Elementwise_sve_fp16.cpp Implements SVE FP16 tanh, erf, and GELU kernels
onnxruntime/core/mlas/lib/fp16_common.h Adds NEON FP16 helper functions for erf computation
onnxruntime/core/mlas/lib/erf.cpp Implements NEON FP16 erf kernel
onnxruntime/core/mlas/inc/mlas.h Exports NEON FP16 erf kernel function
cmake/onnxruntime_providers_cpu.cmake Adds ARM FP16 compile flags for gelu.cc and includes MLAS headers
cmake/onnxruntime_mlas.cmake Adds SVE FP16 elementwise source and compile flags for erf.cpp

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@hariharans29 hariharans29 changed the title Enable FP16 for Gelu [MLAS] Enable FP16 for Gelu Dec 18, 2025
@hariharans29
Copy link
Member

/azp run Linux QNN CI Pipeline,Win_TRT_Minimal_CUDA_Test_CI,Windows ARM64 QNN CI Pipeline,Windows GPU Doc Gen CI Pipeline

@azure-pipelines
Copy link

Azure Pipelines successfully started running 4 pipeline(s).

Seperate platform dependant code
@hariharans29
Copy link
Member

/azp run Linux QNN CI Pipeline,Win_TRT_Minimal_CUDA_Test_CI,Windows ARM64 QNN CI Pipeline,Windows GPU Doc Gen CI Pipeline

@azure-pipelines
Copy link

Azure Pipelines successfully started running 4 pipeline(s).

list(APPEND mlas_platform_srcs ${MLAS_SRC_DIR}/sve/elementwise_sve.cpp)
list(APPEND mlas_platform_srcs ${MLAS_SRC_DIR}/sve/Elementwise_sve_fp16.cpp)
set_source_files_properties(${MLAS_SRC_DIR}/sve/elementwise_sve.cpp PROPERTIES COMPILE_FLAGS " -march=armv8.2-a+sve+fp16 ")
set_source_files_properties(${MLAS_SRC_DIR}/sve/Elementwise_sve_fp16.cpp PROPERTIES COMPILE_FLAGS " -march=armv8.2-a+sve+fp16 ")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Rename to elementwise_sve_fp16.cpp ? (casing is inconsistent)

#include "core/util/math.h"
#include "core/mlas/inc/mlas.h"

#if defined(MLAS_NEON_INTRINSICS)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this macro defined in mlasi.h (which is not included here).
There seems to be something wrong with the design here - I think there is usage of MLAS internal implementation macros leaking over to the CPU EP files. It may need a re-think.


int64_t i = 0;

if (algo == "tanh") {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think all this logic needs to live in MLAS and not in a CPU EP file. is there a limitation that necessitated doing things this way ?


void
MLASCALL
MlasSveErfKernelFp16(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to stick to one naming convention- some APIs have MlasSve... and some others have ..._SVE which is confusing. Some have Mlas.... and another one has Compute.....


target_include_directories(onnxruntime_providers PRIVATE ${ONNXRUNTIME_ROOT})
if(onnxruntime_target_platform STREQUAL "aarch64" OR onnxruntime_target_platform STREQUAL "ARM64" OR onnxruntime_target_platform STREQUAL "arm64")
set_source_files_properties("${ONNXRUNTIME_ROOT}/core/providers/cpu/tensor/gelu.cc" PROPERTIES COMPILE_FLAGS -march=armv8.2-a+fp16)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may be a duplicated comment - I think this is coming about because some CPU EP files are now directly using intrinsics and I feel the hardware accelerated intrinsic using routines should live in MLAS and only be called from the CPU EP files

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants