Skip to content

Discussion on performance optimization of expf in softmax #6633

@crafcat7

Description

@crafcat7

In neural networks, Softmax is a frequently used function. I found that in the current implementation, we call expf to achieve this functionality, which is a performance-intensive function in the standard library.

Therefore, I wondered if there was a better way to optimize it.🤔

I referenced implementations from other inference frameworks, such as MNN, which implements a fast approximation of exp using multinomial approximation, and its performance appears to be good. So I added a fast_expf function for NCNN based on range reduction + multinomial approximation.

I added a local benchmark to evaluate the accuracy and performance of my implementation compared to expf, and the results are as follows:

Size Original Fast Exp ratio
64 0.191ms 0.088ms 2.17x
512 1.518ms 0.613ms 2.48x
4096 12.014ms 4.346ms 2.76x
10000 29.199ms 10.585ms 2.76x
Maximum absolute error relative error
Random data < 0.00001 < 0.02%
Boundary values < 0.0001 < 0.05%
=== Softmax Performance Benchmark ===

=== System Information ===
CPU:  AMD Ryzen 7 9800X3D 8-Core Processor
OS: Linux 6.6.87.2-microsoft-standard-WSL2
Compiler: GCC 13.3.0
Build: -O3
Arch: x86_64


=== Accuracy Test ===
Input:  0.0001 0.0002 0.0006 0.0016 0.0043 0.0116 0.0315 0.0856 0.2326 0.6321
Fast:   0.0001 0.0002 0.0006 0.0016 0.0043 0.0116 0.0315 0.0856 0.2326 0.6321
Max absolute difference: 0.000009

Edge case tests:
All zeros - Original: 0.2000 0.2000 0.2000 0.2000 0.2000
All zeros - Fast:     0.2000 0.2000 0.2000 0.2000 0.2000
Large values - Original: 0.011656 0.031685 0.086129 0.234122 0.636409      
Large values - Fast:    0.011656 0.031685 0.086128 0.234130 0.636401       
Mixed values - Original: 0.000000 0.000000 0.000045 0.006693 0.993262      
Mixed values - Fast:    0.000000 0.000000 0.000045 0.006693 0.993262       

=== Performance Benchmark ===

Warmup...

--- Sequential Access Benchmarks ---

Size = 64:
Original expf: 0.201 ms (1000 iterations, size=64, avg=0.201 us)
Fast expf: 0.093 ms (1000 iterations, size=64, avg=0.093 us)

Size = 128:
Original expf: 0.378 ms (1000 iterations, size=128, avg=0.378 us)
Fast expf: 0.140 ms (1000 iterations, size=128, avg=0.140 us)

Size = 256:
Original expf: 0.768 ms (1000 iterations, size=256, avg=0.768 us)
Fast expf: 0.287 ms (1000 iterations, size=256, avg=0.287 us)

Size = 512:
Original expf: 1.531 ms (1000 iterations, size=512, avg=1.531 us)
Fast expf: 0.562 ms (1000 iterations, size=512, avg=0.562 us)

Size = 1024:
Original expf: 3.047 ms (1000 iterations, size=1024, avg=3.047 us)
Fast expf: 1.202 ms (1000 iterations, size=1024, avg=1.202 us)

Size = 4096:
Original expf: 12.328 ms (1000 iterations, size=4096, avg=12.328 us)
Fast expf: 4.670 ms (1000 iterations, size=4096, avg=4.670 us)

Size = 10000:
Original expf: 29.508 ms (1000 iterations, size=10000, avg=29.508 us)
Fast expf: 10.873 ms (1000 iterations, size=10000, avg=10.873 us)

--- Strided Access Benchmarks (Attention Pattern) ---

Attention softmax (seq_len=512):
Original expf: 15.371 ms (10000 iterations, size=512, stride=1, avg=1.537 us)
Fast expf: 5.377 ms (10000 iterations, size=512, stride=1, avg=0.538 us)   

Cross-attention pattern (stride=768):
Original expf: 3.661 ms (1000 iterations, size=512, stride=768, avg=3.661 us)
Fast expf: 2.164 ms (1000 iterations, size=512, stride=768, avg=2.164 us)  

=== Benchmark Complete ===

Based on the data above, the polynomial approximation exp implementation should bring significant performance gains to NCNN's Softmax.🚀

However, I have some concerns:

  1. What is the current tolerance of NCNN for the precision of Softmax?

  2. Are there any issues with quantization scenarios?

  3. If we were to introduce this optimization, could we offer it as an option to run in parallel with expf?

I look forward to your reply. 🙏
If necessary, I can submit a PR to demonstrate my current implementation.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions