Discussion on performance optimization of `expf` in softmax

In neural networks, Softmax is a frequently used function. I found that in the current implementation, we call `expf` to achieve this functionality, which is a performance-intensive function in the standard library.

Therefore, I wondered if there was a better way to optimize it.🤔

I referenced implementations from other inference frameworks, such as MNN, which implements a fast approximation of `exp` using multinomial approximation, and its performance appears to be good. So I added a `fast_expf` function for NCNN based on range reduction + multinomial approximation.

I added a local benchmark to evaluate the accuracy and performance of my implementation compared to `expf`, and the results are as follows:

| Size | Original | Fast Exp | ratio |
|---------|----------|----------|--------|
| 64 | 0.191ms | 0.088ms | 2.17x |
| 512 | 1.518ms | 0.613ms | 2.48x |
| 4096 | 12.014ms | 4.346ms | 2.76x |
| 10000 | 29.199ms | 10.585ms | 2.76x |

|  | Maximum absolute error | relative error |
|---------|------------|---------|
| Random data | < 0.00001 | < 0.02% |
| Boundary values | < 0.0001 | < 0.05% |

```shell
=== Softmax Performance Benchmark ===

=== System Information ===
CPU:  AMD Ryzen 7 9800X3D 8-Core Processor
OS: Linux 6.6.87.2-microsoft-standard-WSL2
Compiler: GCC 13.3.0
Build: -O3
Arch: x86_64


=== Accuracy Test ===
Input:  0.0001 0.0002 0.0006 0.0016 0.0043 0.0116 0.0315 0.0856 0.2326 0.6321
Fast:   0.0001 0.0002 0.0006 0.0016 0.0043 0.0116 0.0315 0.0856 0.2326 0.6321
Max absolute difference: 0.000009

Edge case tests:
All zeros - Original: 0.2000 0.2000 0.2000 0.2000 0.2000
All zeros - Fast:     0.2000 0.2000 0.2000 0.2000 0.2000
Large values - Original: 0.011656 0.031685 0.086129 0.234122 0.636409      
Large values - Fast:    0.011656 0.031685 0.086128 0.234130 0.636401       
Mixed values - Original: 0.000000 0.000000 0.000045 0.006693 0.993262      
Mixed values - Fast:    0.000000 0.000000 0.000045 0.006693 0.993262       

=== Performance Benchmark ===

Warmup...

--- Sequential Access Benchmarks ---

Size = 64:
Original expf: 0.201 ms (1000 iterations, size=64, avg=0.201 us)
Fast expf: 0.093 ms (1000 iterations, size=64, avg=0.093 us)

Size = 128:
Original expf: 0.378 ms (1000 iterations, size=128, avg=0.378 us)
Fast expf: 0.140 ms (1000 iterations, size=128, avg=0.140 us)

Size = 256:
Original expf: 0.768 ms (1000 iterations, size=256, avg=0.768 us)
Fast expf: 0.287 ms (1000 iterations, size=256, avg=0.287 us)

Size = 512:
Original expf: 1.531 ms (1000 iterations, size=512, avg=1.531 us)
Fast expf: 0.562 ms (1000 iterations, size=512, avg=0.562 us)

Size = 1024:
Original expf: 3.047 ms (1000 iterations, size=1024, avg=3.047 us)
Fast expf: 1.202 ms (1000 iterations, size=1024, avg=1.202 us)

Size = 4096:
Original expf: 12.328 ms (1000 iterations, size=4096, avg=12.328 us)
Fast expf: 4.670 ms (1000 iterations, size=4096, avg=4.670 us)

Size = 10000:
Original expf: 29.508 ms (1000 iterations, size=10000, avg=29.508 us)
Fast expf: 10.873 ms (1000 iterations, size=10000, avg=10.873 us)

--- Strided Access Benchmarks (Attention Pattern) ---

Attention softmax (seq_len=512):
Original expf: 15.371 ms (10000 iterations, size=512, stride=1, avg=1.537 us)
Fast expf: 5.377 ms (10000 iterations, size=512, stride=1, avg=0.538 us)   

Cross-attention pattern (stride=768):
Original expf: 3.661 ms (1000 iterations, size=512, stride=768, avg=3.661 us)
Fast expf: 2.164 ms (1000 iterations, size=512, stride=768, avg=2.164 us)  

=== Benchmark Complete ===
```

Based on the data above, the polynomial approximation `exp` implementation should bring significant performance gains to NCNN's Softmax.🚀

However, I have some concerns:

1. What is the current tolerance of NCNN for the precision of Softmax?

2. Are there any issues with quantization scenarios?

3. If we were to introduce this optimization, could we offer it as an option to run in parallel with `expf`?

I look forward to your reply. 🙏
 If necessary, I can submit a PR to demonstrate my current implementation.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Discussion on performance optimization of `expf` in softmax #6633

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Size	Original	Fast Exp	ratio
64	0.191ms	0.088ms	2.17x
512	1.518ms	0.613ms	2.48x
4096	12.014ms	4.346ms	2.76x
10000	29.199ms	10.585ms	2.76x

Discussion on performance optimization of expf in softmax #6633

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Discussion on performance optimization of `expf` in softmax #6633