-
Notifications
You must be signed in to change notification settings - Fork 4.4k
Discussion on performance optimization of expf in softmax #6633
Description
In neural networks, Softmax is a frequently used function. I found that in the current implementation, we call expf to achieve this functionality, which is a performance-intensive function in the standard library.
Therefore, I wondered if there was a better way to optimize it.🤔
I referenced implementations from other inference frameworks, such as MNN, which implements a fast approximation of exp using multinomial approximation, and its performance appears to be good. So I added a fast_expf function for NCNN based on range reduction + multinomial approximation.
I added a local benchmark to evaluate the accuracy and performance of my implementation compared to expf, and the results are as follows:
| Size | Original | Fast Exp | ratio |
|---|---|---|---|
| 64 | 0.191ms | 0.088ms | 2.17x |
| 512 | 1.518ms | 0.613ms | 2.48x |
| 4096 | 12.014ms | 4.346ms | 2.76x |
| 10000 | 29.199ms | 10.585ms | 2.76x |
| Maximum absolute error | relative error | |
|---|---|---|
| Random data | < 0.00001 | < 0.02% |
| Boundary values | < 0.0001 | < 0.05% |
=== Softmax Performance Benchmark ===
=== System Information ===
CPU: AMD Ryzen 7 9800X3D 8-Core Processor
OS: Linux 6.6.87.2-microsoft-standard-WSL2
Compiler: GCC 13.3.0
Build: -O3
Arch: x86_64
=== Accuracy Test ===
Input: 0.0001 0.0002 0.0006 0.0016 0.0043 0.0116 0.0315 0.0856 0.2326 0.6321
Fast: 0.0001 0.0002 0.0006 0.0016 0.0043 0.0116 0.0315 0.0856 0.2326 0.6321
Max absolute difference: 0.000009
Edge case tests:
All zeros - Original: 0.2000 0.2000 0.2000 0.2000 0.2000
All zeros - Fast: 0.2000 0.2000 0.2000 0.2000 0.2000
Large values - Original: 0.011656 0.031685 0.086129 0.234122 0.636409
Large values - Fast: 0.011656 0.031685 0.086128 0.234130 0.636401
Mixed values - Original: 0.000000 0.000000 0.000045 0.006693 0.993262
Mixed values - Fast: 0.000000 0.000000 0.000045 0.006693 0.993262
=== Performance Benchmark ===
Warmup...
--- Sequential Access Benchmarks ---
Size = 64:
Original expf: 0.201 ms (1000 iterations, size=64, avg=0.201 us)
Fast expf: 0.093 ms (1000 iterations, size=64, avg=0.093 us)
Size = 128:
Original expf: 0.378 ms (1000 iterations, size=128, avg=0.378 us)
Fast expf: 0.140 ms (1000 iterations, size=128, avg=0.140 us)
Size = 256:
Original expf: 0.768 ms (1000 iterations, size=256, avg=0.768 us)
Fast expf: 0.287 ms (1000 iterations, size=256, avg=0.287 us)
Size = 512:
Original expf: 1.531 ms (1000 iterations, size=512, avg=1.531 us)
Fast expf: 0.562 ms (1000 iterations, size=512, avg=0.562 us)
Size = 1024:
Original expf: 3.047 ms (1000 iterations, size=1024, avg=3.047 us)
Fast expf: 1.202 ms (1000 iterations, size=1024, avg=1.202 us)
Size = 4096:
Original expf: 12.328 ms (1000 iterations, size=4096, avg=12.328 us)
Fast expf: 4.670 ms (1000 iterations, size=4096, avg=4.670 us)
Size = 10000:
Original expf: 29.508 ms (1000 iterations, size=10000, avg=29.508 us)
Fast expf: 10.873 ms (1000 iterations, size=10000, avg=10.873 us)
--- Strided Access Benchmarks (Attention Pattern) ---
Attention softmax (seq_len=512):
Original expf: 15.371 ms (10000 iterations, size=512, stride=1, avg=1.537 us)
Fast expf: 5.377 ms (10000 iterations, size=512, stride=1, avg=0.538 us)
Cross-attention pattern (stride=768):
Original expf: 3.661 ms (1000 iterations, size=512, stride=768, avg=3.661 us)
Fast expf: 2.164 ms (1000 iterations, size=512, stride=768, avg=2.164 us)
=== Benchmark Complete ===Based on the data above, the polynomial approximation exp implementation should bring significant performance gains to NCNN's Softmax.🚀
However, I have some concerns:
-
What is the current tolerance of NCNN for the precision of Softmax?
-
Are there any issues with quantization scenarios?
-
If we were to introduce this optimization, could we offer it as an option to run in parallel with
expf?
I look forward to your reply. 🙏
If necessary, I can submit a PR to demonstrate my current implementation.