You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository was archived by the owner on Dec 22, 2021. It is now read-only.
ARM NEON has pairwise-folding addition instructions where pairs of narrow (e.g. 8-bit) input lanes are added together and accumulated into wider (e.g. 16-bit) integer lanes. For example SADALP, SADDLP.
This is in addition to plain pairwise-folding additions with all operands of the same bit width, like SADDP.
An extreme case of such folding is the dot-product instructions (SDOT, See PR #127) where the folding addition is performed 4-fold. When one of the source operands has all lanes set to 1's, this acts as a 4-fold addition of 8bit values into 32bit accumulators.
This combination of folding behavior and mixing different bit widths allows to maximize the number of scalar operations done per instruction.