-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Use predicated stores on more targets #5846
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM pending green
// Should only attempt to predicate store/load if the lane size is | ||
// no less than 4 | ||
// TODO: disabling for now due to trunk LLVM breakage. | ||
// See: https://github.com/halide/Halide/issues/3534 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#3534 is still open -- probably time to see if it is still active or time to close it. (Also, might be worth a reality check on older LLVM versions to see if this should be disabled for them.)
The reason predicated stores are still off on x86 is that 90% of the time they're slightly faster, but occasionally there's a massive performance cliff where they're 10x slower. So we'd have to check a larger number of things |
Can you say what things we should check? I have very little code that uses 32-bit inputs/outputs, so I don't have much code I can check performance. On x86, I'm seeing much more than slight improvements for 32-bit and larger types: This branch:
Master:
Up to 16-bit, there's no change (as expected, the code should not change). For 32-bit types and larger, the improvement is 2-3x. This also really depends on the content of the loop. If the body of the loop is expensive, this can make a huge (positive) difference. I have spent a lot of time trying to write the schedule such that the expensive work is computed in a separate stage (with TailStrategy::RoundUp), but this also adds overhead, I haven't been able to match the performance of this branch, and it also requires a lot of programmer effort that shouldn't be necessary. |
I'll see if I can recreate the effect... |
ok, I remember what the problem was. avx/avx2's vpmaskmov is a predicated store and it's fine. However maskmovq and maskmovd (pre-avx versions) are also non-temporal stores, so it's a disaster for locality if you try to load them again immediately. Quoting from the intel docs on vpmaskmov: "Unlike previous MASKMOV instructions (MASKMOVQ and MASKMOVDQU), a nontemporal hint is not applied to these instructions" However, I can no longer get llvm to emit those non-temporal variants. It seems like it just scalarizes pre-avx, and it doesn't want to emit them for narrow vectors with avx on. Previously I was seeing it used even on avx machines, and that's what was causing nasty stalls. So I think the problem I encountered has been fixed on the llvm side. |
Closing in favor of #5856. We can't assume predicated loads/stores are always faster. |
This PR enables predicated stores by default. I've done some experiments with performance, my findings are: