Skip to content

Conversation

dsharletg
Copy link
Contributor

This PR enables predicated stores by default. I've done some experiments with performance, my findings are:

  • On x86, for a cheap loop body, predicated stores are faster than Halide scalarizing the loop for 32-bit and larger types, slower for 16-bit and smaller types.
  • On x86, for an expensive loop body, predicated stores are faster than Halide scalarizing, even for small types. But, the loop needs to be quite expensive for this to be the case.
  • On ARM, I couldn't find a case where predicated stores were worse than Halide scalarizing (I patched this change and modified a bunch of benchmarks to run narrow x tall images to check this).
  • On ARM, this significantly speeds up a case I care about (the interpret_nn branch).

Copy link
Contributor

@steven-johnson steven-johnson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM pending green

// Should only attempt to predicate store/load if the lane size is
// no less than 4
// TODO: disabling for now due to trunk LLVM breakage.
// See: https://github.com/halide/Halide/issues/3534
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#3534 is still open -- probably time to see if it is still active or time to close it. (Also, might be worth a reality check on older LLVM versions to see if this should be disabled for them.)

@abadams
Copy link
Member

abadams commented Mar 24, 2021

The reason predicated stores are still off on x86 is that 90% of the time they're slightly faster, but occasionally there's a massive performance cliff where they're 10x slower. So we'd have to check a larger number of things

@dsharletg
Copy link
Contributor Author

Can you say what things we should check? I have very little code that uses 32-bit inputs/outputs, so I don't have much code I can check performance.

On x86, I'm seeing much more than slight improvements for 32-bit and larger types:

This branch:

cd bin/build/tmp ; /Users/dsharlet/GitHub/Halide/bin/performance_vectorize
Vectorized vs scalar (uint8_t x 32): 0.042ms 1.3ms. Speedup = 30.873
Vectorized vs scalar (uint8_t x 32): 0.193ms 1.26ms. Speedup = 6.523
Vectorized vs scalar (int8_t x 32): 0.0395ms 1.3ms. Speedup = 32.946
Vectorized vs scalar (int8_t x 32): 0.191ms 1.3ms. Speedup = 6.801
Vectorized vs scalar (uint16_t x 16): 0.0394ms 0.635ms. Speedup = 16.130
Vectorized vs scalar (uint16_t x 16): 0.115ms 0.638ms. Speedup = 5.534
Vectorized vs scalar (int16_t x 16): 0.0401ms 0.61ms. Speedup = 15.230
Vectorized vs scalar (int16_t x 16): 0.125ms 0.67ms. Speedup = 5.370
Vectorized vs scalar (uint32_t x 8): 0.0376ms 0.293ms. Speedup = 7.780
Vectorized vs scalar (uint32_t x 8): 0.0494ms 0.294ms. Speedup = 5.952
Vectorized vs scalar (int32_t x 8): 0.0441ms 0.333ms. Speedup = 7.551
Vectorized vs scalar (int32_t x 8): 0.053ms 0.333ms. Speedup = 6.295
Vectorized vs scalar (float x 8): 0.0457ms 0.303ms. Speedup = 6.633
Vectorized vs scalar (float x 8): 0.06ms 0.34ms. Speedup = 5.668
Vectorized vs scalar (double x 4): 0.0497ms 0.145ms. Speedup = 2.917
Vectorized vs scalar (double x 4): 0.0561ms 0.133ms. Speedup = 2.376

Master:

cd bin/build/tmp ; /Users/dsharlet/GitHub/Halide/bin/performance_vectorize
Vectorized vs scalar (uint8_t x 32): 0.0438ms 1.4ms. Speedup = 32.015
Vectorized vs scalar (uint8_t x 32): 0.187ms 1.33ms. Speedup = 7.123
Vectorized vs scalar (int8_t x 32): 0.0412ms 1.27ms. Speedup = 30.863
Vectorized vs scalar (int8_t x 32): 0.188ms 1.34ms. Speedup = 7.160
Vectorized vs scalar (uint16_t x 16): 0.0407ms 0.641ms. Speedup = 15.759
Vectorized vs scalar (uint16_t x 16): 0.12ms 0.628ms. Speedup = 5.243
Vectorized vs scalar (int16_t x 16): 0.0415ms 0.655ms. Speedup = 15.774
Vectorized vs scalar (int16_t x 16): 0.119ms 0.674ms. Speedup = 5.642
Vectorized vs scalar (uint32_t x 8): 0.0391ms 0.315ms. Speedup = 8.061
Vectorized vs scalar (uint32_t x 8): 0.119ms 0.302ms. Speedup = 2.540
Vectorized vs scalar (int32_t x 8): 0.0401ms 0.317ms. Speedup = 7.904
Vectorized vs scalar (int32_t x 8): 0.121ms 0.305ms. Speedup = 2.531
Vectorized vs scalar (float x 8): 0.0478ms 0.307ms. Speedup = 6.420
Vectorized vs scalar (float x 8): 0.15ms 0.328ms. Speedup = 2.191
Vectorized vs scalar (double x 4): 0.0497ms 0.144ms. Speedup = 2.906
Vectorized vs scalar (double x 4): 0.0986ms 0.141ms. Speedup = 1.433

Up to 16-bit, there's no change (as expected, the code should not change). For 32-bit types and larger, the improvement is 2-3x.

This also really depends on the content of the loop. If the body of the loop is expensive, this can make a huge (positive) difference. I have spent a lot of time trying to write the schedule such that the expensive work is computed in a separate stage (with TailStrategy::RoundUp), but this also adds overhead, I haven't been able to match the performance of this branch, and it also requires a lot of programmer effort that shouldn't be necessary.

@abadams
Copy link
Member

abadams commented Mar 24, 2021

I'll see if I can recreate the effect...

@abadams
Copy link
Member

abadams commented Mar 24, 2021

ok, I remember what the problem was. avx/avx2's vpmaskmov is a predicated store and it's fine. However maskmovq and maskmovd (pre-avx versions) are also non-temporal stores, so it's a disaster for locality if you try to load them again immediately. Quoting from the intel docs on vpmaskmov:

"Unlike previous MASKMOV instructions (MASKMOVQ and MASKMOVDQU), a nontemporal hint is not applied to these instructions"

However, I can no longer get llvm to emit those non-temporal variants. It seems like it just scalarizes pre-avx, and it doesn't want to emit them for narrow vectors with avx on. Previously I was seeing it used even on avx machines, and that's what was causing nasty stalls.

So I think the problem I encountered has been fixed on the llvm side.

@dsharletg
Copy link
Contributor Author

Closing in favor of #5856. We can't assume predicated loads/stores are always faster.

@dsharletg dsharletg closed this Mar 26, 2021
@dsharletg dsharletg deleted the dsharletg/predicate-stores2 branch March 26, 2021 02:52
@alexreinking alexreinking modified the milestone: v12.0.0 May 19, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants