Use predicated stores on more targets #5846

dsharletg · 2021-03-24T17:49:20Z

This PR enables predicated stores by default. I've done some experiments with performance, my findings are:

On x86, for a cheap loop body, predicated stores are faster than Halide scalarizing the loop for 32-bit and larger types, slower for 16-bit and smaller types.
On x86, for an expensive loop body, predicated stores are faster than Halide scalarizing, even for small types. But, the loop needs to be quite expensive for this to be the case.
On ARM, I couldn't find a case where predicated stores were worse than Halide scalarizing (I patched this change and modified a bunch of benchmarks to run narrow x tall images to check this).
On ARM, this significantly speeds up a case I care about (the interpret_nn branch).

steven-johnson

LGTM pending green

steven-johnson · 2021-03-24T18:11:20Z

src/VectorizeLoops.cpp

-            // Should only attempt to predicate store/load if the lane size is
-            // no less than 4
-            // TODO: disabling for now due to trunk LLVM breakage.
-            // See: https://github.com/halide/Halide/issues/3534


#3534 is still open -- probably time to see if it is still active or time to close it. (Also, might be worth a reality check on older LLVM versions to see if this should be disabled for them.)

abadams · 2021-03-24T18:16:22Z

The reason predicated stores are still off on x86 is that 90% of the time they're slightly faster, but occasionally there's a massive performance cliff where they're 10x slower. So we'd have to check a larger number of things

dsharletg · 2021-03-24T18:26:34Z

Can you say what things we should check? I have very little code that uses 32-bit inputs/outputs, so I don't have much code I can check performance.

On x86, I'm seeing much more than slight improvements for 32-bit and larger types:

This branch:

cd bin/build/tmp ; /Users/dsharlet/GitHub/Halide/bin/performance_vectorize
Vectorized vs scalar (uint8_t x 32): 0.042ms 1.3ms. Speedup = 30.873
Vectorized vs scalar (uint8_t x 32): 0.193ms 1.26ms. Speedup = 6.523
Vectorized vs scalar (int8_t x 32): 0.0395ms 1.3ms. Speedup = 32.946
Vectorized vs scalar (int8_t x 32): 0.191ms 1.3ms. Speedup = 6.801
Vectorized vs scalar (uint16_t x 16): 0.0394ms 0.635ms. Speedup = 16.130
Vectorized vs scalar (uint16_t x 16): 0.115ms 0.638ms. Speedup = 5.534
Vectorized vs scalar (int16_t x 16): 0.0401ms 0.61ms. Speedup = 15.230
Vectorized vs scalar (int16_t x 16): 0.125ms 0.67ms. Speedup = 5.370
Vectorized vs scalar (uint32_t x 8): 0.0376ms 0.293ms. Speedup = 7.780
Vectorized vs scalar (uint32_t x 8): 0.0494ms 0.294ms. Speedup = 5.952
Vectorized vs scalar (int32_t x 8): 0.0441ms 0.333ms. Speedup = 7.551
Vectorized vs scalar (int32_t x 8): 0.053ms 0.333ms. Speedup = 6.295
Vectorized vs scalar (float x 8): 0.0457ms 0.303ms. Speedup = 6.633
Vectorized vs scalar (float x 8): 0.06ms 0.34ms. Speedup = 5.668
Vectorized vs scalar (double x 4): 0.0497ms 0.145ms. Speedup = 2.917
Vectorized vs scalar (double x 4): 0.0561ms 0.133ms. Speedup = 2.376

Master:

cd bin/build/tmp ; /Users/dsharlet/GitHub/Halide/bin/performance_vectorize
Vectorized vs scalar (uint8_t x 32): 0.0438ms 1.4ms. Speedup = 32.015
Vectorized vs scalar (uint8_t x 32): 0.187ms 1.33ms. Speedup = 7.123
Vectorized vs scalar (int8_t x 32): 0.0412ms 1.27ms. Speedup = 30.863
Vectorized vs scalar (int8_t x 32): 0.188ms 1.34ms. Speedup = 7.160
Vectorized vs scalar (uint16_t x 16): 0.0407ms 0.641ms. Speedup = 15.759
Vectorized vs scalar (uint16_t x 16): 0.12ms 0.628ms. Speedup = 5.243
Vectorized vs scalar (int16_t x 16): 0.0415ms 0.655ms. Speedup = 15.774
Vectorized vs scalar (int16_t x 16): 0.119ms 0.674ms. Speedup = 5.642
Vectorized vs scalar (uint32_t x 8): 0.0391ms 0.315ms. Speedup = 8.061
Vectorized vs scalar (uint32_t x 8): 0.119ms 0.302ms. Speedup = 2.540
Vectorized vs scalar (int32_t x 8): 0.0401ms 0.317ms. Speedup = 7.904
Vectorized vs scalar (int32_t x 8): 0.121ms 0.305ms. Speedup = 2.531
Vectorized vs scalar (float x 8): 0.0478ms 0.307ms. Speedup = 6.420
Vectorized vs scalar (float x 8): 0.15ms 0.328ms. Speedup = 2.191
Vectorized vs scalar (double x 4): 0.0497ms 0.144ms. Speedup = 2.906
Vectorized vs scalar (double x 4): 0.0986ms 0.141ms. Speedup = 1.433

Up to 16-bit, there's no change (as expected, the code should not change). For 32-bit types and larger, the improvement is 2-3x.

This also really depends on the content of the loop. If the body of the loop is expensive, this can make a huge (positive) difference. I have spent a lot of time trying to write the schedule such that the expensive work is computed in a separate stage (with TailStrategy::RoundUp), but this also adds overhead, I haven't been able to match the performance of this branch, and it also requires a lot of programmer effort that shouldn't be necessary.

abadams · 2021-03-24T18:46:04Z

I'll see if I can recreate the effect...

abadams · 2021-03-24T19:12:06Z

ok, I remember what the problem was. avx/avx2's vpmaskmov is a predicated store and it's fine. However maskmovq and maskmovd (pre-avx versions) are also non-temporal stores, so it's a disaster for locality if you try to load them again immediately. Quoting from the intel docs on vpmaskmov:

"Unlike previous MASKMOV instructions (MASKMOVQ and MASKMOVDQU), a nontemporal hint is not applied to these instructions"

However, I can no longer get llvm to emit those non-temporal variants. It seems like it just scalarizes pre-avx, and it doesn't want to emit them for narrow vectors with avx on. Previously I was seeing it used even on avx machines, and that's what was causing nasty stalls.

So I think the problem I encountered has been fixed on the llvm side.

dsharletg · 2021-03-26T02:51:55Z

Closing in favor of #5856. We can't assume predicated loads/stores are always faster.

dsharletg added 2 commits March 24, 2021 11:17

Use predicated stores (again).

9b76733

Update predicated_store_load test.

ffa73a6

steven-johnson approved these changes Mar 24, 2021

View reviewed changes

dsharletg closed this Mar 26, 2021

dsharletg deleted the dsharletg/predicate-stores2 branch March 26, 2021 02:52

alexreinking modified the milestone: v12.0.0 May 19, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Use predicated stores on more targets #5846

Use predicated stores on more targets #5846

Uh oh!

dsharletg commented Mar 24, 2021

Uh oh!

steven-johnson left a comment

Uh oh!

steven-johnson Mar 24, 2021

Uh oh!

abadams commented Mar 24, 2021

Uh oh!

dsharletg commented Mar 24, 2021

Uh oh!

abadams commented Mar 24, 2021

Uh oh!

abadams commented Mar 24, 2021

Uh oh!

dsharletg commented Mar 26, 2021

Uh oh!

Uh oh!

Use predicated stores on more targets #5846

Use predicated stores on more targets #5846

Uh oh!

Conversation

dsharletg commented Mar 24, 2021

Uh oh!

steven-johnson left a comment

Choose a reason for hiding this comment

Uh oh!

steven-johnson Mar 24, 2021

Choose a reason for hiding this comment

Uh oh!

abadams commented Mar 24, 2021

Uh oh!

dsharletg commented Mar 24, 2021

Uh oh!

abadams commented Mar 24, 2021

Uh oh!

abadams commented Mar 24, 2021

Uh oh!

dsharletg commented Mar 26, 2021

Uh oh!

Uh oh!