Skip to content

[LoopInterchange] Cost model more dedicated to vectorization #131130

Open
@kasuga-fj

Description

@kasuga-fj

LoopInterchange is effective to get a vectorization opportunity in some cases. However, the current implementation of LoopInterchange doesn't consider about vectorization very much. There are several issued of the LoopInterchange cost-model that need to be addressed to increase the vectorization opportunities.

First, the cost-model of LoopInterchange is consists of several individual decision rules. They are applied one at a time, with the one applied earlier having higher priority. In the current implementation, the rule based on CacheCostAnalysis has the highest priority, and the rule for vectorization has the lowest priority. However, there are cases where it is profitable to exchange the loops for vectorization even if it is detrimental to the cache. For example, exchanging the inner two loops in the following example looks about x3 faster in my local (compiled with -O3 -mcpu=neoverse-v2 -mllvm -cache-line-size=64).

__attribute__((aligned(64))) float aa[256][256],bb[256][256],cc[256][256],dd[256][256],ee[256][256],ff[256][256],gg[256][256];

// Alternative version of TSVC s231 with more array accesses than the original.
void s231_alternative() {
  for (int nl = 0; nl < 100*(100000/256); nl++) {
    for (int i = 0; i < 256; ++i) {
      for (int j = 1; j < 256; j++) {
        aa[j][i] = aa[j - 1][i] + bb[j][i] + cc[i][j] + dd[i][j] + ff[i][j] + gg[i][j];
      }
    }
  }
}

Next, the rule for vectorization in the cost-model would have a bug. For example, in the following case isProfitableForVectorization returns false even though exchanging them is s necessary in order to vectorize the innermost loop.

__attribute__((aligned(64))) float aa[256][256],bb[256][256],cc[256][256],dd[256][256],ee[256][256];

void f() {
  for (int i = 0; i < 256; ++i) {
    for (int j = 1; j < 256; j++) {
      cc[i][j] *= dd[i][j] + ee[i][j];
      aa[j][i] = aa[j-1][i] + bb[j][i];
    }
  }
}

See also https://godbolt.org/z/f8TW9dG89, the debug output Cost = 0 means that the profitable decision is delegated to isProfitableForVectorization.

Based on the above, I suggest the following to make LoopInterchange possible to make a profitable decision dedicated to vectorization.

  • Add a new option to give higher priority to the vectorization profitable decision in the cost model.
  • Fix the vectorization profit decision bug.

Any comments are welcome, thanks.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions