likely vectorization discrepancy between julia and clang triple-nested-loop gemms

A performance discrepancy between julia and clang pji-ordered triple-nested-loop gemm implementations is evident in https://github.com/Sacha0/TripleNestedLoopDemo.jl. Though I haven't had the bandwidth to check yet, I suspect the discrepancy comes from vectorization differences. Repro code:
```julia
function gemm_pji!(C, A, B)
    for p in 1:size(A, 2),
         j in 1:size(C, 2),
          i in 1:size(C, 1)
        @inbounds C[i, j] += A[i, p] * B[p, j]
    end
    return C
end

gemm_pji_ccode = """
void gemm_pji_c(double* C, double* A, double* B, int m, int k, int n) {
    for ( int p = 0; p < k; ++p )
        for ( int j = 0; j < n; ++j )
            for ( int i = 0; i < m; ++i )
                C[j*m + i] += A[p*m + i] * B[j*k + p];
}
""";

using Libdl
const CGemmLib = tempname()
open(`clang -fPIC -O3 -xc -shared -o $(CGemmLib * "." * Libdl.dlext) -`, "w") do f
    print(f, gemm_pji_ccode) 
end

gemm_pji_c!(C::Matrix{Float64}, A::Matrix{Float64}, B::Matrix{Float64}) =
    (ccall(("gemm_pji_c", CGemmLib), Cvoid,
            (Ptr{Float64}, Ptr{Float64}, Ptr{Float64}, Cint, Cint, Cint),
            C, A, B, size(C, 1), size(A, 2), size(C, 2)); return C)

using Test
let
    m, n, k = 48*3, 48*2, 48
    C = rand(m, n)
    A = rand(m, k)
    B = rand(k, n)
    Cref = A * B
    @test gemm_pji!(fill!(C, 0), A, B) ≈ Cref
    @test gemm_pji_c!(fill!(C, 0), A, B) ≈ Cref
end

using BenchmarkTools
mnk = 48;
A = rand(mnk, mnk);
B = rand(mnk, mnk);
C = rand(mnk, mnk);
@benchmark gemm_pji!($C, $A, $B)
@benchmark gemm_pji_c!($C, $A, $B)
```
yielding
```
julia> @benchmark gemm_pji!($C, $A, $B)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     66.311 μs (0.00% GC)
  median time:      66.397 μs (0.00% GC)
  mean time:        67.214 μs (0.00% GC)
  maximum time:     244.067 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> @benchmark gemm_pji_c!($C, $A, $B)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     32.923 μs (0.00% GC)
  median time:      32.973 μs (0.00% GC)
  mean time:        34.007 μs (0.00% GC)
  maximum time:     133.430 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1
```
i.e. almost precisely a factor of two discrepancy. Best!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

likely vectorization discrepancy between julia and clang triple-nested-loop gemms #29445

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

likely vectorization discrepancy between julia and clang triple-nested-loop gemms #29445

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions