Skip to content

likely vectorization discrepancy between julia and clang triple-nested-loop gemms #29445

@Sacha0

Description

@Sacha0

A performance discrepancy between julia and clang pji-ordered triple-nested-loop gemm implementations is evident in https://github.com/Sacha0/TripleNestedLoopDemo.jl. Though I haven't had the bandwidth to check yet, I suspect the discrepancy comes from vectorization differences. Repro code:

function gemm_pji!(C, A, B)
    for p in 1:size(A, 2),
         j in 1:size(C, 2),
          i in 1:size(C, 1)
        @inbounds C[i, j] += A[i, p] * B[p, j]
    end
    return C
end

gemm_pji_ccode = """
void gemm_pji_c(double* C, double* A, double* B, int m, int k, int n) {
    for ( int p = 0; p < k; ++p )
        for ( int j = 0; j < n; ++j )
            for ( int i = 0; i < m; ++i )
                C[j*m + i] += A[p*m + i] * B[j*k + p];
}
""";

using Libdl
const CGemmLib = tempname()
open(`clang -fPIC -O3 -xc -shared -o $(CGemmLib * "." * Libdl.dlext) -`, "w") do f
    print(f, gemm_pji_ccode) 
end

gemm_pji_c!(C::Matrix{Float64}, A::Matrix{Float64}, B::Matrix{Float64}) =
    (ccall(("gemm_pji_c", CGemmLib), Cvoid,
            (Ptr{Float64}, Ptr{Float64}, Ptr{Float64}, Cint, Cint, Cint),
            C, A, B, size(C, 1), size(A, 2), size(C, 2)); return C)

using Test
let
    m, n, k = 48*3, 48*2, 48
    C = rand(m, n)
    A = rand(m, k)
    B = rand(k, n)
    Cref = A * B
    @test gemm_pji!(fill!(C, 0), A, B)  Cref
    @test gemm_pji_c!(fill!(C, 0), A, B)  Cref
end

using BenchmarkTools
mnk = 48;
A = rand(mnk, mnk);
B = rand(mnk, mnk);
C = rand(mnk, mnk);
@benchmark gemm_pji!($C, $A, $B)
@benchmark gemm_pji_c!($C, $A, $B)

yielding

julia> @benchmark gemm_pji!($C, $A, $B)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     66.311 μs (0.00% GC)
  median time:      66.397 μs (0.00% GC)
  mean time:        67.214 μs (0.00% GC)
  maximum time:     244.067 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> @benchmark gemm_pji_c!($C, $A, $B)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     32.923 μs (0.00% GC)
  median time:      32.973 μs (0.00% GC)
  mean time:        34.007 μs (0.00% GC)
  maximum time:     133.430 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

i.e. almost precisely a factor of two discrepancy. Best!

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions