-
-
Notifications
You must be signed in to change notification settings - Fork 5.6k
Closed
Closed
Copy link
Labels
compiler:simdinstruction-level vectorizationinstruction-level vectorizationperformanceMust go fasterMust go faster
Description
A performance discrepancy between julia and clang pji-ordered triple-nested-loop gemm implementations is evident in https://github.com/Sacha0/TripleNestedLoopDemo.jl. Though I haven't had the bandwidth to check yet, I suspect the discrepancy comes from vectorization differences. Repro code:
function gemm_pji!(C, A, B)
for p in 1:size(A, 2),
j in 1:size(C, 2),
i in 1:size(C, 1)
@inbounds C[i, j] += A[i, p] * B[p, j]
end
return C
end
gemm_pji_ccode = """
void gemm_pji_c(double* C, double* A, double* B, int m, int k, int n) {
for ( int p = 0; p < k; ++p )
for ( int j = 0; j < n; ++j )
for ( int i = 0; i < m; ++i )
C[j*m + i] += A[p*m + i] * B[j*k + p];
}
""";
using Libdl
const CGemmLib = tempname()
open(`clang -fPIC -O3 -xc -shared -o $(CGemmLib * "." * Libdl.dlext) -`, "w") do f
print(f, gemm_pji_ccode)
end
gemm_pji_c!(C::Matrix{Float64}, A::Matrix{Float64}, B::Matrix{Float64}) =
(ccall(("gemm_pji_c", CGemmLib), Cvoid,
(Ptr{Float64}, Ptr{Float64}, Ptr{Float64}, Cint, Cint, Cint),
C, A, B, size(C, 1), size(A, 2), size(C, 2)); return C)
using Test
let
m, n, k = 48*3, 48*2, 48
C = rand(m, n)
A = rand(m, k)
B = rand(k, n)
Cref = A * B
@test gemm_pji!(fill!(C, 0), A, B) ≈ Cref
@test gemm_pji_c!(fill!(C, 0), A, B) ≈ Cref
end
using BenchmarkTools
mnk = 48;
A = rand(mnk, mnk);
B = rand(mnk, mnk);
C = rand(mnk, mnk);
@benchmark gemm_pji!($C, $A, $B)
@benchmark gemm_pji_c!($C, $A, $B)
yielding
julia> @benchmark gemm_pji!($C, $A, $B)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 66.311 μs (0.00% GC)
median time: 66.397 μs (0.00% GC)
mean time: 67.214 μs (0.00% GC)
maximum time: 244.067 μs (0.00% GC)
--------------
samples: 10000
evals/sample: 1
julia> @benchmark gemm_pji_c!($C, $A, $B)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 32.923 μs (0.00% GC)
median time: 32.973 μs (0.00% GC)
mean time: 34.007 μs (0.00% GC)
maximum time: 133.430 μs (0.00% GC)
--------------
samples: 10000
evals/sample: 1
i.e. almost precisely a factor of two discrepancy. Best!
Metadata
Metadata
Assignees
Labels
compiler:simdinstruction-level vectorizationinstruction-level vectorizationperformanceMust go fasterMust go faster