Triton GEMM streamk performance still has a gap to XeTLA. We need to investigate XeTLA implementation and optimize Triton implementation.