CUDA: add conv_2d_dw #14265

am17an · 2025-06-18T16:32:47Z

Similar to the vulkan implementation, posting some performance numbers from test_backend_ops

CPU

Device description: AMD Ryzen 7 3800XT 8-Core Processor
Device memory: 15968 MB (15968 MB free) 
CONV_2D_DW(ne_input=[512,512,256,1],ne_kernel=[3,3,1,256],stride=1,padding=1,dilation=1,cwhn=0):                16 runs - 101635.44 us/run -   524297 kB/run -    4.92 GB/s 
CONV_2D_DW(ne_input=[512,512,256,1],ne_kernel=[3,3,1,256],stride=1,padding=1,dilation=1,cwhn=1):                32 runs - 44502.06 us/run -   524297 kB/run -   11.94 GB/s

vs CUDA

  Device description: NVIDIA GeForce RTX 3090
  Device memory: 24575 MB (23306 MB free)

  CONV_2D_DW(ne_input=[512,512,256,1],ne_kernel=[3,3,1,256],stride=1,padding=1,dilation=1,cwhn=0):               576 runs -  1853.15 us/run -   524297 kB/run -  269.82 GB/s
  CONV_2D_DW(ne_input=[512,512,256,1],ne_kernel=[3,3,1,256],stride=1,padding=1,dilation=1,cwhn=1):               512 runs -  2176.38 us/run -   524297 kB/run -  233.33 GB/s

Test pass with test_backed_ops -o CONV_2D_DW, also did some manual tests

ggml/src/ggml-cuda/ggml-cuda.cu

ggml/src/ggml-cuda/conv2d-dw.cu

JohannesGaessler · 2025-06-19T09:32:54Z

Not that it was introduced in this PR, but is the logic in ggml_is_contiguous_channels correct? I think the comparisons should be >= rather than > to cover cases where one of the dimensions is 1. @Acly since you are the one who added that function, do you agree?

… more const

Acly · 2025-06-19T17:06:54Z

Not that it was introduced in this PR, but is the logic in ggml_is_contiguous_channels correct?

I believe it does the intended thing, also for typical 1D tensors. I used > so that in the (perhaps unlikely) case of ambiguity when both spatial dimensions are 1, it would be "biased" to report false, and select the non-permuted kernel. Making both comparisons use >= would change that behavior.

// 1D tensor, channels most contiguous in memory
x = ggml_new_tensor(ctx, TYPE_I8, C, W, 1, 1);
x = ggml_permute(ctx, x, 2, 0, 1, 3);   // -> ne = W, 1, C, 1
ggml_is_contiguous_channels(x);         // -> true, nb = C, C*W, 1, C*W

Tensor with C 1 H 1 memory layout will not work though. For that case, changing only the nb1 > nb0 check to use >= instead would enable it, without any downsides that I can see.

* CUDA: add conv_2d_dw * better naming * simplify using template * Review: fix operation ordering in ggml-cuda, use __forceinline__, use more const

* mamba2-sync: (24 commits) sync : ggml Add `ggml_roll` (ggml/1274) docs : fix the link to llama.h (ggml-org#14293) CUDA: add conv_2d_transpose (ggml-org#14287) lint : remove trailing whitepace (ggml-org#14304) vocab : prevent tokenizer overflow (ggml-org#14301) sycl: add usage of enqueue_functions extension (ggml-org#14244) Implement GGML_CPU_ALL_VARIANTS for PowerPC (ggml-org#14286) llama : improve sep token handling (ggml-org#14272) cuda : synchronize graph capture and cublas handle destruction (ggml-org#14288) ggml : fix repack work size for mul_mat_id (ggml-org#14292) ggml: Update KleidiAI to v1.9.0 (ggml-org#14277) model : more uniform output id handling (ggml-org#14275) ubatch : new splitting logic (ggml-org#14217) CUDA: add conv_2d_dw (ggml-org#14265) ggml-cpu : remove unnecesary arm feature detection (ggml-org#14281) gguf-py : make sentencepiece optional (ggml-org#14200) server : add server parameters for draft model cache type (ggml-org#13782) build : suppress gcc15 compile warnings (ggml-org#14261) sycl: Cleanup codepaths in Get Rows in sycl backend (ggml-org#14215) ...

github-actions bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Jun 18, 2025

ggerganov requested a review from JohannesGaessler June 19, 2025 05:48

JohannesGaessler reviewed Jun 19, 2025

View reviewed changes

ggml/src/ggml-cuda/ggml-cuda.cu Outdated Show resolved Hide resolved

ggml/src/ggml-cuda/conv2d-dw.cu Outdated Show resolved Hide resolved

ggml/src/ggml-cuda/conv2d-dw.cu Outdated Show resolved Hide resolved

am17an added 4 commits June 19, 2025 20:27

CUDA: add conv_2d_dw

f177231

better naming

6eb7fbb

simplify using template

2c60d2c

Review: fix operation ordering in ggml-cuda, use __forceinline__, use…

d64ba79

… more const

am17an force-pushed the add_conv2d_dw_cuda branch from 790824c to d64ba79 Compare June 19, 2025 12:27

JohannesGaessler approved these changes Jun 19, 2025

View reviewed changes

am17an merged commit 9eaa51e into ggml-org:master Jun 20, 2025
47 checks passed

Nexesenex pushed a commit to Nexesenex/croco.cpp that referenced this pull request Jun 20, 2025

CUDA: add conv_2d_dw (ggml-org#14265)

47190fe

* CUDA: add conv_2d_dw * better naming * simplify using template * Review: fix operation ordering in ggml-cuda, use __forceinline__, use more const

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

CUDA: add conv_2d_dw #14265

CUDA: add conv_2d_dw #14265

Uh oh!

am17an commented Jun 18, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

JohannesGaessler commented Jun 19, 2025

Uh oh!

Acly commented Jun 19, 2025

Uh oh!

Uh oh!

Uh oh!

CUDA: add conv_2d_dw #14265

CUDA: add conv_2d_dw #14265

Uh oh!

Conversation

am17an commented Jun 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

JohannesGaessler commented Jun 19, 2025

Uh oh!

Acly commented Jun 19, 2025

Uh oh!

Uh oh!

Uh oh!

am17an commented Jun 18, 2025 •

edited

Loading