Skip to content

CUDA: add conv_2d_dw #14265

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Jun 20, 2025
Merged

CUDA: add conv_2d_dw #14265

merged 4 commits into from
Jun 20, 2025

Conversation

am17an
Copy link
Collaborator

@am17an am17an commented Jun 18, 2025

Similar to the vulkan implementation, posting some performance numbers from test_backend_ops

CPU

Device description: AMD Ryzen 7 3800XT 8-Core Processor
Device memory: 15968 MB (15968 MB free) 
CONV_2D_DW(ne_input=[512,512,256,1],ne_kernel=[3,3,1,256],stride=1,padding=1,dilation=1,cwhn=0):                16 runs - 101635.44 us/run -   524297 kB/run -    4.92 GB/s 
CONV_2D_DW(ne_input=[512,512,256,1],ne_kernel=[3,3,1,256],stride=1,padding=1,dilation=1,cwhn=1):                32 runs - 44502.06 us/run -   524297 kB/run -   11.94 GB/s

vs CUDA

  Device description: NVIDIA GeForce RTX 3090
  Device memory: 24575 MB (23306 MB free)

  CONV_2D_DW(ne_input=[512,512,256,1],ne_kernel=[3,3,1,256],stride=1,padding=1,dilation=1,cwhn=0):               576 runs -  1853.15 us/run -   524297 kB/run -  269.82 GB/s
  CONV_2D_DW(ne_input=[512,512,256,1],ne_kernel=[3,3,1,256],stride=1,padding=1,dilation=1,cwhn=1):               512 runs -  2176.38 us/run -   524297 kB/run -  233.33 GB/s

Test pass with test_backed_ops -o CONV_2D_DW, also did some manual tests

@github-actions github-actions bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Jun 18, 2025
@JohannesGaessler
Copy link
Collaborator

Not that it was introduced in this PR, but is the logic in ggml_is_contiguous_channels correct? I think the comparisons should be >= rather than > to cover cases where one of the dimensions is 1. @Acly since you are the one who added that function, do you agree?

@am17an am17an force-pushed the add_conv2d_dw_cuda branch from 790824c to d64ba79 Compare June 19, 2025 12:27
@Acly
Copy link
Contributor

Acly commented Jun 19, 2025

Not that it was introduced in this PR, but is the logic in ggml_is_contiguous_channels correct?

I believe it does the intended thing, also for typical 1D tensors. I used > so that in the (perhaps unlikely) case of ambiguity when both spatial dimensions are 1, it would be "biased" to report false, and select the non-permuted kernel. Making both comparisons use >= would change that behavior.

// 1D tensor, channels most contiguous in memory
x = ggml_new_tensor(ctx, TYPE_I8, C, W, 1, 1);
x = ggml_permute(ctx, x, 2, 0, 1, 3);   // -> ne = W, 1, C, 1
ggml_is_contiguous_channels(x);         // -> true, nb = C, C*W, 1, C*W

Tensor with C 1 H 1 memory layout will not work though. For that case, changing only the nb1 > nb0 check to use >= instead would enable it, without any downsides that I can see.

@am17an am17an merged commit 9eaa51e into ggml-org:master Jun 20, 2025
47 checks passed
Nexesenex pushed a commit to Nexesenex/croco.cpp that referenced this pull request Jun 20, 2025
* CUDA: add conv_2d_dw

* better naming

* simplify using template

* Review: fix operation ordering in ggml-cuda, use __forceinline__, use more const
gabe-l-hart added a commit to gabe-l-hart/llama.cpp that referenced this pull request Jun 20, 2025
* mamba2-sync: (24 commits)
sync : ggml
Add `ggml_roll` (ggml/1274)
docs : fix the link to llama.h (ggml-org#14293)
CUDA: add conv_2d_transpose (ggml-org#14287)
lint : remove trailing whitepace (ggml-org#14304)
vocab : prevent tokenizer overflow (ggml-org#14301)
sycl: add usage of enqueue_functions extension (ggml-org#14244)
Implement GGML_CPU_ALL_VARIANTS for PowerPC (ggml-org#14286)
llama : improve sep token handling (ggml-org#14272)
cuda : synchronize graph capture and cublas handle destruction (ggml-org#14288)
ggml : fix repack work size for mul_mat_id (ggml-org#14292)
ggml: Update KleidiAI to v1.9.0 (ggml-org#14277)
model : more uniform output id handling (ggml-org#14275)
ubatch : new splitting logic (ggml-org#14217)
CUDA: add conv_2d_dw (ggml-org#14265)
ggml-cpu : remove unnecesary arm feature detection (ggml-org#14281)
gguf-py : make sentencepiece optional (ggml-org#14200)
server : add server parameters for draft model cache type (ggml-org#13782)
build : suppress gcc15 compile warnings (ggml-org#14261)
sycl: Cleanup codepaths in Get Rows in sycl backend (ggml-org#14215)
...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ggml changes relating to the ggml tensor library for machine learning Nvidia GPU Issues specific to Nvidia GPUs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants