64 bit CUDA copy routines via GGML_CUDA_ALLOW_LARGE_TENSORS #15298

createthis · 2025-08-13T18:24:53Z

Disclaimer: I couldn't code my way out of a wet paper bag in C++. This is 100% vibe coded AI slop. Upstream issue is #15049

What?

This PR adds GGML_CUDA_ALLOW_LARGE_TENSORS. When enabled, it allows 64 bit sizes in the CUDA copy routines.

Q. What is the difference in INT_MAX and SIZE_MAX / 4? How much larger of a tensor will this accomodate?

A. The difference between INT_MAX and SIZE_MAX/4 is enormous:

INT_MAX: 2,147,483,647 bytes ≈ 2.00 GB
SIZE_MAX/4: 4,611,686,018,427,387,903 bytes ≈ 4,294,967,296 GB ≈ 4.3 PB

How?

cmake -B build -DGGML_CUDA=ON -DGGML_CUDA_FA_ALL_QUANTS=ON -DGGML_CUDA_ALLOW_LARGE_TENSORS=ON
cmake --build build --config Release

Then:

./build/bin/llama-server \
    --model /data/Qwen3-Coder-480B-A35B-Instruct-1M-GGUF/UD-Q4_K_XL/Qwen3-Coder-480B-A35B-Instruct-1M-UD-Q4_K_XL-00001-of-00006.gguf \
    --alias Qwen3-Coder-480B-A35B-Instruct-GGUF:UD-Q4_K_XL \
    --no-webui \
    --numa numactl \
    --threads 32 \
    --ctx-size 400000 \
    --n-gpu-layers 63 \
    -ot "blk\.(3|4|5|6|7|8|9|10|11|12|13)\.ffn_.*=CUDA0" \
    -ot exps=CPU \
    -ub 4096 -b 4096 \
    --cache-type-k q4_1 \
    --cache-type-v q4_1 \
    --seed 3407 \
    --prio 3 \
    --temp 0.7 \
    --top-p 0.8 \
    --top-k 20 \
    --repeat-penalty 1.05 \
    --min-p 0.0 \
    --log-colors \
    --flash-attn \
    --host 0.0.0.0 \
    --jinja \
    --port 11434

Why?

Cards with a lot of VRAM like the blackwell 6000 pro may enable us to use larger in-GPU context lengths than INT_MAX allows.

Results

This model starts out with 20-22 tok/s generation at 0 context, so that's pretty terrible performance. Still, when you absolutely, positively, MUST read a huge number of tokens, this may be a potential solution.

… check in ggml_cuda_cpy

beware.

…by Qwen3-Coder-480B-A35B-Instruct-1M-GGUF

… CUDA large tensor support This change by gpt-oss-120b-mxfp4.

JohannesGaessler

For the copy operations register pressure is not an issue. It should be fine to just use 64 bit integers for everything without the need for an extra compile option.

createthis added 4 commits August 13, 2025 09:21

Add compile-time flag GGML_CUDA_ALLOW_LARGE_TENSORS to bypass INT_MAX…

73ef5b9

… check in ggml_cuda_cpy

R1-0528's attempt to implement this. I doubt this code works. User

d3ea7d2

beware.

New assertions for GGML_CUDA_ALLOW_LARGE_TENSORS upper bounds, coded …

39fbbb8

…by Qwen3-Coder-480B-A35B-Instruct-1M-GGUF

Add compile option GGML_CUDA_ALLOW_LARGE_TENSORS and define macro for…

e40e6a6

… CUDA large tensor support This change by gpt-oss-120b-mxfp4.

github-actions bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Aug 13, 2025

This was referenced Aug 13, 2025

Eval bug: Qwen3-Coder-480B-A35B-Instruct-1M-GGUF GGML_ASSERT(ggml_nbytes(src0) <= INT_MAX) failed #15049

Open

Fix 131k context ggml assert createthis/llama.cpp#3

Closed

JohannesGaessler reviewed Aug 19, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

64 bit CUDA copy routines via GGML_CUDA_ALLOW_LARGE_TENSORS #15298

64 bit CUDA copy routines via GGML_CUDA_ALLOW_LARGE_TENSORS #15298

createthis commented Aug 13, 2025 •

edited

Loading

Uh oh!

JohannesGaessler left a comment

Uh oh!

Uh oh!

64 bit CUDA copy routines via GGML_CUDA_ALLOW_LARGE_TENSORS #15298

Are you sure you want to change the base?

64 bit CUDA copy routines via GGML_CUDA_ALLOW_LARGE_TENSORS #15298

Conversation

createthis commented Aug 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What?

How?

Why?

Results

Uh oh!

JohannesGaessler left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

createthis commented Aug 13, 2025 •

edited

Loading