64 bit CUDA copy routines via GGML_CUDA_ALLOW_LARGE_TENSORS #15298
+114
−23
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Disclaimer: I couldn't code my way out of a wet paper bag in C++. This is 100% vibe coded AI slop. Upstream issue is #15049
What?
This PR adds
GGML_CUDA_ALLOW_LARGE_TENSORS
. When enabled, it allows 64 bit sizes in the CUDA copy routines.Q. What is the difference in INT_MAX and
SIZE_MAX / 4
? How much larger of a tensor will this accomodate?A. The difference between INT_MAX and SIZE_MAX/4 is enormous:
INT_MAX: 2,147,483,647 bytes ≈ 2.00 GB
SIZE_MAX/4: 4,611,686,018,427,387,903 bytes ≈ 4,294,967,296 GB ≈ 4.3 PB
How?
Then:
./build/bin/llama-server \ --model /data/Qwen3-Coder-480B-A35B-Instruct-1M-GGUF/UD-Q4_K_XL/Qwen3-Coder-480B-A35B-Instruct-1M-UD-Q4_K_XL-00001-of-00006.gguf \ --alias Qwen3-Coder-480B-A35B-Instruct-GGUF:UD-Q4_K_XL \ --no-webui \ --numa numactl \ --threads 32 \ --ctx-size 400000 \ --n-gpu-layers 63 \ -ot "blk\.(3|4|5|6|7|8|9|10|11|12|13)\.ffn_.*=CUDA0" \ -ot exps=CPU \ -ub 4096 -b 4096 \ --cache-type-k q4_1 \ --cache-type-v q4_1 \ --seed 3407 \ --prio 3 \ --temp 0.7 \ --top-p 0.8 \ --top-k 20 \ --repeat-penalty 1.05 \ --min-p 0.0 \ --log-colors \ --flash-attn \ --host 0.0.0.0 \ --jinja \ --port 11434
Why?
Cards with a lot of VRAM like the blackwell 6000 pro may enable us to use larger in-GPU context lengths than INT_MAX allows.
Results
This model starts out with 20-22 tok/s generation at 0 context, so that's pretty terrible performance. Still, when you absolutely, positively, MUST read a huge number of tokens, this may be a potential solution.