-
Notifications
You must be signed in to change notification settings - Fork 12.8k
Open
Labels
Description
Name and Version
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA RTX PRO 6000 Blackwell Workstation Edition, compute capability 12.0, VMM: yes
version: 6087 (c3eb159)
built with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
Operating systems
Linux
GGML backends
CUDA
Hardware
blackwell 6000 pro
EPYC 9355
Models
Qwen3-Coder-480B-A35B-Instruct-1M-GGUF
Problem description & steps to reproduce
./build/bin/llama-server \
--model /data/Qwen3-Coder-480B-A35B-Instruct-1M-GGUF/UD-Q4_K_XL/Qwen3-Coder-480B-A35B-Instruct-1M-UD-Q4_K_XL-00001-of-00006.gguf \
--alias Qwen3-Coder-480B-A35B-Instruct-GGUF:UD-Q4_K_XL \
--no-webui \
--numa numactl \
--threads 32 \
--ctx-size 400000 \
--n-gpu-layers 63 \
-ot "blk\.(3|4|5|6|7|8|9|10|11|12|13)\.ffn_.*=CUDA0" \
-ot exps=CPU \
-ub 4096 -b 4096 \
--cache-type-k q4_1 \
--cache-type-v q4_1 \
--seed 3407 \
--prio 3 \
--temp 0.7 \
--top-p 0.8 \
--top-k 20 \
--repeat-penalty 1.05 \
--min-p 0.0 \
--log-colors \
--flash-attn \
--host 0.0.0.0 \
--jinja \
--port 11434
Feed it more than 131072
context, watch it crash and burn.
First Bad Commit
No response
Relevant log output
slot update_slots: id 0 | task 0 | prompt processing progress, n_past = 131072, n_tokens = 4096, progress = 0.492306
/home/jesse/llama.cpp/ggml/src/ggml-cuda/cpy.cu:285: GGML_ASSERT(ggml_nbytes(src0) <= INT_MAX) failed
/home/jesse/llama.cpp/build/bin/libggml-base.so(+0x1594b)[0x7b7943a9294b]
/home/jesse/llama.cpp/build/bin/libggml-base.so(ggml_print_backtrace+0x21c)[0x7b7943a92dac]
/home/jesse/llama.cpp/build/bin/libggml-base.so(ggml_abort+0x15b)[0x7b7943a92f8b]
/home/jesse/llama.cpp/build/bin/libggml-cuda.so(_Z13ggml_cuda_cpyR25ggml_backend_cuda_contextPK11ggml_tensorPS1_b+0xa62)[0x7b7940c9dcb2]
/home/jesse/llama.cpp/build/bin/libggml-cuda.so(+0xeed58)[0x7b7940ceed58]
/home/jesse/llama.cpp/build/bin/libggml-base.so(ggml_backend_sched_graph_compute_async+0x463)[0x7b7943aaab13]
/home/jesse/llama.cpp/build/bin/libllama.so(_ZN13llama_context13graph_computeEP11ggml_cgraphb+0xa1)[0x7b794389c0e1]
/home/jesse/llama.cpp/build/bin/libllama.so(_ZN13llama_context14process_ubatchERK12llama_ubatch14llm_graph_typeP22llama_memory_context_iR11ggml_status+0x104)[0x7b794389d794]
/home/jesse/llama.cpp/build/bin/libllama.so(_ZN13llama_context6decodeERK11llama_batch+0x3bd)[0x7b79438a35dd]
/home/jesse/llama.cpp/build/bin/libllama.so(llama_decode+0xf)[0x7b79438a453f]
./build/bin/llama-server(+0xc1bbe)[0x619405c79bbe]
./build/bin/llama-server(+0x879e5)[0x619405c3f9e5]
./build/bin/llama-server(+0x4ef0e)[0x619405c06f0e]
/lib/x86_64-linux-gnu/libc.so.6(+0x2a1ca)[0x7b794302a1ca]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x8b)[0x7b794302a28b]
./build/bin/llama-server(+0x50f35)[0x619405c08f35]
Aborted (core dumped)