Name and Version
$ ./llama-cli --version
version: 9279 (52be242ad)
built with GNU 13.3.0 for Linux x86_64
build in docker with:
-DGGML_CUDA=ON
-DGGML_CUDA_FA_ALL_QUANTS=ON
-DGGML_CUDA_NCCL=ON
Operating systems
Linux
Which llama.cpp modules do you know to be affected?
libllama (core library)
Command line
./llama-server
--jinja
-t 16
-fa on
--no-mmap
-dio
--slot-save-path /slots
--metrics
--log-prefix
--log-timestamps
--cache-ram 6000
-m /models/final/qwen36-27b/mtp/Qwen3.6-27B-Q8_0.gguf
--min-p 0.0
--top-k 20
--top-p 0.95
--temp 0.6
--image-min-tokens 1024
--mmproj /models/final/qwen36-27b/mmproj-BF16.gguf
-c 1040000
-np 4
-ngl 999
-ctk f16
-ctv f16
-sm tensor
Problem description & steps to reproduce
Running with 10 nvidia rtx 5016 16GB GPU's on tensor split mode with 4 parallel slots for some time results in llama.cpp crashing, see stacktrace below.
The host memory fills up, i can postpone this by increasing the reserved 1GB to 8GB in ggml-backend-meta.cpp and here but it still crashes at the 80gb mark (10gpus x 8gb)
First Bad Commit
No response
Relevant log output
Logs
20.21.856.308 I slot print_timing: id 0 | task 30056 | prompt processing, n_tokens = 22495, progress = 0.84, t = 26.75 s / 840.87 tokens per second
/tmp/llama.cpp/ggml/src/ggml.c:1766: GGML_ASSERT(obj_new) failed
20.22.614.845 W ggml_new_object: not enough space in the context's memory pool (needed 1073741936, available 1073741824)
./llama-server(+0x130e35b)[0x5c008490735b]
./llama-server(+0x130e8ac)[0x5c00849078ac]
./llama-server(+0x130ea8b)[0x5c0084907a8b]
./llama-server(+0x130f6e1)[0x5c00849086e1]
./llama-server(+0x132c02c)[0x5c008492502c]
./llama-server(+0x131fd16)[0x5c0084918d16]
./llama-server(+0x13250ff)[0x5c008491e0ff]
./llama-server(+0x523f17)[0x5c0083b1cf17]
./llama-server(+0x52a567)[0x5c0083b23567]
./llama-server(+0x52bd1f)[0x5c0083b24d1f]
./llama-server(+0x24c990)[0x5c0083845990]
./llama-server(+0x2dfb21)[0x5c00838d8b21]
./llama-server(+0x1bc845)[0x5c00837b5845]
/lib/x86_64-linux-gnu/libc.so.6(+0x2a1ca)[0x7461755701ca]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x8b)[0x74617557028b]
./llama-server(+0x1b7585)[0x5c00837b0585]
Name and Version
$ ./llama-cli --version
version: 9279 (52be242ad)
built with GNU 13.3.0 for Linux x86_64
build in docker with:
-DGGML_CUDA=ON
-DGGML_CUDA_FA_ALL_QUANTS=ON
-DGGML_CUDA_NCCL=ON
Operating systems
Linux
Which llama.cpp modules do you know to be affected?
libllama (core library)
Command line
Problem description & steps to reproduce
Running with 10 nvidia rtx 5016 16GB GPU's on tensor split mode with 4 parallel slots for some time results in llama.cpp crashing, see stacktrace below.
The host memory fills up, i can postpone this by increasing the reserved 1GB to 8GB in ggml-backend-meta.cpp and here but it still crashes at the 80gb mark (10gpus x 8gb)
First Bad Commit
No response
Relevant log output
Logs