Skip to content

Eval bug: vulkan crash on multi-gpu setup #18297

@daniandtheweb

Description

@daniandtheweb

Name and Version

ggml_vulkan: Found 2 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon RX 7800 XT (RADV NAVI32) (radv) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: KHR_coopmat
ggml_vulkan: 1 = AMD Radeon RX 5700 XT (RADV NAVI10) (radv) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 32 | shared memory: 65536 | int dot: 0 | matrix cores: none
version: 7512 (179fd82)
built with GNU 15.2.1 for Linux x86_64

Operating systems

Linux

GGML backends

Vulkan

Hardware

Ryzen 7 9700X, Radeon RX 5700XT, Radeon RX 7800XT

Models

Tested on Mistral Small 24B and Qwen3-VL 8B

Problem description & steps to reproduce

Starting with commit e1f15b4 my multi-gpu setup has stopped working correctly on Vulkan.

What I've noticed using nvtop is that when the model loads, the warmup process happens only on one gpu (the 5700XT), and when I send one request to the model the processing seems to be happening only on the second gpu (the 7800XT). After some seconds the backend just crashes.

Here's the command I can use to perfectly replicate the issue (of course, the multi-gpu setup is better for larger models but it's way faster to reproduce it like this):

GGML_VK_VISIBLE_DEVICES=0,1 ./llama-server -t 8 --context-shift -ctk q8_0 -ctv q8_0 -c 24576 -ngl 100 -m ~/Applications/chat/gguf
/Qwen3-VL-8B-Thinking-UD-Q4_K_XL.gguf --mmproj ~/Applications/chat/gguf/Qwen3-VL-8B-Thinking-mmproj-F16.gguf

First Bad Commit

e1f15b4

Relevant log output

logs

GGML_VK_VISIBLE_DEVICES=0,1 ./llama-server -t 8 --context-shift -ctk q8_0 -ctv q8_0 -c 24576 -ngl 100 -m ~/Applications/chat/gguf
/Qwen3-VL-8B-Thinking-UD-Q4_K_XL.gguf --mmproj ~/Applications/chat/gguf/Qwen3-VL-8B-Thinking-mmproj-F16.gguf
ggml_vulkan: Found 2 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon RX 7800 XT (RADV NAVI32) (radv) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: KHR_coopmat
ggml_vulkan: 1 = AMD Radeon RX 5700 XT (RADV NAVI10) (radv) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 32 | shared memory: 65536 | int dot: 0 | matrix cores: none
main: n_parallel is set to auto, using n_parallel = 4 and kv_unified = true
build: 7512 (179fd82a7) with GNU 15.2.1 for Linux x86_64
system info: n_threads = 8, n_threads_batch = 8, total_threads = 16

system_info: n_threads = 8 (n_threads_batch = 8) / 16 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 | 

init: using 15 threads for HTTP server
start: binding port with default address family
main: loading model
srv    load_model: loading model '/home/daniandtheweb/Applications/chat/gguf/Qwen3-VL-8B-Thinking-UD-Q4_K_XL.gguf'
common_init_result: fitting params to device memory, for bugs during this step try to reproduce them with -fit off, or provide --verbose logs if the bug only occurs with -fit on
llama_params_fit_impl: projected memory use with initial parameters [MiB]:
llama_params_fit_impl:   - Vulkan0 (AMD Radeon RX 7800 XT (RADV NAVI32)):  16368 total,   4416 used,  10938 surplus
llama_params_fit_impl:   - Vulkan1 (AMD Radeon RX 5700 XT (RADV NAVI10)):   8176 total,   2766 used,   5392 surplus
llama_params_fit_impl: projected to use 7182 MiB of device memory vs. 24544 MiB of free device memory
llama_params_fit_impl: will leave at least 5392 >= 1024 MiB of free memory on all devices, no changes needed
llama_params_fit: successfully fit params to free device memory
llama_params_fit: fitting params to free memory took 0.15 seconds
llama_model_load_from_file_impl: using device Vulkan0 (AMD Radeon RX 7800 XT (RADV NAVI32)) (0000:03:00.0) - 15354 MiB free
llama_model_load_from_file_impl: using device Vulkan1 (AMD Radeon RX 5700 XT (RADV NAVI10)) (0000:09:00.0) - 8158 MiB free
llama_model_loader: loaded meta data with 42 key-value pairs and 399 tensors from /home/daniandtheweb/Applications/chat/gguf/Qwen3-VL-8B-Thinking-UD-Q4_K_XL.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen3vl
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Qwen3-Vl-8B-Thinking
llama_model_loader: - kv   3:                           general.finetune str              = Thinking
llama_model_loader: - kv   4:                           general.basename str              = Qwen3-Vl-8B-Thinking
llama_model_loader: - kv   5:                       general.quantized_by str              = Unsloth
llama_model_loader: - kv   6:                         general.size_label str              = 8B
llama_model_loader: - kv   7:                            general.license str              = apache-2.0
llama_model_loader: - kv   8:                           general.repo_url str              = https://huggingface.co/unsloth
llama_model_loader: - kv   9:                   general.base_model.count u32              = 1
llama_model_loader: - kv  10:                  general.base_model.0.name str              = Qwen3 VL 8B Thinking
llama_model_loader: - kv  11:          general.base_model.0.organization str              = Qwen
llama_model_loader: - kv  12:              general.base_model.0.repo_url str              = https://huggingface.co/Qwen/Qwen3-VL-...
llama_model_loader: - kv  13:                               general.tags arr[str,2]       = ["unsloth", "image-text-to-text"]
llama_model_loader: - kv  14:                        qwen3vl.block_count u32              = 36
llama_model_loader: - kv  15:                     qwen3vl.context_length u32              = 262144
llama_model_loader: - kv  16:                   qwen3vl.embedding_length u32              = 4096
llama_model_loader: - kv  17:                qwen3vl.feed_forward_length u32              = 12288
llama_model_loader: - kv  18:               qwen3vl.attention.head_count u32              = 32
llama_model_loader: - kv  19:            qwen3vl.attention.head_count_kv u32              = 8
llama_model_loader: - kv  20:                     qwen3vl.rope.freq_base f32              = 5000000.000000
llama_model_loader: - kv  21:   qwen3vl.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  22:               qwen3vl.attention.key_length u32              = 128
llama_model_loader: - kv  23:             qwen3vl.attention.value_length u32              = 128
llama_model_loader: - kv  24:            qwen3vl.rope.dimension_sections arr[i32,4]       = [24, 20, 20, 0]
llama_model_loader: - kv  25:                 qwen3vl.n_deepstack_layers u32              = 3
llama_model_loader: - kv  26:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  27:                         tokenizer.ggml.pre str              = qwen2
llama_model_loader: - kv  28:                      tokenizer.ggml.tokens arr[str,151936]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  29:                  tokenizer.ggml.token_type arr[i32,151936]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  30:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv  31:                tokenizer.ggml.eos_token_id u32              = 151645
llama_model_loader: - kv  32:            tokenizer.ggml.padding_token_id u32              = 151654
llama_model_loader: - kv  33:                tokenizer.ggml.bos_token_id u32              = 151643
llama_model_loader: - kv  34:               tokenizer.ggml.add_bos_token bool             = false
llama_model_loader: - kv  35:                    tokenizer.chat_template str              = {# Unsloth template fixes #}\n{%- set ...
llama_model_loader: - kv  36:               general.quantization_version u32              = 2
llama_model_loader: - kv  37:                          general.file_type u32              = 15
llama_model_loader: - kv  38:                      quantize.imatrix.file str              = Qwen3-VL-8B-Thinking-GGUF/imatrix_uns...
llama_model_loader: - kv  39:                   quantize.imatrix.dataset str              = unsloth_calibration_Qwen3-VL-8B-Think...
llama_model_loader: - kv  40:             quantize.imatrix.entries_count u32              = 252
llama_model_loader: - kv  41:              quantize.imatrix.chunks_count u32              = 684
llama_model_loader: - type  f32:  145 tensors
llama_model_loader: - type q4_K:  155 tensors
llama_model_loader: - type q5_K:   25 tensors
llama_model_loader: - type q6_K:   54 tensors
llama_model_loader: - type iq4_xs:   20 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q4_K - Medium
print_info: file size   = 4.77 GiB (5.00 BPW) 
load: printing all EOG tokens:
load:   - 151643 ('<|endoftext|>')
load:   - 151645 ('<|im_end|>')
load:   - 151662 ('<|fim_pad|>')
load:   - 151663 ('<|repo_name|>')
load:   - 151664 ('<|file_sep|>')
load: special tokens cache size = 26
load: token to piece cache size = 0.9311 MB
print_info: arch             = qwen3vl
print_info: vocab_only       = 0
print_info: no_alloc         = 0
print_info: n_ctx_train      = 262144
print_info: n_embd           = 4096
print_info: n_embd_inp       = 16384
print_info: n_layer          = 36
print_info: n_head           = 32
print_info: n_head_kv        = 8
print_info: n_rot            = 128
print_info: n_swa            = 0
print_info: is_swa_any       = 0
print_info: n_embd_head_k    = 128
print_info: n_embd_head_v    = 128
print_info: n_gqa            = 4
print_info: n_embd_k_gqa     = 1024
print_info: n_embd_v_gqa     = 1024
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-06
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: f_attn_scale     = 0.0e+00
print_info: n_ff             = 12288
print_info: n_expert         = 0
print_info: n_expert_used    = 0
print_info: n_expert_groups  = 0
print_info: n_group_used     = 0
print_info: causal attn      = 1
print_info: pooling type     = 0
print_info: rope type        = 40
print_info: rope scaling     = linear
print_info: freq_base_train  = 5000000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn  = 262144
print_info: rope_yarn_log_mul= 0.0000
print_info: rope_finetuned   = unknown
print_info: mrope sections   = [24, 20, 20, 0]
print_info: model type       = 8B
print_info: model params     = 8.19 B
print_info: general.name     = Qwen3-Vl-8B-Thinking
print_info: vocab type       = BPE
print_info: n_vocab          = 151936
print_info: n_merges         = 151387
print_info: BOS token        = 151643 '<|endoftext|>'
print_info: EOS token        = 151645 '<|im_end|>'
print_info: EOT token        = 151645 '<|im_end|>'
print_info: PAD token        = 151654 '<|vision_pad|>'
print_info: LF token         = 198 'Ċ'
print_info: FIM PRE token    = 151659 '<|fim_prefix|>'
print_info: FIM SUF token    = 151661 '<|fim_suffix|>'
print_info: FIM MID token    = 151660 '<|fim_middle|>'
print_info: FIM PAD token    = 151662 '<|fim_pad|>'
print_info: FIM REP token    = 151663 '<|repo_name|>'
print_info: FIM SEP token    = 151664 '<|file_sep|>'
print_info: EOG token        = 151643 '<|endoftext|>'
print_info: EOG token        = 151645 '<|im_end|>'
print_info: EOG token        = 151662 '<|fim_pad|>'
print_info: EOG token        = 151663 '<|repo_name|>'
print_info: EOG token        = 151664 '<|file_sep|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: offloading output layer to GPU
load_tensors: offloading 35 repeating layers to GPU
load_tensors: offloaded 37/37 layers to GPU
load_tensors:   CPU_Mapped model buffer size =   333.84 MiB
load_tensors:      Vulkan0 model buffer size =  2805.34 MiB
load_tensors:      Vulkan1 model buffer size =  1740.57 MiB
......................................................................................
common_init_result: added <|endoftext|> logit bias = -inf
common_init_result: added <|im_end|> logit bias = -inf
common_init_result: added <|fim_pad|> logit bias = -inf
common_init_result: added <|repo_name|> logit bias = -inf
common_init_result: added <|file_sep|> logit bias = -inf
llama_context: constructing llama_context
llama_context: n_seq_max     = 4
llama_context: n_ctx         = 24576
llama_context: n_ctx_seq     = 24576
llama_context: n_batch       = 2048
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = auto
llama_context: kv_unified    = true
llama_context: freq_base     = 5000000.0
llama_context: freq_scale    = 1
llama_context: n_ctx_seq (24576) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
llama_context: Vulkan_Host  output buffer size =     2.32 MiB
llama_kv_cache:    Vulkan0 KV buffer size =  1275.00 MiB
llama_kv_cache:    Vulkan1 KV buffer size =   561.00 MiB
llama_kv_cache: size = 1836.00 MiB ( 24576 cells,  36 layers,  4/1 seqs), K (q8_0):  918.00 MiB, V (q8_0):  918.00 MiB
llama_context: pipeline parallelism enabled (n_copies=4)
llama_context: Flash Attention was auto, set to enabled
llama_context:    Vulkan0 compute buffer size =   336.06 MiB
llama_context:    Vulkan1 compute buffer size =   432.82 MiB
llama_context: Vulkan_Host compute buffer size =   200.09 MiB
llama_context: graph nodes  = 1267
llama_context: graph splits = 3
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
srv  log_server_r: request: GET / 127.0.0.1 503
srv  log_server_r: request: GET / 127.0.0.1 503
srv  log_server_r: request: GET / 127.0.0.1 503
srv  log_server_r: request: GET / 127.0.0.1 503
srv  log_server_r: request: GET / 127.0.0.1 503
srv  log_server_r: request: GET / 127.0.0.1 503
clip_model_loader: model name:   Qwen3-Vl-8B-Thinking
clip_model_loader: description:  
clip_model_loader: GGUF version: 3
clip_model_loader: alignment:    32
clip_model_loader: n_tensors:    352
clip_model_loader: n_kv:         31

clip_model_loader: has vision encoder
clip_ctx: CLIP using Vulkan0 backend
load_hparams: Qwen-VL models require at minimum 1024 image tokens to function correctly on grounding tasks
load_hparams: if you encounter problems with accuracy, try adding --image-min-tokens 1024
load_hparams: more info: https://github.com/ggml-org/llama.cpp/issues/16842

load_hparams: projector:          qwen3vl_merger
load_hparams: n_embd:             1152
load_hparams: n_head:             16
load_hparams: n_ff:               4304
load_hparams: n_layer:            27
load_hparams: ffn_op:             gelu
load_hparams: projection_dim:     4096

--- vision hparams ---
load_hparams: image_size:         768
load_hparams: patch_size:         16
load_hparams: has_llava_proj:     0
load_hparams: minicpmv_version:   0
load_hparams: n_merge:            2
load_hparams: n_wa_pattern:       0
load_hparams: image_min_pixels:   8192
load_hparams: image_max_pixels:   4194304

load_hparams: model size:         1105.32 MiB
load_hparams: metadata size:      0.12 MiB
warmup: warmup with image size = 1472 x 1472
alloc_compute_meta:    Vulkan0 compute buffer size =   362.27 MiB
alloc_compute_meta:        CPU compute buffer size =    62.12 MiB
alloc_compute_meta: graph splits = 3, nodes = 853
warmup: flash attention is enabled
warmup: *****************************************************************
warmup: WARNING: the CLIP graph uses unsupported operators by the backend
warmup:          the performance will be suboptimal                      
warmup:          list of unsupported ops (backend=Vulkan0):
warmup:          UPSCALE: type = f32, ne = [92 92 1152 1]
warmup: flash attention is enabled
warmup: please report this on github as an issue
warmup: ref: https://github.com/ggml-org/llama.cpp/pull/16837#issuecomment-3461676118
warmup: *****************************************************************
srv    load_model: loaded multimodal model, '/home/daniandtheweb/Applications/chat/gguf/Qwen3-VL-8B-Thinking-mmproj-F16.gguf'
srv    load_model: ctx_shift is not supported by multimodal, it will be disabled
srv    load_model: initializing slots, n_slots = 4
slot   load_model: id  0 | task -1 | new slot, n_ctx = 24576
slot   load_model: id  1 | task -1 | new slot, n_ctx = 24576
slot   load_model: id  2 | task -1 | new slot, n_ctx = 24576
slot   load_model: id  3 | task -1 | new slot, n_ctx = 24576
srv    load_model: prompt cache is enabled, size limit: 8192 MiB
srv    load_model: use `--cache-ram 0` to disable the prompt cache
srv    load_model: for more info see https://github.com/ggml-org/llama.cpp/pull/16391
srv    load_model: thinking = 1
load_model: chat template, chat_template: {# Unsloth template fixes #}
{%- set image_count = namespace(value=0) %}
{%- set video_count = namespace(value=0) %}
{%- macro render_content(content, do_vision_count) %}
    {%- if content is string %}
        {{- content }}
    {%- else %}
        {%- for item in content %}
            {%- if 'image' in item or 'image_url' in item or item.type == 'image' %}
                {%- if do_vision_count %}
                    {%- set image_count.value = image_count.value + 1 %}
                {%- endif %}
                {%- if add_vision_id %}Picture {{ image_count.value }}: {% endif -%}
                <|vision_start|><|image_pad|><|vision_end|>
            {%- elif 'video' in item or item.type == 'video' %}
                {%- if do_vision_count %}
                    {%- set video_count.value = video_count.value + 1 %}
                {%- endif %}
                {%- if add_vision_id %}Video {{ video_count.value }}: {% endif -%}
                <|vision_start|><|video_pad|><|vision_end|>
            {%- elif 'text' in item %}
                {{- item.text }}
            {%- endif %}
        {%- endfor %}
    {%- endif %}
{%- endmacro %}
{%- if tools %}
    {{- '<|im_start|>system\n' }}
    {%- if messages[0].role == 'system' %}
        {{- render_content(messages[0].content, false) + '\n\n' }}
    {%- endif %}
    {{- "# Tools\n\nYou may call one or more functions to assist with the user query.\n\nYou are provided with function signatures within <tools></tools> XML tags:\n<tools>" }}
    {%- for tool in tools %}
        {{- "\n" }}
        {{- tool | tojson }}
    {%- endfor %}
    {{- "\n</tools>\n\nFor each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:\n<tool_call>\n{\"name\": <function-name>, \"arguments\": <args-json-object>}\n</tool_call><|im_end|>\n" }}
{%- else %}
    {%- if messages[0].role == 'system' %}
        {{- '<|im_start|>system\n' + render_content(messages[0].content, false) + '<|im_end|>\n' }}
    {%- endif %}
{%- endif %}
{%- set ns = namespace(multi_step_tool=true, last_query_index=messages|length - 1) %}
{%- for message in messages[::-1] %}
    {%- set index = (messages|length - 1) - loop.index0 %}
    {%- if ns.multi_step_tool and message.role == "user" %}
        {%- set content = render_content(message.content, false) %}
        {%- if not(content.startswith('<tool_response>') and content.endswith('</tool_response>')) %}
            {%- set ns.multi_step_tool = false %}
            {%- set ns.last_query_index = index %}
        {%- endif %}
    {%- endif %}
{%- endfor %}
{%- for message in messages %}
    {%- set content = render_content(message.content, True) %}
    {%- if (message.role == "user") or (message.role == "system" and not loop.first) %}
        {{- '<|im_start|>' + message.role + '\n' + content + '<|im_end|>' + '\n' }}
    {%- elif message.role == "assistant" %}
        {%- set reasoning_content = '' %}
        {%- if message.reasoning_content is string %}
            {%- set reasoning_content = message.reasoning_content %}
        {%- else %}
            {%- if '</think>' in content %}
                {# Unsloth template fixes - must change to for loop since llama.cpp will error out if not #}
                {%- set parts = content.split('</think>') %}
                {%- for part in parts %}
                    {%- if loop.index0 == 0 -%}
                        {%- set reasoning_content = part.rstrip('\n') %}
                        {%- set reasoning_content = (reasoning_content.split('<think>')|last) %}
                        {%- set reasoning_content = reasoning_content.lstrip('\n') -%}
                    {%- else -%}
                        {%- set content = part.lstrip('\n') %}
                    {%- endif %}
                {%- endfor %}
            {%- endif %}
        {%- endif %}
        {%- if loop.index0 > ns.last_query_index %}
            {%- if loop.last or (not loop.last and reasoning_content) %}
                {{- '<|im_start|>' + message.role + '\n<think>\n' + reasoning_content.strip('\n') + '\n</think>\n\n' + content.lstrip('\n') }}
            {%- else %}
                {{- '<|im_start|>' + message.role + '\n' + content }}
            {%- endif %}
        {%- else %}
            {{- '<|im_start|>' + message.role + '\n' + content }}
        {%- endif %}
        {%- if message.tool_calls %}
            {%- for tool_call in message.tool_calls %}
                {%- if (loop.first and content) or (not loop.first) %}
                    {{- '\n' }}
                {%- endif %}
                {%- if tool_call.function %}
                    {%- set tool_call = tool_call.function %}
                {%- endif %}
                {{- '<tool_call>\n{"name": "' }}
                {{- tool_call.name }}
                {{- '", "arguments": ' }}
                {%- if tool_call.arguments is string %}
                    {{- tool_call.arguments }}
                {%- else %}
                    {{- tool_call.arguments | tojson }}
                {%- endif %}
                {{- '}\n</tool_call>' }}
            {%- endfor %}
        {%- endif %}
        {{- '<|im_end|>\n' }}
    {%- elif message.role == "tool" %}
        {%- if loop.first or (messages[loop.index0 - 1].role != "tool") %}
            {{- '<|im_start|>user' }}
        {%- endif %}
        {{- '\n<tool_response>\n' }}
        {{- content }}
        {{- '\n</tool_response>' }}
        {%- if loop.last or (messages[loop.index0 + 1].role != "tool") %}
            {{- '<|im_end|>\n' }}
        {%- endif %}
    {%- endif %}
{%- endfor %}
{%- if add_generation_prompt %}
    {{- '<|im_start|>assistant\n<think>\n' }}
{%- endif %}
{# Copyright 2025-present Unsloth. Apache 2.0 License. #}, example_format: '<|im_start|>system
You are a helpful assistant<|im_end|>
<|im_start|>user
Hello<|im_end|>
<|im_start|>assistant
Hi there<|im_end|>
<|im_start|>user
How are you?<|im_end|>
<|im_start|>assistant
<think>
'
main: model loaded
main: server is listening on http://127.0.0.1:8080
main: starting the main loop...
srv  update_slots: all slots are idle
srv  log_server_r: request: GET / 127.0.0.1 200
srv  log_server_r: request: GET /props 127.0.0.1 200
srv  log_server_r: request: GET /props 127.0.0.1 200
srv  log_server_r: request: GET /props 127.0.0.1 200
srv  log_server_r: request: GET /v1/models 127.0.0.1 200
srv  log_server_r: request: GET /props 127.0.0.1 200
srv  log_server_r: request: GET /props 127.0.0.1 200
srv  params_from_: Chat format: Hermes 2 Pro
slot get_availabl: id  3 | task -1 | selected slot by LRU, t_last = -1
slot launch_slot_: id  3 | task -1 | sampler chain: logits -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist 
slot launch_slot_: id  3 | task 0 | processing task
slot update_slots: id  3 | task 0 | new prompt, n_ctx_slot = 24576, n_keep = 0, task.n_tokens = 48
slot update_slots: id  3 | task 0 | n_tokens = 0, memory_seq_rm [0, end)
slot update_slots: id  3 | task 0 | prompt processing progress, n_tokens = 48, batch.n_tokens = 48, progress = 1.000000
slot update_slots: id  3 | task 0 | prompt done, n_tokens = 48, batch.n_tokens = 48
radv/amdgpu: The CS has been cancelled because the context is lost. This context is guilty of a hard recovery.
/home/daniandtheweb/Applications/chat/llama.cpp/build/bin/libggml-base.so.0(+0x156f6) [0x7ff7a58d46f6]
/home/daniandtheweb/Applications/chat/llama.cpp/build/bin/libggml-base.so.0(ggml_print_backtrace+0x203) [0x7ff7a58d4b33]
/home/daniandtheweb/Applications/chat/llama.cpp/build/bin/libggml-base.so.0(+0x284d9) [0x7ff7a58e74d9]
/usr/lib/libstdc++.so.6(+0xb1eba) [0x7ff7a1eb1eba]
/usr/lib/libstdc++.so.6(_ZSt10unexpectedv+0x0) [0x7ff7a1e975d9]
/usr/lib/libstdc++.so.6(+0xb2176) [0x7ff7a1eb2176]
/home/daniandtheweb/Applications/chat/llama.cpp/build/bin/libggml-vulkan.so.0(+0x8f984) [0x7ff7a228f984]
/home/daniandtheweb/Applications/chat/llama.cpp/build/bin/libggml-vulkan.so.0(+0x1b440e) [0x7ff7a23b440e]
/home/daniandtheweb/Applications/chat/llama.cpp/build/bin/libggml-vulkan.so.0(+0x1b504a) [0x7ff7a23b504a]
/home/daniandtheweb/Applications/chat/llama.cpp/build/bin/libggml-base.so.0(ggml_backend_sched_graph_compute_async+0x813) [0x7ff7a58f0583]
/home/daniandtheweb/Applications/chat/llama.cpp/build/bin/libllama.so.0(_ZN13llama_context13graph_computeEP11ggml_cgraphb+0xa0) [0x7ff7a56a4bb0]
/home/daniandtheweb/Applications/chat/llama.cpp/build/bin/libllama.so.0(_ZN13llama_context14process_ubatchERK12llama_ubatch14llm_graph_typeP22llama_memory_context_iR11ggml_status+0xf3) [0x7ff7a56a6883]
/home/daniandtheweb/Applications/chat/llama.cpp/build/bin/libllama.so.0(_ZN13llama_context6decodeERK11llama_batch+0x40f) [0x7ff7a56ac0ef]
/home/daniandtheweb/Applications/chat/llama.cpp/build/bin/libllama.so.0(llama_decode+0xe) [0x7ff7a56ad05e]
./llama-server(+0x17004e) [0x55d079ca104e]
./llama-server(+0x114971) [0x55d079c45971]
./llama-server(+0xa1346) [0x55d079bd2346]
/usr/lib/libc.so.6(+0x27635) [0x7ff7a1a27635]
/usr/lib/libc.so.6(__libc_start_main+0x89) [0x7ff7a1a276e9]
./llama-server(+0xa37f5) [0x55d079bd47f5]
terminate called after throwing an instance of 'vk::DeviceLostError'
  what():  vk::Queue::submit: ErrorDeviceLost
zsh: IOT instruction (core dumped)  GGML_VK_VISIBLE_DEVICES=0,1 ./llama-server -t 8 --context-shift -ctk q8_0 -ct

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions