Skip to content

Vulkan backend crashes when reasoning is disabled with Qwen3.5-35B-A3B #21608

@Zirconium419122

Description

@Zirconium419122

When running the Qwen3.5-35B-A3B GGUF model with the Vulkan backend in llama.cpp, inference succeeds when reasoning (thinking) is enabled, but consistently crashes when reasoning is disabled.

This behavior is reproducible across both llama-cli and llama-server. The failure manifests as a Vulkan device loss (vk::DeviceLostError) during inference when reasoning is turned off, while identical configurations with reasoning enabled run without issues.

Notably, disabling reasoning via --reasoning off triggers the crash in llama-cli, whereas using chat template kwargs (--chat-template-kwargs '{"enable_thinking":false}') does not exhibit the same failure in llama-cli, suggesting a discrepancy in how reasoning modes are applied or handled internally.

llama-bench -m ~/.cache/huggingface/hub/models--unsloth--Qwen3.5-35B-A3B-GGUF/snapshots/bc014a17be43adabd7066b7a86075ff935c6a4e2/Qwen3.5-35B-A3B-UD-Q2_K_XL.gguf -p 0 -n 128,256,512
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Intel(R) Graphics (ARL) (Intel open-source Mesa driver) | uma: 1 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: none
| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| qwen35moe 35B.A3B Q2_K - Medium |  11.31 GiB |    34.66 B | Vulkan     |  99 |           tg128 |         12.39 ± 0.12 |
| qwen35moe 35B.A3B Q2_K - Medium |  11.31 GiB |    34.66 B | Vulkan     |  99 |           tg256 |         12.46 ± 0.10 |
| qwen35moe 35B.A3B Q2_K - Medium |  11.31 GiB |    34.66 B | Vulkan     |  99 |           tg512 |         12.30 ± 0.14 |

build: 9bcb4ef (8548)

llama-cli with reasoning:

llama-cli -m ~/.cache/huggingface/hub/models--unsloth--Qwen3.5-35B-A3B-GGUF/snapshots/bc014a17be43adabd7066b7a86075ff935c6a4e2/Qwen3.5-35B-A3B-UD-Q2_K_XL.gguf --jinja -c 32768 -n 100 -p "Hello\!"
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Intel(R) Graphics (ARL) (Intel open-source Mesa driver) | uma: 1 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: none

Loading model...


▄▄ ▄▄
██ ██
██ ██  ▀▀█▄ ███▄███▄  ▀▀█▄    ▄████ ████▄ ████▄
██ ██ ▄█▀██ ██ ██ ██ ▄█▀██    ██    ██ ██ ██ ██
██ ██ ▀█▄██ ██ ██ ██ ▀█▄██ ██ ▀████ ████▀ ████▀
                                    ██    ██
                                    ▀▀    ▀▀

build      : b8548-9bcb4ef
model      : Qwen3.5-35B-A3B-UD-Q2_K_XL.gguf
modalities : text

available commands:
  /exit or Ctrl+C     stop or exit
  /regen              regenerate the last response
  /clear              clear the chat history
  /read               add a text file


> Hello!

[Start thinking]

Thinking Process:

1.  **Analyze the Input:**
    *   Input: "Hello!"
    *   Intent: Greeting.
    *   Tone: Friendly, casual.
    *   Context: Initial interaction.

2.  **Determine the Appropriate Response:**
    *   Acknowledge the greeting.
    *   Offer assistance.
    *   Maintain a friendly and helpful tone.
    *   Keep it concise (since it's

[ Prompt: 23,8 t/s | Generation: 9,8 t/s ]

llama-cli without reasoning:

llama-cli -m ~/.cache/huggingface/hub/models--unsloth--Qwen3.5-35B-A3B-GGUF/snapshots/bc014a17be43adabd7066b7a86075ff935c6a4e2/Qwen3.5-35B-A3B-UD-Q2_K_XL.gguf --jinja -c 32768 --reasoning off -n 100 -p "Hello\!"
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Intel(R) Graphics (ARL) (Intel open-source Mesa driver) | uma: 1 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: none

Loading model...


▄▄ ▄▄
██ ██
██ ██  ▀▀█▄ ███▄███▄  ▀▀█▄    ▄████ ████▄ ████▄
██ ██ ▄█▀██ ██ ██ ██ ▄█▀██    ██    ██ ██ ██ ██
██ ██ ▀█▄██ ██ ██ ██ ▀█▄██ ██ ▀████ ████▀ ████▀
                                    ██    ██
                                    ▀▀    ▀▀

build      : b8548-9bcb4ef
model      : Qwen3.5-35B-A3B-UD-Q2_K_XL.gguf
modalities : text

available commands:
  /exit or Ctrl+C     stop or exit
  /regen              regenerate the last response
  /clear              clear the chat history
  /read               add a text file


> Hello!

-/nix/store/g120d4bqgw63ashvjda55a1z9dgl8bd7-llama-cpp-8548/lib/libggml-base.so.0(+0x17c5a) [0x7f0a9b6b7c5a]
/nix/store/g120d4bqgw63ashvjda55a1z9dgl8bd7-llama-cpp-8548/lib/libggml-base.so.0(ggml_print_backtrace+0x204) [0x7f0a9b6b8114]
/nix/store/g120d4bqgw63ashvjda55a1z9dgl8bd7-llama-cpp-8548/lib/libggml-base.so.0(+0x2c5d9) [0x7f0a9b6cc5d9]
/nix/store/ab3753m6i7isgvzphlar0a8xb84gl96i-gcc-15.2.0-lib/lib/libstdc++.so.6(+0xc539a) [0x7f0a970c539a]
/nix/store/ab3753m6i7isgvzphlar0a8xb84gl96i-gcc-15.2.0-lib/lib/libstdc++.so.6(_ZSt10unexpectedv+0x0) [0x7f0a970b286e]
/nix/store/ab3753m6i7isgvzphlar0a8xb84gl96i-gcc-15.2.0-lib/lib/libstdc++.so.6(+0xc5637) [0x7f0a970c5637]
/nix/store/g120d4bqgw63ashvjda55a1z9dgl8bd7-llama-cpp-8548/lib/libggml-vulkan.so.0(+0x89426) [0x7f0a97489426]
/nix/store/g120d4bqgw63ashvjda55a1z9dgl8bd7-llama-cpp-8548/lib/libggml-vulkan.so.0(+0x193945) [0x7f0a97593945]
/nix/store/g120d4bqgw63ashvjda55a1z9dgl8bd7-llama-cpp-8548/lib/libggml-vulkan.so.0(+0x193c28) [0x7f0a97593c28]
/nix/store/g120d4bqgw63ashvjda55a1z9dgl8bd7-llama-cpp-8548/lib/libggml-base.so.0(ggml_backend_sched_graph_compute_async+0x92c) [0x7f0a9b6d5ecc]
/nix/store/g120d4bqgw63ashvjda55a1z9dgl8bd7-llama-cpp-8548/lib/libllama.so.0(_ZN13llama_context13graph_computeEP11ggml_cgraphb+0xa1) [0x7f0a9accd2f1]
/nix/store/g120d4bqgw63ashvjda55a1z9dgl8bd7-llama-cpp-8548/lib/libllama.so.0(_ZN13llama_context14process_ubatchERK12llama_ubatch14llm_graph_typeP22llama_memory_context_iR11ggml_status+0x11f) [0x7f0a9accfd7f]
/nix/store/g120d4bqgw63ashvjda55a1z9dgl8bd7-llama-cpp-8548/lib/libllama.so.0(_ZN13llama_context6decodeERK11llama_batch+0x3c9) [0x7f0a9acd6aa9]
/nix/store/g120d4bqgw63ashvjda55a1z9dgl8bd7-llama-cpp-8548/lib/libllama.so.0(llama_decode+0x11) [0x7f0a9acd8781]
llama-cli(+0x1a7fcb) [0x55cc2e4affcb]
llama-cli(+0x139fc6) [0x55cc2e441fc6]
/nix/store/ab3753m6i7isgvzphlar0a8xb84gl96i-gcc-15.2.0-lib/lib/libstdc++.so.6(+0xf2fa4) [0x7f0a970f2fa4]
/nix/store/jms7zxzm7w1whczwny5m3gkgdjghmi2r-glibc-2.42-51/lib/libc.so.6(+0x9dd53) [0x7f0a96c9dd53]
/nix/store/jms7zxzm7w1whczwny5m3gkgdjghmi2r-glibc-2.42-51/lib/libc.so.6(+0x12563c) [0x7f0a96d2563c]
terminate called after throwing an instance of 'vk::DeviceLostError'
  what():  vk::Device::waitForFences: ErrorDeviceLost
zsh: IOT instruction (core dumped)  llama-cli -m  --jinja -c 32768 --reasoning off -n 100 -p "Hello!"

Works using --chat-template-kwargs '{"enable_thinking":false}' for some reason:

llama-cli -m ~/.cache/huggingface/hub/models--unsloth--Qwen3.5-35B-A3B-GGUF/snapshots/bc014a17be43adabd7066b7a86075ff935c6a4e2/Qwen3.5-35B-A3B-UD-Q2_K_XL.gguf --jinja -c 32768 --chat-template-kwargs '{"enable_thinking":false}' -n 100 -p "Hello\!"
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Intel(R) Graphics (ARL) (Intel open-source Mesa driver) | uma: 1 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: none

Loading model...


▄▄ ▄▄
██ ██
██ ██  ▀▀█▄ ███▄███▄  ▀▀█▄    ▄████ ████▄ ████▄
██ ██ ▄█▀██ ██ ██ ██ ▄█▀██    ██    ██ ██ ██ ██
██ ██ ▀█▄██ ██ ██ ██ ▀█▄██ ██ ▀████ ████▀ ████▀
                                    ██    ██
                                    ▀▀    ▀▀

build      : b8548-9bcb4ef
model      : Qwen3.5-35B-A3B-UD-Q2_K_XL.gguf
modalities : text

available commands:
  /exit or Ctrl+C     stop or exit
  /regen              regenerate the last response
  /clear              clear the chat history
  /read               add a text file


> Hello!

[Start thinking]

Thinking Process:

1.  **Analyze the Input:**
    *   Input: "Hello!"
    *   Intent: Greeting.
    *   Tone: Friendly, casual.

2.  **Determine the appropriate response:**
    *   Acknowledge the greeting.
    *   Offer assistance.
    *   Maintain a friendly and helpful tone.
    *   Keep it concise (since it's just a greeting).

3.  **Draft

[ Prompt: 24,0 t/s | Generation: 9,5 t/s ]

llama-server fails with --chat-template-kwargs '{"enable_thinking":false}' (prompt was Hello!):

llama-server -m ~/.cache/huggingface/hub/models--unsloth--Qwen3.5-35B-A3B-GGUF/snapshots/bc014a17be43adabd7066b7a86075ff935c6a4e2/Qwen3.5-35B-A3B-UD-Q2_K_XL.gguf -np 1 --jinja -c 32768 --chat-template-kwargs '{"enable_thinking":false}' -lv 2 --host 127.0.0.1 --port 8088
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Intel(R) Graphics (ARL) (Intel open-source Mesa driver) | uma: 1 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: none
Setting 'enable_thinking' via --chat-template-kwargs is deprecated. Use --reasoning on / --reasoning off instead.
llama_context: n_ctx_seq (32768) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
common_speculative_is_compat: the target context does not support partial sequence removal
srv    load_model: speculative decoding not supported by this context
srv    load_model: prompt cache is enabled, size limit: 8192 MiB
srv    load_model: use `--cache-ram 0` to disable the prompt cache
srv    load_model: for more info see https://github.com/ggml-org/llama.cpp/pull/16391
/nix/store/g120d4bqgw63ashvjda55a1z9dgl8bd7-llama-cpp-8548/lib/libggml-base.so.0(+0x17c5a) [0x7f2385365c5a]
/nix/store/g120d4bqgw63ashvjda55a1z9dgl8bd7-llama-cpp-8548/lib/libggml-base.so.0(ggml_print_backtrace+0x204) [0x7f2385366114]
/nix/store/g120d4bqgw63ashvjda55a1z9dgl8bd7-llama-cpp-8548/lib/libggml-base.so.0(+0x2c5d9) [0x7f238537a5d9]
/nix/store/ab3753m6i7isgvzphlar0a8xb84gl96i-gcc-15.2.0-lib/lib/libstdc++.so.6(+0xc539a) [0x7f23814c539a]
/nix/store/ab3753m6i7isgvzphlar0a8xb84gl96i-gcc-15.2.0-lib/lib/libstdc++.so.6(_ZSt10unexpectedv+0x0) [0x7f23814b286e]
/nix/store/ab3753m6i7isgvzphlar0a8xb84gl96i-gcc-15.2.0-lib/lib/libstdc++.so.6(+0xc5637) [0x7f23814c5637]
/nix/store/g120d4bqgw63ashvjda55a1z9dgl8bd7-llama-cpp-8548/lib/libggml-vulkan.so.0(+0x89426) [0x7f2381889426]
/nix/store/g120d4bqgw63ashvjda55a1z9dgl8bd7-llama-cpp-8548/lib/libggml-vulkan.so.0(+0x193945) [0x7f2381993945]
/nix/store/g120d4bqgw63ashvjda55a1z9dgl8bd7-llama-cpp-8548/lib/libggml-vulkan.so.0(+0x193c28) [0x7f2381993c28]
/nix/store/g120d4bqgw63ashvjda55a1z9dgl8bd7-llama-cpp-8548/lib/libggml-base.so.0(ggml_backend_sched_graph_compute_async+0x92c) [0x7f2385383ecc]
/nix/store/g120d4bqgw63ashvjda55a1z9dgl8bd7-llama-cpp-8548/lib/libllama.so.0(_ZN13llama_context13graph_computeEP11ggml_cgraphb+0xa1) [0x7f23850cd2f1]
/nix/store/g120d4bqgw63ashvjda55a1z9dgl8bd7-llama-cpp-8548/lib/libllama.so.0(_ZN13llama_context14process_ubatchERK12llama_ubatch14llm_graph_typeP22llama_memory_context_iR11ggml_status+0x11f) [0x7f23850cfd7f]
/nix/store/g120d4bqgw63ashvjda55a1z9dgl8bd7-llama-cpp-8548/lib/libllama.so.0(_ZN13llama_context6decodeERK11llama_batch+0x3c9) [0x7f23850d6aa9]
/nix/store/g120d4bqgw63ashvjda55a1z9dgl8bd7-llama-cpp-8548/lib/libllama.so.0(llama_decode+0x11) [0x7f23850d8781]
llama-server(+0x186feb) [0x5653b7877feb]
llama-server(+0x1d4e66) [0x5653b78c5e66]
llama-server(+0xd4ced) [0x5653b77c5ced]
/nix/store/jms7zxzm7w1whczwny5m3gkgdjghmi2r-glibc-2.42-51/lib/libc.so.6(+0x2b285) [0x7f238102b285]
/nix/store/jms7zxzm7w1whczwny5m3gkgdjghmi2r-glibc-2.42-51/lib/libc.so.6(__libc_start_main+0x88) [0x7f238102b338]
llama-server(+0xdbfe5) [0x5653b77ccfe5]
terminate called after throwing an instance of 'vk::DeviceLostError'
  what():  vk::Device::waitForFences: ErrorDeviceLost
zsh: IOT instruction (core dumped)  llama-server -m  -np 1 --jinja -c 32768 --chat-template-kwargs  -lv 2 --host

llama-server with reasoning:

Details
llama-server -m ~/.cache/huggingface/hub/models--unsloth--Qwen3.5-35B-A3B-GGUF/snapshots/bc014a17be43adabd7066b7a86075ff935c6a4e2/Qwen3.5-35B-A3B-UD-Q2_K_XL.gguf -np 1 --jinja -c 32768 -lv 3 --host 127.0.0.1 --port 8088
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Intel(R) Graphics (ARL) (Intel open-source Mesa driver) | uma: 1 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: none
build: 8548 (9bcb4ef) with GNU 15.2.0 for Linux x86_64
system info: n_threads = 3, n_threads_batch = 3, total_threads = 16

system_info: n_threads = 3 (n_threads_batch = 3) / 16 | CPU : LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |

Running without SSL
init: using 15 threads for HTTP server
start: binding port with default address family
main: loading model
srv load_model: loading model '/home/user/.cache/huggingface/hub/models--unsloth--Qwen3.5-35B-A3B-GGUF/snapshots/bc014a17be43adabd7066b7a86075ff935c6a4e2/Qwen3.5-35B-A3B-UD-Q2_K_XL.gguf'
common_init_result: fitting params to device memory, for bugs during this step try to reproduce them with -fit off, or provide --verbose logs if the bug only occurs with -fit on
llama_params_fit_impl: projected to use 12526 MiB of device memory vs. 21316 MiB of free device memory
llama_params_fit_impl: will leave 8790 >= 1024 MiB of free device memory, no changes needed
llama_params_fit: successfully fit params to free device memory
llama_params_fit: fitting params to free memory took 0.78 seconds
llama_model_load_from_file_impl: using device Vulkan0 (Intel(R) Graphics (ARL)) (0000:00:02.0) - 21317 MiB free
llama_model_loader: loaded meta data with 52 key-value pairs and 733 tensors from /home/user/.cache/huggingface/hub/models--unsloth--Qwen3.5-35B-A3B-GGUF/snapshots/bc014a17be43adabd7066b7a86075ff935c6a4e2/Qwen3.5-35B-A3B-UD-Q2_K_XL.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = qwen35moe
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.sampling.top_k i32 = 20
llama_model_loader: - kv 3: general.sampling.top_p f32 = 0.950000
llama_model_loader: - kv 4: general.sampling.temp f32 = 1.000000
llama_model_loader: - kv 5: general.name str = Qwen3.5-35B-A3B
llama_model_loader: - kv 6: general.basename str = Qwen3.5-35B-A3B
llama_model_loader: - kv 7: general.quantized_by str = Unsloth
llama_model_loader: - kv 8: general.size_label str = 35B-A3B
llama_model_loader: - kv 9: general.license str = apache-2.0
llama_model_loader: - kv 10: general.license.link str = https://huggingface.co/Qwen/Qwen3.5-3...
llama_model_loader: - kv 11: general.repo_url str = https://huggingface.co/unsloth
llama_model_loader: - kv 12: general.base_model.count u32 = 1
llama_model_loader: - kv 13: general.base_model.0.name str = Qwen3.5 35B A3B
llama_model_loader: - kv 14: general.base_model.0.organization str = Qwen
llama_model_loader: - kv 15: general.base_model.0.repo_url str = https://huggingface.co/Qwen/Qwen3.5-3...
llama_model_loader: - kv 16: general.tags arr[str,2] = ["unsloth", "image-text-to-text"]
llama_model_loader: - kv 17: qwen35moe.block_count u32 = 40
llama_model_loader: - kv 18: qwen35moe.context_length u32 = 262144
llama_model_loader: - kv 19: qwen35moe.embedding_length u32 = 2048
llama_model_loader: - kv 20: qwen35moe.attention.head_count u32 = 16
llama_model_loader: - kv 21: qwen35moe.attention.head_count_kv u32 = 2
llama_model_loader: - kv 22: qwen35moe.rope.dimension_sections arr[i32,4] = [11, 11, 10, 0]
llama_model_loader: - kv 23: qwen35moe.rope.freq_base f32 = 10000000.000000
llama_model_loader: - kv 24: qwen35moe.attention.layer_norm_rms_epsilon f32 = 0.000001
llama_model_loader: - kv 25: qwen35moe.expert_count u32 = 256
llama_model_loader: - kv 26: qwen35moe.expert_used_count u32 = 8
llama_model_loader: - kv 27: qwen35moe.attention.key_length u32 = 256
llama_model_loader: - kv 28: qwen35moe.attention.value_length u32 = 256
llama_model_loader: - kv 29: qwen35moe.expert_feed_forward_length u32 = 512
llama_model_loader: - kv 30: qwen35moe.expert_shared_feed_forward_length u32 = 512
llama_model_loader: - kv 31: qwen35moe.ssm.conv_kernel u32 = 4
llama_model_loader: - kv 32: qwen35moe.ssm.state_size u32 = 128
llama_model_loader: - kv 33: qwen35moe.ssm.group_count u32 = 16
llama_model_loader: - kv 34: qwen35moe.ssm.time_step_rank u32 = 32
llama_model_loader: - kv 35: qwen35moe.ssm.inner_size u32 = 4096
llama_model_loader: - kv 36: qwen35moe.full_attention_interval u32 = 4
llama_model_loader: - kv 37: qwen35moe.rope.dimension_count u32 = 64
llama_model_loader: - kv 38: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 39: tokenizer.ggml.pre str = qwen35
llama_model_loader: - kv 40: tokenizer.ggml.tokens arr[str,248320] = ["!", """, "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 41: tokenizer.ggml.token_type arr[i32,248320] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 42: tokenizer.ggml.merges arr[str,247587] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv 43: tokenizer.ggml.eos_token_id u32 = 248046
llama_model_loader: - kv 44: tokenizer.ggml.padding_token_id u32 = 248055
llama_model_loader: - kv 45: tokenizer.chat_template str = {%- set image_count = namespace(value...
llama_model_loader: - kv 46: general.quantization_version u32 = 2
llama_model_loader: - kv 47: general.file_type u32 = 10
llama_model_loader: - kv 48: quantize.imatrix.file str = Qwen3.5-35B-A3B-GGUF/imatrix_unsloth....
llama_model_loader: - kv 49: quantize.imatrix.dataset str = unsloth_calibration_Qwen3.5-35B-A3B.txt
llama_model_loader: - kv 50: quantize.imatrix.entries_count u32 = 510
llama_model_loader: - kv 51: quantize.imatrix.chunks_count u32 = 76
llama_model_loader: - type f32: 301 tensors
llama_model_loader: - type q8_0: 61 tensors
llama_model_loader: - type q4_K: 1 tensors
llama_model_loader: - type q5_K: 177 tensors
llama_model_loader: - type q6_K: 73 tensors
llama_model_loader: - type iq2_xs: 78 tensors
llama_model_loader: - type iq3_xxs: 41 tensors
llama_model_loader: - type iq4_xs: 1 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type = Q2_K - Medium
print_info: file size = 11.31 GiB (2.80 BPW)
load: 0 unused tokens
load: printing all EOG tokens:
load: - 248044 ('<|endoftext|>')
load: - 248046 ('<|im_end|>')
load: - 248063 ('<|fim_pad|>')
load: - 248064 ('<|repo_name|>')
load: - 248065 ('<|file_sep|>')
load: special tokens cache size = 33
load: token to piece cache size = 1.7581 MB
print_info: arch = qwen35moe
print_info: vocab_only = 0
print_info: no_alloc = 0
print_info: n_ctx_train = 262144
print_info: n_embd = 2048
print_info: n_embd_inp = 2048
print_info: n_layer = 40
print_info: n_head = 16
print_info: n_head_kv = 2
print_info: n_rot = 64
print_info: n_swa = 0
print_info: is_swa_any = 0
print_info: n_embd_head_k = 256
print_info: n_embd_head_v = 256
print_info: n_gqa = 8
print_info: n_embd_k_gqa = 512
print_info: n_embd_v_gqa = 512
print_info: f_norm_eps = 0.0e+00
print_info: f_norm_rms_eps = 1.0e-06
print_info: f_clamp_kqv = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale = 0.0e+00
print_info: f_attn_scale = 0.0e+00
print_info: n_ff = 0
print_info: n_expert = 256
print_info: n_expert_used = 8
print_info: n_expert_groups = 0
print_info: n_group_used = 0
print_info: causal attn = 1
print_info: pooling type = -1
print_info: rope type = 40
print_info: rope scaling = linear
print_info: freq_base_train = 10000000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn = 262144
print_info: rope_yarn_log_mul = 0.0000
print_info: rope_finetuned = unknown
print_info: mrope sections = [11, 11, 10, 0]
print_info: ssm_d_conv = 4
print_info: ssm_d_inner = 4096
print_info: ssm_d_state = 128
print_info: ssm_dt_rank = 32
print_info: ssm_n_group = 16
print_info: ssm_dt_b_c_rms = 0
print_info: model type = 35B.A3B
print_info: model params = 34.66 B
print_info: general.name = Qwen3.5-35B-A3B
print_info: vocab type = BPE
print_info: n_vocab = 248320
print_info: n_merges = 247587
print_info: BOS token = 11 ','
print_info: EOS token = 248046 '<|im_end|>'
print_info: EOT token = 248046 '<|im_end|>'
print_info: PAD token = 248055 '<|vision_pad|>'
print_info: LF token = 198 'Ċ'
print_info: FIM PRE token = 248060 '<|fim_prefix|>'
print_info: FIM SUF token = 248062 '<|fim_suffix|>'
print_info: FIM MID token = 248061 '<|fim_middle|>'
print_info: FIM PAD token = 248063 '<|fim_pad|>'
print_info: FIM REP token = 248064 '<|repo_name|>'
print_info: FIM SEP token = 248065 '<|file_sep|>'
print_info: EOG token = 248044 '<|endoftext|>'
print_info: EOG token = 248046 '<|im_end|>'
print_info: EOG token = 248063 '<|fim_pad|>'
print_info: EOG token = 248064 '<|repo_name|>'
print_info: EOG token = 248065 '<|file_sep|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = true, direct_io = false)
load_tensors: offloading output layer to GPU
load_tensors: offloading 39 repeating layers to GPU
load_tensors: offloaded 41/41 layers to GPU
load_tensors: CPU_Mapped model buffer size = 333.44 MiB
load_tensors: Vulkan0 model buffer size = 11249.67 MiB
.................................................................................................
common_init_result: added <|endoftext|> logit bias = -inf
common_init_result: added <|im_end|> logit bias = -inf
common_init_result: added <|fim_pad|> logit bias = -inf
common_init_result: added <|repo_name|> logit bias = -inf
common_init_result: added <|file_sep|> logit bias = -inf
llama_context: constructing llama_context
llama_context: n_seq_max = 1
llama_context: n_ctx = 32768
llama_context: n_ctx_seq = 32768
llama_context: n_batch = 2048
llama_context: n_ubatch = 512
llama_context: causal_attn = 1
llama_context: flash_attn = auto
llama_context: kv_unified = false
llama_context: freq_base = 10000000.0
llama_context: freq_scale = 1
llama_context: n_ctx_seq (32768) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
llama_context: Vulkan_Host output buffer size = 0.95 MiB
llama_kv_cache: Vulkan0 KV buffer size = 640.00 MiB
llama_kv_cache: size = 640.00 MiB ( 32768 cells, 10 layers, 1/1 seqs), K (f16): 320.00 MiB, V (f16): 320.00 MiB
llama_memory_recurrent: Vulkan0 RS buffer size = 62.81 MiB
llama_memory_recurrent: size = 62.81 MiB ( 1 cells, 40 layers, 1 seqs), R (f32): 2.81 MiB, S (f32): 60.00 MiB
sched_reserve: reserving ...
sched_reserve: Flash Attention was auto, set to enabled
sched_reserve: resolving fused Gated Delta Net support:
sched_reserve: fused Gated Delta Net (autoregressive) enabled
sched_reserve: fused Gated Delta Net (chunked) enabled
sched_reserve: Vulkan0 compute buffer size = 574.02 MiB
sched_reserve: Vulkan_Host compute buffer size = 72.03 MiB
sched_reserve: graph nodes = 3729
sched_reserve: graph splits = 2
sched_reserve: reserve took 45.75 ms, sched copies = 1
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
srv load_model: initializing slots, n_slots = 1
common_speculative_is_compat: the target context does not support partial sequence removal
srv load_model: speculative decoding not supported by this context
slot load_model: id 0 | task -1 | new slot, n_ctx = 32768
srv load_model: prompt cache is enabled, size limit: 8192 MiB
srv load_model: use --cache-ram 0 to disable the prompt cache
srv load_model: for more info see https://github.com/ggml-org/llama.cpp/pull/16391
init: chat template, example_format: '<|im_start|>system
You are a helpful assistant<|im_end|>
<|im_start|>user
Hello<|im_end|>
<|im_start|>assistant
Hi there<|im_end|>
<|im_start|>user
How are you?<|im_end|>
<|im_start|>assistant

'
srv init: init: chat template, thinking = 1
main: model loaded
main: server is listening on http://127.0.0.1:8088
main: starting the main loop...
srv update_slots: all slots are idle
srv log_server_r: done request: GET / 127.0.0.1 200
srv log_server_r: done request: HEAD /cors-proxy 127.0.0.1 404
srv params_from_: Chat format: peg-native
slot get_availabl: id 0 | task -1 | selected slot by LRU, t_last = -1
slot launch_slot_: id 0 | task -1 | sampler chain: logits -> ?penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> min-p -> ?xtc -> temp-ext -> dist
slot launch_slot_: id 0 | task 0 | processing task, is_child = 0
slot update_slots: id 0 | task 0 | new prompt, n_ctx_slot = 32768, n_keep = 0, task.n_tokens = 12
slot update_slots: id 0 | task 0 | n_tokens = 0, memory_seq_rm [0, end)
slot update_slots: id 0 | task 0 | prompt processing progress, n_tokens = 8, batch.n_tokens = 8, progress = 0.666667
srv log_server_r: done request: POST /v1/chat/completions 127.0.0.1 200
slot update_slots: id 0 | task 0 | n_tokens = 8, memory_seq_rm [8, end)
slot init_sampler: id 0 | task 0 | init sampler, took 0.01 ms, tokens: text = 12, total = 12
slot update_slots: id 0 | task 0 | prompt processing done, n_tokens = 12, batch.n_tokens = 4
slot print_timing: id 0 | task 0 |
prompt eval time = 548.88 ms / 12 tokens ( 45.74 ms per token, 21.86 tokens per second)
eval time = 7830.80 ms / 78 tokens ( 100.39 ms per token, 9.96 tokens per second)
total time = 8379.68 ms / 90 tokens
slot release: id 0 | task 0 | stop processing: n_tokens = 89, truncated = 0
srv update_slots: all slots are idle

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions