Skip to content

Eval bug: Frequent Crash occurs randomly #19679

@chhil

Description

@chhil

Name and Version

 ./llama-cli --version
ggml_metal_device_init: tensor API disabled for pre-M5 and pre-A19 devices
ggml_metal_library_init: using embedded metal library
ggml_metal_library_init: loaded in 5.508 sec
ggml_metal_rsets_init: creating a residency set collection (keep_alive = 180 s)
ggml_metal_device_init: GPU name:   MTL0
ggml_metal_device_init: GPU family: MTLGPUFamilyApple9  (1009)
ggml_metal_device_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_device_init: GPU family: MTLGPUFamilyMetal4  (5002)
ggml_metal_device_init: simdgroup reduction   = true
ggml_metal_device_init: simdgroup matrix mul. = true
ggml_metal_device_init: has unified memory    = true
ggml_metal_device_init: has bfloat            = true
ggml_metal_device_init: has tensor            = false
ggml_metal_device_init: use residency sets    = true
ggml_metal_device_init: use shared buffers    = true
ggml_metal_device_init: recommendedMaxWorkingSetSize  = 55662.79 MB
version: 8068 (267ba5a1d)
built with AppleClang 15.0.0.15000309 for Darwin arm64

Operating systems

Mac

GGML backends

Metal

Hardware

M4 Mac Mini 64GB

Models

Qwen_Qwen3-Coder-Next-GGUF_Qwen3-Coder-Next-Q4_K_M_Qwen3-Coder-Next-Q4_K_M-00001-of-00004.gguf

Problem description & steps to reproduce

Crashes

I get the following frequently and can be at the very beginning of a session or midway, so its not related to tokens consumed is my understaning.
The command line

      -m "$MODEL_DIR/Qwen_Qwen3-Coder-Next-GGUF_Qwen3-Coder-Next-Q4_K_M_Qwen3-Coder-Next-Q4_K_M-00001-of-00004.gguf" \
      --host 0.0.0.0 \
      --port 4000 \
      --jinja \
      --n-gpu-layers -1 \
      --threads 8 \
      --batch-size 1024 \
      -fa on \
      -np 1 \
      --top-k 40 --top-p 0.95 --min-p 0 -ngl 99 -sm row --temp 1.0 --no-context-shift

First Bad Commit

No response

Relevant log output

Logs
main: server is listening on http://0.0.0.0:4000
main: starting the main loop...
srv  update_slots: all slots are idle
srv  params_from_: Chat format: Qwen3 Coder
slot get_availabl: id  0 | task -1 | selected slot by LRU, t_last = -1
slot launch_slot_: id  0 | task -1 | sampler chain: logits -> ?penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> ?top-p -> ?min-p -> ?xtc -> ?temp-ext -> dist 
slot launch_slot_: id  0 | task 0 | processing task, is_child = 0
slot update_slots: id  0 | task 0 | new prompt, n_ctx_slot = 221184, n_keep = 0, task.n_tokens = 14484
slot update_slots: id  0 | task 0 | n_tokens = 0, memory_seq_rm [0, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_tokens = 1024, batch.n_tokens = 1024, progress = 0.070699
slot update_slots: id  0 | task 0 | n_tokens = 1024, memory_seq_rm [1024, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_tokens = 2048, batch.n_tokens = 1024, progress = 0.141397
slot update_slots: id  0 | task 0 | n_tokens = 2048, memory_seq_rm [2048, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_tokens = 3072, batch.n_tokens = 1024, progress = 0.212096
slot update_slots: id  0 | task 0 | n_tokens = 3072, memory_seq_rm [3072, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_tokens = 4096, batch.n_tokens = 1024, progress = 0.282795
slot update_slots: id  0 | task 0 | n_tokens = 4096, memory_seq_rm [4096, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_tokens = 5120, batch.n_tokens = 1024, progress = 0.353494
slot update_slots: id  0 | task 0 | n_tokens = 5120, memory_seq_rm [5120, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_tokens = 6144, batch.n_tokens = 1024, progress = 0.424192
slot update_slots: id  0 | task 0 | n_tokens = 6144, memory_seq_rm [6144, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_tokens = 7168, batch.n_tokens = 1024, progress = 0.494891
slot update_slots: id  0 | task 0 | n_tokens = 7168, memory_seq_rm [7168, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_tokens = 8192, batch.n_tokens = 1024, progress = 0.565590
slot update_slots: id  0 | task 0 | n_tokens = 8192, memory_seq_rm [8192, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_tokens = 9216, batch.n_tokens = 1024, progress = 0.636288
slot update_slots: id  0 | task 0 | n_tokens = 9216, memory_seq_rm [9216, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_tokens = 10240, batch.n_tokens = 1024, progress = 0.706987
slot update_slots: id  0 | task 0 | n_tokens = 10240, memory_seq_rm [10240, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_tokens = 11264, batch.n_tokens = 1024, progress = 0.777686
slot update_slots: id  0 | task 0 | n_tokens = 11264, memory_seq_rm [11264, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_tokens = 12288, batch.n_tokens = 1024, progress = 0.848384
slot update_slots: id  0 | task 0 | n_tokens = 12288, memory_seq_rm [12288, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_tokens = 13312, batch.n_tokens = 1024, progress = 0.919083
slot update_slots: id  0 | task 0 | n_tokens = 13312, memory_seq_rm [13312, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_tokens = 14336, batch.n_tokens = 1024, progress = 0.989782
slot update_slots: id  0 | task 0 | n_tokens = 14336, memory_seq_rm [14336, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_tokens = 14420, batch.n_tokens = 84, progress = 0.995581
slot update_slots: id  0 | task 0 | n_tokens = 14420, memory_seq_rm [14420, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_tokens = 14484, batch.n_tokens = 64, progress = 1.000000
slot update_slots: id  0 | task 0 | prompt done, n_tokens = 14484, batch.n_tokens = 64
slot init_sampler: id  0 | task 0 | init sampler, took 1.08 ms, tokens: text = 14484, total = 14484
slot update_slots: id  0 | task 0 | created context checkpoint 1 of 8 (pos_min = 14419, pos_max = 14419, size = 75.376 MiB)
WARNING: Using native backtrace. Set GGML_BACKTRACE_LLDB for more info.
WARNING: GGML_BACKTRACE_LLDB may cause native MacOS Terminal.app to crash.
See: https://github.com/ggml-org/llama.cpp/pull/17869
0   libggml-base.0.9.5.dylib            0x00000001011753bc ggml_print_backtrace + 276
1   libggml-base.0.9.5.dylib            0x0000000101183a70 _ZL23ggml_uncaught_exceptionv + 12
2   libc++abi.dylib                     0x000000019e642c2c _ZSt11__terminatePFvvE + 16
3   libc++abi.dylib                     0x000000019e646394 __cxa_get_exception_ptr + 0
4   libc++abi.dylib                     0x000000019e64633c _ZN10__cxxabiv1L12failed_throwEPNS_15__cxa_exceptionE + 0
5   libllama.0.0.7960.dylib             0x0000000101286030 _Z26llama_grammar_accept_tokenR13llama_grammariRKNSt3__112basic_stringIcNS1_11char_traitsIcEENS1_9allocatorIcEEEE + 1016
6   libllama.0.0.7960.dylib             0x0000000101285af0 _Z25llama_grammar_accept_implR13llama_grammari + 464
7   llama-server                        0x000000010088b5a8 _Z21common_sampler_acceptP14common_samplerib + 76
8   llama-server                        0x000000010078c860 _ZN19server_context_impl12update_slotsEv + 10112
9   llama-server                        0x000000010075d688 _ZN12server_queue10start_loopEx + 848
10  llama-server                        0x00000001006fa4d0 main + 9568
11  dyld                                0x000000019e2c5d54 start + 7184
libc++abi: terminating due to uncaught exception of type std::runtime_error: Unexpected empty grammar stack after accepting piece: =list (40972)

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions