Skip to content

Eval bug: Qwen3-Coder-480B-A35B-Instruct-1M-GGUF GGML_ASSERT(ggml_nbytes(src0) <= INT_MAX) failed #15049

@createthis

Description

@createthis

Name and Version

ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA RTX PRO 6000 Blackwell Workstation Edition, compute capability 12.0, VMM: yes
version: 6087 (c3eb159)
built with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu

Operating systems

Linux

GGML backends

CUDA

Hardware

blackwell 6000 pro
EPYC 9355

Models

Qwen3-Coder-480B-A35B-Instruct-1M-GGUF

Problem description & steps to reproduce

./build/bin/llama-server \
    --model /data/Qwen3-Coder-480B-A35B-Instruct-1M-GGUF/UD-Q4_K_XL/Qwen3-Coder-480B-A35B-Instruct-1M-UD-Q4_K_XL-00001-of-00006.gguf \
    --alias Qwen3-Coder-480B-A35B-Instruct-GGUF:UD-Q4_K_XL \
    --no-webui \
    --numa numactl \
    --threads 32 \
    --ctx-size 400000 \
    --n-gpu-layers 63 \
    -ot "blk\.(3|4|5|6|7|8|9|10|11|12|13)\.ffn_.*=CUDA0" \
    -ot exps=CPU \
    -ub 4096 -b 4096 \
    --cache-type-k q4_1 \
    --cache-type-v q4_1 \
    --seed 3407 \
    --prio 3 \
    --temp 0.7 \
    --top-p 0.8 \
    --top-k 20 \
    --repeat-penalty 1.05 \
    --min-p 0.0 \
    --log-colors \
    --flash-attn \
    --host 0.0.0.0 \
    --jinja \
    --port 11434

Feed it more than 131072 context, watch it crash and burn.

First Bad Commit

No response

Relevant log output

slot update_slots: id  0 | task 0 | prompt processing progress, n_past = 131072, n_tokens = 4096, progress = 0.492306
/home/jesse/llama.cpp/ggml/src/ggml-cuda/cpy.cu:285: GGML_ASSERT(ggml_nbytes(src0) <= INT_MAX) failed
/home/jesse/llama.cpp/build/bin/libggml-base.so(+0x1594b)[0x7b7943a9294b]
/home/jesse/llama.cpp/build/bin/libggml-base.so(ggml_print_backtrace+0x21c)[0x7b7943a92dac]
/home/jesse/llama.cpp/build/bin/libggml-base.so(ggml_abort+0x15b)[0x7b7943a92f8b]
/home/jesse/llama.cpp/build/bin/libggml-cuda.so(_Z13ggml_cuda_cpyR25ggml_backend_cuda_contextPK11ggml_tensorPS1_b+0xa62)[0x7b7940c9dcb2]
/home/jesse/llama.cpp/build/bin/libggml-cuda.so(+0xeed58)[0x7b7940ceed58]
/home/jesse/llama.cpp/build/bin/libggml-base.so(ggml_backend_sched_graph_compute_async+0x463)[0x7b7943aaab13]
/home/jesse/llama.cpp/build/bin/libllama.so(_ZN13llama_context13graph_computeEP11ggml_cgraphb+0xa1)[0x7b794389c0e1]
/home/jesse/llama.cpp/build/bin/libllama.so(_ZN13llama_context14process_ubatchERK12llama_ubatch14llm_graph_typeP22llama_memory_context_iR11ggml_status+0x104)[0x7b794389d794]
/home/jesse/llama.cpp/build/bin/libllama.so(_ZN13llama_context6decodeERK11llama_batch+0x3bd)[0x7b79438a35dd]
/home/jesse/llama.cpp/build/bin/libllama.so(llama_decode+0xf)[0x7b79438a453f]
./build/bin/llama-server(+0xc1bbe)[0x619405c79bbe]
./build/bin/llama-server(+0x879e5)[0x619405c3f9e5]
./build/bin/llama-server(+0x4ef0e)[0x619405c06f0e]
/lib/x86_64-linux-gnu/libc.so.6(+0x2a1ca)[0x7b794302a1ca]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x8b)[0x7b794302a28b]
./build/bin/llama-server(+0x50f35)[0x619405c08f35]
Aborted (core dumped)

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions