-
Notifications
You must be signed in to change notification settings - Fork 12.6k
Closed
Labels
bugSomething isn't workingSomething isn't working
Description
While investigating #14893 (reply in thread), it appears that there is a problem with parallel streams on the Vulkan backend when LLAMA_SET_ROWS=1
is enabled. The problem occurs only when -fa
is disabled.
Here is a basic repro:
make -j && LLAMA_SET_ROWS=1 ./bin/llama-parallel -hf ggml-org/Qwen2.5-Coder-3B-Q8_0-GGUF -np 2 -ns 32 -s 1 -c 4096 -ngl 99 --top-k 1
Garbage output in second request
Input: What is the meaning of life?
Response: The meaning of life is a philosophical question that has been debated for centuries. There is no one definitive answer, but some people believe that the meaning of life is to find happiness and fulfillment in one's own life. Others believe that the meaning of life is to contribute to society and make a positive impact on the world. Ultimately, the meaning of life is up to each individual to define for themselves.
0.05.487.441 I Client 0, seq 2, junk = 0, prompt = 284, started decoding ...
0.07.614.732 I Client 1, seq 1/ 32, prompt 284 t, response 128 t, time 5.29 s, speed 77.83 t/s, cache miss 0
Input: What is the meaning of life?
Response: The _, _,SID _, _,SID
_,
_,
_,SID _,
_,
_,
_,
_,SIDdata _,SID _, = _, _, _, _, _,SID = _, _,SID` _,
_, _,
_,SID _, _, _, _, _,SIDim _,SID = _, _, _,
_,SID _,
_, _,\
_, _, _,
_,
_,SID
_, _,SID
_,SID
_,ID _,
_,\ _,SID =
The problem is not observed if:
- Remove
LLAMA_SET_ROWS=1
- Keep
LLAMA_SET_ROWS=1
and add-fa
This makes me think there is bug in the SOFT_MAX implementation for batched cases (#14449).
Edit: investigating further, it seems that the problem is not in the SOFT_MAX, because when I keep it on the CPU, the problem persists:
diff --git a/ggml/src/ggml-vulkan/ggml-vulkan.cpp b/ggml/src/ggml-vulkan/ggml-vulkan.cpp
index 75b58c26f..3f3efad7f 100644
--- a/ggml/src/ggml-vulkan/ggml-vulkan.cpp
+++ b/ggml/src/ggml-vulkan/ggml-vulkan.cpp
@@ -10863,7 +10863,9 @@ static bool ggml_backend_vk_device_supports_op(ggml_backend_dev_t dev, const ggm
case GGML_OP_PAD:
case GGML_OP_ROLL:
case GGML_OP_DIAG_MASK_INF:
+ return true;
case GGML_OP_SOFT_MAX:
+ return false;
case GGML_OP_SOFT_MAX_BACK:
case GGML_OP_ARGSORT:
case GGML_OP_SUM:
Looking at the batched matrix multiplications now across dim 3.
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working