bug : Vulkan multi-stream attention problem

While investigating https://github.com/ggml-org/llama.cpp/discussions/14893#discussioncomment-13955023, it appears that there is a problem with parallel streams on the Vulkan backend when `LLAMA_SET_ROWS=1` is enabled. The problem occurs only when `-fa` is disabled.

Here is a basic repro:

```bash
make -j && LLAMA_SET_ROWS=1 ./bin/llama-parallel -hf ggml-org/Qwen2.5-Coder-3B-Q8_0-GGUF -np 2 -ns 32 -s 1 -c 4096 -ngl 99 --top-k 1
```

<details>
<summary>Garbage output in second request</summary>

```
Input:    What is the meaning of life?
Response: The meaning of life is a philosophical question that has been debated for centuries. There is no one definitive answer, but some people believe that the meaning of life is to find happiness and fulfillment in one's own life. Others believe that the meaning of life is to contribute to society and make a positive impact on the world. Ultimately, the meaning of life is up to each individual to define for themselves.

0.05.487.441 I Client   0, seq    2, junk =    0, prompt = 284, started decoding ...
0.07.614.732 I Client   1, seq   1/ 32, prompt  284 t, response  128 t, time  5.29 s, speed 77.83 t/s, cache miss 0  

Input:    What is the meaning of life?
Response: The  _,  _,SID  _,  _,SID
 _,
 _,
 _,SID   _,
 _,
 _,
 _,
 _,SIDdata  _,SID  _, =  _,  _,  _,  _,  _,SID =  _,  _,SID`  _,
 _,  _,


 _,SID  _,  _,  _,  _,  _,SIDim  _,SID =  _, _,  _,
 _,SID  _,
 _,  _,\
 _,  _,  _,
 _,
 _,SID
 _,  _,SID
 _,SID
  _,ID  _,
 _,\  _,SID =
```
</details>

The problem is not observed if:

- Remove `LLAMA_SET_ROWS=1`
- Keep `LLAMA_SET_ROWS=1` and add `-fa`

~This makes me think there is bug in the SOFT_MAX implementation for batched cases (#14449).~

Edit: investigating further, it seems that the problem is not in the SOFT_MAX, because when I keep it on the CPU, the problem persists:

```diff
diff --git a/ggml/src/ggml-vulkan/ggml-vulkan.cpp b/ggml/src/ggml-vulkan/ggml-vulkan.cpp
index 75b58c26f..3f3efad7f 100644
--- a/ggml/src/ggml-vulkan/ggml-vulkan.cpp
+++ b/ggml/src/ggml-vulkan/ggml-vulkan.cpp
@@ -10863,7 +10863,9 @@ static bool ggml_backend_vk_device_supports_op(ggml_backend_dev_t dev, const ggm
         case GGML_OP_PAD:
         case GGML_OP_ROLL:
         case GGML_OP_DIAG_MASK_INF:
+            return true;
         case GGML_OP_SOFT_MAX:
+            return false;
         case GGML_OP_SOFT_MAX_BACK:
         case GGML_OP_ARGSORT:
         case GGML_OP_SUM:
```

Looking at the batched matrix multiplications now across dim 3.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

bug : Vulkan multi-stream attention problem #15006

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

bug : Vulkan multi-stream attention problem #15006

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions