Fail to do Qwen2-VL-72B inference with StaticCache

### System Info

```shell
+-----------------------------------------------------------------------------+
| HL-SMI Version:                              hl-1.19.2-fw-57.2.4.0          |
| Driver Version:                                     1.19.2-ff37fea          |
|-------------------------------+----------------------+----------------------+
```

### Information

- [x] The official example scripts
- [ ] My own modified scripts

### Tasks

- [x] An officially supported task in the `examples` folder (such as GLUE/SQuAD, ...)
- [ ] My own task or dataset (give details below)

### Reproduction

```bash
pip install optimum-habana
pip install git+https://github.com/HabanaAI/DeepSpeed.git@1.19.0
git clone https://github.com/huggingface/optimum-habana.git
export PYTHONPATH=/optimum-habana
cd optimum-habana/examples/image-to-text/
pip install -r requirements.txt
python ../gaudi_spawn.py --use_deepspeed --world_size 2 run_pipeline.py --model_name_or_path Qwen/Qwen2-VL-72B-Instruct  --max_new_tokens 4096 --bf16 --use_hpu_graphs --bf16 --sdp_on_bf16
```

Got

```
[rank1]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/module_inject/layers.py", line 142, in forward
[rank1]:     output = torch.matmul(input, self.weight.transpose(-1, -2))
[rank1]: RuntimeError: Common dimension sizes of matmul inputs should be the same. Got 640 and 1280
```

It seems deepspeed auto_tp does not capture `proj` in `Qwen2VLVisionBlock` so I edit auto_tp

```bash
vim /usr/local/lib/python3.10/dist-packages/deepspeed/module_inject/auto_tp.py

# Add following code to change gem_list
                elif 'proj' in layer and 'Qwen2VLVisionBlock' in str(type(module)): # get the vision block linear to replace
                    gem_list = gem_list + [layer]
```


Now it goes to another bug

```bash
[rank1]:   File "/optimum-habana/optimum/habana/transformers/models/qwen2_vl/modeling_qwen2_vl.py", line 242, in forward
[rank1]:     key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs)
[rank1]:   File "/usr/local/lib/python3.10/dist-packages/transformers/cache_utils.py", line 1186, in update
[rank1]:     k_out.index_copy_(2, cache_position, key_states)
[rank1]: RuntimeError:  Source/destination tensor must have same slice shapes except at dimension 2 Destination slice shape: 1 8 4988 128 and source slice shape: 1 4 892 128
```

It seems the `past_key_value` are not split  but the input tensor are split so the shape mismatch. I guess it is maybe because DeepSpeed cannot capture the StaticCache so it does not split it correctly? As shown in https://github.com/huggingface/optimum-habana/blob/main/examples/image-to-text/run_pipeline.py#L350 OH optimization use the HF  StaticCache so it maybe the cause.









### Expected behavior

I then have an experiment based on my [branch](https://github.com/Spycsh/optimum-habana/tree/qwen2_vl) which use the default DynamicCache and it works correctly with a few small fixes in DeepSpeed and the script. However it is so far away from the main branch so it requires extra effort to make a PR.

```
PT_HPU_ENABLE_LAZY_COLLECTIVES=true python ../gaudi_spawn.py --use_deepspeed --world_size 4 run_pipeline.py --model_name_or_path Qwen/Qwen2-VL-72B-Instruct  --max_new_tokens 128 --bf16 --batch_size 1 --use_hpu_graph

18.897807351919962 tokens/second

PT_HPU_ENABLE_LAZY_COLLECTIVES=true python ../gaudi_spawn.py --use_deepspeed --world_size 2 run_pipeline.py --model_name_or_path Qwen/Qwen2-VL-72B-Instruct  --max_new_tokens 128 --bf16 --batch_size 1 --use_hpu_graph

13.218333355546587 tokens/second
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fail to do Qwen2-VL-72B inference with StaticCache #1790

System Info

Information

Tasks

Reproduction

Expected behavior

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Fail to do Qwen2-VL-72B inference with StaticCache #1790

Description

System Info

Information

Tasks

Reproduction

Expected behavior

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions