Skip to content

Fail to do Qwen2-VL-72B inference with StaticCache #1790

@Spycsh

Description

@Spycsh

System Info

+-----------------------------------------------------------------------------+
| HL-SMI Version:                              hl-1.19.2-fw-57.2.4.0          |
| Driver Version:                                     1.19.2-ff37fea          |
|-------------------------------+----------------------+----------------------+

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

pip install optimum-habana
pip install git+https://github.com/HabanaAI/[email protected]
git clone https://github.com/huggingface/optimum-habana.git
export PYTHONPATH=/optimum-habana
cd optimum-habana/examples/image-to-text/
pip install -r requirements.txt
python ../gaudi_spawn.py --use_deepspeed --world_size 2 run_pipeline.py --model_name_or_path Qwen/Qwen2-VL-72B-Instruct  --max_new_tokens 4096 --bf16 --use_hpu_graphs --bf16 --sdp_on_bf16

Got

[rank1]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/module_inject/layers.py", line 142, in forward
[rank1]:     output = torch.matmul(input, self.weight.transpose(-1, -2))
[rank1]: RuntimeError: Common dimension sizes of matmul inputs should be the same. Got 640 and 1280

It seems deepspeed auto_tp does not capture proj in Qwen2VLVisionBlock so I edit auto_tp

vim /usr/local/lib/python3.10/dist-packages/deepspeed/module_inject/auto_tp.py

# Add following code to change gem_list
                elif 'proj' in layer and 'Qwen2VLVisionBlock' in str(type(module)): # get the vision block linear to replace
                    gem_list = gem_list + [layer]

Now it goes to another bug

[rank1]:   File "/optimum-habana/optimum/habana/transformers/models/qwen2_vl/modeling_qwen2_vl.py", line 242, in forward
[rank1]:     key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs)
[rank1]:   File "/usr/local/lib/python3.10/dist-packages/transformers/cache_utils.py", line 1186, in update
[rank1]:     k_out.index_copy_(2, cache_position, key_states)
[rank1]: RuntimeError:  Source/destination tensor must have same slice shapes except at dimension 2 Destination slice shape: 1 8 4988 128 and source slice shape: 1 4 892 128

It seems the past_key_value are not split but the input tensor are split so the shape mismatch. I guess it is maybe because DeepSpeed cannot capture the StaticCache so it does not split it correctly? As shown in https://github.com/huggingface/optimum-habana/blob/main/examples/image-to-text/run_pipeline.py#L350 OH optimization use the HF StaticCache so it maybe the cause.

Expected behavior

I then have an experiment based on my branch which use the default DynamicCache and it works correctly with a few small fixes in DeepSpeed and the script. However it is so far away from the main branch so it requires extra effort to make a PR.

PT_HPU_ENABLE_LAZY_COLLECTIVES=true python ../gaudi_spawn.py --use_deepspeed --world_size 4 run_pipeline.py --model_name_or_path Qwen/Qwen2-VL-72B-Instruct  --max_new_tokens 128 --bf16 --batch_size 1 --use_hpu_graph

18.897807351919962 tokens/second

PT_HPU_ENABLE_LAZY_COLLECTIVES=true python ../gaudi_spawn.py --use_deepspeed --world_size 2 run_pipeline.py --model_name_or_path Qwen/Qwen2-VL-72B-Instruct  --max_new_tokens 128 --bf16 --batch_size 1 --use_hpu_graph

13.218333355546587 tokens/second

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions