-
Notifications
You must be signed in to change notification settings - Fork 269
Description
System Info
+-----------------------------------------------------------------------------+
| HL-SMI Version: hl-1.19.2-fw-57.2.4.0 |
| Driver Version: 1.19.2-ff37fea |
|-------------------------------+----------------------+----------------------+Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examplesfolder (such as GLUE/SQuAD, ...) - My own task or dataset (give details below)
Reproduction
pip install optimum-habana
pip install git+https://github.com/HabanaAI/[email protected]
git clone https://github.com/huggingface/optimum-habana.git
export PYTHONPATH=/optimum-habana
cd optimum-habana/examples/image-to-text/
pip install -r requirements.txt
python ../gaudi_spawn.py --use_deepspeed --world_size 2 run_pipeline.py --model_name_or_path Qwen/Qwen2-VL-72B-Instruct --max_new_tokens 4096 --bf16 --use_hpu_graphs --bf16 --sdp_on_bf16Got
[rank1]: File "/usr/local/lib/python3.10/dist-packages/deepspeed/module_inject/layers.py", line 142, in forward
[rank1]: output = torch.matmul(input, self.weight.transpose(-1, -2))
[rank1]: RuntimeError: Common dimension sizes of matmul inputs should be the same. Got 640 and 1280
It seems deepspeed auto_tp does not capture proj in Qwen2VLVisionBlock so I edit auto_tp
vim /usr/local/lib/python3.10/dist-packages/deepspeed/module_inject/auto_tp.py
# Add following code to change gem_list
elif 'proj' in layer and 'Qwen2VLVisionBlock' in str(type(module)): # get the vision block linear to replace
gem_list = gem_list + [layer]Now it goes to another bug
[rank1]: File "/optimum-habana/optimum/habana/transformers/models/qwen2_vl/modeling_qwen2_vl.py", line 242, in forward
[rank1]: key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs)
[rank1]: File "/usr/local/lib/python3.10/dist-packages/transformers/cache_utils.py", line 1186, in update
[rank1]: k_out.index_copy_(2, cache_position, key_states)
[rank1]: RuntimeError: Source/destination tensor must have same slice shapes except at dimension 2 Destination slice shape: 1 8 4988 128 and source slice shape: 1 4 892 128It seems the past_key_value are not split but the input tensor are split so the shape mismatch. I guess it is maybe because DeepSpeed cannot capture the StaticCache so it does not split it correctly? As shown in https://github.com/huggingface/optimum-habana/blob/main/examples/image-to-text/run_pipeline.py#L350 OH optimization use the HF StaticCache so it maybe the cause.
Expected behavior
I then have an experiment based on my branch which use the default DynamicCache and it works correctly with a few small fixes in DeepSpeed and the script. However it is so far away from the main branch so it requires extra effort to make a PR.
PT_HPU_ENABLE_LAZY_COLLECTIVES=true python ../gaudi_spawn.py --use_deepspeed --world_size 4 run_pipeline.py --model_name_or_path Qwen/Qwen2-VL-72B-Instruct --max_new_tokens 128 --bf16 --batch_size 1 --use_hpu_graph
18.897807351919962 tokens/second
PT_HPU_ENABLE_LAZY_COLLECTIVES=true python ../gaudi_spawn.py --use_deepspeed --world_size 2 run_pipeline.py --model_name_or_path Qwen/Qwen2-VL-72B-Instruct --max_new_tokens 128 --bf16 --batch_size 1 --use_hpu_graph
13.218333355546587 tokens/second