-
Notifications
You must be signed in to change notification settings - Fork 631
[Intel HPU] Support intel hpu platform #4161
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
[Intel HPU] Support intel hpu platform #4161
Conversation
Thanks for your contribution! |
7e59562
to
d7509a6
Compare
try: | ||
# assert len(paddle.static.cuda_places()) > 0 | ||
return True | ||
except Exception as e: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This check doesn't seem to work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
# PACKAGE = "fastdeploy.model_executor.ops.intel_hpu" | ||
PACKAGE = "paddlenlp_ops" | ||
|
||
import_custom_ops(PACKAGE, "paddlenlp_ops", globals()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here should be fastdeploy.model_executor.ops.intel_hpu
instead of paddlenlp_ops
?
Is this because of the naming convention of the ops implementation in custom device?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, real custom ops come from paddlecustomdevice, we just rename it in fastdeploy
@@ -0,0 +1,21 @@ | |||
# Copyright (c) 2024 PaddlePaddle Authors. All Rights Reserved. | |||
# |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2024->2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
raise NotImplementedError | ||
|
||
|
||
class AttentionBackend_HPU(AttentionBackend): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will it be better to move this class to fastdeploy/model_executor/layers/attention/hpu_attn_backend.py
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
moved
fastdeploy/engine/args_utils.py
Outdated
"--enable-tensor-or-expert-parallel", | ||
action='store_true', | ||
default=EngineArgs.enable_tensor_or_expert_parallel, | ||
help="Enable tensor parallelism for non-MoE and expert parallelism for MoE.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could we enable tp + ep by setting --enable-expert-parallel
and --tensor-parrllel-size
without adding a new argument ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
currently EP is combined with DP, so we can't enable tp + ep with existing parameters
https://github.com/PaddlePaddle/FastDeploy/blob/develop/fastdeploy/config.py#L316-L318
https://github.com/PaddlePaddle/FastDeploy/blob/develop/fastdeploy/model_executor/layers/moe/moe.py#L132-L134
fastdeploy/worker/worker_process.py
Outdated
|
||
parallel_config.engine_worker_queue_port = parallel_config.engine_worker_queue_port[ | ||
parallel_config.local_data_parallel_id | ||
] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All CI fails at this line. TypeError: '\''int'\'' object is not subscriptable'
. We need to solve it first and then see if there are any other problems
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
@@ -0,0 +1,314 @@ | |||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
layers目录下有一个backends文件夹,里边放着各类device的layer有关的实现,把attention和moe的实现都放到这个文件夹下吧
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
按照要求,已经移动到了backends目录下
elif current_platform.is_intel_hpu(): | ||
self.forward = self.forward_intel_hpu |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
forard_cuda名字可能现在已经不太合适叫这个了,应该是可以复用forward_cuda的,逻辑都是一样的
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
已经改为复用forward_cuda
elif current_platform.is_intel_hpu(): | ||
self.forward = self.forward_intel_hpu |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个和其他硬件平台有何不同之处吗,为啥需要单独写逻辑,不能抽象为几个op然后调用forward_cuda吗
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
目前我们采用fused的方式,是因为在我们平台上性能会比较好,后面会考虑在不影响性能的前提下进行拆分
from fastdeploy.platforms import current_platform | ||
|
||
|
||
def reload_ep_checkpoint(model_path: str, fd_config: FDConfig, state_dict: dict, return_numpy: bool = False): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
为什么会修改加载模型这块儿的内容,是因为用的不是官方的模型吗
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
没有修改模型,还是官方的模型,只是为了支持TP+EP模式的模型加载
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
我们支持的TP+EP模式,是dense部分用TP,MoE部分不用TP(也不用DP),只用EP(EP的数目=TP)。所以在模型load的时候,如果配置了TP,默认会把MoE的系数也按照TP的模式切分了,这个reload_ep_checkpoint完成的功能,是把切分的MoE weights先删除掉,然后重新在expert维度将各自完整的weights划分给不同的卡。
fastdeploy/config.py
Outdated
self.expert_parallel_size = 1 # EP degree | ||
self.data_parallel_size = 1 # DP degree | ||
self.enable_expert_parallel = False | ||
self.enable_tensor_or_expert_parallel = False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里不能通过enable_expert_parallel或者是expert_parallel_size,tensor_parallel_size等这些字段组合判断吗,必须要给用户接口加新字段吗
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
目前FD里面EP是和DP是绑定的,EP size等于DP size,而且moe里面限制了TP和EP不能同时开,所以支持TP+EP 最佳的选择是加一个参数
https://github.com/PaddlePaddle/FastDeploy/blob/develop/fastdeploy/model_executor/layers/moe/moe.py#L132-L134
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个参数的目的是让moe部分同时打开TP+EP并行吗?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dense部分用TP,MoE部分用EP(EP的数目=TP)。
fastdeploy/engine/args_utils.py
Outdated
cache_cfg = CacheConfig(all_dict) | ||
load_cfg = LoadConfig(all_dict) | ||
parallel_cfg = ParallelConfig(all_dict) | ||
cache_cfg.enc_dec_block_num = self.static_decode_blocks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It could be better to set this value as https://github.com/PaddlePaddle/FastDeploy/blob/release/2.2/fastdeploy/config.py#L899 to avoid impact on other hardware.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not only for specific platform. It maybe be a bug, the parameter "static_decode_blocks" in EngineArgs can't be passed to cache_cfg even on GPUs because it has no static_decode_blocks but enc_dec_block_num
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that there is a problem that the static_decode_blocks
not being passed to cache_cfg.
Could you please move enc_dec_block_num
setting for different platforms to this file? Since this line works after the cache_cfg initialization, the default value 2 may cause error, e.g. Iluvatar
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
after rebased to latest code, we can use FD_ENC_DEC_BLOCK_NUM to solve this problem. I had removed this line
fastdeploy/worker/worker_process.py
Outdated
else: | ||
num_experts = model_config.moe_num_experts | ||
|
||
num_experts_per_rank = num_experts // parallel_config.tensor_parallel_size |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
为什么通过tp_size划分专家呢
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
目前FD里面的逻辑是如果enable EP,则需要用dp_size划分专家 + enable-expert-parallel,类似,我们可以用tp_size划分专家 + enable_tensor_or_expert_parallel来支持TP+EP模式(dense部分用TP,MoE部分用EP(EP的数目=TP))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
enable_tensor_or_expert_parallel这个参数感觉不是很清晰啊,这种dense TP moe EP的切分可以参考也开源框架vllm/SGLang的命名实现?现在看着会比较让人迷惑
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
命名是有问题, 这个PR我们先去掉了相关的code,后面refine好后再提PR
5137adb
to
2ae2c61
Compare
@zoooo0820 @carryyu @YuanRisheng @gzy19990617 ,我们暂时把TP+EP模式去掉了,后面refine好后再单独合并 |
2ae2c61
to
e81e85a
Compare
e81e85a
to
cdc1d07
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
|
||
|
||
@dataclass | ||
class ForwardMeta_HPU: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
命名是否可以与上面其他硬件保持一致呢,HPUForwardMeta
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
已改为HPUForwardMeta
FastDeploy在Intel HPU上已完成ERNIE 4.5模型的适配
依赖信息:
Gaudi software: 1.22.0
PaddlePaddle:3.1.1
PaddleCustomDevice: latest develop branch
更多模型的支持和性能的优化会继续更新。